Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5954] Infer cleaning policy based on clean configs #8238

Merged
merged 2 commits into from
Apr 1, 2023

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Mar 20, 2023

Change Logs

This PR adds the logic of inferring the cleaning policy based on clean configs. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used):

  • "hoodie.cleaner.commits.retained": the KEEP_LATEST_COMMITS cleaning policy is used;
  • "hoodie.cleaner.hours.retained": the KEEP_LATEST_BY_HOURS cleaning policy is used;
  • "hoodie.cleaner.fileversions.retained": the KEEP_LATEST_FILE_VERSIONS cleaning policy is used.

Now setting only one of the configs above automatically switches the cleaning policy. Setting hoodie.cleaner.policy is deprecated.

Impact

A user does not need to explicitly set the cleaning policy alongside the one of following configs: "hoodie.cleaner.commits.retained", "hoodie.cleaner.hours.retained", or "hoodie.cleaner.fileversions.retained".

Risk level

none

Documentation Update

Docs update: HUDI-595

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

.defaultValue(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())
.withInferFunction(cfg -> {
boolean isCommitsRetainedConfigured = cfg.contains(CLEANER_COMMITS_RETAINED_KEY);
boolean isHoursRetainedConfigured = cfg.contains(CLEANER_HOURS_RETAINED_KEY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so confused by these options, does the option hoodie.cleaner.policy make any sense here? If all the specific cleaning param: hoodie.cleaner.commits.retained, hoodie.cleaner.hours.retained, hoodie.cleaner.fileversions.retained all have detemistic policy, then this option should be eliminated.

For example, can we use a combination like HoodieCleaningPolicy.KEEP_LATEST_COMMITS policy and hoodie.cleaner.fileversions.retained, if not, introduce the redundant option key hoodie.cleaner.policy is totally unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I marked it as deprecated.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so confused by these options, does the option hoodie.cleaner.policy make any sense here? If all the specific cleaning param: hoodie.cleaner.commits.retained, hoodie.cleaner.hours.retained, hoodie.cleaner.fileversions.retained all have detemistic policy, then this option should be eliminated.

For example, can we use a combination like HoodieCleaningPolicy.KEEP_LATEST_COMMITS policy and hoodie.cleaner.fileversions.retained, if not, introduce the redundant option key hoodie.cleaner.policy is totally unnecessary.

@yihua
Copy link
Contributor Author

yihua commented Mar 31, 2023

I'm so confused by these options, does the option hoodie.cleaner.policy make any sense here? If all the specific cleaning param: hoodie.cleaner.commits.retained, hoodie.cleaner.hours.retained, hoodie.cleaner.fileversions.retained all have detemistic policy, then this option should be eliminated.

For example, can we use a combination like HoodieCleaningPolicy.KEEP_LATEST_COMMITS policy and hoodie.cleaner.fileversions.retained, if not, introduce the redundant option key hoodie.cleaner.policy is totally unnecessary.

@danny0405 what you mentioned totally makes sense. The reason I keep hoodie.cleaner.policy is that we should still make it backwards compatible, so that if hoodie.cleaner.policy is set, we should honor that. If none of hoodie.cleaner.commits.retained, hoodie.cleaner.hours.retained, or hoodie.cleaner.fileversions.retained is set, the specified policy should be used, and the default clean config value (commits, hours, or file versions retained) is used.

@danny0405
Copy link
Contributor

I'm so confused by these options, does the option hoodie.cleaner.policy make any sense here? If all the specific cleaning param: hoodie.cleaner.commits.retained, hoodie.cleaner.hours.retained, hoodie.cleaner.fileversions.retained all have detemistic policy, then this option should be eliminated.
For example, can we use a combination like HoodieCleaningPolicy.KEEP_LATEST_COMMITS policy and hoodie.cleaner.fileversions.retained, if not, introduce the redundant option key hoodie.cleaner.policy is totally unnecessary.

@danny0405 what you mentioned totally makes sense. The reason I keep hoodie.cleaner.policy is that we should still make it backwards compatible, so that if hoodie.cleaner.policy is set, we should honor that. If none of hoodie.cleaner.commits.retained, hoodie.cleaner.hours.retained, or hoodie.cleaner.fileversions.retained is set, the specified policy should be used, and the default clean config value (commits, hours, or file versions retained) is used.

Then let's mark this option hoodie.cleaner.policy as deprecated and add some release notes on the website.

@yihua
Copy link
Contributor Author

yihua commented Apr 1, 2023

I'm so confused by these options, does the option hoodie.cleaner.policy make any sense here? If all the specific cleaning param: hoodie.cleaner.commits.retained, hoodie.cleaner.hours.retained, hoodie.cleaner.fileversions.retained all have detemistic policy, then this option should be eliminated.
For example, can we use a combination like HoodieCleaningPolicy.KEEP_LATEST_COMMITS policy and hoodie.cleaner.fileversions.retained, if not, introduce the redundant option key hoodie.cleaner.policy is totally unnecessary.

@danny0405 what you mentioned totally makes sense. The reason I keep hoodie.cleaner.policy is that we should still make it backwards compatible, so that if hoodie.cleaner.policy is set, we should honor that. If none of hoodie.cleaner.commits.retained, hoodie.cleaner.hours.retained, or hoodie.cleaner.fileversions.retained is set, the specified policy should be used, and the default clean config value (commits, hours, or file versions retained) is used.

Then let's mark this option hoodie.cleaner.policy as deprecated and add some release notes on the website.

Makes sense. Fixed.

@hudi-bot
Copy link

hudi-bot commented Apr 1, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@yihua
Copy link
Contributor Author

yihua commented Apr 1, 2023

CI is green.
Screenshot 2023-04-01 at 00 19 32

@yihua yihua merged commit c739701 into apache:master Apr 1, 2023
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
This commit adds the logic of inferring the cleaning policy ("hoodie.cleaner.policy") based on clean configs. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used):

- "hoodie.cleaner.commits.retained": the KEEP_LATEST_COMMITS cleaning policy is used;
- "hoodie.cleaner.hours.retained": the KEEP_LATEST_BY_HOURS cleaning policy is used;
- "hoodie.cleaner.fileversions.retained": the KEEP_LATEST_FILE_VERSIONS cleaning policy is used.

Now setting only one of the configs above automatically switches the cleaning policy. Setting "hoodie.cleaner.policy" is deprecated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants