-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-349]: Added new cleaning policy based on number of hours #3646
Conversation
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCleaningPolicy.java
Outdated
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
Outdated
Show resolved
Hide resolved
earliestCommitToRetain = commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained); //15 instants total, 10 commits to retain, this gives 6th instant in the list | ||
} else if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LAST_X_HOURS) { | ||
Instant instant = Instant.now(); | ||
ZonedDateTime commitDateTime = ZonedDateTime.ofInstant(instant, ZoneId.systemDefault()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any precedence in hudi code base for doing time based calculations. Can you explore and let me know. Wanna maintain some uniformity if we have any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should sync this to how the instant times are generated.
Another way to think about this could be:
latestInstantTime = HoodieActiveTimeline.createNewInstantTime()
earliestCommitToRetain = HoodieActiveTimeline.findInstantInPast(latestInstantTime, N); // N = number of hours in the past.
Keeping the instant timestamp code together will prevent any surprises from a wrong time calculation cleaning data unintentionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan I checked. There is no uniformity as of now. We use joda dateTime library in TimestampBasedKeyGenerator, java.time.ZonedDateTime in SlashEncodedHourPartitionValueExtractor, and java.util.Date in HoodieActiveTimeline.
@prashantwason nthInstant is what we are using throughout the codebase. I guess there is no need to introduce a new method findInstantInPast
here.
@prashantwason : can you please address the feedback when you get a chance. |
@nsivabalan ack. |
@nsivabalan Please take a look. |
any update on this , I faced an issue today where the ETL failed in Prod because the file was deleted by Cleaner service
Having config based on time will help instead of relying on Number of commits |
@pratyakshsharma : Can you update the patch when you get a chance. |
@nsivabalan Please take a pass. |
Let me review this as well |
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
Outdated
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Outdated
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Outdated
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Outdated
Show resolved
Hide resolved
} else if (policy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) { | ||
// This block corresponds to KEEP_LATEST_BY_HOURS policy | ||
// Do not delete the latest commit. | ||
if (fileCommitTime.equals(lastVersion)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why do have (fileCommitTime.equals(lastVersionBeforeEarliestCommitToRetain) in L318. I would prefer to keep the same logic unless we have strong reasons to remove. As I mentioned, cleaning based on num hours is just a diff way to conveying retain N commits. So, lets try to keep the logic similar across both. Easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleaner policies are designed to enable the longest running query to succeed. Let us take a small example where ingestion is happening every 30 minutes and hoodie.cleaner.commits.retained = 2
. This implies the longest running query is going to take 1 hour to complete. Now if I maintain a list of commits, when commit at index=2 in this list (i.e 3rd commit from the beginning) is completed, the query which started when commit at index=0 completed might still be running.
list of commits -> 0. 1. 2
time (in mins). -> 0. 30. 60
Hence to allow this query to finish, we take into account lastVersionBeforeEarliestCommitToRetain
in L318. lastVersion
in this case is going to be commit at index=1.
With the new policy KEEP_LATEST_BY_HOURS, when 3rd commit is completed, 1st commit will be retained automatically by the logic. Hence the code in its current form works fine.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
Outdated
Show resolved
Hide resolved
@prashantwason : ping |
@nsivabalan Did you intend to ping @pratyakshsharma so he can respond to the comments? For me, this is on my review list but am a bit behind due to metadata index related PRs. |
I had been waiting for @vinothchandar 's review on this. Let me address the existing comments so we can land this. |
@hudi-bot run azure |
@nsivabalan Please take a pass. |
@pratyakshsharma : Can you check for CI failures. also, is it already rebased w/ latest master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested in other patch, can we try to keep the tests to TestCleanActionExecutor if possible.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
Outdated
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Outdated
Show resolved
Hide resolved
thanks for fixing all refactoring feedback. now the source code changes looks a lot less. |
@pratyakshsharma : once you address the feedback, let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. wrt tests as suggested, can we move the tests to CleanActionExecutor tests rather than at write client layer. Is there any challenges with that?
*/ | ||
@ParameterizedTest | ||
@MethodSource("argumentsForTestKeepLatestCommits") | ||
public void testKeepXHoursWithCleaning(boolean simulateFailureRetry, boolean enableIncrementalClean, boolean enableBootstrapSourceClean) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved only one test in this PR since the rest are already moved in #4385. Once that is merged, changes will automatically reflect here as well. @nsivabalan
Tips
What is the purpose of the pull request
Add a new cleaning policy based on the number of hours. This gives more flexibility for users compared to the one based on the number of commits
Brief change log
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.