Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-349]: Added new cleaning policy based on number of hours #3646
[HUDI-349]: Added new cleaning policy based on number of hours #3646
Changes from 9 commits
97113ff
92d085d
9c1364c
f93ee61
0cd1fe6
68b0a68
f58beda
3c79136
676d999
29e8200
9527702
efa6056
7eea9ec
8e7e497
cc0a44a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why do have (fileCommitTime.equals(lastVersionBeforeEarliestCommitToRetain) in L318. I would prefer to keep the same logic unless we have strong reasons to remove. As I mentioned, cleaning based on num hours is just a diff way to conveying retain N commits. So, lets try to keep the logic similar across both. Easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleaner policies are designed to enable the longest running query to succeed. Let us take a small example where ingestion is happening every 30 minutes and
hoodie.cleaner.commits.retained = 2
. This implies the longest running query is going to take 1 hour to complete. Now if I maintain a list of commits, when commit at index=2 in this list (i.e 3rd commit from the beginning) is completed, the query which started when commit at index=0 completed might still be running.list of commits -> 0. 1. 2
time (in mins). -> 0. 30. 60
Hence to allow this query to finish, we take into account
lastVersionBeforeEarliestCommitToRetain
in L318.lastVersion
in this case is going to be commit at index=1.With the new policy KEEP_LATEST_BY_HOURS, when 3rd commit is completed, 1st commit will be retained automatically by the logic. Hence the code in its current form works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any precedence in hudi code base for doing time based calculations. Can you explore and let me know. Wanna maintain some uniformity if we have any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should sync this to how the instant times are generated.
Another way to think about this could be:
latestInstantTime = HoodieActiveTimeline.createNewInstantTime()
earliestCommitToRetain = HoodieActiveTimeline.findInstantInPast(latestInstantTime, N); // N = number of hours in the past.
Keeping the instant timestamp code together will prevent any surprises from a wrong time calculation cleaning data unintentionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan I checked. There is no uniformity as of now. We use joda dateTime library in TimestampBasedKeyGenerator, java.time.ZonedDateTime in SlashEncodedHourPartitionValueExtractor, and java.util.Date in HoodieActiveTimeline.
@prashantwason nthInstant is what we are using throughout the codebase. I guess there is no need to introduce a new method
findInstantInPast
here.