New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() #4810
[HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() #4810
Conversation
Here is the test table in my local env
Query
result with As we can see the result is duplicated. |
@@ -492,7 +505,7 @@ protected HoodieBaseFile addBootstrapBaseFileIfPresent(HoodieFileGroupId fileGro | |||
.map(fileGroup -> Option.fromJavaOptional(fileGroup.getAllBaseFiles() | |||
.filter(baseFile -> HoodieTimeline.compareTimestamps(baseFile.getCommitTime(), HoodieTimeline.LESSER_THAN_OR_EQUALS, maxCommitTime | |||
)) | |||
.filter(df -> !isBaseFileDueToPendingCompaction(df)).findFirst())) | |||
.filter(df -> !isBaseFileDueToPendingCompaction(df) && !isBaseFileDueToPendingClustering(df)).findFirst())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the caller pass in the right maxCommitTime
to filter out the pending base files ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When inflight clustering at the earliest instant of the active timeline, this bug could happen(based on follow code). So that no matter what maxCommitTime is, we can' t filter it out.
Now we use take containsOrBeforeTimelineStarts as committed, which may involve unfinished clustering data(this inflight clustering instant is at the earliest of active timeline.)
hudi/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java
Line 120 in 55ecbc6
private boolean isFileSliceCommitted(FileSlice slice) { |
Hi @codope and @satishkotha Sorry to bother you. Would you mind to take a look at his patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! LGTM. but lets wait to get a review from Satish or Sagar who authored most pieces of clustering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangyue19921010 Thanks for catching the bug! Left a few comments
*/ | ||
protected boolean isBaseFileDueToPendingClustering(HoodieBaseFile baseFile) { | ||
List<String> pendingReplaceInstants = | ||
metaClient.getActiveTimeline().filterPendingReplaceTimeline().getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not reuse isPendingClusteringScheduledForFileId()
or getPendingClusteringInstant()
? So, we maintain a map of fgIdToPendingClustering
which supports various methods. If we can reuse one of them then we need to call active timeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm, maybe we can't use fgIdToPendingClustering
to do filter here.
Because the files recorded in fgIdToPendingClustering
are committed file and need to be seen.
What we need to filter here are the in-flight uncommitted data files produced by clustering job.
So that we need to know the instant time of xxxx.replacecommit.requested
or xxxx.replacecommit.inflight
and use it to filter out uncommitted clustering creating data files instead of the files which need to be clustering.
* 2. getBaseFileOn | ||
* 3. getLatestBaseFilesInRange | ||
* 4. getAllBaseFiles | ||
* 5. getLatestBaseFiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other base file related APIs like fetchLatestBaseFiles
, fetchAllBaseFiles
? Are they all covered by this change?
PS: I think we should take a follow up task to make FSView APIs more uniform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, there is pretty much getxxxxLatestxxx() methods, hhh.
The root change here is that add isBaseFileDueToPendingClustering
the same as isBaseFileDueToPendingCompaction
so add this check to wherever isBaseFileDueToPendingCompaction
is.
And this APIs in UT are all affected APIs.
@codope : can we close the loop here please. If you want me to take it up, let me know. happy to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. All comments addressed.
https://issues.apache.org/jira/browse/HUDI-3421
What is the purpose of the pull request
If there is a inflight clustering instant at the earliest of active time line, AbstractTableFileSystemView#getxxBaseFIle() will be broken because of un-committed data file created by this clustering job.
We could find the result without replacecommit 2 is bigger than with it.
This patch is try to fix it and We couldn't take uncommitted data file created by inflight clustering as a
HoodieBaseFile
We could look at added UT for more details and without this Patch, this added UT will failed because of additional un-committed clustering data file.
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.