[HUDI-4777] Fix flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue#6593
Closed
TJX2014 wants to merge 1 commit intoapache:masterfrom
Closed
Conversation
Contributor
|
When spark loads the latest fileslice, if include the fileslice that only contains log file, and then the problem can be also solved right? |
Contributor
Author
Seems spark need not to include log file, which is merged to base file. |
Contributor
Author
Sorry, closed by mistake, please see: #6595 |
Contributor
|
The pr can be reopen :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Logs
Impact
The duplicate issue is from hudi-flink mor table, which first append log, but not compact right now, so the bucket num is not in base file;
When spark use loadPartitionBucketIdFileIdMapping of org.apache.hudi.index.bucket.HoodieSimpleBucketIndex, it will not find the bucket num which written by hudi-flink, so it will generate a new one which not consistent with hudi-flink.
After this change, when hudi-flink write mor table use bucket index, it will firstly consider to write base parquet file after deduplicate, if base file exists, it will change to write log file, follow spark way seems more stable for mor table.
Risk level: none | low | medium | high
none
Contributor's checklist