-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7778] Fixing global index for duplicate updates #11256
[HUDI-7778] Fixing global index for duplicate updates #11256
Conversation
p.getRight().get().getPartitionPath(), | ||
p.getRight().get().getInstantTime(), | ||
p.getRight().get().getFileId())) | ||
.map(p -> Pair.of(p.getRight().get().getPartitionPath(), p.getRight().get().getFileId())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but can I ask a question. look like the instantime
is different with the same fgId in HoodieRecordGlobalLocation
, is this normal? In Global Simple Index
, it will use base file commit time as InstantTime
, and there is no different timestamp in the same file. So there is a problem here, HoodieRecordLocation's instantTime
is baseFile lastCommit time which file name contain, or every record commit time? or may be use any of them, there will be no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @KnightChess : atleast for global simple, global bloom and RLI, instant time should not matter. Hbase might have some intricacies with instant time to deduce rollbacks. I did not want to increase the scope of this patch. We are good wrt global simple, global bloom and RLI w/ cur fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan thanks for answering
p.getRight().get().getPartitionPath(), | ||
p.getRight().get().getInstantTime(), | ||
p.getRight().get().getFileId())) | ||
.map(p -> Pair.of(p.getRight().get().getPartitionPath(), p.getRight().get().getFileId())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we can simply remove instantTime
from the equals
and hashcode
of HoodieRecordGlobalLocation
. Why would we need instantTime
for global location? Clustering will anyway map to a new fileId. Compaction will create a base file with same file id but different instant time. However, for the location all we care about is partition path and filegroup id right. Instant time is only used for filesystem view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is a legacy design, to give a quick access of the instant time from a given Hoodie record, but I agree with you, the location should not include the instant time. The HoodieRecordLocation
should also be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instant time is used in other places as well (and not just global indexing). I do not want to include that in this patch. For now, lets proceed w/ 0.15.0 with cur fix and work towards properly fixing the RecordLocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let's track separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change Logs
We occasionally this duplicate keys being ingested to RLI partition in MDT. Fixing the root cause in this patch.
Root cause:
After fetching record locations from RLI partition in MDT, before doing snapshot read to honor payload merge and ordering field, we fetch unique Partition and fileId pairs. Instead of fetching unique pair of {Partition, fileId}s, we were using HoodieRecordGlobalLocation which also contains "instantTime" in addition to partition path and fileId. So, this was resulting in 1 record from incoming resulting in 2 to 3 or N records after joining because of this.
I have written tests to reproduce the issue. If not for the fix, we will encounter below exception
Impact
Global index (bloom, simple, RLI) will not result in duplicates.
Risk level (write none, low medium or high below)
low.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist