-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5968] Fix global index duplicate when update partition #8344
Conversation
...t/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieGlobalSimpleIndex.java
Outdated
Show resolved
Hide resolved
51b9969
to
697f6e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. one comment on tests
import java.util.stream.Collectors; | ||
import java.util.stream.IntStream; | ||
|
||
public class HoodieSimpleDataGenerator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it not possible to use HoodieTestDataGenerator or any other existing ones. We should try to standardize on these test data generators. Ensuring no flakiess or bugs in new ones are hard. lets try to stick to the ones we have already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HoodieTestDataGenerator
actually needs an overhaul as the APIs became unorganized over the years and hard to use. More importantly, randomness is a big cause to flakiness and we need a deterministic data gen more than a random data gen for UT/FT scenarios. I can revert this back to using existing data gen class and let the future overhaul work cover the new class adoption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add an e2e test like https://github.com/apache/hudi/pull/8228/files#diff-aab5a6c7849875ae6ee88b20c22c11c19d680eb7fae5410abf72ea6f74122633 ?
@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig { | |||
.defaultValue("true") | |||
.withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index."); | |||
|
|||
public static final ConfigProperty<String> GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and other parallelism configs seem like good candidates for HoodieInternalConfig
. This is not going to be used by the users often. Their expectation would be to dedup as fast as we can. Don't have to do it in this patch but just want to know your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not very clear at the moment, given this is still tunable depends on the data's update ratio. it may stay as a infrequently used one like hoodie.markers.delete.parallelism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's keep it this way. we can revisit later if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets make sure this is tagged an adv config? or not exposed to user by default. User should n't have to tune this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also calling this deduping
overloads the meaning a bit. - we are not removing the duplicates per see, right? We only ensure the tagging routes it to the right record? "hoodie.global.index.reconcile.parallelism"
import java.util.stream.Collectors; | ||
import java.util.stream.IntStream; | ||
|
||
public class HoodieSimpleDataGenerator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
fa1b152
to
3c004c6
Compare
2068f09
to
7624300
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to handle this for HBase index when hbase index update partition path is enabled?
* So we let A left-anti join B to drop the insert from Set A and keep the update in Set B. | ||
*/ | ||
return deduped.leftOuterJoin(undeduped | ||
.filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter if we favor insert or an update here?
If yes, I feel its better to favor insert and drop the update. so that we maintain the behavior across the board. i.e. whenever a record migrates from one partition to another, we will ignore whatever in storage and do an insert to incoming partition. to maintain similar semantics, thinking if we shd favor insert record over update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced up directly. lets add java docs to call this out, ie. why we should strictly favor update record and not insert. so that anyone looking to make any changes in this code block is aware of all the nuances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we instead be applying the payload to the old and new record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is kind of the semantics we should be going for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once you add java docs, we are good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question on the high level approach taken here. Instead of de-duplicating further on reading the base file alone, why not read the base file, apply delete blocks - and tag per usual? If that could work, I think its the cleaner way to do this.
Even with the bloom index, we would get a false positive of older file groups that may contain the key, but then it will get fixed once we actually apply the delete blocks. no?
the issue is, we have to read entire logs (including data files), since we realize deletes in diff ways. for eg, "_hoodie_is_deleted" field. So, considering the cost (esply for global index every file group is involved), we thought we will go w/ this approach. |
Change Logs
When using global index (bloom or simple), and update partition is set to true. There is a chance where record is in p1 at the beginning, and later updated to p2, when updating to p3 and compaction not yet happened, global index joined both old versions of the record in p1 and p2, and tagged 2 records to insert to p3. This sort of duplicates will reside in the dataset and won't be reconciled unless manually dedup the table.
This patch ensure dedup happens within the indexing (tagging) phase.
Impact
Global index has an extra dedup step for some records, which may slow down the whole process if a lot partition updates happen. In most scenarios, this is rare and perf impact is negligible.
Risk level (write none, low medium or high below)
Medium
Documentation Update
hoodie.global.index.dedup.parallelism
Contributor's checklist