[HUDI-5968] Fix global index duplicate when update partition #8344

xushiyan · 2023-03-31T21:42:54Z

Change Logs

When using global index (bloom or simple), and update partition is set to true. There is a chance where record is in p1 at the beginning, and later updated to p2, when updating to p3 and compaction not yet happened, global index joined both old versions of the record in p1 and p2, and tagged 2 records to insert to p3. This sort of duplicates will reside in the dataset and won't be reconciled unless manually dedup the table.

This patch ensure dedup happens within the indexing (tagging) phase.

Impact

Global index has an extra dedup step for some records, which may slow down the whole process if a lot partition updates happen. In most scenarios, this is rare and perf impact is negligible.

Risk level (write none, low medium or high below)

Medium

Documentation Update

New config hoodie.global.index.dedup.parallelism

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...t/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieGlobalSimpleIndex.java

nsivabalan

LGTM. one comment on tests

nsivabalan · 2023-04-03T07:01:03Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieSimpleDataGenerator.java

+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class HoodieSimpleDataGenerator {


is it not possible to use HoodieTestDataGenerator or any other existing ones. We should try to standardize on these test data generators. Ensuring no flakiess or bugs in new ones are hard. lets try to stick to the ones we have already.

HoodieTestDataGenerator actually needs an overhaul as the APIs became unorganized over the years and hard to use. More importantly, randomness is a big cause to flakiness and we need a deterministic data gen more than a random data gen for UT/FT scenarios. I can revert this back to using existing data gen class and let the future overhaul work cover the new class adoption.

codope

Can we also add an e2e test like https://github.com/apache/hudi/pull/8228/files#diff-aab5a6c7849875ae6ee88b20c22c11c19d680eb7fae5410abf72ea6f74122633 ?

codope · 2023-04-03T16:07:00Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java

@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
      .defaultValue("true")
      .withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");

+  public static final ConfigProperty<String> GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty


This and other parallelism configs seem like good candidates for HoodieInternalConfig. This is not going to be used by the users often. Their expectation would be to dedup as fast as we can. Don't have to do it in this patch but just want to know your thoughts?

not very clear at the moment, given this is still tunable depends on the data's update ratio. it may stay as a infrequently used one like hoodie.markers.delete.parallelism

ok, let's keep it this way. we can revisit later if necessary.

lets make sure this is tagged an adv config? or not exposed to user by default. User should n't have to tune this.

also calling this deduping overloads the meaning a bit. - we are not removing the duplicates per see, right? We only ensure the tagging routes it to the right record? "hoodie.global.index.reconcile.parallelism"

codope · 2023-04-03T16:07:35Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieSimpleDataGenerator.java

+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class HoodieSimpleDataGenerator {


codope

Do we also need to handle this for HBase index when hbase index update partition path is enabled?

nsivabalan · 2023-04-04T05:32:12Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java

+     * So we let A left-anti join B to drop the insert from Set A and keep the update in Set B.
+     */
+    return deduped.leftOuterJoin(undeduped
+            .filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload))


does it matter if we favor insert or an update here?
If yes, I feel its better to favor insert and drop the update. so that we maintain the behavior across the board. i.e. whenever a record migrates from one partition to another, we will ignore whatever in storage and do an insert to incoming partition. to maintain similar semantics, thinking if we shd favor insert record over update.

synced up directly. lets add java docs to call this out, ie. why we should strictly favor update record and not insert. so that anyone looking to make any changes in this code block is aware of all the nuances.

should we instead be applying the payload to the old and new record?

which is kind of the semantics we should be going for?

nsivabalan

once you add java docs, we are good to go.

vinothchandar

I have a question on the high level approach taken here. Instead of de-duplicating further on reading the base file alone, why not read the base file, apply delete blocks - and tag per usual? If that could work, I think its the cleaner way to do this.

Even with the bloom index, we would get a false positive of older file groups that may contain the key, but then it will get fixed once we actually apply the delete blocks. no?

nsivabalan · 2023-04-04T14:31:26Z

the issue is, we have to read entire logs (including data files), since we realize deletes in diff ways. for eg, "_hoodie_is_deleted" field. So, considering the cost (esply for global index every file group is involved), we thought we will go w/ this approach.

hudi-bot · 2023-04-04T16:09:12Z

CI report:

f021bc3 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan reviewed Apr 1, 2023

View reviewed changes

...t/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieGlobalSimpleIndex.java Outdated Show resolved Hide resolved

xushiyan force-pushed the HUDI-5968-fix-global-index-dup branch from 51b9969 to 697f6e5 Compare April 1, 2023 22:04

nsivabalan reviewed Apr 3, 2023

View reviewed changes

codope reviewed Apr 3, 2023

View reviewed changes

xushiyan and others added 8 commits April 3, 2023 19:10

[HUDI-5968] Fix global index duplicate when update partition

3fb96c4

use pair for dedup flag

fe5c6ab

add UT

e66d7de

rename method

df67726

reuse method

6fe1c75

fix style

c83b075

add UT

7a603d2

use HoodieTestDataGenerator

3c004c6

xushiyan force-pushed the HUDI-5968-fix-global-index-dup branch from fa1b152 to 3c004c6 Compare April 4, 2023 01:38

fix corner case for partition update

7624300

xushiyan force-pushed the HUDI-5968-fix-global-index-dup branch from 2068f09 to 7624300 Compare April 4, 2023 04:05

codope reviewed Apr 4, 2023

View reviewed changes

nsivabalan reviewed Apr 4, 2023

View reviewed changes

nsivabalan approved these changes Apr 4, 2023

View reviewed changes

add docs and refactor tests

f021bc3

vinothchandar requested changes Apr 4, 2023

View reviewed changes

nsivabalan mentioned this pull request Apr 25, 2023

[HUDI-5968] Fix global index duplicate and handle custom payload when update partition #8490

Merged

6 tasks

xushiyan closed this May 29, 2023

xushiyan deleted the HUDI-5968-fix-global-index-dup branch May 29, 2023 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5968] Fix global index duplicate when update partition #8344

[HUDI-5968] Fix global index duplicate when update partition #8344

xushiyan commented Mar 31, 2023

nsivabalan left a comment

nsivabalan Apr 3, 2023

codope Apr 3, 2023

xushiyan Apr 3, 2023

codope left a comment

codope Apr 3, 2023

xushiyan Apr 3, 2023

codope Apr 4, 2023

vinothchandar Apr 6, 2023

vinothchandar Apr 6, 2023

codope Apr 3, 2023

codope left a comment

nsivabalan Apr 4, 2023

nsivabalan Apr 4, 2023

vinothchandar Apr 6, 2023

vinothchandar Apr 6, 2023

nsivabalan left a comment

vinothchandar left a comment •

edited

nsivabalan commented Apr 4, 2023

hudi-bot commented Apr 4, 2023

[HUDI-5968] Fix global index duplicate when update partition #8344

[HUDI-5968] Fix global index duplicate when update partition #8344

Conversation

xushiyan commented Mar 31, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan left a comment

Choose a reason for hiding this comment

vinothchandar left a comment • edited

Choose a reason for hiding this comment

nsivabalan commented Apr 4, 2023

hudi-bot commented Apr 4, 2023

CI report:

vinothchandar left a comment •

edited