Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5968] Fix global index duplicate when update partition #8344

Closed
wants to merge 10 commits into from

Conversation

xushiyan
Copy link
Member

Change Logs

When using global index (bloom or simple), and update partition is set to true. There is a chance where record is in p1 at the beginning, and later updated to p2, when updating to p3 and compaction not yet happened, global index joined both old versions of the record in p1 and p2, and tagged 2 records to insert to p3. This sort of duplicates will reside in the dataset and won't be reconciled unless manually dedup the table.

This patch ensure dedup happens within the indexing (tagging) phase.

Impact

Global index has an extra dedup step for some records, which may slow down the whole process if a lot partition updates happen. In most scenarios, this is rare and perf impact is negligible.

Risk level (write none, low medium or high below)

Medium

Documentation Update

  • New config hoodie.global.index.dedup.parallelism

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@xushiyan xushiyan force-pushed the HUDI-5968-fix-global-index-dup branch from 51b9969 to 697f6e5 Compare April 1, 2023 22:04
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. one comment on tests

import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class HoodieSimpleDataGenerator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not possible to use HoodieTestDataGenerator or any other existing ones. We should try to standardize on these test data generators. Ensuring no flakiess or bugs in new ones are hard. lets try to stick to the ones we have already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieTestDataGenerator actually needs an overhaul as the APIs became unorganized over the years and hard to use. More importantly, randomness is a big cause to flakiness and we need a deterministic data gen more than a random data gen for UT/FT scenarios. I can revert this back to using existing data gen class and let the future overhaul work cover the new class adoption.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
.defaultValue("true")
.withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");

public static final ConfigProperty<String> GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other parallelism configs seem like good candidates for HoodieInternalConfig. This is not going to be used by the users often. Their expectation would be to dedup as fast as we can. Don't have to do it in this patch but just want to know your thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very clear at the moment, given this is still tunable depends on the data's update ratio. it may stay as a infrequently used one like hoodie.markers.delete.parallelism

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's keep it this way. we can revisit later if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets make sure this is tagged an adv config? or not exposed to user by default. User should n't have to tune this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also calling this deduping overloads the meaning a bit. - we are not removing the duplicates per see, right? We only ensure the tagging routes it to the right record? "hoodie.global.index.reconcile.parallelism"

import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class HoodieSimpleDataGenerator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@xushiyan xushiyan force-pushed the HUDI-5968-fix-global-index-dup branch from fa1b152 to 3c004c6 Compare April 4, 2023 01:38
@xushiyan xushiyan force-pushed the HUDI-5968-fix-global-index-dup branch from 2068f09 to 7624300 Compare April 4, 2023 04:05
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to handle this for HBase index when hbase index update partition path is enabled?

* So we let A left-anti join B to drop the insert from Set A and keep the update in Set B.
*/
return deduped.leftOuterJoin(undeduped
.filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter if we favor insert or an update here?
If yes, I feel its better to favor insert and drop the update. so that we maintain the behavior across the board. i.e. whenever a record migrates from one partition to another, we will ignore whatever in storage and do an insert to incoming partition. to maintain similar semantics, thinking if we shd favor insert record over update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced up directly. lets add java docs to call this out, ie. why we should strictly favor update record and not insert. so that anyone looking to make any changes in this code block is aware of all the nuances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we instead be applying the payload to the old and new record?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is kind of the semantics we should be going for?

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once you add java docs, we are good to go.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question on the high level approach taken here. Instead of de-duplicating further on reading the base file alone, why not read the base file, apply delete blocks - and tag per usual? If that could work, I think its the cleaner way to do this.

Even with the bloom index, we would get a false positive of older file groups that may contain the key, but then it will get fixed once we actually apply the delete blocks. no?

@nsivabalan
Copy link
Contributor

the issue is, we have to read entire logs (including data files), since we realize deletes in diff ways. for eg, "_hoodie_is_deleted" field. So, considering the cost (esply for global index every file group is involved), we thought we will go w/ this approach.

@hudi-bot
Copy link

hudi-bot commented Apr 4, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants