Skip to content

Comments

[HUDI-5953][DNM] Handle duplicate records in HoodieCreateHandle#8228

Closed
codope wants to merge 1 commit intoapache:masterfrom
codope:fix-dups-create-handle
Closed

[HUDI-5953][DNM] Handle duplicate records in HoodieCreateHandle#8228
codope wants to merge 1 commit intoapache:masterfrom
codope:fix-dups-create-handle

Conversation

@codope
Copy link
Member

@codope codope commented Mar 18, 2023

Change Logs

Handle potential duplicates in HoodieCreateHandle. It is already handled for compaction.
DO NOT MERGE. Yet to cleanup tests.

Impact

No public API but critical change. Will affect inserts.

Risk level (write none, low medium or high below)

high

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope codope changed the title [WIP] Handle duplicate records in HoodieCreateHandle [HUDI-5953] Handle duplicate records in HoodieCreateHandle Mar 18, 2023
@codope
Copy link
Member Author

codope commented Mar 18, 2023

@hudi-bot run azure

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1, don't think we should go in this direction, for HoodieCreateHandle, a premise is that the input data set should already be decuplicated, there are two impacts I could think of here:

  1. Introduces unnecessary overhead for many workflows, like bulk_insert, insert, compaction, clustering, etc.
  2. the cache of the keys could results in OOM exception, it is more safe the engine handles the deduplication job

Fix merge handle and dedupe conditional
@codope codope force-pushed the fix-dups-create-handle branch from cb172b5 to 1e2111d Compare March 20, 2023 04:21
@codope codope changed the title [HUDI-5953] Handle duplicate records in HoodieCreateHandle [HUDI-5953][DNM] Handle duplicate records in HoodieCreateHandle Mar 22, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope
Copy link
Member Author

codope commented Jul 5, 2023

Closing it as the fix landed via cabcb2b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants