Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not #6983

Merged
merged 3 commits into from Oct 18, 2023

Conversation

weimingdiit
Copy link
Contributor

@weimingdiit weimingdiit commented Oct 18, 2022

…rce table has partitions and the target table does not

Change Logs

Hudi merge into creates empty partition files when the source table has partitions and the target table does not

Impact

No impact for users, this is a internal code refactoring.

Risk level (write none, low medium or high below)

medium,see ut for detail

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan added priority:major degraded perf; unable to move forward; potential bugs spark-sql labels Oct 21, 2022
@weimingdiit weimingdiit reopened this Oct 25, 2022
@YannByron
Copy link
Contributor

@weimingdiit i think only merge into can cause this issue, right?

@weimingdiit
Copy link
Contributor Author

weimingdiit commented Oct 26, 2022

@weimingdiit i think only merge into can cause this issue, right?

@YannByron yes, it is

Comment on lines 75 to 80
if (handle == null) {
// If the records are sorted, this means that we encounter a new partition path
// and the records for the previous partition path are all written,
// so we can safely closely existing open handle to reduce memory footprint.
if (areRecordsSorted) {
closeOpenHandles();
if (handle == null || !handle.canWrite(payload.record)) {
if (handle == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this logic flow? pls revert unnecessary changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no change this logic,Some code is duplicate:

in my first version :
first judge ignored record, other code no change,but every record need to judge,it is unnecessary

second version:
i move ignored record logic into if condition, but it need twice,if(handle == null) {...} and if(!handle.canWrite(payload.record)) {...}, and writeHandleFactory.create() also call twice, it is duplicate.

so i change this logic flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan how about this? If worried about causing other hidden problems, I can change the code , revert logic flow and return early

Comment on lines 93 to 98
insertPayload.getPartitionPath(), idPrefix, taskContextSupplier);
insertPayload.getPartitionPath(), idPrefix, taskContextSupplier);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

Comment on lines 91 to 88
Option<IndexedRecord> insertRecord = payload.insertValue;
// just skip the ignored record,do not make partitions on fs
if (insertRecord.isPresent() && insertRecord.get().equals(IGNORE_RECORD)) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to return based on these conditions, why not return early?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after judge handle, so you don't have to judge every Record, Just judge when the if condition is met,Most records do not need to be judged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan how about that?

@weimingdiit
Copy link
Contributor Author

@xushiyan @yihua @XuQianJin-Stars hi, please take a review for this pr. thks!

@weimingdiit weimingdiit force-pushed the HUDI-5031 branch 2 times, most recently from 48ef299 to 139a4b2 Compare November 25, 2022 08:59
@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022
@codope codope changed the title [HUDI-5031]Hudi merge into creates empty partition files when the sou… [HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not Dec 7, 2022
@bvaradar bvaradar added the bug-fix For PRs which fixes bugs label Oct 4, 2023
@bvaradar
Copy link
Contributor

bvaradar commented Oct 6, 2023

@weimingdiit : FYI: Have rebased and made slight changes. Will help take this to completion

…table has partitions but target table does not
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@bvaradar bvaradar merged commit 61051a3 into apache:master Oct 18, 2023
23 of 28 checks passed
nsivabalan pushed a commit that referenced this pull request Nov 21, 2023
…table has partitions but target table does not (#6983)

* [HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not

Co-authored-by: jameswei <jameswei@didiglobal.com>
Co-authored-by: balaji.varadarajan <vbalaji@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix For PRs which fixes bugs priority:major degraded perf; unable to move forward; potential bugs release-0.12.2 Patches targetted for 0.12.2 spark-sql
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants