Skip to content

[HUDI-5018] Make user-provided copyOnWriteRecordSizeEstimate first precedence#7226

Closed
xicm wants to merge 1 commit intoapache:masterfrom
xicm:HUDI-5018
Closed

[HUDI-5018] Make user-provided copyOnWriteRecordSizeEstimate first precedence#7226
xicm wants to merge 1 commit intoapache:masterfrom
xicm:HUDI-5018

Conversation

@xicm
Copy link
Contributor

@xicm xicm commented Nov 17, 2022

Change Logs

Make user-provided COPY_ON_WRITE_RECORD_SIZE_ESTIMATE first

  1. if user sets a value, then use it as is
  2. if user not setting it, infer from timeline commit metadata
  3. if timeline is empty, use a default (current: 1024)

Impact

HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE

Risk level (write none, low medium or high below)

low

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@xicm
Copy link
Contributor Author

xicm commented Nov 21, 2022

@hudi-bot run azure

long defaultAvgSize = Integer.parseInt(HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.defaultValue());
long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();

if (avgSize != defaultAvgSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HoodieCompactionConfig#build already sets the defaults explicitly if it is not configured, so what's the problem here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you didn't what the moving average from the commit matadata, is the moving average not accurate or something else ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 This pr makes user-provided avg value first. We can't tell the value is default or user-provided here, so I suppose if the avg size is not default, it is user provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you need to explain why the explicit record size is always considered while ignoring the write stats from the commit metadata ? Can we elaborate to make the commit metadata more accurate here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @danny0405 , do you mean to add some comment to explain why we ignore the write stats from the commit metadata and how to set the value more accurate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we should figure out why the write stats is not that accurate ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 The calculation of average value is OK from my side.
@xushiyan could you explain the purpose of making user-provided first?Are these changes what you expected?

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@KnightChess KnightChess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this scenario, I will set COPY_ON_WRITE_RECORD_SIZE_ESTIMATE is more smaller than the original to prevent generate a large number of small files when I first load data. But I can't accurately estimate the size of each record after load in hudi, only use the original data meta.
And next time, if I hava history data in hudi table, I tend to use the existing data to calculate the record size. And if the schema is evolution, the user provide may be will inaccurate, they need to reset the conf, right?

@xicm
Copy link
Contributor Author

xicm commented Nov 21, 2022

Considering this scenario, I will set COPY_ON_WRITE_RECORD_SIZE_ESTIMATE is more smaller than the original to prevent generate a large number of small files when I first load data. But I can't accurately estimate the size of each record after load in hudi, only use the original data meta. And next time, if I hava history data in hudi table, I tend to use the existing data to calculate the record size. And if the schema is evolution, the user provide may be will inaccurate, they need to reset the conf, right?

@KnightChess Yes, you are right.
The original logic has a problem that user provided avg value does not take effect on existing table.
@xushiyan could you explain more about the original purpose of this issue?

@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022
@codope codope added priority:high Significant impact; potential bugs writer-core and removed release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022
@xicm xicm closed this May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants