[HUDI-5018] Make user-provided copyOnWriteRecordSizeEstimate first precedence#7226
[HUDI-5018] Make user-provided copyOnWriteRecordSizeEstimate first precedence#7226xicm wants to merge 1 commit intoapache:masterfrom
Conversation
|
@hudi-bot run azure |
| long defaultAvgSize = Integer.parseInt(HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.defaultValue()); | ||
| long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate(); | ||
|
|
||
| if (avgSize != defaultAvgSize) { |
There was a problem hiding this comment.
The HoodieCompactionConfig#build already sets the defaults explicitly if it is not configured, so what's the problem here ?
There was a problem hiding this comment.
I see, you didn't what the moving average from the commit matadata, is the moving average not accurate or something else ?
There was a problem hiding this comment.
@danny0405 This pr makes user-provided avg value first. We can't tell the value is default or user-provided here, so I suppose if the avg size is not default, it is user provided.
There was a problem hiding this comment.
Yeah, you need to explain why the explicit record size is always considered while ignoring the write stats from the commit metadata ? Can we elaborate to make the commit metadata more accurate here ?
There was a problem hiding this comment.
Hi @danny0405 , do you mean to add some comment to explain why we ignore the write stats from the commit metadata and how to set the value more accurate?
There was a problem hiding this comment.
I mean we should figure out why the write stats is not that accurate ?
There was a problem hiding this comment.
@danny0405 The calculation of average value is OK from my side.
@xushiyan could you explain the purpose of making user-provided first?Are these changes what you expected?
There was a problem hiding this comment.
Considering this scenario, I will set COPY_ON_WRITE_RECORD_SIZE_ESTIMATE is more smaller than the original to prevent generate a large number of small files when I first load data. But I can't accurately estimate the size of each record after load in hudi, only use the original data meta.
And next time, if I hava history data in hudi table, I tend to use the existing data to calculate the record size. And if the schema is evolution, the user provide may be will inaccurate, they need to reset the conf, right?
@KnightChess Yes, you are right. |
Change Logs
Make user-provided COPY_ON_WRITE_RECORD_SIZE_ESTIMATE first
Impact
HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE
Risk level (write none, low medium or high below)
low
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist