[HUDI-6337] Incremental Clean ignore partitions affected by append write commits/delta commits#8905
Conversation
40921a2 to
8ba4c9e
Compare
8ba4c9e to
f8f1426
Compare
f8f1426 to
f9adcec
Compare
|
@hudi-bot run azure |
f9adcec to
e73f7ed
Compare
| .defaultValue(false) | ||
| .markAdvanced() | ||
| .withDocumentation("When set to true, cleaner will ignore partition affected by commits/delta commits. This is usefule for append write mode"); | ||
|
|
There was a problem hiding this comment.
Just disable the cleaning serivce for append mode?
There was a problem hiding this comment.
Just disable the cleaning serivce for append mode?
We still need the clean for replace commits
There was a problem hiding this comment.
The clean can be put right after the clustering commit I think.
There was a problem hiding this comment.
Do you mean we only trigger clean after every successful replace commit( including clustering and delete partition)? In this way we may need to disable auto clean and async clean, then add extra steps after committing the replace commit
There was a problem hiding this comment.
Yeah, but we have no good manner to infer that config if the table service operations are all async. How about just config to disable the cleaning manually in the ingestion job?
There was a problem hiding this comment.
Yeah, but we have no good manner to infer that config if the table service operations are all async. How about just config to disable the cleaning manually in the ingestion job?
That's why we introduce the above config hoodie.cleaner.ignore.append.write.commits in our inner version. We don't need to care about whether the table services are sync/async, running in the streaming ingestions job or just batch job like delete partitions by sparksql. User just need to add a single config where the clean service is running. What do you think?
There was a problem hiding this comment.
I kind of think the option is verbose. Disable the cleaning service has the same function right?
There was a problem hiding this comment.
I kind of think the option is verbose. Disable the cleaning service has the same function right?
True if there is no replace commit. However clustering is commonly used in append-only ingestion.
There was a problem hiding this comment.
But the offline clustering will take over the cleaning task, both Spark and Flink offline clustering and compaction would trigger cleaning now, Spark supports this only in recent master code.
|
@danny0405 Thanks for your review! Close this pr due to we can achieve same function by disable clean in ingestion job. |
Change Logs
Incremental Clean ignore partitions affected by append write commits/delta commits. In append write, we may write thousands of files in different partitions in one commit, and we know that we don't need to clean them at all. However current incremental clean will try to list the partitions anyway. This pr fix this and we won't list those partitions affected by append write commits.
Impact
Do not fetch commit metadata and list partition for append writes ( we will still list replace commits)
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist