You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At present, when Hudi cleaner uses the KEEP_LATEST_FILE_VERSIONS strategy to clean the Hudi table, hoodie will assume that once replaced a file group automatically becomes eligible for cleaning completely, which means all the replaced data files will be deleted as soon as possible.
But in this way, the downstream unfinished query tasks will fail.
So we need a mechanism to Control KEPP_LATEST_VERSIONS clean replaced files immediately or delete after a while
Impact
no
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.
re-trigger failed CI
Java CI / validate-bundles (flink1.15, spark3.3) (pull_request)
danny0405
changed the title
[HUDI-5422] Control KEPP_LATEST_VERSIONS clean replaced files immediately or delete after a while
[HUDI-5422] Control KEEP_LATEST_VERSIONS clean replaced files immediately or delete after a while
Dec 21, 2022
I guess this PR is related with https://github.com/apache/hudi/pull/7405/files, if the clsutering metadata files are archived but the replaced files are not cleaned, the query would see duplicates.
I guess this PR is related with https://github.com/apache/hudi/pull/7405/files, if the clsutering metadata files are archived but the replaced files are not cleaned, the query would see duplicates.
Hi @danny0405 I think it have something related, but not aiming to solve the same issue.
In HUDI-5341 is trying to solve incremental clean didn't clean all the replaced files as we expected which will causing data duplicate.
In this PR, we are trying to have a new control for KEEP_LATEST_VERSIONS delete all the replaced files immediate which will cause downstream query failed.
of cause users need to set this time carefully to make sure all replaced files are deleted before archive.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Logs
At present, when Hudi cleaner uses the KEEP_LATEST_FILE_VERSIONS strategy to clean the Hudi table, hoodie will assume that once replaced a file group automatically becomes eligible for cleaning completely, which means all the replaced data files will be deleted as soon as possible.
But in this way, the downstream unfinished query tasks will fail.
So we need a mechanism to Control KEPP_LATEST_VERSIONS clean replaced files immediately or delete after a while
Impact
no
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist