Skip to content

[HUDI-5422] Control KEEP_LATEST_VERSIONS clean replaced files immediately or delete after a while#7519

Open
zhangyue19921010 wants to merge 4 commits intoapache:masterfrom
zhangyue19921010:control-KEEP_LATEST_VERSION-delete-replaced-files
Open

[HUDI-5422] Control KEEP_LATEST_VERSIONS clean replaced files immediately or delete after a while#7519
zhangyue19921010 wants to merge 4 commits intoapache:masterfrom
zhangyue19921010:control-KEEP_LATEST_VERSION-delete-replaced-files

Conversation

@zhangyue19921010
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 commented Dec 20, 2022

Change Logs

At present, when Hudi cleaner uses the KEEP_LATEST_FILE_VERSIONS strategy to clean the Hudi table, hoodie will assume that once replaced a file group automatically becomes eligible for cleaning completely, which means all the replaced data files will be deleted as soon as possible.

But in this way, the downstream unfinished query tasks will fail.

So we need a mechanism to Control KEPP_LATEST_VERSIONS clean replaced files immediately or delete after a while

Impact

no

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@zhangyue19921010
Copy link
Contributor Author

a5d2214 Azure: SUCCESS

UTs are passed.

re-trigger failed CI
Java CI / validate-bundles (flink1.15, spark3.3) (pull_request)

@danny0405 danny0405 changed the title [HUDI-5422] Control KEPP_LATEST_VERSIONS clean replaced files immediately or delete after a while [HUDI-5422] Control KEEP_LATEST_VERSIONS clean replaced files immediately or delete after a while Dec 21, 2022
@danny0405
Copy link
Contributor

I guess this PR is related with https://github.com/apache/hudi/pull/7405/files, if the clsutering metadata files are archived but the replaced files are not cleaned, the query would see duplicates.

@zhangyue19921010
Copy link
Contributor Author

zhangyue19921010 commented Dec 21, 2022

I guess this PR is related with https://github.com/apache/hudi/pull/7405/files, if the clsutering metadata files are archived but the replaced files are not cleaned, the query would see duplicates.

Hi @danny0405 I think it have something related, but not aiming to solve the same issue.
In HUDI-5341 is trying to solve incremental clean didn't clean all the replaced files as we expected which will causing data duplicate.

In this PR, we are trying to have a new control for KEEP_LATEST_VERSIONS delete all the replaced files immediate which will cause downstream query failed.

of cause users need to set this time carefully to make sure all replaced files are deleted before archive.

@zhangyue19921010
Copy link
Contributor Author

@hudi-bot run azure

@yihua yihua self-assigned this Dec 22, 2022
@yihua yihua added priority:critical Production degraded; pipelines stalled area:table-service Table services labels Dec 22, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:table-service Table services priority:critical Production degraded; pipelines stalled size:M PR with lines of changes in (100, 300]

Projects

Status: 🔖 Ready for review

Development

Successfully merging this pull request may close these issues.

4 participants