Skip to content

[HUDI-5955] fix incremental clean not work caused by archive#8232

Closed
hbgstc123 wants to merge 1 commit intoapache:masterfrom
hbgstc123:fix_incremental_clean_bug_by_archive
Closed

[HUDI-5955] fix incremental clean not work caused by archive#8232
hbgstc123 wants to merge 1 commit intoapache:masterfrom
hbgstc123:fix_incremental_clean_bug_by_archive

Conversation

@hbgstc123
Copy link
Contributor

Change Logs

Current incremental clean action may miss data files that should be cleaned, if the commit instants of those data files were archived.

This pr make sure when incremental clean enabled, HoodieTimelineArchiver won't archive commit instants later than or equals to the earliestCommitToRetain of last complete clean instant, so that clean executor can find those file in active timeline.

Impact

no

Risk level (write none, low medium or high below)

low

Documentation Update

no

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

.orElse(true)
).filter(s ->
latestCleanRetainedInstant.map(instant ->
compareTimestamps(s.getTimestamp(), LESSER_THAN, instant))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get the error use case for incremental cleaning, but block the archiving of instants with cleaning is kind of risky, it is better we fix the incremental cleaning procedure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can fallback to full cleaning if there are archived retained commits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just use archived timeline to continue incremental clean

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archived timeline should also be good if the loading does not happen in high frequency.

@danny0405 danny0405 added the area:table-service Table services label Mar 20, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hbgstc123
Copy link
Contributor Author

#8373

I submit a new pr that fallback to full clean if instant needed for incremental clean is archived.

@hbgstc123 hbgstc123 closed this Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:table-service Table services

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants