Skip to content

[HUDI-6337] Incremental Clean ignore partitions affected by append write commits/delta commits#8905

Closed
stream2000 wants to merge 1 commit intoapache:masterfrom
stream2000:HUDI-6337_avoid_fetch_commit_metadata_in_append_mode
Closed

[HUDI-6337] Incremental Clean ignore partitions affected by append write commits/delta commits#8905
stream2000 wants to merge 1 commit intoapache:masterfrom
stream2000:HUDI-6337_avoid_fetch_commit_metadata_in_append_mode

Conversation

@stream2000
Copy link
Contributor

@stream2000 stream2000 commented Jun 8, 2023

Change Logs

Incremental Clean ignore partitions affected by append write commits/delta commits. In append write, we may write thousands of files in different partitions in one commit, and we know that we don't need to clean them at all. However current incremental clean will try to list the partitions anyway. This pr fix this and we won't list those partitions affected by append write commits.

Impact

Do not fetch commit metadata and list partition for append writes ( we will still list replace commits)

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@stream2000 stream2000 marked this pull request as draft June 8, 2023 12:17
@stream2000 stream2000 force-pushed the HUDI-6337_avoid_fetch_commit_metadata_in_append_mode branch 2 times, most recently from 40921a2 to 8ba4c9e Compare June 8, 2023 16:12
@stream2000 stream2000 marked this pull request as ready for review June 8, 2023 16:13
@stream2000 stream2000 force-pushed the HUDI-6337_avoid_fetch_commit_metadata_in_append_mode branch from 8ba4c9e to f8f1426 Compare June 8, 2023 16:18
@stream2000 stream2000 changed the title [HUDI-6337] Incremental Clean skip fetch commit metadata for append mode [HUDI-6337] Incremental Clean ignore partitions affected by append write commits/delta commits Jun 8, 2023
@stream2000 stream2000 force-pushed the HUDI-6337_avoid_fetch_commit_metadata_in_append_mode branch from f8f1426 to f9adcec Compare June 8, 2023 16:23
@stream2000
Copy link
Contributor Author

@hudi-bot run azure

@stream2000 stream2000 force-pushed the HUDI-6337_avoid_fetch_commit_metadata_in_append_mode branch from f9adcec to e73f7ed Compare June 9, 2023 11:54
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jun 9, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

.defaultValue(false)
.markAdvanced()
.withDocumentation("When set to true, cleaner will ignore partition affected by commits/delta commits. This is usefule for append write mode");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just disable the cleaning serivce for append mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just disable the cleaning serivce for append mode?

We still need the clean for replace commits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clean can be put right after the clustering commit I think.

Copy link
Contributor Author

@stream2000 stream2000 Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we only trigger clean after every successful replace commit( including clustering and delete partition)? In this way we may need to disable auto clean and async clean, then add extra steps after committing the replace commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we have no good manner to infer that config if the table service operations are all async. How about just config to disable the cleaning manually in the ingestion job?

Copy link
Contributor Author

@stream2000 stream2000 Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we have no good manner to infer that config if the table service operations are all async. How about just config to disable the cleaning manually in the ingestion job?

That's why we introduce the above config hoodie.cleaner.ignore.append.write.commits in our inner version. We don't need to care about whether the table services are sync/async, running in the streaming ingestions job or just batch job like delete partitions by sparksql. User just need to add a single config where the clean service is running. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of think the option is verbose. Disable the cleaning service has the same function right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of think the option is verbose. Disable the cleaning service has the same function right?

True if there is no replace commit. However clustering is commonly used in append-only ingestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the offline clustering will take over the cleaning task, both Spark and Flink offline clustering and compaction would trigger cleaning now, Spark supports this only in recent master code.

@stream2000
Copy link
Contributor Author

@danny0405 Thanks for your review! Close this pr due to we can achieve same function by disable clean in ingestion job.

@stream2000 stream2000 closed this Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants