Skip to content

[HUDI-6882] Differentiate between replacecommits in cluster planning#9755

Merged
yihua merged 2 commits intoapache:masterfrom
jonvex:fix_clustering_plan_last_cluster
Sep 21, 2023
Merged

[HUDI-6882] Differentiate between replacecommits in cluster planning#9755
yihua merged 2 commits intoapache:masterfrom
jonvex:fix_clustering_plan_last_cluster

Conversation

@jonvex
Copy link
Contributor

@jonvex jonvex commented Sep 20, 2023

Change Logs

Cluster planning will run clustering every n commits. To do this, it gets the previous clustering instant and then finds the number of commits after that. However, it was finding the previous clustering instant just by finding the latest replacecommit. Replacecommit is also used for insert_overwrite. Now we check the commit metadata to ensure it is a cluster commit.

Impact

clustering now works as expected if lots of insert_overwrite is used

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua force-pushed the fix_clustering_plan_last_cluster branch from b9218ae to 0d495d3 Compare September 21, 2023 15:00
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 03fdc63 into apache:master Sep 21, 2023
nsivabalan pushed a commit that referenced this pull request Nov 21, 2023
…9755)

Cluster planning will run clustering every n commits. To do this, it gets the previous clustering instant and then finds the number of commits after that. However, it was finding the previous clustering instant just by finding the latest replacecommit. Replacecommit is also used for insert_overwrite. This commit fixes the logic to check the commit metadata to ensure it is a cluster commit.

Co-authored-by: Jonathan Vexler <=>
.getReverseOrderedInstants()
.filter(i -> {
try {
HoodieCommitMetadata metadata = TimelineUtils.getCommitMetadata(i, this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduced a regression that the pending clustering is no longer considered when figuring out the lastClusteringInstant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants