Skip to content

[HUDI-7030] Commit-based Clustering Plan Strategy#18251

Open
prashantwason wants to merge 2 commits intoapache:masterfrom
prashantwason:pw_oss_commit_porting_53
Open

[HUDI-7030] Commit-based Clustering Plan Strategy#18251
prashantwason wants to merge 2 commits intoapache:masterfrom
prashantwason:pw_oss_commit_porting_53

Conversation

@prashantwason
Copy link
Member

Summary

This PR introduces a new commit-based clustering plan strategy for Apache Hudi that processes data files based on their commit/ingestion order. This is particularly useful for streaming ingestion workloads where:

  • Newly generated small files need to be merged promptly to avoid performance degradation
  • Files produced earlier should be merged first to ensure accurate latency measurement
  • Already-clustered files should be excluded from future clustering plans

Key Changes

  • New CommitBasedClusteringPlanStrategy class - Processes commits in order and generates clustering plans based on commit patterns
  • Checkpoint-based tracking - Uses a checkpoint to track the last processed commit, enabling incremental clustering across runs
  • Replace commit handling - Properly excludes files that have been replaced by previous clustering jobs
  • Remaining partition handling - Creates clustering groups for partitions that haven't been fully processed

Configuration

Property Description
hoodie.clustering.plan.strategy.class Set to CommitBasedClusteringPlanStrategy
hoodie.clustering.plan.strategy.max.bytes.per.group Maximum bytes per clustering group
hoodie.clustering.plan.strategy.max.num.groups Maximum number of clustering groups per plan
hoodie.clustering.plan.last.commit Checkpoint for the last processed commit

Test plan

  • Unit tests for CommitBasedClusteringPlanStrategy (7 tests, all passing)
    • testGenerateClusteringPlanWithNoCommits
    • testGenerateClusteringPlanWithoutCheckpoint
    • testGenerateClusteringPlanWithSingleCommitSinglePartition (parameterized)
    • testGenerateClusteringPlanWithSingleCommitMultiPartitions
    • testGenerateClusteringPlanWithMultiCommits
    • testGenerateClusteringPlanWithReplaceCommit

🤖 Generated with Claude Code

Summary: The first version of commit based clustering plan is scope to append only ingestion jobs

Test Plan: unit tests

Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi, bkrishen, ureview

Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi, bkrishen

Tags: #has_java

JIRA Issues: HUDI-7030

Differential Revision: https://code.uberinternal.com/D18833959
Summary: create clustering groups for remaining partitions

Test Plan: unit tests

Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, ureview, vamshi

Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi

Subscribers: vamshi

Tags: #has_java

JIRA Issues: HUDI-7030, HUDI-7197

Differential Revision: https://code.uberinternal.com/D18935003
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2026
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 4.31655% with 133 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.23%. Comparing base (5fb1b34) to head (54f071b).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...er/strategy/CommitBasedClusteringPlanStrategy.java 0.00% 130 Missing ⚠️
...org/apache/hudi/config/HoodieClusteringConfig.java 75.00% 2 Missing ⚠️
...java/org/apache/hudi/config/HoodieWriteConfig.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18251      +/-   ##
============================================
- Coverage     57.29%   57.23%   -0.07%     
- Complexity    18542    18559      +17     
============================================
  Files          1945     1946       +1     
  Lines        106202   106395     +193     
  Branches      13130    13151      +21     
============================================
+ Hits          60848    60891      +43     
- Misses        39628    39780     +152     
+ Partials       5726     5724       -2     
Flag Coverage Δ
hadoop-mr-java-client 45.31% <4.31%> (-0.08%) ⬇️
spark-java-tests 47.37% <4.31%> (-0.05%) ⬇️
spark-scala-tests 45.45% <4.31%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/config/HoodieWriteConfig.java 84.94% <0.00%> (-0.06%) ⬇️
...org/apache/hudi/config/HoodieClusteringConfig.java 87.73% <75.00%> (-0.33%) ⬇️
...er/strategy/CommitBasedClusteringPlanStrategy.java 0.00% <0.00%> (ø)

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared feedback for source code.
will review tests in next iteration.

.noDefaultValue()
.markAdvanced()
.sinceVersion("0.14.0")
.withDocumentation("Last commit time to start clustering from, applied to commit-based clustering plan strategy");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be, ....earliest.commit.time.to.cluster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call out in docs whether this is inclusive or exclusive?

return getString(HoodieClusteringConfig.PARTITION_REGEX_PATTERN);
}

public String getClusteringLastCommit() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets fix all getter methods once we fix the config key

List<FileSlice> fileSlices = new ArrayList<>();
List<HoodieWriteStat> writeStats = commitMetadata.getWriteStats(partition);
for (HoodieWriteStat stat : writeStats) {
// Only consider base files (ignore log files for clustering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't ignore log files for clustering in MOR table.
we have added FileGroupReader abstraction in general. So, we should not ignore any log files here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we should confine this plan only for COW table.

partitionToFiles.put(partition, eligibleFileSlices);
}

if (partitionToSize.get(partition) >= getWriteConfig().getClusteringMaxBytesInGroup()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this does not strictly honor the max clustering group size right.
we should remove the last added entry so that we are <= ClusteringMaxBytesInGroup?

// For partitions that have not been processed, create clustering groups
for (String partition : partitionToSize.keySet()) {
if (partitionToSize.get(partition) > 0) {
Pair<Stream<HoodieClusteringGroup>, Boolean> groupPair = buildClusteringGroupsForPartition(partition, partitionToFiles.get(partition));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not accounting for ClusteringMaxNumGroups here?

</configuration>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants