[HUDI-7030] Commit-based Clustering Plan Strategy by prashantwason · Pull Request #18251 · apache/hudi

prashantwason · 2026-02-26T02:01:40Z

Summary

This PR introduces a new commit-based clustering plan strategy for Apache Hudi that processes data files based on their commit/ingestion order. This is particularly useful for streaming ingestion workloads where:

Newly generated small files need to be merged promptly to avoid performance degradation
Files produced earlier should be merged first to ensure accurate latency measurement
Already-clustered files should be excluded from future clustering plans

Key Changes

New CommitBasedClusteringPlanStrategy class - Processes commits in order and generates clustering plans based on commit patterns
Checkpoint-based tracking - Uses a checkpoint to track the last processed commit, enabling incremental clustering across runs
Replace commit handling - Properly excludes files that have been replaced by previous clustering jobs
Remaining partition handling - Creates clustering groups for partitions that haven't been fully processed

Configuration

Property	Description
`hoodie.clustering.plan.strategy.class`	Set to `CommitBasedClusteringPlanStrategy`
`hoodie.clustering.plan.strategy.max.bytes.per.group`	Maximum bytes per clustering group
`hoodie.clustering.plan.strategy.max.num.groups`	Maximum number of clustering groups per plan
`hoodie.clustering.plan.last.commit`	Checkpoint for the last processed commit

Test plan

Unit tests for CommitBasedClusteringPlanStrategy (7 tests, all passing)
- testGenerateClusteringPlanWithNoCommits
- testGenerateClusteringPlanWithoutCheckpoint
- testGenerateClusteringPlanWithSingleCommitSinglePartition (parameterized)
- testGenerateClusteringPlanWithSingleCommitMultiPartitions
- testGenerateClusteringPlanWithMultiCommits
- testGenerateClusteringPlanWithReplaceCommit

🤖 Generated with Claude Code

Summary: The first version of commit based clustering plan is scope to append only ingestion jobs Test Plan: unit tests Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi, bkrishen, ureview Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi, bkrishen Tags: #has_java JIRA Issues: HUDI-7030 Differential Revision: https://code.uberinternal.com/D18833959

Summary: create clustering groups for remaining partitions Test Plan: unit tests Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, ureview, vamshi Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi Subscribers: vamshi Tags: #has_java JIRA Issues: HUDI-7030, HUDI-7197 Differential Revision: https://code.uberinternal.com/D18935003

codecov-commenter · 2026-02-26T03:16:35Z

Codecov Report

❌ Patch coverage is 4.31655% with 133 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.23%. Comparing base (5fb1b34) to head (54f071b).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...er/strategy/CommitBasedClusteringPlanStrategy.java	0.00%	130 Missing ⚠️
...org/apache/hudi/config/HoodieClusteringConfig.java	75.00%	2 Missing ⚠️
...java/org/apache/hudi/config/HoodieWriteConfig.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18251      +/-   ##
============================================
- Coverage     57.29%   57.23%   -0.07%     
- Complexity    18542    18559      +17     
============================================
  Files          1945     1946       +1     
  Lines        106202   106395     +193     
  Branches      13130    13151      +21     
============================================
+ Hits          60848    60891      +43     
- Misses        39628    39780     +152     
+ Partials       5726     5724       -2

Flag	Coverage Δ
hadoop-mr-java-client	`45.31% <4.31%> (-0.08%)`	⬇️
spark-java-tests	`47.37% <4.31%> (-0.05%)`	⬇️
spark-scala-tests	`45.45% <4.31%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...java/org/apache/hudi/config/HoodieWriteConfig.java	`84.94% <0.00%> (-0.06%)`	⬇️
...org/apache/hudi/config/HoodieClusteringConfig.java	`87.73% <75.00%> (-0.33%)`	⬇️
...er/strategy/CommitBasedClusteringPlanStrategy.java	`0.00% <0.00%> (ø)`

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-02-26T03:26:07Z

CI report:

54f071b Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan

shared feedback for source code.
will review tests in next iteration.

nsivabalan · 2026-03-02T18:15:06Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java

+      .noDefaultValue()
+      .markAdvanced()
+      .sinceVersion("0.14.0")
+      .withDocumentation("Last commit time to start clustering from, applied to commit-based clustering plan strategy");


Shouldn't this be, ....earliest.commit.time.to.cluster

can we call out in docs whether this is inclusive or exclusive?

nsivabalan · 2026-03-02T18:15:21Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

    return getString(HoodieClusteringConfig.PARTITION_REGEX_PATTERN);
  }

+  public String getClusteringLastCommit() {


lets fix all getter methods once we fix the config key

nsivabalan · 2026-03-02T18:20:55Z

...in/java/org/apache/hudi/table/action/cluster/strategy/CommitBasedClusteringPlanStrategy.java

+      List<FileSlice> fileSlices = new ArrayList<>();
+      List<HoodieWriteStat> writeStats = commitMetadata.getWriteStats(partition);
+      for (HoodieWriteStat stat : writeStats) {
+        // Only consider base files (ignore log files for clustering)


we can't ignore log files for clustering in MOR table.
we have added FileGroupReader abstraction in general. So, we should not ignore any log files here.

Or, we should confine this plan only for COW table.

nsivabalan · 2026-03-02T18:31:51Z

...in/java/org/apache/hudi/table/action/cluster/strategy/CommitBasedClusteringPlanStrategy.java

+          partitionToFiles.put(partition, eligibleFileSlices);
+        }
+
+        if (partitionToSize.get(partition) >= getWriteConfig().getClusteringMaxBytesInGroup()) {


but this does not strictly honor the max clustering group size right.
we should remove the last added entry so that we are <= ClusteringMaxBytesInGroup?

nsivabalan · 2026-03-02T18:32:32Z

...in/java/org/apache/hudi/table/action/cluster/strategy/CommitBasedClusteringPlanStrategy.java

+    // For partitions that have not been processed, create clustering groups
+    for (String partition : partitionToSize.keySet()) {
+      if (partitionToSize.get(partition) > 0) {
+        Pair<Stream<HoodieClusteringGroup>, Boolean> groupPair = buildClusteringGroupsForPartition(partition, partitionToFiles.get(partition));


Are we not accounting for ClusteringMaxNumGroups here?

nsivabalan · 2026-03-02T18:33:35Z

hudi-common/pom.xml

        </configuration>
      </plugin>
+      <plugin>
+        <groupId>org.codehaus.mojo</groupId>


why do we need this change?

jngli added 2 commits February 25, 2026 16:07

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2026

nsivabalan reviewed Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-7030] Commit-based Clustering Plan Strategy#18251

[HUDI-7030] Commit-based Clustering Plan Strategy#18251
prashantwason wants to merge 2 commits intoapache:masterfrom
prashantwason:pw_oss_commit_porting_53

prashantwason commented Feb 26, 2026

Uh oh!

codecov-commenter commented Feb 26, 2026

Uh oh!

hudi-bot commented Feb 26, 2026

Uh oh!

nsivabalan left a comment

Uh oh!

nsivabalan Mar 2, 2026

Uh oh!

nsivabalan Mar 2, 2026

Uh oh!

nsivabalan Mar 2, 2026

Uh oh!

nsivabalan Mar 2, 2026

Uh oh!

nsivabalan Mar 2, 2026

Uh oh!

nsivabalan Mar 2, 2026

Uh oh!

nsivabalan Mar 2, 2026

Uh oh!

nsivabalan Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

prashantwason commented Feb 26, 2026

Summary

Key Changes

Configuration

Test plan

Uh oh!

codecov-commenter commented Feb 26, 2026

Codecov Report

Uh oh!

hudi-bot commented Feb 26, 2026

CI report:

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants