Skip to content

feat(table-services): Support hoodie.clustering.enable.expirations to allow cleanup of failed clustering plans (intended for PreferWriterConflictResolutionStrategy)#18302

Open
kbuci wants to merge 16 commits intoapache:masterfrom
kbuci:PR-17879

Conversation

@kbuci
Copy link
Contributor

@kbuci kbuci commented Mar 10, 2026

Describe the issue this Pull Request addresses

When using PreferWriterConflictResolutionStrategy for multi-writer setups, clustering jobs can fail and leave behind incomplete replacecommit instants on the timeline. These stale clustering instants block future writes targeting the same file groups and require manual intervention to clean up. This PR introduces automatic rollback of failed clustering instants with expired heartbeats, gated behind a new configuration so it is opt-in for users who need it.

Issue: #17879

Summary and Changelog

Adds opt-in support for automatically rolling back failed/stale clustering instants during the rollbackFailedWrites flow (LAZY cleaning policy), and a utility for partition-targeted rollback of failed clustering.

New Configurations:

  • hoodie.clustering.enable.expirations (default: false): Enables rollback of incomplete clustering instants with expired heartbeats. Can only be applied if PreferWriterConflictResolutionStrategy is the configured conflict resolution strategy.
  • hoodie.clustering.expiration.time.mins (default: 60): Minimum age (in minutes) a clustering instant must have before it is eligible for rollback. Acts as a guardrail against ingestion jobs rolling back clustering operations that a table service platform is anyway immediately attempting to rollback.

Behavioral Changes:

  • HoodieWriteConfig.autoAdjustConfigsForConcurrencyMode: When PreferWriterConflictResolutionStrategy is enabled, the clustering updates strategy is automatically set to SparkAllowUpdateStrategy so that ingestion writes can proceed even when there is inflight clustering targeting the same file groups.
  • BaseHoodieTableServiceClient.getInstantsToRollback: Under the LAZY failed writes cleaning policy, eligible incomplete clustering instants (old enough, config enabled, confirmed as clustering action) are now included in the inflight stream before heartbeat-based expiry filtering.
  • BaseHoodieTableServiceClient.getInstantsToRollbackForLazyCleanPolicy: The double-check after timeline reload now also considers the pending replace/clustering timeline when the config is enabled, so that expired clustering instants (if eligible) are not inadvertently filtered out.
  • New helper BaseHoodieTableServiceClient.isClusteringInstantEligibleForRollback: Encapsulates the check for whether an instant is a clustering instant (with expired heartbeat) that is old enough and the rollback config is enabled.
    • BaseHoodieTableServiceClient.getPendingRollbackInfos: Uses the new helper to allow re-attempting pending rollback plans for eligible clustering instants.

New Utilities in HoodieClusteringJob:

  • getPendingClusteringInstantsForPartitions(metaClient, partitions): Returns all pending clustering instant times that target any of the given partitions.
  • rollbackFailedClusteringForPartitions(client, metaClient, partitions): Rolls back pending clustering instants targeting the given partitions, filtering for eligibility (config enabled, old enough, clustering action) and expired heartbeat. This allows users such as a table service platform to "clear out" any potentially conflicting clustering plans before attempting a new clustering plan (assuming hoodie.clustering.enable.expirations is used). Since otherwise that new clustering plan would not include file groups from those other inflight clustering .requested plans.

Tests:

  • Unit tests in TestHoodieWriteConfig for new config defaults, explicit enable, inference from PreferWriterConflictResolutionStrategy, and auto-adjustment of clustering update strategy.
  • Unit tests in TestBaseHoodieTableServiceClient for isClusteringInstantEligibleForRollback and getInstantsToRollback behavior with clustering instants under various conditions (config disabled, too recent, eligible, non-clustering, active vs expired heartbeat).
  • Integration tests in TestHoodieClusteringJob for getPendingClusteringInstantsForPartitions and rollbackFailedClusteringForPartitions (expired heartbeat triggers rollback, active heartbeat skips rollback).

Impact

  • Two new user-facing configurations: hoodie.clustering.enable.expirations and hoodie.clustering.enable.expirations.time.mins.
  • When PreferWriterConflictResolutionStrategy is used, hoodie.clustering.updates.strategy is now auto-set to SparkAllowUpdateStrategy.
  • No breaking changes. The new behavior is entirely opt-in (disabled by default). Existing users who do not use PreferWriterConflictResolutionStrategy are unaffected.

Risk Level

Low. The rollback of failed clustering instants is gated behind a config that defaults to false and only activates for instants that are old enough (configurable wait time) with expired heartbeats. The auto-adjustment of the clustering update strategy only applies when PreferWriterConflictResolutionStrategy is already in use. Unit and integration tests cover the key scenarios.

Documentation Update

  • Config description for hoodie.clustering.enable.expirations and hoodie.clustering.enable.expirations.time.mins is included in the config property definitions with inline documentation.
  • The auto-adjustment of hoodie.clustering.updates.strategy when using PreferWriterConflictResolutionStrategy is logged at INFO level.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Mar 10, 2026
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed source code. Will review tests in next iteration

Option<HoodieClusteringPlan> clusteringPlan = table
.scheduleClustering(context, instantTime, extraMetadata);
option = clusteringPlan.map(plan -> instantTime);
if (option.isPresent() && config.isRollbackFailedClustering()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using config.isRollbackFailedClustering() here does not sit well.

can we name something like
config.isExpirationOfClusteringEnabled()

this reads nicely and is understandable.
wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expiration is a good name , since its more broad than failed. Updated

}
HoodieInstant instant = metaClient.getInstantGenerator()
.createNewInstant(HoodieInstant.State.INFLIGHT, action, instantToRollback);
if (!isClusteringInstantEligibleForRollback(metaClient, instant)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we split this up to keep it simple as before. current code structure is bit confusing.

boolean isClusteringInstant = ClusteringUtils.isClusteringInstant(
                  metaClient.getActiveTimeline(), instant, metaClient.getInstantGenerator())
if (isClusteringInstant) {
    if (!isClusteringInstantEligibleForRollback(metaClient, instant)) {
    continue;
    }
} else {
    continue;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually instead of continue, can we add it to list w/n the if condition only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I refactored it to not need any continue and to be closer to original approach

.markAdvanced()
.withDocumentation("Number of heartbeat misses, before a writer is deemed not alive and all pending writes are aborted.");

public static final ConfigProperty<Boolean> ROLLBACK_FAILED_CLUSTERING = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this to HoodieClusteringConfig and name this
hoodie.clustering.enable.expirations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there chances that we will enable expiration by clustering runners too or you folks purely needed this from ingestion writer standpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can use that config name.
In our clustering table service runner, we currently want to rollback any (potentially conflicting) clustering instant with an expired heartbeat, regardless of how recent it is. So currently with our internal setup/build, we in clustering runner we enable the config to allow rollback of failed clustering writes, but we set hoodie.clustering.expiration.time.mins to 0

+ "client must be used to schedule, execute, and commit the clustering instant.");

public static final ConfigProperty<Long> ROLLBACK_FAILED_CLUSTERING_WAIT_MINUTES = ConfigProperty
.key("hoodie.rollback.failed.clustering.wait.minutes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about hoodie.clustering.expiration.time.mins ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can use that name

.markAdvanced()
.withDocumentation("When hoodie.rollback.failed.clustering is enabled, rollbackFailedWrites will not attempt to rollback "
+ "a clustering instant unless it is at least this many minutes old. This is a temporary guardrail to reduce the chance "
+ "of transient failures from concurrent rollback attempts until https://github.com/apache/hudi/issues/18050 is resolved.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add talk about 18050.
we are going to get 18050 landed for 1.2 shortly.
So, why talk about it in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok if this change will anyway be released with the rest in 1.2 then makes sense let me remove that reference

metaClient.getActiveTimeline(), instant, metaClient.getInstantGenerator());
}

private static boolean isInstantOldEnough(String instantTime, long waitMinutes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasInstantExpired

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure renamed

}
}

public boolean isClusteringInstantEligibleForRollback(HoodieTableMetaClient metaClient, HoodieInstant instant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are we checking for heart beats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In getInstantsToRollback, we first call this function and then the subsequent call to getInstantsToRollbackForLazyCleanPolicy will filter out instants with an active heartbeat. Because getInstantsToRollbackForLazyCleanPolicy already takes care of this heartbeat check I decided to avoid doing a heartbeat check + timeline reload again beforehand in isClusteringInstantEligibleForRollback

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

private static boolean isInstantOldEnough(String instantTime, long waitMinutes) {
try {
Date instantDate = TimelineUtils.parseDateFromInstantTime(instantTime);
long ageMs = System.currentTimeMillis() - instantDate.getTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets get the time zone from hoodieTable.getMetaClient().getTableConfig().getTimelineTimezone() and calculate based on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, good point unlike the timeline the client job might not be in UTC then I guess? Updated


public static final ConfigProperty<Long> ROLLBACK_FAILED_CLUSTERING_WAIT_MINUTES = ConfigProperty
.key("hoodie.rollback.failed.clustering.wait.minutes")
.defaultValue(60L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which interval this config dictates.
is it, after the heart beat expires, we need to wait until X mins to trigger rollback?

if yes, the documentation is not clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wait x minutes until after the requested instant is created, updated doc to clarify that

@kbuci kbuci requested a review from nsivabalan March 17, 2026 06:05
@nsivabalan
Copy link
Contributor

@kbuci : can you update PR description based on the change in config key

}
}

public boolean isClusteringInstantEligibleForRollback(HoodieTableMetaClient metaClient, HoodieInstant instant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why public ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted HoodieClusteringJob to be able to use this API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it static then.

HoodieInstantTimeGenerator.fixInstantTimeCompatibility(instantTime),
HoodieInstantTimeGenerator.MILLIS_INSTANT_TIME_FORMATTER);
long instantEpochMs = instantDateTime.atZone(zoneId).toInstant().toEpochMilli();
long ageMs = System.currentTimeMillis() - instantEpochMs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do something like

ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(java.time.Instant.now(), table.getMetaClient().getTableConfig().getTimelineTimezone().getZoneId());
          long currentTimeMs = latestDateTime.toInstant().toEpochMilli();
          long replaceCommitInstantTime = HoodieInstantTimeGenerator.parseDateFromInstantTime(instantTime).toInstant().toEpochMilli();

long ageMs = currentTimeMs - replaceCommitInstantTime; 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

return ageMs >= TimeUnit.MINUTES.toMillis(expirationMins);
} catch (DateTimeParseException e) {
log.warn("Could not parse instant time {}, assuming it has expired", instantTime, e);
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this unintentionally rollback an on-going or in progress clustering?
should we throw instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching, fixing to make sure errors are re-thrown as needed

.key("hoodie.clustering.updates.strategy")
.defaultValue("org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy")
.withInferFunction(cfg -> {
String strategy = cfg.getStringOrDefault(HoodieLockConfig.WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME, "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you write some UTs for this.
what does it mean to have a default value set(L245) and also infer func(L246).

shouldn't only one of them will take effect?
If yes, then, we should fix L247 to set org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy as default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a UT and removed the default value to avoid confusion

public static final ConfigProperty<Boolean> ENABLE_EXPIRATIONS = ConfigProperty
.key("hoodie.clustering.enable.expirations")
.defaultValue(false)
.withInferFunction(cfg -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.
lets validate if setting both default value and infer func works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed infer func #18302 (comment)

.withInferFunction(cfg -> {
String strategy = cfg.getStringOrDefault(HoodieLockConfig.WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME, "");
if (PreferWriterConflictResolutionStrategy.class.getName().equals(strategy)) {
return Option.of(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we leave this false for OOB users. this is slightly orthogonal feature right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure let me just make this default false then and not infer, since even though this is intended for preferred write conflict users, its technically orthogonal to that

metaClient.getActiveTimeline(), instant, metaClient.getInstantGenerator());
}

private static boolean hasInstantExpired(HoodieTableMetaClient metaClient, String instantTime, long expirationMins) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we not checking the heart beat expiry, but just purely based on the instant time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid redundant heartbeat checks which is why I didn't do that here - https://github.com/apache/hudi/pull/18302/changes#r2943866147 .. Especially since the caller anyway needs to reload the timeline after to double-check that an inflight instant with expired heartbeat wasn't actually just a completed instant.
But since this is not intuitive when reading the code, I ended up abandoning this approach. And now this API also checks heartbeat status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg.

getPendingClusteringInstantsForPartitions(metaClient, partitions).stream()
.filter(instantTime -> {
HoodieInstant instant = metaClient.getInstantGenerator()
.createNewInstant(HoodieInstant.State.INFLIGHT, HoodieTimeline.CLUSTERING_ACTION, instantTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not possible to reuse the writeClient to rollback an expired clustering.
I am looking to avoid having duplicate codes.
for eg, if we chance the way we deduce the expiration of a clustering instant, we have to fix in two places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on other changes, I avoided some redundant heartbeat related checks

metaClient.getActiveTimeline(), instant, metaClient.getInstantGenerator());
}

private static boolean hasInstantExpired(HoodieTableMetaClient metaClient, String instantTime, long expirationMins) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check from the time the heart beat expired and it we have elapsed the expiration interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to go with this approach, since if the heartbeat file is cleaned up then I'm not sure if there is an easy way to infer when the heartbeat actually expired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@kbuci kbuci requested a review from nsivabalan March 20, 2026 06:41
@kbuci kbuci changed the title feat(table-services): Support hoodie.rollback.failed.clustering to allow cleanup of failed clustering plans (intended for PreferWriterConflictResolutionStrategy) feat(table-services): Support hoodie.clustering.enable.expirations to allow cleanup of failed clustering plans (intended for PreferWriterConflictResolutionStrategy) Mar 20, 2026
}
}

public boolean isClusteringInstantEligibleForRollback(HoodieTableMetaClient metaClient, HoodieInstant instant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

}
}

public boolean isClusteringInstantEligibleForRollback(HoodieTableMetaClient metaClient, HoodieInstant instant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it static then.

+ "client must be used to schedule, execute, and commit the clustering instant. And a clustering plan cannot be "
+ "re-attempted");

public static final ConfigProperty<Long> EXPIRATION_TIME_MINS = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. my bad. hoodie.clustering.expiration.threshold.mins

time might give a wrong notion of absolute value. but here, we are referring to interval or threshold.

metaClient.getActiveTimeline(), instant, metaClient.getInstantGenerator());
}

private static boolean hasInstantExpired(HoodieTableMetaClient metaClient, String instantTime, long expirationMins) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg.

metaClient.getActiveTimeline(), instant, metaClient.getInstantGenerator());
}

private static boolean hasInstantExpired(HoodieTableMetaClient metaClient, String instantTime, long expirationMins) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

// --- Tests for clustering expiration logic ---

@Test
void isClusteringInstantEligibleForRollback_returnsFalseWhenConfigDisabled() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we parametrize these tests or both v6 and v9. bcoz, the requested and inflight have different actions in the timeline in these table versions.

// --- Tests for clustering expiration logic ---

@Test
void isClusteringInstantEligibleForRollback_returnsFalseWhenConfigDisabled() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly for some of tests added here, can we parametrize for v6 and v9? lets be tactical. lets not parametrize all tests, but a good no of tests to give us good coverage

Krishen Bhan added 2 commits March 25, 2026 11:29
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 85.18519% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.32%. Comparing base (a179555) to head (7d5a775).
⚠️ Report is 23 commits behind head on master.

Files with missing lines Patch % Lines
...ache/hudi/client/BaseHoodieTableServiceClient.java 71.05% 8 Missing and 3 partials ⚠️
...org/apache/hudi/utilities/HoodieClusteringJob.java 96.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18302       +/-   ##
=============================================
- Coverage     69.26%   57.32%   -11.95%     
+ Complexity    27117    23484     -3633     
=============================================
  Files          2391     2433       +42     
  Lines        129572   133338     +3766     
  Branches      15366    16041      +675     
=============================================
- Hits          89746    76431    -13315     
- Misses        32969    50706    +17737     
+ Partials       6857     6201      -656     
Flag Coverage Δ
common-and-other-modules 21.77% <45.67%> (-22.61%) ⬇️
hadoop-mr-java-client 45.13% <33.92%> (-0.03%) ⬇️
spark-client-hadoop-common 48.56% <44.64%> (+0.22%) ⬆️
spark-java-tests 48.69% <37.03%> (+1.21%) ⬆️
spark-scala-tests 45.36% <18.51%> (-0.17%) ⬇️
utilities 38.56% <59.25%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...org/apache/hudi/config/HoodieClusteringConfig.java 92.79% <100.00%> (-1.80%) ⬇️
...java/org/apache/hudi/config/HoodieWriteConfig.java 88.97% <100.00%> (-0.82%) ⬇️
...org/apache/hudi/utilities/HoodieClusteringJob.java 81.94% <96.00%> (+2.95%) ⬆️
...ache/hudi/client/BaseHoodieTableServiceClient.java 73.59% <71.05%> (-0.57%) ⬇️

... and 562 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants