Core: Add pluggable BackoffStrategy SPI for Tasks retries#16437
Open
wombatu-kun wants to merge 3 commits into
Open
Core: Add pluggable BackoffStrategy SPI for Tasks retries#16437wombatu-kun wants to merge 3 commits into
wombatu-kun wants to merge 3 commits into
Conversation
Contributor
Author
|
PR isn't the cause of Kafka Connect CI failure. I've fixed Kafka Connect integration test flakiness in separate PR #16438 |
Contributor
Author
|
Hi @nastra @jackye1995 @singhpk234 |
Retry backoff was a single hardcoded exponential-plus-jitter formula inlined in Tasks.runTaskWithRetry, with no way to substitute another algorithm. This makes the backoff a pluggable SPI (org.apache.iceberg.util.BackoffStrategy) while keeping the existing behavior as the default. ExponentialBackoffStrategy reproduces the previous formula and jitter exactly. Tasks.Builder.exponentialBackoff(...) now installs that default and a new backoffStrategy(...) overrides it; a null strategy is a no-op, so behavior is byte-for-byte unchanged unless a strategy is configured. A strategy is selected via one property, retry.strategy-impl, loaded by BackoffStrategies via the standard DynConstructors no-arg + initialize(Map) pattern. Every production .exponentialBackoff(...) site now routes through the SPI and is selectable from its local properties (commits, REST handlers, metadata refresh, status checks, in-memory/DynamoDB locks, Hive Metastore lock via Hadoop conf, REST scan poll, Spark cleanup via FileIO.properties()). OAuth2 token refresh has no usable properties map and stays on the default, still SPI-routed. Adds TestBackoffStrategies, TestBackoffStrategyCommit, and TestTasks coverage. A follow-up will add an AIMD strategy on top of this SPI. Part of apache#7528 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tasks.Builder.exponentialBackoff(min, max, totalTimeout, scale) was carrying two unrelated concerns: the per-attempt formula parameters (min/max/scale) and the total retry budget (totalTimeout). With the BackoffStrategy SPI now mediating the per-attempt wait, every wired call site read as ".exponentialBackoff(...).backoffStrategy(...)" where the first call's formula args became dead weight whenever the second call's strategy actually won. The internal field maxDurationMs and method parameter backoffMaxRetryTimeMs also diverged from the operator-facing property commit.retry.total-timeout-ms. Splits the method into Tasks.Builder.totalTimeoutMs(long) (the retry budget) and Tasks.Builder.backoffStrategy(BackoffStrategy) (the per-attempt wait formula). The internal field is renamed maxDurationMs to totalTimeoutMs to match the operator-facing name. A new BackoffStrategies.from(Map, long, long, double) overload preserves operator-tunable formula defaults: it returns the custom strategy when retry.strategy-impl is set, otherwise a fresh ExponentialBackoffStrategy with the provided min/max/scale. Builder.backoffStrategy(null) is now a true no-op, the Builder field is initialized to a non-null default, and the in-Tasks fallback construction path is gone. Every production call site is migrated. MetastoreLock's two sites with different scales (1.5 for lock-acquire, 2.0 for lock-creation) are preserved via two separate strategy fields. OAuth2Util, which has no properties map at its call site, passes null to the new helper. Revapi accepts the removal against the 1.11.0 baseline with a justification that the behavior is fully recoverable via the new two-call form. Part of apache#7528 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The class-level javadoc referenced Tasks.Builder#exponentialBackoff, which was removed in the previous commit when the method was split into totalTimeoutMs and backoffStrategy. The broken link caused :iceberg-core:javadoc to fail on CI. Updates the javadoc to point at BackoffStrategies.from(Map, long, long, double) — the helper call sites use to resolve the strategy — and at the ExponentialBackoffStrategy default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4c7f13e to
f1fc5f1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #7528.
Motivation
Retry backoff in
Tasks.runTaskWithRetrywas a single hardcoded exponential-plus-jitter formula with no way to substitute another algorithm.What this PR does
SPI introduction. Introduces a pluggable SPI
org.apache.iceberg.util.BackoffStrategyselected via a single propertyretry.strategy-impl, loaded reflectively via the standardDynConstructorsno-arg constructor +initialize(Map)pattern thatcatalog-impl,io-impl, andlock-implalready use. The newExponentialBackoffStrategyreproduces the previous inline formula and jitter exactly, and is the default whenever no strategy is configured — so existing callers see byte-for-byte identical per-attempt wait times.Every production
.exponentialBackoff(...)site that had access to a properties map is wired through the SPI: commits, REST catalog handlers, metadata refresh, status checks, in-memory and DynamoDB lock managers, the Hive Metastore lock (via HadoopConfiguration), REST scan polling, and Spark cleanup viaFileIO.properties(). OAuth2 token refresh has no usable properties map at its call site, so it uses an explicitly-constructed default.API cleanup.
Tasks.Builder.exponentialBackoff(long, long, long, double)was carrying two unrelated concerns: per-attempt formula parameters (min/max/scale) and the total retry budget (totalTimeout). The new SPI made the conflation visible — at every wired site the formula args became dead weight whenever the configured strategy actually won. The method is replaced by two:Tasks.Builder.totalTimeoutMs(long)(the retry budget — aligned with the operator-facing property namecommit.retry.total-timeout-ms) andTasks.Builder.backoffStrategy(BackoffStrategy)(the per-attempt wait formula). A newBackoffStrategies.from(Map, long, long, double)overload preserves operator-tunable formula defaults: it returns the custom strategy whenretry.strategy-implis set, otherwise a freshExponentialBackoffStrategywith the provided params. The internal fieldmaxDurationMsis renamedtotalTimeoutMsto match the same operator-facing name.A follow-up PR will add an AIMD strategy on top of this SPI without further core changes. The implementation is already prepared on branch
issue/7528-aimd-backoff-strategy(two commits: anonSuccess()callback added toBackoffStrategyplusTaskswiring, thenAIMDBackoffStrategywith tests and docs) and will be filed againstmainonce this PR is merged.Compatibility
For the SPI surface, behavior is byte-for-byte unchanged unless
retry.strategy-implis set — the default fallback uses the same exponential-plus-jitter formula with the same parameters as before.For the builder API,
Tasks.Builder.exponentialBackoff(long, long, long, double)is removed. The behavior is fully recoverable via the new two-call form.totalTimeoutMs(total).backoffStrategy(BackoffStrategies.from(props, min, max, scale)). The revapi exception against the 1.11.0 baseline documents the removal and points at the replacement.Tests
TestBackoffStrategiescovers the loader: happy path withinitializeinvocation, null/empty/missing-key returns, unknown class name, missing no-arg constructor, non-strategy class,initializeexception propagation, and the new 4-argfrom(Map, long, long, double)overload across three branches (custom strategy from properties, default exponential, null properties).TestTaskscovers Tasks integration: custom strategy is used,nullis a no-op, the returned milliseconds actually driveThread.sleep(timing-verified), and a single strategy instance is shared across parallel worker threads.TestBackoffStrategyCommitexercises end-to-end commit retry through the SPI across all four format versions.