Skip to content

Core: Add pluggable BackoffStrategy SPI for Tasks retries#16437

Open
wombatu-kun wants to merge 3 commits into
apache:mainfrom
wombatu-kun:issue/7528-pluggable-backoff-spi
Open

Core: Add pluggable BackoffStrategy SPI for Tasks retries#16437
wombatu-kun wants to merge 3 commits into
apache:mainfrom
wombatu-kun:issue/7528-pluggable-backoff-spi

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented May 20, 2026

Part of #7528.

Motivation

Retry backoff in Tasks.runTaskWithRetry was a single hardcoded exponential-plus-jitter formula with no way to substitute another algorithm.

What this PR does

SPI introduction. Introduces a pluggable SPI org.apache.iceberg.util.BackoffStrategy selected via a single property retry.strategy-impl, loaded reflectively via the standard DynConstructors no-arg constructor + initialize(Map) pattern that catalog-impl, io-impl, and lock-impl already use. The new ExponentialBackoffStrategy reproduces the previous inline formula and jitter exactly, and is the default whenever no strategy is configured — so existing callers see byte-for-byte identical per-attempt wait times.

Every production .exponentialBackoff(...) site that had access to a properties map is wired through the SPI: commits, REST catalog handlers, metadata refresh, status checks, in-memory and DynamoDB lock managers, the Hive Metastore lock (via Hadoop Configuration), REST scan polling, and Spark cleanup via FileIO.properties(). OAuth2 token refresh has no usable properties map at its call site, so it uses an explicitly-constructed default.

API cleanup. Tasks.Builder.exponentialBackoff(long, long, long, double) was carrying two unrelated concerns: per-attempt formula parameters (min/max/scale) and the total retry budget (totalTimeout). The new SPI made the conflation visible — at every wired site the formula args became dead weight whenever the configured strategy actually won. The method is replaced by two: Tasks.Builder.totalTimeoutMs(long) (the retry budget — aligned with the operator-facing property name commit.retry.total-timeout-ms) and Tasks.Builder.backoffStrategy(BackoffStrategy) (the per-attempt wait formula). A new BackoffStrategies.from(Map, long, long, double) overload preserves operator-tunable formula defaults: it returns the custom strategy when retry.strategy-impl is set, otherwise a fresh ExponentialBackoffStrategy with the provided params. The internal field maxDurationMs is renamed totalTimeoutMs to match the same operator-facing name.

A follow-up PR will add an AIMD strategy on top of this SPI without further core changes. The implementation is already prepared on branch issue/7528-aimd-backoff-strategy (two commits: an onSuccess() callback added to BackoffStrategy plus Tasks wiring, then AIMDBackoffStrategy with tests and docs) and will be filed against main once this PR is merged.

Compatibility

For the SPI surface, behavior is byte-for-byte unchanged unless retry.strategy-impl is set — the default fallback uses the same exponential-plus-jitter formula with the same parameters as before.

For the builder API, Tasks.Builder.exponentialBackoff(long, long, long, double) is removed. The behavior is fully recoverable via the new two-call form .totalTimeoutMs(total).backoffStrategy(BackoffStrategies.from(props, min, max, scale)). The revapi exception against the 1.11.0 baseline documents the removal and points at the replacement.

Tests

TestBackoffStrategies covers the loader: happy path with initialize invocation, null/empty/missing-key returns, unknown class name, missing no-arg constructor, non-strategy class, initialize exception propagation, and the new 4-arg from(Map, long, long, double) overload across three branches (custom strategy from properties, default exponential, null properties). TestTasks covers Tasks integration: custom strategy is used, null is a no-op, the returned milliseconds actually drive Thread.sleep (timing-verified), and a single strategy instance is shared across parallel worker threads. TestBackoffStrategyCommit exercises end-to-end commit retry through the SPI across all four format versions.

@wombatu-kun wombatu-kun marked this pull request as draft May 20, 2026 03:00
@wombatu-kun wombatu-kun marked this pull request as ready for review May 20, 2026 05:33
@wombatu-kun
Copy link
Copy Markdown
Contributor Author

PR isn't the cause of Kafka Connect CI failure. I've fixed Kafka Connect integration test flakiness in separate PR #16438

@wombatu-kun
Copy link
Copy Markdown
Contributor Author

Hi @nastra @jackye1995 @singhpk234
This is ready for review when someone has a moment. Happy to split it up or adjust the approach based on feedback.

Vova Kolmakov and others added 3 commits May 20, 2026 19:49
Retry backoff was a single hardcoded exponential-plus-jitter formula inlined in Tasks.runTaskWithRetry, with no way to substitute another algorithm. This makes the backoff a pluggable SPI (org.apache.iceberg.util.BackoffStrategy) while keeping the existing behavior as the default.

ExponentialBackoffStrategy reproduces the previous formula and jitter exactly. Tasks.Builder.exponentialBackoff(...) now installs that default and a new backoffStrategy(...) overrides it; a null strategy is a no-op, so behavior is byte-for-byte unchanged unless a strategy is configured. A strategy is selected via one property, retry.strategy-impl, loaded by BackoffStrategies via the standard DynConstructors no-arg + initialize(Map) pattern. Every production .exponentialBackoff(...) site now routes through the SPI and is selectable from its local properties (commits, REST handlers, metadata refresh, status checks, in-memory/DynamoDB locks, Hive Metastore lock via Hadoop conf, REST scan poll, Spark cleanup via FileIO.properties()). OAuth2 token refresh has no usable properties map and stays on the default, still SPI-routed.

Adds TestBackoffStrategies, TestBackoffStrategyCommit, and TestTasks coverage. A follow-up will add an AIMD strategy on top of this SPI.

Part of apache#7528

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tasks.Builder.exponentialBackoff(min, max, totalTimeout, scale) was carrying two unrelated concerns: the per-attempt formula parameters (min/max/scale) and the total retry budget (totalTimeout). With the BackoffStrategy SPI now mediating the per-attempt wait, every wired call site read as ".exponentialBackoff(...).backoffStrategy(...)" where the first call's formula args became dead weight whenever the second call's strategy actually won. The internal field maxDurationMs and method parameter backoffMaxRetryTimeMs also diverged from the operator-facing property commit.retry.total-timeout-ms.

Splits the method into Tasks.Builder.totalTimeoutMs(long) (the retry budget) and Tasks.Builder.backoffStrategy(BackoffStrategy) (the per-attempt wait formula). The internal field is renamed maxDurationMs to totalTimeoutMs to match the operator-facing name. A new BackoffStrategies.from(Map, long, long, double) overload preserves operator-tunable formula defaults: it returns the custom strategy when retry.strategy-impl is set, otherwise a fresh ExponentialBackoffStrategy with the provided min/max/scale. Builder.backoffStrategy(null) is now a true no-op, the Builder field is initialized to a non-null default, and the in-Tasks fallback construction path is gone.

Every production call site is migrated. MetastoreLock's two sites with different scales (1.5 for lock-acquire, 2.0 for lock-creation) are preserved via two separate strategy fields. OAuth2Util, which has no properties map at its call site, passes null to the new helper.

Revapi accepts the removal against the 1.11.0 baseline with a justification that the behavior is fully recoverable via the new two-call form.

Part of apache#7528

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The class-level javadoc referenced Tasks.Builder#exponentialBackoff, which was removed in the previous commit when the method was split into totalTimeoutMs and backoffStrategy. The broken link caused :iceberg-core:javadoc to fail on CI. Updates the javadoc to point at BackoffStrategies.from(Map, long, long, double) — the helper call sites use to resolve the strategy — and at the ExponentialBackoffStrategy default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the issue/7528-pluggable-backoff-spi branch from 4c7f13e to f1fc5f1 Compare May 20, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant