Flink: Fix flaky TestMonitorSource.testStateRestore by wombatu-kun · Pull Request #16548 · apache/iceberg

wombatu-kun · 2026-05-24T04:21:31Z

Summary

TestMonitorSource.testStateRestore (the testStateRestore(File, ClusterClient) variant in the Flink v2.0/v2.1 trees) intermittently fails with a TimeoutException from CollectingSink.poll (example CI run). The timeout only means the sink queue stayed empty for 5s; the real cause is a savepoint-completion race in the shared test helper, not slow startup.

OperatorTestBase.closeJobClient(JobClient, File) discarded the CompletableFuture<String> returned by stopWithSavepoint and instead waited for the savepoint directory to appear on disk. That directory is created early in the savepoint process, before the _metadata/state files finish writing. Phase 2 of the test then restores from that path via clusterClient.submitJob; when the restore races savepoint completion, the restored job never comes up and emits nothing, so the poll times out.

What changed

OperatorTestBase.closeJobClient now awaits the stopWithSavepoint(...) future and returns the path it resolves to, so the savepoint is guaranteed to be fully written before any job restores from it. This mirrors the existing idiom in TestIcebergSourceFailover.testBoundedWithSavepoint, which awaits the savepoint future with .get(). The only caller that passes a non-null savepoint directory is TestMonitorSource.testStateRestore, so the change is scoped to this test.
As a backstop for restored-job startup latency on busy CI, the Phase 2 poll in testStateRestore is raised from 5s to 30s. The assertion stays strict — a genuine re-read emits a non-empty event quickly and still fails fast — so the longer timeout only extends the wait for the (correct) first event, mirroring the "deterministic fix + generous timeout backstop" pattern used when TestIcebergSourceFailover was de-flaked.

Both changes are applied identically to the Flink v2.0 and v2.1 trees. The v1.20 variant uses env.executeAsync rather than clusterClient.submitJob and is not affected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Flink: Fix flaky TestMonitorSource.testStateRestore

fe70f2d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the flink label May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Fix flaky TestMonitorSource.testStateRestore#16548

Flink: Fix flaky TestMonitorSource.testStateRestore#16548
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16546-flaky-test-monitor-source-state-restore

wombatu-kun commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wombatu-kun commented May 24, 2026

Summary

What changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant