Skip to content

Flink: Fix flaky TestMonitorSource.testStateRestore#16548

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16546-flaky-test-monitor-source-state-restore
Open

Flink: Fix flaky TestMonitorSource.testStateRestore#16548
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16546-flaky-test-monitor-source-state-restore

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Closes #16546.

Summary

TestMonitorSource.testStateRestore (the testStateRestore(File, ClusterClient) variant in the Flink v2.0/v2.1 trees) intermittently fails with a TimeoutException from CollectingSink.poll (example CI run). The timeout only means the sink queue stayed empty for 5s; the real cause is a savepoint-completion race in the shared test helper, not slow startup.

OperatorTestBase.closeJobClient(JobClient, File) discarded the CompletableFuture<String> returned by stopWithSavepoint and instead waited for the savepoint directory to appear on disk. That directory is created early in the savepoint process, before the _metadata/state files finish writing. Phase 2 of the test then restores from that path via clusterClient.submitJob; when the restore races savepoint completion, the restored job never comes up and emits nothing, so the poll times out.

What changed

  • OperatorTestBase.closeJobClient now awaits the stopWithSavepoint(...) future and returns the path it resolves to, so the savepoint is guaranteed to be fully written before any job restores from it. This mirrors the existing idiom in TestIcebergSourceFailover.testBoundedWithSavepoint, which awaits the savepoint future with .get(). The only caller that passes a non-null savepoint directory is TestMonitorSource.testStateRestore, so the change is scoped to this test.
  • As a backstop for restored-job startup latency on busy CI, the Phase 2 poll in testStateRestore is raised from 5s to 30s. The assertion stays strict — a genuine re-read emits a non-empty event quickly and still fails fast — so the longer timeout only extends the wait for the (correct) first event, mirroring the "deterministic fix + generous timeout backstop" pattern used when TestIcebergSourceFailover was de-flaked.

Both changes are applied identically to the Flink v2.0 and v2.1 trees. The v1.20 variant uses env.executeAsync rather than clusterClient.submitJob and is not affected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the flink label May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test TestMonitorSource > testStateRestore

1 participant