Flink: Fix flaky TestMonitorSource.testStateRestore#16548
Open
wombatu-kun wants to merge 1 commit into
Open
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #16546.
Summary
TestMonitorSource.testStateRestore(thetestStateRestore(File, ClusterClient)variant in the Flink v2.0/v2.1 trees) intermittently fails with aTimeoutExceptionfromCollectingSink.poll(example CI run). The timeout only means the sink queue stayed empty for 5s; the real cause is a savepoint-completion race in the shared test helper, not slow startup.OperatorTestBase.closeJobClient(JobClient, File)discarded theCompletableFuture<String>returned bystopWithSavepointand instead waited for the savepoint directory to appear on disk. That directory is created early in the savepoint process, before the_metadata/state files finish writing. Phase 2 of the test then restores from that path viaclusterClient.submitJob; when the restore races savepoint completion, the restored job never comes up and emits nothing, so the poll times out.What changed
OperatorTestBase.closeJobClientnow awaits thestopWithSavepoint(...)future and returns the path it resolves to, so the savepoint is guaranteed to be fully written before any job restores from it. This mirrors the existing idiom inTestIcebergSourceFailover.testBoundedWithSavepoint, which awaits the savepoint future with.get(). The only caller that passes a non-null savepoint directory isTestMonitorSource.testStateRestore, so the change is scoped to this test.testStateRestoreis raised from 5s to 30s. The assertion stays strict — a genuine re-read emits a non-empty event quickly and still fails fast — so the longer timeout only extends the wait for the (correct) first event, mirroring the "deterministic fix + generous timeout backstop" pattern used whenTestIcebergSourceFailoverwas de-flaked.Both changes are applied identically to the Flink v2.0 and v2.1 trees. The v1.20 variant uses
env.executeAsyncrather thanclusterClient.submitJoband is not affected.