[FLINK-40070][tests] Fix flaky DynamicParameterITCase reading rolled JobManager logs#28636
Open
MartijnVisser wants to merge 1 commit into
Open
Conversation
…JobManager logs
The distribution log4j configuration rolls the log file on startup, so the JobManager startup banner frequently lands in a rolled .log.N file that FlinkDistribution.searchAllLogs skips; the test then either spins unboundedly waiting for the banner (multi-hour e2e_4 hang) or parses a half-written arguments block ("Missing required option: c"). Search rolled logs for the startup banner and bound the wait so a missing banner fails fast.
Generated-by: Claude Opus 4.8 (1M context)
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
Fixes flaky
DynamicParameterITCaseruns that either hang the whole e2e_4 leg for hours or fail withMissing required option: c(Azure builds 75992, 76627). The distribution's log4j configuration rolls the log file on startup (OnStartupTriggeringPolicy), so the JobManager startup banner (program arguments, classpath) frequently lands in a rolled.log.Nfile, whichFlinkDistribution.searchAllLogsdeliberately skips. The test then either spins unboundedly waiting for a banner that never appears in the live.log, or parses a half-written arguments block.Brief change log
FlinkDistribution.searchAllLogs: add an overload with anincludeRolledLogsflag; the existing 2-arg method delegates withfalse, so all other callers are unchanged.DynamicParameterITCase: search rolled logs for the startup banner, and bound the readiness wait withCommonTestUtils.waitUtil(1 minute) so a missing banner fails fast with a clear message instead of hanging until the CI watchdog kills the leg. The "Classpath:" line is logged after the program arguments, so its presence guarantees the complete block has been flushed before parsing.Verifying this change
This change is already covered by existing tests (
DynamicParameterITCase, all parameterizations green locally). The hang needs the on-startup log rotation to move the banner into a rolled file, which depends on prior runs' log state on the CI machine and is not deterministically reproducible locally.Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.8 (1M context)