Skip to content

[FLINK-40070][tests] Fix flaky DynamicParameterITCase reading rolled JobManager logs#28636

Open
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:fix-dynamicparameter-itcase-rolled-logs
Open

[FLINK-40070][tests] Fix flaky DynamicParameterITCase reading rolled JobManager logs#28636
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:fix-dynamicparameter-itcase-rolled-logs

Conversation

@MartijnVisser

Copy link
Copy Markdown
Contributor

What is the purpose of the change

Fixes flaky DynamicParameterITCase runs that either hang the whole e2e_4 leg for hours or fail with Missing required option: c (Azure builds 75992, 76627). The distribution's log4j configuration rolls the log file on startup (OnStartupTriggeringPolicy), so the JobManager startup banner (program arguments, classpath) frequently lands in a rolled .log.N file, which FlinkDistribution.searchAllLogs deliberately skips. The test then either spins unboundedly waiting for a banner that never appears in the live .log, or parses a half-written arguments block.

Brief change log

  • FlinkDistribution.searchAllLogs: add an overload with an includeRolledLogs flag; the existing 2-arg method delegates with false, so all other callers are unchanged.
  • DynamicParameterITCase: search rolled logs for the startup banner, and bound the readiness wait with CommonTestUtils.waitUtil (1 minute) so a missing banner fails fast with a clear message instead of hanging until the CI watchdog kills the leg. The "Classpath:" line is logged after the program arguments, so its presence guarantees the complete block has been flushed before parsing.

Verifying this change

This change is already covered by existing tests (DynamicParameterITCase, all parameterizations green locally). The hang needs the on-startup log rotation to move the banner into a rolled file, which depends on prior runs' log state on the CI machine and is not deterministically reproducible locally.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

Was generative AI tooling used to co-author this PR?
  • Yes (Claude Opus 4.8, via Claude Code)

Generated-by: Claude Opus 4.8 (1M context)

…JobManager logs

The distribution log4j configuration rolls the log file on startup, so the JobManager startup banner frequently lands in a rolled .log.N file that FlinkDistribution.searchAllLogs skips; the test then either spins unboundedly waiting for the banner (multi-hour e2e_4 hang) or parses a half-written arguments block ("Missing required option: c"). Search rolled logs for the startup banner and bound the wait so a missing banner fails fast.

Generated-by: Claude Opus 4.8 (1M context)
@flinkbot

flinkbot commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@spuru9 spuru9 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants