Skip to content

[FLINK-37701][flink-runtime] Backporting of the fix to 2.0 for AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment.#26796

Merged
1996fanrui merged 3 commits into
apache:release-2.0from
Izeren:37701/2.0-fix-backport
Jul 15, 2025

Conversation

@Izeren
Copy link
Copy Markdown
Contributor

@Izeren Izeren commented Jul 14, 2025

Previously merged PRs:

PR1 #26663

What is the purpose of the change

Address local recovery issues when Adaptive scheduler is enabled.

  1. Pass latest completed checkpoint in addition to execution graph to StateSizeEstimates (that is needed because execution graph goes through cancelling/cancelled state and checkpoint coordinator is nulled by the time we run calculations).
  2. Assign positive priority score to allocations that have overlapping key groups even when state size is zero (currently we would only give priority score if managedKeyedState is present, but local recovery semantics doesn't require state presence).

Context: When job can be recovered locally, we should keep slot allocation after restart to maintain

Verifying this change

LocalRecoveryITCase#testRecoverLocallyFromProcessCrashWithWorkingDirectory now passes when AdaptiveScheduler is enabled.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

PR2 #26712

Brief change log

  • Extract commonly re-used elements from AdaptiveSchedulerTest into AdaptiveSchedulerTestBase.
  • Move local recovery test into independent class inheriting from AdaptiveSchedulerTestBase.
  • Restore checkstyle file length limit to 3100 lines.
  • minor cleanup changes missed in previous PR

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Izeren added 3 commits July 14, 2025 16:18
…for local recovery adjustment.

(cherry picked from commit 8b6f9ce)
…t length, restore checkstyle line-length limit.

(cherry picked from commit 9365049)
@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Jul 14, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@rkhachatryan rkhachatryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming CI is green

Copy link
Copy Markdown
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the backporting, CI is green, merging

@1996fanrui 1996fanrui merged commit 5f4d713 into apache:release-2.0 Jul 15, 2025
@Izeren
Copy link
Copy Markdown
Contributor Author

Izeren commented Jul 15, 2025

Thank you Rui

@Izeren Izeren deleted the 37701/2.0-fix-backport branch July 15, 2025 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants