[FLINK-36753][runtime]Adaptive Scheduler actively triggers a Checkpoint after all resources are ready by Samrat002 · Pull Request #27921 · apache/flink

Samrat002 · 2026-04-12T17:47:35Z

What is the purpose of the change

FLIP-461 introduced checkpoint-synchronized rescaling where the Adaptive Scheduler waits for a checkpoint to complete before rescaling. However, it passively waits for the next periodic checkpoint, which can delay rescaling significantly when checkpoint intervals are large (e.g., 10 minutes).
This PR makes the Adaptive Scheduler actively trigger a checkpoint when resources change and rescaling is desired. The trigger fires at the right time. ie, when the DefaultStateTransitionManager enters the Stabilizing or Stabilized phase (i.e., when the resource gate is open and the scheduler is waiting for the checkpoint gate). The feature is controlled by a new configuration option jobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabled (default: false).

The feature respects execution.checkpointing.min-pause, skips if a checkpoint is already in progress, and only fires when parallelism has actually changed.

Brief change log

Added requestActiveCheckpointTrigger() to StateTransitionManager.Context interface
DefaultStateTransitionManager calls requestActiveCheckpointTrigger() when entering Stabilizing, on onChange during Stabilizing, and when entering Stabilized
Executing implements the callback with guard conditions (config enabled, checkpointing configured, parallelism changed, no checkpoint in progress)
Added config option jobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabled wired through AdaptiveScheduler.Settings
Added integration test proving rescale happens without periodic checkpoints or manual triggers

Verifying this change

Unit Test

End-to-end test on a real cluster

Verified the feature on a local 2-TaskManager standalone cluster running the LargeStateGeneratorJob benchmark with a deliberately long checkpoint interval, so any checkpoint firing within seconds must be the active trigger.

Setup

Setting	Value
Scheduler	`Adaptive`
`jobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabled`	`true`
`execution.checkpointing.interval`	`1 h`
`execution.checkpointing.min-pause`	`0 s`
`state.backend.type`	`hashmap`
Cluster	1 JM + 1 TM (2 slots), 2nd TM added mid-run
Job	`LargeStateGeneratorJob` at parallelism 2, ~16 MB keyed state

Scenarios

Initial deploy at parallelism 2 — Executing.requestActiveCheckpointTrigger() was called from Stabilizing entry but the parallelismChanged() guard correctly returned false; no active trigger fired.
Scale up 2 → 4 (start 2nd TM, then PUT /jobs/<id>/resource-requirements with upperBound: 4) — active trigger fired immediately, checkpoint completed in 22 ms, rescale proceeded.
Scale down 4 → 2 (PUT … upperBound: 2) — same flow, 22 ms checkpoint, rescale proceeded.

Grepped Log Lines

2026-04-25 23:53:49,059  AdaptiveScheduler             Actively triggering checkpoint to expedite rescaling, job c06b881baef154cb51e196fa46844772.
2026-04-25 23:53:49,059  DefaultStateTransitionManager Transitioning from Idling to Stabilizing, job ...
2026-04-25 23:53:49,060  CheckpointCoordinator         Triggering checkpoint 2 ...
2026-04-25 23:53:49,084  CheckpointCoordinator         Completed checkpoint 2 ... (16,395,090 bytes, checkpointDuration=22 ms)
2026-04-25 23:53:49,084  AdaptiveScheduler             Active checkpoint for rescale completed successfully: 2.
2026-04-25 23:53:49,084  DefaultStateTransitionManager Desired resources are met, transitioning to the subsequent state ...
2026-04-25 23:53:49,084  DefaultStateTransitionManager Transitioning from Stabilizing to Transitioning, job ...

2026-04-25 23:54:57,517  AdaptiveScheduler             Actively triggering checkpoint to expedite rescaling, job c06b881baef154cb51e196fa46844772.
2026-04-25 23:54:57,517  CheckpointCoordinator         Triggering checkpoint 4 ...
2026-04-25 23:54:57,539  CheckpointCoordinator         Completed checkpoint 4 ... (16,395,900 bytes, checkpointDuration=22 ms)
2026-04-25 23:54:57,539  AdaptiveScheduler             Active checkpoint for rescale completed successfully: 4.
2026-04-25 23:54:57,540  DefaultStateTransitionManager Transitioning from Stabilizing to Transitioning, job ...

Entire JobManager logs

e2e-proof-jm.log

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

flinkbot · 2026-04-12T17:51:13Z

CI report:

940966f Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Samrat002 · 2026-04-13T17:20:20Z

@1996fanrui PTAL whenever time.

pnowojski

Thanks for the contribution. I've left a couple of comments, however I don't have context to review whether this is properly integrated with AdatpiveScheduler and DefaultStateTransitionManager. Would be great for someone else to take a look as well.

ztison

Hi, thanks for the PR. We look at it with @XComp and found few things to improve.

Samrat002 · 2026-04-17T05:51:46Z

@flinkbot run azure

Samrat002 · 2026-04-17T08:45:20Z

@ztison @pnowojski PTAL . i have addressed to review comments

added Unit tests , made the IT more robust and ensured minpause is respected

ztison · 2026-04-20T07:42:44Z

@ztison @pnowojski PTAL . i have addressed to review comments

added Unit tests , made the IT more robust and ensured minpause is respected

Thanks for incorporating our improvements. I was on a vacation the last few days so I haven't responded. I am back, I will check the PR today or tomorrow.

ztison

I see some issues with retry logic.

Samrat002 · 2026-04-24T04:28:56Z

@ztison PTAL, I have addressed the latest review comments.

ztison

Generally, I am ok with the implementation. Thanks for applying suggested changes. I added last small suggestion.
Thanks.

Samrat002 · 2026-04-28T09:04:56Z

@pnowojski @XComp PTAL whenever time.

…nt after all resources are ready

Samrat002 marked this pull request as ready for review April 13, 2026 17:20

pnowojski reviewed Apr 14, 2026

View reviewed changes

ztison reviewed Apr 15, 2026

View reviewed changes

github-actions Bot added the community-reviewed PR has been reviewed by the community. label Apr 15, 2026

Samrat002 force-pushed the FLINK-36753 branch from bd96e05 to c5a4baa Compare April 16, 2026 18:52

Samrat002 requested review from pnowojski and ztison April 16, 2026 18:53

Samrat002 force-pushed the FLINK-36753 branch from c5a4baa to e4f65fc Compare April 17, 2026 03:17

ztison reviewed Apr 21, 2026

View reviewed changes

Samrat002 force-pushed the FLINK-36753 branch from e4f65fc to a6337c6 Compare April 24, 2026 04:27

Samrat002 requested a review from ztison April 24, 2026 04:28

ztison reviewed Apr 24, 2026

View reviewed changes

Comment thread flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/ExecutingTest.java

Comment thread flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/ExecutingTest.java

Samrat002 force-pushed the FLINK-36753 branch from a6337c6 to 1e5bc9b Compare April 24, 2026 10:16

Samrat002 requested a review from ztison April 24, 2026 12:54

Samrat002 force-pushed the FLINK-36753 branch 3 times, most recently from 37050e6 to e9735a9 Compare April 26, 2026 03:52

ztison approved these changes Apr 27, 2026

View reviewed changes

Comment thread flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java Outdated

Samrat002 force-pushed the FLINK-36753 branch from e9735a9 to bb61c5e Compare April 27, 2026 18:00

[FLINK-36753][runtime]Adaptive Scheduler actively triggers a Checkpoi…

940966f

…nt after all resources are ready

Samrat002 force-pushed the FLINK-36753 branch from bb61c5e to 940966f Compare May 7, 2026 12:56

Conversation

Samrat002 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

End-to-end test on a real cluster

Setup

Scenarios

Grepped Log Lines

Entire JobManager logs

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Samrat002 commented Apr 13, 2026

Uh oh!

pnowojski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ztison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samrat002 commented Apr 17, 2026

Uh oh!

Samrat002 commented Apr 17, 2026

Uh oh!

ztison commented Apr 20, 2026

Uh oh!

ztison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samrat002 commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

ztison left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samrat002 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Samrat002 commented Apr 12, 2026 •

edited

Loading

flinkbot commented Apr 12, 2026 •

edited

Loading

ztison left a comment •

edited

Loading