[FLINK-36753][runtime]Adaptive Scheduler actively triggers a Checkpoint after all resources are ready#27921
[FLINK-36753][runtime]Adaptive Scheduler actively triggers a Checkpoint after all resources are ready#27921Samrat002 wants to merge 1 commit intoapache:masterfrom
Conversation
|
@1996fanrui PTAL whenever time. |
pnowojski
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I've left a couple of comments, however I don't have context to review whether this is properly integrated with AdatpiveScheduler and DefaultStateTransitionManager. Would be great for someone else to take a look as well.
|
@flinkbot run azure |
|
@ztison @pnowojski PTAL . i have addressed to review comments added Unit tests , made the IT more robust and ensured minpause is respected |
Thanks for incorporating our improvements. I was on a vacation the last few days so I haven't responded. I am back, I will check the PR today or tomorrow. |
ztison
left a comment
There was a problem hiding this comment.
I see some issues with retry logic.
|
@ztison PTAL, I have addressed the latest review comments. |
37050e6 to
e9735a9
Compare
|
@pnowojski @XComp PTAL whenever time. |
…nt after all resources are ready
What is the purpose of the change
FLIP-461 introduced checkpoint-synchronized rescaling where the Adaptive Scheduler waits for a checkpoint to complete before rescaling. However, it passively waits for the next periodic checkpoint, which can delay rescaling significantly when checkpoint intervals are large (e.g., 10 minutes).
This PR makes the Adaptive Scheduler actively trigger a checkpoint when resources change and rescaling is desired. The trigger fires at the right time. ie, when the
DefaultStateTransitionManagerenters the Stabilizing or Stabilized phase (i.e., when the resource gate is open and the scheduler is waiting for the checkpoint gate). The feature is controlled by a new configuration optionjobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabled(default: false).The feature respects
execution.checkpointing.min-pause, skips if a checkpoint is already in progress, and only fires when parallelism has actually changed.Brief change log
Verifying this change
End-to-end test on a real cluster
Verified the feature on a local 2-TaskManager standalone cluster running the
LargeStateGeneratorJobbenchmark with a deliberately long checkpoint interval, so any checkpoint firing within seconds must be the active trigger.Setup
Adaptivejobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabledtrueexecution.checkpointing.interval1 hexecution.checkpointing.min-pause0 sstate.backend.typehashmapLargeStateGeneratorJobat parallelism 2, ~16 MB keyed stateScenarios
Initial deploy at parallelism 2 —
Executing.requestActiveCheckpointTrigger()was called fromStabilizingentry but theparallelismChanged()guard correctly returned false; no active trigger fired.Scale up 2 → 4 (start 2nd TM, then
PUT /jobs/<id>/resource-requirementswithupperBound: 4) — active trigger fired immediately, checkpoint completed in 22 ms, rescale proceeded.Scale down 4 → 2 (
PUT … upperBound: 2) — same flow, 22 ms checkpoint, rescale proceeded.Grepped Log Lines
Entire JobManager logs
e2e-proof-jm.log
Does this pull request potentially affect one of the following parts:
Documentation