Skip to content

[FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery#27688

Open
1996fanrui wants to merge 5 commits intoapache:masterfrom
1996fanrui:39140/enhance-uc-itcase-during-recovery
Open

[FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery#27688
1996fanrui wants to merge 5 commits intoapache:masterfrom
1996fanrui:39140/enhance-uc-itcase-during-recovery

Conversation

@1996fanrui
Copy link
Member

@1996fanrui 1996fanrui commented Feb 26, 2026

What is the purpose of the change

Current Unaligned Checkpoint ITCases only restart once from a normal checkpoint. They do not cover restoring from a checkpoint produced by recovery phase — which is the key scenario for checkpointing during recovery.

Proposed mechanism: After restoring from a checkpoint, wait for the first new checkpoint to be produced, then immediately trigger a restart from it. Repeat for a configurable number of rounds (≥ 2). Whether to rescale depends on the specific test case.

This mechanism works on the current master (validating normal checkpoint recovery). Once checkpointing during recovery is enabled, the same tests automatically cover recovery-phase checkpoint scenarios.

Brief change log

  • [FLINK-39140][checkpoint] Disable CUSTOM_PARTITIONER in unaligned checkpoint it case since it does not work well
  • [FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery
  • [FLINK-39140][checkpoint] Fix MAX_RETAINED_CHECKPOINTS not effective in UnalignedCheckpointRescaleWithMixedExchangesITCase
    • Move MAX_RETAINED_CHECKPOINTS from per-job config to MiniCluster cluster config. StandaloneCompletedCheckpointStore reads this value from the cluster-level configuration, so setting it in StreamExecutionEnvironment had no effect (defaulting to 1).
    • This caused checkpoint is cleaned up sometimes due to new checkpoint is generated
  • [FLINK-39140][checkpoint] Change record type from Long to String in UnalignedCheckpointRescaleWithMixedExchangesITCase
    • Long records (8 bytes) allow hundreds of records per buffer, causing excessive backpressure during aligned checkpoint phases (forward/rescale exchanges). Using 100-char random String records reduces the record count per buffer, shortening the time needed to drain backpressured buffers.
  • [hotfix] Including task name and subtask index into channel-state-unspilling thread name

Verifying this change

This is an enhancement for existing ITCase, so it has been covered by testing.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 26, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@1996fanrui 1996fanrui force-pushed the 39140/enhance-uc-itcase-during-recovery branch 2 times, most recently from bb0667e to fe5b695 Compare March 3, 2026 09:39
@1996fanrui 1996fanrui marked this pull request as ready for review March 3, 2026 09:45
@1996fanrui 1996fanrui changed the title [FLINK-39140][checkpoint] Enhance Unaligned Checkpoint ITCases to perform checkpointing during recovery [FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery Mar 3, 2026
@1996fanrui 1996fanrui force-pushed the 39140/enhance-uc-itcase-during-recovery branch from fe5b695 to a4ef930 Compare March 3, 2026 09:51
Copy link
Contributor

@rkhachatryan rkhachatryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @1996fanrui

I've left a couple of comments, PTAL.

Apart from that, could you add component name to the last commit message (I guess [runtime])?
And for the earlier commits, [test] seem to be a more relevant component.

@1996fanrui 1996fanrui force-pushed the 39140/enhance-uc-itcase-during-recovery branch from a4ef930 to ac826c3 Compare March 3, 2026 17:27
…t it case since it does not work well

See FLINK-39162
…TCases to perform checkpointing during recovery
…lignedCheckpointRescaleWithMixedExchangesITCase

Move MAX_RETAINED_CHECKPOINTS from per-job config to MiniCluster cluster
config. StandaloneCompletedCheckpointStore reads this value from the
cluster-level configuration, so setting it in StreamExecutionEnvironment
had no effect (defaulting to 1). This caused checkpoint subsumption to
delete the selected checkpoint before the next job could restore from it.
…edCheckpointRescaleWithMixedExchangesITCase

Long records (8 bytes) allow thousands of records per buffer, causing
excessive backpressure during aligned checkpoint phases (forward/rescale
exchanges). Using 100-char random String records reduces the record count
per buffer, shortening the time needed to drain backpressured buffers.
@1996fanrui 1996fanrui force-pushed the 39140/enhance-uc-itcase-during-recovery branch from ac826c3 to f6da7ce Compare March 3, 2026 17:33
@1996fanrui
Copy link
Member Author

Hey @rkhachatryan , thanks for the review!

Apart from that, could you add component name to the last commit message (I guess [runtime])? And for the earlier commits, [test] seem to be a more relevant component.

Sounds make sense, updated.

Copy link
Contributor

@rkhachatryan rkhachatryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR, LGTM.

W.r.t. CustomPartitioner (FLINK-39162), I've drafted #27731 - my guess is that we don't forward disableUnalignedCheckpoints call to the nested partitioner (I didn't look deeply into the code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants