[FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery by 1996fanrui · Pull Request #27688 · apache/flink

1996fanrui · 2026-02-26T13:54:06Z

What is the purpose of the change

Current Unaligned Checkpoint ITCases only restart once from a normal checkpoint. They do not cover restoring from a checkpoint produced by recovery phase — which is the key scenario for checkpointing during recovery.

Proposed mechanism: After restoring from a checkpoint, wait for the first new checkpoint to be produced, then immediately trigger a restart from it. Repeat for a configurable number of rounds (≥ 2). Whether to rescale depends on the specific test case.

This mechanism works on the current master (validating normal checkpoint recovery). Once checkpointing during recovery is enabled, the same tests automatically cover recovery-phase checkpoint scenarios.

Brief change log

[FLINK-39140][checkpoint] Disable CUSTOM_PARTITIONER in unaligned checkpoint it case since it does not work well
- See FLINK-39162
[FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery
[FLINK-39140][checkpoint] Fix MAX_RETAINED_CHECKPOINTS not effective in UnalignedCheckpointRescaleWithMixedExchangesITCase
- Move MAX_RETAINED_CHECKPOINTS from per-job config to MiniCluster cluster config. StandaloneCompletedCheckpointStore reads this value from the cluster-level configuration, so setting it in StreamExecutionEnvironment had no effect (defaulting to 1).
- This caused checkpoint is cleaned up sometimes due to new checkpoint is generated
[FLINK-39140][checkpoint] Change record type from Long to String in UnalignedCheckpointRescaleWithMixedExchangesITCase
- Long records (8 bytes) allow hundreds of records per buffer, causing excessive backpressure during aligned checkpoint phases (forward/rescale exchanges). Using 100-char random String records reduces the record count per buffer, shortening the time needed to drain backpressured buffers.
[hotfix] Including task name and subtask index into channel-state-unspilling thread name

Verifying this change

This is an enhancement for existing ITCase, so it has been covered by testing.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

flinkbot · 2026-02-26T13:59:47Z

CI report:

f6da7ce Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

rkhachatryan

Thanks for the PR @1996fanrui

I've left a couple of comments, PTAL.

Apart from that, could you add component name to the last commit message (I guess [runtime])?
And for the earlier commits, [test] seem to be a more relevant component.

flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointITCase.java

...ests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointRescaleITCase.java

…t it case since it does not work well See FLINK-39162

…TCases to perform checkpointing during recovery

…lignedCheckpointRescaleWithMixedExchangesITCase Move MAX_RETAINED_CHECKPOINTS from per-job config to MiniCluster cluster config. StandaloneCompletedCheckpointStore reads this value from the cluster-level configuration, so setting it in StreamExecutionEnvironment had no effect (defaulting to 1). This caused checkpoint subsumption to delete the selected checkpoint before the next job could restore from it.

…edCheckpointRescaleWithMixedExchangesITCase Long records (8 bytes) allow thousands of records per buffer, causing excessive backpressure during aligned checkpoint phases (forward/rescale exchanges). Using 100-char random String records reduces the record count per buffer, shortening the time needed to drain backpressured buffers.

…state-unspilling thread name

1996fanrui · 2026-03-03T17:57:34Z

Hey @rkhachatryan , thanks for the review!

Apart from that, could you add component name to the last commit message (I guess [runtime])? And for the earlier commits, [test] seem to be a more relevant component.

Sounds make sense, updated.

rkhachatryan

Thanks for updating the PR, LGTM.

W.r.t. CustomPartitioner (FLINK-39162), I've drafted #27731 - my guess is that we don't forward disableUnalignedCheckpoints call to the nested partitioner (I didn't look deeply into the code).

1996fanrui force-pushed the 39140/enhance-uc-itcase-during-recovery branch 2 times, most recently from bb0667e to fe5b695 Compare March 3, 2026 09:39

1996fanrui marked this pull request as ready for review March 3, 2026 09:45

1996fanrui changed the title ~~[FLINK-39140][checkpoint] Enhance Unaligned Checkpoint ITCases to perform checkpointing during recovery~~ [FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery Mar 3, 2026

1996fanrui force-pushed the 39140/enhance-uc-itcase-during-recovery branch from fe5b695 to a4ef930 Compare March 3, 2026 09:51

1996fanrui requested review from pnowojski and rkhachatryan March 3, 2026 09:54

rkhachatryan reviewed Mar 3, 2026

View reviewed changes

flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointITCase.java Show resolved Hide resolved

...ests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointRescaleITCase.java Show resolved Hide resolved

1996fanrui force-pushed the 39140/enhance-uc-itcase-during-recovery branch from a4ef930 to ac826c3 Compare March 3, 2026 17:27

1996fanrui added 5 commits March 3, 2026 18:32

[FLINK-39140][test] Disable CUSTOM_PARTITIONER in unaligned checkpoin…

2817da4

…t it case since it does not work well See FLINK-39162

[FLINK-39140][test] Allow multiple rescales in Unaligned Checkpoint I…

c76218d

…TCases to perform checkpointing during recovery

[hotfix][runtime] Including task name and subtask index into channel-…

f6da7ce

…state-unspilling thread name

1996fanrui force-pushed the 39140/enhance-uc-itcase-during-recovery branch from ac826c3 to f6da7ce Compare March 3, 2026 17:33

rkhachatryan approved these changes Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery#27688

[FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery#27688
1996fanrui wants to merge 5 commits intoapache:masterfrom
1996fanrui:39140/enhance-uc-itcase-during-recovery

1996fanrui commented Feb 26, 2026 •

edited

Loading

Uh oh!

flinkbot commented Feb 26, 2026 •

edited

Loading

Uh oh!

rkhachatryan left a comment

Uh oh!

Uh oh!

Uh oh!

1996fanrui commented Mar 3, 2026

Uh oh!

rkhachatryan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

1996fanrui commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

1996fanrui commented Mar 3, 2026

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1996fanrui commented Feb 26, 2026 •

edited

Loading

flinkbot commented Feb 26, 2026 •

edited

Loading