[FLINK-39140][checkpoint] Allow multiple rescales in Unaligned Checkpoint ITCases to perform checkpointing during recovery#27688
Conversation
bb0667e to
fe5b695
Compare
fe5b695 to
a4ef930
Compare
rkhachatryan
left a comment
There was a problem hiding this comment.
Thanks for the PR @1996fanrui
I've left a couple of comments, PTAL.
Apart from that, could you add component name to the last commit message (I guess [runtime])?
And for the earlier commits, [test] seem to be a more relevant component.
flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointITCase.java
Show resolved
Hide resolved
...ests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointRescaleITCase.java
Show resolved
Hide resolved
a4ef930 to
ac826c3
Compare
…t it case since it does not work well See FLINK-39162
…TCases to perform checkpointing during recovery
…lignedCheckpointRescaleWithMixedExchangesITCase Move MAX_RETAINED_CHECKPOINTS from per-job config to MiniCluster cluster config. StandaloneCompletedCheckpointStore reads this value from the cluster-level configuration, so setting it in StreamExecutionEnvironment had no effect (defaulting to 1). This caused checkpoint subsumption to delete the selected checkpoint before the next job could restore from it.
…edCheckpointRescaleWithMixedExchangesITCase Long records (8 bytes) allow thousands of records per buffer, causing excessive backpressure during aligned checkpoint phases (forward/rescale exchanges). Using 100-char random String records reduces the record count per buffer, shortening the time needed to drain backpressured buffers.
…state-unspilling thread name
ac826c3 to
f6da7ce
Compare
|
Hey @rkhachatryan , thanks for the review!
Sounds make sense, updated. |
rkhachatryan
left a comment
There was a problem hiding this comment.
Thanks for updating the PR, LGTM.
W.r.t. CustomPartitioner (FLINK-39162), I've drafted #27731 - my guess is that we don't forward disableUnalignedCheckpoints call to the nested partitioner (I didn't look deeply into the code).
What is the purpose of the change
Current Unaligned Checkpoint ITCases only restart once from a normal checkpoint. They do not cover restoring from a checkpoint produced by recovery phase — which is the key scenario for checkpointing during recovery.
Proposed mechanism: After restoring from a checkpoint, wait for the first new checkpoint to be produced, then immediately trigger a restart from it. Repeat for a configurable number of rounds (≥ 2). Whether to rescale depends on the specific test case.
This mechanism works on the current master (validating normal checkpoint recovery). Once checkpointing during recovery is enabled, the same tests automatically cover recovery-phase checkpoint scenarios.
Brief change log
UnalignedCheckpointRescaleWithMixedExchangesITCaseVerifying this change
This is an enhancement for existing ITCase, so it has been covered by testing.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation