[FLINK-32523] Fix Timeout and Assert Error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted #23283

masteryhx · 2023-08-24T09:26:06Z

What is the purpose of the change

This pr wants to fix Timeout exception and assert error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted

From the exception stack and attached logs, I saw:

the failure reason is not same, timeout and assert error
in the timeout cases, the job failed then restored, In the assert error cases, the job aborted two times (these cases are all enabling unaligned checkpoint), the job never failed and ran until finished for every success cases

So I think there are two exceptions in this ITCase:

Assert error -> All operators haven't snapshotState together strictly for marked decline checkpoint id in the teste case (This could be reproduced by adding Thread.sleep after first verifyAllOperatorsNotifyAborted() )
Timeout exception -> restarting (due to 1 tolerable checkpoint failure number) and notifying aborted occur in different threads, and the order is uncertain, if the job restart firstly, this will cause timeout exception (This could be reproduced by adding Thread.sleep in NormalMap#notifyCheckpointAborted)

For the first exception, we could just make them snapshotState together strictly which I think the ITCase should guarantee.

For the second one, I think it's acceptable that the abort function may not be called if the job failover (notifyCheckpointAborted is a best effort function). So we could just increase the tolerable checkpoint number.

Why timeout exception just occured in 1.18 ?

This is because FLINK-32347 which fixes the exception that CompletedCheckpointStore are not registered by the CheckpointFailureManager, After this, the job could fail due to the tolerable checkpoint failure number.

Brief change log

Guarantee all operators triggering marked decline checkpoint id together
Increase the tolerable checkpoint failure number to avoid aborting after job failing

Verifying this change

This change just fix the error ITCase.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? no

…oint together for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted (cherry picked from commit 66cc21d)

…to avoid aborting after job failing for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted(apache#23283) (cherry picked from commit 5bbdc46)

flinkbot · 2023-08-24T09:33:02Z

CI report:

1181b04 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

…to avoid aborting after job failing for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted(#23283) (cherry picked from commit 5bbdc46)

masteryhx added 2 commits August 24, 2023 17:22

[FLINK-32523][test] Guarantee all operators triggering decline checkp…

aa20141

…oint together for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted (cherry picked from commit 66cc21d)

[FLINK-32523][test] Increase the tolerable checkpoint failure number …

1181b04

…to avoid aborting after job failing for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted(apache#23283) (cherry picked from commit 5bbdc46)

masteryhx force-pushed the release-1.18 branch from fde2a70 to 1181b04 Compare August 24, 2023 09:27

masteryhx closed this Aug 24, 2023

ruibinx mentioned this pull request Dec 29, 2023

[BP-1.18] [FLINK-33863][checkpoint] Fix restoring compressed operator state #24008

Merged

flinkbot added the component=Runtime/Checkpointing label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-32523] Fix Timeout and Assert Error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted #23283

[FLINK-32523] Fix Timeout and Assert Error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted #23283

masteryhx commented Aug 24, 2023

flinkbot commented Aug 24, 2023 •

edited

Loading

[FLINK-32523] Fix Timeout and Assert Error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted #23283

[FLINK-32523] Fix Timeout and Assert Error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted #23283

Conversation

masteryhx commented Aug 24, 2023

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Aug 24, 2023 • edited Loading

CI report:

flinkbot commented Aug 24, 2023 •

edited

Loading