Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-32523] Fix Timeout and Assert Error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted #23283

Closed
wants to merge 2 commits into from

Conversation

masteryhx
Copy link
Contributor

What is the purpose of the change

This pr wants to fix Timeout exception and assert error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted

From the exception stack and attached logs, I saw:

  1. the failure reason is not same, timeout and assert error
  2. in the timeout cases, the job failed then restored, In the assert error cases, the job aborted two times (these cases are all enabling unaligned checkpoint), the job never failed and ran until finished for every success cases

So I think there are two exceptions in this ITCase:

  1. Assert error -> All operators haven't snapshotState together strictly for marked decline checkpoint id in the teste case (This could be reproduced by adding Thread.sleep after first verifyAllOperatorsNotifyAborted() )
  2. Timeout exception -> restarting (due to 1 tolerable checkpoint failure number) and notifying aborted occur in different threads, and the order is uncertain, if the job restart firstly, this will cause timeout exception (This could be reproduced by adding Thread.sleep in NormalMap#notifyCheckpointAborted)

For the first exception, we could just make them snapshotState together strictly which I think the ITCase should guarantee.

For the second one, I think it's acceptable that the abort function may not be called if the job failover (notifyCheckpointAborted is a best effort function). So we could just increase the tolerable checkpoint number.

Why timeout exception just occured in 1.18 ?

This is because FLINK-32347 which fixes the exception that CompletedCheckpointStore are not registered by the CheckpointFailureManager, After this, the job could fail due to the tolerable checkpoint failure number.

Brief change log

  • Guarantee all operators triggering marked decline checkpoint id together
  • Increase the tolerable checkpoint failure number to avoid aborting after job failing

Verifying this change

This change just fix the error ITCase.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? no

…oint together for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted

(cherry picked from commit 66cc21d)
…to avoid aborting after job failing for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted(apache#23283)

(cherry picked from commit 5bbdc46)
@flinkbot
Copy link
Collaborator

flinkbot commented Aug 24, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

masteryhx added a commit that referenced this pull request Aug 24, 2023
…to avoid aborting after job failing for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted(#23283)

(cherry picked from commit 5bbdc46)
@masteryhx masteryhx closed this Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants