[FLINK-32523] Fix Timeout and Assert Error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted #23283
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
This pr wants to fix Timeout exception and assert error for NotifyCheckpointAbortedITCase#testNotifyCheckpointAborted
From the exception stack and attached logs, I saw:
So I think there are two exceptions in this ITCase:
For the first exception, we could just make them snapshotState together strictly which I think the ITCase should guarantee.
For the second one, I think it's acceptable that the abort function may not be called if the job failover (notifyCheckpointAborted is a best effort function). So we could just increase the tolerable checkpoint number.
Why timeout exception just occured in 1.18 ?
This is because FLINK-32347 which fixes the exception that CompletedCheckpointStore are not registered by the CheckpointFailureManager, After this, the job could fail due to the tolerable checkpoint failure number.
Brief change log
Verifying this change
This change just fix the error ITCase.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation