-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-18063][checkpointing] Fix the race condition for aborting current checkpoint in CheckpointBarrierUnaligner #12511
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 72fc053 (Sun Jun 07 14:09:43 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
…d method in CheckpointBarrierHandler Simplify the implementations of CheckpointBarrierTracker and CheckpointBarrierUnaligner to reuse the parent default implementation. This closes apache#12460.
…atingCheckpointBarrierHandler#getAlignmentDurationNanos We should take the value from active handler instead of aligned handler, because aligned handler is only used for savepoint and in most cases the unaligned alignment duration should always be 0. This cloese apache#12460.
4e9a8fe
to
6778e65
Compare
…point in CheckpointBarrierUnaligner There are three aborting scenarios which might encounter race condition: 1. CheckpointBarrierUnaligner#processCancellationBarrier 2. CheckpointBarrierUnaligner#processEndOfPartition 3. AlternatingCheckpointBarrierHandler#processBarrier They only consider the pending checkpoint triggered by #processBarrier from task thread to abort it. Actually the checkpoint might also be triggered by #notifyBarrierReceived from netty thread in race condition, so we should also handle properly to abort it. This closes apache#12460.
6778e65
to
b4bb91c
Compare
Cherry-pick it to master from #12406 which was reviewed and approved before. |
What is the purpose of the change
There are three aborting scenarios which might encounter race condition:
They only consider the pending checkpoint triggered by #processBarrier from task thread to abort it. Actually the checkpoint might also be triggered by #notifyBarrierReceived from netty thread in race condition, so we should also handle properly to abort it.
Brief change log
AlternatingCheckpointBarrierHandler#processBarrier
CheckpointBarrierUnaligner#processEndOfPartition
to abort checkpoint properlyCheckpointBarrierUnaligner#processCancellationBarrier
to abort checkpoint properlyVerifying this change
CheckpointBarrierUnalignerTest#testProcessCancellationBarrierAfterNotifyBarrierReceived
CheckpointBarrierUnalignerTest#testProcessCancellationBarrierAfterProcessBarrier
CheckpointBarrierUnalignerTest#testProcessCancellationBarrierBeforeProcessAndReceiveBarrier
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation