-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-20654][FLINK-21104][network] Fix couple bugs in the handling of unaligned checkpoints and cancellations. #14797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit f7f5fb6 (Fri May 28 08:15:12 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
...me/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGate.java
Outdated
Show resolved
Hide resolved
pnowojski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % a couple of minor comments.
...src/main/java/org/apache/flink/streaming/runtime/io/checkpointing/CheckpointedInputGate.java
Show resolved
Hide resolved
flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointTestBase.java
Outdated
Show resolved
Hide resolved
flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointTestBase.java
Outdated
Show resolved
Hide resolved
flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointTestBase.java
Outdated
Show resolved
Hide resolved
…n there is a failure while finishing. Also improved logging and check for data corruption.
… UnalignedCheckpoints If previous checkpoint is declined, it can happen that task receives both older and newer checkpoint barrier on two different channels, before processing any checkpoint cancellation message/RPC. If the newer checkpoint barrier happens to be processed before the obsolete one incorrect `checkState` in ChannelStatePersister would cause job failure. This checkState was assuming that the previous checkpoint would have been aborted/stopped before triggering the new one, while in reality, this previous checkpoint has never been triggered on this task so it also could not have been stopped.
…oteInputChannel This commit fixes a bug where RemoteInputChannel was incorrectly deciding which buffers should be spilled, if it has received an obsoleted CheckpointBarrier, that hasn't been cancelled (yet?).
During cancellation it may happen that CheckpointedInputGate may not poll a priority event if the corresponding channel has already been released. Until race conditions are removed, it safest to simply ignore an empty poll.
pnowojski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, azure green merging.
This PR depends on #14797 (@AHeise commits at the bottom)
This PR fixes two bugs on unaligned checkpoints. First:
Second:
Both commits are tested by the existing UnalignedCheckpointITCase and some freshly added unit tests.
Further, it addresses some issues in cancellation:
Both commits are also tested by the existing UnalignedCheckpointITCase and covered by 1 new unit tests each.
Lastly, there is a fix for UnalignedCheckpointITCase itself, which could have run indefinitively if there is a cancellation after the final expected checkpoint.
Two related side commits increase the debuggability of network code, especially in conjunction with unaligned checkpoint.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation