-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-17823][network] Resolve the race condition while releasing RemoteInputChannel #12261
Conversation
…oteInputChannel RemoteInputChannel#releaseAllResources might be called by canceler thread. Meanwhile, the task thread can also call RemoteInputChannel#getNextBuffer. There probably cause two potential problems: 1. Task thread might get null buffer after canceler thread already released all the buffers, then it might cause misleading NPE in getNextBuffer. 2. Task thread and canceler thread might pull the same buffer concurrently, which causes unexpected exception when the same buffer is recycled twice. The solution is to properly synchronize the buffer queue in release method to avoid the same buffer pulled by both canceler thread and task thread. And in getNextBuffer method, we add some explicit checks to avoid misleading NPE and hint some valid exceptions.
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 26afeb0 (Fri Oct 16 10:53:42 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
...src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java
Show resolved
Hide resolved
...src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java
Show resolved
Hide resolved
...test/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannelTest.java
Show resolved
Hide resolved
...src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java
Show resolved
Hide resolved
...test/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannelTest.java
Show resolved
Hide resolved
Thanks for the review @pnowojski and @Jiayi-Liao . Merging! |
What is the purpose of the change
RemoteInputChannel#releaseAllResources might be called by canceler thread. Meanwhile, the task thread can also call RemoteInputChannel#getNextBuffer.
There probably cause two potential problems:
The solution is to properly synchronize the buffer queue in release method to avoid the same buffer pulled by both canceler thread and task thread.
And in getNextBuffer method, we add some explicit checks to avoid misleading NPE and hint some valid exceptions.
Brief change log
receivedBuffers
inRemoteInputChannel#releaseAllResources
RemoteInputChannel#getNextBuffer
Verifying this change
New unit test in `RemoteInputChannelTest#testConcurrentGetNextBufferAndRelease
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation