-
Notifications
You must be signed in to change notification settings - Fork 13.7k
[FLINK-18595][network] Fix the deadlock issue by task thread and canceler thread in RemoteInputChannel #12912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…eler thread in RemoteInputChannel Assuming two remote channels as listeners in LocalBufferPool, the deadlock happens as follows 1. While the Canceler thread calling ch1#releaseAllResources, it will occupy the bufferQueue lock and try to call ch2#notifyBufferAvailable. 2. While task thread exiting to call CachedBufferStorage#close, it might release exclusive buffers for ch2. Then ch2 will occupy the bufferQueue lock and try to call ch1#notifyBufferAvailable. 3. ch1 and ch2 will both occupy self bufferQueue lock and wait for other side's bufferQueue lock to cause deadlock. Regarding the solution, we can check the released state outside of bufferQueue lock in RemoteInputChannel#notifyBufferAvailable to return immediately.
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 356e564 (Thu Jul 16 10:05:14 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
I considered some options to resolve this issue:
Regarding the verify, I can not reproduce this issue locally via the reported |
@flinkbot run azure |
...src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java
Show resolved
Hide resolved
Notice that For the deadlock, Task thread, grab bufferQueue lock for ch2, get notification from ch1 because ch1 has already set released This won't cause deadlock. Case2: the Canceler is able to set releaseAllResources, but not able to grab ch1's lock. This won't cause deadlock as well. |
@curcur
So as long as two threads might recycle buffer concurrently, then it would cause this potential deadlock. E.g. while canceler thread is releasing the channel to recycle received buffer, the task thread might also recycle the buffer meanwhile when the buffer is consumed completely. |
Hey, @zhijiangW , I do not disagree with the solution for checking For the second point, I mean if the cleanup of channels (sort by channel id for example) in the same order in the two threads, would that resolve the deadlock problem as well. |
@curcur thanks for the explanation and I got your point now. But it is not the case of current deadlock. In current situation, the two threads will not cleanup input channels concurrently. Actually only one thread will release channels, but the other thread will recycle buffers to further touch the internal lock of respective input channel. So we can not handle the sort or control anything for the recycled buffers which was maintained inside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for review @rkhachatryan and @curcur , the azure failure seems unrelated, so merging! |
What is the purpose of the change
Assuming two remote channels as listeners in LocalBufferPool, the deadlock happens as follows:
While the Canceler thread calling
ch1#releaseAllResources
, it will occupy the bufferQueue lock and try to callch2#notifyBufferAvailable
.While task thread exiting to call
CachedBufferStorage#close
, it might release exclusive buffers for ch2. Then ch2 will occupy the bufferQueue lock and try to callch1#notifyBufferAvailable
.ch1 and ch2 will both occupy self bufferQueue lock and wait for other side's bufferQueue lock to cause deadlock.
Regarding the solution, we can check the released state outside of bufferQueue lock in
RemoteInputChannel#notifyBufferAvailable
to return immediately.Brief change log
isReleased
state before entering synchronized inRemoteInputChannel#notifyBufferAvailable
Verifying this change
By the failure
StreamFaultToleranceTestBase
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation