-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 1578: Fixed deadlock in auditor blocking ZK thread #1608
Conversation
Thread.currentThread().interrupt(); | ||
// Do not perform blocking calls that involve making ZK calls from within the ZK | ||
// event thread. Jump to background thread instead to avoid deadlock. | ||
ForkJoinPool.commonPool().execute(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have some other threadpool available? If we have a Bookkeeper object (inside BK afmin) we could use its own main threadpool for instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have some other threadpool available?
we don't have pools for blocking operations.
If we have a Bookkeeper object (inside BK afmin) we could use its own main threadpool for instance
I don't think we need a pool for blocking calls. all the pools in BK are for non-blocking.
Where does the backpressure occur with this? Lets say there are 1000 ledgers, and we create a ledgerManager.asyncProcessLedgers with this Processor implementation. Will 1000 tasks be submitted concurrently to the ForkJoinPool? Will this mean 1000 threads are created? I think the sync calls may have acted as a kind of nasty backpressure in the past. |
No. There was a thread back at 2014 asking for changing autorecovery related methods from sync to async.
I think there is a parallelism cap. I think the default parallelism is |
### Motivation Fixes #1578 After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..) Author: Matteo Merli <mmerli@apache.org> Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org> This closes #1608 from merlimat/fix-auditor-deadlock, closes #1578 (cherry picked from commit f782a9d) Signed-off-by: Sijie Guo <sijie@apache.org>
### Motivation Fixes #1578 After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..) Author: Matteo Merli <mmerli@apache.org> Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org> This closes #1608 from merlimat/fix-auditor-deadlock, closes #1578 (cherry picked from commit f782a9d) Signed-off-by: Sijie Guo <sijie@apache.org>
@merlimat @sijie @ivankelly I've few questions here
@jvrao @athanatos fyi. |
It was always there
The ZK "event-thread" is blocked, so nothing else using ZK will work.
I think the issue was named (not by me) based on the initial perceived behavior. The analysis of the stack-trace is pretty clear on what the root problem is. It is a big problem to mix sync and async operation in ZK. It is imperative to not do anything blocking from a ZK callback thread.
|
@reddycharan : let me start with a summary of my thoughts: As if you read the comment above, I raised the concerns before around AutoRecovery calling sync methods in async callbacks is a very bad practise. so I wouldn't be surprised if there are more other deadlocks found in other places in AutoRecovery. So this bug is not intended to address the whole "calling-sync-methods-in-async-callback" problem in AutoRecovery, the PR is intended to address the problem reported in in #1578 first and get the bugfix available for 4.7.2. so the scope is limited to address the bug reported in #1578. regarding the test, I think it is a bit hard to reproduce the sequence of this race condition. and the change is limited to scope to address the sync call in the stack trace reported in #1578, moving the sync call to a detailed thread pool without blocking zookeeper callback thread is a simple and straightforward problem. As the aim for this PR is to address the specific stack trace in #1578, this change is okay to go in for a bugfix to address the immediate concerns in AutoRecovery. A long term fix for AutoRecovery is to audit all the sync calls in callbacks and make then async, which I have raised that up before. http://mail-archives.apache.org/mod_mbox/bookkeeper-dev/201411.mbox/%3CCAO2yDyZo5AzYgE%3D%3Dk8C5bifA3Miv-2A9w%2B_-6aS%2Bwa4zxRGcOw%40mail.gmail.com%3E more details to your questions:
I believe this has been since beginning not just a regression. as I pointed out, the AutoRecovery has multiple places calling sync calls in callbacks.
you will only have this issue when you call a sync method (waiting for zookeeper result) in a zookeeper callback thread. not all the "call-sync-methods-in-async-callback" will have this issues. I don't think other components are concerns. However as a good practice, we need to clean up the sync calls in async callbacks. that is a very bad practice. However that is a big problem to address in AutoRecovery.
yes. that's why the issue reporter says "Auditor periodic check only run once", because Auditor executor is blocked.
I think "Auditor run periodic check only once" is the reporter observed behavior. The race condition can be happening at any "checkAllLedgers" run, not necessarily to be the first one. if you look into the code, for each Auditor checkAllLedgers, a new zookeeper client is established, so the race condition can happen any any
I think this race condition depends on timing. any timing related stuffs usually very hard to be covered or caught via test cases.
I think it is a bit hard to reproduce the sequence of this race condition. especially this related to timing at zookeeper client callback. and the fix is straightforward by moving the blocking call to a separate thread. that's why we make the exception to merge this for bugfix for 4.7.2. |
@eolivelli : I just created #1617 for addressing the broader issue of AutoRecovery. |
@sijie, From @merlimat description
I understood it as that, "admin.openLedgerNoRecovery" f782a9d#diff-7525f06ad3a1ad0a00a462df4deb4698L645 will be blocked consistently. Thats why I was wondering how were we ok so far (5 years since 005b62c ) is introduced, since the ZK thread deadlock will eventually lead to Auditor being non-functional. if you say that because of race condition in ZK library we would run into issue, then it makes some sense for why this issue was not identified completely so far. Being said that I'm just wondering at very high level how probabilistic is it to get into this zk thread deadlock issue? Since this will effectively makes Auditor non-functional, I would like to ascertain how vulnerable we were so far.
|
btw, I described in my earlier comment why this ZK thread deadlock will lead to Auditor being non-functional -
|
@sijie @merlimat from the callstack trace reported in the issue #1578, we can say that Auditor's single threaded executor ('executor') is hung while waiting on "processDone.await()" in checkAllLedgers method. So technically even with this fix, there is still scope for 'processDone' countdownlatch not being counted down to zero (for what so ever reasons). So again in this case, executor will be blocked and Auditor will become non-functional. So I believe the important fix needed here is to not wait forever on this latch - https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L701 . Instead have some timeout and move on. Ideally I would move the checkers functionality to some other threadpool/executor, so that it wont impact the core functionality of Auditor, which is super critical in Autoreplication system. "AuditorBookie-XXXXX:3181" #40 daemon prio=5 os_prio=0 tid=0x00007f049c117830 nid=0x5da4 waiting on condition [0x00007f0477dfc000] |
@merlimat @jvrao @reddycharan : actually I looked at the stack trace again. so the problem is a race condition, the fix is here as okay as well, however the race condition is not Let me summarize my findings and hope these will make things clearer for Charan's questions. zookeeper deadlock?Based on matteo's initial investigation, there is a deadlock at zookeeper thread.
The stack trace here is very confusing. As you can see from the stack trace, we are calling a synchronous However, as I said the stack trace here is very very confusing. because the self-deadlock would only happen if for every checkAllLedgers, there is a brand new zookeeper client is created for opening ledgers. so the what is the real deadlockI looked again at the stack trace here. https://gist.github.com/hrsakai/d65e8e2cd511173232b1010a9bbdf126
**in summary, the real deadlock happens due to:
both are waiting for each other.** This happens rarely because bookkeeper has multiple callback threads. most of the time, the bookie call thread which publishes suspected ledger might be different from the thread that completes What is the right fix?The fix here is okay but not correct. The right fix is to move all synchronous calls out of bk callback threads and zk callback thread. That's the first immediate action we can take. Then consider rewrite those sync methods to use async methods to completely get rid of sync logic. Above is all my findings regarding this issue. Below I will leave my thoughts to some of Charan and JV's questions.
there is only one send thread and one receive thread per zookeeper client. it is non-configurable.
yes if we call same zk client's sync method in its callback.
I don't think "not wait forever" on this latch is the right fix. for sure you can timeout and let next check kick in. However the problem here is deadlock makes both bookkeeper and zookeeper client are not functionable. so introducing timeout just delay and hide the problem. In a production system, I would rather than let it wait forever and use metric (e.g. the number of auditor runs) for alerting. for example, if the number of auditor runs doesn't increment for more than checkIntervals, that means auditor is stuck. then an alert should be triggered.
yeah dedicated threadpools for different checkers is a good idea. however I think the most critical fix is not calling sync methods in async callbacks. |
Thanks @sijie for detailed analysis! |
Thanks @sijie for the detailed analysis. Despite your detailed analysis it needs more digging/understanding to know exactly what is going on. I think the reason for confusion/uncertainty/convolutedness is because of the fact that each call of Auditor.checkAllLedgers, partially has it own set of resources but not all. Auditor.checkAllLedgers creates its own 'newzk', 'client', 'admin', and 'checker', but for 'ledgerManager' and 'ledgerUnderreplicationManager' (ProcessLostFragmentsCb in Auditor.checkAllLedgers uses 'ledgerUnderreplicationManager') it uses Auditor's instance variables. Because of this, 2 ZK clients * 2 threads (ZKClients IOThread and event thread), 2 BK clients * 1 mainWorkerPool (OrderedExecutor for each BKClient) and Auditor's singlethreaded executor are in play here. And our maze of callbacks in this component, makes understanding of control/execution transfer between threads super complicated. I'm not sure if this is intentional, but to reason out any such issues, it needs multiple pair of eyes and multiple hours of debugging. Can you please consider evaluating if Auditor.checkAllLedgers needs its own set of resources or not? If it needs its own set of resources then is it ok to go partial path? Anyhow I'm glad I started this conversation, since I'm not convinced with the original Issue description and description of the fix. |
If Auditor.checkAllLedgers is reusing Auditor's - newzk, client, admin and checker then it would be self-deadlock situation or if Auditor.checkAllLedgers has its own ledgerManager and ledgerUnderreplicationManager along with other newly created resources then also we would be in self-deadlock situation. So AFAIU this issue is not completely found because of the fact that partial new resource creation is allowed in original commit 005b62c . |
for current code, it needs its own zk, otherwise it is self-deadlock. after my change #1619, it doesn't need its own resources.
the problem will be gone if we don't call sync calls in async callbacks. so my fix #1619 should be the correct fix. please take a look. |
### Motivation Fixes apache#1578 After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..) Author: Matteo Merli <mmerli@apache.org> Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org> This closes apache#1608 from merlimat/fix-auditor-deadlock, closes apache#1578 (cherry picked from commit f782a9d) Signed-off-by: Sijie Guo <sijie@apache.org> (cherry picked from commit 51040cf) Signed-off-by: JV Jujjuri <vjujjuri@salesforce.com>
Motivation
Fixes #1578
After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to
admin.openLedgerNoRecovery(ledgerId);
which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..)