-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auditor run Periodic check only once #1578
Comments
There is a dead-lock in ZK thread.
Auditor is doing blocking operation (that involves ZK request) from ZK event thread. |
@eolivelli this should be a blocker for 4.8.0 as well. and it would be great to cherry-pick back to 4.7.2 |
Sure |
sijie
pushed a commit
that referenced
this issue
Aug 21, 2018
### Motivation Fixes #1578 After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..) Author: Matteo Merli <mmerli@apache.org> Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org> This closes #1608 from merlimat/fix-auditor-deadlock, closes #1578 (cherry picked from commit f782a9d) Signed-off-by: Sijie Guo <sijie@apache.org>
sijie
pushed a commit
that referenced
this issue
Aug 21, 2018
### Motivation Fixes #1578 After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..) Author: Matteo Merli <mmerli@apache.org> Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org> This closes #1608 from merlimat/fix-auditor-deadlock, closes #1578 (cherry picked from commit f782a9d) Signed-off-by: Sijie Guo <sijie@apache.org>
sijie
added a commit
to sijie/bookkeeper
that referenced
this issue
Aug 22, 2018
…nager ### Motivation Auditor has multiple places calling sync methods in async callbacks. This raises the possibility hitting deadlock. Issue apache#1578 is one of the examples. After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated` is the only interface that will be called in async callbacks. This change is to provide an async version of `markLedgerUnderreplicated`. ### Changes - add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`. - implement the logic of `markLedgerUnderreplicated` using async callbacks - use `markLedgerUnderreplicatedAsync` in the Auditor Related Issues: apache#1578 Master Issue: apache#1617
sijie
added a commit
that referenced
this issue
Aug 27, 2018
…licationManager Descriptions of the changes in this PR: ### Motivation Auditor has multiple places calling sync methods in async callbacks. This raises the possibility hitting deadlock. Issue #1578 is one of the examples. After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated` is the only interface that will be called in async callbacks. This change is to provide an async version of `markLedgerUnderreplicated`. ### Changes - add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`. - implement the logic of `markLedgerUnderreplicated` using async callbacks - use `markLedgerUnderreplicatedAsync` in the Auditor Related Issues: #1578 Master Issue: #1617 Author: Sijie Guo <sijie@apache.org> Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org> This closes #1619 from sijie/async_sync_autorecovery
sijie
added a commit
that referenced
this issue
Aug 27, 2018
…licationManager Descriptions of the changes in this PR: ### Motivation Auditor has multiple places calling sync methods in async callbacks. This raises the possibility hitting deadlock. Issue #1578 is one of the examples. After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated` is the only interface that will be called in async callbacks. This change is to provide an async version of `markLedgerUnderreplicated`. ### Changes - add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`. - implement the logic of `markLedgerUnderreplicated` using async callbacks - use `markLedgerUnderreplicatedAsync` in the Auditor Related Issues: #1578 Master Issue: #1617 Author: Sijie Guo <sijie@apache.org> Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org> This closes #1619 from sijie/async_sync_autorecovery (cherry picked from commit 73b428c) Signed-off-by: Sijie Guo <sijie@apache.org>
sijie
added a commit
that referenced
this issue
Aug 27, 2018
…licationManager Descriptions of the changes in this PR: ### Motivation Auditor has multiple places calling sync methods in async callbacks. This raises the possibility hitting deadlock. Issue #1578 is one of the examples. After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated` is the only interface that will be called in async callbacks. This change is to provide an async version of `markLedgerUnderreplicated`. ### Changes - add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`. - implement the logic of `markLedgerUnderreplicated` using async callbacks - use `markLedgerUnderreplicatedAsync` in the Auditor Related Issues: #1578 Master Issue: #1617 Author: Sijie Guo <sijie@apache.org> Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org> This closes #1619 from sijie/async_sync_autorecovery
This is fixed by #1619 |
reddycharan
pushed a commit
to reddycharan/bookkeeper
that referenced
this issue
Oct 17, 2018
### Motivation Fixes apache#1578 After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..) Author: Matteo Merli <mmerli@apache.org> Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org> This closes apache#1608 from merlimat/fix-auditor-deadlock, closes apache#1578 (cherry picked from commit f782a9d) Signed-off-by: Sijie Guo <sijie@apache.org> (cherry picked from commit 51040cf) Signed-off-by: JV Jujjuri <vjujjuri@salesforce.com>
reddycharan
pushed a commit
to reddycharan/bookkeeper
that referenced
this issue
Oct 17, 2018
…licationManager Descriptions of the changes in this PR: ### Motivation Auditor has multiple places calling sync methods in async callbacks. This raises the possibility hitting deadlock. Issue apache#1578 is one of the examples. After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated` is the only interface that will be called in async callbacks. This change is to provide an async version of `markLedgerUnderreplicated`. ### Changes - add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`. - implement the logic of `markLedgerUnderreplicated` using async callbacks - use `markLedgerUnderreplicatedAsync` in the Auditor Related Issues: apache#1578 Master Issue: apache#1617 Author: Sijie Guo <sijie@apache.org> Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org> This closes apache#1619 from sijie/async_sync_autorecovery (cherry picked from commit 3e01125) Signed-off-by: JV Jujjuri <vjujjuri@salesforce.com> Checkstyle fix
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
BUG REPORT
What did you do?
In our cluster, Auditor run periodic check only once. If interval expires after first periodic check, auditor will not run periodic check.
If we want to run periodic check again, we have to restart auditor bookie.
Auditor's thread dump
It seems that
AuditorBookie
thread stop byCountDownLatch
with some reason.https://gist.github.com/hrsakai/d65e8e2cd511173232b1010a9bbdf126
I saw many timed-out logs in Auditor's log file.
What did you expect to see?
Auditor run periodic check after every interval expires.
What did you see instead?
Auditor run periodic check only once.
System configuration
BookKeeper version : 4.7.0
Number of Bookies: 5
Ensemble size: 2
Write quorum size: 2
Ack quorum size: 2
Priodic check interval: 1day
The text was updated successfully, but these errors were encountered: