Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auditor run Periodic check only once #1578

Closed
hrsakai opened this issue Aug 3, 2018 · 4 comments
Closed

Auditor run Periodic check only once #1578

hrsakai opened this issue Aug 3, 2018 · 4 comments

Comments

@hrsakai
Copy link
Contributor

hrsakai commented Aug 3, 2018

BUG REPORT

What did you do?

In our cluster, Auditor run periodic check only once. If interval expires after first periodic check, auditor will not run periodic check.
If we want to run periodic check again, we have to restart auditor bookie.

Auditor's thread dump
It seems that AuditorBookie thread stop by CountDownLatch with some reason.
https://gist.github.com/hrsakai/d65e8e2cd511173232b1010a9bbdf126

"AuditorBookie-XXXXX:3181" #40 daemon prio=5 os_prio=0 tid=0x00007f049c117830 nid=0x5da4 waiting on condition [0x00007f0477dfc000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e04e54f8> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at org.apache.bookkeeper.replication.Auditor.checkAllLedgers(Auditor.java:696)
        at org.apache.bookkeeper.replication.Auditor$5.run(Auditor.java:359)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

I saw many timed-out logs in Auditor's log file.

23:14:56.769 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  o.a.b.proto.PerChannelBookieClient   - Timed-out 2 operations to channel [id: 0xa17320e1, L:/AAAA:38234 - R:XXXX.co.jp/BBBB:3181] for BBBB:3181
23:14:56.921 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  o.a.b.proto.PerChannelBookieClient   - Timed-out 48 operations to channel [id: 0x359c20bd, L:/AAAA:38222 - R:BBBB/BBBB:3181] for BBBB:3181
23:15:37.768 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  o.a.b.proto.PerChannelBookieClient   - Timed-out 2 operations to channel [id: 0xa17320e1, L:/AAAA:38234 - R:XXXX.co.jp/BBBB:3181] for BBBB:3181
23:15:37.921 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  o.a.b.proto.PerChannelBookieClient   - Timed-out 91 operations to channel [id: 0x359c20bd, L:/AAAA:38222 - R:BBBB/BBBB:3181] for BBBB:3181
・
・

What did you expect to see?

Auditor run periodic check after every interval expires.

What did you see instead?

Auditor run periodic check only once.

System configuration
BookKeeper version : 4.7.0
Number of Bookies: 5
Ensemble size: 2
Write quorum size: 2
Ack quorum size: 2
Priodic check interval: 1day

@merlimat
Copy link
Contributor

merlimat commented Aug 3, 2018

There is a dead-lock in ZK thread.

"main-EventThread" #11 daemon prio=5 os_prio=0 tid=0x00007f05385d3aa0 nid=0x5bd2 waiting on condition [0x00007f05207f0000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e1374598> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
	at org.apache.bookkeeper.client.SyncCallbackUtils.waitForResult(SyncCallbackUtils.java:45)
	at org.apache.bookkeeper.client.BookKeeperAdmin.openLedgerNoRecovery(BookKeeperAdmin.java:327)
	at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:645)
	at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:627)
	at org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:510)
	at org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:484)
	at org.apache.bookkeeper.util.ZkUtils$6$1.processResult(ZkUtils.java:285)
	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$32$1.processResult(ZooKeeperClient.java:1395)
	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$31$1.processResult(ZooKeeperClient.java:1356)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:589)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)

Auditor is doing blocking operation (that involves ZK request) from ZK event thread.

@sijie
Copy link
Member

sijie commented Aug 9, 2018

@eolivelli this should be a blocker for 4.8.0 as well. and it would be great to cherry-pick back to 4.7.2

@eolivelli
Copy link
Contributor

Sure

@sijie sijie modified the milestones: 4.8.0, 4.9.0 Aug 21, 2018
@sijie sijie closed this as completed in f782a9d Aug 21, 2018
sijie pushed a commit that referenced this issue Aug 21, 2018
### Motivation

Fixes #1578

After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..)

Author: Matteo Merli <mmerli@apache.org>

Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org>

This closes #1608 from merlimat/fix-auditor-deadlock, closes #1578

(cherry picked from commit f782a9d)
Signed-off-by: Sijie Guo <sijie@apache.org>
sijie pushed a commit that referenced this issue Aug 21, 2018
### Motivation

Fixes #1578

After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..)

Author: Matteo Merli <mmerli@apache.org>

Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org>

This closes #1608 from merlimat/fix-auditor-deadlock, closes #1578

(cherry picked from commit f782a9d)
Signed-off-by: Sijie Guo <sijie@apache.org>
@sijie sijie reopened this Aug 22, 2018
sijie added a commit to sijie/bookkeeper that referenced this issue Aug 22, 2018
…nager

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue apache#1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: apache#1578
Master Issue: apache#1617
sijie added a commit that referenced this issue Aug 27, 2018
…licationManager

Descriptions of the changes in this PR:

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue #1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: #1578
Master Issue: #1617

Author: Sijie Guo <sijie@apache.org>

Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org>

This closes #1619 from sijie/async_sync_autorecovery
sijie added a commit that referenced this issue Aug 27, 2018
…licationManager

Descriptions of the changes in this PR:

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue #1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: #1578
Master Issue: #1617

Author: Sijie Guo <sijie@apache.org>

Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org>

This closes #1619 from sijie/async_sync_autorecovery

(cherry picked from commit 73b428c)
Signed-off-by: Sijie Guo <sijie@apache.org>
sijie added a commit that referenced this issue Aug 27, 2018
…licationManager

Descriptions of the changes in this PR:

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue #1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: #1578
Master Issue: #1617

Author: Sijie Guo <sijie@apache.org>

Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org>

This closes #1619 from sijie/async_sync_autorecovery
@sijie
Copy link
Member

sijie commented Aug 27, 2018

This is fixed by #1619

@sijie sijie closed this as completed Aug 27, 2018
reddycharan pushed a commit to reddycharan/bookkeeper that referenced this issue Oct 17, 2018
### Motivation

Fixes apache#1578

After getting ZK callback from ZK event thread, we need to jump to a background thread before doing synchronous call to `admin.openLedgerNoRecovery(ledgerId);` which will try to make a ZK request a wait for a response (which would be coming through same ZK event thread currently blocked..)

Author: Matteo Merli <mmerli@apache.org>

Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org>

This closes apache#1608 from merlimat/fix-auditor-deadlock, closes apache#1578

(cherry picked from commit f782a9d)
Signed-off-by: Sijie Guo <sijie@apache.org>
(cherry picked from commit 51040cf)
Signed-off-by: JV Jujjuri <vjujjuri@salesforce.com>
reddycharan pushed a commit to reddycharan/bookkeeper that referenced this issue Oct 17, 2018
…licationManager

Descriptions of the changes in this PR:

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue apache#1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: apache#1578
Master Issue: apache#1617

Author: Sijie Guo <sijie@apache.org>

Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org>

This closes apache#1619 from sijie/async_sync_autorecovery

(cherry picked from commit 3e01125)
Signed-off-by: JV Jujjuri <vjujjuri@salesforce.com>

Checkstyle fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants