Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Branch-2.7] Fixed deadlock on metadata cache missing while doing che… #12484

Merged
merged 1 commit into from
Jul 28, 2022

Conversation

merlimat
Copy link
Contributor

Motivation

After the changes in #12340, there were still a couple of places making blocking calls. These calls occupy all the ordered scheduler threads preventing the callbacks to complete, until the 30 seconds timeout expire.

"bookkeeper-ml-scheduler-OrderedScheduler-7-0" #50 prio=5 os_prio=0 tid=0x00007f2d40050000 nid=0xe5 waiting on condition [0x00007f2d998d0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f38940080e0> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
	at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
	at org.apache.pulsar.broker.service.persistent.PersistentTopic.checkReplication(PersistentTopic.java:1152)
	at org.apache.pulsar.broker.service.BrokerService$3.openLedgerComplete(BrokerService.java:1107)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$asyncOpen$8(ManagedLedgerFactoryImpl.java:425)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$$Lambda$581/978469035.accept(Unknown Source)
	at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
	at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$2.initializeComplete(ManagedLedgerFactoryImpl.java:397)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$3$1.operationComplete(ManagedLedgerImpl.java:498)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$1.operationComplete(ManagedCursorImpl.java:316)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$1.operationComplete(ManagedCursorImpl.java:289)
	at org.apache.bookkeeper.mledger.impl.MetaStoreImpl.lambda$asyncGetCursorInfo$11(MetaStoreImpl.java:170)
	at org.apache.bookkeeper.mledger.impl.MetaStoreImpl$$Lambda$679/542144696.accept(Unknown Source)
	at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
	at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

"pulsar-ordered-OrderedExecutor-0-0" #13 prio=5 os_prio=0 tid=0x00007f3f73dac800 nid=0xc1 waiting on condition [0x00007f2de07e1000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f38940388f8> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
	at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
	at org.apache.pulsar.broker.service.BrokerService.lambda$getManagedLedgerConfig$43(BrokerService.java:1199)
	at org.apache.pulsar.broker.service.BrokerService$$Lambda$455/163843091.run(Unknown Source)
	at org.apache.bookkeeper.mledger.util.SafeRun$2.safeRun(SafeRun.java:49)
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

Instead converted the code to use getAsync().

@merlimat merlimat added type/bug The PR fixed a bug or issue reported a bug doc-not-needed Your PR changes do not impact docs release/2.7.4 labels Oct 25, 2021
@merlimat merlimat self-assigned this Oct 25, 2021
@315157973
Copy link
Contributor

/pulsarbot run-failure-checks

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

2 similar comments
@hangc0276
Copy link
Contributor

/pulsarbot run-failure-checks

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@lhotari lhotari force-pushed the fix-deadlock-check-replication branch from 531a596 to f53ddc1 Compare February 10, 2022 13:53
@lhotari
Copy link
Member

lhotari commented Feb 10, 2022

I rebased the changes. Let's see what the test failures are.

@lhotari
Copy link
Member

lhotari commented Feb 11, 2022

There are too many failures that I'm not confident to pick this in 2.7.5 release.

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@Jason918 Jason918 merged commit 32fe228 into apache:branch-2.7 Jul 28, 2022
Jason918 pushed a commit to Jason918/pulsar that referenced this pull request Jul 30, 2022
Jason918 added a commit that referenced this pull request Jul 31, 2022
* Revert "[fix][proxy] Fix client service url (#16834)"

This reverts commit 10b4e99.

* Revert "[Build] Use grpc-bom to align grpc library versions (#15234)"

This reverts commit 99c93d2.

* Revert "upgrade aircompressor to 0.20 (#11790)"

This reverts commit 5ad16b6.

* Revert "[Branch-2.7] Fixed deadlock on metadata cache missing while doing checkReplication (#12484)"

This reverts commit 32fe228.

* Revert changes of PersistentTopic#getMessageTTL in #12339.

Co-authored-by: JiangHaiting <janghaiting@apache.org>
Jason918 pushed a commit to Jason918/pulsar that referenced this pull request Jul 31, 2022
@Jason918
Copy link
Contributor

This PR breaks branch 2.7 and reverted.
I opened a new PR to fix this, see #16889
@merlimat @codelipenghui @lhotari

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs lifecycle/stale release/2.7.5 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants