Skip to content

Fix a deadlock that can happen when mutliple stages execute parallelly in a worker#16524

Closed
LakshSingla wants to merge 3 commits intoapache:masterfrom
LakshSingla:deadlock-run-all-fully
Closed

Fix a deadlock that can happen when mutliple stages execute parallelly in a worker#16524
LakshSingla wants to merge 3 commits intoapache:masterfrom
LakshSingla:deadlock-run-all-fully

Conversation

@LakshSingla
Copy link
Contributor

Description

This PR fixes up a deadlock that can happen when a worker executes multiple stages in parallel. The deadlock happens because, under the above circumstance, there are multiple instances of RunAllFullyWidget sharing a single Bouncer.

The deadlock happens as follows:
The bouncer hands out tickets to RunAllFullyWidget1 and RunAllFullyWidget2. Two threads call cleanup() on the two widgets at the same time:

Thread1 has the lock on RunAllFullyWidget1 and Thread2 has the lock on RunAllFullyWidget2 (since they have entered the cleanup() method).

cleanup() gives back the tickets that the widgets have acquired to the bouncer.

Thread1 giving back the ticket triggers the listener on the RunAllFullyWidget2 to acquire the ticket.
Thread2 giving back the ticket triggers the listener on the RunAllFullyWidget1 to acquire the ticket.

However, the above operations can't succeed, since each thread holds up the lock the other thread needs.

Jstack of the deadlock. Thanks @Akshat-Jain for providing the JStack

  51 "processing-8":
  51         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.lambda$run$3(RunAllFullyWidget.java:234)
  49         - waiting to lock <0x00000006dab7d578> (a java.lang.Object)
  48         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable$$Lambda$717/0x0000000800a4e840.run(Unknown Source)
  47         at org.apache.druid.java.util.common.concurrent.DirectExecutorService.execute(DirectExecutorService.java:81)
  46         at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
  45         at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
  44         at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:782)
  43         at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
  42         at org.apache.druid.frame.processor.Bouncer$Ticket.giveBack(Bouncer.java:117)
  41         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.cleanup(RunAllFullyWidget.java:365)
  40         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.cleanupIfNoMoreProcessors(RunAllFullyWidget.java:344)
  39         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.run(RunAllFullyWidget.java:208)
  38         - locked <0x00000006dab87378> (a java.lang.Object)
  37         at org.apache.druid.msq.exec.WorkerImpl$1$2.run(WorkerImpl.java:836)
  36         at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.22/Executors.java:515)
  35         at java.util.concurrent.FutureTask.run$$$capture(java.base@11.0.22/FutureTask.java:264)
  34         at java.util.concurrent.FutureTask.run(java.base@11.0.22/FutureTask.java)
  33         at org.apache.druid.query.PrioritizedListenableFutureTask.run(PrioritizedExecutorService.java:259)
  32         at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128)
  31         at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628)
  30         at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
  29 "processing-5":
  28         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.lambda$run$3(RunAllFullyWidget.java:234)
  27         - waiting to lock <0x00000006dab87378> (a java.lang.Object)
  26         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable$$Lambda$717/0x0000000800a4e840.run(Unknown Source)
  25         at org.apache.druid.java.util.common.concurrent.DirectExecutorService.execute(DirectExecutorService.java:81)
  24         at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
  23         at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
  22         at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:782)
  21         at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
  20         at org.apache.druid.frame.processor.Bouncer$Ticket.giveBack(Bouncer.java:117)
  19         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.lambda$run$3(RunAllFullyWidget.java:236)
  18         - locked <0x00000006dab7d578> (a java.lang.Object)
  17         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable$$Lambda$717/0x0000000800a4e840.run(Unknown Source)
  16         at org.apache.druid.java.util.common.concurrent.DirectExecutorService.execute(DirectExecutorService.java:81)
  15         at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
  14         at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
  13         at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:782)
  12         at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
  11         at org.apache.druid.frame.processor.Bouncer$Ticket.giveBack(Bouncer.java:117)
  10         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.cleanup(RunAllFullyWidget.java:365)
   9         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.cleanupIfNoMoreProcessors(RunAllFullyWidget.java:344)
   8         at org.apache.druid.frame.processor.RunAllFullyWidget$RunAllFullyRunnable.run(RunAllFullyWidget.java:208)
   7         - locked <0x00000006dab7d578> (a java.lang.Object)
   6         at org.apache.druid.msq.exec.WorkerImpl$1$2.run(WorkerImpl.java:836)
   5         at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.22/Executors.java:515)
   4         at java.util.concurrent.FutureTask.run$$$capture(java.base@11.0.22/FutureTask.java:264)
   3         at java.util.concurrent.FutureTask.run(java.base@11.0.22/FutureTask.java)
   2         at org.apache.druid.query.PrioritizedListenableFutureTask.run(PrioritizedExecutorService.java:259)
   1         at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128)
5038         at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628)
   1         at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions
Copy link

github-actions bot commented Aug 5, 2024

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 5, 2024
@github-actions
Copy link

github-actions bot commented Sep 2, 2024

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants