[fix][broker] Fix deadlock while skip non-recoverable ledgers. #21915

hrzzzz · 2024-01-18T03:31:58Z

Fixes #21914

Motivation

Resolved the deadlock issue that occurred when the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method and the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method were called concurrently.

Modifications

The reason for the deadlock is that the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method internally acquired a write lock with a large scope, seemingly just to check !individualDeletedMessages.contains(ledgerId, i). However, the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method actually performs this check again internally to ensure that already deleted Positions are not deleted repeatedly. Therefore, we can remove the write lock in the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method and omit the !individualDeletedMessages.contains(ledgerId, i) check, instead directly calling the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method.

Verifying this change

Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: hrzzzz#4

hrzzzz · 2024-01-18T08:07:35Z

@poorbarcode @codelipenghui PTAL, thanks

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

lhotari

It would be great to have the Iterable optimization proposed in a review comment.

codecov-commenter · 2024-01-21T16:48:55Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (17bb322) 73.59% compared to head (13bb9bb) 73.62%.
Report is 1 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #21915      +/-   ##
============================================
+ Coverage     73.59%   73.62%   +0.03%     
- Complexity    32417    32427      +10     
============================================
  Files          1861     1861              
  Lines        138678   138675       -3     
  Branches      15188    15185       -3     
============================================
+ Hits         102060   102106      +46     
+ Misses        28715    28682      -33     
+ Partials       7903     7887      -16

Flag	Coverage Δ
inttests	`24.07% <18.18%> (-0.06%)`	⬇️
systests	`23.63% <18.18%> (-0.05%)`	⬇️
unittests	`72.92% <81.81%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...n/java/org/apache/pulsar/broker/PulsarService.java	`82.07% <100.00%> (-0.24%)`	⬇️
...ache/pulsar/client/admin/internal/BrokersImpl.java	`87.50% <100.00%> (ø)`
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java	`79.37% <80.00%> (-0.14%)`	⬇️
...g/apache/pulsar/broker/admin/impl/BrokersBase.java	`70.86% <50.00%> (ø)`

... and 73 files with indirect coverage changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org>

…e#21915) ### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org> (cherry picked from commit 37fc40c)

[fix][broker] Fix deadlock while skip non-recoverable ledgers.

fe6d9e9

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 18, 2024

lhotari requested changes Jan 19, 2024

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java Show resolved Hide resolved

lhotari approved these changes Jan 19, 2024

View reviewed changes

lhotari reviewed Jan 19, 2024

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java Outdated Show resolved Hide resolved

Technoboy- assigned hrzzzz Jan 21, 2024

Technoboy- added this to the 3.2.0 milestone Jan 21, 2024

Technoboy- added release/3.0.3 release/3.1.3 release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Jan 21, 2024

address comment

161c3d8

Technoboy- requested a review from lhotari January 21, 2024 15:37

lhotari requested changes Jan 21, 2024

View reviewed changes

lhotari added 2 commits January 21, 2024 20:14

Merge remote-tracking branch 'origin/master' into fix-deadlock

b171ef6

Pass plain Iterable without collecting a list

13bb9bb

lhotari approved these changes Jan 21, 2024

View reviewed changes

lhotari added the ready-to-test label Jan 21, 2024

Technoboy- requested review from codelipenghui and poorbarcode January 22, 2024 03:31

poorbarcode approved these changes Jan 22, 2024

View reviewed changes

poorbarcode reviewed Jan 22, 2024

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java Show resolved Hide resolved

poorbarcode approved these changes Jan 22, 2024

View reviewed changes

poorbarcode merged commit 5a65e98 into apache:master Jan 22, 2024
52 checks passed

hrzzzz deleted the fix-deadlock branch January 25, 2024 10:17

Technoboy- added the cherry-picked/branch-3.1 label Jan 31, 2024

Technoboy- added the cherry-picked/branch-3.0 label Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][broker] Fix deadlock while skip non-recoverable ledgers. #21915

[fix][broker] Fix deadlock while skip non-recoverable ledgers. #21915

hrzzzz commented Jan 18, 2024

hrzzzz commented Jan 18, 2024

lhotari left a comment

codecov-commenter commented Jan 21, 2024 •

edited

Loading

[fix][broker] Fix deadlock while skip non-recoverable ledgers. #21915

[fix][broker] Fix deadlock while skip non-recoverable ledgers. #21915

Conversation

hrzzzz commented Jan 18, 2024

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

hrzzzz commented Jan 18, 2024

lhotari left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 21, 2024 • edited Loading

Codecov Report

codecov-commenter commented Jan 21, 2024 •

edited

Loading