Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix deadlock while skip non-recoverable ledgers. #21915

Merged
merged 4 commits into from
Jan 22, 2024

Conversation

hrzzzz
Copy link
Contributor

@hrzzzz hrzzzz commented Jan 18, 2024

Fixes #21914

Motivation

Resolved the deadlock issue that occurred when the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method and the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method were called concurrently.

Modifications

The reason for the deadlock is that the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method internally acquired a write lock with a large scope, seemingly just to check !individualDeletedMessages.contains(ledgerId, i). However, the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method actually performs this check again internally to ensure that already deleted Positions are not deleted repeatedly. Therefore, we can remove the write lock in the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method and omit the !individualDeletedMessages.contains(ledgerId, i) check, instead directly calling the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method.

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: hrzzzz#4

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 18, 2024
@hrzzzz
Copy link
Contributor Author

hrzzzz commented Jan 18, 2024

@poorbarcode @codelipenghui PTAL, thanks

@Technoboy- Technoboy- added this to the 3.2.0 milestone Jan 21, 2024
@Technoboy- Technoboy- added release/3.0.3 release/3.1.3 release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Jan 21, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have the Iterable optimization proposed in a review comment.

@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (17bb322) 73.59% compared to head (13bb9bb) 73.62%.
Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21915      +/-   ##
============================================
+ Coverage     73.59%   73.62%   +0.03%     
- Complexity    32417    32427      +10     
============================================
  Files          1861     1861              
  Lines        138678   138675       -3     
  Branches      15188    15185       -3     
============================================
+ Hits         102060   102106      +46     
+ Misses        28715    28682      -33     
+ Partials       7903     7887      -16     
Flag Coverage Δ
inttests 24.07% <18.18%> (-0.06%) ⬇️
systests 23.63% <18.18%> (-0.05%) ⬇️
unittests 72.92% <81.81%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...n/java/org/apache/pulsar/broker/PulsarService.java 82.07% <100.00%> (-0.24%) ⬇️
...ache/pulsar/client/admin/internal/BrokersImpl.java 87.50% <100.00%> (ø)
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 79.37% <80.00%> (-0.14%) ⬇️
...g/apache/pulsar/broker/admin/impl/BrokersBase.java 70.86% <50.00%> (ø)

... and 73 files with indirect coverage changes

@poorbarcode poorbarcode merged commit 5a65e98 into apache:master Jan 22, 2024
52 checks passed
Technoboy- added a commit that referenced this pull request Jan 22, 2024
### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
@hrzzzz hrzzzz deleted the fix-deadlock branch January 25, 2024 10:17
Technoboy- added a commit that referenced this pull request Jan 31, 2024
### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
Technoboy- added a commit that referenced this pull request Feb 5, 2024
### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 1, 2024
…e#21915)

### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 37fc40c)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 6, 2024
…e#21915)

### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 37fc40c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-3.0 cherry-picked/branch-3.1 doc-not-needed Your PR changes do not impact docs ready-to-test release/blocker Indicate the PR or issue that should block the release until it gets resolved release/3.0.3 release/3.1.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Deadlock occurs when skipping non-recoverable ledger
5 participants