-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix deadlock while skip non-recoverable ledgers. #21915
Conversation
@poorbarcode @codelipenghui PTAL, thanks |
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Show resolved
Hide resolved
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have the Iterable optimization proposed in a review comment.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #21915 +/- ##
============================================
+ Coverage 73.59% 73.62% +0.03%
- Complexity 32417 32427 +10
============================================
Files 1861 1861
Lines 138678 138675 -3
Branches 15188 15185 -3
============================================
+ Hits 102060 102106 +46
+ Misses 28715 28682 -33
+ Partials 7903 7887 -16
Flags with carried forward coverage won't be shown. Click here to find out more.
|
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Show resolved
Hide resolved
### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org>
### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org>
### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org>
…e#21915) ### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org> (cherry picked from commit 37fc40c)
…e#21915) ### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org> (cherry picked from commit 37fc40c)
Fixes #21914
Motivation
Resolved the deadlock issue that occurred when the
org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger
method and theorg.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete
method were called concurrently.Modifications
The reason for the deadlock is that the
org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger
method internally acquired a write lock with a large scope, seemingly just to check!individualDeletedMessages.contains(ledgerId, i)
. However, theorg.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete
method actually performs this check again internally to ensure that already deleted Positions are not deleted repeatedly. Therefore, we can remove the write lock in theorg.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger
method and omit the!individualDeletedMessages.contains(ledgerId, i)
check, instead directly calling theorg.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete
method.Verifying this change
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: hrzzzz#4