Reimplement writer-priority rwlock#3286
Open
chenBright wants to merge 1 commit intoapache:masterfrom
Open
Conversation
The previous go-like RWLock bumps the reader count first and rolls it back only on the full success path. The state is not reversible on a partial failure, so a read timeout while a writer holds the lock leaves a dangling reader credit and permanently blocks future writers (issue apache#3051). For the same reason the old implementation could not offer correct try_/timed_ APIs at all. Co-authored-by: hairet
There was a problem hiding this comment.
Pull request overview
This PR reimplements bthread_rwlock_t with a writer-priority algorithm to fix incorrect accounting in cancellable (try_* / timed_*) RWLock operations (issue #3051), and adds unit tests covering writer-priority semantics, cleanup correctness, and memory-ordering expectations.
Changes:
- Replaced the previous Go-like RWLock state machine with a writer-priority design using
lock_word(butex),writer_wait_count(butex), andwriter_queue_mutex. - Updated the public
bthread_rwlock_tstruct layout to match the new implementation. - Expanded RWLock unit tests to validate writer-priority behavior, cleanup after timed-out writers, and data consistency; minor perf-test logging tweaks.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
src/bthread/rwlock.cpp |
New writer-priority RWLock implementation, new contention sampling flow, new init/destroy paths. |
src/bthread/types.h |
Updates bthread_rwlock_t fields to match the new implementation (butex pointers + mutex). |
test/bthread_rwlock_unittest.cpp |
Adds new behavioral tests for writer-priority and cleanup; tweaks perf output formatting and concurrency setting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
408
to
426
| int bthread_rwlock_init(bthread_rwlock_t* __restrict rwlock, | ||
| const bthread_rwlockattr_t* __restrict) { | ||
| int rc = bthread_sem_init(&rwlock->reader_sema, 0); | ||
| if (BAIDU_UNLIKELY(0 != rc)) { | ||
| return rc; | ||
| rwlock->writer_wait_count = bthread::butex_create_checked<unsigned>(); | ||
| rwlock->lock_word = bthread::butex_create_checked<unsigned>(); | ||
| if (NULL == rwlock->writer_wait_count || NULL == rwlock->lock_word) { | ||
| LOG(ERROR) << "Out of memory"; | ||
| return ENOMEM; | ||
| } | ||
| bthread_sem_disable_csite(&rwlock->reader_sema); | ||
| rc = bthread_sem_init(&rwlock->writer_sema, 0); | ||
| if (BAIDU_UNLIKELY(0 != rc)) { | ||
| bthread_sem_destroy(&rwlock->reader_sema); | ||
| return rc; | ||
| } | ||
| bthread_sem_disable_csite(&rwlock->writer_sema); | ||
|
|
||
| rwlock->reader_count = 0; | ||
| rwlock->reader_wait = 0; | ||
| rwlock->wlock_flag = false; | ||
| *rwlock->writer_wait_count = 0; | ||
| *rwlock->lock_word = 0; | ||
|
|
||
| bthread_mutexattr_t attr; | ||
| bthread_mutexattr_init(&attr); | ||
| // Disable csite on the inner queue mutex so the writer's wait time is | ||
| // accounted exactly once -- by the rwlock layer, not double-counted via | ||
| // the inner mutex. | ||
| bthread_mutexattr_disable_csite(&attr); | ||
| rc = bthread_mutex_init(&rwlock->write_queue_mutex, &attr); | ||
| if (BAIDU_UNLIKELY(0 != rc)) { | ||
| bthread_sem_destroy(&rwlock->reader_sema); | ||
| bthread_sem_destroy(&rwlock->writer_sema); | ||
| return rc; | ||
| } | ||
| bthread_mutex_init(&rwlock->writer_queue_mutex, &attr); | ||
| bthread_mutexattr_destroy(&attr); |
Comment on lines
431
to
434
| int bthread_rwlock_destroy(bthread_rwlock_t* rwlock) { | ||
| bthread_sem_destroy(&rwlock->reader_sema); | ||
| bthread_sem_destroy(&rwlock->writer_sema); | ||
| bthread_mutex_destroy(&rwlock->write_queue_mutex); | ||
| bthread::butex_destroy(rwlock->writer_wait_count); | ||
| bthread::butex_destroy(rwlock->lock_word); | ||
| return 0; |
Comment on lines
+177
to
+178
| LOG(ERROR) << "Invalid unrlock!"; | ||
| return -1; |
Comment on lines
+91
to
+92
| int rwlock_rdlock(bthread_rwlock_t* rwlock, bool try_lock, | ||
| const struct timespec* abstime) { |
Comment on lines
+463
to
470
| int bthread_rwlock_unrlock(bthread_rwlock_t* rwlock) { | ||
| return bthread::rwlock_unrdlock(rwlock); | ||
| } | ||
|
|
||
| int bthread_rwlock_unwlock(bthread_rwlock_t* rwlock) { | ||
| return bthread::rwlock_unwrlock(rwlock); | ||
| } | ||
|
|
Comment on lines
+134
to
+146
| // No writer in flight: try to add ourselves to the reader count. | ||
| // 2^31 readers should be enough for any realistic workload. | ||
| unsigned l = lock_word->load(butil::memory_order_relaxed); | ||
| if ((l >> 31) == 0) { | ||
| // Acquire on success synchronizes-with the release-CAS in | ||
| // unwrlock(), so any data written by the previous writer is | ||
| // visible to us before we start reading. | ||
| if (lock_word->compare_exchange_weak(l, l + 1, | ||
| butil::memory_order_acquire, | ||
| butil::memory_order_relaxed)) { | ||
| rc = 0; | ||
| break; | ||
| } |
| bool wlock_flag; // Flag used to indicate that a write lock has been held. | ||
| bthread_mutex_t write_queue_mutex; // Held if there are pending writers. | ||
| bthread_contention_site_t writer_csite; | ||
| // Count of the bthread who holding write lock yet. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: resolve #3051
Problem Summary:
The current bthread RWLock is implemented with a go-like state machine
where the reader count is bumped first and only rolled back if everything
later succeeds. As pointed out in #3051, this state is not safely
reversible on a partial failure:
reader counter is left dangling. Subsequent writers can no longer acquire
the lock, even after every legitimate reader has left.
try_*/timed_*APIs at all -- any cancellable wait risks corruptingthe lock's accounting.
PR #1031 already proposed a writer-priority rwlock that fits this requirement;
this PR is built on top of that proposal.
What is changed and the side effects?
Changed:
Reimplement
bthread_rwlock_ton top of a writer-priority algorithmderived from PR #1031. The lock state is split into:
lock_word(butex): bit 31 = writer held, bits 0..30 = reader count;writer_wait_count(butex): in-flight writer counter used by readersto honor writer-priority;
writer_queue_mutex: serializes writers competing forlock_word.Every wait/park step is fully reversible, so the lock now correctly
supports
try_*/timed_*operations:bthread_rwlock_tryrdlock/bthread_rwlock_timedrdlockbthread_rwlock_trywrlock/bthread_rwlock_timedwrlockbthread_rwlock_unlock(auto-dispatches to read/write unlock byinspecting
lock_word).On any failure path (
EBUSY,ETIMEDOUT,EINTR-leading-to-fail) theside effects are rolled back via
rwlock_wrlock_cleanup(release the innermutex if held; decrement
writer_wait_count; wake parked readers if wewere the last in-flight writer).
Replace the unconditional broadcast in unlock with selective wakeup:
unrdlock: only the last reader (lock_word1 -> 0) doesbutex_wake(lock_word)for the at-most-one writer parked on it(writers are serialized by
writer_queue_mutex, so a single wakesuffices). Earlier readers do not wake anyone.
unwrlock: deliberately unlockswriter_queue_mutexfirst (so thenext queued writer -- which has already self-accounted in
writer_wait_count-- gets a chance), and only doesbutex_wake_all(writer_wait_count)when we were the last in-flightwriter (
fetch_subreturned 1). The intentional ordering preservesstrict writer-priority and is documented in detail in the source.
This eliminates the previous "wake everybody, let them re-contend"
pattern for both readers and writers, which removes a stampede on
contended workloads.
Co-authored-by: @hairet
Side effects:
Performance effects:
Breaking backward compatibility:
Check List: