Skip to content

Reimplement writer-priority rwlock#3286

Open
chenBright wants to merge 1 commit intoapache:masterfrom
chenBright:rwlock_
Open

Reimplement writer-priority rwlock#3286
chenBright wants to merge 1 commit intoapache:masterfrom
chenBright:rwlock_

Conversation

@chenBright
Copy link
Copy Markdown
Contributor

@chenBright chenBright commented May 8, 2026

What problem does this PR solve?

Issue Number: resolve #3051

Problem Summary:

The current bthread RWLock is implemented with a go-like state machine
where the reader count is bumped first and only rolled back if everything
later succeeds. As pointed out in #3051, this state is not safely
reversible on a partial failure
:

  • When read times out while a writer holds the lock, the pre-incremented
    reader counter is left dangling. Subsequent writers can no longer acquire
    the lock, even after every legitimate reader has left.
  • For the same structural reason, bthread rwlock cannot offer correct
    try_* / timed_* APIs at all -- any cancellable wait risks corrupting
    the lock's accounting.

PR #1031 already proposed a writer-priority rwlock that fits this requirement;
this PR is built on top of that proposal.

What is changed and the side effects?

Changed:

Reimplement bthread_rwlock_t on top of a writer-priority algorithm
derived from PR #1031.
The lock state is split into:

  • lock_word (butex): bit 31 = writer held, bits 0..30 = reader count;
  • writer_wait_count (butex): in-flight writer counter used by readers
    to honor writer-priority;
  • writer_queue_mutex: serializes writers competing for lock_word.

Every wait/park step is fully reversible, so the lock now correctly
supports try_* / timed_* operations:

  • bthread_rwlock_tryrdlock / bthread_rwlock_timedrdlock
  • bthread_rwlock_trywrlock / bthread_rwlock_timedwrlock
  • bthread_rwlock_unlock (auto-dispatches to read/write unlock by
    inspecting lock_word).

On any failure path (EBUSY, ETIMEDOUT, EINTR-leading-to-fail) the
side effects are rolled back via rwlock_wrlock_cleanup (release the inner
mutex if held; decrement writer_wait_count; wake parked readers if we
were the last in-flight writer).

Replace the unconditional broadcast in unlock with selective wakeup:

  • unrdlock: only the last reader (lock_word 1 -> 0) does
    butex_wake(lock_word) for the at-most-one writer parked on it
    (writers are serialized by writer_queue_mutex, so a single wake
    suffices). Earlier readers do not wake anyone.
  • unwrlock: deliberately unlocks writer_queue_mutex first (so the
    next queued writer -- which has already self-accounted in
    writer_wait_count -- gets a chance), and only does
    butex_wake_all(writer_wait_count) when we were the last in-flight
    writer (fetch_sub returned 1). The intentional ordering preserves
    strict writer-priority and is documented in detail in the source.

This eliminates the previous "wake everybody, let them re-contend"
pattern for both readers and writers, which removes a stampede on
contended workloads.

Co-authored-by: @hairet

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

@chenBright chenBright changed the title Reimplement writer-priority RWLock Reimplement writer-priority rwlock May 8, 2026
The previous go-like RWLock bumps the reader count first and rolls it
back only on the full success path. The state is not reversible on a
partial failure, so a read timeout while a writer holds the lock leaves a
dangling reader credit and permanently blocks future writers (issue apache#3051).
For the same reason the old implementation could not offer correct try_/timed_ APIs
at all.

Co-authored-by: hairet
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reimplements bthread_rwlock_t with a writer-priority algorithm to fix incorrect accounting in cancellable (try_* / timed_*) RWLock operations (issue #3051), and adds unit tests covering writer-priority semantics, cleanup correctness, and memory-ordering expectations.

Changes:

  • Replaced the previous Go-like RWLock state machine with a writer-priority design using lock_word (butex), writer_wait_count (butex), and writer_queue_mutex.
  • Updated the public bthread_rwlock_t struct layout to match the new implementation.
  • Expanded RWLock unit tests to validate writer-priority behavior, cleanup after timed-out writers, and data consistency; minor perf-test logging tweaks.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File Description
src/bthread/rwlock.cpp New writer-priority RWLock implementation, new contention sampling flow, new init/destroy paths.
src/bthread/types.h Updates bthread_rwlock_t fields to match the new implementation (butex pointers + mutex).
test/bthread_rwlock_unittest.cpp Adds new behavioral tests for writer-priority and cleanup; tweaks perf output formatting and concurrency setting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/bthread/rwlock.cpp
Comment on lines 408 to 426
int bthread_rwlock_init(bthread_rwlock_t* __restrict rwlock,
const bthread_rwlockattr_t* __restrict) {
int rc = bthread_sem_init(&rwlock->reader_sema, 0);
if (BAIDU_UNLIKELY(0 != rc)) {
return rc;
rwlock->writer_wait_count = bthread::butex_create_checked<unsigned>();
rwlock->lock_word = bthread::butex_create_checked<unsigned>();
if (NULL == rwlock->writer_wait_count || NULL == rwlock->lock_word) {
LOG(ERROR) << "Out of memory";
return ENOMEM;
}
bthread_sem_disable_csite(&rwlock->reader_sema);
rc = bthread_sem_init(&rwlock->writer_sema, 0);
if (BAIDU_UNLIKELY(0 != rc)) {
bthread_sem_destroy(&rwlock->reader_sema);
return rc;
}
bthread_sem_disable_csite(&rwlock->writer_sema);

rwlock->reader_count = 0;
rwlock->reader_wait = 0;
rwlock->wlock_flag = false;
*rwlock->writer_wait_count = 0;
*rwlock->lock_word = 0;

bthread_mutexattr_t attr;
bthread_mutexattr_init(&attr);
// Disable csite on the inner queue mutex so the writer's wait time is
// accounted exactly once -- by the rwlock layer, not double-counted via
// the inner mutex.
bthread_mutexattr_disable_csite(&attr);
rc = bthread_mutex_init(&rwlock->write_queue_mutex, &attr);
if (BAIDU_UNLIKELY(0 != rc)) {
bthread_sem_destroy(&rwlock->reader_sema);
bthread_sem_destroy(&rwlock->writer_sema);
return rc;
}
bthread_mutex_init(&rwlock->writer_queue_mutex, &attr);
bthread_mutexattr_destroy(&attr);
Comment thread src/bthread/rwlock.cpp
Comment on lines 431 to 434
int bthread_rwlock_destroy(bthread_rwlock_t* rwlock) {
bthread_sem_destroy(&rwlock->reader_sema);
bthread_sem_destroy(&rwlock->writer_sema);
bthread_mutex_destroy(&rwlock->write_queue_mutex);
bthread::butex_destroy(rwlock->writer_wait_count);
bthread::butex_destroy(rwlock->lock_word);
return 0;
Comment thread src/bthread/rwlock.cpp
Comment on lines +177 to +178
LOG(ERROR) << "Invalid unrlock!";
return -1;
Comment thread src/bthread/rwlock.cpp
Comment on lines +91 to +92
int rwlock_rdlock(bthread_rwlock_t* rwlock, bool try_lock,
const struct timespec* abstime) {
Comment thread src/bthread/rwlock.cpp
Comment on lines +463 to 470
int bthread_rwlock_unrlock(bthread_rwlock_t* rwlock) {
return bthread::rwlock_unrdlock(rwlock);
}

int bthread_rwlock_unwlock(bthread_rwlock_t* rwlock) {
return bthread::rwlock_unwrlock(rwlock);
}

Comment thread src/bthread/rwlock.cpp
Comment on lines +134 to +146
// No writer in flight: try to add ourselves to the reader count.
// 2^31 readers should be enough for any realistic workload.
unsigned l = lock_word->load(butil::memory_order_relaxed);
if ((l >> 31) == 0) {
// Acquire on success synchronizes-with the release-CAS in
// unwrlock(), so any data written by the previous writer is
// visible to us before we start reading.
if (lock_word->compare_exchange_weak(l, l + 1,
butil::memory_order_acquire,
butil::memory_order_relaxed)) {
rc = 0;
break;
}
Comment thread src/bthread/types.h
bool wlock_flag; // Flag used to indicate that a write lock has been held.
bthread_mutex_t write_queue_mutex; // Held if there are pending writers.
bthread_contention_site_t writer_csite;
// Count of the bthread who holding write lock yet.
Comment thread src/bthread/types.h
Comment thread test/bthread_rwlock_unittest.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bthread::RWLock在写者持有锁期间,读者try_lock_for锁超时失败,会导致读者计数错误

2 participants