Skip to content

Commit 7d43f1c

Browse files
Waiman-LongIngo Molnar
authored andcommitted
locking/rwsem: Enable time-based spinning on reader-owned rwsem
When the rwsem is owned by reader, writers stop optimistic spinning simply because there is no easy way to figure out if all the readers are actively running or not. However, there are scenarios where the readers are unlikely to sleep and optimistic spinning can help performance. This patch provides a simple mechanism for spinning on a reader-owned rwsem by a writer. It is a time threshold based spinning where the allowable spinning time can vary from 10us to 25us depending on the condition of the rwsem. When the time threshold is exceeded, the nonspinnable bits will be set in the owner field to indicate that no more optimistic spinning will be allowed on this rwsem until it becomes writer owned again. Not even readers is allowed to acquire the reader-locked rwsem by optimistic spinning for fairness. We also want a writer to acquire the lock after the readers hold the lock for a relatively long time. In order to give preference to writers under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split into two - one for reader and one for writer. When optimistic spinning is disabled, both bits will be set. When the reader count drop down to 0, the writer nonspinnable bit will be cleared to allow writers to spin on the lock, but not the readers. When a writer acquires the lock, it will write its own task structure pointer into sem->owner and clear the reader nonspinnable bit in the process. The time taken for each iteration of the reader-owned rwsem spinning loop varies. Below are sample minimum elapsed times for 16 iterations of the loop. System Time for 16 Iterations ------ ---------------------- 1-socket Skylake ~800ns 4-socket Broadwell ~300ns 2-socket ThunderX2 (arm64) ~250ns When the lock cacheline is contended, we can see up to almost 10X increase in elapsed time. So 25us will be at most 500, 1300 and 1600 iterations for each of the above systems. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBridge-EX system with equal numbers of readers and writers before and after this patch were as follows: # of Threads Pre-patch Post-patch ------------ --------- ---------- 2 1,759 6,684 4 1,684 6,738 8 1,074 7,222 16 900 7,163 32 458 7,316 64 208 520 128 168 425 240 143 474 This patch gives a big boost in performance for mixed reader/writer workloads. With 32 locking threads, the rwsem lock event data were: rwsem_opt_fail=79850 rwsem_opt_nospin=5069 rwsem_opt_rlock=597484 rwsem_opt_wlock=957339 rwsem_sleep_reader=57782 rwsem_sleep_writer=55663 With 64 locking threads, the data looked like: rwsem_opt_fail=346723 rwsem_opt_nospin=6293 rwsem_opt_rlock=1127119 rwsem_opt_wlock=1400628 rwsem_sleep_reader=308201 rwsem_sleep_writer=72281 So a lot more threads acquired the lock in the slowpath and more threads went to sleep. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
1 parent 94a9717 commit 7d43f1c

File tree

2 files changed

+144
-30
lines changed

2 files changed

+144
-30
lines changed

kernel/locking/lock_events_list.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
5959
LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
6060
LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
6161
LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
62+
LOCK_EVENT(rwsem_opt_nospin) /* # of disabled reader opt-spinnings */
6263
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
6364
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
6465
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */

kernel/locking/rwsem.c

Lines changed: 143 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include <linux/sched/debug.h>
2424
#include <linux/sched/wake_q.h>
2525
#include <linux/sched/signal.h>
26+
#include <linux/sched/clock.h>
2627
#include <linux/export.h>
2728
#include <linux/rwsem.h>
2829
#include <linux/atomic.h>
@@ -31,32 +32,38 @@
3132
#include "lock_events.h"
3233

3334
/*
34-
* The least significant 2 bits of the owner value has the following
35+
* The least significant 3 bits of the owner value has the following
3536
* meanings when set.
3637
* - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
37-
* - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
38-
* The rwsem is anonymously owned, i.e. the owner(s) cannot be
39-
* readily determined. It can be reader owned or the owning writer
40-
* is indeterminate.
38+
* - Bit 1: RWSEM_RD_NONSPINNABLE - Readers cannot spin on this lock.
39+
* - Bit 2: RWSEM_WR_NONSPINNABLE - Writers cannot spin on this lock.
4140
*
41+
* When the rwsem is either owned by an anonymous writer, or it is
42+
* reader-owned, but a spinning writer has timed out, both nonspinnable
43+
* bits will be set to disable optimistic spinning by readers and writers.
44+
* In the later case, the last unlocking reader should then check the
45+
* writer nonspinnable bit and clear it only to give writers preference
46+
* to acquire the lock via optimistic spinning, but not readers. Similar
47+
* action is also done in the reader slowpath.
48+
4249
* When a writer acquires a rwsem, it puts its task_struct pointer
4350
* into the owner field. It is cleared after an unlock.
4451
*
4552
* When a reader acquires a rwsem, it will also puts its task_struct
46-
* pointer into the owner field with both the RWSEM_READER_OWNED and
47-
* RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
48-
* largely be left untouched. So for a free or reader-owned rwsem,
49-
* the owner value may contain information about the last reader that
50-
* acquires the rwsem. The anonymous bit is set because that particular
51-
* reader may or may not still own the lock.
53+
* pointer into the owner field with the RWSEM_READER_OWNED bit set.
54+
* On unlock, the owner field will largely be left untouched. So
55+
* for a free or reader-owned rwsem, the owner value may contain
56+
* information about the last reader that acquires the rwsem.
5257
*
5358
* That information may be helpful in debugging cases where the system
5459
* seems to hang on a reader owned rwsem especially if only one reader
5560
* is involved. Ideally we would like to track all the readers that own
5661
* a rwsem, but the overhead is simply too big.
5762
*/
5863
#define RWSEM_READER_OWNED (1UL << 0)
59-
#define RWSEM_NONSPINNABLE (1UL << 1)
64+
#define RWSEM_RD_NONSPINNABLE (1UL << 1)
65+
#define RWSEM_WR_NONSPINNABLE (1UL << 2)
66+
#define RWSEM_NONSPINNABLE (RWSEM_RD_NONSPINNABLE | RWSEM_WR_NONSPINNABLE)
6067
#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
6168

6269
#ifdef CONFIG_DEBUG_RWSEMS
@@ -141,7 +148,7 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
141148
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
142149
struct task_struct *owner)
143150
{
144-
unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
151+
unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
145152

146153
atomic_long_set(&sem->owner, val);
147154
}
@@ -191,6 +198,23 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
191198
}
192199
#endif
193200

201+
/*
202+
* Set the RWSEM_NONSPINNABLE bits if the RWSEM_READER_OWNED flag
203+
* remains set. Otherwise, the operation will be aborted.
204+
*/
205+
static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
206+
{
207+
unsigned long owner = atomic_long_read(&sem->owner);
208+
209+
do {
210+
if (!(owner & RWSEM_READER_OWNED))
211+
break;
212+
if (owner & RWSEM_NONSPINNABLE)
213+
break;
214+
} while (!atomic_long_try_cmpxchg(&sem->owner, &owner,
215+
owner | RWSEM_NONSPINNABLE));
216+
}
217+
194218
/*
195219
* Return just the real task structure pointer of the owner
196220
*/
@@ -546,7 +570,8 @@ static inline bool owner_on_cpu(struct task_struct *owner)
546570
return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
547571
}
548572

549-
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
573+
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
574+
unsigned long nonspinnable)
550575
{
551576
struct task_struct *owner;
552577
unsigned long flags;
@@ -562,7 +587,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
562587
preempt_disable();
563588
rcu_read_lock();
564589
owner = rwsem_owner_flags(sem, &flags);
565-
if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
590+
if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
566591
ret = false;
567592
rcu_read_unlock();
568593
preempt_enable();
@@ -588,12 +613,12 @@ enum owner_state {
588613
OWNER_READER = 1 << 2,
589614
OWNER_NONSPINNABLE = 1 << 3,
590615
};
591-
#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
616+
#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER | OWNER_READER)
592617

593618
static inline enum owner_state
594-
rwsem_owner_state(struct task_struct *owner, unsigned long flags)
619+
rwsem_owner_state(struct task_struct *owner, unsigned long flags, unsigned long nonspinnable)
595620
{
596-
if (flags & RWSEM_NONSPINNABLE)
621+
if (flags & nonspinnable)
597622
return OWNER_NONSPINNABLE;
598623

599624
if (flags & RWSEM_READER_OWNED)
@@ -602,14 +627,15 @@ rwsem_owner_state(struct task_struct *owner, unsigned long flags)
602627
return owner ? OWNER_WRITER : OWNER_NULL;
603628
}
604629

605-
static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
630+
static noinline enum owner_state
631+
rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
606632
{
607633
struct task_struct *new, *owner;
608634
unsigned long flags, new_flags;
609635
enum owner_state state;
610636

611637
owner = rwsem_owner_flags(sem, &flags);
612-
state = rwsem_owner_state(owner, flags);
638+
state = rwsem_owner_state(owner, flags, nonspinnable);
613639
if (state != OWNER_WRITER)
614640
return state;
615641

@@ -622,7 +648,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
622648

623649
new = rwsem_owner_flags(sem, &new_flags);
624650
if ((new != owner) || (new_flags != flags)) {
625-
state = rwsem_owner_state(new, new_flags);
651+
state = rwsem_owner_state(new, new_flags, nonspinnable);
626652
break;
627653
}
628654

@@ -646,10 +672,39 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
646672
return state;
647673
}
648674

675+
/*
676+
* Calculate reader-owned rwsem spinning threshold for writer
677+
*
678+
* The more readers own the rwsem, the longer it will take for them to
679+
* wind down and free the rwsem. So the empirical formula used to
680+
* determine the actual spinning time limit here is:
681+
*
682+
* Spinning threshold = (10 + nr_readers/2)us
683+
*
684+
* The limit is capped to a maximum of 25us (30 readers). This is just
685+
* a heuristic and is subjected to change in the future.
686+
*/
687+
static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
688+
{
689+
long count = atomic_long_read(&sem->count);
690+
int readers = count >> RWSEM_READER_SHIFT;
691+
u64 delta;
692+
693+
if (readers > 30)
694+
readers = 30;
695+
delta = (20 + readers) * NSEC_PER_USEC / 2;
696+
697+
return sched_clock() + delta;
698+
}
699+
649700
static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
650701
{
651702
bool taken = false;
652703
int prev_owner_state = OWNER_NULL;
704+
int loop = 0;
705+
u64 rspin_threshold = 0;
706+
unsigned long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
707+
: RWSEM_RD_NONSPINNABLE;
653708

654709
preempt_disable();
655710

@@ -661,12 +716,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
661716
* Optimistically spin on the owner field and attempt to acquire the
662717
* lock whenever the owner changes. Spinning will be stopped when:
663718
* 1) the owning writer isn't running; or
664-
* 2) readers own the lock as we can't determine if they are
665-
* actively running or not.
719+
* 2) readers own the lock and spinning time has exceeded limit.
666720
*/
667721
for (;;) {
668-
enum owner_state owner_state = rwsem_spin_on_owner(sem);
722+
enum owner_state owner_state;
669723

724+
owner_state = rwsem_spin_on_owner(sem, nonspinnable);
670725
if (!(owner_state & OWNER_SPINNABLE))
671726
break;
672727

@@ -679,6 +734,38 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
679734
if (taken)
680735
break;
681736

737+
/*
738+
* Time-based reader-owned rwsem optimistic spinning
739+
*/
740+
if (wlock && (owner_state == OWNER_READER)) {
741+
/*
742+
* Re-initialize rspin_threshold every time when
743+
* the owner state changes from non-reader to reader.
744+
* This allows a writer to steal the lock in between
745+
* 2 reader phases and have the threshold reset at
746+
* the beginning of the 2nd reader phase.
747+
*/
748+
if (prev_owner_state != OWNER_READER) {
749+
if (rwsem_test_oflags(sem, nonspinnable))
750+
break;
751+
rspin_threshold = rwsem_rspin_threshold(sem);
752+
loop = 0;
753+
}
754+
755+
/*
756+
* Check time threshold once every 16 iterations to
757+
* avoid calling sched_clock() too frequently so
758+
* as to reduce the average latency between the times
759+
* when the lock becomes free and when the spinner
760+
* is ready to do a trylock.
761+
*/
762+
else if (!(++loop & 0xf) && (sched_clock() > rspin_threshold)) {
763+
rwsem_set_nonspinnable(sem);
764+
lockevent_inc(rwsem_opt_nospin);
765+
break;
766+
}
767+
}
768+
682769
/*
683770
* An RT task cannot do optimistic spinning if it cannot
684771
* be sure the lock holder is running or live-lock may
@@ -733,8 +820,25 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
733820
lockevent_cond_inc(rwsem_opt_fail, !taken);
734821
return taken;
735822
}
823+
824+
/*
825+
* Clear the owner's RWSEM_WR_NONSPINNABLE bit if it is set. This should
826+
* only be called when the reader count reaches 0.
827+
*
828+
* This give writers better chance to acquire the rwsem first before
829+
* readers when the rwsem was being held by readers for a relatively long
830+
* period of time. Race can happen that an optimistic spinner may have
831+
* just stolen the rwsem and set the owner, but just clearing the
832+
* RWSEM_WR_NONSPINNABLE bit will do no harm anyway.
833+
*/
834+
static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
835+
{
836+
if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
837+
atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
838+
}
736839
#else
737-
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
840+
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
841+
unsigned long nonspinnable)
738842
{
739843
return false;
740844
}
@@ -743,6 +847,8 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
743847
{
744848
return false;
745849
}
850+
851+
static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
746852
#endif
747853

748854
/*
@@ -752,10 +858,11 @@ static struct rw_semaphore __sched *
752858
rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
753859
{
754860
long count, adjustment = -RWSEM_READER_BIAS;
861+
bool wake = false;
755862
struct rwsem_waiter waiter;
756863
DEFINE_WAKE_Q(wake_q);
757864

758-
if (!rwsem_can_spin_on_owner(sem))
865+
if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
759866
goto queue;
760867

761868
/*
@@ -815,8 +922,12 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
815922
* If there are no writers and we are first in the queue,
816923
* wake our own waiter to join the existing active readers !
817924
*/
818-
if (!(count & RWSEM_LOCK_MASK) ||
819-
(!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
925+
if (!(count & RWSEM_LOCK_MASK)) {
926+
clear_wr_nonspinnable(sem);
927+
wake = true;
928+
}
929+
if (wake || (!(count & RWSEM_WRITER_MASK) &&
930+
(adjustment & RWSEM_FLAG_WAITERS)))
820931
rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
821932

822933
raw_spin_unlock_irq(&sem->wait_lock);
@@ -866,7 +977,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
866977
DEFINE_WAKE_Q(wake_q);
867978

868979
/* do optimistic spinning and steal lock if possible */
869-
if (rwsem_can_spin_on_owner(sem) &&
980+
if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
870981
rwsem_optimistic_spin(sem, true))
871982
return sem;
872983

@@ -1124,8 +1235,10 @@ inline void __up_read(struct rw_semaphore *sem)
11241235
rwsem_clear_reader_owned(sem);
11251236
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
11261237
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
1127-
RWSEM_FLAG_WAITERS))
1238+
RWSEM_FLAG_WAITERS)) {
1239+
clear_wr_nonspinnable(sem);
11281240
rwsem_wake(sem, tmp);
1241+
}
11291242
}
11301243

11311244
/*

0 commit comments

Comments
 (0)