Improve spining in ReaderWriter Lock #13324

vancem · 2017-08-10T23:39:42Z

Fixes #12780, I suggested this in words to @kouvel as a alternative to PR #13243. He suggested that I make up the actual code (since the goal was it was supposed to be simple).

Marked NoMerge for now since I have not tested the performance characteristics, and this is mostly so that @kouvel can review.

The change incorporates the optimization in PR #13243 to avoid read spinning if writers are trying to get the lock.

But the main change here is to remember how much we spun last time and use that to determine if spinning is worthwhile this time. Thus if spinning was not fruitful, in the past, we don't spin at all (which is a big improvement).

This is done in both the MyLock and the Reader and Writer spin loops.

The code is basically the SpinHistory class, (which is very straightforward), and then very minor changes to wire it in). The rest is some variable renames, and comments.

If you pass an null payload using Write<T>, and in the EventListener then call PayloadNames, it will throw an IndexOutOfRangeException. It should just return null. This fixes this.

Main thing we do is remember how much spinning we did last time and use that to decide how much to do this time. Thus if spinning was unproductive last time we don't spin this time (but will try from time to time to see if things changed).

kouvel · 2017-08-19T04:25:44Z

I did a bunch of tests on variations of this to see what is working and what is not, and to hopefully understand why.

I modified my benchmark to add some memory-heavy work on a specified number of threads, these threads just do some simple memory operations (array[i - 1] += array[i]) on an array (64 int elements) and increment a counter for each full iteration over the array.

Issues with changes to the spin lock spin heuristics:

RE: Scale issue caused by threads attempting to enter the RW lock starving threads attempting to exit it from acquiring the spin lock
- The change as is minus the RW lock spin history doesn't seem to help, and with just readers and writers it breaks down fairly quickly with increasing number of threads compared with the baseline. Based on the profile, it is spending almost all of the time blocked when there is high contention.
  - I believe what is happening is that the saved starting spin count is creeping upwards quickly because with contention, threads fail to acquire the spin lock more frequently than they succeed. This is introducing a significant delay to each operation that is necessary to make progress. Once a writer registers itself as a waiter, it blocks new enter-read threads from acquiring the RW lock but they'll still spin and acquire the spin lock once before waiting. Exit-read threads (actually all threads) would start at a high enough spin count that they're doing a sleep(1) between each attempt at the spin lock and with enough other threads stealing the lock, some exit-read threads are blocked for a long time.
- Experiments:
  - 1 - I tried removing the spin count history when exiting the RW lock and that seems to perform much better
    - Since enter-read and enter-write threads are significantly delayed by the sleeps, exit read threads that spin more aggressively have an easy time acquiring the spin lock to make progress
    - Under high contention, this performs an order of magnitude better than the baseline (roughly ranging from 2x to 10x), but deprioritization still performs an order of magintude better than this (roughly ranging from 2x to 4x)
  - 2 - I tried gradually increasing the initial spin count and quickly decreasing it, to discourage it from staying near the max
    - Something like this:
      spinCount -= 2; if (spinCount < 0) spinCount = 0; if (spinCount <= _myLockSpinCountStart) _myLockSpinCountStart = (byte)spinCount; else _myLockSpinCountStart = (byte)((spinCount + _myLockSpinCountStart * 3) / 4);
    - This performs about equally to the baseline. It is not combining well with experiment 1, I believe that is because it decreases the delay and increases contention on the spin lock, and exit-read threads are not as easily able to acquire the spin lock.
RE: Writers are starved for long durations
- The change as is minus the RW lock spin history seems to be similar to the baseline
  - I believe that is because writers and readers are similarly likely to spin less and sleep for longer and that seems to balance out
- Experiment 1 above seems to help to some degree, but deprioritization seems to work much better

Issues with changes to the RW lock spin heuristics:

Reducing enter-read spinning based on history seems to break down with fewer threads than before with high contention
- I believe that is because it causes enter-read threads to wait more frequently (both sleep(1) and the actual wait). Readers waiting on the event are all woken when a write lock is released and there is no write waiter, and there is no guarantee that they will acquire their read lock after waking. With enough demand for writes (though quite little demand compared to reads), it can even be unlikely that woken awakened readers would acquire the read lock before a writer acquires a write lock. The profile shows that context switches are significant in time spent, a little more than half from Sleep(1) and the rest from the wait.
- With the RW lock spin history changes, it performs an order of magnitude worse, and that is regardless of whether the changes are on top of experiment 1 above or on top of deprioritization (roughly 3x to 10x)
Reducing enter-write spinning based on history seems to perform worse
- I believe that is because it causes enter-write threads to wait more frequently, causing enter-read threads to wait more frequently, leading to the same problem above
Delaying spinning (adding 100 to spin count on first failure to acquire lock before a wait) as far as I can tell is only magnifying the above problems with any sizeable value and I believe for the same reasons, the rest of my experiments were excluding that part of the change
Experiments:
- 3 - I tried removing the spin heuristic changes for readers and it seems to avoid the enter-read problem
- 4 - I tried gradually increasing and more quickly decreasing the initial spin count for the RW lock spinning similarly to how I did for the spin lock above. It seems to help slightly but noticeably, but it still doesn't get close to deprioritization.

Effects of and on unrelated memory-intensive work (most of my tests were with at least proc count threads):

Having background CPU-bound work (memory-intensive or not) seems to mask part of the problem in most of my experiments. That is, any case that would otherwise break down doesn't break down anymore. It only mask the problem though and with more threads it still seems to breaks down similarly to before.
- I believe that is because the background work has an effect of reducing contention on the spin lock and RW lock. Although the relative contention with time slicing should be the same on the reader/writer threads, I believe the way it works is it becomes far less likely that the spin lock for instance would be held when a reader/writer thread attempts to acquire it because typically the spin lock is held for a short enough duration that it would be acquired and released in the same time slice.
In the vast majority of my experiments, there were no effects on the background work (it performed at a similar rate regardless of the solution being tested). The largest difference I saw was a 10-15% increase in throughput of the background work with better RW lock heuristics.
- In all cases where I saw the background work's throughput increase, the reader/writer work throughput was also increased. The only exceptions were when the RW heuristics broke down, then the background work's thoughput was higher but not relevant because the reader/writer throughput was on the floor anyway.

I found one thing that sometimes performed an order of magnitude better than deprioritization (but only on top of deprioritization).

Experiment 5:
- Using the counterintuitive inverse heuristic - spin less when a spin succeeds and spin more when it fails - seemed to work quite well
- I believe that is because it has the effect of reducing contention on the spin lock and RW lock, similarly to having backgorund work but without the overhead at scale. Threads that acquire the spin lock for instance before reaching the max spin count will cause the next few threads to spin less and wait sooner, and threads that spin less and wait sooner (the previous batch) will cause the next few threads to spin more. So in this example it would reduce the number of threads contending on the spin lock.
- On top of deprioritization, it performs up to 2x better at medium thread counts. At lower and higher thread counts, there is no gain, but at least I haven't seen the combination perform worse than deprioritization alone. At higher thread counts I believe there is a waiting writer bottleneck that dominates the score in both cases and that's why there's no gain.

In summary:

I was not able to find any order-of-magnitude improvements in the common cases with any solution
I don't think changing only spin heuristics sufficiently solves the scale problems or writer-starvation problems. Experiment 1 is the closest, but it doesn't solve the writer starvation problem and deprioritization performs equally or better in pretty much all cases including when the total thread count is <= the number of procs.
There definitely seems to be potential for improvement on top of deprioritization, but I think it would be rather difficult to tune changes to spin heuristics to work well. I'm not even inclined to use experiment 5 because I can't be certain of its behavior outside of my tests, which are not exactly representative of real-world scenarios. Deprioritization on the other hand, sets clear rules that make logical sense to me and are conservative enough that I would find it surprizing for it to perform order-of-magnitude worse than the baseline in real-world scenarios.

vancem · 2017-09-05T23:48:37Z

Thanks Kount on doing the performance work to investigate this.

I think that the general idea of using history to determine a good spin count is likely to be better than the data above suggests. The reason is that the benchmark we are using above is really a unusual corner case (where reader-writer lock hold time is basically zero, and under super-high contention). In fact the data above supports this because when other work was included you saw that everything performed about the same.

Indeed the target for the SpinHistory class was not really to optimize for this case. All I hoped that SpinHistory would do for this super high contention case, is to avoid 'meltdown' (that is we were not become dramatically LESS efficient as contention increased). I expected to quickly cause some threads to block (sleep), and this would dramatically affect that thread's throughput, but at least the threads that were making headway, would have a very bounded amount of spinning (and thus its inefficiency would be bounded). Thus its value is not about optimality (for this case that frankly will never be good and users should avoid), but about simplicity (a good enough solution the never melts down and works well up to a very reasonable scaling factor)

SpinHistory actually helps in the case where the lock hold time may vary, but is often too long to spin but short enough that the lock might be taking many time (e.g. say .1 - 10 msec). In that case SpinHistory avoids any spinning (because a reasonable spin time (e.g. 10usec), is almost certainly going to fail and simply waste CPU).

Now the myLock should NOT have this hold time characteristic (we expect myLock to only be held < 0.1usec because it is only help during TRANSITIONS for the RW lock). Thus I would not expect it to be particular good, it was just that it avoids meltdown in a simple way.

In the end, for reader-writer lock I am OK with prioritizing exits (and writes). It is complexity, but it is not that much and it certainly makes intuitive sense, and solved the meltdown problem.

I think SpinHistory is still useful for normal locks (which unlike myLock are more likely to have lock hold times that might be short at times and long at times).

But that is a different issue. I will close this one.

vancem added 3 commits July 19, 2017 12:35

Fix IndexOutOfRangeException in PayloadNames property

4dbb243

If you pass an null payload using Write<T>, and in the EventListener then call PayloadNames, it will throw an IndexOutOfRangeException. It should just return null. This fixes this.

Merge remote-tracking branch 'upstream/master'

9e97e2c

Improve spinning heurstics.

59c9785

Main thing we do is remember how much spinning we did last time and use that to decide how much to do this time. Thus if spinning was unproductive last time we don't spin this time (but will try from time to time to see if things changed).

vancem added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 10, 2017

dnfclas added the cla-already-signed label Aug 10, 2017

vancem changed the title ~~Improve spining in~~ Improve spining in ReaderWriter Lock Aug 10, 2017

kouvel mentioned this pull request Aug 19, 2017

Improve ReaderWriterLockSlim scalability #13243

Merged

karelz added the area-System.Threading label Aug 28, 2017

karelz assigned vancem Aug 28, 2017

vancem closed this Sep 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve spining in ReaderWriter Lock #13324

Improve spining in ReaderWriter Lock #13324

vancem commented Aug 10, 2017 •

edited

Loading

kouvel commented Aug 19, 2017 •

edited

Loading

vancem commented Sep 5, 2017

Improve spining in ReaderWriter Lock #13324

Improve spining in ReaderWriter Lock #13324

Conversation

vancem commented Aug 10, 2017 • edited Loading

kouvel commented Aug 19, 2017 • edited Loading

vancem commented Sep 5, 2017

vancem commented Aug 10, 2017 •

edited

Loading

kouvel commented Aug 19, 2017 •

edited

Loading