Improve ReaderWriterLockSlim scalability #13243

kouvel · 2017-08-06T01:31:45Z

Fixes #12780

The _myLock spin lock runs into some bad scalability issues. For example:

Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads.
Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves.

Took some suggestions from @vancem and landed on the following after some experiments:

Introduced some fairness to _myLock acquisition by deprioritizing attempts to acquire _myLock that are not likely to make progress on the RW lock
Limited spinning in some cases where it is very unlikely that spinning would help

Baseline (left) vs Changed (right)                       Left score       Right score      ∆ Score %
-------------------------------------------------------  ---------------  ---------------  ---------
Concurrency_12Readers                                    22576.33 ±0.61%  22485.13 ±0.84%     -0.40%
Concurrency_12Writers                                    22661.57 ±0.36%  23159.64 ±0.41%      2.20%
Concurrency_12Readers_4Writers                           1363.40 ±11.54%   7721.94 ±2.66%    466.37%
Concurrency_48Readers_8Writers                               6.01 ±3.81%   5495.41 ±4.76%  91334.09%
Concurrency_192Readers_16Writers                             6.99 ±2.55%   4071.60 ±6.46%  58152.74%
Concurrency_768Readers_32Writers                            15.23 ±3.74%   1296.81 ±4.83%   8412.51%
Contention_AcquireReleaseReadLock_When_BgWriteLock           3.81 ±0.81%      3.65 ±0.68%     -4.25%
Contention_AcquireReleaseWriteLock_When_BgReadLock           3.78 ±0.63%      3.66 ±0.69%     -3.20%
Contention_UpgradeDowngradeLock_When_BgReadLock              3.76 ±0.72%      3.59 ±0.80%     -4.63%
Micro_AcquireReleaseReadLock_When_BgReadLock              2578.81 ±0.41%   2566.67 ±0.04%     -0.47%
Micro_AcquireReleaseReadLock_When_NoLock                  2586.77 ±0.12%   2592.37 ±0.03%      0.22%
Micro_AcquireReleaseReadLock_When_ReadLock                2777.25 ±0.03%   2770.76 ±0.37%     -0.23%
Micro_AcquireReleaseUpgradeableReadLock_When_BgReadLock   2595.37 ±0.18%   2605.86 ±0.49%      0.40%
Micro_AcquireReleaseUpgradeableReadLock_When_NoLock       2646.20 ±0.12%   2609.14 ±0.13%     -1.40%
Micro_AcquireReleaseWriteLock_When_NoLock                 2548.90 ±0.04%   2621.03 ±0.18%      2.83%
Micro_AcquireReleaseWriteLock_When_WriteLock              2650.63 ±0.32%   2660.43 ±0.37%      0.37%
Micro_UpgradeDowngradeLock_When_ReadLock                  2535.17 ±0.10%   2512.07 ±0.09%     -0.91%
Timeout_AcquireReleaseReadLock_When_BgWriteLock           4016.95 ±1.00%   4012.96 ±0.23%     -0.10%
Timeout_AcquireReleaseWriteLock_When_BgReadLock           4014.43 ±0.14%   4119.19 ±0.24%      2.61%
Timeout_UpgradeDowngradeLock_When_BgReadLock              3979.77 ±0.07%   3893.49 ±0.04%     -2.17%
-------------------------------------------------------  ---------------  ---------------  ---------
Total                                                      532.66 ±1.40%   1395.93 ±1.24%    162.07%

@ChrisAhna's repro from #12780

Baseline

Running NonReentrantRwlockSlim write lock contention scenarios with process affinity set to 0x0000000000000fff...
Initial output should appear in approximately 5 seconds...
ThreadCount=0001: Elapsed=5008ms FinalCount=146778099 CountsPerSecond=29303915.9706614
ThreadCount=0002: Elapsed=5013ms FinalCount=144566761 CountsPerSecond=28832836.5040624
ThreadCount=0004: Elapsed=5030ms FinalCount=142987794 CountsPerSecond=28424230.6837138
ThreadCount=0008: Elapsed=5046ms FinalCount=140895614 CountsPerSecond=27919285.270997
ThreadCount=0016: Elapsed=5061ms FinalCount=131126565 CountsPerSecond=25904896.8847266
ThreadCount=0032: Elapsed=5046ms FinalCount=118985913 CountsPerSecond=23578687.3922634
ThreadCount=0064: Elapsed=5077ms FinalCount=87382990 CountsPerSecond=17209966.7389088
ThreadCount=0128: Elapsed=5046ms FinalCount=13983552 CountsPerSecond=2771044.65190866
ThreadCount=0256: Elapsed=5061ms FinalCount=926020 CountsPerSecond=182954.410859513
ThreadCount=0512: Elapsed=5061ms FinalCount=554880 CountsPerSecond=109633.000133682
ThreadCount=1024: Elapsed=5047ms FinalCount=403372 CountsPerSecond=79907.8598817266
ThreadCount=2048: Elapsed=5158ms FinalCount=427853 CountsPerSecond=82937.9986349298
ThreadCount=4096: Elapsed=6843ms FinalCount=1454282 CountsPerSecond=212510.216120728

Changed

Running NonReentrantRwlockSlim write lock contention scenarios with process affinity set to 0x0000000000000fff...
Initial output should appear in approximately 5 seconds...
ThreadCount=0001: Elapsed=5011ms FinalCount=146023913 CountsPerSecond=29136243.0844929
ThreadCount=0002: Elapsed=5029ms FinalCount=148085765 CountsPerSecond=29443444.9191416
ThreadCount=0004: Elapsed=5030ms FinalCount=147435037 CountsPerSecond=29307748.3158908
ThreadCount=0008: Elapsed=5046ms FinalCount=135669584 CountsPerSecond=26884746.2829748
ThreadCount=0016: Elapsed=5046ms FinalCount=117172253 CountsPerSecond=23219237.5243882
ThreadCount=0032: Elapsed=5077ms FinalCount=123019081 CountsPerSecond=24227324.0685217
ThreadCount=0064: Elapsed=5061ms FinalCount=114036461 CountsPerSecond=22528225.1624931
ThreadCount=0128: Elapsed=5061ms FinalCount=114874563 CountsPerSecond=22694305.2318717
ThreadCount=0256: Elapsed=5077ms FinalCount=111656891 CountsPerSecond=21990952.9701927
ThreadCount=0512: Elapsed=5092ms FinalCount=108080691 CountsPerSecond=21224516.1624796
ThreadCount=1024: Elapsed=5091ms FinalCount=101505410 CountsPerSecond=19936231.0295513
ThreadCount=2048: Elapsed=5168ms FinalCount=90210271 CountsPerSecond=17452847.1281253
ThreadCount=4096: Elapsed=5448ms FinalCount=70413247 CountsPerSecond=12923136.9608753

Monitor

Running StandardObjectLock write lock contention scenarios with process affinity set to 0x0000000000000fff...
Initial output should appear in approximately 5 seconds...
ThreadCount=0001: Elapsed=5003ms FinalCount=256096538 CountsPerSecond=51188037.8514373
ThreadCount=0002: Elapsed=5014ms FinalCount=247492238 CountsPerSecond=49357843.8844401
ThreadCount=0004: Elapsed=5015ms FinalCount=233682614 CountsPerSecond=46594332.7385551
ThreadCount=0008: Elapsed=5014ms FinalCount=202084181 CountsPerSecond=40295977.2812599
ThreadCount=0016: Elapsed=5015ms FinalCount=160327931 CountsPerSecond=31967286.7931112
ThreadCount=0032: Elapsed=5015ms FinalCount=159973407 CountsPerSecond=31896067.0536476
ThreadCount=0064: Elapsed=5016ms FinalCount=159925779 CountsPerSecond=31881983.1519616
ThreadCount=0128: Elapsed=5018ms FinalCount=160565171 CountsPerSecond=31994598.5148935
ThreadCount=0256: Elapsed=5027ms FinalCount=160346276 CountsPerSecond=31893699.5203601
ThreadCount=0512: Elapsed=5059ms FinalCount=160314101 CountsPerSecond=31688269.2933108
ThreadCount=1024: Elapsed=5106ms FinalCount=160400801 CountsPerSecond=31409342.086444
ThreadCount=2048: Elapsed=5198ms FinalCount=160176757 CountsPerSecond=30814118.2536987
ThreadCount=4096: Elapsed=5398ms FinalCount=160109421 CountsPerSecond=29660512.8891984

stephentoub · 2017-08-06T13:20:16Z

src/mscorlib/shared/System/Threading/ReaderWriterLockSlim.cs

+        {
+            // If there is a write waiter or write upgrade waiter, the waiter would block a reader from acquiring the RW lock
+            // because the waiter takes precedence. In that case, the reader is not likely to make progress by spinning.
+            return _fNoWaiters || _numWriteWaiters == 0 && _numWriteUpgradeWaiters == 0;


Nit: consider adding parens here to help clarify the precedence

stephentoub · 2017-08-06T13:23:02Z

src/mscorlib/shared/System/Threading/ReaderWriterLockSlim.cs

+        {
+            // If there is a write upgrade waiter, the waiter would block a writer from acquiring the RW lock because the waiter
+            // holds a read lock. In that case, the writer is not likely to make progress by spinning.
+            return isUpgradeToWrite || _numWriteUpgradeWaiters == 0;


I understand how the comment applies to "_numWriteUpgradeWaiters == 0". What's the rationale for why we still spin if _numWriteUpgradeWaiters > 0 but this is a write upgrade?

There can only be one upgradeable read lock held at a time, so if this is a write upgrade, there can't be any write upgrade waiters. I'll add that to the comment.

stephentoub · 2017-08-06T13:25:41Z

src/mscorlib/shared/System/Threading/ReaderWriterLockSlim.cs

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private void EnterMyLockForEnterAnyRead(bool isForWait = false)
+        {
+            if (!TryEnterMyLock())


Maybe we already do this elsewhere and I missed it, but when someone passes a timeout of 0, can/should we avoid spinning?

Good point, when the enter is deprioritized it would mean there is no progress to be made on the RW lock so it doesn't have to spin for _myLock. I'll also fix it so that it doesn't create/wait on the event when it decides to skip spinning on the RW lock and the timeout has expired.

If it's not deprioritized then I think it should spin for _myLock to attempt the RW lock once before returning

I'll fix this in a separate PR, maybe along with fixes to #13254. I'm seeing some regressions with those at the moment, so it'll need some more work.

kouvel · 2017-08-07T21:13:27Z

Flagged as no-merge, as I'll be adding more commits and would like to merge the commits as separate commits (no squash & merge), as each commit will have different levels of impact and risk for considering to port to desktop.

vancem · 2017-08-08T17:49:33Z

Sorry for the delay, I was catching from a vacation yesterday.

My biggest concern is complexity. As I mentioned in e-mail threads, generally speaking we are NOT fixing a pri1 scenario here because even after the fixes we have a very hot lock (that users really should fix, and what we are doing here does not help them). What we are trying to do is avoid really bad behavior in this case (things are very serialized, but you don't get meltdown). Thus ideally we do something which does not add complexity but addresses that issue.

I really like your ShouldSpinFor* functions because they are very simple (no new state), and the logic clearly is simply giving up on spinning (which does not have a semantic effect, only perf).

Do we know how much just doing this helps? It has the effect of 'prioritizing' without extra state variables. As I recall Chris' benchmark was just write, in case it wont help that case (but still definitely worth doing as the read-write case is the common one).

I have been thinking about the 'best' way of doing spinning in general, and I believe a very good general solution is to use what happened in the past, to determine the likely amount of time you have to spin, and based on that decide whether spinning is worth doing at all. Thus if hold times are long (in which case you typically end up blocking anyway), you very quickly don't bother spinning at all (which is great).

One of the simpler ways of doing this is to simply remember the 'spinCount' variable from last time and use say 80% of it for next time (that way you tend to ratchet down but if that fails you naturally home in on the correct value. The only trick is what you do when you end up blocking? The best answer is you use something like Stopwatch to actually measure the blocked time and only adjust the spinCount down if that time drops below some threshold. If we don't like the idea of taking a dependency on Stopwatch, we can also do it by simply increasing the Spin count something large enough so that when you take 80% of it say 100 times it gets to the threshold of spinning. Thus what this does is if you block you basically are guaranteed to skip spinning for the next 100 times at which point you do a spin just to see if anything has changed. Now the overhead is 1% of what it was before.

I am thinking that this would be simpler code (basically spinCount becomes an instance variable (you have two of them one for readers and one for writers as they are likely to have different wait times).

The MyLock could also benefit from this scheme (making its spin count remembered in an instance variable).

Finally I do think we can do better on our tuning parameters (changing them also does not increase complexity). First the SpinWaits are too small. We start at 10, but the memory system is probably 20X slower than the CPU so frankly it never gives the memory system 'a break' which is part of the point. I believe this value should be somewhere between 10 and 100, and I am guessing 30-50 is a good place to start for MYlock. SpinCount should be reduced to 5 to compensate. For the Reader-Writer, I think the spin count should be 2-4X larger than for MyLock because each spin stresses the memory system more because you have to lock and unlock), the maximum of spins should thus be reduced to something like 5 from its current 20 (because each spin waits longer, and frankly spinning is NOT as likely to as fruitful. This kind of tuning does take a while and assumes a good set of benchmarks to tune with, so we could skip this (but my suggestion is to UNDER spin rather than OVER spin, spinning is always SPECULATION and in general you should not speculate when you have no real clue whether it is helping or not).

I like to believe that these suggestions (add ShouldSpinFor, and remember spinCounts in instance variables and play with turning parameters) keep the code simple, but should address excessive spinning.

I am concerned that the current PR has a non-trivial amount of new code and that that new complexity will never go away. The suggestion above also helps avoid CPU in the (reasonably common) case where you need to block for a small amount of time, but not small enough that spinning is effective.

Comments?

vancem

In my comment I suggested that the scalability could be achieved more simply (by a 'Should* methods in here coupled with remembering when spinning is ineffective (you will end up blocking anyway).

kouvel · 2017-08-09T09:20:59Z

My biggest concern is complexity. As I mentioned in e-mail threads, generally speaking we are NOT fixing a pri1 scenario here because even after the fixes we have a very hot lock (that users really should fix, and what we are doing here does not help them). What we are trying to do is avoid really bad behavior in this case (things are very serialized, but you don't get meltdown). Thus ideally we do something which does not add complexity but addresses that issue.

General thoughts:

The issue arises when threads are thrashing on the spin lock, typically readers. When a writer comes in, it is unable to get the spin lock and quickly reaches the Sleep(1) stage of the spin loop, where it becomes even less likely that it would get the lock by chance. Eventually after enough iterations of getting the spin lock and not making progress on the RW lock because a read lock is typically held, it goes into a wait state. While the writer is waiting, new readers spin on the spin lock and the RW lock, delaying threads holding the read lock from releasing their lock. Eventually, readers are drained and the writer is signaled and acquires the write lock. Then it has trouble with readers to release the write lock because they're thrashing on the spin lock.

Potential considerations:

This would be a hot lock that gets serialized. If a user runs into this issue, probably a way to fix it would be use a mutex-type lock. It's possible that the lock is hot sometimes and provides beneficial parallel reads other times. In that case they could opt to take a write lock for short operations. Should we fix this at all?
Once reaching the Sleep(1) stage, should it only do Sleep(1), or should it reset the count after some threshold and spin faster for a short time? Haven't experimented with this, doesn't seem like a significant issue with the deprioritization.
Should writers spin on the RW lock at all? They may not be able to get the spin lock very easily, but at least they would only have to get it once to register as a waiter. I experimented disabling spinning on the RW lock for writers and it works somewhat well for few threads but does not scale well. The writer then tends to intrude on readers too easily/quickly. Spinning delays the writer and it allows readers some extra time to do their work. On the other hand, we may expect that writers are rare and may be ok with intruding on readers even if it doesn't scale well.
It seems logical that read lock releasers should take priority on the spin lock over write lock acquirers, and write lock releasers should take priority on the spin lock over read lock acquirers. It seems to work well in practice too, improving throughput noticeably when a writer is trying to release its write lock.
Should readers spin on the RW lock while a write lock is held? My limited experimentation on this seems to say yes. Forcing readers into a wait state when a write lock is held and there is no write waiter seems to be bad sometimes. When the write lock is released, it wakes up all waiting readers and before they wake up to attempt the lock, the same writer thread or a different writer thread spinning to acquire a write lock may acquire it. All readers would have to wake up and go back to sleep.

I really like your ShouldSpinFor* functions because they are very simple (no new state), and the logic clearly is simply giving up on spinning (which does not have a semantic effect, only perf).

Do we know how much just doing this helps? It has the effect of 'prioritizing' without extra state variables. As I recall Chris' benchmark was just write, in case it wont help that case (but still definitely worth doing as the read-write case is the common one).

It appears the testing I did on limiting spinning was before I finished tuning the deprioritization. In the current state of this PR, it's not helping noticeably with scale. Even without the deprioritization (I hadn't tried this before) it seems to work somewhat well but only noticeably at few threads. I still think it's a decent thing to do, as it's very unlikely that it would have negative effects. Mainly once the a writer starts waiting, it stops new readers from thrashing on the spin lock for too long unnecessarily, allowing read lock holders to efficiently release their locks. At least that's the theory. In practice with the current state of deprioritization, as soon as a waiter starts to spin for the spin lock, new readers become deprioritized and read lock holders begin draining. By the time the writer registers as a waiter (often it does not even get to this point), most of the read lock holders have already drained, so the benefit of limiting new readers from contending with read lock releasers is negligible.

First the SpinWaits are too small. We start at 10, but the memory system is probably 20X slower than the CPU so frankly it never gives the memory system 'a break' which is part of the point.

That's interesting, I can try some experiments and get back.

Regarding limiting spinning, I can try out what you suggested. It may help to not starve the writer from registering as a waiter for so long. It may be tricky to tune, to avoid artifical delays that hurt throughput. I still prefer the deprioritization idea (introducing some fairness into spin lock acquisition), as it directly attacks the problem.

vancem · 2017-08-09T18:30:55Z

Thanks @kouvel for the information above. It is helpful.

What I have learned (the hard way) with things like concurrency and thread pools, is

Weird (bad) behaviors are the norm
Thus the best solutions tend to be VERY simple, because the more complex it is the more places you can get weird behaviors.
In particular you have to have a REALLY good simple rationale for why you do things.

So I would like to apply this to spinning and spin locks (in general). First, why do we have a spin lock in the reader-writer lock? Basically the beauty of a spin lock is that it is both simple and efficient (because you don't need an interlocked operation on release), IF the lock hold time is (uniformly) small (that is measured in 10s to 100s of CPU clocks (not 1000s)). We strongly believe that this is true for the 'myLock' (because it is ONLY held when you are in transition (no more than the body of one of the RW lock methods). Indeed the fact that we EVER get to the Sleep(0), let alone the Sleep(1) in the myLock code, shows that are in a 'bad' regime, however myLock is not performing that badly (At any instant of time SOME thread either has the lock or is in the process of getting into it). This is about as good as you can do (after all you DO have to serialize access to it). You can probably do a bit better to avoid consuming the (shared) memory bus (by spinning locally more before you ping the lock), but that is not the main problem.

The problem is the spinning at the RWLock level. Now it is fair to ask 'do we need to spin at all?' This is a fair question. Why do we spin at all? The idea is that there is a COST to waiting (and waking up). It is the cost of context switching out and switching back (this is about 1-3K cycles), and the cost you incur because the memory caches were 'cooled' (filled with other data), which now has to be fetch again. This is hard to quantify, and would be very scenario dependent but a reasonable ballpark figure is about 10K-100K (0.03 msec). You can avoid this cost if you spin AND THE WAIT TIME IS SHORT (certainly < 100K cycles, but more like < 10K).

Thus spinning only helps in the case where the wait time is very short. More importantly however than is that spinning only pays off if it succeeds. Otherwise you pay strictly MORE. This is the fundamental issue that causes the 'meltdown' where increased load makes the system LESS efficient.

One very simple solution is to simply not spin. This is really not that bad. It works well when lock hold time is long (since spinning does not help there), and when there is no contention. It will hurt however if hold time is very short. Thus what you WANT is something that spins when wait time is < 10-100K and otherwise does no spinning.

The trick is that you need to know how long wait time is. But we HAVE a good idea of that (it is time it took the last time (or the average of the some set of times in the past)). To simplify the implementation even further, the spinCount is roughly speaking a measure of the time it took to enter the lock the last time. Thus simply remembering is an approximation and starting 'there' (which might mean going directly to waiting), solves the problem. (You do have the complexity of updating this time approximation but I believe we can do this a very few lines of codes (< 10).

The key behavior we need is that if it we ended up waiting in the past, then we should NOT bother spinning this time. Thus if there is a high contention (and wait times are long), the system QUICKLY stops spinning completely.

While having priorities for the various kinds of entries into the lock also help, what I like about the scheme above is that it is SIMPLE (and hopefully foolproof). You only wait if by past example you have some reason to believe it will be successful. Otherwise you don't spin at all.

Notice that if you are NOT spinning, that you ALWAYS make forward progress (every time the API enters MyLock, it accomplishes something). It will also be (roughly) fair (because when you wait you get put in a queue of waiters). The lock will still be hot (lots of threads will all be trying to access it), but it will be an orderly process (e.g. every thread spins, enters MyLock, gets what it needs or gets blocked,. Thus there are no 'useless spins'.

What I am arguing is that this scheme can be very simple (simpler than what you have), and the logic behind why it solves the 'spinning problem' is also simple (you never spin unless you have reason that it will succeed, and in particular you never REPEATEDLY spin and fail).

I realize that my suggestion here is not particularly welcome. You have something that seems to solve the problem and here am I trying to make you do work to rewrite it. I get that... What I have learned is that the cost of developing software is small in comparison to maintaining it over time (since maintenance accumulates), so keeping things SIMPLE is worthwhile.

I am arguing that you can implement the scheme I suggest in well under half the lines of code (hopefully a couple dozen total). The number is small enough that I will write them if you want me to. We can then see if that solution solves the issue, and if it does, is it reasonable to solve it that way?

kouvel · 2017-08-09T20:03:11Z

I realize that my suggestion here is not particularly welcome. You have something that seems to solve the problem and here am I trying to make you do work to rewrite it. I get that... What I have learned is that the cost of developing software is small in comparison to maintaining it over time (since maintenance accumulates), so keeping things SIMPLE is worthwhile.

I'm not opposed to changing the approach. I agree that what you suggested here may work reasonably well for this problem and that may suffice. It may even help with other more common scenarios.

I am arguing that you can implement the scheme I suggest in well under half the lines of code (hopefully a couple dozen total). The number is small enough that I will write them if you want me to. We can then see if that solution solves the issue, and if it does, is it reasonable to solve it that way?

Since you have something specific in mind, if you're interested in trying it out, please go ahead.

vancem · 2017-08-09T20:14:32Z

Ok, let me take a crack at it.

kouvel · 2017-08-19T04:35:37Z

Responded on #13324. I have pushed a couple of commits to try and simplify the code a bit. Of what I've tried, I believe this is the most reasonable approach to solving these problems.

kouvel · 2017-08-19T17:16:14Z

Added another commit that prevents waking multiple waiters when only one can make progress. For instance, when there are multiple write waiters and a writer keeps entering and exiting a write lock, every exit causes it to wake up a new waiting writer, preventing write waiters from remaining in a wait state.

This last commit solves most of the scalability issue by itself and to a much greater degree in many cases than deprioritization. I believe it reduces contention on the spin lock by allowing waiters to remain waiting and eliminates excessive context switching from threads waking up and going back to sleep. However, it doesn't solve the writer starvation issue by itself, and deprioritization still performs an order of magnitude better in several cases. The last commit and deprioritization seem to combine well, getting the benefits of both.

Updated perf numbers (benchmark and scoring has changed from before):

Baseline (left) vs Changed (right)                       Left score       Right score      ∆ Score %
-------------------------------------------------------  ---------------  ---------------  ---------
Concurrency_12Readers_1-3Writers                         21335.21 ±1.54%  21814.40 ±1.04%     2.25%
Concurrency_48Readers_1-12Writers                         7795.03 ±9.83%  19802.80 ±2.57%   154.04%
Concurrency_192Readers_1-32Writers                        877.41 ±13.18%  19427.11 ±2.51%  2114.14%
Concurrency_768Readers_1-32Writers                        547.95 ±15.21%  21131.86 ±1.69%  3756.55%

@ChrisAhna's repro from #12780

Running NonReentrantRwlockSlim write lock contention scenarios with process affinity set to 0x0000000000000fff...
Initial output should appear in approximately 5 seconds...
ThreadCount=0001: Elapsed=5000ms FinalCount=144589363 CountsPerSecond=28916967.4989173
ThreadCount=0002: Elapsed=5014ms FinalCount=144584762 CountsPerSecond=28830976.1227239
ThreadCount=0004: Elapsed=5015ms FinalCount=144036668 CountsPerSecond=28716351.0107656
ThreadCount=0008: Elapsed=5048ms FinalCount=138234503 CountsPerSecond=27380895.2031019
ThreadCount=0016: Elapsed=5036ms FinalCount=134589683 CountsPerSecond=26721099.3625001
ThreadCount=0032: Elapsed=5031ms FinalCount=127872590 CountsPerSecond=25415045.1992862
ThreadCount=0064: Elapsed=5042ms FinalCount=128792699 CountsPerSecond=25540375.4453682
ThreadCount=0128: Elapsed=5093ms FinalCount=127502272 CountsPerSecond=25030204.4401221
ThreadCount=0256: Elapsed=5086ms FinalCount=128297554 CountsPerSecond=25221690.5011927
ThreadCount=0512: Elapsed=5061ms FinalCount=129017905 CountsPerSecond=25489084.9449699
ThreadCount=1024: Elapsed=5062ms FinalCount=125965942 CountsPerSecond=24880288.419177
ThreadCount=2048: Elapsed=5094ms FinalCount=122257552 CountsPerSecond=23998523.3918935
ThreadCount=4096: Elapsed=5225ms FinalCount=110505127 CountsPerSecond=21147566.2289993

Alternative to and subset of dotnet#13243, fixes #12780 - Prevented waking more than one waiter when only one of them may acquire the lock - Limited spinning in some cases where it is very unlikely that spinning would help The _myLock spin lock runs into some bad scalability issues. For example: 1. Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads. 2. Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves. This PR does not solve (1), but solves (2) to a degree that could be considered sufficient. dotnet#13243 solves (1) and (2) and for (2) it is still better by order-of-magnitude compared with this PR in several cases that I believe are not extreme, but if the complexity of deprioritization is deemed excessive for the goals then of what I tried so far this is the perhaps simplest and most reasonable way to solve (2). I believe this PR would also be a reasonably low-risk one to port back to .NET Framework.

* Improve ReaderWriterLockSlim scalability Subset of #13243, fixes #12780 - Prevented waking more than one waiter when only one of them may acquire the lock - Limited spinning in some cases where it is very unlikely that spinning would help The _myLock spin lock runs into some bad scalability issues. For example: 1. Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads. 2. Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves. This PR does not solve (1), but solves (2) to a degree that could be considered sufficient. #13243 solves (1) and (2) and for (2) it is still better by order-of-magnitude compared with this PR in several cases that I believe are not extreme, but if the complexity of deprioritization is deemed excessive for the goals then of what I tried so far this is the perhaps simplest and most reasonable way to solve (2). I believe this PR would also be a reasonably low-risk one to port back to .NET Framework.

sdmaclea · 2017-08-25T22:43:08Z

@kouvel I have another optimization of this as well. I drafted it when the arm64 volatile function lookup cache was broken on arm64. SpinLock.cs was getting hit on every volatile instruction lookup. My patch made a substantial improvement. Since I fixed the real issue, I never submitted the alternative. I'll dig it up.

sdmaclea · 2017-08-25T22:52:09Z

@kouvel #13599 is the patch I meant, but it is not in managed code.

kouvel · 2017-09-05T19:49:46Z

Rebased and squashed to simplify resolving merge conflicts. Updated perf numbers from current state below. I think this is still a beneficial change, as it significantly improves scalability at higher thread counts when there are readers and writers, and fixes the writer starvation issue.

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

RwSB vs RwS                         Left score        Right score      ∆ Score %
----------------------------------  ----------------  ---------------  ---------
Concurrency_OnlyReadersPcx01         23253.75 ±0.53%  24133.35 ±0.12%      3.78%
Concurrency_OnlyReadersPcx04         23643.40 ±0.10%  23961.26 ±0.35%      1.34%
Concurrency_OnlyReadersPcx16         23381.09 ±0.12%  23913.45 ±0.14%      2.28%
Concurrency_OnlyReadersPcx64         16003.45 ±3.93%  21900.96 ±0.29%     36.85%
Concurrency_OnlyWritersPcx01         23740.19 ±0.45%  24422.96 ±0.28%      2.88%
Concurrency_OnlyWritersPcx04         23923.59 ±0.33%  23973.27 ±0.20%      0.21%
Concurrency_OnlyWritersPcx16         23602.52 ±0.49%  23149.35 ±0.37%     -1.92%
Concurrency_OnlyWritersPcx64         24017.52 ±0.33%  19312.34 ±1.72%    -19.59%
Concurrency_Pcx01Readers_01Writers   10771.92 ±4.53%  11837.56 ±5.02%      9.89%
Concurrency_Pcx01Readers_02Writers  10009.64 ±14.71%  20353.36 ±3.56%    103.34%
Concurrency_Pcx04Readers_01Writers   16536.58 ±1.98%  19290.98 ±0.74%     16.66%
Concurrency_Pcx04Readers_02Writers   13014.96 ±3.44%  17689.59 ±3.25%     35.92%
Concurrency_Pcx04Readers_04Writers   12313.32 ±8.32%  20302.25 ±3.82%     64.88%
Concurrency_Pcx16Readers_01Writers   18518.40 ±4.11%  21922.77 ±1.98%     18.38%
Concurrency_Pcx16Readers_02Writers   16789.55 ±3.62%  20524.91 ±2.29%     22.25%
Concurrency_Pcx16Readers_04Writers   13315.66 ±7.32%  21255.95 ±2.22%     59.63%
Concurrency_Pcx16Readers_08Writers   13530.42 ±6.56%  22919.80 ±1.53%     69.39%
Concurrency_Pcx64Readers_01Writers   14054.68 ±7.12%  18556.82 ±4.78%     32.03%
Concurrency_Pcx64Readers_02Writers   11597.02 ±8.58%  17548.92 ±4.89%     51.32%
Concurrency_Pcx64Readers_04Writers   11727.83 ±7.50%  16453.37 ±5.18%     40.29%
Concurrency_Pcx64Readers_08Writers   9779.97 ±11.81%  21636.79 ±1.75%    121.24%
Concurrency_Pcx64Readers_16Writers   11873.76 ±9.23%  22943.82 ±0.80%     93.23%
----------------------------------  ----------------  ---------------  ---------
Total                                15802.03 ±4.87%  20561.88 ±2.07%     30.12%

Core i7-6700 (Skylake, 4-core, 8-thread):

RwSB vs RwS                         Left score        Right score       ∆ Score %
----------------------------------  ----------------  ----------------  ---------
Concurrency_OnlyReadersPcx01         25844.75 ±0.18%   27646.62 ±0.40%      6.97%
Concurrency_OnlyReadersPcx04         25960.00 ±0.15%   26554.70 ±0.71%      2.29%
Concurrency_OnlyReadersPcx16         25605.03 ±0.36%   27411.46 ±0.24%      7.05%
Concurrency_OnlyReadersPcx64         24484.74 ±0.85%   26779.13 ±0.18%      9.37%
Concurrency_OnlyWritersPcx01         29088.16 ±0.53%   28208.32 ±0.35%     -3.02%
Concurrency_OnlyWritersPcx04         26173.08 ±0.91%   27457.82 ±0.26%      4.91%
Concurrency_OnlyWritersPcx16         26048.41 ±1.82%   27051.25 ±0.20%      3.85%
Concurrency_OnlyWritersPcx64         20804.65 ±5.81%   26680.97 ±0.25%     28.25%
Concurrency_Pcx01Readers_01Writers   12637.79 ±8.93%   15786.84 ±7.72%     24.92%
Concurrency_Pcx01Readers_02Writers   4876.47 ±14.10%   24541.40 ±2.84%    403.26%
Concurrency_Pcx04Readers_01Writers    9726.81 ±8.71%   20252.46 ±2.52%    108.21%
Concurrency_Pcx04Readers_02Writers    5876.21 ±9.43%   24402.43 ±2.50%    315.28%
Concurrency_Pcx04Readers_04Writers   3523.01 ±28.39%   26001.85 ±2.86%    638.06%
Concurrency_Pcx16Readers_01Writers   8803.89 ±16.50%   21202.79 ±5.18%    140.83%
Concurrency_Pcx16Readers_02Writers   6036.79 ±16.47%   20864.33 ±3.74%    245.62%
Concurrency_Pcx16Readers_04Writers   3710.47 ±18.96%   27024.88 ±1.39%    628.34%
Concurrency_Pcx16Readers_08Writers   5065.23 ±28.57%   26793.17 ±2.12%    428.96%
Concurrency_Pcx64Readers_01Writers   8701.36 ±22.39%  19495.47 ±11.67%    124.05%
Concurrency_Pcx64Readers_02Writers   8640.47 ±19.63%  16308.06 ±14.03%     88.74%
Concurrency_Pcx64Readers_04Writers   5556.25 ±11.76%   20453.60 ±7.31%    268.12%
Concurrency_Pcx64Readers_08Writers   7012.25 ±16.31%   26671.01 ±1.52%    280.35%
Concurrency_Pcx64Readers_16Writers  14703.94 ±12.71%   27182.44 ±1.67%     84.86%
----------------------------------  ----------------  ----------------  ---------
Total                               11064.93 ±11.53%   23975.96 ±3.24%    116.68%

kouvel · 2017-09-05T19:50:17Z

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

kouvel · 2017-09-08T00:59:55Z

@dotnet-bot test OSX10.12 x64 Checked Build and Test
@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

kouvel · 2017-09-08T19:06:49Z

@dotnet-bot test OSX10.12 x64 Checked Build and Test

kouvel · 2017-09-09T00:26:10Z

Ping for review please, @tarekgh, @stephentoub, @vancem. @vancem mentioned in #13324 that he's ok with this direction but it's not clear to me whether consensus has been reached on going ahead with this PR. I'm sure that the ideas @vancem suggested would improve this further for more common cases, but it's difficult to tune it and to get it right, and based on my tests and current understanding, those solutions don't solve the writer starvation problem and may at best reduce the severity of the scalability problem without solving it. As such, I believe the solutions are trying to solve different problems and I believe they would work well together, so I'd rather solve the scalability issue with this PR and leave the dynamic spin heuristics for another PR. Just to be clear, I'm perfectly ok with closing this PR if we want to do more investigation before settling on this solution.

vancem · 2017-09-13T16:37:21Z

I am OK with having this PR proceed. It is not ideal because it does add complexity, and I am still thinking that a simpler solution is possible, but at some point we have to move on. The solution embodied in this PR is conceptually simple (you want to have the _myLock spin lock understand priorities and you want to give priority to threads needing to take _myLock to RELEASE the lock, and secondarily to writers).

I was hoping to keep it even simpler by making spin-lock improvements (that we would make on all our spin locks), that would be 'good enough' to avoid the scaling issue. However Kount did quite a bit of measurement to show that it is harder than I thought. While I have not completely given up, I agree for now at least, we should move on. Thus we need SOME solution.

Arguably having this kind of priority scheme is always good from a efficiency perspective (it is clear that giving priority to those RELEASING locks can only help things), it is the question whether it is worth the complexity. Given that it solves the original scaling problem (which is good, but is arguably a corner case since users should FIXING a lock that gets that hot), but ALSO improves more real-world scenarios (1 writer and a modest number of readers) in a nontrivial way

Concurrency_Pcx01Readers_01Writers   12637.79 ±8.93%   15786.84 ±7.72%     24.92%
Concurrency_Pcx04Readers_01Writers    9726.81 ±8.71%   20252.46 ±2.52%    108.21%
Concurrency_Pcx16Readers_01Writers   8803.89 ±16.50%   21202.79 ±5.18%    140.83

makes it worth it. Also the change really is not THAT complex (it is in the implementation of a single class, and the intuition behind why it works is intuitive). It is likely to be ADDITIVE to any improvements we make in spin heuristics later.

The only concern is this regression on the Xeon E5-1650

Concurrency_OnlyWritersPcx64         24017.52 ±0.33%  19312.34 ±1.72%    -19.59%

However I note that it did not happen on Core i7-6700, where the was improvement

Concurrency_OnlyWritersPcx64         20804.65 ±5.81%   26680.97 ±0.25%     28.25%

And this benchmark is not the most interesting (if you are using a reader-writer lock, you expect readers).

So I am OK with this being checked in.

@vancem

Fixes #12780 The _myLock spin lock runs into some bad scalability issues. For example: - Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads. - Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves. Took some suggestions from @vancem and landed on the following after some experiments: - Introduced some fairness to _myLock acquisition by deprioritizing attempts to acquire _myLock that are not likely to make progress on the RW lock - Limited spinning in some cases where it is very unlikely that spinning would help

vancem · 2017-09-13T17:04:37Z

src/mscorlib/shared/System/Threading/ReaderWriterLockSlim.cs

@@ -63,11 +63,17 @@ public class ReaderWriterLockSlim : IDisposable
        // the events that hang off this lock (eg writeEvent, readEvent upgradeEvent).
        private int _myLock;

+        // Used to deprioritize threading attempting to enter _myLock when they would not make progress after doing so
+        private int _enterMyLockDeprioritizationState;


Please put a reasonably verbose comment about what this state is. In particular it is a tuple of two counts (read and write), and these counts are requests to deprioritize that set of lock requestors). All of this is hidden behind EnterSpinLock (I an sorely tempted to make a struct SpinLock which encapsulates _myLock and _enterMyLockDeprioritizationState and hides all this except for EnterMyLockSpin(EnterMyLockReason reason). This keeps it conceptually localized).

kouvel · 2017-09-13T18:46:17Z

Thanks @vancem.

Regarding the regression above, it appears to be happening by chance but more consistently on this processor. In that test all enter-write threads will eventually go into a wait state while one writer holds the lock. The writer that holds the lock typically acquires and releases the lock repeatedly and wakes up at most one write waiter at a time. In baseline, the write waiter is usually not able to acquire the lock when it wakes up. In the changed side, the write waiter is acquiring the lock more frequently. This causes the previous lock releaser thread to become a spinner and the write waiter that now holds the lock wakes up another write waiter when it releases the lock. As a result, more threads are spinning for the write lock instead of waiting.

The CPU usage shows this, the changed side is using 2.5x more CPU than the baseline. The issue is occurring consistently when there are 64 * proc count writer threads but consistently not occurring when there are 16 * proc count writer threads and there's not much difference between those two scenarios. I can't think of a reason why this would be happening, maybe slight timing changes are making the difference.

One way to fix this deterministically may be to register the number of spinners and avoid waking up waiters when spinners can satisfy the wait. I'll leave that for another time if it becomes a real issue.

vancem · 2017-09-13T19:31:17Z

Thanks for looking into it @kouvel. It is good to hear that the regression is does not happen consistently. I agree that it is not clear it is worth fixing (since the scenario is not the highest priority, it only happens when the lock hot enough that users should be fixing the hot lock, and the regression is not bad).

For what it is worth, I think it would be useful if we changed our scalability tests to have parameterized amounts of 'in-lock' and 'out-lock' work that is in the range of 100 to 10000 cycles. This would be much more representative of a real user workload. As it is our current microbenchmarks seem to be leading us to tune for the 'hot lock' case which is NOT really the most important case (when locks use 80%+ of the CPU, user code likely needs to be fixed to cool the lock. But when locks use 20% of the CPU, reducing it to 10% is an important win).

kouvel · 2017-09-13T23:55:50Z

I'm gathering some perf numbers with delays added

kouvel · 2017-09-14T00:35:56Z

I added a delay inside and outside each lock by calculating the N'th Fibonacci number using recursion where N is between 4 and 14 using a pseudorandom number generator. 4 and 14 on my machine correspond to approximately 100 and 10,000 cycles. This scheme prefers smaller delays.

With delays added, it seems difficult to compare the throughput scores apples to apples due to the writer starvation problem. The baseline starves writers and allows readers to happily chug away. The changed version does not starve writers and since writers block readers from acquiring the lock, the greater the frequency of writers, the less the overall throughput because writers drain all readers before they acquire the lock.

To make things a bit more realistic and comparable, I increased the delay outside the write lock such that N is between 15 and 19. On my machine, Fib(19) takes about 35 us to calculate using recursion.

Here are some raw numbers. Each line is a measurement over 500 ms, all in one process in continuous operation.

1 writer thread

(1 * proc count) reader threads

Baseline				Changed
Read count	Write count	Sum	Sum average so far	Read count	Write count	Sum	Sum average so far
1182	2	1184	1184	3193	29	3222	3222
1573	2	1575	1379	3233	31	3264	3243
825	2	827	1195	3157	34	3191	3226
1133	2	1134	1180	3188	30	3218	3224
719	1	720	1088	3140	35	3175	3214
1074	2	1076	1086	3150	36	3186	3209
1278	2	1279	1114	3166	31	3198	3208
792	2	794	1074	3142	35	3176	3204
1494	3	1497	1121	3049	35	3084	3191
1068	2	1070	1116	3142	34	3177	3189
959	3	961	1102	3171	29	3200	3190
1193	2	1195	1109	3134	33	3167	3188
934	2	936	1096	3151	32	3183	3188
1405	3	1408	1118	3170	35	3205	3189
1390	2	1392	1137	3155	33	3188	3189
855	2	857	1119	3144	34	3178	3188
1485	1	1486	1141	3166	31	3198	3189
995	2	997	1133	3158	33	3191	3189
1136	1	1138	1133	3160	32	3192	3189
620	1	621	1107	3132	33	3165	3188

(4 * proc count) reader threads

Baseline				Changed
Read count	Write count	Sum	Sum average so far	Read count	Write count	Sum	Sum average so far
1343	0	1343	1343	3205	18	3223	3223
1108	0	1108	1226	3245	22	3268	3245
1108	0	1108	1187	3255	18	3272	3254
1236	0	1236	1199	3224	19	3243	3252
870	0	870	1133	3204	18	3222	3246
1342	0	1342	1168	3237	18	3255	3247
1083	0	1083	1156	3183	22	3206	3241
1087	0	1087	1147	3240	15	3255	3243
1258	0	1258	1159	2856	14	2870	3202
1065	0	1065	1150	3253	13	3265	3208
1275	0	1275	1161	3227	20	3246	3211
1125	0	1125	1158	3219	16	3234	3213
968	0	968	1144	3101	23	3124	3206
1182	0	1182	1146	3142	20	3161	3203
1130	0	1130	1145	3243	15	3257	3207
1185	0	1185	1148	3169	18	3187	3206
1319	0	1319	1158	3204	17	3221	3207
1110	0	1110	1155	3247	10	3258	3209
1315	0	1315	1164	3222	17	3239	3211
1183	0	1183	1165	3180	24	3204	3211

Note that by this point the baseline starves writers so badly that not even one write lock is taken in 20 seconds.

(16 * proc count) reader threads

Baseline				Changed
Read count	Write count	Sum	Sum average so far	Read count	Write count	Sum	Sum average so far
2402	0	2402	2402	3025	4	3029	3029
2473	0	2473	2438	2552	1	2553	2791
1958	0	1958	2278	3159	7	3165	2916
2359	0	2359	2298	3250	6	3256	3001
2324	0	2324	2303	3275	4	3279	3057
2298	0	2298	2302	3211	9	3221	3084
2379	0	2379	2313	3161	3	3164	3095
2300	0	2300	2312	3238	5	3243	3114
2176	0	2176	2297	3298	4	3302	3135
2437	0	2437	2311	3262	5	3267	3148
2809	0	2809	2356	3256	7	3263	3158
1926	0	1926	2320	2893	6	2899	3137
2522	0	2522	2336	2832	6	2837	3114
2032	0	2032	2314	2896	6	2903	3099
2099	0	2099	2300	3271	3	3274	3110
2658	0	2658	2322	3281	5	3286	3121
2628	0	2628	2340	3286	8	3294	3132
2089	0	2089	2326	3066	8	3074	3128
1956	0	1956	2307	3084	9	3093	3127
2403	0	2403	2311	3332	4	3336	3137

2 writer threads

(1 * proc count) reader threads

Baseline				Changed
Read count	Write count	Sum	Sum average so far	Read count	Write count	Sum	Sum average so far
828	7	835	835	850	80	2930	2930
765	6	771	803	2443	82	2525	2727
1079	7	1086	897	2796	79	2875	2777
1100	8	1108	950	2683	79	2762	2773
1114	10	1125	985	2660	77	2736	2766
1225	10	1235	1026	2638	77	2715	2757
867	7	874	1005	2794	79	2872	2774
876	7	883	989	2886	76	2962	2797
942	8	950	985	2866	72	2938	2813
893	5	898	976	2668	79	2747	2806
635	5	640	946	2669	80	2748	2801
923	8	931	945	2503	79	2582	2783
848	5	853	938	2843	79	2922	2793
792	5	797	927	2520	80	2600	2780
810	7	817	920	2647	80	2726	2776
1219	12	1231	940	2689	78	2766	2776
1078	9	1088	948	2669	77	2746	2774
1142	9	1151	959	2760	79	2840	2777
991	11	1002	962	2648	79	2727	2775
1129	10	1140	971	2759	80	2839	2778

(4 * proc count) reader threads

Baseline				Changed
Read count	Write count	Sum	Sum average so far	Read count	Write count	Sum	Sum average so far
1379	0	1379	1379	2409	50	2458	2458
1135	0	1135	1257	2054	63	2117	2288
1275	0	1275	1263	2185	61	2246	2274
1195	0	1195	1246	2340	61	2401	2306
1135	0	1135	1224	2120	49	2170	2278
1105	0	1105	1204	2087	67	2153	2258
1049	0	1049	1182	2234	49	2282	2261
1499	0	1499	1221	2001	68	2069	2237
1513	0	1513	1254	1913	66	1979	2208
1033	0	1033	1232	2007	60	2067	2194
980	0	980	1209	1954	62	2016	2178
1263	0	1263	1213	2101	64	2165	2177
1305	0	1305	1221	2146	72	2217	2180
1200	0	1200	1219	2087	66	2153	2178
1221	0	1222	1219	2334	71	2405	2193
1161	0	1161	1216	2326	69	2395	2206
1375	0	1375	1225	2093	65	2157	2203
1387	0	1387	1234	1703	65	1769	2179
1306	0	1306	1238	2632	55	2687	2206
1116	0	1116	1232	2244	54	2299	2210

Note that by this point the baseline starves writers so badly that not even one write lock is taken in 20 seconds.

(16 * proc count) reader threads

Baseline				Changed
Read count	Write count	Sum	Sum average so far	Read count	Write count	Sum	Sum average so far
2105	0	2105	2105	2494	32	2526	2526
1879	0	1879	1992	2067	44	2112	2319
2524	0	2524	2169	1622	47	1669	2102
1472	0	1472	1995	2149	48	2197	2126
2271	0	2271	2050	1745	37	1781	2057
1562	0	1562	1969	1511	37	1548	1972
2969	0	2969	2112	1818	36	1853	1955
2428	0	2428	2151	2038	36	2073	1970
2004	0	2004	2135	1498	41	1539	1922
1714	0	1714	2093	2049	33	2082	1938
2409	0	2409	2121	1896	38	1934	1938
2247	0	2247	2132	2044	41	2085	1950
2258	0	2258	2142	1919	44	1964	1951
2194	0	2194	2145	1816	42	1858	1944
2595	0	2595	2175	1904	31	1935	1944
2441	0	2441	2192	1869	39	1908	1942
2166	0	2167	2190	1971	36	2007	1945
2177	0	2177	2190	2116	32	2148	1957
2585	0	2585	2211	2420	37	2456	1983
2397	0	2397	2220	1627	42	1668	1967

The baseline has a higher overall throughput here but it also starves writers, so not very meaningful.

There will also be cases where the changed version has much lower throughput due to not starving writers and there being a high frequency of writers. The following is an example, and the difference can get more severe than this, but in the end I don't think it's a very meaningful comparison.

8 writer threads

(16 * proc count) reader threads

Baseline				Changed
Read count	Write count	Sum	Sum average so far	Read count	Write count	Sum	Sum average so far
2070	0	2070	2070	1376	177	1553	1553
2542	0	2542	2306	1109	199	1309	1431
2774	0	2774	2462	890	213	1103	1322
1703	0	1703	2272	1221	172	1393	1340
2331	0	2331	2284	1155	191	1346	1341
2525	0	2525	2324	1192	177	1369	1346
2155	0	2156	2300	1002	155	1157	1319
2120	0	2121	2278	958	206	1164	1299
2414	0	2414	2293	1025	187	1213	1290
2461	0	2461	2310	1112	192	1304	1291
2271	0	2271	2306	1221	180	1400	1301
2373	0	2373	2312	1176	162	1338	1304
2616	0	2616	2335	1212	175	1386	1310
1629	0	1629	2285	1010	202	1212	1303
2295	0	2295	2285	757	206	963	1281
2485	0	2486	2298	1389	174	1563	1298
1834	0	1834	2271	1383	161	1544	1313
2407	0	2407	2278	1075	192	1266	1310
1902	0	1902	2258	1072	177	1249	1307
2267	0	2267	2259	926	186	1112	1297

kouvel · 2017-09-14T00:36:45Z

Code used for raw numbers above:

internal static class ReaderWriterLockSlimScalePerf
{
    private static void Main(string[] args)
    {
        int processorCount = Environment.ProcessorCount;

        int readerThreadCount, writerThreadCount;
        if (args.Length == 0)
        {
            readerThreadCount = 1;
            writerThreadCount = 1;
        }
        else
        {
            readerThreadCount = (int)uint.Parse(args[0]);
            writerThreadCount = (int)uint.Parse(args[1]);
        }

        var rw = new ReaderWriterLockSlim(LockRecursionPolicy.SupportsRecursion);

        var sw = new Stopwatch();
        var startThreads = new ManualResetEvent(false);

        var counts = new int[48];
        var readerThreads = new Thread[readerThreadCount];
        ThreadStart readThreadStart =
            () =>
            {
                var rng = new Random(0);
                startThreads.WaitOne();
                while (true)
                {
                    uint d0 = (uint)rng.Next(4, 10);
                    uint d1 = (uint)rng.Next(4, 10);
                    rw.EnterReadLock();
                    Delayer.CpuDelay(d0);
                    rw.ExitReadLock();
                    Interlocked.Increment(ref counts[16]);
                    Delayer.CpuDelay(d1);
                }
            };
        for (int i = 0; i < readerThreadCount; ++i)
        {
            readerThreads[i] = new Thread(readThreadStart);
            readerThreads[i].IsBackground = true;
            readerThreads[i].Start();
        }

        var writeLockAcquireAndReleasedInnerIterationCountTimes = new AutoResetEvent(false);
        var writerThreads = new Thread[writerThreadCount];
        ThreadStart writeThreadStart =
            () =>
            {
                var rng = new Random(1);
                startThreads.WaitOne();
                while (true)
                {
                    uint d0 = (uint)rng.Next(4, 14);
                    uint d1 = (uint)rng.Next(15, 20);
                    rw.EnterWriteLock();
                    Delayer.CpuDelay(d0);
                    rw.ExitWriteLock();
                    Interlocked.Increment(ref counts[32]);
                    Delayer.CpuDelay(d1);
                }
            };
        for (int i = 0; i < writerThreadCount; ++i)
        {
            writerThreads[i] = new Thread(writeThreadStart);
            writerThreads[i].IsBackground = true;
            writerThreads[i].Start();
        }

        startThreads.Set();

        // Warmup
        Thread.Sleep(500);
        Interlocked.Exchange(ref counts[16], 0);
        Interlocked.Exchange(ref counts[32], 0);

        // Actual run
        var unweightedScores = new List<double>();
        var scores = new List<double>();
        while (true)
        {
            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            int readCount = Interlocked.Exchange(ref counts[16], 0);
            int writeCount = Interlocked.Exchange(ref counts[32], 0);

            double elapsedMs = sw.Elapsed.TotalMilliseconds;
            double unweightedScore = (readCount + writeCount) / elapsedMs;
            double score =
                new double[]
                {
                    Math.Max(1, (readCount + writeCount) / elapsedMs),
                    Math.Max(1, writeCount / elapsedMs)
                }.GeometricMean(readerThreadCount, writerThreadCount);
            unweightedScores.Add(unweightedScore);
            scores.Add(score);

            Console.WriteLine(
                "{0:0}\t{1:0}\t{2:0}\t{3:0}\t{4:0}\t{5:0}",
                readCount / elapsedMs,
                writeCount / elapsedMs,
                unweightedScore,
                unweightedScores.Average(),
                score,
                scores.Average());
        }
    }

    internal static class Delayer
    {
        private static int[] s_delayValues = new int[32];

        public static void CpuDelay(uint n)
        {
            s_delayValues[16] = Fib(n);
        }

        private static int Fib(uint n)
        {
            if (n == 0)
                return 0;
            if (n == 1)
                return 1;
            return Fib(n - 2) + Fib(n - 1);
        }
    }

    public static double GeometricMean(this IEnumerable<double> values, params double[] weights)
    {
        double logSum = 0, weightSum = 0;
        int weightIndex = 0;
        foreach (var value in values)
        {
            if (weightIndex >= weights.Length)
                throw new InvalidOperationException();
            var weight = weights[weightIndex];
            ++weightIndex;
            logSum += Math.Log(value) * weight;
            weightSum += weight;
        }
        return Math.Exp(logSum / weightSum);
    }
}

kouvel · 2017-09-14T01:17:25Z

If there are no further concerns, I'll go ahead and merge this tomorrow

vancem · 2017-09-14T22:56:53Z

I liked the encapsulation of the myLock!

kouvel added area-System.Threading netfx-port-consider tenet-performance Performance related issue labels Aug 6, 2017

kouvel added this to the 2.1.0 milestone Aug 6, 2017

kouvel requested review from stephentoub, vancem and ChrisAhna August 6, 2017 01:31

dnfclas added the cla-already-signed label Aug 6, 2017

stephentoub reviewed Aug 6, 2017

View reviewed changes

kouvel added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 7, 2017

kouvel removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 8, 2017

vancem reviewed Aug 8, 2017

View reviewed changes

kouvel added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 9, 2017

vancem mentioned this pull request Aug 10, 2017

Improve spining in ReaderWriter Lock #13324

Closed

kouvel force-pushed the RwLockSlimSpinLock branch from f6bb711 to 3be8934 Compare August 18, 2017 00:37

kouvel added bug Product bug (most likely) and removed * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) netfx-port-consider labels Aug 20, 2017

kouvel mentioned this pull request Aug 20, 2017

Improve ReaderWriterLockSlim scalability #13495

Merged

karelz removed the bug Product bug (most likely) label Aug 25, 2017

karelz assigned kouvel Aug 25, 2017

kouvel force-pushed the RwLockSlimSpinLock branch from 6f0ef37 to 7514309 Compare August 27, 2017 01:54

kouvel force-pushed the RwLockSlimSpinLock branch from 7514309 to 262d18d Compare September 5, 2017 19:46

kouvel requested a review from tarekgh September 5, 2017 19:49

vancem approved these changes Sep 13, 2017

View reviewed changes

vancem reviewed Sep 13, 2017

View reviewed changes

kouvel added 2 commits September 13, 2017 12:46

Refactor spin lock stuff into a struct

9de951e

Move SpinLock struct to the bottom

ab89cff

kouvel force-pushed the RwLockSlimSpinLock branch from 262d18d to ab89cff Compare September 13, 2017 19:50

kouvel merged commit 7f2b64a into dotnet:master Sep 14, 2017

kouvel deleted the RwLockSlimSpinLock branch September 14, 2017 21:48

kouvel mentioned this pull request Oct 5, 2017

Fix new deadlock possibility in ReaderWriterLockSlim from a recent change #14337

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ReaderWriterLockSlim scalability #13243

Improve ReaderWriterLockSlim scalability #13243

kouvel commented Aug 6, 2017

stephentoub Aug 6, 2017

stephentoub Aug 6, 2017

kouvel Aug 6, 2017 •

edited

Loading

stephentoub Aug 6, 2017

kouvel Aug 6, 2017

kouvel Aug 6, 2017

kouvel Aug 8, 2017

kouvel commented Aug 7, 2017

vancem commented Aug 8, 2017

vancem left a comment

kouvel commented Aug 9, 2017

vancem commented Aug 9, 2017

kouvel commented Aug 9, 2017

vancem commented Aug 9, 2017

kouvel commented Aug 19, 2017

kouvel commented Aug 19, 2017

sdmaclea commented Aug 25, 2017

sdmaclea commented Aug 25, 2017

kouvel commented Sep 5, 2017 •

edited

Loading

kouvel commented Sep 5, 2017

kouvel commented Sep 8, 2017

kouvel commented Sep 8, 2017

kouvel commented Sep 9, 2017 •

edited

Loading

vancem commented Sep 13, 2017

vancem Sep 13, 2017

kouvel commented Sep 13, 2017 •

edited

Loading

vancem commented Sep 13, 2017

kouvel commented Sep 13, 2017

kouvel commented Sep 14, 2017 •

edited

Loading

kouvel commented Sep 14, 2017

kouvel commented Sep 14, 2017

vancem commented Sep 14, 2017

﻿Improve ReaderWriterLockSlim scalability #13243

﻿Improve ReaderWriterLockSlim scalability #13243

Conversation

kouvel commented Aug 6, 2017

stephentoub Aug 6, 2017

Choose a reason for hiding this comment

stephentoub Aug 6, 2017

Choose a reason for hiding this comment

kouvel Aug 6, 2017 • edited Loading

Choose a reason for hiding this comment

stephentoub Aug 6, 2017

Choose a reason for hiding this comment

kouvel Aug 6, 2017

Choose a reason for hiding this comment

kouvel Aug 6, 2017

Choose a reason for hiding this comment

kouvel Aug 8, 2017

Choose a reason for hiding this comment

kouvel commented Aug 7, 2017

vancem commented Aug 8, 2017

vancem left a comment

Choose a reason for hiding this comment

kouvel commented Aug 9, 2017

vancem commented Aug 9, 2017

kouvel commented Aug 9, 2017

vancem commented Aug 9, 2017

kouvel commented Aug 19, 2017

kouvel commented Aug 19, 2017

sdmaclea commented Aug 25, 2017

sdmaclea commented Aug 25, 2017

kouvel commented Sep 5, 2017 • edited Loading

kouvel commented Sep 5, 2017

kouvel commented Sep 8, 2017

kouvel commented Sep 8, 2017

kouvel commented Sep 9, 2017 • edited Loading

vancem commented Sep 13, 2017

vancem Sep 13, 2017

Choose a reason for hiding this comment

kouvel commented Sep 13, 2017 • edited Loading

vancem commented Sep 13, 2017

kouvel commented Sep 13, 2017

kouvel commented Sep 14, 2017 • edited Loading

1 writer thread

2 writer threads

8 writer threads

kouvel commented Sep 14, 2017

kouvel commented Sep 14, 2017

vancem commented Sep 14, 2017

Improve ReaderWriterLockSlim scalability #13243

Improve ReaderWriterLockSlim scalability #13243

kouvel Aug 6, 2017 •

edited

Loading

kouvel commented Sep 5, 2017 •

edited

Loading

kouvel commented Sep 9, 2017 •

edited

Loading

kouvel commented Sep 13, 2017 •

edited

Loading

kouvel commented Sep 14, 2017 •

edited

Loading