New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ReaderWriterLockSlim scalability #13243

Merged
merged 3 commits into from Sep 14, 2017

Conversation

@kouvel
Member

kouvel commented Aug 6, 2017

Fixes #12780

The _myLock spin lock runs into some bad scalability issues. For example:

  • Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads.
  • Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves.

Took some suggestions from @vancem and landed on the following after some experiments:

  • Introduced some fairness to _myLock acquisition by deprioritizing attempts to acquire _myLock that are not likely to make progress on the RW lock
  • Limited spinning in some cases where it is very unlikely that spinning would help
Baseline (left) vs Changed (right)                       Left score       Right score      ∆ Score %
-------------------------------------------------------  ---------------  ---------------  ---------
Concurrency_12Readers                                    22576.33 ±0.61%  22485.13 ±0.84%     -0.40%
Concurrency_12Writers                                    22661.57 ±0.36%  23159.64 ±0.41%      2.20%
Concurrency_12Readers_4Writers                           1363.40 ±11.54%   7721.94 ±2.66%    466.37%
Concurrency_48Readers_8Writers                               6.01 ±3.81%   5495.41 ±4.76%  91334.09%
Concurrency_192Readers_16Writers                             6.99 ±2.55%   4071.60 ±6.46%  58152.74%
Concurrency_768Readers_32Writers                            15.23 ±3.74%   1296.81 ±4.83%   8412.51%
Contention_AcquireReleaseReadLock_When_BgWriteLock           3.81 ±0.81%      3.65 ±0.68%     -4.25%
Contention_AcquireReleaseWriteLock_When_BgReadLock           3.78 ±0.63%      3.66 ±0.69%     -3.20%
Contention_UpgradeDowngradeLock_When_BgReadLock              3.76 ±0.72%      3.59 ±0.80%     -4.63%
Micro_AcquireReleaseReadLock_When_BgReadLock              2578.81 ±0.41%   2566.67 ±0.04%     -0.47%
Micro_AcquireReleaseReadLock_When_NoLock                  2586.77 ±0.12%   2592.37 ±0.03%      0.22%
Micro_AcquireReleaseReadLock_When_ReadLock                2777.25 ±0.03%   2770.76 ±0.37%     -0.23%
Micro_AcquireReleaseUpgradeableReadLock_When_BgReadLock   2595.37 ±0.18%   2605.86 ±0.49%      0.40%
Micro_AcquireReleaseUpgradeableReadLock_When_NoLock       2646.20 ±0.12%   2609.14 ±0.13%     -1.40%
Micro_AcquireReleaseWriteLock_When_NoLock                 2548.90 ±0.04%   2621.03 ±0.18%      2.83%
Micro_AcquireReleaseWriteLock_When_WriteLock              2650.63 ±0.32%   2660.43 ±0.37%      0.37%
Micro_UpgradeDowngradeLock_When_ReadLock                  2535.17 ±0.10%   2512.07 ±0.09%     -0.91%
Timeout_AcquireReleaseReadLock_When_BgWriteLock           4016.95 ±1.00%   4012.96 ±0.23%     -0.10%
Timeout_AcquireReleaseWriteLock_When_BgReadLock           4014.43 ±0.14%   4119.19 ±0.24%      2.61%
Timeout_UpgradeDowngradeLock_When_BgReadLock              3979.77 ±0.07%   3893.49 ±0.04%     -2.17%
-------------------------------------------------------  ---------------  ---------------  ---------
Total                                                      532.66 ±1.40%   1395.93 ±1.24%    162.07%

@ChrisAhna's repro from #12780

Baseline

Running NonReentrantRwlockSlim write lock contention scenarios with process affinity set to 0x0000000000000fff...
Initial output should appear in approximately 5 seconds...
ThreadCount=0001: Elapsed=5008ms FinalCount=146778099 CountsPerSecond=29303915.9706614
ThreadCount=0002: Elapsed=5013ms FinalCount=144566761 CountsPerSecond=28832836.5040624
ThreadCount=0004: Elapsed=5030ms FinalCount=142987794 CountsPerSecond=28424230.6837138
ThreadCount=0008: Elapsed=5046ms FinalCount=140895614 CountsPerSecond=27919285.270997
ThreadCount=0016: Elapsed=5061ms FinalCount=131126565 CountsPerSecond=25904896.8847266
ThreadCount=0032: Elapsed=5046ms FinalCount=118985913 CountsPerSecond=23578687.3922634
ThreadCount=0064: Elapsed=5077ms FinalCount=87382990 CountsPerSecond=17209966.7389088
ThreadCount=0128: Elapsed=5046ms FinalCount=13983552 CountsPerSecond=2771044.65190866
ThreadCount=0256: Elapsed=5061ms FinalCount=926020 CountsPerSecond=182954.410859513
ThreadCount=0512: Elapsed=5061ms FinalCount=554880 CountsPerSecond=109633.000133682
ThreadCount=1024: Elapsed=5047ms FinalCount=403372 CountsPerSecond=79907.8598817266
ThreadCount=2048: Elapsed=5158ms FinalCount=427853 CountsPerSecond=82937.9986349298
ThreadCount=4096: Elapsed=6843ms FinalCount=1454282 CountsPerSecond=212510.216120728

Changed

Running NonReentrantRwlockSlim write lock contention scenarios with process affinity set to 0x0000000000000fff...
Initial output should appear in approximately 5 seconds...
ThreadCount=0001: Elapsed=5011ms FinalCount=146023913 CountsPerSecond=29136243.0844929
ThreadCount=0002: Elapsed=5029ms FinalCount=148085765 CountsPerSecond=29443444.9191416
ThreadCount=0004: Elapsed=5030ms FinalCount=147435037 CountsPerSecond=29307748.3158908
ThreadCount=0008: Elapsed=5046ms FinalCount=135669584 CountsPerSecond=26884746.2829748
ThreadCount=0016: Elapsed=5046ms FinalCount=117172253 CountsPerSecond=23219237.5243882
ThreadCount=0032: Elapsed=5077ms FinalCount=123019081 CountsPerSecond=24227324.0685217
ThreadCount=0064: Elapsed=5061ms FinalCount=114036461 CountsPerSecond=22528225.1624931
ThreadCount=0128: Elapsed=5061ms FinalCount=114874563 CountsPerSecond=22694305.2318717
ThreadCount=0256: Elapsed=5077ms FinalCount=111656891 CountsPerSecond=21990952.9701927
ThreadCount=0512: Elapsed=5092ms FinalCount=108080691 CountsPerSecond=21224516.1624796
ThreadCount=1024: Elapsed=5091ms FinalCount=101505410 CountsPerSecond=19936231.0295513
ThreadCount=2048: Elapsed=5168ms FinalCount=90210271 CountsPerSecond=17452847.1281253
ThreadCount=4096: Elapsed=5448ms FinalCount=70413247 CountsPerSecond=12923136.9608753

Monitor

Running StandardObjectLock write lock contention scenarios with process affinity set to 0x0000000000000fff...
Initial output should appear in approximately 5 seconds...
ThreadCount=0001: Elapsed=5003ms FinalCount=256096538 CountsPerSecond=51188037.8514373
ThreadCount=0002: Elapsed=5014ms FinalCount=247492238 CountsPerSecond=49357843.8844401
ThreadCount=0004: Elapsed=5015ms FinalCount=233682614 CountsPerSecond=46594332.7385551
ThreadCount=0008: Elapsed=5014ms FinalCount=202084181 CountsPerSecond=40295977.2812599
ThreadCount=0016: Elapsed=5015ms FinalCount=160327931 CountsPerSecond=31967286.7931112
ThreadCount=0032: Elapsed=5015ms FinalCount=159973407 CountsPerSecond=31896067.0536476
ThreadCount=0064: Elapsed=5016ms FinalCount=159925779 CountsPerSecond=31881983.1519616
ThreadCount=0128: Elapsed=5018ms FinalCount=160565171 CountsPerSecond=31994598.5148935
ThreadCount=0256: Elapsed=5027ms FinalCount=160346276 CountsPerSecond=31893699.5203601
ThreadCount=0512: Elapsed=5059ms FinalCount=160314101 CountsPerSecond=31688269.2933108
ThreadCount=1024: Elapsed=5106ms FinalCount=160400801 CountsPerSecond=31409342.086444
ThreadCount=2048: Elapsed=5198ms FinalCount=160176757 CountsPerSecond=30814118.2536987
ThreadCount=4096: Elapsed=5398ms FinalCount=160109421 CountsPerSecond=29660512.8891984
{
// If there is a write waiter or write upgrade waiter, the waiter would block a reader from acquiring the RW lock
// because the waiter takes precedence. In that case, the reader is not likely to make progress by spinning.
return _fNoWaiters || _numWriteWaiters == 0 && _numWriteUpgradeWaiters == 0;

This comment has been minimized.

@stephentoub

stephentoub Aug 6, 2017

Member

Nit: consider adding parens here to help clarify the precedence

{
// If there is a write upgrade waiter, the waiter would block a writer from acquiring the RW lock because the waiter
// holds a read lock. In that case, the writer is not likely to make progress by spinning.
return isUpgradeToWrite || _numWriteUpgradeWaiters == 0;

This comment has been minimized.

@stephentoub

stephentoub Aug 6, 2017

Member

I understand how the comment applies to "_numWriteUpgradeWaiters == 0". What's the rationale for why we still spin if _numWriteUpgradeWaiters > 0 but this is a write upgrade?

This comment has been minimized.

@kouvel

kouvel Aug 6, 2017

Member

There can only be one upgradeable read lock held at a time, so if this is a write upgrade, there can't be any write upgrade waiters. I'll add that to the comment.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private void EnterMyLockForEnterAnyRead(bool isForWait = false)
{
if (!TryEnterMyLock())

This comment has been minimized.

@stephentoub

stephentoub Aug 6, 2017

Member

Maybe we already do this elsewhere and I missed it, but when someone passes a timeout of 0, can/should we avoid spinning?

This comment has been minimized.

@kouvel

kouvel Aug 6, 2017

Member

Good point, when the enter is deprioritized it would mean there is no progress to be made on the RW lock so it doesn't have to spin for _myLock. I'll also fix it so that it doesn't create/wait on the event when it decides to skip spinning on the RW lock and the timeout has expired.

This comment has been minimized.

@kouvel

kouvel Aug 6, 2017

Member

If it's not deprioritized then I think it should spin for _myLock to attempt the RW lock once before returning

This comment has been minimized.

@kouvel

kouvel Aug 8, 2017

Member

I'll fix this in a separate PR, maybe along with fixes to #13254. I'm seeing some regressions with those at the moment, so it'll need some more work.

@kouvel kouvel added the * NO MERGE * label Aug 7, 2017

@kouvel

This comment has been minimized.

Member

kouvel commented Aug 7, 2017

Flagged as no-merge, as I'll be adding more commits and would like to merge the commits as separate commits (no squash & merge), as each commit will have different levels of impact and risk for considering to port to desktop.

@kouvel kouvel removed the * NO MERGE * label Aug 8, 2017

@vancem

This comment has been minimized.

Member

vancem commented Aug 8, 2017

Sorry for the delay, I was catching from a vacation yesterday.

My biggest concern is complexity. As I mentioned in e-mail threads, generally speaking we are NOT fixing a pri1 scenario here because even after the fixes we have a very hot lock (that users really should fix, and what we are doing here does not help them). What we are trying to do is avoid really bad behavior in this case (things are very serialized, but you don't get meltdown). Thus ideally we do something which does not add complexity but addresses that issue.

I really like your ShouldSpinFor* functions because they are very simple (no new state), and the logic clearly is simply giving up on spinning (which does not have a semantic effect, only perf).

Do we know how much just doing this helps? It has the effect of 'prioritizing' without extra state variables. As I recall Chris' benchmark was just write, in case it wont help that case (but still definitely worth doing as the read-write case is the common one).

I have been thinking about the 'best' way of doing spinning in general, and I believe a very good general solution is to use what happened in the past, to determine the likely amount of time you have to spin, and based on that decide whether spinning is worth doing at all. Thus if hold times are long (in which case you typically end up blocking anyway), you very quickly don't bother spinning at all (which is great).

One of the simpler ways of doing this is to simply remember the 'spinCount' variable from last time and use say 80% of it for next time (that way you tend to ratchet down but if that fails you naturally home in on the correct value. The only trick is what you do when you end up blocking? The best answer is you use something like Stopwatch to actually measure the blocked time and only adjust the spinCount down if that time drops below some threshold. If we don't like the idea of taking a dependency on Stopwatch, we can also do it by simply increasing the Spin count something large enough so that when you take 80% of it say 100 times it gets to the threshold of spinning. Thus what this does is if you block you basically are guaranteed to skip spinning for the next 100 times at which point you do a spin just to see if anything has changed. Now the overhead is 1% of what it was before.

I am thinking that this would be simpler code (basically spinCount becomes an instance variable (you have two of them one for readers and one for writers as they are likely to have different wait times).

The MyLock could also benefit from this scheme (making its spin count remembered in an instance variable).

Finally I do think we can do better on our tuning parameters (changing them also does not increase complexity). First the SpinWaits are too small. We start at 10, but the memory system is probably 20X slower than the CPU so frankly it never gives the memory system 'a break' which is part of the point. I believe this value should be somewhere between 10 and 100, and I am guessing 30-50 is a good place to start for MYlock. SpinCount should be reduced to 5 to compensate. For the Reader-Writer, I think the spin count should be 2-4X larger than for MyLock because each spin stresses the memory system more because you have to lock and unlock), the maximum of spins should thus be reduced to something like 5 from its current 20 (because each spin waits longer, and frankly spinning is NOT as likely to as fruitful. This kind of tuning does take a while and assumes a good set of benchmarks to tune with, so we could skip this (but my suggestion is to UNDER spin rather than OVER spin, spinning is always SPECULATION and in general you should not speculate when you have no real clue whether it is helping or not).

I like to believe that these suggestions (add ShouldSpinFor, and remember spinCounts in instance variables and play with turning parameters) keep the code simple, but should address excessive spinning.

I am concerned that the current PR has a non-trivial amount of new code and that that new complexity will never go away. The suggestion above also helps avoid CPU in the (reasonably common) case where you need to block for a small amount of time, but not small enough that spinning is effective.

Comments?

@vancem

In my comment I suggested that the scalability could be achieved more simply (by a 'Should* methods in here coupled with remembering when spinning is ineffective (you will end up blocking anyway).

@kouvel

This comment has been minimized.

Member

kouvel commented Aug 9, 2017

My biggest concern is complexity. As I mentioned in e-mail threads, generally speaking we are NOT fixing a pri1 scenario here because even after the fixes we have a very hot lock (that users really should fix, and what we are doing here does not help them). What we are trying to do is avoid really bad behavior in this case (things are very serialized, but you don't get meltdown). Thus ideally we do something which does not add complexity but addresses that issue.

General thoughts:

The issue arises when threads are thrashing on the spin lock, typically readers. When a writer comes in, it is unable to get the spin lock and quickly reaches the Sleep(1) stage of the spin loop, where it becomes even less likely that it would get the lock by chance. Eventually after enough iterations of getting the spin lock and not making progress on the RW lock because a read lock is typically held, it goes into a wait state. While the writer is waiting, new readers spin on the spin lock and the RW lock, delaying threads holding the read lock from releasing their lock. Eventually, readers are drained and the writer is signaled and acquires the write lock. Then it has trouble with readers to release the write lock because they're thrashing on the spin lock.

Potential considerations:

  • This would be a hot lock that gets serialized. If a user runs into this issue, probably a way to fix it would be use a mutex-type lock. It's possible that the lock is hot sometimes and provides beneficial parallel reads other times. In that case they could opt to take a write lock for short operations. Should we fix this at all?
  • Once reaching the Sleep(1) stage, should it only do Sleep(1), or should it reset the count after some threshold and spin faster for a short time? Haven't experimented with this, doesn't seem like a significant issue with the deprioritization.
  • Should writers spin on the RW lock at all? They may not be able to get the spin lock very easily, but at least they would only have to get it once to register as a waiter. I experimented disabling spinning on the RW lock for writers and it works somewhat well for few threads but does not scale well. The writer then tends to intrude on readers too easily/quickly. Spinning delays the writer and it allows readers some extra time to do their work. On the other hand, we may expect that writers are rare and may be ok with intruding on readers even if it doesn't scale well.
  • It seems logical that read lock releasers should take priority on the spin lock over write lock acquirers, and write lock releasers should take priority on the spin lock over read lock acquirers. It seems to work well in practice too, improving throughput noticeably when a writer is trying to release its write lock.
  • Should readers spin on the RW lock while a write lock is held? My limited experimentation on this seems to say yes. Forcing readers into a wait state when a write lock is held and there is no write waiter seems to be bad sometimes. When the write lock is released, it wakes up all waiting readers and before they wake up to attempt the lock, the same writer thread or a different writer thread spinning to acquire a write lock may acquire it. All readers would have to wake up and go back to sleep.

I really like your ShouldSpinFor* functions because they are very simple (no new state), and the logic clearly is simply giving up on spinning (which does not have a semantic effect, only perf).

Do we know how much just doing this helps? It has the effect of 'prioritizing' without extra state variables. As I recall Chris' benchmark was just write, in case it wont help that case (but still definitely worth doing as the read-write case is the common one).

It appears the testing I did on limiting spinning was before I finished tuning the deprioritization. In the current state of this PR, it's not helping noticeably with scale. Even without the deprioritization (I hadn't tried this before) it seems to work somewhat well but only noticeably at few threads. I still think it's a decent thing to do, as it's very unlikely that it would have negative effects. Mainly once the a writer starts waiting, it stops new readers from thrashing on the spin lock for too long unnecessarily, allowing read lock holders to efficiently release their locks. At least that's the theory. In practice with the current state of deprioritization, as soon as a waiter starts to spin for the spin lock, new readers become deprioritized and read lock holders begin draining. By the time the writer registers as a waiter (often it does not even get to this point), most of the read lock holders have already drained, so the benefit of limiting new readers from contending with read lock releasers is negligible.

First the SpinWaits are too small. We start at 10, but the memory system is probably 20X slower than the CPU so frankly it never gives the memory system 'a break' which is part of the point.

That's interesting, I can try some experiments and get back.

Regarding limiting spinning, I can try out what you suggested. It may help to not starve the writer from registering as a waiter for so long. It may be tricky to tune, to avoid artifical delays that hurt throughput. I still prefer the deprioritization idea (introducing some fairness into spin lock acquisition), as it directly attacks the problem.

@vancem

This comment has been minimized.

Member

vancem commented Aug 9, 2017

Thanks @kouvel for the information above. It is helpful.

What I have learned (the hard way) with things like concurrency and thread pools, is

  1. Weird (bad) behaviors are the norm
  2. Thus the best solutions tend to be VERY simple, because the more complex it is the more places you can get weird behaviors.
  3. In particular you have to have a REALLY good simple rationale for why you do things.

So I would like to apply this to spinning and spin locks (in general). First, why do we have a spin lock in the reader-writer lock? Basically the beauty of a spin lock is that it is both simple and efficient (because you don't need an interlocked operation on release), IF the lock hold time is (uniformly) small (that is measured in 10s to 100s of CPU clocks (not 1000s)). We strongly believe that this is true for the 'myLock' (because it is ONLY held when you are in transition (no more than the body of one of the RW lock methods). Indeed the fact that we EVER get to the Sleep(0), let alone the Sleep(1) in the myLock code, shows that are in a 'bad' regime, however myLock is not performing that badly (At any instant of time SOME thread either has the lock or is in the process of getting into it). This is about as good as you can do (after all you DO have to serialize access to it). You can probably do a bit better to avoid consuming the (shared) memory bus (by spinning locally more before you ping the lock), but that is not the main problem.

The problem is the spinning at the RWLock level. Now it is fair to ask 'do we need to spin at all?' This is a fair question. Why do we spin at all? The idea is that there is a COST to waiting (and waking up). It is the cost of context switching out and switching back (this is about 1-3K cycles), and the cost you incur because the memory caches were 'cooled' (filled with other data), which now has to be fetch again. This is hard to quantify, and would be very scenario dependent but a reasonable ballpark figure is about 10K-100K (0.03 msec). You can avoid this cost if you spin AND THE WAIT TIME IS SHORT (certainly < 100K cycles, but more like < 10K).

Thus spinning only helps in the case where the wait time is very short. More importantly however than is that spinning only pays off if it succeeds. Otherwise you pay strictly MORE. This is the fundamental issue that causes the 'meltdown' where increased load makes the system LESS efficient.

One very simple solution is to simply not spin. This is really not that bad. It works well when lock hold time is long (since spinning does not help there), and when there is no contention. It will hurt however if hold time is very short. Thus what you WANT is something that spins when wait time is < 10-100K and otherwise does no spinning.

The trick is that you need to know how long wait time is. But we HAVE a good idea of that (it is time it took the last time (or the average of the some set of times in the past)). To simplify the implementation even further, the spinCount is roughly speaking a measure of the time it took to enter the lock the last time. Thus simply remembering is an approximation and starting 'there' (which might mean going directly to waiting), solves the problem. (You do have the complexity of updating this time approximation but I believe we can do this a very few lines of codes (< 10).

The key behavior we need is that if it we ended up waiting in the past, then we should NOT bother spinning this time. Thus if there is a high contention (and wait times are long), the system QUICKLY stops spinning completely.

While having priorities for the various kinds of entries into the lock also help, what I like about the scheme above is that it is SIMPLE (and hopefully foolproof). You only wait if by past example you have some reason to believe it will be successful. Otherwise you don't spin at all.

Notice that if you are NOT spinning, that you ALWAYS make forward progress (every time the API enters MyLock, it accomplishes something). It will also be (roughly) fair (because when you wait you get put in a queue of waiters). The lock will still be hot (lots of threads will all be trying to access it), but it will be an orderly process (e.g. every thread spins, enters MyLock, gets what it needs or gets blocked,. Thus there are no 'useless spins'.

What I am arguing is that this scheme can be very simple (simpler than what you have), and the logic behind why it solves the 'spinning problem' is also simple (you never spin unless you have reason that it will succeed, and in particular you never REPEATEDLY spin and fail).

I realize that my suggestion here is not particularly welcome. You have something that seems to solve the problem and here am I trying to make you do work to rewrite it. I get that... What I have learned is that the cost of developing software is small in comparison to maintaining it over time (since maintenance accumulates), so keeping things SIMPLE is worthwhile.

I am arguing that you can implement the scheme I suggest in well under half the lines of code (hopefully a couple dozen total). The number is small enough that I will write them if you want me to. We can then see if that solution solves the issue, and if it does, is it reasonable to solve it that way?

@kouvel

This comment has been minimized.

Member

kouvel commented Aug 9, 2017

I realize that my suggestion here is not particularly welcome. You have something that seems to solve the problem and here am I trying to make you do work to rewrite it. I get that... What I have learned is that the cost of developing software is small in comparison to maintaining it over time (since maintenance accumulates), so keeping things SIMPLE is worthwhile.

I'm not opposed to changing the approach. I agree that what you suggested here may work reasonably well for this problem and that may suffice. It may even help with other more common scenarios.

I am arguing that you can implement the scheme I suggest in well under half the lines of code (hopefully a couple dozen total). The number is small enough that I will write them if you want me to. We can then see if that solution solves the issue, and if it does, is it reasonable to solve it that way?

Since you have something specific in mind, if you're interested in trying it out, please go ahead.

@kouvel kouvel added the * NO MERGE * label Aug 9, 2017

@vancem

This comment has been minimized.

Member

vancem commented Aug 9, 2017

Ok, let me take a crack at it.

@kouvel

This comment has been minimized.

Member

kouvel commented Aug 19, 2017

Responded on #13324. I have pushed a couple of commits to try and simplify the code a bit. Of what I've tried, I believe this is the most reasonable approach to solving these problems.

@kouvel

This comment has been minimized.

Member

kouvel commented Aug 19, 2017

Added another commit that prevents waking multiple waiters when only one can make progress. For instance, when there are multiple write waiters and a writer keeps entering and exiting a write lock, every exit causes it to wake up a new waiting writer, preventing write waiters from remaining in a wait state.

This last commit solves most of the scalability issue by itself and to a much greater degree in many cases than deprioritization. I believe it reduces contention on the spin lock by allowing waiters to remain waiting and eliminates excessive context switching from threads waking up and going back to sleep. However, it doesn't solve the writer starvation issue by itself, and deprioritization still performs an order of magnitude better in several cases. The last commit and deprioritization seem to combine well, getting the benefits of both.

Updated perf numbers (benchmark and scoring has changed from before):

Baseline (left) vs Changed (right)                       Left score       Right score      ∆ Score %
-------------------------------------------------------  ---------------  ---------------  ---------
Concurrency_12Readers_1-3Writers                         21335.21 ±1.54%  21814.40 ±1.04%     2.25%
Concurrency_48Readers_1-12Writers                         7795.03 ±9.83%  19802.80 ±2.57%   154.04%
Concurrency_192Readers_1-32Writers                        877.41 ±13.18%  19427.11 ±2.51%  2114.14%
Concurrency_768Readers_1-32Writers                        547.95 ±15.21%  21131.86 ±1.69%  3756.55%

@ChrisAhna's repro from #12780

Running NonReentrantRwlockSlim write lock contention scenarios with process affinity set to 0x0000000000000fff...
Initial output should appear in approximately 5 seconds...
ThreadCount=0001: Elapsed=5000ms FinalCount=144589363 CountsPerSecond=28916967.4989173
ThreadCount=0002: Elapsed=5014ms FinalCount=144584762 CountsPerSecond=28830976.1227239
ThreadCount=0004: Elapsed=5015ms FinalCount=144036668 CountsPerSecond=28716351.0107656
ThreadCount=0008: Elapsed=5048ms FinalCount=138234503 CountsPerSecond=27380895.2031019
ThreadCount=0016: Elapsed=5036ms FinalCount=134589683 CountsPerSecond=26721099.3625001
ThreadCount=0032: Elapsed=5031ms FinalCount=127872590 CountsPerSecond=25415045.1992862
ThreadCount=0064: Elapsed=5042ms FinalCount=128792699 CountsPerSecond=25540375.4453682
ThreadCount=0128: Elapsed=5093ms FinalCount=127502272 CountsPerSecond=25030204.4401221
ThreadCount=0256: Elapsed=5086ms FinalCount=128297554 CountsPerSecond=25221690.5011927
ThreadCount=0512: Elapsed=5061ms FinalCount=129017905 CountsPerSecond=25489084.9449699
ThreadCount=1024: Elapsed=5062ms FinalCount=125965942 CountsPerSecond=24880288.419177
ThreadCount=2048: Elapsed=5094ms FinalCount=122257552 CountsPerSecond=23998523.3918935
ThreadCount=4096: Elapsed=5225ms FinalCount=110505127 CountsPerSecond=21147566.2289993

kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 20, 2017

Improve ReaderWriterLockSlim scalability
Alternative to and subset of dotnet#13243, fixes dotnet#12780

- Prevented waking more than one waiter when only one of them may acquire the lock
- Limited spinning in some cases where it is very unlikely that spinning would help

The _myLock spin lock runs into some bad scalability issues. For example:
1. Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads.
2. Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves.

This PR does not solve (1), but solves (2) to a degree that could be considered sufficient. dotnet#13243 solves (1) and (2) and for (2) it is still better by order-of-magnitude compared with this PR in several cases that I believe are not extreme, but if the complexity of deprioritization is deemed excessive for the goals then of what I tried so far this is the perhaps simplest and most reasonable way to solve (2).

I believe this PR would also be a reasonably low-risk one to port back to .NET Framework.

kouvel added a commit that referenced this pull request Aug 23, 2017

Improve ReaderWriterLockSlim scalability (#13495)
* Improve ReaderWriterLockSlim scalability

Subset of #13243, fixes #12780

- Prevented waking more than one waiter when only one of them may acquire the lock
- Limited spinning in some cases where it is very unlikely that spinning would help

The _myLock spin lock runs into some bad scalability issues. For example:
1. Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads.
2. Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves.

This PR does not solve (1), but solves (2) to a degree that could be considered sufficient. #13243 solves (1) and (2) and for (2) it is still better by order-of-magnitude compared with this PR in several cases that I believe are not extreme, but if the complexity of deprioritization is deemed excessive for the goals then of what I tried so far this is the perhaps simplest and most reasonable way to solve (2).

I believe this PR would also be a reasonably low-risk one to port back to .NET Framework.

@karelz karelz removed the bug label Aug 25, 2017

@sdmaclea

This comment has been minimized.

Member

sdmaclea commented Aug 25, 2017

@kouvel I have another optimization of this as well. I drafted it when the arm64 volatile function lookup cache was broken on arm64. SpinLock.cs was getting hit on every volatile instruction lookup. My patch made a substantial improvement. Since I fixed the real issue, I never submitted the alternative. I'll dig it up.

@sdmaclea

This comment has been minimized.

Member

sdmaclea commented Aug 25, 2017

@kouvel #13599 is the patch I meant, but it is not in managed code.

@kouvel

This comment has been minimized.

Member

kouvel commented Sep 5, 2017

Rebased and squashed to simplify resolving merge conflicts. Updated perf numbers from current state below. I think this is still a beneficial change, as it significantly improves scalability at higher thread counts when there are readers and writers, and fixes the writer starvation issue.

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

RwSB vs RwS                         Left score        Right score      ∆ Score %
----------------------------------  ----------------  ---------------  ---------
Concurrency_OnlyReadersPcx01         23253.75 ±0.53%  24133.35 ±0.12%      3.78%
Concurrency_OnlyReadersPcx04         23643.40 ±0.10%  23961.26 ±0.35%      1.34%
Concurrency_OnlyReadersPcx16         23381.09 ±0.12%  23913.45 ±0.14%      2.28%
Concurrency_OnlyReadersPcx64         16003.45 ±3.93%  21900.96 ±0.29%     36.85%
Concurrency_OnlyWritersPcx01         23740.19 ±0.45%  24422.96 ±0.28%      2.88%
Concurrency_OnlyWritersPcx04         23923.59 ±0.33%  23973.27 ±0.20%      0.21%
Concurrency_OnlyWritersPcx16         23602.52 ±0.49%  23149.35 ±0.37%     -1.92%
Concurrency_OnlyWritersPcx64         24017.52 ±0.33%  19312.34 ±1.72%    -19.59%
Concurrency_Pcx01Readers_01Writers   10771.92 ±4.53%  11837.56 ±5.02%      9.89%
Concurrency_Pcx01Readers_02Writers  10009.64 ±14.71%  20353.36 ±3.56%    103.34%
Concurrency_Pcx04Readers_01Writers   16536.58 ±1.98%  19290.98 ±0.74%     16.66%
Concurrency_Pcx04Readers_02Writers   13014.96 ±3.44%  17689.59 ±3.25%     35.92%
Concurrency_Pcx04Readers_04Writers   12313.32 ±8.32%  20302.25 ±3.82%     64.88%
Concurrency_Pcx16Readers_01Writers   18518.40 ±4.11%  21922.77 ±1.98%     18.38%
Concurrency_Pcx16Readers_02Writers   16789.55 ±3.62%  20524.91 ±2.29%     22.25%
Concurrency_Pcx16Readers_04Writers   13315.66 ±7.32%  21255.95 ±2.22%     59.63%
Concurrency_Pcx16Readers_08Writers   13530.42 ±6.56%  22919.80 ±1.53%     69.39%
Concurrency_Pcx64Readers_01Writers   14054.68 ±7.12%  18556.82 ±4.78%     32.03%
Concurrency_Pcx64Readers_02Writers   11597.02 ±8.58%  17548.92 ±4.89%     51.32%
Concurrency_Pcx64Readers_04Writers   11727.83 ±7.50%  16453.37 ±5.18%     40.29%
Concurrency_Pcx64Readers_08Writers   9779.97 ±11.81%  21636.79 ±1.75%    121.24%
Concurrency_Pcx64Readers_16Writers   11873.76 ±9.23%  22943.82 ±0.80%     93.23%
----------------------------------  ----------------  ---------------  ---------
Total                                15802.03 ±4.87%  20561.88 ±2.07%     30.12%

Core i7-6700 (Skylake, 4-core, 8-thread):

RwSB vs RwS                         Left score        Right score       ∆ Score %
----------------------------------  ----------------  ----------------  ---------
Concurrency_OnlyReadersPcx01         25844.75 ±0.18%   27646.62 ±0.40%      6.97%
Concurrency_OnlyReadersPcx04         25960.00 ±0.15%   26554.70 ±0.71%      2.29%
Concurrency_OnlyReadersPcx16         25605.03 ±0.36%   27411.46 ±0.24%      7.05%
Concurrency_OnlyReadersPcx64         24484.74 ±0.85%   26779.13 ±0.18%      9.37%
Concurrency_OnlyWritersPcx01         29088.16 ±0.53%   28208.32 ±0.35%     -3.02%
Concurrency_OnlyWritersPcx04         26173.08 ±0.91%   27457.82 ±0.26%      4.91%
Concurrency_OnlyWritersPcx16         26048.41 ±1.82%   27051.25 ±0.20%      3.85%
Concurrency_OnlyWritersPcx64         20804.65 ±5.81%   26680.97 ±0.25%     28.25%
Concurrency_Pcx01Readers_01Writers   12637.79 ±8.93%   15786.84 ±7.72%     24.92%
Concurrency_Pcx01Readers_02Writers   4876.47 ±14.10%   24541.40 ±2.84%    403.26%
Concurrency_Pcx04Readers_01Writers    9726.81 ±8.71%   20252.46 ±2.52%    108.21%
Concurrency_Pcx04Readers_02Writers    5876.21 ±9.43%   24402.43 ±2.50%    315.28%
Concurrency_Pcx04Readers_04Writers   3523.01 ±28.39%   26001.85 ±2.86%    638.06%
Concurrency_Pcx16Readers_01Writers   8803.89 ±16.50%   21202.79 ±5.18%    140.83%
Concurrency_Pcx16Readers_02Writers   6036.79 ±16.47%   20864.33 ±3.74%    245.62%
Concurrency_Pcx16Readers_04Writers   3710.47 ±18.96%   27024.88 ±1.39%    628.34%
Concurrency_Pcx16Readers_08Writers   5065.23 ±28.57%   26793.17 ±2.12%    428.96%
Concurrency_Pcx64Readers_01Writers   8701.36 ±22.39%  19495.47 ±11.67%    124.05%
Concurrency_Pcx64Readers_02Writers   8640.47 ±19.63%  16308.06 ±14.03%     88.74%
Concurrency_Pcx64Readers_04Writers   5556.25 ±11.76%   20453.60 ±7.31%    268.12%
Concurrency_Pcx64Readers_08Writers   7012.25 ±16.31%   26671.01 ±1.52%    280.35%
Concurrency_Pcx64Readers_16Writers  14703.94 ±12.71%   27182.44 ±1.67%     84.86%
----------------------------------  ----------------  ----------------  ---------
Total                               11064.93 ±11.53%   23975.96 ±3.24%    116.68%

@kouvel kouvel requested a review from tarekgh Sep 5, 2017

@kouvel

This comment has been minimized.

Member

kouvel commented Sep 5, 2017

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

@kouvel

This comment has been minimized.

Member

kouvel commented Sep 8, 2017

@dotnet-bot test OSX10.12 x64 Checked Build and Test
@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

@kouvel

This comment has been minimized.

Member

kouvel commented Sep 8, 2017

@dotnet-bot test OSX10.12 x64 Checked Build and Test

@kouvel

This comment has been minimized.

Member

kouvel commented Sep 9, 2017

Ping for review please, @tarekgh, @stephentoub, @vancem. @vancem mentioned in #13324 that he's ok with this direction but it's not clear to me whether consensus has been reached on going ahead with this PR. I'm sure that the ideas @vancem suggested would improve this further for more common cases, but it's difficult to tune it and to get it right, and based on my tests and current understanding, those solutions don't solve the writer starvation problem and may at best reduce the severity of the scalability problem without solving it. As such, I believe the solutions are trying to solve different problems and I believe they would work well together, so I'd rather solve the scalability issue with this PR and leave the dynamic spin heuristics for another PR. Just to be clear, I'm perfectly ok with closing this PR if we want to do more investigation before settling on this solution.

@vancem

This comment has been minimized.

Member

vancem commented Sep 13, 2017

I am OK with having this PR proceed. It is not ideal because it does add complexity, and I am still thinking that a simpler solution is possible, but at some point we have to move on. The solution embodied in this PR is conceptually simple (you want to have the _myLock spin lock understand priorities and you want to give priority to threads needing to take _myLock to RELEASE the lock, and secondarily to writers).

I was hoping to keep it even simpler by making spin-lock improvements (that we would make on all our spin locks), that would be 'good enough' to avoid the scaling issue. However Kount did quite a bit of measurement to show that it is harder than I thought. While I have not completely given up, I agree for now at least, we should move on. Thus we need SOME solution.

Arguably having this kind of priority scheme is always good from a efficiency perspective (it is clear that giving priority to those RELEASING locks can only help things), it is the question whether it is worth the complexity. Given that it solves the original scaling problem (which is good, but is arguably a corner case since users should FIXING a lock that gets that hot), but ALSO improves more real-world scenarios (1 writer and a modest number of readers) in a nontrivial way

Concurrency_Pcx01Readers_01Writers   12637.79 ±8.93%   15786.84 ±7.72%     24.92%
Concurrency_Pcx04Readers_01Writers    9726.81 ±8.71%   20252.46 ±2.52%    108.21%
Concurrency_Pcx16Readers_01Writers   8803.89 ±16.50%   21202.79 ±5.18%    140.83

makes it worth it. Also the change really is not THAT complex (it is in the implementation of a single class, and the intuition behind why it works is intuitive). It is likely to be ADDITIVE to any improvements we make in spin heuristics later.

The only concern is this regression on the Xeon E5-1650

Concurrency_OnlyWritersPcx64         24017.52 ±0.33%  19312.34 ±1.72%    -19.59%

However I note that it did not happen on Core i7-6700, where the was improvement

Concurrency_OnlyWritersPcx64         20804.65 ±5.81%   26680.97 ±0.25%     28.25%

And this benchmark is not the most interesting (if you are using a reader-writer lock, you expect readers).

So I am OK with this being checked in.

@vancem

vancem approved these changes Sep 13, 2017

Improve ReaderWriterLockSlim scalability
Fixes #12780

The _myLock spin lock runs into some bad scalability issues. For example:
- Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads.
- Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves.

Took some suggestions from @vancem and landed on the following after some experiments:
- Introduced some fairness to _myLock acquisition by deprioritizing attempts to acquire _myLock that are not likely to make progress on the RW lock
- Limited spinning in some cases where it is very unlikely that spinning would help
@@ -63,11 +63,17 @@ public class ReaderWriterLockSlim : IDisposable
// the events that hang off this lock (eg writeEvent, readEvent upgradeEvent).
private int _myLock;
// Used to deprioritize threading attempting to enter _myLock when they would not make progress after doing so
private int _enterMyLockDeprioritizationState;

This comment has been minimized.

@vancem

vancem Sep 13, 2017

Member

Please put a reasonably verbose comment about what this state is. In particular it is a tuple of two counts (read and write), and these counts are requests to deprioritize that set of lock requestors). All of this is hidden behind EnterSpinLock (I an sorely tempted to make a struct SpinLock which encapsulates _myLock and _enterMyLockDeprioritizationState and hides all this except for EnterMyLockSpin(EnterMyLockReason reason). This keeps it conceptually localized).

@kouvel

This comment has been minimized.

Member

kouvel commented Sep 13, 2017

Thanks @vancem.

Regarding the regression above, it appears to be happening by chance but more consistently on this processor. In that test all enter-write threads will eventually go into a wait state while one writer holds the lock. The writer that holds the lock typically acquires and releases the lock repeatedly and wakes up at most one write waiter at a time. In baseline, the write waiter is usually not able to acquire the lock when it wakes up. In the changed side, the write waiter is acquiring the lock more frequently. This causes the previous lock releaser thread to become a spinner and the write waiter that now holds the lock wakes up another write waiter when it releases the lock. As a result, more threads are spinning for the write lock instead of waiting.

The CPU usage shows this, the changed side is using 2.5x more CPU than the baseline. The issue is occurring consistently when there are 64 * proc count writer threads but consistently not occurring when there are 16 * proc count writer threads and there's not much difference between those two scenarios. I can't think of a reason why this would be happening, maybe slight timing changes are making the difference.

One way to fix this deterministically may be to register the number of spinners and avoid waking up waiters when spinners can satisfy the wait. I'll leave that for another time if it becomes a real issue.

@vancem

This comment has been minimized.

Member

vancem commented Sep 13, 2017

Thanks for looking into it @kouvel. It is good to hear that the regression is does not happen consistently. I agree that it is not clear it is worth fixing (since the scenario is not the highest priority, it only happens when the lock hot enough that users should be fixing the hot lock, and the regression is not bad).

For what it is worth, I think it would be useful if we changed our scalability tests to have parameterized amounts of 'in-lock' and 'out-lock' work that is in the range of 100 to 10000 cycles. This would be much more representative of a real user workload. As it is our current microbenchmarks seem to be leading us to tune for the 'hot lock' case which is NOT really the most important case (when locks use 80%+ of the CPU, user code likely needs to be fixed to cool the lock. But when locks use 20% of the CPU, reducing it to 10% is an important win).

@kouvel

This comment has been minimized.

Member

kouvel commented Sep 13, 2017

I'm gathering some perf numbers with delays added

@kouvel

This comment has been minimized.

Member

kouvel commented Sep 14, 2017

I added a delay inside and outside each lock by calculating the N'th Fibonacci number using recursion where N is between 4 and 14 using a pseudorandom number generator. 4 and 14 on my machine correspond to approximately 100 and 10,000 cycles. This scheme prefers smaller delays.

With delays added, it seems difficult to compare the throughput scores apples to apples due to the writer starvation problem. The baseline starves writers and allows readers to happily chug away. The changed version does not starve writers and since writers block readers from acquiring the lock, the greater the frequency of writers, the less the overall throughput because writers drain all readers before they acquire the lock.

To make things a bit more realistic and comparable, I increased the delay outside the write lock such that N is between 15 and 19. On my machine, Fib(19) takes about 35 us to calculate using recursion.

Here are some raw numbers. Each line is a measurement over 500 ms, all in one process in continuous operation.

1 writer thread

(1 * proc count) reader threads

Baseline       Changed      
Read count Write count Sum Sum average so far Read count Write count Sum Sum average so far
1182 2 1184 1184 3193 29 3222 3222
1573 2 1575 1379 3233 31 3264 3243
825 2 827 1195 3157 34 3191 3226
1133 2 1134 1180 3188 30 3218 3224
719 1 720 1088 3140 35 3175 3214
1074 2 1076 1086 3150 36 3186 3209
1278 2 1279 1114 3166 31 3198 3208
792 2 794 1074 3142 35 3176 3204
1494 3 1497 1121 3049 35 3084 3191
1068 2 1070 1116 3142 34 3177 3189
959 3 961 1102 3171 29 3200 3190
1193 2 1195 1109 3134 33 3167 3188
934 2 936 1096 3151 32 3183 3188
1405 3 1408 1118 3170 35 3205 3189
1390 2 1392 1137 3155 33 3188 3189
855 2 857 1119 3144 34 3178 3188
1485 1 1486 1141 3166 31 3198 3189
995 2 997 1133 3158 33 3191 3189
1136 1 1138 1133 3160 32 3192 3189
620 1 621 1107 3132 33 3165 3188

(4 * proc count) reader threads

Baseline       Changed      
Read count Write count Sum Sum average so far Read count Write count Sum Sum average so far
1343 0 1343 1343 3205 18 3223 3223
1108 0 1108 1226 3245 22 3268 3245
1108 0 1108 1187 3255 18 3272 3254
1236 0 1236 1199 3224 19 3243 3252
870 0 870 1133 3204 18 3222 3246
1342 0 1342 1168 3237 18 3255 3247
1083 0 1083 1156 3183 22 3206 3241
1087 0 1087 1147 3240 15 3255 3243
1258 0 1258 1159 2856 14 2870 3202
1065 0 1065 1150 3253 13 3265 3208
1275 0 1275 1161 3227 20 3246 3211
1125 0 1125 1158 3219 16 3234 3213
968 0 968 1144 3101 23 3124 3206
1182 0 1182 1146 3142 20 3161 3203
1130 0 1130 1145 3243 15 3257 3207
1185 0 1185 1148 3169 18 3187 3206
1319 0 1319 1158 3204 17 3221 3207
1110 0 1110 1155 3247 10 3258 3209
1315 0 1315 1164 3222 17 3239 3211
1183 0 1183 1165 3180 24 3204 3211

Note that by this point the baseline starves writers so badly that not even one write lock is taken in 20 seconds.

(16 * proc count) reader threads

Baseline       Changed      
Read count Write count Sum Sum average so far Read count Write count Sum Sum average so far
2402 0 2402 2402 3025 4 3029 3029
2473 0 2473 2438 2552 1 2553 2791
1958 0 1958 2278 3159 7 3165 2916
2359 0 2359 2298 3250 6 3256 3001
2324 0 2324 2303 3275 4 3279 3057
2298 0 2298 2302 3211 9 3221 3084
2379 0 2379 2313 3161 3 3164 3095
2300 0 2300 2312 3238 5 3243 3114
2176 0 2176 2297 3298 4 3302 3135
2437 0 2437 2311 3262 5 3267 3148
2809 0 2809 2356 3256 7 3263 3158
1926 0 1926 2320 2893 6 2899 3137
2522 0 2522 2336 2832 6 2837 3114
2032 0 2032 2314 2896 6 2903 3099
2099 0 2099 2300 3271 3 3274 3110
2658 0 2658 2322 3281 5 3286 3121
2628 0 2628 2340 3286 8 3294 3132
2089 0 2089 2326 3066 8 3074 3128
1956 0 1956 2307 3084 9 3093 3127
2403 0 2403 2311 3332 4 3336 3137

2 writer threads

(1 * proc count) reader threads

Baseline       Changed      
Read count Write count Sum Sum average so far Read count Write count Sum Sum average so far
828 7 835 835 850 80 2930 2930
765 6 771 803 2443 82 2525 2727
1079 7 1086 897 2796 79 2875 2777
1100 8 1108 950 2683 79 2762 2773
1114 10 1125 985 2660 77 2736 2766
1225 10 1235 1026 2638 77 2715 2757
867 7 874 1005 2794 79 2872 2774
876 7 883 989 2886 76 2962 2797
942 8 950 985 2866 72 2938 2813
893 5 898 976 2668 79 2747 2806
635 5 640 946 2669 80 2748 2801
923 8 931 945 2503 79 2582 2783
848 5 853 938 2843 79 2922 2793
792 5 797 927 2520 80 2600 2780
810 7 817 920 2647 80 2726 2776
1219 12 1231 940 2689 78 2766 2776
1078 9 1088 948 2669 77 2746 2774
1142 9 1151 959 2760 79 2840 2777
991 11 1002 962 2648 79 2727 2775
1129 10 1140 971 2759 80 2839 2778

(4 * proc count) reader threads

Baseline       Changed      
Read count Write count Sum Sum average so far Read count Write count Sum Sum average so far
1379 0 1379 1379 2409 50 2458 2458
1135 0 1135 1257 2054 63 2117 2288
1275 0 1275 1263 2185 61 2246 2274
1195 0 1195 1246 2340 61 2401 2306
1135 0 1135 1224 2120 49 2170 2278
1105 0 1105 1204 2087 67 2153 2258
1049 0 1049 1182 2234 49 2282 2261
1499 0 1499 1221 2001 68 2069 2237
1513 0 1513 1254 1913 66 1979 2208
1033 0 1033 1232 2007 60 2067 2194
980 0 980 1209 1954 62 2016 2178
1263 0 1263 1213 2101 64 2165 2177
1305 0 1305 1221 2146 72 2217 2180
1200 0 1200 1219 2087 66 2153 2178
1221 0 1222 1219 2334 71 2405 2193
1161 0 1161 1216 2326 69 2395 2206
1375 0 1375 1225 2093 65 2157 2203
1387 0 1387 1234 1703 65 1769 2179
1306 0 1306 1238 2632 55 2687 2206
1116 0 1116 1232 2244 54 2299 2210

Note that by this point the baseline starves writers so badly that not even one write lock is taken in 20 seconds.

(16 * proc count) reader threads

Baseline       Changed      
Read count Write count Sum Sum average so far Read count Write count Sum Sum average so far
2105 0 2105 2105 2494 32 2526 2526
1879 0 1879 1992 2067 44 2112 2319
2524 0 2524 2169 1622 47 1669 2102
1472 0 1472 1995 2149 48 2197 2126
2271 0 2271 2050 1745 37 1781 2057
1562 0 1562 1969 1511 37 1548 1972
2969 0 2969 2112 1818 36 1853 1955
2428 0 2428 2151 2038 36 2073 1970
2004 0 2004 2135 1498 41 1539 1922
1714 0 1714 2093 2049 33 2082 1938
2409 0 2409 2121 1896 38 1934 1938
2247 0 2247 2132 2044 41 2085 1950
2258 0 2258 2142 1919 44 1964 1951
2194 0 2194 2145 1816 42 1858 1944
2595 0 2595 2175 1904 31 1935 1944
2441 0 2441 2192 1869 39 1908 1942
2166 0 2167 2190 1971 36 2007 1945
2177 0 2177 2190 2116 32 2148 1957
2585 0 2585 2211 2420 37 2456 1983
2397 0 2397 2220 1627 42 1668 1967

The baseline has a higher overall throughput here but it also starves writers, so not very meaningful.


There will also be cases where the changed version has much lower throughput due to not starving writers and there being a high frequency of writers. The following is an example, and the difference can get more severe than this, but in the end I don't think it's a very meaningful comparison.

8 writer threads

(16 * proc count) reader threads

Baseline       Changed      
Read count Write count Sum Sum average so far Read count Write count Sum Sum average so far
2070 0 2070 2070 1376 177 1553 1553
2542 0 2542 2306 1109 199 1309 1431
2774 0 2774 2462 890 213 1103 1322
1703 0 1703 2272 1221 172 1393 1340
2331 0 2331 2284 1155 191 1346 1341
2525 0 2525 2324 1192 177 1369 1346
2155 0 2156 2300 1002 155 1157 1319
2120 0 2121 2278 958 206 1164 1299
2414 0 2414 2293 1025 187 1213 1290
2461 0 2461 2310 1112 192 1304 1291
2271 0 2271 2306 1221 180 1400 1301
2373 0 2373 2312 1176 162 1338 1304
2616 0 2616 2335 1212 175 1386 1310
1629 0 1629 2285 1010 202 1212 1303
2295 0 2295 2285 757 206 963 1281
2485 0 2486 2298 1389 174 1563 1298
1834 0 1834 2271 1383 161 1544 1313
2407 0 2407 2278 1075 192 1266 1310
1902 0 1902 2258 1072 177 1249 1307
2267 0 2267 2259 926 186 1112 1297
@kouvel

This comment has been minimized.

Member

kouvel commented Sep 14, 2017

Code used for raw numbers above:

internal static class ReaderWriterLockSlimScalePerf
{
    private static void Main(string[] args)
    {
        int processorCount = Environment.ProcessorCount;

        int readerThreadCount, writerThreadCount;
        if (args.Length == 0)
        {
            readerThreadCount = 1;
            writerThreadCount = 1;
        }
        else
        {
            readerThreadCount = (int)uint.Parse(args[0]);
            writerThreadCount = (int)uint.Parse(args[1]);
        }

        var rw = new ReaderWriterLockSlim(LockRecursionPolicy.SupportsRecursion);

        var sw = new Stopwatch();
        var startThreads = new ManualResetEvent(false);

        var counts = new int[48];
        var readerThreads = new Thread[readerThreadCount];
        ThreadStart readThreadStart =
            () =>
            {
                var rng = new Random(0);
                startThreads.WaitOne();
                while (true)
                {
                    uint d0 = (uint)rng.Next(4, 10);
                    uint d1 = (uint)rng.Next(4, 10);
                    rw.EnterReadLock();
                    Delayer.CpuDelay(d0);
                    rw.ExitReadLock();
                    Interlocked.Increment(ref counts[16]);
                    Delayer.CpuDelay(d1);
                }
            };
        for (int i = 0; i < readerThreadCount; ++i)
        {
            readerThreads[i] = new Thread(readThreadStart);
            readerThreads[i].IsBackground = true;
            readerThreads[i].Start();
        }

        var writeLockAcquireAndReleasedInnerIterationCountTimes = new AutoResetEvent(false);
        var writerThreads = new Thread[writerThreadCount];
        ThreadStart writeThreadStart =
            () =>
            {
                var rng = new Random(1);
                startThreads.WaitOne();
                while (true)
                {
                    uint d0 = (uint)rng.Next(4, 14);
                    uint d1 = (uint)rng.Next(15, 20);
                    rw.EnterWriteLock();
                    Delayer.CpuDelay(d0);
                    rw.ExitWriteLock();
                    Interlocked.Increment(ref counts[32]);
                    Delayer.CpuDelay(d1);
                }
            };
        for (int i = 0; i < writerThreadCount; ++i)
        {
            writerThreads[i] = new Thread(writeThreadStart);
            writerThreads[i].IsBackground = true;
            writerThreads[i].Start();
        }

        startThreads.Set();

        // Warmup
        Thread.Sleep(500);
        Interlocked.Exchange(ref counts[16], 0);
        Interlocked.Exchange(ref counts[32], 0);

        // Actual run
        var unweightedScores = new List<double>();
        var scores = new List<double>();
        while (true)
        {
            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            int readCount = Interlocked.Exchange(ref counts[16], 0);
            int writeCount = Interlocked.Exchange(ref counts[32], 0);

            double elapsedMs = sw.Elapsed.TotalMilliseconds;
            double unweightedScore = (readCount + writeCount) / elapsedMs;
            double score =
                new double[]
                {
                    Math.Max(1, (readCount + writeCount) / elapsedMs),
                    Math.Max(1, writeCount / elapsedMs)
                }.GeometricMean(readerThreadCount, writerThreadCount);
            unweightedScores.Add(unweightedScore);
            scores.Add(score);

            Console.WriteLine(
                "{0:0}\t{1:0}\t{2:0}\t{3:0}\t{4:0}\t{5:0}",
                readCount / elapsedMs,
                writeCount / elapsedMs,
                unweightedScore,
                unweightedScores.Average(),
                score,
                scores.Average());
        }
    }

    internal static class Delayer
    {
        private static int[] s_delayValues = new int[32];

        public static void CpuDelay(uint n)
        {
            s_delayValues[16] = Fib(n);
        }

        private static int Fib(uint n)
        {
            if (n == 0)
                return 0;
            if (n == 1)
                return 1;
            return Fib(n - 2) + Fib(n - 1);
        }
    }

    public static double GeometricMean(this IEnumerable<double> values, params double[] weights)
    {
        double logSum = 0, weightSum = 0;
        int weightIndex = 0;
        foreach (var value in values)
        {
            if (weightIndex >= weights.Length)
                throw new InvalidOperationException();
            var weight = weights[weightIndex];
            ++weightIndex;
            logSum += Math.Log(value) * weight;
            weightSum += weight;
        }
        return Math.Exp(logSum / weightSum);
    }
}
@kouvel

This comment has been minimized.

Member

kouvel commented Sep 14, 2017

If there are no further concerns, I'll go ahead and merge this tomorrow

@kouvel kouvel merged commit 7f2b64a into dotnet:master Sep 14, 2017

16 checks passed

CROSS Check Build finished.
Details
CentOS7.1 x64 Debug Build and Test Build finished.
Details
OSX10.12 x64 Checked Build and Test Build finished.
Details
Tizen armel Cross Debug Build Build finished.
Details
Tizen armel Cross Release Build Build finished.
Details
Ubuntu arm Cross Release Build Build finished.
Details
Ubuntu x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Ubuntu16.04 arm Cross Debug Build Build finished.
Details
Windows_NT x64 Debug Build and Test Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release Priority 1 Build and Test Build finished.
Details
Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 Checked Build and Test Build finished.
Details
Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details

@kouvel kouvel deleted the kouvel:RwLockSlimSpinLock branch Sep 14, 2017

@vancem

This comment has been minimized.

Member

vancem commented Sep 14, 2017

I liked the encapsulation of the myLock!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment