Improve ReaderWriterLockSlim scalability #13243
Conversation
{ | ||
// If there is a write waiter or write upgrade waiter, the waiter would block a reader from acquiring the RW lock | ||
// because the waiter takes precedence. In that case, the reader is not likely to make progress by spinning. | ||
return _fNoWaiters || _numWriteWaiters == 0 && _numWriteUpgradeWaiters == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: consider adding parens here to help clarify the precedence
{ | ||
// If there is a write upgrade waiter, the waiter would block a writer from acquiring the RW lock because the waiter | ||
// holds a read lock. In that case, the writer is not likely to make progress by spinning. | ||
return isUpgradeToWrite || _numWriteUpgradeWaiters == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand how the comment applies to "_numWriteUpgradeWaiters == 0". What's the rationale for why we still spin if _numWriteUpgradeWaiters > 0 but this is a write upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can only be one upgradeable read lock held at a time, so if this is a write upgrade, there can't be any write upgrade waiters. I'll add that to the comment.
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private void EnterMyLockForEnterAnyRead(bool isForWait = false) | ||
{ | ||
if (!TryEnterMyLock()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we already do this elsewhere and I missed it, but when someone passes a timeout of 0, can/should we avoid spinning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, when the enter is deprioritized it would mean there is no progress to be made on the RW lock so it doesn't have to spin for _myLock. I'll also fix it so that it doesn't create/wait on the event when it decides to skip spinning on the RW lock and the timeout has expired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not deprioritized then I think it should spin for _myLock to attempt the RW lock once before returning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this in a separate PR, maybe along with fixes to #13254. I'm seeing some regressions with those at the moment, so it'll need some more work.
Flagged as no-merge, as I'll be adding more commits and would like to merge the commits as separate commits (no squash & merge), as each commit will have different levels of impact and risk for considering to port to desktop. |
Sorry for the delay, I was catching from a vacation yesterday. My biggest concern is complexity. As I mentioned in e-mail threads, generally speaking we are NOT fixing a pri1 scenario here because even after the fixes we have a very hot lock (that users really should fix, and what we are doing here does not help them). What we are trying to do is avoid really bad behavior in this case (things are very serialized, but you don't get meltdown). Thus ideally we do something which does not add complexity but addresses that issue. I really like your ShouldSpinFor* functions because they are very simple (no new state), and the logic clearly is simply giving up on spinning (which does not have a semantic effect, only perf). Do we know how much just doing this helps? It has the effect of 'prioritizing' without extra state variables. As I recall Chris' benchmark was just write, in case it wont help that case (but still definitely worth doing as the read-write case is the common one). I have been thinking about the 'best' way of doing spinning in general, and I believe a very good general solution is to use what happened in the past, to determine the likely amount of time you have to spin, and based on that decide whether spinning is worth doing at all. Thus if hold times are long (in which case you typically end up blocking anyway), you very quickly don't bother spinning at all (which is great). One of the simpler ways of doing this is to simply remember the 'spinCount' variable from last time and use say 80% of it for next time (that way you tend to ratchet down but if that fails you naturally home in on the correct value. The only trick is what you do when you end up blocking? The best answer is you use something like Stopwatch to actually measure the blocked time and only adjust the spinCount down if that time drops below some threshold. If we don't like the idea of taking a dependency on Stopwatch, we can also do it by simply increasing the Spin count something large enough so that when you take 80% of it say 100 times it gets to the threshold of spinning. Thus what this does is if you block you basically are guaranteed to skip spinning for the next 100 times at which point you do a spin just to see if anything has changed. Now the overhead is 1% of what it was before. I am thinking that this would be simpler code (basically spinCount becomes an instance variable (you have two of them one for readers and one for writers as they are likely to have different wait times). The MyLock could also benefit from this scheme (making its spin count remembered in an instance variable). Finally I do think we can do better on our tuning parameters (changing them also does not increase complexity). First the SpinWaits are too small. We start at 10, but the memory system is probably 20X slower than the CPU so frankly it never gives the memory system 'a break' which is part of the point. I believe this value should be somewhere between 10 and 100, and I am guessing 30-50 is a good place to start for MYlock. SpinCount should be reduced to 5 to compensate. For the Reader-Writer, I think the spin count should be 2-4X larger than for MyLock because each spin stresses the memory system more because you have to lock and unlock), the maximum of spins should thus be reduced to something like 5 from its current 20 (because each spin waits longer, and frankly spinning is NOT as likely to as fruitful. This kind of tuning does take a while and assumes a good set of benchmarks to tune with, so we could skip this (but my suggestion is to UNDER spin rather than OVER spin, spinning is always SPECULATION and in general you should not speculate when you have no real clue whether it is helping or not). I like to believe that these suggestions (add ShouldSpinFor, and remember spinCounts in instance variables and play with turning parameters) keep the code simple, but should address excessive spinning. I am concerned that the current PR has a non-trivial amount of new code and that that new complexity will never go away. The suggestion above also helps avoid CPU in the (reasonably common) case where you need to block for a small amount of time, but not small enough that spinning is effective. Comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my comment I suggested that the scalability could be achieved more simply (by a 'Should* methods in here coupled with remembering when spinning is ineffective (you will end up blocking anyway).
General thoughts: The issue arises when threads are thrashing on the spin lock, typically readers. When a writer comes in, it is unable to get the spin lock and quickly reaches the Sleep(1) stage of the spin loop, where it becomes even less likely that it would get the lock by chance. Eventually after enough iterations of getting the spin lock and not making progress on the RW lock because a read lock is typically held, it goes into a wait state. While the writer is waiting, new readers spin on the spin lock and the RW lock, delaying threads holding the read lock from releasing their lock. Eventually, readers are drained and the writer is signaled and acquires the write lock. Then it has trouble with readers to release the write lock because they're thrashing on the spin lock. Potential considerations:
It appears the testing I did on limiting spinning was before I finished tuning the deprioritization. In the current state of this PR, it's not helping noticeably with scale. Even without the deprioritization (I hadn't tried this before) it seems to work somewhat well but only noticeably at few threads. I still think it's a decent thing to do, as it's very unlikely that it would have negative effects. Mainly once the a writer starts waiting, it stops new readers from thrashing on the spin lock for too long unnecessarily, allowing read lock holders to efficiently release their locks. At least that's the theory. In practice with the current state of deprioritization, as soon as a waiter starts to spin for the spin lock, new readers become deprioritized and read lock holders begin draining. By the time the writer registers as a waiter (often it does not even get to this point), most of the read lock holders have already drained, so the benefit of limiting new readers from contending with read lock releasers is negligible.
That's interesting, I can try some experiments and get back. Regarding limiting spinning, I can try out what you suggested. It may help to not starve the writer from registering as a waiter for so long. It may be tricky to tune, to avoid artifical delays that hurt throughput. I still prefer the deprioritization idea (introducing some fairness into spin lock acquisition), as it directly attacks the problem. |
Thanks @kouvel for the information above. It is helpful. What I have learned (the hard way) with things like concurrency and thread pools, is
So I would like to apply this to spinning and spin locks (in general). First, why do we have a spin lock in the reader-writer lock? Basically the beauty of a spin lock is that it is both simple and efficient (because you don't need an interlocked operation on release), IF the lock hold time is (uniformly) small (that is measured in 10s to 100s of CPU clocks (not 1000s)). We strongly believe that this is true for the 'myLock' (because it is ONLY held when you are in transition (no more than the body of one of the RW lock methods). Indeed the fact that we EVER get to the Sleep(0), let alone the Sleep(1) in the myLock code, shows that are in a 'bad' regime, however myLock is not performing that badly (At any instant of time SOME thread either has the lock or is in the process of getting into it). This is about as good as you can do (after all you DO have to serialize access to it). You can probably do a bit better to avoid consuming the (shared) memory bus (by spinning locally more before you ping the lock), but that is not the main problem. The problem is the spinning at the RWLock level. Now it is fair to ask 'do we need to spin at all?' This is a fair question. Why do we spin at all? The idea is that there is a COST to waiting (and waking up). It is the cost of context switching out and switching back (this is about 1-3K cycles), and the cost you incur because the memory caches were 'cooled' (filled with other data), which now has to be fetch again. This is hard to quantify, and would be very scenario dependent but a reasonable ballpark figure is about 10K-100K (0.03 msec). You can avoid this cost if you spin AND THE WAIT TIME IS SHORT (certainly < 100K cycles, but more like < 10K). Thus spinning only helps in the case where the wait time is very short. More importantly however than is that spinning only pays off if it succeeds. Otherwise you pay strictly MORE. This is the fundamental issue that causes the 'meltdown' where increased load makes the system LESS efficient. One very simple solution is to simply not spin. This is really not that bad. It works well when lock hold time is long (since spinning does not help there), and when there is no contention. It will hurt however if hold time is very short. Thus what you WANT is something that spins when wait time is < 10-100K and otherwise does no spinning. The trick is that you need to know how long wait time is. But we HAVE a good idea of that (it is time it took the last time (or the average of the some set of times in the past)). To simplify the implementation even further, the spinCount is roughly speaking a measure of the time it took to enter the lock the last time. Thus simply remembering is an approximation and starting 'there' (which might mean going directly to waiting), solves the problem. (You do have the complexity of updating this time approximation but I believe we can do this a very few lines of codes (< 10). The key behavior we need is that if it we ended up waiting in the past, then we should NOT bother spinning this time. Thus if there is a high contention (and wait times are long), the system QUICKLY stops spinning completely. While having priorities for the various kinds of entries into the lock also help, what I like about the scheme above is that it is SIMPLE (and hopefully foolproof). You only wait if by past example you have some reason to believe it will be successful. Otherwise you don't spin at all. Notice that if you are NOT spinning, that you ALWAYS make forward progress (every time the API enters MyLock, it accomplishes something). It will also be (roughly) fair (because when you wait you get put in a queue of waiters). The lock will still be hot (lots of threads will all be trying to access it), but it will be an orderly process (e.g. every thread spins, enters MyLock, gets what it needs or gets blocked,. Thus there are no 'useless spins'. What I am arguing is that this scheme can be very simple (simpler than what you have), and the logic behind why it solves the 'spinning problem' is also simple (you never spin unless you have reason that it will succeed, and in particular you never REPEATEDLY spin and fail). I realize that my suggestion here is not particularly welcome. You have something that seems to solve the problem and here am I trying to make you do work to rewrite it. I get that... What I have learned is that the cost of developing software is small in comparison to maintaining it over time (since maintenance accumulates), so keeping things SIMPLE is worthwhile. I am arguing that you can implement the scheme I suggest in well under half the lines of code (hopefully a couple dozen total). The number is small enough that I will write them if you want me to. We can then see if that solution solves the issue, and if it does, is it reasonable to solve it that way? |
I'm not opposed to changing the approach. I agree that what you suggested here may work reasonably well for this problem and that may suffice. It may even help with other more common scenarios.
Since you have something specific in mind, if you're interested in trying it out, please go ahead. |
Ok, let me take a crack at it. |
f6bb711
to
3be8934
Compare
Responded on #13324. I have pushed a couple of commits to try and simplify the code a bit. Of what I've tried, I believe this is the most reasonable approach to solving these problems. |
Added another commit that prevents waking multiple waiters when only one can make progress. For instance, when there are multiple write waiters and a writer keeps entering and exiting a write lock, every exit causes it to wake up a new waiting writer, preventing write waiters from remaining in a wait state. This last commit solves most of the scalability issue by itself and to a much greater degree in many cases than deprioritization. I believe it reduces contention on the spin lock by allowing waiters to remain waiting and eliminates excessive context switching from threads waking up and going back to sleep. However, it doesn't solve the writer starvation issue by itself, and deprioritization still performs an order of magnitude better in several cases. The last commit and deprioritization seem to combine well, getting the benefits of both. Updated perf numbers (benchmark and scoring has changed from before):
@ChrisAhna's repro from #12780
|
Alternative to and subset of dotnet#13243, fixes #12780 - Prevented waking more than one waiter when only one of them may acquire the lock - Limited spinning in some cases where it is very unlikely that spinning would help The _myLock spin lock runs into some bad scalability issues. For example: 1. Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads. 2. Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves. This PR does not solve (1), but solves (2) to a degree that could be considered sufficient. dotnet#13243 solves (1) and (2) and for (2) it is still better by order-of-magnitude compared with this PR in several cases that I believe are not extreme, but if the complexity of deprioritization is deemed excessive for the goals then of what I tried so far this is the perhaps simplest and most reasonable way to solve (2). I believe this PR would also be a reasonably low-risk one to port back to .NET Framework.
* Improve ReaderWriterLockSlim scalability Subset of #13243, fixes #12780 - Prevented waking more than one waiter when only one of them may acquire the lock - Limited spinning in some cases where it is very unlikely that spinning would help The _myLock spin lock runs into some bad scalability issues. For example: 1. Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads. 2. Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves. This PR does not solve (1), but solves (2) to a degree that could be considered sufficient. #13243 solves (1) and (2) and for (2) it is still better by order-of-magnitude compared with this PR in several cases that I believe are not extreme, but if the complexity of deprioritization is deemed excessive for the goals then of what I tried so far this is the perhaps simplest and most reasonable way to solve (2). I believe this PR would also be a reasonably low-risk one to port back to .NET Framework.
@kouvel I have another optimization of this as well. I drafted it when the arm64 volatile function lookup cache was broken on arm64. SpinLock.cs was getting hit on every volatile instruction lookup. My patch made a substantial improvement. Since I fixed the real issue, I never submitted the alternative. I'll dig it up. |
6f0ef37
to
7514309
Compare
7514309
to
262d18d
Compare
Rebased and squashed to simplify resolving merge conflicts. Updated perf numbers from current state below. I think this is still a beneficial change, as it significantly improves scalability at higher thread counts when there are readers and writers, and fixes the writer starvation issue. Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):
Core i7-6700 (Skylake, 4-core, 8-thread):
|
@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness |
@dotnet-bot test OSX10.12 x64 Checked Build and Test |
@dotnet-bot test OSX10.12 x64 Checked Build and Test |
Ping for review please, @tarekgh, @stephentoub, @vancem. @vancem mentioned in #13324 that he's ok with this direction but it's not clear to me whether consensus has been reached on going ahead with this PR. I'm sure that the ideas @vancem suggested would improve this further for more common cases, but it's difficult to tune it and to get it right, and based on my tests and current understanding, those solutions don't solve the writer starvation problem and may at best reduce the severity of the scalability problem without solving it. As such, I believe the solutions are trying to solve different problems and I believe they would work well together, so I'd rather solve the scalability issue with this PR and leave the dynamic spin heuristics for another PR. Just to be clear, I'm perfectly ok with closing this PR if we want to do more investigation before settling on this solution. |
I am OK with having this PR proceed. It is not ideal because it does add complexity, and I am still thinking that a simpler solution is possible, but at some point we have to move on. The solution embodied in this PR is conceptually simple (you want to have the _myLock spin lock understand priorities and you want to give priority to threads needing to take _myLock to RELEASE the lock, and secondarily to writers). I was hoping to keep it even simpler by making spin-lock improvements (that we would make on all our spin locks), that would be 'good enough' to avoid the scaling issue. However Kount did quite a bit of measurement to show that it is harder than I thought. While I have not completely given up, I agree for now at least, we should move on. Thus we need SOME solution. Arguably having this kind of priority scheme is always good from a efficiency perspective (it is clear that giving priority to those RELEASING locks can only help things), it is the question whether it is worth the complexity. Given that it solves the original scaling problem (which is good, but is arguably a corner case since users should FIXING a lock that gets that hot), but ALSO improves more real-world scenarios (1 writer and a modest number of readers) in a nontrivial way
makes it worth it. Also the change really is not THAT complex (it is in the implementation of a single class, and the intuition behind why it works is intuitive). It is likely to be ADDITIVE to any improvements we make in spin heuristics later. The only concern is this regression on the Xeon E5-1650
However I note that it did not happen on Core i7-6700, where the was improvement
And this benchmark is not the most interesting (if you are using a reader-writer lock, you expect readers). So I am OK with this being checked in. |
Fixes #12780 The _myLock spin lock runs into some bad scalability issues. For example: - Readers can starve writers for an unreasonable amount of time. Typically there would be more readers than writers, and it doesn't take many readers to starve a writer. On my machine with 6 cores (12 logical processors with hyperthreading), 6 to 16 reader threads attempting to acquire the spin lock to acquire or release a read lock can starve one writer thread from acquiring the spin lock for several or many seconds. The issue magnifies with more reader threads. - Readers and especially writers that hold the RW lock can be starved from even releasing their lock. Releasing an RW lock requires acquiring the spin lock, so releasers are easliy starved by acquirers. How badly they are starved depends on how many acquirers there are, and it doesn't take many to show a very noticeable scalability issue. Often, these acquirers are those that would not be able to acquire the RW lock until one or more releasers release their lock, so the acquirers effectively starve themselves. Took some suggestions from @vancem and landed on the following after some experiments: - Introduced some fairness to _myLock acquisition by deprioritizing attempts to acquire _myLock that are not likely to make progress on the RW lock - Limited spinning in some cases where it is very unlikely that spinning would help
@@ -63,11 +63,17 @@ public class ReaderWriterLockSlim : IDisposable | |||
// the events that hang off this lock (eg writeEvent, readEvent upgradeEvent). | |||
private int _myLock; | |||
|
|||
// Used to deprioritize threading attempting to enter _myLock when they would not make progress after doing so | |||
private int _enterMyLockDeprioritizationState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a reasonably verbose comment about what this state is. In particular it is a tuple of two counts (read and write), and these counts are requests to deprioritize that set of lock requestors). All of this is hidden behind EnterSpinLock (I an sorely tempted to make a struct SpinLock which encapsulates _myLock and _enterMyLockDeprioritizationState and hides all this except for EnterMyLockSpin(EnterMyLockReason reason). This keeps it conceptually localized).
Thanks @vancem. Regarding the regression above, it appears to be happening by chance but more consistently on this processor. In that test all enter-write threads will eventually go into a wait state while one writer holds the lock. The writer that holds the lock typically acquires and releases the lock repeatedly and wakes up at most one write waiter at a time. In baseline, the write waiter is usually not able to acquire the lock when it wakes up. In the changed side, the write waiter is acquiring the lock more frequently. This causes the previous lock releaser thread to become a spinner and the write waiter that now holds the lock wakes up another write waiter when it releases the lock. As a result, more threads are spinning for the write lock instead of waiting. The CPU usage shows this, the changed side is using 2.5x more CPU than the baseline. The issue is occurring consistently when there are 64 * proc count writer threads but consistently not occurring when there are 16 * proc count writer threads and there's not much difference between those two scenarios. I can't think of a reason why this would be happening, maybe slight timing changes are making the difference. One way to fix this deterministically may be to register the number of spinners and avoid waking up waiters when spinners can satisfy the wait. I'll leave that for another time if it becomes a real issue. |
Thanks for looking into it @kouvel. It is good to hear that the regression is does not happen consistently. I agree that it is not clear it is worth fixing (since the scenario is not the highest priority, it only happens when the lock hot enough that users should be fixing the hot lock, and the regression is not bad). For what it is worth, I think it would be useful if we changed our scalability tests to have parameterized amounts of 'in-lock' and 'out-lock' work that is in the range of 100 to 10000 cycles. This would be much more representative of a real user workload. As it is our current microbenchmarks seem to be leading us to tune for the 'hot lock' case which is NOT really the most important case (when locks use 80%+ of the CPU, user code likely needs to be fixed to cool the lock. But when locks use 20% of the CPU, reducing it to 10% is an important win). |
262d18d
to
ab89cff
Compare
I'm gathering some perf numbers with delays added |
I added a delay inside and outside each lock by calculating the N'th Fibonacci number using recursion where N is between 4 and 14 using a pseudorandom number generator. 4 and 14 on my machine correspond to approximately 100 and 10,000 cycles. This scheme prefers smaller delays. With delays added, it seems difficult to compare the throughput scores apples to apples due to the writer starvation problem. The baseline starves writers and allows readers to happily chug away. The changed version does not starve writers and since writers block readers from acquiring the lock, the greater the frequency of writers, the less the overall throughput because writers drain all readers before they acquire the lock. To make things a bit more realistic and comparable, I increased the delay outside the write lock such that N is between 15 and 19. On my machine, Fib(19) takes about 35 us to calculate using recursion. Here are some raw numbers. Each line is a measurement over 500 ms, all in one process in continuous operation. 1 writer thread(1 * proc count) reader threads
(4 * proc count) reader threads
Note that by this point the baseline starves writers so badly that not even one write lock is taken in 20 seconds. (16 * proc count) reader threads
2 writer threads(1 * proc count) reader threads
(4 * proc count) reader threads
Note that by this point the baseline starves writers so badly that not even one write lock is taken in 20 seconds. (16 * proc count) reader threads
The baseline has a higher overall throughput here but it also starves writers, so not very meaningful. There will also be cases where the changed version has much lower throughput due to not starving writers and there being a high frequency of writers. The following is an example, and the difference can get more severe than this, but in the end I don't think it's a very meaningful comparison. 8 writer threads(16 * proc count) reader threads
|
Code used for raw numbers above: internal static class ReaderWriterLockSlimScalePerf
{
private static void Main(string[] args)
{
int processorCount = Environment.ProcessorCount;
int readerThreadCount, writerThreadCount;
if (args.Length == 0)
{
readerThreadCount = 1;
writerThreadCount = 1;
}
else
{
readerThreadCount = (int)uint.Parse(args[0]);
writerThreadCount = (int)uint.Parse(args[1]);
}
var rw = new ReaderWriterLockSlim(LockRecursionPolicy.SupportsRecursion);
var sw = new Stopwatch();
var startThreads = new ManualResetEvent(false);
var counts = new int[48];
var readerThreads = new Thread[readerThreadCount];
ThreadStart readThreadStart =
() =>
{
var rng = new Random(0);
startThreads.WaitOne();
while (true)
{
uint d0 = (uint)rng.Next(4, 10);
uint d1 = (uint)rng.Next(4, 10);
rw.EnterReadLock();
Delayer.CpuDelay(d0);
rw.ExitReadLock();
Interlocked.Increment(ref counts[16]);
Delayer.CpuDelay(d1);
}
};
for (int i = 0; i < readerThreadCount; ++i)
{
readerThreads[i] = new Thread(readThreadStart);
readerThreads[i].IsBackground = true;
readerThreads[i].Start();
}
var writeLockAcquireAndReleasedInnerIterationCountTimes = new AutoResetEvent(false);
var writerThreads = new Thread[writerThreadCount];
ThreadStart writeThreadStart =
() =>
{
var rng = new Random(1);
startThreads.WaitOne();
while (true)
{
uint d0 = (uint)rng.Next(4, 14);
uint d1 = (uint)rng.Next(15, 20);
rw.EnterWriteLock();
Delayer.CpuDelay(d0);
rw.ExitWriteLock();
Interlocked.Increment(ref counts[32]);
Delayer.CpuDelay(d1);
}
};
for (int i = 0; i < writerThreadCount; ++i)
{
writerThreads[i] = new Thread(writeThreadStart);
writerThreads[i].IsBackground = true;
writerThreads[i].Start();
}
startThreads.Set();
// Warmup
Thread.Sleep(500);
Interlocked.Exchange(ref counts[16], 0);
Interlocked.Exchange(ref counts[32], 0);
// Actual run
var unweightedScores = new List<double>();
var scores = new List<double>();
while (true)
{
sw.Restart();
Thread.Sleep(500);
sw.Stop();
int readCount = Interlocked.Exchange(ref counts[16], 0);
int writeCount = Interlocked.Exchange(ref counts[32], 0);
double elapsedMs = sw.Elapsed.TotalMilliseconds;
double unweightedScore = (readCount + writeCount) / elapsedMs;
double score =
new double[]
{
Math.Max(1, (readCount + writeCount) / elapsedMs),
Math.Max(1, writeCount / elapsedMs)
}.GeometricMean(readerThreadCount, writerThreadCount);
unweightedScores.Add(unweightedScore);
scores.Add(score);
Console.WriteLine(
"{0:0}\t{1:0}\t{2:0}\t{3:0}\t{4:0}\t{5:0}",
readCount / elapsedMs,
writeCount / elapsedMs,
unweightedScore,
unweightedScores.Average(),
score,
scores.Average());
}
}
internal static class Delayer
{
private static int[] s_delayValues = new int[32];
public static void CpuDelay(uint n)
{
s_delayValues[16] = Fib(n);
}
private static int Fib(uint n)
{
if (n == 0)
return 0;
if (n == 1)
return 1;
return Fib(n - 2) + Fib(n - 1);
}
}
public static double GeometricMean(this IEnumerable<double> values, params double[] weights)
{
double logSum = 0, weightSum = 0;
int weightIndex = 0;
foreach (var value in values)
{
if (weightIndex >= weights.Length)
throw new InvalidOperationException();
var weight = weights[weightIndex];
++weightIndex;
logSum += Math.Log(value) * weight;
weightSum += weight;
}
return Math.Exp(logSum / weightSum);
}
} |
If there are no further concerns, I'll go ahead and merge this tomorrow |
I liked the encapsulation of the myLock! |
Fixes #12780
The _myLock spin lock runs into some bad scalability issues. For example:
Took some suggestions from @vancem and landed on the following after some experiments:
@ChrisAhna's repro from #12780
Baseline
Changed
Monitor