New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spin wait tuning #13388

Closed
helloguo opened this Issue Aug 15, 2017 · 47 comments

Comments

Projects
None yet
9 participants
@helloguo
Copy link
Contributor

helloguo commented Aug 15, 2017

Spin-Wait can have a significant cost on power and performance if not done properly. The PAUSE instruction on Skylake changed from previous generations such as Haswell and Broadwell, from reference "Intel® 64 and IA-32 Architectures Optimization Reference Manual" section 8.4.7, "The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles." "As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss." We are seeing issues with spinning on Skylake Server with TechEmpower Plaintext with excessive amount of time being spent in spin code. It’s beneficial to tune the spin count or spin cycles and we have a proposal below for fix:

For example, the original spin code form j_join::join is shown below.

                int spin_count = 4096 * g_num_processors;
                for (int j = 0; j < spin_count; j++)
                {
                    if (color != join_struct.lock_color)
                    {
                        break;
                    }
                    YieldProcessor();           // indicate to the processor that we are spinning
                }

Assume YieldProcessor() took 10 cycles when the above code was first written and tuned. We could use spin cycles instead of spin count like this:

                int spin_count = 4096 * g_num_processors;

                // Assume YieldProcessor() took 10 cycles 
                // when the above code was first written and tuned
                ptrdiff_t endTime = get_cycle_count() + spin_count * 10;

                while (get_cycle_count() < endTime)
                {
                    if (color != join_struct.lock_color)
                    {
                        break;
                    }
                    YieldProcessor();           // indicate to the processor that we are spinning
                }
@helloguo

This comment has been minimized.

Copy link
Contributor

helloguo commented Aug 15, 2017

/cc @jkotas

@jkotas

This comment has been minimized.

Copy link
Member

jkotas commented Aug 15, 2017

cc @vancem

@vancem

This comment has been minimized.

Copy link
Member

vancem commented Aug 15, 2017

cc @kouvel

@benaadams

This comment has been minimized.

Copy link
Collaborator

benaadams commented Aug 16, 2017

Might have a major effect on Thread.SpinWait(iterations)

For example in ManualResetEventSlim looping on i

Thread.SpinWait(PlatformHelper.ProcessorCount * (4 << i))

for (int i = 0; i < spinCount; i++)
{
if (IsSet)
{
return true;
}
else if (i < HOW_MANY_SPIN_BEFORE_YIELD)
{
if (i == HOW_MANY_SPIN_BEFORE_YIELD / 2)
{
Thread.Yield();
}
else
{
Thread.SpinWait(PlatformHelper.ProcessorCount * (4 << i));

which calls

FCIMPL1(void, ThreadNative::SpinWait, int iterations)
{
FCALL_CONTRACT;
//
// If we're not going to spin for long, it's ok to remain in cooperative mode.
// The threshold is determined by the cost of entering preemptive mode; if we're
// spinning for less than that number of cycles, then switching to preemptive
// mode won't help a GC start any faster. That number is right around 1000000
// on my machine.
//
if (iterations <= 1000000)
{
for(int i = 0; i < iterations; i++)
YieldProcessor();
return;
}
//
// Too many iterations; better switch to preemptive mode to avoid stalling a GC.
//
HELPER_METHOD_FRAME_BEGIN_NOPOLL();
GCX_PREEMP();
for(int i = 0; i < iterations; i++)
YieldProcessor();
HELPER_METHOD_FRAME_END();
}
FCIMPLEND

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 16, 2017

I have been tracking an unusual MAJOR regression (as in capital letters major) from 60K req/sec (running the loading and the target process at the same machine) to 3K req/sec that may be related to this. We have only seen this after upgrading the solution to CoreCLR 2.0. Under profiler the only suspect for the odd behavior is the ManualResetEventSlim. We see an inordinate amount of CPU time being assigned when that should have been very low.

image

It is fair to say that we know our current bottleneck pre 2.0 was thread state handoff, so we expected to be the same on 2.0. But, this behavior took me off guard. If you think it could be related, just let me know and we can connect with Skype to help you build an environment for investigation.

@danmosemsft danmosemsft added this to the 2.0.0 milestone Aug 16, 2017

@vancem

This comment has been minimized.

Copy link
Member

vancem commented Aug 16, 2017

I have been in conversation with @kouvel on spinning, and I would like us to move to make a small modification to spin loops such that they remember the spin count that worked in the past (or more importantly the fact that it did NOT work in the past), and use that to determine how much spinning to do this time (or to bail immediately). This change was motivated to simply reduce spin overhead, but it should also help with this issue.

This does not completely fix the problem, because although you will bail quickly when you bail, the spinning you do do is 14X longer. It seems to me that what we should do is make it so that SpinWait(count) normalizes it argument (that is a count always represents a certain number of machine instruction units (cycles), so on machines where pause is large). This approach may have problems, but it seems like something that should be on the table...

@benaadams

This comment has been minimized.

Copy link
Collaborator

benaadams commented Aug 16, 2017

/cc @stephentoub FYI

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 16, 2017

@vancem just point us to the private build for it and we can test it.

@helloguo

This comment has been minimized.

Copy link
Contributor

helloguo commented Aug 16, 2017

@redknightlois what's your hardware?

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 16, 2017

Skylake & Kaby Lake, probably by tomorrow we will have the same test on pre-Skylake.

@helloguo

This comment has been minimized.

Copy link
Contributor

helloguo commented Aug 18, 2017

@redknightlois Do you see the same issue on pre-Skylake platforms? If yes, it may suggest something else. If you only see the issue on Skylake, it may suggest the spin loop with long PAUSE inst needs to be tuned accordingly. And some test cases would be helpful to narrow down the problem.

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 18, 2017

@helloguo haven't go to the office yet. Will be there in a few hours.

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 19, 2017

@helloguo @vancem I can confirm with absolute certainty (I believe this is the very first time in my life I can say that) that what we are looking at on our side is this issue at work. We run the same thing on Skylake i7-6700K, Kaby Lake i7-7700 and Ivy Bridge i5-3470.

Results:

  • i7-6700K = 3.5K request/sec
  • i7-7700 = 3.5K request/sec
  • i5-3470 = 12K request/sec

Moreover, we can also confirm with high confidence that CoreCLR 1.1 also suffers from this on a lower degree. We were able to workaround the Wait CPU consumption manually yielding the thread, waiting until we hit the 5th round to call the primitive Wait. That diminish the CPU cost of wait from 30% to 8% in the exact same scenario and allow us to jump from 30K to 43K request/sec on i7-7700.

Hope we can fix this soon, because we cannot ship a Release Candidate with this issue there.

@helloguo

This comment has been minimized.

Copy link
Contributor

helloguo commented Aug 21, 2017

@redknightlois Thanks for your detailed numbers. Can you share how to set up and run your workload? So that I could run it against my prototype.

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Aug 21, 2017

@redknightlois, thanks for the information. One other difference between the above procs could be that the i7s have hyperthreading and the i5 doesn't, and the pause count is multiplied by number of procs, so the i7s could be issuing twice as many pauses. Do you have hyperthreading enabled on those?

Probably the multiplying should be removed anyway.

Also @redknightlois, what spin counts are you using for the problematic ManualResetEventSlims? I don't think it matters but just to test similarly on my side.

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 21, 2017

@kouvel Yes they both have hyperthreading enabled and the ManualResetEventSlim is created with default parameters. https://github.com/ravendb/ravendb/blob/v4.0/src/Voron/GlobalFlushingBehavior.cs#L80

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Aug 22, 2017

@helloguo, is there by any chance a way to determine from the processor the cycle count delay incurred by pause? It seems like using the cycle count could be error-prone and I'm inclined to just try reducing the quantity of pauses.

@helloguo

This comment has been minimized.

Copy link
Contributor

helloguo commented Aug 22, 2017

@kouvel Off the top of my head, we could use CPUID to check the processors' code name and then return the latency of PAUSE inst based on the processors' code name.

@kouvel @redknightlois I put a prototype here #13527, which uses the recommended method by optimization guide.

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 22, 2017

@helloguo @kouvel Is there any way you can push to MyGet a version that I could use to link it for CoreCLR 2.0. I am not building it locally.

@helloguo

This comment has been minimized.

Copy link
Contributor

helloguo commented Aug 22, 2017

@redknightlois Can you replace path\to\coreclr\packages\microsoft.dotnet.buildtools\2.0.0-prerelease-01914-02\lib\tool-runtime\project.csproj

with

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TargetFrameworks>netcoreapp2.0;net46</TargetFrameworks>
    <EnableDefaultItems>false</EnableDefaultItems>
    <!--<RuntimeFrameworkVersion Condition=" '$(TargetFramework)' == 'netcoreapp2.0' ">2.0.0-beta-001776-00</RuntimeFrameworkVersion>-->
    <!--<PackageTargetFallback Condition=" '$(TargetFramework)' == 'netcoreapp2.0' ">$(PackageTargetFallback);dnxcore50;portable-net45+win8</PackageTargetFallback>-->
    <OutputType>Exe</OutputType>
  </PropertyGroup>

  <ItemGroup>
    <Compile Include="Program.cs" />
  </ItemGroup>

  <ItemGroup Condition=" '$(TargetFramework)' == 'netcoreapp2.0' ">
    <PackageReference Include="Microsoft.Build.Framework" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.Build.Runtime" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.Build.Tasks.Core" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.Build.Utilities.Core" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.Build" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.DotNet.PlatformAbstractions" Version="2.0.0-preview2-25407-01" />
    <PackageReference Include="Microsoft.Net.Compilers.netcore" Version="2.0.0-rc4" />
    <PackageReference Include="Microsoft.Net.Compilers.Targets.NetCore" Version="0.1.5-dev" />
    <PackageReference Include="Microsoft.Cci" Version="4.0.0-rc4-24217-00" />
    <PackageReference Include="System.Composition" Version="1.1.0-preview2-25405-01" />
    <PackageReference Include="Newtonsoft.Json" Version="10.0.3" />
  </ItemGroup>

  <ItemGroup Condition=" '$(TargetFramework)' == 'net46' ">
    <PackageReference Include="Microsoft.Cci" Version="4.0.0-rc4-24217-00" />
    <PackageReference Include="Microsoft.NETCore.Windows.ApiSets" Version="1.0.1" />
    <PackageReference Include="System.IO.FileSystem" Version="4.3.0" />
    <PackageReference Include="System.Diagnostics.FileVersionInfo" Version="4.3.0" />
    <PackageReference Include="System.Diagnostics.TraceSource" Version="4.3.0" />
  </ItemGroup>

</Project>

I cannot build it without this change. see #13025

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Aug 22, 2017

I'm planning on doing two separate PRs for this, one mainly to remove the "multiply by processor count" which IMO is a low-risk easy-to-explain change, and another to tune the spin counts based on a measure of the delay. I don't think we would want to keep track of processor code names, that would be a new maintenance issue.

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Aug 22, 2017

@redknightlois:

Is there any way you can push to MyGet a version that I could use to link it for CoreCLR 2.0. I am not building it locally.

I'm hoping the first PR I mentioned above can be merged soon, and maybe you can test with the daily build that contains the change.

@helloguo

This comment has been minimized.

Copy link
Contributor

helloguo commented Aug 22, 2017

one mainly to remove the "multiply by processor count" which IMO is a low-risk easy-to-explain change

I assume you are referring the spin in GC part, right? Any plan for the spin in MonitorEnter and other places?

Do you have any benchmarks that can be shared with community? It would be great if we can measure it as well :)

I don't think we would want to keep track of processor code names, that would be a new maintenance issue.

I agree. If we can find a reasonable spin cycles by measuring benchmarks, we can use the "Dynamic Pause Loop Example" https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf to get rid of processor code names.

capture-example

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 22, 2017

@kouvel I was going to ask the same thing, there has been many changes on sync primitives in CoreCLR 2.0 this should be reviewed all over the place. If I remember correctly @benaadams have done some assembler work too in between 1.1 and 2.0 (correct me if I am wrong).

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Aug 23, 2017

Regarding removing the multiply by proc count, I was referring to ManualResetEventSlim and another place I found in managed code where it's multiplying the pause count per iteration by proc count, I'm planning on removing those. The multiply by proc count in GC is tuning the spin count with no back-off scheme, so it could probably be changed to use the method you suggested. For the second PR I was going to start with Thread.SpinWait, measuring the delay incurred by the pause and calibrating the delay per iteration issued based on that, and spin loops that use a back-off scheme could use the calibrated version to hopefully get somewhat equivalent behavior on different procs.

It'll have to be reviewed in a bunch of places and expanded, probably one at a time though along with testing perf for what is being changed. I don't have benchmarks yet, was just planning on writing them as I go.

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Aug 23, 2017

@kouvel I created a repro package with some instructions if you want to try your changes on our codebase.

You need to have Bash for Windows (or run on Linux) with wrk installed.

Download: http://chilp.it/6c4620c
Run RavenDB on a SSD or RamDisk
Open the browser on http://locahost:8080
Create a database named: BenchmarkDB (dont forget to do this or you are not going to hit the proper codepath).
Run the content of writes.cmd on bash (will run WRK localhost to generate contention against your local database).

Let me know how it went. If you need further assistance, let me know and we can setup a shared screen call to work on it.

EDIT: Updated the link because for some reason it wasnt working.

kouvel added a commit to kouvel/coreclr that referenced this issue Aug 23, 2017

Don't multiply YieldProcessor count by proc count
Related to issue mentioned in dotnet#13388
- Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
- The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
- I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.
@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Sep 1, 2017

That said, MRES has a threshold after which it goes into a Sleep(1), which can significantly reduce latency. May want to avoid that threshold, but worth a try. You could also try using ManualResetEvent (not slim) with your own custom spinning for the purpose of experimentation.

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Sep 1, 2017

Could also try ManualResetEvent (not slim) by itself (no spinning), as Yield/Sleep(0) can also contribute to context switching

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Sep 5, 2017

OK. After much analysis work I can conclude and confirm:

  • With the latest commit (03bf95c) on CoreCLR 2.1 branch, spin on ManualResetEventSlim is no longer an issue even though we are particularly limited by Kestrel PInvokes on Libuv & Async machinery.
  • Latest commit have higher throughput than intermediate solution.
  • High certainty that kernel measurements are artifacts (kernel waits are not cheap, but nothing we didn't know about).
  • Windows Performance Recorder makes it hard (unless there is a way I didn't find) to filter out all instances of System\ETW Overhead from the measurements :)

So the fix LGTM. cc @helloguo

dotnet-bot added a commit to dotnet/corert that referenced this issue Sep 6, 2017

Add normalized equivalent of YieldProcessor, retune some spin loops (…
…#13670)

* Add normalized equivalent of YieldProcessor, retune some spin loops

Part of fix for dotnet/coreclr#13388

Normalized equivalent of YieldProcessor
- The delay incurred by YieldProcessor is measured once lazily at run-time
- Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
- Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment
- At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
- Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
  - I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
  - ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning
- At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
- General observations
  - YieldProcessor stage
    - At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
    - Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
    - Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
    - Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
  - Yield/Sleep(0) stage
    - Observed a couple of issues:
      - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
      - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
    - I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
  - Sleep(1) stage
    - I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
      - Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
    - For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
      - For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
      - For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
      - Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
- SpinWait - Tweaked the spin strategy as mentioned above
- ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
- Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
- SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
- SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
- ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
- The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
  - On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
  - On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

jkotas added a commit to dotnet/corert that referenced this issue Sep 6, 2017

Add normalized equivalent of YieldProcessor, retune some spin loops (…
…#13670)

* Add normalized equivalent of YieldProcessor, retune some spin loops

Part of fix for dotnet/coreclr#13388

Normalized equivalent of YieldProcessor
- The delay incurred by YieldProcessor is measured once lazily at run-time
- Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
- Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment
- At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
- Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
  - I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
  - ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning
- At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
- General observations
  - YieldProcessor stage
    - At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
    - Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
    - Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
    - Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
  - Yield/Sleep(0) stage
    - Observed a couple of issues:
      - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
      - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
    - I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
  - Sleep(1) stage
    - I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
      - Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
    - For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
      - For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
      - For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
      - Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
- SpinWait - Tweaked the spin strategy as mentioned above
- ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
- Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
- SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
- SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
- ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
- The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
  - On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
  - On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Sep 6, 2017

Thanks @redknightlois. How much improvement are you seeing from 03bf95c? That change has more risk of regressions so I wasn't considering porting it back to 2.0 but if there is a significant improvement in some real scenarios, wondering if it would be worth a try.

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Sep 6, 2017

@kouvel we jumped from 27K request/sec to 30K request per sec, so that's roughly 10% improvement.

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Sep 27, 2017

Any ETA on the service release now that the 2.0.1 is scheduled to be (if not already) out in the open?

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Sep 28, 2017

I'm not sure about the cadence of .NET Core servicing releases, @karelz, is there a rough timeframe for 2.0.2 now?

@karelz

This comment has been minimized.

Copy link
Member

karelz commented Sep 28, 2017

@leecow do we have rough ETA for 2.0.1 release?
Is it something we could put on roadmap? (with caveat that ETAs can change)

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Sep 28, 2017

I think @redknightlois is looking for 2.0.2, as the fix was available when 2.0.1 was almost finalized

@helloguo helloguo changed the title Spin count tuning Spin wait tuning Sep 28, 2017

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented Sep 29, 2017

@karelz @kouvel Yes, looking for the 2.0.2 release :)

@redknightlois

This comment has been minimized.

Copy link

redknightlois commented May 16, 2018

Shouldnt this be closed?

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented May 16, 2018

Yes some spinning cases remain but they are covered by other issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment