Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threadpool performance 5x slower under Linux under WSL2 vs. Windows #42994

Open
dje-dev opened this issue Oct 2, 2020 · 16 comments
Open

Threadpool performance 5x slower under Linux under WSL2 vs. Windows #42994

dje-dev opened this issue Oct 2, 2020 · 16 comments
Labels
area-System.Threading os-windows-wsl WSL (Windows Subsystem for Linux) OS - Linux binaries running on Windows tenet-performance Performance related issue
Milestone

Comments

@dje-dev
Copy link

dje-dev commented Oct 2, 2020

Linux_threadpool_perf.txt

Description

A large C# application makes extensive use of multithreading (including ThreadPool) and runs well on Windows but degrades to 1/3 speed on Linux.

The attached standalone C# benchmark code demonstrates the apparent problem. It is based mostly on a performance benchmark written by a member of the mono team:
mono/mono#17387.

Configuration

.NET 5.0 RC1 running Windows 10 (2004)
For Linux test, running Ubuntu 20.04 via WSL2
Intel 2 sockets of 16 physical cores each

Regression?

Unknown.

Data

Windows runtime: 5 seconds (70 seconds CPU)
Linux runtime: 35 seconds (280 seconds CPU time)

Analysis

Attempts such as modifying the threadpool minimum size, or setting processor affinity to only one socket did not meaningfully change the results.

@dje-dev dje-dev added the tenet-performance Performance related issue label Oct 2, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Threading untriaged New issue has not been triaged by the area owner labels Oct 2, 2020
@danmoseley
Copy link
Member

Is it possible to measure with 3.1 to help check whether this is a regression in 5.0? That would make it more time critical if it is.
@kouvel

@dje-dev
Copy link
Author

dje-dev commented Oct 2, 2020

Apologies, I should have already noted that. Not a regression, similar behavior was noted on .NET 3.1 (therefore unfortunately but understandably I guess you will have to treat it as less time critical...)

@danmoseley
Copy link
Member

As well as investigating it would be nice to know whether we are missing interesting coverage in dotnet/performance. As I do not recall this showing up when @adamsitnik compare results by OS.

@adamsitnik
Copy link
Member

hi @dje-dev

WSL2

WSL2 might add some non-trivial overhead. Have you tried to run the benchmark without it?

2 sockets

We had some issues in the past that were specific to hardware with multiple sockets. Have you tried to run in on a machine with a single socket?

The attached standalone C# benchmark code demonstrates

Is there any chance that you could contribute it to https://github.com/dotnet/performance repo? Benchmarks added to this repo are used to ensure that we don't introduce any regressions to .NET

@kouvel kouvel added this to the 6.0.0 milestone Oct 5, 2020
@kouvel kouvel removed the untriaged New issue has not been triaged by the area owner label Oct 5, 2020
@janvorli
Copy link
Member

janvorli commented Oct 6, 2020

WSL2 might add some non-trivial overhead.

I was recently debugging a problem in a .NET Core app and I've noticed that under WSL2, that app was about 40% slower than in a VM on the same machine. The Linux distro in both the VM and WSL2 was the same. But I have no idea whether it is a general trend or if that app had some specific functionality that was interacting badly with WSL2. It was an app of another party, so I didn't know much of its internals.

@dje-dev
Copy link
Author

dje-dev commented Oct 8, 2020

Some progress with the help of the comments and suggestions:

  1. Testing has been moved to a single socket machine to eliminate that potential confounding factor.

  2. All performance tests from the following folders were run on (a) native Windows and (b) WSL2 and (c) Hyper-V running Linux from:
    performance/src/benchmarks/micro/libraries/System.Threading
    performance/src/benchmarks/micro/libraries/System.Threading.ThreadPool

  3. Native Windows and Linux via Hyper-V yield about the same (good performance).

  4. However the existing ThreadPool performance test (QueueUserWorkItem_WaitCallback_Throughput) confirms the circa 5x slower performance under WSL2

  5. The threading microbenchmarks mostly only moderately slower under WSL2. An exception is GetThreadStatic (8.3ns vs 2.6ns).

Tentative conclusion is that highly multithreaded .NET code is likely to run very slowly on WSL2. I suggest we focus on the most simple case of understanding why GetThreadStatic (a very simple operation) is so much slower.

How would you suggest we proceed? It gets complex because of the interaction with WSL2.

(full WSL2 and native Windows tests results below)

*** WSL2 ***
BenchmarkDotNet=v0.12.1, OS=ubuntu 20.04
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT
  Job-OGDZSD : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT


| Read_double                    | 0.0000 ns | 0.0000 ns | 0.0000 ns | 0.0000 ns |
| Write_double                   | 0.0032 ns | 0.0037 ns | 0.0053 ns | 0.0015 ns |

| GetThreadStatic                |  8.333 ns | 0.0290 ns | 0.0397 ns |
| SetThreadStatic                | 10.357 ns | 0.0436 ns | 0.0652 ns |

| EnterExit                      | 14.361 ns | 0.1072 ns | 0.1538 ns | 14.444 ns |
| TryEnterExit                   | 14.318 ns | 0.1324 ns | 0.1857 ns | 14.309 ns |
| TryEnter_Fail                  |  3.514 ns | 0.0113 ns | 0.0162 ns |  3.515 ns |
| EnterExit                      | 21.14 ns  | 0.045 ns | 0.066 ns | 21.14 ns |
| TryEnterExit                   | 23.13 ns  | 0.342 ns | 0.502 ns | 23.43 ns |

| Increment_int                  | 4.764 ns  | 0.0174 ns | 0.0260 ns |
| CompareExchange_object_NoMatch | 6.709 ns  | 0.0231 ns | 0.0324 ns |

| Set_Reset                      | 355.5 ns  | 3.49 ns | 5.00 ns |
| RegisterAndUnregister_Serial   |  60.72 ns | 0.878 ns | 1.231 ns |
| RegisterAndUnregister_Parallel |  20.27 ns | 0.256 ns | 0.366 ns |

|                         Cancel | 206.36 ns | 3.070 ns | 4.403 ns |
|       CreateLinkedTokenSource1 |  73.22 ns | 0.280 ns | 0.384 ns |
|       CreateLinkedTokenSource2 | 123.81 ns | 2.889 ns | 4.324 ns |
|       CreateLinkedTokenSource3 | 199.87 ns | 1.823 ns | 2.728 ns |

| QueueUserWorkItem_WaitCallback_Throughput |         20000000 | 11.89 s | 0.181 s | 0.411 s | 12.12 s |


*** Native Windows ***
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.508 (2004/?/20H1)
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT
  Job-LFGVEJ : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT

| Read_double  	                 | 0.0461 ns | 0.0063 ns | 0.0095 ns |
| Write_double                   | 0.2393 ns | 0.0096 ns | 0.0141 ns |

| GetThreadStatic                | 2.591 ns | 0.1606 ns | 0.2404 ns |
| SetThreadStatic                | 5.252 ns | 0.1517 ns | 0.2407 ns |

| EnterExit                      | 14.549 ns | 0.0751 ns | 0.1124 ns |
| TryEnterExit                   | 14.134 ns | 0.0815 ns | 0.1195 ns |
| TryEnter_Fail                  |  4.095 ns | 0.1863 ns | 0.2551 ns |
|  EnterExit                     | 14.84 ns | 0.063 ns | 0.093 ns |
| TryEnterExit                   | 15.39 ns | 0.087 ns | 0.121 ns |

| Increment_int                  | 4.774 ns | 0.0192 ns | 0.0287 ns |
| CompareExchange_object_NoMatch | 5.088 ns | 0.0361 ns | 0.0540 ns |

| Set_Reset                      | 1.227 us | 0.0042 us | 0.0061 us |
|   RegisterAndUnregister_Serial |  46.817 ns | 0.6051 ns | 0.8283 ns |
| RegisterAndUnregister_Parallel |   8.211 ns | 0.2663 ns | 0.7107 ns |

|                         Cancel | 148.132 ns | 0.9243 ns | 1.3548 ns |
|       CreateLinkedTokenSource1 |  54.312 ns | 0.2988 ns | 0.4379 ns |
|       CreateLinkedTokenSource2 | 103.057 ns | 1.2763 ns | 1.8707 ns |
|       CreateLinkedTokenSource3 | 156.846 ns | 1.3350 ns | 1.9146 ns |

| QueueUserWorkItem_WaitCallback_Throughput |         20000000 | 2.531 s | 0.0261 s | 0.0349 s |

@danmoseley
Copy link
Member

@dje-dev just curious, could you tell us more about the scenario here -- are you deploying a product to run on WSL2? I think generally I have been thinking of that as more of a developer platform, not a deployment platform: for testing and developing software to later deploy on a "regular" Linux machine or VM - so raw performance was less critical. What are you using WSL2 for?

@dje-dev
Copy link
Author

dje-dev commented Oct 8, 2020

Fair enough, in some places Microsoft does refer to WSL2 as "primarily a tool for developers." But on the other hand, one could get by if it were a 5% or maybe even 50% performance regression, but at 500% it's no longer viable (at least with some applications) as a developer tool.

Further, WSL2 has been described by Microsoft as generally being very close to bare metal. This was confirmed by
Phoronix, who ran a suite of 69 benchmarks(https://www.phoronix.com/scan.php?page=article&item=windows10-may2020-wsl2) including some with multithreading and reported "Ubuntu 20.04 running bare metal on the same system was faster by just 8%."

This suggests to me that there is a nontrival possibility that either (a) there is some bad interaction between .NET runtime and WSL2, and/or (b) this performance problem is not intrinsically solvable (albeit possibly involving WSL2 adjustments).

@danmoseley
Copy link
Member

@dje-dev I'm still curious about your production scenario, do you have one, or just happened to notice it? It would be interesting f there was data suggesting real customers deploy perf sensitive workloads to WSL2.

But yes, if it's essentially a regular VM then perhaps there's a perf issue to report to them here.

@dje-dev
Copy link
Author

dje-dev commented Oct 8, 2020

Sure, my scenario is development, I was hoping to leverage awesome tools Microsoft is making available for this (https://devblogs.microsoft.com/dotnet/debug-your-net-core-apps-in-wsl-2-with-visual-studio/).

Of course Docker for Windows is now based on WSL2 so this will be a common scenario.

Just not sure how where we take this issue form here.....any thoughts appreciated.

@danmoseley
Copy link
Member

It would be nice if we could localize this to some API we call, so that we could open an issue against WSL2. But, it is for @kouvel to determine whether or how to proceed as he owns this area.

@danmoseley
Copy link
Member

(And even for a dev scenario, 5x slower may be unnacceptable, as you say, depending on the scenario.)

@janvorli
Copy link
Member

janvorli commented Oct 8, 2020

The comment above says:

GetThreadStatic (a very simple operation) is so much slower.

I wonder if WSL2 has some perf issue w.r.t. mechanisms used for thread local access. Linux accesses it via fs segment.

@dje-dev
Copy link
Author

dje-dev commented Oct 16, 2020

Great, so we have isolated at least a part of this apparent problem with WSL2 + .NET to just 2 lines of code (see below) appearing as part of the .NET test suite at: https://github.com/dotnet/performance/blob/74fca49ecd1f0eae51b0172bd121ee7d0fdd2b6d/src/benchmarks/micro/corefx/System.Threading/Perf.ThreadStatic.cs

Further, janvorli has conjectured about a potential reason for the performance issue and I have verified the issue exists on two different systems.

These two products (WSL2 and .NET) are promoted as working well together, and if we can make sure there are no serious performance problems (such as this 4x to 5x regression in some scenarios) it will be surely helpful to me and others.

Is there some way we could move this forward?

Thank you.

   [ThreadStatic]
    private static object t_threadStaticValue = null;

   [Benchmark]
   public object GetThreadStatic() => t_threadStaticValue;

@danmoseley danmoseley added the os-windows-wsl WSL (Windows Subsystem for Linux) OS - Linux binaries running on Windows label Oct 16, 2020
@ShadyNagy
Copy link

Is that problem still exist on .Net 6?

@kouvel kouvel modified the milestones: 6.0.0, Future Aug 10, 2021
@kouvel kouvel changed the title Threadpool performance 5x slower under Linux vs. Windows Threadpool performance 5x slower under Linux under WSL2 vs. Windows Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Threading os-windows-wsl WSL (Windows Subsystem for Linux) OS - Linux binaries running on Windows tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants