Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Windows IO completions to the portable thread pool #64834

Merged
merged 13 commits into from Mar 4, 2022

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Feb 4, 2022

  • Added an implementation for BindHandle
  • Polling for IO completions is done in batches on separate threads similarly to what is done on Unixes
  • Added a high-priority work item queue to have IO completion work items run at higher priority
  • Removed the time-sensitive work item queue, used the high-priority queue instead

Test fixes on Mono

@kouvel kouvel added this to the 7.0.0 milestone Feb 4, 2022
@kouvel kouvel self-assigned this Feb 4, 2022
@ghost
Copy link

ghost commented Feb 4, 2022

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details
  • Added an implementation for BindHandle
  • Polling for IO completions is done in batches on separate threads similarly to what is done on Unixes
  • Added a high-priority work item queue to have IO completion work items run at higher priority
  • Removed the time-sensitive work item queue, used the high-priority queue instead
Author: kouvel
Assignees: kouvel
Labels:

area-System.Threading

Milestone: 7.0.0

@kouvel
Copy link
Member Author

kouvel commented Feb 4, 2022

ASP.NET perf on citrine-win:

Before After Diff
PlaintextPlatform 10,346,490 10,627,688 2.7%
JsonPlatform 840,775 973,058 15.7%
FortunesPlatform 387,290 428,023 10.5%
Plaintext 5,201,727 5,197,202 -0.1%
Json 824,142 919,967 11.6%
Fortunes 277,752 303,691 9.3%

@kouvel kouvel requested a review from mangod9 February 4, 2022 23:08
@marek-safar
Copy link
Contributor

@lateralusX please review

@lateralusX
Copy link
Member

lateralusX commented Feb 7, 2022

@kouvel Should we enable the runtime tests currently disabled on Mono as part of this PR. There are a handful of tests currently labeled with #34582 as active issue due to missing features, I guess the change in this PR should resolve that issue. If so, we would get some additional coverage on MonoVM as part of doing this change.

There is also an open issue tied to the same missing functionality, #63185, and that should probably be validate by

public class FileSystemWatcherTests : FileSystemWatcherTest
, if we enable that test on MonoVM.

I will build your branch and validate some of these things manual as part of reviewing the MonoVM side of changes done in this PR.

@lateralusX
Copy link
Member

lateralusX commented Feb 7, 2022

Manual verification of #63185 using this PR running on MonoVM+Windows, works.

@kouvel
Copy link
Member Author

kouvel commented Feb 7, 2022

@lateralusX, there seem to be a number of tests in various places disabled on Mono for the same reason. I have enabled the tests mentioned in #34582 for now, there are likely more to enable and I can try to do so in a separate commit but I'm not sure I'll find them all.

@kouvel
Copy link
Member Author

kouvel commented Feb 7, 2022

Oh I see that there are a number of other disabled tests referring to #34582, I can start with those

@lateralusX
Copy link
Member

lateralusX commented Feb 8, 2022

Running through the rest of the enabled tests locally and all seems to pass, great work!

If we would like to run the Windows library test suite on Mono as part of this PR (for validation), we would need to enable it on PR runs (currently not run on PR's), flip this to get the test suite running on this PR:

# - windows_x64

There is one test failing in the drawning test suite on Windows, that is currently expected and will be fixed in a different PR where I fix and enable other failing tests on Windows.

@stephentoub
Copy link
Member

stephentoub commented Feb 8, 2022

at least, I assume that works.

only once #64720 is merged.

But this case is using a ThrowHelper, which we generally haven't replaced.

@kouvel
Copy link
Member Author

kouvel commented Feb 11, 2022

Rebased to fix conflicts

@kouvel kouvel requested a review from janvorli February 11, 2022 21:59
kouvel added a commit to kouvel/dotnet-diagnostics that referenced this pull request Feb 13, 2022
…ns on Windows

- Depends on dotnet/runtime#64834
- Refactored the enumeration of the global queue and reused it to also enumerate the high-priority work item queue
- Omitted info that is specific to the native thread pool implementation, when the portable thread pool is being used for IO completions
@davidfowl
Copy link
Member

This makes me so happy 😄

- Renamed config var for number of poller threads to indicate that they are IO pollers
- Removed an unused property
- Reordered IO completion callback arguments for consistency with before and other code paths
- A `FileSystemWatcher` test failed twice, unable to repro locally, tried offloading background work to a separate thread
This reverts commit 9a73d8e6fbac6d3bb9d70d6aab8b8b0de5b3b6c3.
@kouvel
Copy link
Member Author

kouvel commented Mar 4, 2022

Looks like the remaining CI failures are unrelated failures also happening elsewhere.

@kouvel kouvel merged commit dd3b61d into dotnet:main Mar 4, 2022
@kouvel kouvel deleted the TpIoWin branch March 4, 2022 18:47
kouvel added a commit to dotnet/diagnostics that referenced this pull request Mar 4, 2022
…ns on Windows (#2870)

- Depends on dotnet/runtime#64834
- Refactored the enumeration of the global queue and reused it to also enumerate the high-priority work item queue
- Omitted info that is specific to the native thread pool implementation, when the portable thread pool is being used for IO completions
akoeplinger added a commit to akoeplinger/runtime that referenced this pull request Mar 11, 2022
akoeplinger added a commit that referenced this pull request Mar 12, 2022
@dotnet dotnet locked as resolved and limited conversation to collaborators Apr 3, 2022
@mconnew
Copy link
Member

mconnew commented Aug 22, 2022

I just learnt about this change/PR. On Windows, in the previous IO thread pool implementation, threads which were bound to an IOCP have a different behavior than normal threads. In the Threads and Concurrency section of the win32 documentation it explains about limiting the number of runnable threads, and .NET would limit it to the number of CPU cores. This has the effect that work executing on an IO thread could not be pre-emptively context switched by another one (within the IO thread pool). Posting the work to a regular thread which has higher priority than a regular thread but without this runnable concurrency limit is going to reduce performance in some scenarios. Unless that thread pool is hard limited in the number of threads in it (and not modifiable with ThreadPool.SetMaxThreads), there's going to be more context switching which introduces more overhead. But if you do that and you execute work on one of the threads which blocks, you don't have any extra threads which can be woken up to complete more work so that's not an option (and you deadlock if the blocking thread is waiting on some other work that's waiting).
I suspect this change will see some improvements for asp.net core, but that's because it doesn't take advantage of the IOCP capability. I would expect any library which is purposefully utilizing IO threads to parallelize work to see a regression with the cost of context switching. If a library (such as asp.net core) wishes to offload work to a high priority regular thread pool, it should do so itself. I think it's a bad idea to effectively remove IO threads from the runtime and offload the work to a non IOCP thread pool automatically.
We've been performance testing CoreWCF and comparing Linux with Windows. We have special code to utilize IO threads on Windows an in efficient manner. On Linux, when perf testing with all that code disabled (just let the regular thread pool handle the work) we see measurably worse performance on Linux than Windows. The "known wisdom" is that Linux is faster than Windows for networking, but when correctly utilizing IOCP, Windows can have better throughput than Linux.

@kouvel
Copy link
Member Author

kouvel commented Aug 24, 2022

@mconnew, fair point. It may be possible to limit the IOCP concurrency for worker threads too on Windows since they also use an IOCP, perhaps an option could be exposed through config, it may be interesting to try although there may be some downsides to using it in some cases.

The IOCP concurrency limit applies only so long as threads don't block for any reason. If for instance there is any lock contention that causes one thread to wait (even in a spin-lock with Sleep(1)), another thread will be released by the IOCP, and after the lock becomes available there would be one extra thread running. So any sort of frequent lock contention or slightly blocking operation could cause more threads than the concurrency limit to be active despite the concurrency limit, until the extra threads wait for more IO completions.

Extra context switching can also occur between IO threads and worker threads when both are being used, which is not uncommon. Reusing one pool for both types of work can reduce the amount of context switching in those typical cases.

There is another difference between Windows and Linux in how IO completions are processed. Before this change on Windows, IO completions are processed in parallel to worker threads. After this change on Windows, IO completions are queued to the worker thread pool at higher priority. On Linux, IO completions are queued as normal work items to the worker thread pool, and a long work queue could artificially delay IO completions from being processed. There are some other differences in how wait completions are processed.

I'd like to understand what kind of effect this change would have in CoreWCF's performance. I'm also curious which differences are causing it to perform slower on Linux. Can you point me to the performance tests and instructions for running them?

@davidfowl
Copy link
Member

After this change on Windows, IO completions are queued to the worker thread pool at higher priority. On Linux, IO completions are queued as normal work items to the worker thread pool, and a long work queue could artificially delay IO completions from being processed.

@kouvel feels like we should make this change on linux as well. Another benefit to moving the polling logic into the runtime.

@kouvel
Copy link
Member Author

kouvel commented Aug 24, 2022

@kouvel feels like we should make this change on linux as well. Another benefit to moving the polling logic into the runtime.

I think it would be good to do at least for quasi-correctness, if not for perf. Doing so may need some APIs such as #61273 or similar (or at least an internal way to get to those APIs).

@davidfowl
Copy link
Member

Should re-open that issue for .NET 8?

@kouvel
Copy link
Member Author

kouvel commented Aug 26, 2022

Should re-open that issue for .NET 8?

Maybe the first option in #61273 (comment) could be used for starters.

@davidfowl
Copy link
Member

davidfowl commented Aug 26, 2022

Option 1b would be to expose ThreadPool.UnsafeQueueHighPriorityWorkItemInternal so the sockets assembly and use that instead. That wouldn't require moving the polling loop, but it doesn't get us any closer to making those new public APIs.

@mconnew
Copy link
Member

mconnew commented Aug 26, 2022

@kouvel, I could be wrong on this, but my understanding was that when a thread goes into an alertable state (eg the sleep(1) you mentioned), it yields the CPU and the kernel will activate a dormant thread. But that thread which went alertable won't get CPU time again until another thread goes alertable and yields up the CPU? Is this not the case? Does it only limit releasing a thread waiting for a completion from the IOCP if there are more than the configured concurrency limit executing?

The model I would like to see is to have an instantiable ThreadPool class. This class would have the option on Windows to create an IOCP for the instance, as well as options to specify OS thread priority for threads created by the pool. It would have a helper method to bind to a Socket which on Windows would either bind the socket to the ThreadPool instances IOCP or bind to the ambient IO thread pool but have the completion be dispatched to the ThreadPool instance (similar to the new IOCP model) if IOCP hasn't been configured. On Linux it would set up a singleton polling loop in the ThreadPool instance and dispatch ready notifications to the ThreadPool work queue.

The ThreadPool would have public helper properties which can return a SynchronizationContext and a TaskScheduler to make it easy to run code in the ThreadPool using methods like TaskFactory.StartNew. It would also have its own QueueUserWorkItem method to easily dispatch delegates to it.

I think this model would satisfy being able to keep the original behavior that WCF/CoreWCF uses to do pseudo cooperative multitasking, make it really easy to have all IO operations initiated and completed in the pool, and it would enable the current new IO thread pool model which improves performance for some scenarios, and it doesn't require compromising.

I think we're getting to the point where having two implicit built-in thread pools with hard coded behavior is becoming a limiting factor.

@davidfowl
Copy link
Member

@mconnew You should file another issue and write up an API proposal. It sounds like you basically want a way to spin up your own poller for network IO (I wouldn't call this a threadpool).

@kouvel
Copy link
Member Author

kouvel commented Aug 27, 2022

I could be wrong on this, but my understanding was that when a thread goes into an alertable state (eg the sleep(1) you mentioned), it yields the CPU and the kernel will activate a dormant thread. But that thread which went alertable won't get CPU time again until another thread goes alertable and yields up the CPU? Is this not the case?

It looks like after the sleep expires, that thread is scheduled normally, even if that would exceed the configured concurrency limit. I didn't see a difference in behavior between alertable and non-alertable waits, both seem to allow more threads than the concurrency limit to run simultaneously. Waiting on GetQueuedCompletionStatus does appear to block the extra threads based on the concurrency limit.

The model I would like to see is to have an instantiable ThreadPool class.

That may be reasonable if the underlying thread resources are shared to a large degree. I would be concerned about having too many thread pools that amount to having too many threads underneath. There may be other advantages to having instantiable thread pools, such as for configuring an instance for how those work items run.

I think this model would satisfy being able to keep the original behavior that WCF/CoreWCF uses to do pseudo cooperative multitasking, make it really easy to have all IO operations initiated and completed in the pool, and it would enable the current new IO thread pool model which improves performance for some scenarios, and it doesn't require compromising.

I'd like to understand this more. Is there a performance drop with this change, or between .NET 6 and .NET 7? If so, I'd like to understand better what is causing it and see some perf profiles. If it really is a separate pool of IO completion threads that is helping in the scenario, there may be options to consider, and it would help to understand the reason for the difference to determine which options may be fruitful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet