Use a more dependable policy for thread pool thread injection #2329

Closed
GSPP opened this Issue Jul 13, 2015 · 10 comments

Comments

Projects
None yet
6 participants
@GSPP

GSPP commented Jul 13, 2015

As of .NET 4.5 the thread pool injects one thread per 0.5sec if it thinks more threads are required. This is a problem if the number of required threads suddenly increases. Example: A corporate ASP.NET website is idle at night. In the morning at 8AM 1000 people log on and start working. If the app is using let's say 100 threads starting from 8AM it will take like 50sec to create all of them. Until then there will be serious delays and timeouts. It is possible to construe arbitrarily bad scenarios.

Problem statement: If thread pool load suddenly increases in IO bound workloads the pool is too slow to respond. This causes throughput and availability disruption. IO bound workloads relying on synchronous IO are common. Sudden workload changes are also common. Sometimes the workload can change due to a problem outside of the developer's control: A web service timing out or a database become slow.

Let me stress that this causes service interruption.

You can easily repro this yourself. Run a load test on an ASP.NET site with Thread.Sleep(10000);. The thread count goes up by 2 each second.

Starting and shutting down a thread was benchmarked by me to be around 1ms in total. Threads are not really an expensive resource. The thread pool should be a lot more eager to create and destroy threads. 500ms delay to potentially save 1ms is not a good trade-off.

Easy fix: I propose lowering the injection delay to 100ms. This reduces the problem given above by 5x. Ideally, the rate would be configurable. The shutdown delay could be lowered from 30s as well. Keeping an idle thread for 30000ms to save 1ms seems excessive. In this ticket I'm not that concerned with retiring threads, though.

Smarter, riskier fix: The delay could depend on the number of threads in existence, the core count and the perceived pressure on the thread-pool. The injection rate could be:

  • 0ms delay for up to (ProcessorCount * 1) threads
  • 50ms delay for up to (ProcessorCount * 4) threads
  • Starting from that a delay of (100ms * (ThreadCount / ProcessorCount * someFloatFactor)). Reasoning: The more the CPU is oversubscribed the slower we want to inject. Maybe we need to have a maximum delay of 1sec. Or, the delay must rise sub-linearly (e.g. sqrt).

Note, that the Windows Kernel has some thread pool injection heuristics that apply back pressure the more threads are in existence. This seems to work. Source: https://channel9.msdn.com/Shows/Going+Deep/Inside-Windows-8-Pedro-Teixeira-Thread-pool

@JonHanna

This comment has been minimized.

Show comment
Hide comment
@JonHanna

JonHanna Jul 13, 2015

Collaborator

One immediate thought is that the expense of threads differs according to how the assembly is run. On ASP.NET the default stack size is 256K for 32-bit, 512K for 64-bit and 1M prior to Server 2003. Executables default to 1M. Considerations of the cost of such a change should consider all of those scenarios, and others besides.

Collaborator

JonHanna commented Jul 13, 2015

One immediate thought is that the expense of threads differs according to how the assembly is run. On ASP.NET the default stack size is 256K for 32-bit, 512K for 64-bit and 1M prior to Server 2003. Executables default to 1M. Considerations of the cost of such a change should consider all of those scenarios, and others besides.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Jul 13, 2015

Good point. One could argue that the default stack size should be lowered everywhere. Especially on memory constrained phones 1MB of stack seems very excessive. .NET code normally does not use a lot of stack (in contrast to C code). 1000 frames (a lot!) times 100 bytes would still only be 100KB. 256KB seems conservative still.

GSPP commented Jul 13, 2015

Good point. One could argue that the default stack size should be lowered everywhere. Especially on memory constrained phones 1MB of stack seems very excessive. .NET code normally does not use a lot of stack (in contrast to C code). 1000 frames (a lot!) times 100 bytes would still only be 100KB. 256KB seems conservative still.

@JonHanna

This comment has been minimized.

Show comment
Hide comment
@JonHanna

JonHanna Jul 13, 2015

Collaborator

I imagine the default is lower on phones too, but I'm not very knowledgeable in that space.

.NET code normally doesn't use a lot of space, but .NET code that does use a lot of space is out there. Indeed, the fact that stacks tend not to go that deep means that people might be stackallocing large chunks without any worries, because why not; you've 1MB there and .NET code normally doesn't use a lot of space...

Collaborator

JonHanna commented Jul 13, 2015

I imagine the default is lower on phones too, but I'm not very knowledgeable in that space.

.NET code normally doesn't use a lot of space, but .NET code that does use a lot of space is out there. Indeed, the fact that stacks tend not to go that deep means that people might be stackallocing large chunks without any worries, because why not; you've 1MB there and .NET code normally doesn't use a lot of space...

@stephentoub

This comment has been minimized.

Show comment
Hide comment
Member

stephentoub commented Jul 13, 2015

cc: @ericeil

@ericeil

This comment has been minimized.

Show comment
Hide comment
@ericeil

ericeil Oct 9, 2015

Contributor

Throttling thread creation is not only about the cost of creating a thread; it's mainly about the cost of having a large number of running threads on an ongoing basis. For example:

  • More threads means more context-switching, which adds CPU overhead. With a large number of threads, this can have a significant impact.
  • More threads means more active stacks, which impacts data locality. The more stacks a CPU is having to juggle in its various caches, the less effective those caches are.

The advantage of more threads than logical processors is, of course, that we can keep the CPU busy if some of the threads are blocked, and so get more work done. But we need to be careful not to "overreact" to blocking, and end up hurting performance by having too many threads.

The current algorithm is considerably more complex than @GSPP has suggested; for a taste of what's involved, see this MSDN article. What you are noticing is really just the "back-stop" we've implemented in case workitems are taking so long to complete that our algorithms can't get a good idea of the current "throughput" (work completed per second), and so we give up and just add another thread. It also provides an "escape valve" for code that is temporarily deadlocked due to an implicit reliance on a particular ThreadPool size, or for cases where code in another AppDomain is monopolizing the existing ThreadPool threads.

All that said, I do think there's lots of room for improvement here. I have been wanting to try moving over to the Windows thread pool (when running on Windows 8 or later; prior versions did not run managed code well on the system thread pool). As @GSPP points out, having a kernel component allows the Windows thread pool to do "smart" things that we can't do (or at least, that are hard to do) from user mode.

Contributor

ericeil commented Oct 9, 2015

Throttling thread creation is not only about the cost of creating a thread; it's mainly about the cost of having a large number of running threads on an ongoing basis. For example:

  • More threads means more context-switching, which adds CPU overhead. With a large number of threads, this can have a significant impact.
  • More threads means more active stacks, which impacts data locality. The more stacks a CPU is having to juggle in its various caches, the less effective those caches are.

The advantage of more threads than logical processors is, of course, that we can keep the CPU busy if some of the threads are blocked, and so get more work done. But we need to be careful not to "overreact" to blocking, and end up hurting performance by having too many threads.

The current algorithm is considerably more complex than @GSPP has suggested; for a taste of what's involved, see this MSDN article. What you are noticing is really just the "back-stop" we've implemented in case workitems are taking so long to complete that our algorithms can't get a good idea of the current "throughput" (work completed per second), and so we give up and just add another thread. It also provides an "escape valve" for code that is temporarily deadlocked due to an implicit reliance on a particular ThreadPool size, or for cases where code in another AppDomain is monopolizing the existing ThreadPool threads.

All that said, I do think there's lots of room for improvement here. I have been wanting to try moving over to the Windows thread pool (when running on Windows 8 or later; prior versions did not run managed code well on the system thread pool). As @GSPP points out, having a kernel component allows the Windows thread pool to do "smart" things that we can't do (or at least, that are hard to do) from user mode.

@ericeil

This comment has been minimized.

Show comment
Hide comment
@ericeil

ericeil Oct 9, 2015

Contributor

Closing this, as it's not a CoreFx issue per se. The ThreadPool implementation is part of CoreCLR.

Contributor

ericeil commented Oct 9, 2015

Closing this, as it's not a CoreFx issue per se. The ThreadPool implementation is part of CoreCLR.

@ericeil ericeil closed this Oct 9, 2015

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Oct 10, 2015

@ericeil what's the best way to proceed without dropping this ticket? Should I reopen at CoreCLR?

it's mainly about the cost of having a large number of running threads on an ongoing basis.

You are speaking of CPU bound threads. Most production apps are IO bound in some form. They do not approach anything like 100% CPU load. Context switches are less of a concern here.

Most apps are super boring ASP.NET apps that move data around.

This is the core issue and it should be addressed:

IO bound workloads relying on synchronous IO are common. Sudden workload changes are also common. Sometimes the workload can change due to a problem outside of the developer's control: A web service timing out or a database become slow.

Here, context switches are not a concern. Neither is running in the hundreds of threads temporarily.

Frankly, your response has not adequately addressed the availability problems that I outlined. An app that goes down is a problem. Throughput is less of a concern than non-deterministic non-availability. Shrugging this issue off as a responsibility of the app developer falls a little short in my opinion because it is addressable at the framework level to a large degree.

GSPP commented Oct 10, 2015

@ericeil what's the best way to proceed without dropping this ticket? Should I reopen at CoreCLR?

it's mainly about the cost of having a large number of running threads on an ongoing basis.

You are speaking of CPU bound threads. Most production apps are IO bound in some form. They do not approach anything like 100% CPU load. Context switches are less of a concern here.

Most apps are super boring ASP.NET apps that move data around.

This is the core issue and it should be addressed:

IO bound workloads relying on synchronous IO are common. Sudden workload changes are also common. Sometimes the workload can change due to a problem outside of the developer's control: A web service timing out or a database become slow.

Here, context switches are not a concern. Neither is running in the hundreds of threads temporarily.

Frankly, your response has not adequately addressed the availability problems that I outlined. An app that goes down is a problem. Throughput is less of a concern than non-deterministic non-availability. Shrugging this issue off as a responsibility of the app developer falls a little short in my opinion because it is addressable at the framework level to a large degree.

@JonHanna

This comment has been minimized.

Show comment
Hide comment
@JonHanna

JonHanna Oct 10, 2015

Collaborator

The particular case you mention @GSPP could perhaps be served well by setting a minimum thread count above the norm. If you could do that then the cool-down period would have wasted idle threads, but the morning warm-up would be much faster.

Most production apps are IO bound in some form.

Of course the move generally is to avoiding threads having to block when a task is IO bound. The more that is done the better, though of little comfort to someone who isn't able to take advantage of it in their own case.

Collaborator

JonHanna commented Oct 10, 2015

The particular case you mention @GSPP could perhaps be served well by setting a minimum thread count above the norm. If you could do that then the cool-down period would have wasted idle threads, but the morning warm-up would be much faster.

Most production apps are IO bound in some form.

Of course the move generally is to avoiding threads having to block when a task is IO bound. The more that is done the better, though of little comfort to someone who isn't able to take advantage of it in their own case.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Oct 10, 2015

@JonHanna true.

Regarding async IO: It has a productivity and thoughput cost. When threads can be used to solve the problem that is often the most adequate solution. It is also possible to mix sync and async IO. The more long running and frequent the operation the more attractive async IO becomes (e.g. web service calls). For typical database queries in the sub 100ms range async IO usually is a waste of dev time and threads should be used. The .NET Framework should support this use case well.

The min threads idea works and is what I have done in the past. Problems:

  • You need to set a very high (conservative) number. That can lead to unnecessary spikes. A little bit of throttling would be much better. Example: The app is hit by a volley of 1000 requests at the same time. You probably don't want 1000 threads to spin up instantaneously.
  • If you misestimate and configure too low availability is still partially lost. Some requests will time out, many will receive high latency.

GSPP commented Oct 10, 2015

@JonHanna true.

Regarding async IO: It has a productivity and thoughput cost. When threads can be used to solve the problem that is often the most adequate solution. It is also possible to mix sync and async IO. The more long running and frequent the operation the more attractive async IO becomes (e.g. web service calls). For typical database queries in the sub 100ms range async IO usually is a waste of dev time and threads should be used. The .NET Framework should support this use case well.

The min threads idea works and is what I have done in the past. Problems:

  • You need to set a very high (conservative) number. That can lead to unnecessary spikes. A little bit of throttling would be much better. Example: The app is hit by a volley of 1000 requests at the same time. You probably don't want 1000 threads to spin up instantaneously.
  • If you misestimate and configure too low availability is still partially lost. Some requests will time out, many will receive high latency.
@ericeil

This comment has been minimized.

Show comment
Hide comment
@ericeil

ericeil Oct 12, 2015

Contributor

@GSPP, yes, please reopen this at CoreCLR.

Contributor

ericeil commented Oct 12, 2015

@GSPP, yes, please reopen this at CoreCLR.

@karelz karelz added enhancement and removed suggestion labels Sep 18, 2016

@karelz karelz modified the milestones: 1.2.0, 1.0.0-rtm Dec 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment