Use a more dependable policy for thread pool thread injection #1754

Open
GSPP opened this Issue Oct 13, 2015 · 11 comments

Comments

Projects
None yet
7 participants
@GSPP

GSPP commented Oct 13, 2015

Reopening this ticket here per the suggestion of ericeil.

As of .NET 4.5 the thread pool injects one thread per 0.5sec if it thinks more threads are required. This is a problem if the number of required threads suddenly increases. Example: A corporate ASP.NET website is idle at night. In the morning at 8AM 1000 people log on and start working. If the app is using let's say 100 threads starting from 8AM it will take like 50sec to create all of them. Until then there will be serious delays and timeouts. It is possible to construe arbitrarily bad scenarios.

Problem statement: If thread pool load suddenly increases in IO bound workloads the pool is too slow to respond. This causes throughput and availability disruption. IO bound workloads relying on synchronous IO are common. Sudden workload changes are also common. Sometimes the workload can change due to a problem outside of the developer's control: A web service timing out or a database become slow.

Let me stress that this causes service interruption.

You can easily repro this yourself. Run a load test on an ASP.NET site with Thread.Sleep(10000);. The thread count goes up by 2 each second.

Starting and shutting down a thread was benchmarked by me to be around 1ms in total. Threads are not really an expensive resource. The thread pool should be a lot more eager to create and destroy threads. 500ms delay to potentially save 1ms is not a good trade-off.

Easy fix: I propose lowering the injection delay to 100ms. This reduces the problem given above by 5x. Ideally, the rate would be configurable. The shutdown delay could be lowered from 30s as well. Keeping an idle thread for 30000ms to save 1ms seems excessive. In this ticket I'm not that concerned with retiring threads, though.

Smarter, riskier fix: The delay could depend on the number of threads in existence, the core count and the perceived pressure on the thread-pool. The injection rate could be:

  • 0ms delay for up to (ProcessorCount * 1) threads
  • 50ms delay for up to (ProcessorCount * 4) threads
  • Starting from that a delay of (100ms * (ThreadCount / ProcessorCount * someFloatFactor)). Reasoning: The more the CPU is oversubscribed the slower we want to inject. Maybe we need to have a maximum delay of 1sec. Or, the delay must rise sub-linearly (e.g. sqrt).

Note, that the Windows Kernel has some thread pool injection heuristics that apply back pressure the more threads are in existence. This seems to work. Source: https://channel9.msdn.com/Shows/Going+Deep/Inside-Windows-8-Pedro-Teixeira-Thread-pool

Here's what happens when you Parallel.ForEach over an infinite sequence and sleep for a few seconds in each work item: thread injection The memory use is proportional to the number of threads (here: 731).

System.Threading.Tasks.Parallel.ForEach(
 Enumerable.Range(0, int.MaxValue),
 i => Thread.Sleep(delay));

The problem exists at all delays:

//Time to get to about 750 threads on an 8 core box:
//var delay = Timeout.Infinite; //15min
//var delay = TimeSpan.FromSeconds(1); //30min
//var delay = TimeSpan.FromSeconds(0.1); //2h 15min
var delay = TimeSpan.FromSeconds(0.01); //980 threads after 6h

10ms delay is something that is very unrealistic to avoid in a production workload. Almost nothing is faster than 10ms (e.g. web service, database).

Please also see the discussion at the original location which is now closed.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Oct 13, 2015

@stephentoub @ericeil pinging you. Please note the added information.

GSPP commented Oct 13, 2015

@stephentoub @ericeil pinging you. Please note the added information.

@JeffCyr

This comment has been minimized.

Show comment
Hide comment
@JeffCyr

JeffCyr May 8, 2016

Contributor

In java there is a mechanism to inform the thread pool that one of its thread is about to block. This allows the thread pool to create a new thread immediately if needed.

Something similar might be doable in .Net, this could be done transparently in WaitHandle.WaitOne and all equivalent blocking calls. I think there is already something like this with SynchronizationContext.

Contributor

JeffCyr commented May 8, 2016

In java there is a mechanism to inform the thread pool that one of its thread is about to block. This allows the thread pool to create a new thread immediately if needed.

Something similar might be doable in .Net, this could be done transparently in WaitHandle.WaitOne and all equivalent blocking calls. I think there is already something like this with SynchronizationContext.

@GSPP GSPP referenced this issue in dotnet/corefx May 29, 2016

Closed

Sync-blocking on threadpool threads #8949

@acelent

This comment has been minimized.

Show comment
Hide comment
@acelent

acelent May 31, 2016

In ASP.NET under a IIS worker process, since the CLR is hosted, the host could consider the current HTTP request queues as pressure to create threads. Even with the 500ms checking interval, it could create more than a single thread at a time, such as the minimum of 1) the number of queued requests and 2) the number of threads to reach the current MaxWorkerThreads configuration.

If ASP.NET Core under a IIS worker process runs in a specialized CLR host as well, this host could do the same.

But in essence, setting MinWorkerThreads to a high enough number has a similar effect, since it tells the CLR host how many threads should be created on demand (read: instantaneously) before using the 1 thread per 500ms adjusting algorithm. Setting a high minimum number of threads doesn't actually mean creating them all upfront or keep them all alive.

This also works when running a standalone ASP.NET Core application, i.e. without IIS.

If you have a requirement on the number of concurrent requests, you should probably provide a high enough minimum for now.

However, I agree that the thread pool algorithm could be improved, probably be configurable or replaced, or make the thread-pool be an API, such that we can create our own thread pools and task schedulers that use them, replace the default thread pool at will, perform certain tasks or requests in fixed thread-pools, etc.

acelent commented May 31, 2016

In ASP.NET under a IIS worker process, since the CLR is hosted, the host could consider the current HTTP request queues as pressure to create threads. Even with the 500ms checking interval, it could create more than a single thread at a time, such as the minimum of 1) the number of queued requests and 2) the number of threads to reach the current MaxWorkerThreads configuration.

If ASP.NET Core under a IIS worker process runs in a specialized CLR host as well, this host could do the same.

But in essence, setting MinWorkerThreads to a high enough number has a similar effect, since it tells the CLR host how many threads should be created on demand (read: instantaneously) before using the 1 thread per 500ms adjusting algorithm. Setting a high minimum number of threads doesn't actually mean creating them all upfront or keep them all alive.

This also works when running a standalone ASP.NET Core application, i.e. without IIS.

If you have a requirement on the number of concurrent requests, you should probably provide a high enough minimum for now.

However, I agree that the thread pool algorithm could be improved, probably be configurable or replaced, or make the thread-pool be an API, such that we can create our own thread pools and task schedulers that use them, replace the default thread pool at will, perform certain tasks or requests in fixed thread-pools, etc.

@kouvel

This comment has been minimized.

Show comment
Hide comment
@kouvel

kouvel Oct 25, 2016

Member

As @ericeil mentioned in dotnet/corefx#2329, it would be good to investigate switching to the Windows thread pool. We need a good solution outside Windows as well, and SynchronizationContext.Wait only handles synchronization waits., For that as @JeffCyr pointed out, an API that indicates when blocking begins and ends, along with some heuristics, I think could be used to mostly alleviate this problem. It may also require replacing some of the existing heuristics or tuning them down to handle the other edge cases they are intended to solve.

Member

kouvel commented Oct 25, 2016

As @ericeil mentioned in dotnet/corefx#2329, it would be good to investigate switching to the Windows thread pool. We need a good solution outside Windows as well, and SynchronizationContext.Wait only handles synchronization waits., For that as @JeffCyr pointed out, an API that indicates when blocking begins and ends, along with some heuristics, I think could be used to mostly alleviate this problem. It may also require replacing some of the existing heuristics or tuning them down to handle the other edge cases they are intended to solve.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Nov 15, 2016

IO also can be blocking. Nothing wrong with doing synchronous IO.

Not sure how blocking would play into the heuristics even if the exact number of blocked threads was known. Just releasing another thread for each one blocked could easily cause an explosion of threads.

Knowing the number of blocked threads might help the heuristics but this can never fully resolve the problems detailed in this issue.

GSPP commented Nov 15, 2016

IO also can be blocking. Nothing wrong with doing synchronous IO.

Not sure how blocking would play into the heuristics even if the exact number of blocked threads was known. Just releasing another thread for each one blocked could easily cause an explosion of threads.

Knowing the number of blocked threads might help the heuristics but this can never fully resolve the problems detailed in this issue.

@kouvel

This comment has been minimized.

Show comment
Hide comment
@kouvel

kouvel Nov 15, 2016

Member

Just releasing another thread for each one blocked could easily cause an explosion of threads.

Suppose we immediately spin off a new thread to compensate for a blocked thread, when blocking ends and it calls the API to indicate that blocking has ended, instead of allowing that thread to run, we could block that thread and put it in a separate queue that would be prioritized when a thread pool thread becomes available. That way, it would not cause too many threads to be running after blocking operations complete. Realistically, there would need to be some sort of configurable limits on how many threads can be created like this, and perhaps also and increasing delay before compensating with a new thread, based on the number of active threads and processors, to prevent creating too many threads unnecessarily.

this can never fully resolve the problems detailed in this issue.

The 0.5 sec heuristic you mentioned (starvation heuristic) kicks in when all of the thread pool threads have not completed their work in that amount of time. That could be due to blocking work or long CPU-bound work. With a heuristic that compensates for blocking threads much quicker (immediately for all practical purposes), the starvation heuristic would not kick in due to blocking, only due to long CPU-bound work (but that's a separate issue). Why would this not resolve the problems you mentioned?

Member

kouvel commented Nov 15, 2016

Just releasing another thread for each one blocked could easily cause an explosion of threads.

Suppose we immediately spin off a new thread to compensate for a blocked thread, when blocking ends and it calls the API to indicate that blocking has ended, instead of allowing that thread to run, we could block that thread and put it in a separate queue that would be prioritized when a thread pool thread becomes available. That way, it would not cause too many threads to be running after blocking operations complete. Realistically, there would need to be some sort of configurable limits on how many threads can be created like this, and perhaps also and increasing delay before compensating with a new thread, based on the number of active threads and processors, to prevent creating too many threads unnecessarily.

this can never fully resolve the problems detailed in this issue.

The 0.5 sec heuristic you mentioned (starvation heuristic) kicks in when all of the thread pool threads have not completed their work in that amount of time. That could be due to blocking work or long CPU-bound work. With a heuristic that compensates for blocking threads much quicker (immediately for all practical purposes), the starvation heuristic would not kick in due to blocking, only due to long CPU-bound work (but that's a separate issue). Why would this not resolve the problems you mentioned?

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Nov 17, 2016

This would immediately create a maximum amount of threads with the following code:

Enumerable.Range(0, 100000).Select(_ => Task.Run(() => File.ReadAllBytes("..."))).ToList();

Because all those tasks immediately block, causing another thread to be spawned.

Holding a thread at the end of a blocking operation can lead to deadlocks or catastrophically low throughput if that thread was holding any locks. It can dramatically alter the perceived latency of IO operations.

But I agree that using the number of blocked threads in the thread injection heuristics could improve them. It seems that injection should happen more liberally if there are many threads blocked on IO.

I think the bias should be towards over-subscribing resources instead of under-subscribing. 10x too little threads means 10x loss in thoughput. 10x too many threads usually means far less than 10x loss.

What I recommend at the moment is to target the "Easy fix" immediately:

I propose lowering the injection delay to 100ms.

And then possibly pursue a more sophisticated solution.

GSPP commented Nov 17, 2016

This would immediately create a maximum amount of threads with the following code:

Enumerable.Range(0, 100000).Select(_ => Task.Run(() => File.ReadAllBytes("..."))).ToList();

Because all those tasks immediately block, causing another thread to be spawned.

Holding a thread at the end of a blocking operation can lead to deadlocks or catastrophically low throughput if that thread was holding any locks. It can dramatically alter the perceived latency of IO operations.

But I agree that using the number of blocked threads in the thread injection heuristics could improve them. It seems that injection should happen more liberally if there are many threads blocked on IO.

I think the bias should be towards over-subscribing resources instead of under-subscribing. 10x too little threads means 10x loss in thoughput. 10x too many threads usually means far less than 10x loss.

What I recommend at the moment is to target the "Easy fix" immediately:

I propose lowering the injection delay to 100ms.

And then possibly pursue a more sophisticated solution.

@kouvel

This comment has been minimized.

Show comment
Hide comment
@kouvel

kouvel Nov 17, 2016

Member

Enumerable.Range(0, 100000).Select(_ => Task.Run(() => File.ReadAllBytes("..."))).ToList();

There would need to be some limit on how many threads are added to compensate for blocked threads, such that this type of worst-case scenario caps out the number of threads at a reasonable amount. There are several pieces of information available to make reasonable heuristics for this case.

Holding a thread at the end of a blocking operation can lead to deadlocks or catastrophically low throughput if that thread was holding any locks. It can dramatically alter the perceived latency of IO operations.

Yea that's a good point, especially if threads returning from blocked state are many and the queue gets backed up. Another issue is that when there are CPU-bound and IO-bound work items, it could artificially slow down the IO-bound work. I agree that we should let some threads continue running after unblocking up to some point and then queue them up, but it would not be reasonable to let all of them run immediately. There is lots of information we can use to tune this well, but yea it won't be perfect since it doesn't have control over scheduling. For instance we can look at duration and time spent unblocked for a particular work item, and make decisions based on that. In any case, it should be significantly better than what we have now.

I think the bias should be towards over-subscribing resources instead of under-subscribing. 10x too little threads means 10x loss in thoughput. 10x too many threads usually means far less than 10x loss.

What I suggested would not undersubscribe resources, it just may not be fair in allocating resources, that can be tuned.

I propose lowering the injection delay to 100ms.

The heuristic is currently unbounded in the number of threads it would create (max threads by default is 32K). So realistically, the delay is the bound. By reducing the delay, more apps would be subject to this heuristic, and the more the delay is reduced, the more it can get out of control. It could create more threads than necessary, creating inefficiencies and causing higher memory usage, and it would do so much faster than before due to the shorter delay.

So I don't think it's as simple as reducing the delay. Adding a bound to this heuristic would be arbitrary. Some apps may benefit from having many threads, and some may benefit from having few, and the information needed to create a reasonable bound is not available. Choosing thresholds for adjusting the rate of thread injection would also be arbitrary. This heuristic is inherently flawed, tweaking it may help some apps but it may hurt others.

For instance, an app that has a work item duration of 150 ms would not currently hit the starvation heuristic, and if all of those are CPU-bound work items, it would be performing well with the right number of threads. Reducing the delay to 100 ms would cause the starvation heuristic to kick in and 10 threads would be created per second in this app as long as there is work.

Member

kouvel commented Nov 17, 2016

Enumerable.Range(0, 100000).Select(_ => Task.Run(() => File.ReadAllBytes("..."))).ToList();

There would need to be some limit on how many threads are added to compensate for blocked threads, such that this type of worst-case scenario caps out the number of threads at a reasonable amount. There are several pieces of information available to make reasonable heuristics for this case.

Holding a thread at the end of a blocking operation can lead to deadlocks or catastrophically low throughput if that thread was holding any locks. It can dramatically alter the perceived latency of IO operations.

Yea that's a good point, especially if threads returning from blocked state are many and the queue gets backed up. Another issue is that when there are CPU-bound and IO-bound work items, it could artificially slow down the IO-bound work. I agree that we should let some threads continue running after unblocking up to some point and then queue them up, but it would not be reasonable to let all of them run immediately. There is lots of information we can use to tune this well, but yea it won't be perfect since it doesn't have control over scheduling. For instance we can look at duration and time spent unblocked for a particular work item, and make decisions based on that. In any case, it should be significantly better than what we have now.

I think the bias should be towards over-subscribing resources instead of under-subscribing. 10x too little threads means 10x loss in thoughput. 10x too many threads usually means far less than 10x loss.

What I suggested would not undersubscribe resources, it just may not be fair in allocating resources, that can be tuned.

I propose lowering the injection delay to 100ms.

The heuristic is currently unbounded in the number of threads it would create (max threads by default is 32K). So realistically, the delay is the bound. By reducing the delay, more apps would be subject to this heuristic, and the more the delay is reduced, the more it can get out of control. It could create more threads than necessary, creating inefficiencies and causing higher memory usage, and it would do so much faster than before due to the shorter delay.

So I don't think it's as simple as reducing the delay. Adding a bound to this heuristic would be arbitrary. Some apps may benefit from having many threads, and some may benefit from having few, and the information needed to create a reasonable bound is not available. Choosing thresholds for adjusting the rate of thread injection would also be arbitrary. This heuristic is inherently flawed, tweaking it may help some apps but it may hurt others.

For instance, an app that has a work item duration of 150 ms would not currently hit the starvation heuristic, and if all of those are CPU-bound work items, it would be performing well with the right number of threads. Reducing the delay to 100 ms would cause the starvation heuristic to kick in and 10 threads would be created per second in this app as long as there is work.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Nov 17, 2016

Maybe a tunable settings for app developers to alter the injection speed? It seems we need two settings:

  1. The maximum thread count that we want to converge to over time.
  2. The speed at which we would want to inject.

To illustrate:

  • The red line is the old behavior.
  • The other lines show different maximum values and speeds. They all converge to a maximum but at different speeds.

image

For all productions apps that I have ever worked on I'd prefer the new lines over the red line. The green one is what I would use in all my web apps.

GSPP commented Nov 17, 2016

Maybe a tunable settings for app developers to alter the injection speed? It seems we need two settings:

  1. The maximum thread count that we want to converge to over time.
  2. The speed at which we would want to inject.

To illustrate:

  • The red line is the old behavior.
  • The other lines show different maximum values and speeds. They all converge to a maximum but at different speeds.

image

For all productions apps that I have ever worked on I'd prefer the new lines over the red line. The green one is what I would use in all my web apps.

@kouvel

This comment has been minimized.

Show comment
Hide comment
@kouvel

kouvel Nov 17, 2016

Member

Configuring the minimum and maximum worker thread count produces an approximation of the green line above, though not as nice:

image

I'm not sure that it is always beneficial to have a curve. It should get to the right number of threads quickly. The type of work could also be changing, and it would have to adjust the thread count based on that. When the type of work is stable, configuring the min & max can produce decent results.

Member

kouvel commented Nov 17, 2016

Configuring the minimum and maximum worker thread count produces an approximation of the green line above, though not as nice:

image

I'm not sure that it is always beneficial to have a curve. It should get to the right number of threads quickly. The type of work could also be changing, and it would have to adjust the thread count based on that. When the type of work is stable, configuring the min & max can produce decent results.

@kouvel

This comment has been minimized.

Show comment
Hide comment
@kouvel

kouvel Nov 17, 2016

Member

An issue with that is often times a min thread count closer to the max typically required for the app will be configured, and the app may not run as efficiently.

Member

kouvel commented Nov 17, 2016

An issue with that is often times a min thread count closer to the max typically required for the app will be configured, and the app may not run as efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment