-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
I'd like to suggest to change the way the threadpool workers dequeue items, and treat the global queue with the same level of priority as work-stealing.
I've explained most of the rationale in this article: http://labs.criteo.com/2018/10/net-threadpool-starvation-and-how-queuing-makes-it-worse/
In a nutshell, threadpool starvation occurs pretty easily when mixing synchronous and asynchronous code. While this would happen with any scheduling system, this is made worse by the way the threadpool prioritize the queues.
As a reminder, when a threadpool thread is free, it'll process items in order from:
- its local queue
- the global queue
- the local queues of other threads (in random order)
The issue appears easily in applications that receive workload from a non-threadpool thread (for instance, because the transport layer is native or uses dedicated threads). In this case, the incoming workload will always be queued in the global queue. Then, when new threadpool workers are spawned, they will always dequeue from the global queue first (since their local queue is empty), adding more pressure on the system instead of stealing work from the other queues.
This is very apparent in the following snippet:
using System;
using System.Threading;
using System.Threading.Tasks;
namespace Starvation
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(Environment.ProcessorCount);
ThreadPool.SetMinThreads(8, 8);
Task.Factory.StartNew(
Producer,
TaskCreationOptions.None);
Console.ReadLine();
}
static void Producer()
{
while (true)
{
Process();
Thread.Sleep(200);
}
}
static async Task Process()
{
await Task.Yield();
var tcs = new TaskCompletionSource<bool>();
Task.Run(() =>
{
Thread.Sleep(1000);
tcs.SetResult(true);
});
tcs.Task.Wait();
Console.WriteLine("Ended - " + DateTime.Now.ToLongTimeString());
}
}
}
This causes threadpool starvation on any system that has 8 CPUs or less, and the system never recovers, even though the workload is stable (5 items per second).
I think the local queue is a pretty good mechanism: once a worker has started executing tasks related to a particular async workflow, we want to process that workflow in priority and finish it as soon as possible. However, the priority given to the global queue over work-stealing is questionable. I suspect it was only done to limit the impact on legacy pre-4.0 applications.
I believe many threadpool starvation scenarios could be mitigated by lowering the priority of the global queue at the same level as work-stealing. When a threadpool thread is free, it would process items in order from:
- its local queue
- other queues in random order, including the global queue
Since it's likely that the global queue receives new items at a faster pace than the average local queue, some weighing would probably be needed, which clearly makes the change non-trivial. Still, I believe this is something worth discussing.
Note that, obviously, an application shouldn't mix sync and async code. If the workflow is purely asynchronous then the starvation issues do not occur. However, in many codebases this is not realistic, especially when in the middle of migrating from synchronous to asynchronous.