Lower the priority of the threadpool global queue #28295

kevingosse · 2018-12-31T16:33:31Z

I'd like to suggest to change the way the threadpool workers dequeue items, and treat the global queue with the same level of priority as work-stealing.

I've explained most of the rationale in this article: http://labs.criteo.com/2018/10/net-threadpool-starvation-and-how-queuing-makes-it-worse/

In a nutshell, threadpool starvation occurs pretty easily when mixing synchronous and asynchronous code. While this would happen with any scheduling system, this is made worse by the way the threadpool prioritize the queues.

As a reminder, when a threadpool thread is free, it'll process items in order from:

its local queue
the global queue
the local queues of other threads (in random order)

The issue appears easily in applications that receive workload from a non-threadpool thread (for instance, because the transport layer is native or uses dedicated threads). In this case, the incoming workload will always be queued in the global queue. Then, when new threadpool workers are spawned, they will always dequeue from the global queue first (since their local queue is empty), adding more pressure on the system instead of stealing work from the other queues.

This is very apparent in the following snippet:

using System;
using System.Threading;
using System.Threading.Tasks;

namespace Starvation
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine(Environment.ProcessorCount);

            ThreadPool.SetMinThreads(8, 8);

            Task.Factory.StartNew(
                Producer,
                TaskCreationOptions.None);
            Console.ReadLine();
        }

        static void Producer()
        {
            while (true)
            {
                Process();

                Thread.Sleep(200);
            }
        }

        static async Task Process()
        {
            await Task.Yield();

            var tcs = new TaskCompletionSource<bool>();

            Task.Run(() =>
            {
                Thread.Sleep(1000);
                tcs.SetResult(true);
            });

            tcs.Task.Wait();

            Console.WriteLine("Ended - " + DateTime.Now.ToLongTimeString());
        }
    }
}

This causes threadpool starvation on any system that has 8 CPUs or less, and the system never recovers, even though the workload is stable (5 items per second).

I think the local queue is a pretty good mechanism: once a worker has started executing tasks related to a particular async workflow, we want to process that workflow in priority and finish it as soon as possible. However, the priority given to the global queue over work-stealing is questionable. I suspect it was only done to limit the impact on legacy pre-4.0 applications.

I believe many threadpool starvation scenarios could be mitigated by lowering the priority of the global queue at the same level as work-stealing. When a threadpool thread is free, it would process items in order from:

its local queue
other queues in random order, including the global queue

Since it's likely that the global queue receives new items at a faster pace than the average local queue, some weighing would probably be needed, which clearly makes the change non-trivial. Still, I believe this is something worth discussing.

Note that, obviously, an application shouldn't mix sync and async code. If the workflow is purely asynchronous then the starvation issues do not occur. However, in many codebases this is not realistic, especially when in the middle of migrating from synchronous to asynchronous.

The text was updated successfully, but these errors were encountered:

danmoseley · 2018-12-31T18:24:01Z

@kouvel @stephentoub

stephentoub · 2019-01-02T22:25:07Z

I understand the concern. At the same time, though:
a) When synchronously blocking in one work item waiting for another, regardless of the scheduling/dispatch scheme, you can always come up with really bad situations, regardless of how queues are checked. The answer is generally to fix that blocking, not try to mitigate it for a particular workload by changing the scheduling algorithm to accommodate it to the detriment of other workloads.
b) Dequeueing from the global queue is significantly faster than searching and stealing from other queues. Any change made to treat the global queue with the same priority as local queues will almost certainly negatively impact real workloads and benchmarks.
c) Regarding "However, the priority given to the global queue over work-stealing is questionable. I suspect it was only done to limit the impact on legacy pre-4.0 applications.", that's actually not the case. The priority of the global queue over the local queues was very intentional, and specifically because the local queues were introduced to help enable scale out of an individual operation only when additional resources were available to help. If enough requests are coming in to keep all cores / threads saturated, then the most efficient thing that can be done is generally to not try to parallelize the processing of an individual operation by splitting it up into sub-items, and instead just process as many of the requests as possible in parallel; the local queues provide a way to try to get the best of both worlds, where a request can be split up but still primarily processed by the same thread / core, and only get assistance in picking up its forked items if there are no other requests to be processed. In situations where other threads do help out by picking up the sub-items, there's generally some overhead / loss of efficiency involved, and so it's often only a valuable approach when it's better than doing nothing.

You're of course welcome to experiment, but any changes in this area would need to be rigorously vetted on a large number and variety of workloads.

kevingosse · 2019-01-03T09:40:23Z

I doubt there would be a large overhead:

        public object Dequeue(ThreadPoolWorkQueueThreadLocals tl, ref bool missedSteal)
        {
            WorkStealingQueue localWsq = tl.workStealingQueue;
            object callback;

            if ((callback = localWsq.LocalPop()) == null && // first try the local queue
                !workItems.TryDequeue(out callback)) // then try the global queue
            {
                // finally try to steal from another thread's local queue
                WorkStealingQueue[] queues = WorkStealingQueueList.Queues;
                int c = queues.Length;
                Debug.Assert(c > 0, "There must at least be a queue for this thread.");
                int maxIndex = c - 1;
                int i = tl.random.Next(c);
                while (c > 0)
                {
                    i = (i < maxIndex) ? i + 1 : 0;
                    WorkStealingQueue otherQueue = queues[i];
                    if (otherQueue != localWsq && otherQueue.CanSteal)
                    {
                        callback = otherQueue.TrySteal(ref missedSteal);
                        if (callback != null)
                        {
                            break;
                        }
                    }
                    c--;
                }
            }

            return callback;
        }

I think it's just a matter of removing the workItems.TryDequeue(out callback) bit from the if condition, and instead call it depending on the random value.

Your "only steal work as last resort to keep items in the same thread as much as possible" argument makes a lot of sense. On the other hand, I believe it's best for a system to dedicate more power to finishing the processing of current requests instead of dequeuing new ones, but that's the old latency vs throughput argument (and I'm biased by my job towards the former).

In any case, we both agree that this is a sensitive area and it needs to be backed by factual data, so I'll put together a few benchmarks and see how it goes :)

stephentoub · 2020-02-25T14:38:33Z

I'll put together a few benchmarks and see how it goes :)

It's been a year since this :) Anything you can share?
cc: @kouvel

kevingosse · 2020-02-25T15:17:15Z

It's a mixed bag. I've seen large improvements in some apps and degradation in others (probably because work-stealing is a costly operation, as you identified). I'll re-open the issue if I manage to identify more precisely what workflows benefit from this and why.

stephentoub · 2020-02-25T15:19:30Z

Thanks.

I've seen large improvements in some apps and degradation in others

Yeah, that's what I'd expect.

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

msftgits added this to the 5.0 milestone Feb 1, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

stephentoub removed the untriaged New issue has not been triaged by the area owner label Feb 25, 2020

stephentoub modified the milestones: 5.0, Future Feb 25, 2020

kevingosse closed this as completed Feb 25, 2020

ghost locked as resolved and limited conversation to collaborators Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower the priority of the threadpool global queue #28295

Lower the priority of the threadpool global queue #28295

kevingosse commented Dec 31, 2018

danmoseley commented Dec 31, 2018

stephentoub commented Jan 2, 2019

kevingosse commented Jan 3, 2019

stephentoub commented Feb 25, 2020

kevingosse commented Feb 25, 2020

stephentoub commented Feb 25, 2020

Lower the priority of the threadpool global queue #28295

Lower the priority of the threadpool global queue #28295

Comments

kevingosse commented Dec 31, 2018

danmoseley commented Dec 31, 2018

stephentoub commented Jan 2, 2019

kevingosse commented Jan 3, 2019

stephentoub commented Feb 25, 2020

kevingosse commented Feb 25, 2020

stephentoub commented Feb 25, 2020