Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower the priority of the threadpool global queue #28295

Closed
kevingosse opened this issue Dec 31, 2018 · 6 comments
Closed

Lower the priority of the threadpool global queue #28295

kevingosse opened this issue Dec 31, 2018 · 6 comments

Comments

@kevingosse
Copy link
Contributor

I'd like to suggest to change the way the threadpool workers dequeue items, and treat the global queue with the same level of priority as work-stealing.

I've explained most of the rationale in this article: http://labs.criteo.com/2018/10/net-threadpool-starvation-and-how-queuing-makes-it-worse/

In a nutshell, threadpool starvation occurs pretty easily when mixing synchronous and asynchronous code. While this would happen with any scheduling system, this is made worse by the way the threadpool prioritize the queues.

As a reminder, when a threadpool thread is free, it'll process items in order from:

  • its local queue
  • the global queue
  • the local queues of other threads (in random order)

The issue appears easily in applications that receive workload from a non-threadpool thread (for instance, because the transport layer is native or uses dedicated threads). In this case, the incoming workload will always be queued in the global queue. Then, when new threadpool workers are spawned, they will always dequeue from the global queue first (since their local queue is empty), adding more pressure on the system instead of stealing work from the other queues.

This is very apparent in the following snippet:

using System;
using System.Threading;
using System.Threading.Tasks;

namespace Starvation
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine(Environment.ProcessorCount);

            ThreadPool.SetMinThreads(8, 8);

            Task.Factory.StartNew(
                Producer,
                TaskCreationOptions.None);
            Console.ReadLine();
        }

        static void Producer()
        {
            while (true)
            {
                Process();

                Thread.Sleep(200);
            }
        }

        static async Task Process()
        {
            await Task.Yield();

            var tcs = new TaskCompletionSource<bool>();

            Task.Run(() =>
            {
                Thread.Sleep(1000);
                tcs.SetResult(true);
            });

            tcs.Task.Wait();

            Console.WriteLine("Ended - " + DateTime.Now.ToLongTimeString());
        }
    }
}

This causes threadpool starvation on any system that has 8 CPUs or less, and the system never recovers, even though the workload is stable (5 items per second).

I think the local queue is a pretty good mechanism: once a worker has started executing tasks related to a particular async workflow, we want to process that workflow in priority and finish it as soon as possible. However, the priority given to the global queue over work-stealing is questionable. I suspect it was only done to limit the impact on legacy pre-4.0 applications.

I believe many threadpool starvation scenarios could be mitigated by lowering the priority of the global queue at the same level as work-stealing. When a threadpool thread is free, it would process items in order from:

  • its local queue
  • other queues in random order, including the global queue

Since it's likely that the global queue receives new items at a faster pace than the average local queue, some weighing would probably be needed, which clearly makes the change non-trivial. Still, I believe this is something worth discussing.

Note that, obviously, an application shouldn't mix sync and async code. If the workflow is purely asynchronous then the starvation issues do not occur. However, in many codebases this is not realistic, especially when in the middle of migrating from synchronous to asynchronous.

@danmoseley
Copy link
Member

@kouvel @stephentoub

@stephentoub
Copy link
Member

I understand the concern. At the same time, though:
a) When synchronously blocking in one work item waiting for another, regardless of the scheduling/dispatch scheme, you can always come up with really bad situations, regardless of how queues are checked. The answer is generally to fix that blocking, not try to mitigate it for a particular workload by changing the scheduling algorithm to accommodate it to the detriment of other workloads.
b) Dequeueing from the global queue is significantly faster than searching and stealing from other queues. Any change made to treat the global queue with the same priority as local queues will almost certainly negatively impact real workloads and benchmarks.
c) Regarding "However, the priority given to the global queue over work-stealing is questionable. I suspect it was only done to limit the impact on legacy pre-4.0 applications.", that's actually not the case. The priority of the global queue over the local queues was very intentional, and specifically because the local queues were introduced to help enable scale out of an individual operation only when additional resources were available to help. If enough requests are coming in to keep all cores / threads saturated, then the most efficient thing that can be done is generally to not try to parallelize the processing of an individual operation by splitting it up into sub-items, and instead just process as many of the requests as possible in parallel; the local queues provide a way to try to get the best of both worlds, where a request can be split up but still primarily processed by the same thread / core, and only get assistance in picking up its forked items if there are no other requests to be processed. In situations where other threads do help out by picking up the sub-items, there's generally some overhead / loss of efficiency involved, and so it's often only a valuable approach when it's better than doing nothing.

You're of course welcome to experiment, but any changes in this area would need to be rigorously vetted on a large number and variety of workloads.

@kevingosse
Copy link
Contributor Author

I doubt there would be a large overhead:

        public object Dequeue(ThreadPoolWorkQueueThreadLocals tl, ref bool missedSteal)
        {
            WorkStealingQueue localWsq = tl.workStealingQueue;
            object callback;

            if ((callback = localWsq.LocalPop()) == null && // first try the local queue
                !workItems.TryDequeue(out callback)) // then try the global queue
            {
                // finally try to steal from another thread's local queue
                WorkStealingQueue[] queues = WorkStealingQueueList.Queues;
                int c = queues.Length;
                Debug.Assert(c > 0, "There must at least be a queue for this thread.");
                int maxIndex = c - 1;
                int i = tl.random.Next(c);
                while (c > 0)
                {
                    i = (i < maxIndex) ? i + 1 : 0;
                    WorkStealingQueue otherQueue = queues[i];
                    if (otherQueue != localWsq && otherQueue.CanSteal)
                    {
                        callback = otherQueue.TrySteal(ref missedSteal);
                        if (callback != null)
                        {
                            break;
                        }
                    }
                    c--;
                }
            }

            return callback;
        }

I think it's just a matter of removing the workItems.TryDequeue(out callback) bit from the if condition, and instead call it depending on the random value.

Your "only steal work as last resort to keep items in the same thread as much as possible" argument makes a lot of sense. On the other hand, I believe it's best for a system to dedicate more power to finishing the processing of current requests instead of dequeuing new ones, but that's the old latency vs throughput argument (and I'm biased by my job towards the former).

In any case, we both agree that this is a sensitive area and it needs to be backed by factual data, so I'll put together a few benchmarks and see how it goes :)

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@stephentoub
Copy link
Member

I'll put together a few benchmarks and see how it goes :)

It's been a year since this :) Anything you can share?
cc: @kouvel

@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Feb 25, 2020
@stephentoub stephentoub modified the milestones: 5.0, Future Feb 25, 2020
@kevingosse
Copy link
Contributor Author

It's a mixed bag. I've seen large improvements in some apps and degradation in others (probably because work-stealing is a costly operation, as you identified). I'll re-open the issue if I manage to identify more precisely what workflows benefit from this and why.

@stephentoub
Copy link
Member

Thanks.

I've seen large improvements in some apps and degradation in others

Yeah, that's what I'd expect.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants