-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower the priority of the threadpool global queue #28295
Comments
I understand the concern. At the same time, though: You're of course welcome to experiment, but any changes in this area would need to be rigorously vetted on a large number and variety of workloads. |
I doubt there would be a large overhead:
I think it's just a matter of removing the Your "only steal work as last resort to keep items in the same thread as much as possible" argument makes a lot of sense. On the other hand, I believe it's best for a system to dedicate more power to finishing the processing of current requests instead of dequeuing new ones, but that's the old latency vs throughput argument (and I'm biased by my job towards the former). In any case, we both agree that this is a sensitive area and it needs to be backed by factual data, so I'll put together a few benchmarks and see how it goes :) |
It's been a year since this :) Anything you can share? |
It's a mixed bag. I've seen large improvements in some apps and degradation in others (probably because work-stealing is a costly operation, as you identified). I'll re-open the issue if I manage to identify more precisely what workflows benefit from this and why. |
Thanks.
Yeah, that's what I'd expect. |
I'd like to suggest to change the way the threadpool workers dequeue items, and treat the global queue with the same level of priority as work-stealing.
I've explained most of the rationale in this article: http://labs.criteo.com/2018/10/net-threadpool-starvation-and-how-queuing-makes-it-worse/
In a nutshell, threadpool starvation occurs pretty easily when mixing synchronous and asynchronous code. While this would happen with any scheduling system, this is made worse by the way the threadpool prioritize the queues.
As a reminder, when a threadpool thread is free, it'll process items in order from:
The issue appears easily in applications that receive workload from a non-threadpool thread (for instance, because the transport layer is native or uses dedicated threads). In this case, the incoming workload will always be queued in the global queue. Then, when new threadpool workers are spawned, they will always dequeue from the global queue first (since their local queue is empty), adding more pressure on the system instead of stealing work from the other queues.
This is very apparent in the following snippet:
This causes threadpool starvation on any system that has 8 CPUs or less, and the system never recovers, even though the workload is stable (5 items per second).
I think the local queue is a pretty good mechanism: once a worker has started executing tasks related to a particular async workflow, we want to process that workflow in priority and finish it as soon as possible. However, the priority given to the global queue over work-stealing is questionable. I suspect it was only done to limit the impact on legacy pre-4.0 applications.
I believe many threadpool starvation scenarios could be mitigated by lowering the priority of the global queue at the same level as work-stealing. When a threadpool thread is free, it would process items in order from:
Since it's likely that the global queue receives new items at a faster pace than the average local queue, some weighing would probably be needed, which clearly makes the change non-trivial. Still, I believe this is something worth discussing.
Note that, obviously, an application shouldn't mix sync and async code. If the workflow is purely asynchronous then the starvation issues do not occur. However, in many codebases this is not realistic, especially when in the middle of migrating from synchronous to asynchronous.
The text was updated successfully, but these errors were encountered: