New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Fail Tasks stuck in Wait() on ThreadPool when starvation occurs? #35777
Comments
As you point out the indeterministic behavior from such a change could lead to very unexpected results. Apps can themselves control Wait behavior with a timeout, so wonder if such an "auto" recovery is required. Additionally if there is a genuine bug/problem Tasks would continuously keep getting cancelled? |
This gist demonstrates a difference between In the 60th sec of execution the output for
Only 249 executions have been completed and it has reached 79 threads; the delay is non-deterministic and the system will become overwhelmed. And the output of
23.3K executions have been completed and thread count is only 16 threads; the delay is deterministic and the ThreadCount is not climbing. |
If you end up in this state, it signals a problem in the construction of your app, and there's a really good chance that you're just going to end up back here after free'ing up some threads. Code often ends up here in the first place because the pool starves, so it introduces another thread, and the work item it picks up to process itself spawns another sync-over-async operation and blocks waiting for it. Killing some waits by canceling them is likely to just lead to the exact same result: those threads freed up by canceling those waits are just going to go pick up some other work item that's likely to end right back in the same situation. On top of that, the code in question obviously wasn't stress/scale tested to be robust for this workload... there's a really good chance it also wasn't tested for cancellation exceptions emerging spuriously from operations that were never previously cancelable. I've seen multiple cases where the presence of an exception where there wasn't one before (or where code simply wasn't expecting it) become an infinite loop or something else that's just going to cause problems in a different way. And even if it did help alleviate some symptoms, it could make things worse by masking the actual problem. I'm personally skeptical this is something that should be pursued. |
The trouble with ThreadPool starvation is it kills everything; killing blocking Tasks would at least allow timers and non-blocking portions of the app to continue running rather than everything becoming unresponsive (assuming the app does more than a single thing)? |
How? What's to say those timers and non-blocking portions are going to be prioritized over all the other work filling the queues? And even if a couple did run, the app is written in a way that it's very likely to quickly find itself back in the same position after allowing a few such timers and other things to run, no? And in killing blocking tasks, you're not actually killing them: you're just killing the wait on them, which means you've now also got asynchronous operations that are orphaned and who knows what they're doing or what they expect the state of things to be when they continue executing. Do you have a real app (not a synthetic example like you shared earlier) where this converts a deadlocked app into one that is able to meaningfully make forward progress and continue operating successfully indefinitely? And without the developer having had to do anything with regards to the random cancellation exceptions they received? |
Maybe rather than investing in an auto-cancelling mechanism perhaps we could look into some notifications if the TP is able to detect such "starvation" -- maybe generate a ETW event. This could provide some visibility to the app owner for possible mitigation -- Auto recovery is always tricky to get right in most cases. |
There is an ETW event of However it can be hard to the track down what is causing it; the idea was to put the Tasks into a failed/cancelled state and then unblock the wait causing an exception to propagate and bubble up with a stack trace for further diagnosis. |
Well the synthetic example would die because it would throw an uncaught exception on the ThreadPool 😅 |
There probably are aspects of this that could be pursued, for instance the thread pool may track different type of work items, which ones are tending to block, and avoid scheduling all threads to such work items. The Windows thread pool does something like that but differentiating work items is left to the user or formalized with specific usages in APIs, which probably wouldn't be sufficient here. There are I think, however, cases where even an ideal thread pool could be coerced to behave in the same worst-case way as you mention. I don't think the problem there is with the thread pool. Timers I understand - their callbacks should really run at higher priority than other regular queue-order work items. Other general work items though, there is no guarantee and that is the current contract with using the thread pool. I'm sure there is more information that could be attained to do better in such cases (with new APIs), or perhaps alternate solutions, or even opportunities to expose workarounds that would not fail miserably so quickly. |
At the extreme end we could stack bend #4654 at |
@stephentoub I don't know if this proposal would make a thread-starved app behave better, but I think it would make it much easier to diagnose the problem. Pretty much every developer understands unhanded exceptions and most codebases will have some way of dealing with them. But if you have an app that seems to be just stuck, that's much harder to diagnose and requires mode advanced tools, especially if the problem only happens in production (which is likely here). To me, that would make this worth pursuing, assuming it doesn't have some other significant negative side-effect (I would consider changing the failure mode from thread starvation to infinite loop in a small number of apps acceptable). Though this assumes that the thrown exception will have a clear message, not just "The operation was canceled." |
That was one example. I'd expect an even more common one to be that such an exception causes the app to crash. At which point either the app becomes just as unresponsive, or hopefully more likely a new process is spun up and now it has to spend time doing all of the start-up work it does on first use and maybe it's temporarily responsive again until it quickly finds its way back to the same situation, at which point it crashes again... so it's still looping, just at a grander scale. However you slice it, I'm not seeing this generally improving things. From my perspective, such a system is "throwing good money after bad" as it were, and an app that can get into this situation needs to be fixed rather than worked around by introducing random behaviors that try to effectively randomly abort portions of operations. Based on what I currently know, I don't think it's worth pursuing. If someone can come up with a good way to prove that it would actually result in enough widespread good to outweigh the harm, I'd of course like to hear about it. |
Like I said, I don't know if this would improve things, but improving things is not why I think this would be useful.
I agree. Which is why my goal is to make it easier to diagnose and fix such an app. And I think throwing an exception with a good error message helps with that. |
Dont believe there is consensus on any action items here? Change in behavior here would be quite risky too. |
Closing for now, we can reopen in the future if required. |
ThreadPool starvation can be problematic and will never recover if the rate of ThreadPool threads blocking is higher than the rate of ThreadPool thread injection.
Suggestion
If
.Wait()
and its associated permutations e.g..Result
etc are done on a ThreadPool thread; this could trigger the spawning of a "finalizer" type thread that fails (cancels?) the Tasks and releases the.Wait
s if ThreadPool starvation occurs.The Tasks would need to register themselves before waiting if on the
CurrentThread.IsThreadPoolThread
so they are available to be cancelled and unblocked.Would likely want to operate with some ThreadPool heuristics so it doesn't trigger too easily when natural thread injection would resolve it.
Drawbacks
Task
s would start getting randomly cancelled if the ThreadPool starts getting starved which might be an unexpected behaviour; but may be better than entering an unrecoverable state?Example registration code
Task.cs
/cc @stephentoub @davidfowl @jkotas @kouvel
The text was updated successfully, but these errors were encountered: