Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Fail Tasks stuck in Wait() on ThreadPool when starvation occurs? #35777

Closed
benaadams opened this issue May 3, 2020 · 15 comments
Closed
Labels
area-System.Threading untriaged New issue has not been triaged by the area owner

Comments

@benaadams
Copy link
Member

ThreadPool starvation can be problematic and will never recover if the rate of ThreadPool threads blocking is higher than the rate of ThreadPool thread injection.

Suggestion

If .Wait() and its associated permutations e.g. .Result etc are done on a ThreadPool thread; this could trigger the spawning of a "finalizer" type thread that fails (cancels?) the Tasks and releases the .Waits if ThreadPool starvation occurs.

The Tasks would need to register themselves before waiting if on the CurrentThread.IsThreadPoolThread so they are available to be cancelled and unblocked.

Would likely want to operate with some ThreadPool heuristics so it doesn't trigger too easily when natural thread injection would resolve it.

Drawbacks

Tasks would start getting randomly cancelled if the ThreadPool starts getting starved which might be an unexpected behaviour; but may be better than entering an unrecoverable state?

Example registration code

Task.cs

+ private static Dictionary<SetOnInvokeMres, Task>? s_threadPoolWaitRegistrations;
+ private static object? s_registrationLock;
  
  private bool SpinThenBlockingWait(int millisecondsTimeout, CancellationToken cancellationToken)
  {
      bool infiniteWait = millisecondsTimeout == Timeout.Infinite;
      uint startTimeTicks = infiniteWait ? 0 : (uint)Environment.TickCount;
      bool returnValue = SpinWait(millisecondsTimeout);
      if (!returnValue)
      {
          var mres = new SetOnInvokeMres();
+         var isWaitRegistered = false;
          try
          {
              AddCompletionAction(mres, addBeforeOthers: true);
+             // If running on ThreadPool register this Task and Wait so its available
+             //  to be Failed/Cancelled if ThreadPool starvation occurs.
+             if (Thread.CurrentThread.IsThreadPoolThread)
+             {
+                 lock (ThreadPoolWaitRegistrationLock)
+                 {
+                     ThreadPoolWaitRegistrations.Add(mres, this);
+                     isWaitRegistered = true;
+                 }
+             }
  
              if (infiniteWait)
              {
                  returnValue = mres.Wait(Timeout.Infinite, cancellationToken);
              }
              else
              {
                  uint elapsedTimeTicks = ((uint)Environment.TickCount) - startTimeTicks;
                  if (elapsedTimeTicks < millisecondsTimeout)
                  {
                      returnValue = mres.Wait((int)(millisecondsTimeout - elapsedTimeTicks), cancellationToken);
                  }
              }
          }
          finally
          {
+             if (isWaitRegistered)
+             {
+                 lock (ThreadPoolWaitRegistrationLock)
+                 {
+                     // Remove the wait registration.
+                     ThreadPoolWaitRegistrations.Remove(mres);
+                 }
+             }
  
              if (!IsCompleted) RemoveContinuation(mres);
              // Don't Dispose of the MRES, because the continuation off of this task may
              // still be running.  This is ok, however, as we never access the MRES' WaitHandle,
              // and thus no finalizable resources are actually allocated.
          }
      }
      return returnValue;
  }

+ private static Dictionary<SetOnInvokeMres, Task> ThreadPoolWaitRegistrations
+     => s_threadPoolWaitRegistrations ?? CreateThreadPoolRegistrations();
+ 
+ [MethodImpl(MethodImplOptions.NoInlining)]
+ private static Dictionary<SetOnInvokeMres, Task> CreateThreadPoolRegistrations()
+ {
+     Interlocked.CompareExchange(ref s_threadPoolWaitRegistrations, new Dictionary<SetOnInvokeMres, Task>(), null);
+     return s_threadPoolWaitRegistrations;
+ }
+ 
+ private static object ThreadPoolWaitRegistrationLock
+     => s_registrationLock ?? CreateThreadPoolWaitRegistrationLock();
+ 
+ [MethodImpl(MethodImplOptions.NoInlining)]
+ private static object CreateThreadPoolWaitRegistrationLock()
+ {
+     Interlocked.CompareExchange(ref s_registrationLock, new object(), null);
+     return s_registrationLock;
+ }

/cc @stephentoub @davidfowl @jkotas @kouvel

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Threading untriaged New issue has not been triaged by the area owner labels May 3, 2020
@mangod9
Copy link
Member

mangod9 commented May 4, 2020

As you point out the indeterministic behavior from such a change could lead to very unexpected results. Apps can themselves control Wait behavior with a timeout, so wonder if such an "auto" recovery is required. Additionally if there is a genuine bug/problem Tasks would continuously keep getting cancelled?

@benaadams
Copy link
Member Author

benaadams commented May 5, 2020

Apps can themselves control Wait behavior with a timeout, so wonder if such an "auto" recovery is required.

This gist demonstrates a difference between await WaitAsync(); and WaitAsync().Wait(); on the ThreadPool https://gist.github.com/benaadams/08e5af7881f25c6dd2577a92de182a73

In the 60th sec of execution the output for WaitAsync().Wait(); is

Seq 243 => WaitsCompleted 175, Elapsed: 59.0sec, Delay: 0secs, ThreadCount: 78
Seq 244 => WaitsCompleted 179, Elapsed: 60.0sec, Delay: 0.969secs, ThreadCount: 78
Seq 245 => WaitsCompleted 179, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 78
Seq 246 => WaitsCompleted 179, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 78
Seq 247 => WaitsCompleted 179, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 78
Seq 248 => WaitsCompleted 179, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 78
Seq 249 => WaitsCompleted 186, Elapsed: 61.0sec, Delay: 1secs, ThreadCount: 79

Only 249 executions have been completed and it has reached 79 threads; the delay is non-deterministic and the system will become overwhelmed.

And the output of await WaitAsync(); is

Seq 23768 => WaitsCompleted 19771, Elapsed: 59.9sec, Delay: 0secs, ThreadCount: 17
Seq 23769 => WaitsCompleted 19771, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 17
Seq 23770 => WaitsCompleted 19771, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 17
Seq 23771 => WaitsCompleted 19771, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 17
Seq 23772 => WaitsCompleted 19778, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23773 => WaitsCompleted 19778, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23774 => WaitsCompleted 19778, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23775 => WaitsCompleted 19778, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23776 => WaitsCompleted 19778, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23777 => WaitsCompleted 19778, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23778 => WaitsCompleted 19783, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23779 => WaitsCompleted 19783, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23780 => WaitsCompleted 19783, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23781 => WaitsCompleted 19783, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23782 => WaitsCompleted 19783, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23783 => WaitsCompleted 19783, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23784 => WaitsCompleted 19783, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23785 => WaitsCompleted 19790, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23786 => WaitsCompleted 19790, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23787 => WaitsCompleted 19790, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23788 => WaitsCompleted 19790, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23789 => WaitsCompleted 19790, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23790 => WaitsCompleted 19790, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23791 => WaitsCompleted 19796, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23792 => WaitsCompleted 19796, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23793 => WaitsCompleted 19796, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23794 => WaitsCompleted 19796, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23795 => WaitsCompleted 19796, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23796 => WaitsCompleted 19796, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23797 => WaitsCompleted 19796, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23798 => WaitsCompleted 19803, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23799 => WaitsCompleted 19803, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23800 => WaitsCompleted 19803, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23801 => WaitsCompleted 19803, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23802 => WaitsCompleted 19803, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23803 => WaitsCompleted 19803, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23804 => WaitsCompleted 19809, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23805 => WaitsCompleted 19809, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23806 => WaitsCompleted 19809, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23807 => WaitsCompleted 19809, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23808 => WaitsCompleted 19809, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23809 => WaitsCompleted 19809, Elapsed: 60.0sec, Delay: 0secs, ThreadCount: 16
Seq 23810 => WaitsCompleted 19809, Elapsed: 60.1sec, Delay: 0secs, ThreadCount: 16

23.3K executions have been completed and thread count is only 16 threads; the delay is deterministic and the ThreadCount is not climbing.

@stephentoub
Copy link
Member

Tasks would start getting randomly cancelled if the ThreadPool starts getting starved which might be an unexpected behaviour; but may be better than entering an unrecoverable state?

If you end up in this state, it signals a problem in the construction of your app, and there's a really good chance that you're just going to end up back here after free'ing up some threads. Code often ends up here in the first place because the pool starves, so it introduces another thread, and the work item it picks up to process itself spawns another sync-over-async operation and blocks waiting for it. Killing some waits by canceling them is likely to just lead to the exact same result: those threads freed up by canceling those waits are just going to go pick up some other work item that's likely to end right back in the same situation.

On top of that, the code in question obviously wasn't stress/scale tested to be robust for this workload... there's a really good chance it also wasn't tested for cancellation exceptions emerging spuriously from operations that were never previously cancelable. I've seen multiple cases where the presence of an exception where there wasn't one before (or where code simply wasn't expecting it) become an infinite loop or something else that's just going to cause problems in a different way.

And even if it did help alleviate some symptoms, it could make things worse by masking the actual problem.

I'm personally skeptical this is something that should be pursued.

@benaadams
Copy link
Member Author

The trouble with ThreadPool starvation is it kills everything; killing blocking Tasks would at least allow timers and non-blocking portions of the app to continue running rather than everything becoming unresponsive (assuming the app does more than a single thing)?

@stephentoub
Copy link
Member

killing blocking Tasks would at least allow timers and non-blocking portions of the app to continue running rather than everything becoming unresponsive

How? What's to say those timers and non-blocking portions are going to be prioritized over all the other work filling the queues? And even if a couple did run, the app is written in a way that it's very likely to quickly find itself back in the same position after allowing a few such timers and other things to run, no? And in killing blocking tasks, you're not actually killing them: you're just killing the wait on them, which means you've now also got asynchronous operations that are orphaned and who knows what they're doing or what they expect the state of things to be when they continue executing.

Do you have a real app (not a synthetic example like you shared earlier) where this converts a deadlocked app into one that is able to meaningfully make forward progress and continue operating successfully indefinitely? And without the developer having had to do anything with regards to the random cancellation exceptions they received?

@mangod9
Copy link
Member

mangod9 commented May 5, 2020

Maybe rather than investing in an auto-cancelling mechanism perhaps we could look into some notifications if the TP is able to detect such "starvation" -- maybe generate a ETW event. This could provide some visibility to the app owner for possible mitigation -- Auto recovery is always tricky to get right in most cases.

@benaadams
Copy link
Member Author

benaadams commented May 5, 2020

There is an ETW event of ThreadPoolWorkerThreadAdjustment/Adjustment reason Starvation

However it can be hard to the track down what is causing it; the idea was to put the Tasks into a failed/cancelled state and then unblock the wait causing an exception to propagate and bubble up with a stack trace for further diagnosis.

https://docs.microsoft.com/en-us/archive/blogs/vancem/diagnosing-net-core-threadpool-starvation-with-perfview-why-my-service-is-not-saturating-all-cores-or-seems-to-stall

@benaadams
Copy link
Member Author

Do you have a real app (not a synthetic example like you shared earlier) where this converts a deadlocked app into one that is able to meaningfully make forward progress and continue operating successfully indefinitely?

Well the synthetic example would die because it would throw an uncaught exception on the ThreadPool 😅

@kouvel
Copy link
Member

kouvel commented May 5, 2020

The trouble with ThreadPool starvation is it kills everything; killing blocking Tasks would at least allow timers and non-blocking portions of the app to continue running rather than everything becoming unresponsive

There probably are aspects of this that could be pursued, for instance the thread pool may track different type of work items, which ones are tending to block, and avoid scheduling all threads to such work items. The Windows thread pool does something like that but differentiating work items is left to the user or formalized with specific usages in APIs, which probably wouldn't be sufficient here. There are I think, however, cases where even an ideal thread pool could be coerced to behave in the same worst-case way as you mention. I don't think the problem there is with the thread pool. Timers I understand - their callbacks should really run at higher priority than other regular queue-order work items. Other general work items though, there is no guarantee and that is the current contract with using the thread pool. I'm sure there is more information that could be attained to do better in such cases (with new APIs), or perhaps alternate solutions, or even opportunities to expose workarounds that would not fail miserably so quickly.

@benaadams
Copy link
Member Author

or perhaps alternate solutions, or even opportunities to expose workarounds that would not fail miserably so quickly.

At the extreme end we could stack bend #4654 at Dispatch off the Thread stack and if .Wait was called ditch the whole stack and return to the Dispatch loop to pick up the next item. If the item being resumed was a .Waited item then bend back onto its stack rather than a different detachable stack.

@svick
Copy link
Contributor

svick commented May 5, 2020

@stephentoub I don't know if this proposal would make a thread-starved app behave better, but I think it would make it much easier to diagnose the problem. Pretty much every developer understands unhanded exceptions and most codebases will have some way of dealing with them. But if you have an app that seems to be just stuck, that's much harder to diagnose and requires mode advanced tools, especially if the problem only happens in production (which is likely here).

To me, that would make this worth pursuing, assuming it doesn't have some other significant negative side-effect (I would consider changing the failure mode from thread starvation to infinite loop in a small number of apps acceptable).

Though this assumes that the thrown exception will have a clear message, not just "The operation was canceled."

@stephentoub
Copy link
Member

I would consider changing the failure mode from thread starvation to infinite loop in a small number of apps acceptable

That was one example. I'd expect an even more common one to be that such an exception causes the app to crash. At which point either the app becomes just as unresponsive, or hopefully more likely a new process is spun up and now it has to spend time doing all of the start-up work it does on first use and maybe it's temporarily responsive again until it quickly finds its way back to the same situation, at which point it crashes again... so it's still looping, just at a grander scale. However you slice it, I'm not seeing this generally improving things.

From my perspective, such a system is "throwing good money after bad" as it were, and an app that can get into this situation needs to be fixed rather than worked around by introducing random behaviors that try to effectively randomly abort portions of operations. Based on what I currently know, I don't think it's worth pursuing. If someone can come up with a good way to prove that it would actually result in enough widespread good to outweigh the harm, I'd of course like to hear about it.

@svick
Copy link
Contributor

svick commented May 5, 2020

@stephentoub

However you slice it, I'm not seeing this generally improving things.

Like I said, I don't know if this would improve things, but improving things is not why I think this would be useful.

an app that can get into this situation needs to be fixed rather than worked around by introducing random behaviors that try to effectively randomly abort portions of operations

I agree. Which is why my goal is to make it easier to diagnose and fix such an app. And I think throwing an exception with a good error message helps with that.

@mangod9
Copy link
Member

mangod9 commented Jul 6, 2020

Dont believe there is consensus on any action items here? Change in behavior here would be quite risky too.

@mangod9
Copy link
Member

mangod9 commented Jul 7, 2020

Closing for now, we can reopen in the future if required.

@mangod9 mangod9 closed this as completed Jul 7, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

6 participants