Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Akka internals immune against thread starvation issues #23576

Closed
jrudolph opened this issue Aug 29, 2017 · 8 comments
Closed

Make Akka internals immune against thread starvation issues #23576

jrudolph opened this issue Aug 29, 2017 · 8 comments

Comments

@jrudolph
Copy link
Member

jrudolph commented Aug 29, 2017

There are uncountable numbers of issues caused by thread starvation issues. In many cases not only user code is affected (bad performance, bad CPU utilization, etc) but often Akka internals are affected as well. Examples are e.g. all kinds of remoting issues due to late messages that get no attention because threads are blocked. In streams it's quite easy to block or even deadlock ActorGraphInterpreters by user code blocking inside of (sub)streams. In Akka-Http basic infrastructure (like accepting new requests, handling timeout, delivering entity streams) can stop working when user code blocks.

The basic issue is that Akka's internals are by default not shielded against user code. We always tell people to use an IO or blocking dispatcher for "dangerous" tasks. That's definitely a good strategy for the user to "contain the damage". However, I wonder if we could more actively prevent user code from affecting Akka internals code by introducing containment ourselves by running internals on a dedicated dispatcher by default consciously trading off the latest bits of performance to keep the engine running.

Concretely that would mean:

  • we define an internal dispatcher used for all internal use cases, especially remoting and clustering
  • for framework-provided streams in general and for http in particular, we enforce (configurable) asynchronous boundaries between framework stream stages and user stream parts

The trade-off is a few extra threads and asynchronous boundaries. The exact number of internal threads would have to be decided. In the worst case of full utilization, it might mean that there are a few more threads battling for CPU resources (but note how our current defaults are also already overprovisioning threads) but preemptive OS scheduling of threads would still be fairer than user code blocking Akka internals from executing at all.

The expected result would be:

  • No more spurious reports of broken remoting connections due to thread starvation. Remaining issues be more likely due to excessive GC or real connection problems.
  • Some deadlock issues in Akka-Http would be automatically prevented because framework code and (potentially blocking) user code can still make progress.
  • Blocking user code will still affect user code and might lead to bad utilization.
@jrudolph jrudolph added discuss Tickets that need some discussion before proceeding. Not decided if it's a good idea. t:core t:stream labels Aug 29, 2017
@ktoso
Copy link
Member

ktoso commented Aug 29, 2017

I think I brought up a similar thing while we were discussing the [in]famous SO answer "Akka HTTP: Blocking in a future blocks the server", where blocking on a route kills everything since it blocks the default dispatcher... There I argued for giving that one a special dispatcher to protect the routing layer specifically, but yes - the same issue is elsewhere too.

So... I'd be in favour of such move if we can afford it.

@ktoso
Copy link
Member

ktoso commented Aug 29, 2017

re 3rd point - It is true that "don't block!" will remain a valid and super important advice, but at least it will shield the system messages more from bad stuff so it may help delay issues somewhat for users until they're "in a really bad spot"..

@patriknw
Copy link
Member

I agree that we might have to revisit this old decision. I'm not sure, but I think the arguments were that we by default should have small footprint (be able to run on Android and such), but it's probably not that important any longer. We could make it opt-in to use same dispatcher.

The performance penalty of the extra async boundaries in akka-http might not be great for benchmarks, but we can once again have it opt-in to use same dispatcher (without boundaries).

Realistically, we can't change these defaults in 2.5.x, but we can make it a recommended setting to turn on, and then change defaults in 2.6.

@jrudolph
Copy link
Member Author

I'm not sure, but I think the arguments were that we by default should have small footprint (be able to run on Android and such), but it's probably not that important any longer.

The default settings with parallelism-factor = 3 actually contradict this intention ;)

The performance penalty of the extra async boundaries in akka-http might not be great for benchmarks, but we can once again have it opt-in to use same dispatcher (without boundaries).

Yep, it should be an option with the default changing in 2.6, I agree.

@patriknw
Copy link
Member

The default settings with parallelism-factor = 3 actually contradict this intention ;)

yes, but note that FJP doesn't create threads eagerly and it stops unused threads, so it's not that bad (but I agree)

@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted and removed discuss Tickets that need some discussion before proceeding. Not decided if it's a good idea. labels Oct 30, 2017
@patriknw patriknw added this to the 2.6 milestone Jan 3, 2019
@He-Pin
Copy link
Member

He-Pin commented Jan 3, 2019

In our internal RPC framework, the default configuration which been used at very large scale is true as what @jrudolph just suggested.

We have a separated networking pool with 8 threads and a separated callbacks pool which with maximum 720 threads by default, and using a SynchronousQueue for communication. when there is no thread is free out there, the RPC framework will throw ThreadPool is full exception.

And every service stub methods could configure their own dedicated thread pool too. And user could also config sharing the networking pool with the user code and eliminating the callbacks pool too if the user's code is fully async.

I like this idea, as this is what we are currently using too and helps. If the user received an SMS notification warning about that Thread pool is full and he can do an online thread dumping to see what's happening and fix it up in the next release.

@patriknw
Copy link
Member

patriknw commented Mar 8, 2019

As part of this we should also change parallelism-factor for default dispatcher to something closer to 1

@patriknw patriknw added the medium label Mar 8, 2019
@johanandren
Copy link
Member

It could be a good idea to split the default-internal dispatcher work from the stream isolation part, they are fairly independent and the stream one seems more risky in terms of performance (putting the Tcp stream stages on a separate dispatcher for example)

@johanandren johanandren self-assigned this Apr 17, 2019
@johanandren johanandren added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Apr 17, 2019
johanandren added a commit to johanandren/akka that referenced this issue Apr 18, 2019
 * Another dispatcher can be selected through akka.actor.internal-dispatcher
 * Factor of default dispatcher lowered to 1
@patriknw patriknw modified the milestones: 2.6, 2.6.0-M1 May 2, 2019
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label May 2, 2019
@patriknw patriknw closed this as completed May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants