Actor performance degrades catastrophically with contention #68299
Labels
bug
A deviation from expected or documented behavior. Also: expected but undesirable behavior.
concurrency runtime
Feature: umbrella label for concurrency runtime features
concurrency
Feature: umbrella label for concurrency language features
run-time performance
Description
While benchmarking swift actors I found that when multiple actors are contending on a single actor the throughput of the benchmark unexpectedly degrades with the number of actors, almost linearly. My benchmark starts a specified number of
Pingable
andPinger
actors, and the latter callping
method a fixed number of times in a loop. What I found is that when--pingables
is set to 1 and--pingers
is set to a large enough number (more than ~32 in my case) the throughput degrades almost linearly with the number of pingers. These are example numbers in a Linux aarch64 VM with latest swift nightly, but the same behavior happens on a host m1 air (swift 5.8.1) as well:As you can see with ~8k pingers the throughput is catastrophically low.
Steps to reproduce
Here's the benchmark code and the shell script that runs it with different parameters:
actor-latency-swift.swift
actor-latency-swift.sh
Expected behavior
With more pingers the throughput should initially degrade, since more concurrent threads are competing for the same actor, but it should plateau at around the number of cores and not degrade much afterwards. This is because the amount of work in the system is fixed and doesn't change with the number of pingers.
Environment
Additional context
I'm actually writing a C++ actor library (heavily inspired by Swift!) for C++ coroutines, and so I was naturally comparing my performance numbers to Swift, when I noticed this unexpected degradation. Initially I thought this is some kind of a thundering herd issue, where maybe actors wake up when they shouldn't, though cpu consumption didn't really confirm it (it uses ~1 core). Unfortunately trying to test a swift compiler compiled from sources has not been successful so far. On a mac I cannot make it produce binaries that link to compiled libraries instead of the ones in the system (and the actor switching code is in the standard library), and on Linux it needed way to much disk space for the time being.
What I managed to do so far is to profile it with
perf
and found that with thousands of pingers most of the time (>85%) is spent in the following stack:That must be inside one of
preprocessQueue
overloads, which is called fromdrainOne
indefaultActorDrain
. I think the problem here is thatListMerger
is for merging sorted lists, which makes sense when different priorities are involved. However in my case all priorities are the same. After aPingable
actor replies to aPinger
, thatPinger
will get the response and immediately issue a new request, which adds exactly one item to the queue. But to preprocess that one itempreprocessQueue
has to walk the entire list, which linearly depends on the number of pingers (the queue size). And that's why there's such a severe degradation on contention.I gather priorities are important here (otherwise merging two lists would be a simple pointer swap), so I would suggest actor job queues are split into two parts: the inverted head with nodes having an unprocessed flag set, and the processed part should be some kind of a sorted tree (an intrusive red-black tree, if no extra allocations is a hard requirement, although then taking the next item would be more expensive).
At the very least the second pointer in
Job::SchedulerPrivate
could be used as a reference to the last job with the same priority in the processed list (valid only at the head of that priority), which would allow merging large lists of same-priority tasks efficiently, only iterating over priorities (which there are probably not many) when more than one is present in the list.Sorry if I couldn't be of more help, this is like my second time looking into the swift source code.
The text was updated successfully, but these errors were encountered: