Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actor performance degrades catastrophically with contention #68299

Open
snaury opened this issue Sep 3, 2023 · 3 comments
Open

Actor performance degrades catastrophically with contention #68299

snaury opened this issue Sep 3, 2023 · 3 comments
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. concurrency runtime Feature: umbrella label for concurrency runtime features concurrency Feature: umbrella label for concurrency language features run-time performance

Comments

@snaury
Copy link

snaury commented Sep 3, 2023

Description

While benchmarking swift actors I found that when multiple actors are contending on a single actor the throughput of the benchmark unexpectedly degrades with the number of actors, almost linearly. My benchmark starts a specified number of Pingable and Pinger actors, and the latter call ping method a fixed number of times in a loop. What I found is that when --pingables is set to 1 and --pingers is set to a large enough number (more than ~32 in my case) the throughput degrades almost linearly with the number of pingers. These are example numbers in a Linux aarch64 VM with latest swift nightly, but the same behavior happens on a host m1 air (swift 5.8.1) as well:

1 pingers... Finished in 204ms (4887565/s), max latency = 41.71us
2 pingers... Finished in 196ms (5094343/s), max latency = 751.739us
4 pingers... Finished in 439ms (2275278/s), max latency = 159.715us
8 pingers... Finished in 467ms (2141133/s), max latency = 219.509us
16 pingers... Finished in 479ms (2085265/s), max latency = 2197.466us
32 pingers... Finished in 550ms (1816451/s), max latency = 447.393us
64 pingers... Finished in 746ms (1339919/s), max latency = 426.476us
128 pingers... Finished in 1068ms (935672/s), max latency = 384.349us
256 pingers... Finished in 1688ms (592305/s), max latency = 975.998us
512 pingers... Finished in 3466ms (288445/s), max latency = 3544.105us
1024 pingers... Finished in 7939ms (125952/s), max latency = 25041.866us
2048 pingers... Finished in 21884ms (45693/s), max latency = 64650.459us
4096 pingers... Finished in 35633ms (28063/s), max latency = 214673.696us
8192 pingers... Finished in 95032ms (10522/s), max latency = 836421.191us

As you can see with ~8k pingers the throughput is catastrophically low.

Steps to reproduce

Here's the benchmark code and the shell script that runs it with different parameters:

actor-latency-swift.swift

let clock = ContinuousClock.continuous

actor Pingable {
    private var counter: Int = 0

    func ping() -> Int {
        counter += 1
        return counter
    }

    func getCounter() -> Int {
        return counter
    }
}

actor Pinger {
    private var target: Pingable

    init(_ target: Pingable) {
        self.target = target
    }

    func run(_ n: Int, start: ContinuousClock.Instant, withLatencies: Bool = true) async -> Duration {
        var maxLatency: Duration = .seconds(0)
        if withLatencies {
            var end = clock.now
            maxLatency = end - start
            for _ in 0..<n {
                let callStart = end
                _ = await self.target.ping()
                let callEnd = clock.now
                let latency = callEnd - callStart
                if maxLatency < latency {
                    maxLatency = latency
                }
                end = callEnd
            }
        } else {
            for _ in 0..<n {
                _ = await self.target.ping()
            }
        }
        return maxLatency
    }
}

func main() async {
    var numPingers = 1
    var numPingables = 1
    var count = 10_000_000
    var withLatencies = true

    var i = 1
    let args = CommandLine.arguments
    while i < args.count {
        switch args[i] {
        case "-p":
            i += 1
            numPingers = Int(args[i])!
            numPingables = numPingers
        case "--pingers":
            i += 1
            numPingers = Int(args[i])!
        case "--pingables":
            i += 1
            numPingables = Int(args[i])!
        case "-c", "--count":
            i += 1
            count = Int(args[i])!
        case "--without-latencies":
            withLatencies = false
        default:
            precondition(false, "Unexpected argument \(args[i])")
        }
        i += 1
    }

    var pingables = [Pingable]()
    for _ in 0..<numPingables {
        pingables.append(Pingable())
    }
    var pingers = [Pinger]()
    for i in 0..<numPingers {
        pingers.append(Pinger(pingables[i % numPingables]))
    }

    print("Warming up...")
    let r = count / numPingers / 100
    _ = await pingers[0].run(r, start: clock.now, withLatencies: withLatencies)

    print("Starting...")
    let start = clock.now
    let maxLatency = await withTaskGroup(of: Duration.self) { group in
        let m = count % numPingers
        for i in 0..<numPingers {
            let n = (count / numPingers) + (i < m ? 1 : 0)
            let pinger = pingers[i]
            let wl = withLatencies
            group.addTask {
                return await pinger.run(n, start: start, withLatencies: wl)
            }
        }

        var maxLatency: Duration = .seconds(0)
        for await latency in group {
            if maxLatency < latency {
                maxLatency = latency
            }
        }
        return maxLatency
    }
    let end = clock.now

    var sum = 0
    for pingable in pingables {
        sum += await pingable.getCounter()
    }
    precondition(sum == r + count)

    let elapsed = end - start
    let elapsedSeconds = elapsed / Duration.seconds(1)
    let elapsedMs = Int(elapsedSeconds * 1000)
    let rps = Int(Double(count) / elapsedSeconds)
    let maxLatencyUs = maxLatency / Duration.microseconds(1)
    print("Finished in \(elapsedMs)ms (\(rps)/s), max latency = \(maxLatencyUs)us")
}

await main()

actor-latency-swift.sh

#!/bin/sh
set -e
#export LIBDISPATCH_COOPERATIVE_POOL_STRICT=1
./swiftc -O actor-latency-swift.swift
for p in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192; do
    echo
    echo $p pingers...
    ./actor-latency-swift --pingables 1 --pingers $p -c 1000000
done

Expected behavior

With more pingers the throughput should initially degrade, since more concurrent threads are competing for the same actor, but it should plateau at around the number of cores and not degrade much afterwards. This is because the amount of work in the system is fixed and doesn't change with the number of pingers.

Environment

  • Swift version 5.9-dev (LLVM 91de708cf253c77, Swift 9f786c8)
  • Deployment target: darwin, linux

Additional context

I'm actually writing a C++ actor library (heavily inspired by Swift!) for C++ coroutines, and so I was naturally comparing my performance numbers to Swift, when I noticed this unexpected degradation. Initially I thought this is some kind of a thundering herd issue, where maybe actors wake up when they shouldn't, though cpu consumption didn't really confirm it (it uses ~1 core). Unfortunately trying to test a swift compiler compiled from sources has not been successful so far. On a mac I cannot make it produce binaries that link to compiled libraries instead of the ones in the system (and the actor switching code is in the standard library), and on Linux it needed way to much disk space for the time being.

What I managed to do so far is to profile it with perf and found that with thousands of pingers most of the time (>85%) is spent in the following stack:

swift::ListMerger<swift::Job*, (anonymous namespace)::JobQueueTraits>::merge
(anonymous namespace)::ProcessOutOfLineJob::process
swift::runJobInEstablishedExecutorContext
swift_job_run
_dispatch_continuation_pop
...

That must be inside one of preprocessQueue overloads, which is called from drainOne in defaultActorDrain. I think the problem here is that ListMerger is for merging sorted lists, which makes sense when different priorities are involved. However in my case all priorities are the same. After a Pingable actor replies to a Pinger, that Pinger will get the response and immediately issue a new request, which adds exactly one item to the queue. But to preprocess that one item preprocessQueue has to walk the entire list, which linearly depends on the number of pingers (the queue size). And that's why there's such a severe degradation on contention.

I gather priorities are important here (otherwise merging two lists would be a simple pointer swap), so I would suggest actor job queues are split into two parts: the inverted head with nodes having an unprocessed flag set, and the processed part should be some kind of a sorted tree (an intrusive red-black tree, if no extra allocations is a hard requirement, although then taking the next item would be more expensive).

At the very least the second pointer in Job::SchedulerPrivate could be used as a reference to the last job with the same priority in the processed list (valid only at the head of that priority), which would allow merging large lists of same-priority tasks efficiently, only iterating over priorities (which there are probably not many) when more than one is present in the list.

Sorry if I couldn't be of more help, this is like my second time looking into the swift source code.

@snaury snaury added bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels labels Sep 3, 2023
@snaury
Copy link
Author

snaury commented Sep 4, 2023

For completeness I confirmed the degradation goes away (at least up to ~8k actors, looks like allocator struggles to keep up afterwards) with this naive patch, where new jobs are always appended to the end of the queue and priorities are completely ignored: https://gist.github.com/snaury/a138b6f91234b46a71d3c2d1540842d8

Fixing it like that is way too naive though, I imagine a real solution would need to keep different priorities sorted.

@ktoso ktoso added concurrency Feature: umbrella label for concurrency language features run-time performance concurrency runtime Feature: umbrella label for concurrency runtime features and removed triage needed This issue needs more specific labels labels Sep 5, 2023
@ktoso
Copy link
Contributor

ktoso commented Sep 15, 2023

Thanks for the great report. Yeah that's not great... we'll see if we can schedule some time to look into this

@hassila
Copy link

hassila commented Dec 14, 2023

For the record, we also see the same for some of our measurements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. concurrency runtime Feature: umbrella label for concurrency runtime features concurrency Feature: umbrella label for concurrency language features run-time performance
Projects
None yet
Development

No branches or pull requests

3 participants