Swift Actor/Tasks concurrency on Linux - Lock contention in rand() #760

freef4ll · 2022-05-23T10:20:44Z

We have a workload pipeline which is chaining several thousand Actors to each other via AsyncStream processing pipeline.
There is a multiplication affect that a single event at the start of the processing pipeline will be amplified as the event will be delivered to several Tasks processing the events concurrently. The processing time of each wakeup is currently quite small and on several microseconds range currently.

Under Linux, what was observed when stressing this processing pipeline is that ~45% of the stacks show __DISPATCH_ROOT_QUEUE_CONTENDED_WAIT__(), which is leading to lock contention in glibc rand() - as there are ~60 threads which are created and they all contend here:

            7f193794a2db futex_wait+0x2b (inlined)
            7f193794a2db __GI___lll_lock_wait_private+0x2b (inlined)
            7f19378ff29b __random+0x6b (/usr/lib/x86_64-linux-gnu/libc.so.6)
            7f19378ff76c rand+0xc (/usr/lib/x86_64-linux-gnu/libc.so.6)
            7f1937bac612 __DISPATCH_ROOT_QUEUE_CONTENDED_WAIT__+0x12 (/usr/lib/swift/linux/libdispatch.so)

This is occurring in every entrance of DISPATCH_ROOT_QUEUE_CONTENDED_WAIT(), while using macro _dispatch_contention_wait_until() which in turn uses _dispatch_contention_spins(), in here the rand() call comes in and the macro produces just these 4 values: 31, 63, 95 and 127 for how many pause/yield instructions to execute.

The following example can reproduce the issue where ~28% of the time when sampling is spent in the code path mentioned.
The example creates 5000 tasks which work between 1μs and 3μs and then sleep for random 6-10 milliseconds. The point of the test is to create the contention and to illustrate the issue with rand():

// $ swift package init --type executable --name RandomTasks
// $ cat  Sources/RandomTasks/main.swift &&  swift run -c release

import Foundation

let numberOfTasks = 5000
let randomSleepRangeMs: ClosedRange<UInt64> = 6 ... 10

// correlates closely to processing amount in micros
let randomWorkRange: ClosedRange<UInt32> = 1 ... 3

@available(macOS 10.15, *)
func smallInfinitiveTask() async {
    let randomWork = UInt32.random(in: randomWorkRange)
    let randomSleepNs = UInt64.random(in: randomSleepRangeMs) * 1_000_000
    print("Task start; sleep: \(randomSleepNs) ns, randomWork: \(randomWork) ")

    while true {
        do {
            var x2: String = ""
            x2.reserveCapacity(2000)
            for _ in 1 ... 50 * randomWork {
                x2 += "hi"
            }
            // Thread.sleep(forTimeInterval: 0.001) // 1ms
            try await Task.sleep(nanoseconds: randomSleepNs)
        } catch {}
    }
}

@available(macOS 10.15, *)
func startLotsOfTasks(_ tasks: Int) {
    for _ in 1 ... tasks {
        Task {
            await smallInfinitiveTask()
        }
    }
}

if #available(macOS 10.15, *) {
    startLotsOfTasks(numberOfTasks)
} else {
    // Fallback on earlier versions
    print("Unsupported")
}

sleep(600)

When run on Ryzen 5950X system, 18-19 HT cores are spent processing the workload. While on M1 Pro just ~4.

The text was updated successfully, but these errors were encountered:

freef4ll · 2022-05-23T10:46:22Z

A semi-related issue was present in Windows, that was resolved with: #453 / #455

hassila · 2022-05-24T07:57:42Z

Would it be possible to use the Windows optimization for Linux also possibly to work around this?

weissi · 2022-05-25T11:34:22Z

CC @rokhinip

ilya-fedin · 2023-11-25T10:22:14Z

Is there any plan to solve this issue? It's reaching a third year since reported :(

weissi · 2023-11-27T10:20:13Z

CC @ktoso / @rokhinip / @rjmccall do you know what the plans w.r.t. Concurrency runtime and dispatch are? Clearly there are a bunch of things that need addressing and corelibs-dispatch isn't getting updated...

ilya-fedin · 2023-12-13T05:04:14Z

Silence :(

ktoso · 2023-12-13T22:47:11Z

Hi everyone,
we've been doing a lot of thinking about the executor pool provided here by (corelibs) Dispatch for Swift Concurrency.
We're leaning towards a fresh thread-pool implementation, that would be tailored towards Swift Concurrency's needs specifically, rather than thinking about the DispatchQueue APIs explicitly. Instead of explicitly having 1:1 dispatch API's back Swift Concurrency on various platforms, I think we're more interested in specialized executor implementations per platform, which then Swift Concurrency uses.

Sadly this is no small feat and a large project in itself. We don't currently have more details to share on the long term.

We'd certainly welcome any PRs that could help remove the short-term pain, but medium-to-long term we think replacing the executors backing Swift Concurrency may be a preferable direction here.

ilya-fedin · 2023-12-14T03:00:56Z

What does this mean for me if I use libdispatch directly in a C++ application as a lock-free thread pool implementation? Is this variant of libdispatch officially discontinued?

ilya-fedin · 2023-12-14T03:07:04Z

We'd certainly welcome any PRs that could help remove the short-term pain

Would such a PR be ok?

diff --git a/src/shims/yield.h b/src/shims/yield.h
index 53eb800..c8e5eed 100644
--- a/src/shims/yield.h
+++ b/src/shims/yield.h
@@ -98,7 +98,7 @@ void *_dispatch_wait_for_enqueuer(void **ptr);
 #define _dispatch_contention_spins() \
                ((DISPATCH_CONTENTION_SPINS_MIN) + ((DISPATCH_CONTENTION_SPINS_MAX) - \
                (DISPATCH_CONTENTION_SPINS_MIN)) / 2)
-#elif defined(_WIN32)
+#else
 // Use randomness to prevent threads from resonating at the same frequency and
 // permanently contending. Windows doesn't provide rand_r(), so use a simple
 // LCG. (msvcrt has rand_s(), but its security guarantees aren't optimal here.)
@@ -108,12 +108,6 @@ void *_dispatch_wait_for_enqueuer(void **ptr);
                os_atomic_store(&_seed, _next * 1103515245 + 12345, relaxed); \
                ((_next >> 24) & (DISPATCH_CONTENTION_SPINS_MAX)) | \
                                (DISPATCH_CONTENTION_SPINS_MIN); })
-#else
-// Use randomness to prevent threads from resonating at the same
-// frequency and permanently contending.
-#define _dispatch_contention_spins() ({ \
-               ((unsigned int)rand() & (DISPATCH_CONTENTION_SPINS_MAX)) | \
-                               (DISPATCH_CONTENTION_SPINS_MIN); })
 #endif
 #define _dispatch_contention_wait_until(c) ({ \
                bool _out = false; \

rjmccall · 2023-12-14T03:44:08Z

Is this variant of libdispatch officially discontinued?

No. The Swift project is just considering whether it should continue to use libdispatch as the standard implementation of our thread pool on non-Darwin platforms.

What does this mean for me if I use libdispatch directly in a C++ application as a lock-free thread pool implementation?

Without any changes to make the thread pools coordinate, they'd both create their own threads. Out of the box, that means you could end up with 2 * ncpus threads — or more, given that normal uses of libdispatch don't hard-cap the thread count. IIRC, you can manually cap libdispatch's thread usage with environment variables, and I assume we'd want to offer the same capability with a Swift thread pool. We might also choose to offer direct APIs to schedule work on Swift's thread pool.

ilya-fedin · 2023-12-14T03:46:16Z

Without any changes to make the thread pools coordinate, they'd both create their own threads.

Both? Perhaps there's a misunderstanding: I don't use Swift, I just use libdispatch.

rjmccall · 2023-12-14T03:46:27Z

We'd certainly welcome any PRs that could help remove the short-term pain

Would such a PR be ok?

That seems abstractly right; could you adjust the comment and make a real PR?

rjmccall · 2023-12-14T03:47:06Z

Without any changes to make the thread pools coordinate, they'd both create their own threads.

Both? Perhaps there's a misunderstanding: I don't use Swift, I just use libdispatch.

Then you would not be affected by Swift's use of a different thread pool implementation.

ilya-fedin · 2023-12-14T03:48:01Z

Well, my question was about what it means for libdispatch bugs priority & maintenance in general :)

Fixes apple#760

ilya-fedin · 2023-12-14T04:03:37Z

#804

hassila · 2023-12-14T08:50:26Z

@ilya-fedin thanks for the PR, we'll try to rerun the original workload test (@freef4ll could you please try to give it a spin with a benchmark perhaps so we can get before/after numbers and add that to #804 as further input).

We'd certainly welcome any PRs that could help remove the short-term pain, but medium-to-long term we think replacing the executors backing Swift Concurrency may be a preferable direction here.

@ktoso and @rjmccall - thanks for clarifying your thoughts on the possible future direction of the shared concurrency pool for Swift, I think the above quoted sentence would be pragmatic.

I think what you suggest is the right call (I remember when we spent some time getting libdispatch to work on Solaris back in the day - the mach/kqueue bits makes it a bit challenging to keep the codebase in sync as have been seen - Windows doesn't make it easier).

I also know that the needs of various users on different platforms can differ quite a bit too (e.g. back in the day we wanted to prioritise latency over energy efficiency and had a spin thread for picking up new work as a configurable default, for data center usage with many cores available, it was a great improvement for the stuff we did). One could envision a few different variations there even on the same OS platform. To structure the code base of such an API such that not only multiple platforms are easy to support, but also such that one could have a couple of variants per platform would be nice.

Also, just to mention a need to put it into your thought process - it's desirable to be able to pin executors to a given thread and it'd be nice if a future API made that fairly straightforward (not sure how the interaction with the concurrent pool and additional such threads would look like, but it'd be nice if it could be managed by the same code base...) - this is especially interesting for I/O threads where one might pin the thread to a specific core, which one designates to handle the interrupts from the network card (with the use of a user-land networking stack, this allows for low latency processing of inbound packets with good cache behaviour...).

freef4ll · 2023-12-14T12:18:56Z

The #804 helps the CPU usage of our stress workload, reduction from 100% utilisation of 30 CPU cores to ~18 cores on a 32 core system. Sadly, the throughput numbers of the workload do not change.

The lock contention in rand() is now replaced with spending 20% in _dispatch_root_queue_mediator_is_gone() :

When larger amount of Actors are present, apple/swift#68299 reproduces.

ktoso · 2024-01-08T23:07:18Z

Thanks for verifying on your end @freef4ll

Merged and should be part of 5.9.3 https://forums.swift.org/t/development-open-for-swift-5-9-3-for-linux-and-windows/69020

ktoso added the Linux label May 23, 2022

ilya-fedin added a commit to ilya-fedin/swift-corelibs-libdispatch that referenced this issue Dec 14, 2023

Avoid using rand() that leads to a lock contention

29babc1

Fixes apple#760

ilya-fedin mentioned this issue Dec 14, 2023

Avoid using rand() that leads to a lock contention #804

Merged

ktoso closed this as completed in #804 Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swift Actor/Tasks concurrency on Linux - Lock contention in rand() #760

Swift Actor/Tasks concurrency on Linux - Lock contention in rand() #760

freef4ll commented May 23, 2022 •

edited

freef4ll commented May 23, 2022

hassila commented May 24, 2022

weissi commented May 25, 2022

ilya-fedin commented Nov 25, 2023

weissi commented Nov 27, 2023

ilya-fedin commented Dec 13, 2023

ktoso commented Dec 13, 2023

ilya-fedin commented Dec 14, 2023

ilya-fedin commented Dec 14, 2023

rjmccall commented Dec 14, 2023 •

edited

ilya-fedin commented Dec 14, 2023

rjmccall commented Dec 14, 2023

rjmccall commented Dec 14, 2023

ilya-fedin commented Dec 14, 2023

ilya-fedin commented Dec 14, 2023

hassila commented Dec 14, 2023

freef4ll commented Dec 14, 2023 •

edited

ktoso commented Jan 8, 2024 •

edited

Swift Actor/Tasks concurrency on Linux - Lock contention in rand() #760

Swift Actor/Tasks concurrency on Linux - Lock contention in rand() #760

Comments

freef4ll commented May 23, 2022 • edited

freef4ll commented May 23, 2022

hassila commented May 24, 2022

weissi commented May 25, 2022

ilya-fedin commented Nov 25, 2023

weissi commented Nov 27, 2023

ilya-fedin commented Dec 13, 2023

ktoso commented Dec 13, 2023

ilya-fedin commented Dec 14, 2023

ilya-fedin commented Dec 14, 2023

rjmccall commented Dec 14, 2023 • edited

ilya-fedin commented Dec 14, 2023

rjmccall commented Dec 14, 2023

rjmccall commented Dec 14, 2023

ilya-fedin commented Dec 14, 2023

ilya-fedin commented Dec 14, 2023

hassila commented Dec 14, 2023

freef4ll commented Dec 14, 2023 • edited

ktoso commented Jan 8, 2024 • edited

freef4ll commented May 23, 2022 •

edited

rjmccall commented Dec 14, 2023 •

edited

freef4ll commented Dec 14, 2023 •

edited

ktoso commented Jan 8, 2024 •

edited