Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swift Actor/Tasks concurrency on Linux - Lock contention in rand() #760

Closed
freef4ll opened this issue May 23, 2022 · 18 comments · Fixed by #804
Closed

Swift Actor/Tasks concurrency on Linux - Lock contention in rand() #760

freef4ll opened this issue May 23, 2022 · 18 comments · Fixed by #804
Labels

Comments

@freef4ll
Copy link

freef4ll commented May 23, 2022

We have a workload pipeline which is chaining several thousand Actors to each other via AsyncStream processing pipeline.
There is a multiplication affect that a single event at the start of the processing pipeline will be amplified as the event will be delivered to several Tasks processing the events concurrently. The processing time of each wakeup is currently quite small and on several microseconds range currently.

Under Linux, what was observed when stressing this processing pipeline is that ~45% of the stacks show __DISPATCH_ROOT_QUEUE_CONTENDED_WAIT__(), which is leading to lock contention in glibc rand() - as there are ~60 threads which are created and they all contend here:

            7f193794a2db futex_wait+0x2b (inlined)
            7f193794a2db __GI___lll_lock_wait_private+0x2b (inlined)
            7f19378ff29b __random+0x6b (/usr/lib/x86_64-linux-gnu/libc.so.6)
            7f19378ff76c rand+0xc (/usr/lib/x86_64-linux-gnu/libc.so.6)
            7f1937bac612 __DISPATCH_ROOT_QUEUE_CONTENDED_WAIT__+0x12 (/usr/lib/swift/linux/libdispatch.so)

This is occurring in every entrance of DISPATCH_ROOT_QUEUE_CONTENDED_WAIT(), while using macro _dispatch_contention_wait_until() which in turn uses _dispatch_contention_spins(), in here the rand() call comes in and the macro produces just these 4 values: 31, 63, 95 and 127 for how many pause/yield instructions to execute.

The following example can reproduce the issue where ~28% of the time when sampling is spent in the code path mentioned.
The example creates 5000 tasks which work between 1μs and 3μs and then sleep for random 6-10 milliseconds. The point of the test is to create the contention and to illustrate the issue with rand():

// $ swift package init --type executable --name RandomTasks
// $ cat  Sources/RandomTasks/main.swift &&  swift run -c release

import Foundation

let numberOfTasks = 5000
let randomSleepRangeMs: ClosedRange<UInt64> = 6 ... 10

// correlates closely to processing amount in micros
let randomWorkRange: ClosedRange<UInt32> = 1 ... 3

@available(macOS 10.15, *)
func smallInfinitiveTask() async {
    let randomWork = UInt32.random(in: randomWorkRange)
    let randomSleepNs = UInt64.random(in: randomSleepRangeMs) * 1_000_000
    print("Task start; sleep: \(randomSleepNs) ns, randomWork: \(randomWork) ")

    while true {
        do {
            var x2: String = ""
            x2.reserveCapacity(2000)
            for _ in 1 ... 50 * randomWork {
                x2 += "hi"
            }
            // Thread.sleep(forTimeInterval: 0.001) // 1ms
            try await Task.sleep(nanoseconds: randomSleepNs)
        } catch {}
    }
}

@available(macOS 10.15, *)
func startLotsOfTasks(_ tasks: Int) {
    for _ in 1 ... tasks {
        Task {
            await smallInfinitiveTask()
        }
    }
}

if #available(macOS 10.15, *) {
    startLotsOfTasks(numberOfTasks)
} else {
    // Fallback on earlier versions
    print("Unsupported")
}

sleep(600)

When run on Ryzen 5950X system, 18-19 HT cores are spent processing the workload. While on M1 Pro just ~4.

rand-contention

@freef4ll
Copy link
Author

A semi-related issue was present in Windows, that was resolved with: #453 / #455

@ktoso ktoso added the Linux label May 23, 2022
@hassila
Copy link

hassila commented May 24, 2022

Would it be possible to use the Windows optimization for Linux also possibly to work around this?

@weissi
Copy link
Member

weissi commented May 25, 2022

CC @rokhinip

@ilya-fedin
Copy link
Contributor

Is there any plan to solve this issue? It's reaching a third year since reported :(

@weissi
Copy link
Member

weissi commented Nov 27, 2023

CC @ktoso / @rokhinip / @rjmccall do you know what the plans w.r.t. Concurrency runtime and dispatch are? Clearly there are a bunch of things that need addressing and corelibs-dispatch isn't getting updated...

@ilya-fedin
Copy link
Contributor

Silence :(

@ktoso
Copy link
Member

ktoso commented Dec 13, 2023

Hi everyone,
we've been doing a lot of thinking about the executor pool provided here by (corelibs) Dispatch for Swift Concurrency.
We're leaning towards a fresh thread-pool implementation, that would be tailored towards Swift Concurrency's needs specifically, rather than thinking about the DispatchQueue APIs explicitly. Instead of explicitly having 1:1 dispatch API's back Swift Concurrency on various platforms, I think we're more interested in specialized executor implementations per platform, which then Swift Concurrency uses.

Sadly this is no small feat and a large project in itself. We don't currently have more details to share on the long term.

We'd certainly welcome any PRs that could help remove the short-term pain, but medium-to-long term we think replacing the executors backing Swift Concurrency may be a preferable direction here.

@ilya-fedin
Copy link
Contributor

What does this mean for me if I use libdispatch directly in a C++ application as a lock-free thread pool implementation? Is this variant of libdispatch officially discontinued?

@ilya-fedin
Copy link
Contributor

We'd certainly welcome any PRs that could help remove the short-term pain

Would such a PR be ok?

diff --git a/src/shims/yield.h b/src/shims/yield.h
index 53eb800..c8e5eed 100644
--- a/src/shims/yield.h
+++ b/src/shims/yield.h
@@ -98,7 +98,7 @@ void *_dispatch_wait_for_enqueuer(void **ptr);
 #define _dispatch_contention_spins() \
                ((DISPATCH_CONTENTION_SPINS_MIN) + ((DISPATCH_CONTENTION_SPINS_MAX) - \
                (DISPATCH_CONTENTION_SPINS_MIN)) / 2)
-#elif defined(_WIN32)
+#else
 // Use randomness to prevent threads from resonating at the same frequency and
 // permanently contending. Windows doesn't provide rand_r(), so use a simple
 // LCG. (msvcrt has rand_s(), but its security guarantees aren't optimal here.)
@@ -108,12 +108,6 @@ void *_dispatch_wait_for_enqueuer(void **ptr);
                os_atomic_store(&_seed, _next * 1103515245 + 12345, relaxed); \
                ((_next >> 24) & (DISPATCH_CONTENTION_SPINS_MAX)) | \
                                (DISPATCH_CONTENTION_SPINS_MIN); })
-#else
-// Use randomness to prevent threads from resonating at the same
-// frequency and permanently contending.
-#define _dispatch_contention_spins() ({ \
-               ((unsigned int)rand() & (DISPATCH_CONTENTION_SPINS_MAX)) | \
-                               (DISPATCH_CONTENTION_SPINS_MIN); })
 #endif
 #define _dispatch_contention_wait_until(c) ({ \
                bool _out = false; \

@rjmccall
Copy link
Member

rjmccall commented Dec 14, 2023

Is this variant of libdispatch officially discontinued?

No. The Swift project is just considering whether it should continue to use libdispatch as the standard implementation of our thread pool on non-Darwin platforms.

What does this mean for me if I use libdispatch directly in a C++ application as a lock-free thread pool implementation?

Without any changes to make the thread pools coordinate, they'd both create their own threads. Out of the box, that means you could end up with 2 * ncpus threads — or more, given that normal uses of libdispatch don't hard-cap the thread count. IIRC, you can manually cap libdispatch's thread usage with environment variables, and I assume we'd want to offer the same capability with a Swift thread pool. We might also choose to offer direct APIs to schedule work on Swift's thread pool.

@ilya-fedin
Copy link
Contributor

Without any changes to make the thread pools coordinate, they'd both create their own threads.

Both? Perhaps there's a misunderstanding: I don't use Swift, I just use libdispatch.

@rjmccall
Copy link
Member

We'd certainly welcome any PRs that could help remove the short-term pain

Would such a PR be ok?

That seems abstractly right; could you adjust the comment and make a real PR?

@rjmccall
Copy link
Member

Without any changes to make the thread pools coordinate, they'd both create their own threads.

Both? Perhaps there's a misunderstanding: I don't use Swift, I just use libdispatch.

Then you would not be affected by Swift's use of a different thread pool implementation.

@ilya-fedin
Copy link
Contributor

Well, my question was about what it means for libdispatch bugs priority & maintenance in general :)

ilya-fedin added a commit to ilya-fedin/swift-corelibs-libdispatch that referenced this issue Dec 14, 2023
@ilya-fedin
Copy link
Contributor

#804

@hassila
Copy link

hassila commented Dec 14, 2023

@ilya-fedin thanks for the PR, we'll try to rerun the original workload test (@freef4ll could you please try to give it a spin with a benchmark perhaps so we can get before/after numbers and add that to #804 as further input).

We'd certainly welcome any PRs that could help remove the short-term pain, but medium-to-long term we think replacing the executors backing Swift Concurrency may be a preferable direction here.

@ktoso and @rjmccall - thanks for clarifying your thoughts on the possible future direction of the shared concurrency pool for Swift, I think the above quoted sentence would be pragmatic.

I think what you suggest is the right call (I remember when we spent some time getting libdispatch to work on Solaris back in the day - the mach/kqueue bits makes it a bit challenging to keep the codebase in sync as have been seen - Windows doesn't make it easier).

I also know that the needs of various users on different platforms can differ quite a bit too (e.g. back in the day we wanted to prioritise latency over energy efficiency and had a spin thread for picking up new work as a configurable default, for data center usage with many cores available, it was a great improvement for the stuff we did). One could envision a few different variations there even on the same OS platform. To structure the code base of such an API such that not only multiple platforms are easy to support, but also such that one could have a couple of variants per platform would be nice.

Also, just to mention a need to put it into your thought process - it's desirable to be able to pin executors to a given thread and it'd be nice if a future API made that fairly straightforward (not sure how the interaction with the concurrent pool and additional such threads would look like, but it'd be nice if it could be managed by the same code base...) - this is especially interesting for I/O threads where one might pin the thread to a specific core, which one designates to handle the interrupts from the network card (with the use of a user-land networking stack, this allows for low latency processing of inbound packets with good cache behaviour...).

@freef4ll
Copy link
Author

freef4ll commented Dec 14, 2023

The #804 helps the CPU usage of our stress workload, reduction from 100% utilisation of 30 CPU cores to ~18 cores on a 32 core system. Sadly, the throughput numbers of the workload do not change.

The lock contention in rand() is now replaced with spending 20% in _dispatch_root_queue_mediator_is_gone() :

rand-patch-14318-2023-12-14_114949

When larger amount of Actors are present, apple/swift#68299 reproduces.

@ktoso ktoso closed this as completed in #804 Jan 8, 2024
@ktoso
Copy link
Member

ktoso commented Jan 8, 2024

Thanks for verifying on your end @freef4ll

Merged and should be part of 5.9.3 https://forums.swift.org/t/development-open-for-swift-5-9-3-for-linux-and-windows/69020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants