Skip to content

Bathtor/rust-executors

Repository files navigation

Executors

A library with high-performace task executors for Rust.

License Cargo Documentation Codecov CI

Usage

Add this to your Cargo.toml:

[dependencies]
executors = "0.9"

You can use, for example, the crossbeam_workstealing_pool to schedule a number n_jobs over a number n_workers threads, and collect the results via an mpsc::channel.

use executors::*;
use executors::crossbeam_workstealing_pool;
use std::sync::mpsc::channel;

let n_workers = 4;
let n_jobs = 8;
let pool = crossbeam_workstealing_pool::small_pool(n_workers);

let (tx, rx) = channel();
for _ in 0..n_jobs {
    let tx = tx.clone();
    pool.execute(move|| {
        tx.send(1).expect("channel will be there waiting for the pool");
    });
}

assert_eq!(rx.iter().take(n_jobs).fold(0, |a, b| a + b), 8);

Rust Version

Requires at least Rust 1.26, due to crossbeam-channel requirements.

Deciding on an Implementation

To select an Executor implementation, it is best to test the exact requirements on the target hardware. The crate executor-performance provides a performance testing suite for the provided implementations. To use it clone this repository, run cargo build --release, and then check target/release/executor-performance --help to see the available options. There is also a small script to test thread-scaling with some reasonable default options.

In general, crossbeam_workstealing_pool works best for workloads where the tasks on the worker threads spawn more and more tasks. If all tasks a spawned from a single thread that isn't part of the threadpool, then crossbeam_channel_pool tends to perform best.

If you don't know what hardware your code is going to run on, use the crossbeam_workstealing_pool. It tends to perform best on all the hardware I have tested (which is pretty much Intel processors like i7 and Xeon).

If you absolutely need low response time to bursty workloads, you can compile the crate with the ws-no-park feature, which prevents the workers in the crossbeam_workstealing_pool from parking their threads, when all the task-queues are temporarily empty. This will, of course, not play well with other tasks running on the same system, although the threads are still yielded to the OS scheduler in between queue checks. See latency results below to get an idea of the performance impact of this feature.

Core Affinity

You can enable support for pinning pool threads to particular CPU cores via the "thread-pinning" feature. It will then pin by default as many threads as there are core ids. If you are asking for more threads than that, the rest will be created unpinned. You can also assign only a subset of your core ids to a thread pool by using the with_affinity(...) instead of the new(...) function.

Some Numbers

The following are some example results from my desktop machine (Intel i7-4770 @ 3.40Ghz Quad-Core with HT (8 logical cores) with 16GB of RAM). Note that they are all from a single run and thus not particularly scientific and subject to whatever else was going on on my system during the run.

Implementation abbreviations:

  • TP: threadpool_executor docs
  • CBCP: crossbeam_channel_pool docs
  • CBWP: crossbeam_workstealing_pool docs

Throughput

These experiments measure the throughput of a fully loaded executor implementation.

Low Amplification

Testing command (where $NUM_THREADS corresponds to #Threads in the table below):

target/release/executor-performance -t $NUM_THREADS -p 2 -m 10000000 -a 1 throughput --pre 10000 --post 10000

This corresponds to a IO-handling-server-style workload, where the vast majority of tasks are coming in via the external queue, and only very few are spawned from within the executor's threads.

(Units are in mio tasks/s)

#Threads TP CBCP CBWP (default-features) CBWP (ws-no-park) CBWP (no ws-timed-fairness)
1 2.4 3.1 3.4 3.5 3.5
2 1.2 2.8 3.3 3.4 3.4
3 1.5 3.1 4.0 4.0 4.0
4 1.2 3.6 4.4 4.3 4.3
5 1.3 3.6 4.2 4.2 4.2
6 1.2 3.7 4.2 4.2 4.1
7 1.1 3.7 3.7 4.1 3.3
8 1.1 3.4 2.4 4.0 1.8
9 1.1 3.4 1.5 3.6 1.6

High Amplification

Testing command (where $NUM_THREADS corresponds to #Threads in the table below):

target/release/executor-performance -t $NUM_THREADS -p 2 -m 1000 -a 50000 throughput --pre 10000 --post 10000

This corresponds to a message-passing-style (or fork-join) workload, where the vast majority of tasks are spawned from within the executor's threads and only relatively few come in via the external queue.

(Units are in mio tasks/s)

#Threads TP CBCP CBWP (default-features) CBWP (ws-no-park) CBWP (no ws-timed-fairness)
1 6.8 9.4 10.9 10.9 10.9
2 2.9 5.6 7.4 6.8 7.4
3 3.3 6.7 9.0 8.3 8.5
4 2.9 7.4 10.0 10.6 10.3
5 2.8 8.0 11.1 12.4 12.0
6 2.6 8.8 11.8 12.3 12.6
7 2.5 9.0 12.8 12.7 12.5
8 2.5 8.8 14.1 12.9 13.2
9 2.4 8.9 13.4 12.7 12.7

Latency

These experiments measure the start-to-finish time (latency) of running a bursty workload, while sleeping in between runs. This is what typically happens in a network server, that isn't fully loaded, but waiting for a message and then running a series of tasks based on the message, before responding and waiting for the next.

These experiments report averages collected over multiple measurements with RSE<10%.

Small Bursts / Low Amplification

Testing command:

target/release/executor-performance -t 6 -p 1 -m 100 -a 1 latency -s 100

This corresponds to a very small internal workload in response to every external message. Sleep time between bursts is 100ms.

Implementation Mean Max/Min
TP 0.539 ms ±0.321 ms
CBCP 0.442 ms ±0.225 ms
CBWP 0.372 ms ±0.218 ms
CBWP (ws-no-park) 0.072 ms ±0.182 ms

Medium Bursts / High Amplification

Testing command:

target/release/executor-performance -t 6 -p 1 -m 100 -a 100 latency -s 100

This corresponds to a significant internal workload in response to every external message. Sleep time between bursts is 100ms.

Implementation Mean Max/Min
TP 9.062 ms ±3.485 ms
CBCP 4.215 ms ±2.784 ms
CBWP 3.941 ms ±2.241 ms
CBWP (ws-no-park) 0.343 ms ±3.205 ms

Large Bursts / Very High Amplification

Testing command:

target/release/executor-performance -t 6 -p 1 -m 100 -a 10000 latency -s 100

This corresponds to a very large internal workload in response to every external message. Sleep time between bursts is 100ms.

Implementation Mean Max/Min
TP 381.023 ms ±3.249 ms
CBCP 124.666 ms ±2.315 ms
CBWP 79.532 ms ±41.626 ms
CBWP (ws-no-park) 51.867 ms ±43.893 ms

License

Licensed under the terms of the MIT license.

See LICENSE for details.