Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spawn several threads when we fail to enqueue work in the blocki… #181

Merged
2 commits merged into from
Nov 1, 2019

Conversation

spacejam
Copy link
Contributor

This makes the spin-up time on the blocking threadpool far more efficient when encountering bursts.

@vertexclique
Copy link
Member

This implementation will be ~2x slower (if we assume there are 4 cores) due to the congestion of threads in any scenario and batch allocation under pressure(with long bursts, short bursts, without burst, long-running tasks, etc.) which is not desired for most systems (incl. embedded systems – if this library is targeted for embedded domain too) in the future.

Missing parts:

This PR doesn't solve system boundary problems, which is tested and fixed in #108
System boundary problem issue can be seen here: #126

@vertexclique
Copy link
Member

I've run the benchmarks, here are the results for this implementation:

running 2 tests
test blocking        ... bench: 567,998,519 ns/iter (+/- 381,638,949)
test blocking_single ... bench:      62,606 ns/iter (+/- 19,482)

@spacejam
Copy link
Contributor Author

spacejam commented Sep 11, 2019

@vertexclique with my measurements it behaves the same as #108, and sometimes with less jitter because it's not dependent on the single thread that only adjusts every 200ms. The main thing that #108 improves over master on is that it will spawn more than one thread at a time. However, as we discussed, #108 is more complex than necessary, and in any case would require significant clean-up before it reaches a mergeable state.

blocking 10k bench, measuring spawn latency distribution in terms of microseconds per spawn, 2018 i7 macbook pro with no foreground processes running other than tmux:

master:

Histogram[(0 -> 2) (50 -> 4) (75 -> 4) (90 -> 5) (95 -> 6) (97.5 -> 10) (99 -> 14) (99.9 -> 38) (99.99 -> 241) (100 -> 6973) ]

108:

Histogram[(0 -> 1) (50 -> 2) (75 -> 2) (90 -> 3) (95 -> 3) (97.5 -> 5) (99 -> 8) (99.9 -> 44) (99.99 -> 79) (100 -> 5883) ]

this:

Histogram[(0 -> 1) (50 -> 1) (75 -> 1) (90 -> 2) (95 -> 3) (97.5 -> 3) (99 -> 5) (99.9 -> 18) (99.99 -> 364) (100 -> 5431) ]

In general with performance-related code, please test using a proper histogram library like historian to get a sense of tail latency for actual requests, I think it will make conversations around performance more productive. Statements like "this is 2x slower" are not specific or actionable.

From my measurements, I'm getting roughly the same performance, with a small fraction of the additional code and almost none of the math complexity or bespoke thread operational considerations.

@spacejam
Copy link
Contributor Author

spacejam commented Sep 11, 2019

Curious to see your results on more hardware, here's the general recipe:

diff --git a/Cargo.toml b/Cargo.toml
index ca0f3fc..f9f40b3 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -25,6 +25,7 @@ unstable = []

 [dependencies]
 async-task = "1.0.0"
+historian = "*"
 cfg-if = "0.1.9"
 crossbeam-channel = "0.3.9"
 futures-core-preview = "0.3.0-alpha.18"
diff --git a/src/task/blocking.rs b/src/task/blocking.rs
index cb2b891..6cc5a3b 100644
--- a/src/task/blocking.rs
+++ b/src/task/blocking.rs
@@ -45,6 +45,7 @@ lazy_static! {
         let (sender, receiver) = bounded(0);
         Pool { sender, receiver }
     };
+    pub static ref SPAWN_TIME: historian::Histo = historian::Histo::default();
 }

 // Create up to MAX_THREADS dynamic blocking task worker threads.
@@ -108,7 +109,9 @@ where
     R: Send + 'static,
 {
     let (task, handle) = async_task::spawn(future, schedule, ());
+    let now = std::time::Instant::now();
     task.schedule();
+    SPAWN_TIME.measure(now.elapsed().as_micros() as f64);
     JoinHandle(handle)
 }

diff --git a/src/task/mod.rs b/src/task/mod.rs
index eef7284..2c5d8db 100644
--- a/src/task/mod.rs
+++ b/src/task/mod.rs
@@ -36,4 +36,4 @@ mod pool;
 mod sleep;
 mod task;

-pub(crate) mod blocking;
+pub mod blocking;

@spacejam
Copy link
Contributor Author

oh, actual bench code:

λ cat src/bin/blocking_histos.rs
use async_std::task;
use std::thread;
use std::time::Duration;

fn main() {
    let handles = (0..10_000)
        .map(|_| {
            task::blocking::spawn(async {
                let duration = Duration::from_millis(1);
                thread::sleep(duration);
            })
        })
        .collect::<Vec<_>>();

    task::blocking::SPAWN_TIME.print_percentiles();
}

@vertexclique
Copy link
Member

vertexclique commented Sep 11, 2019

I think what you are measuring is the spawn time on the pool. It will be always approximately same on every implementation.

Also, this histogram's will give exact same numbers in every implementation where it spawns on error case onto the thread pool. The only difference will be caused by how many you spawn dynamically. That will only scale up the percentile values towards the 0th percentile. While comparing that time to the EMA PR is comparing apples to nothing. Because it doesn't do the spawn in there.

From my measurements, I'm getting roughly the same performance, with a small fraction of the additional code and almost none of the math complexity or bespoke thread operational considerations.

Total job in/out time is something else. Job-in rate overhead is something else. You are calculating the latter.


Since ema pr didn't well fit(also I don't think that it is well-read) I am going to close it. It is up to you what you want to use.

@naftulikay
Copy link

Has the work on #108, the exponential moving average thread pool design, been closed and rejected? I'm not seeing an implementation of that here in this PR.

@spacejam
Copy link
Contributor Author

@naftulikay this PR does not use an exponential moving average, it just spins up more than one thread at a time to get pretty much the same effect with far less complexity.

@gnzlbg
Copy link

gnzlbg commented Sep 22, 2019

Is there a benchmark comparing this with EMA somewhere ?

@vertexclique
Copy link
Member

@gnzlbg Yes. In the EMA branch there is a benchmark comparing total time of execution for any implementation of blocking pool.

You can run it with this branch. I ran and wrote the results in the second comment.

EMA's benhcmark result is in the PR description.

@yoshuawuyts
Copy link
Contributor

ping @spacejam @stjepang what - what would we like to do with this PR?

@ghost ghost force-pushed the tyler_spawn_several_threads branch from 726ad2c to ee50156 Compare November 1, 2019 19:55
@ghost
Copy link

ghost commented Nov 1, 2019

Rebased onto master. Merge if this passes CI :)

@ghost
Copy link

ghost commented Nov 1, 2019

Switched to unbounded channels because that improves our benchmark numbers here by an order of magnitude: https://github.com/jebrosen/async-file-benchmark

Reading a 256K file:

  • tokio: 0.136 sec
  • async_std: 0.086 sec

@ghost ghost changed the title Spawn several threads when we fail to enqueue work in the blocking threadpool Spawn several threads when we fail to enqueue work in the blocki… Nov 1, 2019
@ghost ghost merged commit 5adc608 into async-rs:master Nov 1, 2019
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants