Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMA based statistically adaptive thread pool design #108

Conversation

@vertexclique
Copy link
Member

commented Aug 24, 2019

This is an adaptive thread pool implementation with exponential moving averages(where it doesn't rely on idle time).

Here are the benchmarks:

This new statistical-EMA-approach:

  • CPU utilization ~97%
running 1 test
test blocking ... bench:  39,116,408 ns/iter (+/- 5,686,213)

Current approach

  • CPU utilization ~88%
running 1 test
test blocking ... bench: 254,204,382 ns/iter (+/- 214,099,973)

This approach enables 5.5x speedup.

On top of that, this PR solves #65 intrinsically
Just because it is not nice to manually include modules at the benchmark level.

Documentation

blocking doc comment

@yoshuawuyts

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

I'm in no means qualified to review this code, but the results are looking very cool! awesome work @vertexclique! 🎉

@stjepang stjepang requested a review from spacejam Aug 26, 2019

src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved

@vertexclique vertexclique force-pushed the vertexclique:ema-based-statistically-adaptive-thread-pool branch from 81837e7 to ecfce3a Aug 26, 2019

@spacejam

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

Thanks for continuing to boil this down! These are really nice results that address the main runaway thread growth issue in the naive original implementation.

I'm working through some more correctness proofs on paper to try to find more ways to break the large number of Relaxed atomic accesses that are added in this.

@vertexclique

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

Thanks, Tyler. Very nice to hear that. If you are doing formal verification please share with me too.

@spacejam

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

Hehehe the reason I'm using paper for a couple things is because I'm a little uncertain about translating Relaxed stuff into TLA (maybe using sets of all reorderable code and permuting them?)

Basically, anywhere a completely random value impacts runtime correctness is the only thing I'm looking for and desk-checking backwards from.

@vertexclique

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

More to improve (aka TO-DO):

  • Adaptive scale down (we can do this sometime later, it is harder and need more sync points than above)
  • Compartmentalize the pool creation into so-called execution context's so they satisfy the needs in the issue #79 . This PR also helps that in an indirect way. (Atomics can be moved for that mentioned issue, no worries)
@spacejam

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

If more synchronization needs to occur, it probably isn't a bad time to take a step back from the 100% atomics-based approach and either use ROWEX when an adaptation is required (spawning readers work lock-free, adapting writers need to claim an exclusive mutex to reduce the concurrent interleaving space) or just using a Mutex so that the implementation is obviously correct. Thinking about all of these atomic accesses is quite labor intensive and error-prone.

Mutexes are quite fast compared to anything that needs to do IO, and I have doubts that using one would cause any impact on performance on a workload that this module is involved with.

@vertexclique

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

I think todos can go to other PRs. For Mutexes (if I understand you correctly you are talking about the scaling down part) it can be useful, but I don't think they should be used for this code since there is no critical section code here, I don't like abusing lock-exclusion in places which are not needed. (But I am not fully against that).

I am rooting for this PR to taking it like how it is for the blocking pool rn. Then we will think on that later in the process of extraction (which will also be for async thread pool, I think).

@spacejam

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

I guess what I'm mostly getting at is that there are some potentially very subtle information flows that can happen when this math happens to be operating in an entirely concurrent way on global variables via Relaxed atomics (or even SeqCst atomics), and this is expensive overall to reason about as a human (it's also expensive/unreasonable for the formal verification tooling I know about to provide meaningful validations of).

@spacejam

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

There is almost never a real need for exclusion, but it makes PR's like this understandable without staring at it for hours :P

src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
@spacejam

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

In general, please change all of the atomics to use barriers that prevent potentially risky reorderings. Note that in the original, relaxed atomics were used with a clear explanation for why randomness in their use path violated no correctness criteria. In this code, it's much much much more complex, and relaxed atomics make this quite difficult to reason about.

Mutexes will make this very easy for me to review and merge, and will not impact performance, so I recommend that you replace all of the atomic stuff with them, but I totally get that it's not as fun!

@spacejam

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

After removing all Relaxed atomics, I have some concerns about long-term maintenance costs of this code due to the large amount of unserialized atomic access, and math without clear documentation. Please document the math more clearly for people like me who are less familiar with it, and please document the atomic access patterns with explanations of what the important correctness criteria are at each dependency site so that future contributors don't introduce bugs.

Since bugs have been found in review, it would also make me feel better if some concurrency test was added that exercises the interleavings more.

You may think that this is a bit draconian, but lock-free programming is not easy for humans to get right, and this code is not obviously correct to me yet.

This needs a bit more work before it's safe from a long-term maintenance perspective. The alternative is to wrap all atomic variables other than FREQUENCY in a Mutex so that their complex math is serialized. Please try this out if you're uninterested in adding concurrency tests. Mutex unlocks are less than 16ns, and as long as you're just doing this math on variables inside the Mutex and not doing any IO or running a blocking function with the Mutex held, it will be really really fast.

So, please make this code obviously correct by either using a simple Mutex around all of the variables other than FREQUENCY or by adding concurrent interleaving tests that will cause assumptions we're making to pop out.

benches/blocking.rs Show resolved Hide resolved
@vertexclique

This comment has been minimized.

Copy link
Member Author

commented Aug 28, 2019

src/task/blocking.rs Outdated Show resolved Hide resolved
// on the request rate.
//
// It uses frequency based calculation to define work. Utilizing average processing rate.
fn scale_pool() {

This comment has been minimized.

Copy link
@killercup

killercup Aug 28, 2019

Contributor

In another context there was some discussion around adding tracing (using log's trace! macro) to async-std. This functions seems like a good place for that, too.

This comment has been minimized.

Copy link
@naftulikay

naftulikay Aug 28, 2019

log::trace or the new tracing library? wondering about how much overhead each would add

This comment has been minimized.

Copy link
@killercup

killercup Aug 28, 2019

Contributor

I just know that the idea was thrown around, and it was probably an opt-in feature, too. @stjepang or @yoshuawuyts would know more :)

This comment has been minimized.

Copy link
@yoshuawuyts

yoshuawuyts Aug 30, 2019

Member

log now has kv support, which we're already using in other places too. tracing can pick these up too, so using log is probably the best choice here overall (:

This comment has been minimized.

Copy link
@davidbarsky

davidbarsky Aug 31, 2019

Member

I think we should get the kv support in tracing-log soon-ish? However I don't believe that log supports the span-like macros, so async-std might need to have its own, internal span-like macro and dispatch conditionally to async-log or tracing.

I haven't considered how the differences between async-log's span and tracing's span can be bridged, but I think it's feasible if tracing's span!().enter() is used to mimic async-log's nested span block.

This comment has been minimized.

Copy link
@yoshuawuyts

yoshuawuyts Sep 1, 2019

Member

@davidbarsky we could use async-log::span for this, probably. Though there's still a few open issues on the repo, the interface wouldn't change.

From convos with Eliza during RustConf there's def a desire to bridge between the two approaches, and we mostly need to figure out how to go about it. rustasync/async-log#7 is what I've currently got (repo incoming soon), but need to check how well that works (:

This comment has been minimized.

Copy link
@davidbarsky

davidbarsky Sep 1, 2019

Member

@yoshuawuyts Yep! I think we're on the same page about wanting to bridge the two libraries.

I'm spitballing one mechanism on bridging that gap, which consists of using mutually exclusive feature flags in a build.rs to dispatch to either async-log or tracing. This would entail using an async-std-specific span macro until the gaps—if any exist—between async-std's span and tracing's span are closed. I'm personally excited to dig into rust-lang-nursery/log#353!

I'm sorry if what I communicated wasn't clear!

@vertexclique

This comment has been minimized.

Copy link
Member Author

commented Aug 30, 2019

So, to sum up, all the changes that I pushed:

  • Values moved around are mutex rn.
  • Now I cleanly calculate the values and also documented more explicitly how things are working. Even formula, very explicitly.
  • Fixed #126 (OS-specific problem, hard to hit but possible)
  • No significant degradation in performance. For the curious:
running 2 tests
test blocking        ... bench:  48,959,842 ns/iter (+/- 46,653,507)
test blocking_single ... bench:       3,322 ns/iter (+/- 8,452)
  • I think I covered most of the cases with tests. Skipped all of them since they are exhaustive. But for the curious cats again here how you can run them:
$ cargo test longhauling_task_join -- --ignored --exact --nocapture
$ cargo test slow_join_interrupted -- --ignored --exact --nocapture
$ cargo test slow_join -- --ignored --exact --nocapture
  • NOTE: Benchmarks cover the common cases. Tests are covering the possible edge cases.
@killercup
Copy link
Contributor

left a comment

Thanks for writing these docs! I left some inline comments to clarify some of it. In general I think this is really neat, concise (just 300 LOC!) implementation. With a bit more docs this could be used as reference material to understand the paper and task scheduling, but I don't think we'll need to focus on that before landing this PR :)

src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
// Should be less than 200_000 ns
// Previous implementation will panic when this test is running.
assert_eq!(elapsed < 200_000, true);
}

This comment has been minimized.

Copy link
@spacejam

spacejam Aug 30, 2019

Member

these tests are generally non-reproducible, and it's unclear if they are valuable. they tend to fail on my fairly modern 2018 i7 mbp

This comment has been minimized.

Copy link
@spacejam

spacejam Aug 30, 2019

Member

when I said concurrency tests before, I meant adding randomized delays to operations that involve cross-thread coordination via atomics (or higher level primitives that invoke them internally). These might be OK as a benchmark so you can compare results on your own machine before and after changes, but I don't think they are useful as tests.

This comment has been minimized.

Copy link
@vertexclique

vertexclique Aug 30, 2019

Author Member

Please try running the tests with master's blocking.rs. They will panic. Also, please take a look to referenced issue.

This comment has been minimized.

Copy link
@spacejam

spacejam Sep 2, 2019

Member

They panic on this branch on a fairly modern machine, so I'm not sure they are useful on machines other than yours. In general, timing-based performance tests are an antipattern, as they have big reproducibility issues. If you want to do this kind of testing, it's better to use something like cachegrind to count CPU instructions executed.

This comment has been minimized.

Copy link
@spacejam

spacejam Sep 2, 2019

Member

Because they don't actually assert on anything now (which was what caused them to fail on my laptop), I think they are better as benchmarks, right? I don't think it makes sense to have tests that can't fail.

This comment has been minimized.

Copy link
@vertexclique

vertexclique Sep 2, 2019

Author Member

I don't think it makes sense to have tests that can't fail.

Please read my comments carefully. You can see what it tests in action:
https://asciinema.org/a/R6b3UcFo0xfNo1sCPOPJrB04T

when I said concurrency tests before, I meant adding randomized delays to operations that involve cross-thread coordination via atomics (or higher level primitives that invoke them internally).

Unexposed architecture shouldn't be tested with whitebox fashion.

This comment has been minimized.

Copy link
@spacejam

spacejam Sep 2, 2019

Member

Where are the assertions that fail? The asciicinema is too long to watch.

Possible concurrent interleavings are exposed architecture, and this kind of test has resulted in a huge number of bugs being discovered in various lock-free projects I've been involved in. Concurrency is really hard to get right sometimes, and the original PR with more unserialized atomic accesses would have had bugs jump out pretty quickly with this approach.

src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved

@vertexclique vertexclique force-pushed the vertexclique:ema-based-statistically-adaptive-thread-pool branch from 0377429 to e934a5e Aug 30, 2019

/// # Arguments
///
/// * `freq_queue` - Sliding window of frequency samples
#[inline]

This comment has been minimized.

Copy link
@spacejam

spacejam Sep 2, 2019

Member

in general, humans are bad at guessing whether something should be inlined. Usually in rust it's better to leave these attributes out unless some code is going to be executed in a super hot path and you've done work to ensure the improvement is measurable

This comment has been minimized.

Copy link
@vertexclique

vertexclique Sep 2, 2019

Author Member

This comment is not contributing to the overall discussion + review. I’ve found it quite a bit out of context and also from the book. I've made the benchmarks already with and without and decided to leave it inlined. You can see below my last run again after your comment (run shouldn't be needed because of how compiler inlines and unrolls):

Mean of 10 consecutive runs of blocking benchmark:

43057207.9 ns without inlining
41816626.1 ns with inlining

That is also better for users of this library indirectly.

This comment has been minimized.

Copy link
@spacejam

spacejam Sep 2, 2019

Member

I'm a bit skeptical that change is related to the inline here because this is a function that gets called 5 times per second, and that measurement skew is ~1ms. For a workload that lasts ~40ms, it's not clear to me how measuring things that happen on the order of once every 200ms is relevant.

I'm asking you to simplify your proposed code because we are going to have to keep it working over time, and it's important not to add bits that add complexity for humans without clear benefits.

@vertexclique vertexclique deleted the vertexclique:ema-based-statistically-adaptive-thread-pool branch Sep 11, 2019

@vertexclique

This comment has been minimized.

Copy link
Member Author

commented Sep 11, 2019

Please track the progress of #181

@vertexclique vertexclique restored the vertexclique:ema-based-statistically-adaptive-thread-pool branch Sep 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.