New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance overhead of unused actor_system #699
Comments
This could be an artifact of various multicore optimizations for low-concurrency setups. For example, Intel's "Turbo Boost" will shutdown unused cores to be able to increase the clock speed for single-threaded applications. You could test for this by spawning one thread per core that does nothing but sleeps a few milliseconds in a loop (e.g. until a global shutdown flag is set). That should reproduce the performance decrease you see without CAF if something like Intel's Turbo Boost is to blame here. Mitigating this in CAF would require tweaks to the work stealing scheduler. The scheduler is designed without any central component, so "no work is available in the entire system" is hard to detect. Shutting down workers once they agree nothing is happening might cause additional latencies once work does become available again and needs to spread to all workers. We currently do a poll strategy with increasing wait intervals to get a compromise between not wasting too much power while being able to quickly get up to speed again. Just my thoughts so far. Can you try a simple run with periodically waking threads to see if that triggers the same behavior on your 32-core machine? |
This behaviour could possibly be fixed also by the topic/latency-integration branch, that use a different strategy than polling with a configuration having no or little aggressive attempts, moderate-poll-attempts = 0 and relaxed-sleep-duration = ~100.
https://github.com/actor-framework/actor-framework/tree/topic/latency-integration
You may try it.
Tullio
From: Dominik Charousset <notifications@github.com>
Sent: mercoledì 13 giugno 2018 23:06
To: actor-framework/actor-framework <actor-framework@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [actor-framework/actor-framework] performance overhead of unused actor_system (#699)
This could be an artifact of various multicore optimizations for low-concurrency setups. For example, Intel's "Turbo Boost" will shutdown unused cores to be able to increase the clock speed for single-threaded applications.
You could test for this by spawning one thread per core that does nothing but sleeps a few milliseconds in a loop (e.g. until a global shutdown flag is set). That should reproduce the performance decrease you see without CAF if something like Intel's Turbo Boost is to blame here.
Mitigating this in CAF would require tweaks to the work stealing scheduler. The scheduler is designed without any central component, so "no work is available in the entire system" is hard to detect. Shutting down workers once they agree nothing is happening might cause additional latencies once work does become available again and needs to spread to all workers. We currently do a poll strategy with increasing wait intervals to get a compromise between not wasting too much power while being able to quickly get up to speed again.
Just my thoughts so far. Can you try a simple run with periodically waking threads to see if that triggers the same behavior on your 32-core machine?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#699 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/APJxSAqMgqz378EpzlqqoKpJlx8QXdrAks5t8X6bgaJpZM4Um0gX>.
PRIVILEGED AND CONFIDENTIAL *******************************
This message contains confidential information and is intended only for the individual(s) addressed in the message. Please refer to DISCLAIMER<http://www.atscom.it/maildisclaimer> for important disclaimers and the firm's regulatory position.
|
I've actually merged the branch into master two weeks ago after finishing some final testing and benchmarking: 04c1164 🙂 |
Btw, the |
Yep, sorry for misleading information 😊 anyways, the way to trigger the wait behavior instead of poll is to set moderate-poll-attempts to 0 and fall back asap to the relaxed policy.
T
From: Dominik Charousset <notifications@github.com>
Sent: giovedì 14 giugno 2018 09:31
To: actor-framework/actor-framework <actor-framework@noreply.github.com>
Cc: Tullio Menga <Tullio.Menga@atscom.it>; Comment <comment@noreply.github.com>
Subject: Re: [actor-framework/actor-framework] performance overhead of unused actor_system (#699)
Btw, the topic/latency-integration is actually an unsuccessful experiment of mine. I'm sure you meant topic/latency.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#699 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/APJxSNFgrjQgqwCMDkmHtbLhwvHyOYiTks5t8hFBgaJpZM4Um0gX>.
PRIVILEGED AND CONFIDENTIAL *******************************
This message contains confidential information and is intended only for the individual(s) addressed in the message. Please refer to DISCLAIMER<http://www.atscom.it/maildisclaimer> for important disclaimers and the firm's regulatory position.
|
It does: 32 threads sleeping for 10ms is around the same timing of the default actor system configuration. I notice that increasing Any ideas/plans for actor systems to even more dynamically scale toward most-efficient state? Ideally I don't want to have to tune these settings or ship arbitrary defaults, especially since we're not in control of the range of hardware or workloads that our users have and would rather not put the responsibility on each individual to have to possibly tune many knobs. I know that's maybe a tall order, so feel free to close this if there's no plans or other advice on how to go about providing tuning options -- right now, I only expose |
@jsiwek IIRC, you have a working setup in Broker for now that allows to processor to shutdown most cores reliably. I think the default use case for CAF should remain focused on maximum multi-core throughput, though. I'm not sure if there's a strategy that gives us the best of both extremes, so I close this for now. Please feel free to reopen if something comes up. |
On a 32-core Linux system (
std::thread::hardware_concurrency()
is 32) and default build of caf/master (cdfe2c2) using the following test benchmark:I get this timing (built w/
-O0
):Then recompile without creating an
actor_system
:And I get this timing:
I can also get closer to the better timing by artificially limiting the number of threads via
cfg.scheduler_max_threads = 4;
. Running the benchmarks on a lower-core system with default configuration would also not see as much of a problem.Does this generally indicate some scaling issues on high-core-count systems? Or is there any room to improve CAF so that the default configuration of an actor system on a high-core-count machine doesn't have as much overhead?
The text was updated successfully, but these errors were encountered: