Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance overhead of unused actor_system #699

Closed
jsiwek opened this issue Jun 13, 2018 · 7 comments
Closed

performance overhead of unused actor_system #699

jsiwek opened this issue Jun 13, 2018 · 7 comments
Labels

Comments

@jsiwek
Copy link
Contributor

jsiwek commented Jun 13, 2018

On a 32-core Linux system (std::thread::hardware_concurrency() is 32) and default build of caf/master (cdfe2c2) using the following test benchmark:

#include "caf/actor_system.hpp"
#include "caf/actor_system_config.hpp"
#include <cstdint>
using namespace caf;

int main()
    {
    actor_system_config cfg;
    actor_system system{cfg};

    uint64_t total = 0u;
    uint64_t limit = 10000000000u;

    for ( uint64_t i = 0u; i < limit; ++i )
        total += i;

    return 0;
    }

I get this timing (built w/ -O0):

$ time ./a.out

real	1m6.162s
user	1m12.252s
sys	0m27.054s

Then recompile without creating an actor_system:

#include "caf/actor_system.hpp"
#include "caf/actor_system_config.hpp"
#include <cstdint>
using namespace caf;

int main()
    {
    //actor_system_config cfg;
    //actor_system system{cfg};

    uint64_t total = 0u;
    uint64_t limit = 10000000000u;

    for ( uint64_t i = 0u; i < limit; ++i )
        total += i;

    return 0;
    }

And I get this timing:

$ time ./a.out

real	0m45.918s
user	0m45.751s
sys	0m0.002s

I can also get closer to the better timing by artificially limiting the number of threads via cfg.scheduler_max_threads = 4;. Running the benchmarks on a lower-core system with default configuration would also not see as much of a problem.

Does this generally indicate some scaling issues on high-core-count systems? Or is there any room to improve CAF so that the default configuration of an actor system on a high-core-count machine doesn't have as much overhead?

@Neverlord
Copy link
Member

This could be an artifact of various multicore optimizations for low-concurrency setups. For example, Intel's "Turbo Boost" will shutdown unused cores to be able to increase the clock speed for single-threaded applications.

You could test for this by spawning one thread per core that does nothing but sleeps a few milliseconds in a loop (e.g. until a global shutdown flag is set). That should reproduce the performance decrease you see without CAF if something like Intel's Turbo Boost is to blame here.

Mitigating this in CAF would require tweaks to the work stealing scheduler. The scheduler is designed without any central component, so "no work is available in the entire system" is hard to detect. Shutting down workers once they agree nothing is happening might cause additional latencies once work does become available again and needs to spread to all workers. We currently do a poll strategy with increasing wait intervals to get a compromise between not wasting too much power while being able to quickly get up to speed again.

Just my thoughts so far. Can you try a simple run with periodically waking threads to see if that triggers the same behavior on your 32-core machine?

@lovemycatt
Copy link

lovemycatt commented Jun 14, 2018 via email

@Neverlord
Copy link
Member

This behaviour could possibly be fixed also by the topic/latency-integration branch

I've actually merged the branch into master two weeks ago after finishing some final testing and benchmarking: 04c1164 🙂

@Neverlord
Copy link
Member

Btw, the topic/latency-integration is actually an unsuccessful experiment of mine. I'm sure you meant topic/latency.

@lovemycatt
Copy link

lovemycatt commented Jun 14, 2018 via email

@jsiwek
Copy link
Contributor Author

jsiwek commented Jun 14, 2018

Can you try a simple run with periodically waking threads to see if that triggers the same behavior on your 32-core machine?

It does: 32 threads sleeping for 10ms is around the same timing of the default actor system configuration.

I notice that increasing work_stealing_relaxed_sleep_duration_us (e.g. to 64 milliseconds) instead of limiting the number of threads also helps bring things back to within 2% of baseline. (maybe you want to consider raising the default from 10ms?)

Any ideas/plans for actor systems to even more dynamically scale toward most-efficient state? Ideally I don't want to have to tune these settings or ship arbitrary defaults, especially since we're not in control of the range of hardware or workloads that our users have and would rather not put the responsibility on each individual to have to possibly tune many knobs.

I know that's maybe a tall order, so feel free to close this if there's no plans or other advice on how to go about providing tuning options -- right now, I only expose scheduler_max_threads and work_stealing_relaxed_sleep_duration_us to users and set them to more moderate defaults (cap threads at 8 and max sleep for 64ms).

@Neverlord
Copy link
Member

@jsiwek IIRC, you have a working setup in Broker for now that allows to processor to shutdown most cores reliably. I think the default use case for CAF should remain focused on maximum multi-core throughput, though. I'm not sure if there's a strategy that gives us the best of both extremes, so I close this for now. Please feel free to reopen if something comes up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants