performance overhead of unused actor_system #699

jsiwek · 2018-06-13T19:45:44Z

On a 32-core Linux system (std::thread::hardware_concurrency() is 32) and default build of caf/master (cdfe2c2) using the following test benchmark:

#include "caf/actor_system.hpp"
#include "caf/actor_system_config.hpp"
#include <cstdint>
using namespace caf;

int main()
    {
    actor_system_config cfg;
    actor_system system{cfg};

    uint64_t total = 0u;
    uint64_t limit = 10000000000u;

    for ( uint64_t i = 0u; i < limit; ++i )
        total += i;

    return 0;
    }

I get this timing (built w/ -O0):

$ time ./a.out

real	1m6.162s
user	1m12.252s
sys	0m27.054s

Then recompile without creating an actor_system:

#include "caf/actor_system.hpp"
#include "caf/actor_system_config.hpp"
#include <cstdint>
using namespace caf;

int main()
    {
    //actor_system_config cfg;
    //actor_system system{cfg};

    uint64_t total = 0u;
    uint64_t limit = 10000000000u;

    for ( uint64_t i = 0u; i < limit; ++i )
        total += i;

    return 0;
    }

And I get this timing:

$ time ./a.out

real	0m45.918s
user	0m45.751s
sys	0m0.002s

I can also get closer to the better timing by artificially limiting the number of threads via cfg.scheduler_max_threads = 4;. Running the benchmarks on a lower-core system with default configuration would also not see as much of a problem.

Does this generally indicate some scaling issues on high-core-count systems? Or is there any room to improve CAF so that the default configuration of an actor system on a high-core-count machine doesn't have as much overhead?

The text was updated successfully, but these errors were encountered:

Neverlord · 2018-06-13T21:05:22Z

This could be an artifact of various multicore optimizations for low-concurrency setups. For example, Intel's "Turbo Boost" will shutdown unused cores to be able to increase the clock speed for single-threaded applications.

You could test for this by spawning one thread per core that does nothing but sleeps a few milliseconds in a loop (e.g. until a global shutdown flag is set). That should reproduce the performance decrease you see without CAF if something like Intel's Turbo Boost is to blame here.

Mitigating this in CAF would require tweaks to the work stealing scheduler. The scheduler is designed without any central component, so "no work is available in the entire system" is hard to detect. Shutting down workers once they agree nothing is happening might cause additional latencies once work does become available again and needs to spread to all workers. We currently do a poll strategy with increasing wait intervals to get a compromise between not wasting too much power while being able to quickly get up to speed again.

Just my thoughts so far. Can you try a simple run with periodically waking threads to see if that triggers the same behavior on your 32-core machine?

lovemycatt · 2018-06-14T07:26:10Z

This behaviour could possibly be fixed also by the topic/latency-integration branch, that use a different strategy than polling with a configuration having no or little aggressive attempts, moderate-poll-attempts = 0 and relaxed-sleep-duration = ~100. https://github.com/actor-framework/actor-framework/tree/topic/latency-integration You may try it. Tullio From: Dominik Charousset <notifications@github.com> Sent: mercoledì 13 giugno 2018 23:06 To: actor-framework/actor-framework <actor-framework@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [actor-framework/actor-framework] performance overhead of unused actor_system (#699) This could be an artifact of various multicore optimizations for low-concurrency setups. For example, Intel's "Turbo Boost" will shutdown unused cores to be able to increase the clock speed for single-threaded applications. You could test for this by spawning one thread per core that does nothing but sleeps a few milliseconds in a loop (e.g. until a global shutdown flag is set). That should reproduce the performance decrease you see without CAF if something like Intel's Turbo Boost is to blame here. Mitigating this in CAF would require tweaks to the work stealing scheduler. The scheduler is designed without any central component, so "no work is available in the entire system" is hard to detect. Shutting down workers once they agree nothing is happening might cause additional latencies once work does become available again and needs to spread to all workers. We currently do a poll strategy with increasing wait intervals to get a compromise between not wasting too much power while being able to quickly get up to speed again. Just my thoughts so far. Can you try a simple run with periodically waking threads to see if that triggers the same behavior on your 32-core machine? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#699 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/APJxSAqMgqz378EpzlqqoKpJlx8QXdrAks5t8X6bgaJpZM4Um0gX>. PRIVILEGED AND CONFIDENTIAL ******************************* This message contains confidential information and is intended only for the individual(s) addressed in the message. Please refer to DISCLAIMER<http://www.atscom.it/maildisclaimer> for important disclaimers and the firm's regulatory position.

Neverlord · 2018-06-14T07:28:58Z

This behaviour could possibly be fixed also by the topic/latency-integration branch

I've actually merged the branch into master two weeks ago after finishing some final testing and benchmarking: 04c1164 🙂

Neverlord · 2018-06-14T07:31:05Z

Btw, the topic/latency-integration is actually an unsuccessful experiment of mine. I'm sure you meant topic/latency.

lovemycatt · 2018-06-14T07:34:41Z

Yep, sorry for misleading information 😊 anyways, the way to trigger the wait behavior instead of poll is to set moderate-poll-attempts to 0 and fall back asap to the relaxed policy. T From: Dominik Charousset <notifications@github.com> Sent: giovedì 14 giugno 2018 09:31 To: actor-framework/actor-framework <actor-framework@noreply.github.com> Cc: Tullio Menga <Tullio.Menga@atscom.it>; Comment <comment@noreply.github.com> Subject: Re: [actor-framework/actor-framework] performance overhead of unused actor_system (#699) Btw, the topic/latency-integration is actually an unsuccessful experiment of mine. I'm sure you meant topic/latency. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#699 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/APJxSNFgrjQgqwCMDkmHtbLhwvHyOYiTks5t8hFBgaJpZM4Um0gX>. PRIVILEGED AND CONFIDENTIAL ******************************* This message contains confidential information and is intended only for the individual(s) addressed in the message. Please refer to DISCLAIMER<http://www.atscom.it/maildisclaimer> for important disclaimers and the firm's regulatory position.

jsiwek · 2018-06-14T18:24:11Z

Can you try a simple run with periodically waking threads to see if that triggers the same behavior on your 32-core machine?

It does: 32 threads sleeping for 10ms is around the same timing of the default actor system configuration.

I notice that increasing work_stealing_relaxed_sleep_duration_us (e.g. to 64 milliseconds) instead of limiting the number of threads also helps bring things back to within 2% of baseline. (maybe you want to consider raising the default from 10ms?)

Any ideas/plans for actor systems to even more dynamically scale toward most-efficient state? Ideally I don't want to have to tune these settings or ship arbitrary defaults, especially since we're not in control of the range of hardware or workloads that our users have and would rather not put the responsibility on each individual to have to possibly tune many knobs.

I know that's maybe a tall order, so feel free to close this if there's no plans or other advice on how to go about providing tuning options -- right now, I only expose scheduler_max_threads and work_stealing_relaxed_sleep_duration_us to users and set them to more moderate defaults (cap threads at 8 and max sleep for 64ms).

Neverlord · 2018-06-18T19:41:50Z

@jsiwek IIRC, you have a working setup in Broker for now that allows to processor to shutdown most cores reliably. I think the default use case for CAF should remain focused on maximum multi-core throughput, though. I'm not sure if there's a strategy that gives us the best of both extremes, so I close this for now. Please feel free to reopen if something comes up.

Neverlord added the question label Jun 14, 2018

Neverlord closed this as completed Jun 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance overhead of unused actor_system #699

performance overhead of unused actor_system #699

jsiwek commented Jun 13, 2018

Neverlord commented Jun 13, 2018

lovemycatt commented Jun 14, 2018 via email

Neverlord commented Jun 14, 2018

Neverlord commented Jun 14, 2018

lovemycatt commented Jun 14, 2018 via email

jsiwek commented Jun 14, 2018

Neverlord commented Jun 18, 2018

performance overhead of unused actor_system #699

performance overhead of unused actor_system #699

Comments

jsiwek commented Jun 13, 2018

Neverlord commented Jun 13, 2018

lovemycatt commented Jun 14, 2018 via email

Neverlord commented Jun 14, 2018

Neverlord commented Jun 14, 2018

lovemycatt commented Jun 14, 2018 via email

jsiwek commented Jun 14, 2018

Neverlord commented Jun 18, 2018