Use of std::atomic can slow down multithreaded tests #452

martinus · 2020-12-29T20:58:06Z

Description

Having lots of REQUIRE in a multithreaded test is relatively slow due to the use of std::atomic.

I have a test that performs a lot of asserts, and is called in parallel. It's 240016891 REQUIRE statements. Without the REQUIRE, the test takes 0.37 seconds on my machine. When I add the REQUIRE, the test takes 8.05 seconds.

It seems that most of the slowdown comes from the use of std::atomic for the variable numAssertsCurrentTest_atomic. The problem is that in my case 12 threads basically block each other by continuously increasing the atomic, which leads to lots of cache invalidation for the other threads.

I've played a bit with the code, and have come up with a multi-lane implementation of atomic. This splits up the atomic into multiple atomics, each sitting on a different cache line, and each thread operates on a different atomic so they can't block each other. This speeds up the test from 8.05 seconds to 2.28 seconds on my machine.

Steps to reproduce

Create a test that spawns many threads, each calling REQUIRE in a tight loop.

Put this into doctest.h:

// Provides a multilane implementation of an atomic variable that supports add, sub, load, store.
// Instead of using a single atomic variable, this splits up into multiple counters, each sitting on a separate memory lane.
// This can reduce contention dramatically when many concurrent writes are performed.
template <typename T, size_t NumLanes = 32, size_t CacheLineSize = 64>
class MultiLaneAtomic {
    struct CacheLineAlignedAtomic {
        std::atomic<T> atomic{};
        std::array<uint8_t, CacheLineSize - sizeof(std::atomic<T>)> dummyData{};
    };
    std::array<CacheLineAlignedAtomic, NumLanes> mAtomics{};

public:
    MultiLaneAtomic(MultiLaneAtomic const&) = delete;
    MultiLaneAtomic(MultiLaneAtomic&&) = delete;
    auto operator=(MultiLaneAtomic const&) -> MultiLaneAtomic& = delete;
    auto operator=(MultiLaneAtomic&&) -> MultiLaneAtomic& = delete;

    MultiLaneAtomic() = default;
    ~MultiLaneAtomic() = default;

    auto operator++() noexcept -> T {
        return fetch_add(1) + 1;
    }

    auto operator++(int) noexcept -> T {
        return fetch_add(1);
    }

    auto fetch_add(T arg, std::memory_order order = std::memory_order_seq_cst) noexcept -> T {
        return myAtomic().fetch_add(arg, order);
    }

    auto fetch_sub(T arg, std::memory_order order = std::memory_order_seq_cst) noexcept -> T {
        return myAtomic().fetch_sub(arg, order);
    }

    operator T() const noexcept {
        return load();
    }

    auto load(std::memory_order order = std::memory_order_seq_cst) const noexcept -> T {
        auto result = T();
        for (auto const& c : mAtomics) {
            result += c.atomic.load(order);
        }
        return result;
    }

    auto operator=(T desired) noexcept -> T {
        store(desired);
        return desired;
    }

    void store(T desired, std::memory_order order = std::memory_order_seq_cst) noexcept {
        // first value becomes desired", all others become 0.
        for (auto& c : mAtomics) {
            c.atomic.store(desired, order);
            desired = {};
        }
    }

private:
    // Each thread has a different atomic that it operates on.
    // This is a bit tricky: laneCounter is a global static counter that goes up.
    // laneId is an ID that is unique to this. So each instance of MultiLaneAtomic in the same thread will have the same landId.
    // To support different number of lanes, we always calculate laneId % NumLanes.
    auto myAtomic() noexcept -> std::atomic<T>& {
        static auto laneCounter = std::atomic<size_t>();
        thread_local auto laneId = laneCounter.fetch_add(1);

        return mAtomics[laneId % NumLanes].atomic;
    }
};

Then, replace the line

        std::atomic<int> numAssertsCurrentTest_atomic;

with

        MultiLaneAtomic<int> numAssertsCurrentTest_atomic;

The test will run significantly faster with the MultiLaneAtomic<int>.

If you think this is worthwhile, I could create a pull request for this.

doctest version: 2.4.1
Operating System: Linux 5.9.11-3-MANJARO
Compiler+version: clang++ 11.0.0

The text was updated successfully, but these errors were encountered:

martinus · 2020-12-29T21:10:59Z

Here is a reproducer:

#include <doctest.h>

#include <thread>
#include <vector>

// 4.364 seconds: std::atomic<int> numAssertsFailedCurrentTest_atomic
// 0.755 seconds: MultiLaneAtomic<int> numAssertsCurrentTest_atomic
TEST_CASE("MultiLaneAtomic") {
    static constexpr auto numIters = size_t(10000000);

    auto threads = std::vector<std::thread>();
    for (size_t i = 0; i < std::thread::hardware_concurrency(); ++i) {
        threads.emplace_back([] {
            for (size_t it = 0; it < numIters; ++it) {
                REQUIRE(it < numIters);
            }
        });
    }
    for (auto& thread : threads) {
        thread.join();
    }
}

onqtam · 2020-12-30T11:28:06Z

It is my understanding that this solution would work even if there are more threads/cores than the 32 lanes (by default) - in every CacheLineAlignedAtomic instance there is a std::atomic so a few threads can use the same atomic due to the modulo addressing in myAtomic() - is that correct?

Regarding the timing - the example code ran on my machine for 4.5 seconds vs 3.2 seconds with this atomic class, so there was some gain indeed.
It also seems to slow down the same test when done serially (without threads) by about 10% (no concurrency) - nothing is ever free.

Maybe the 3.5-to-1 difference you are observing is more exaggerated because there are more cores/threads on your machine? I tested with 12 cores - maybe you have a lot more? If that's the case (bigger payoff for higher multi-threadedness) maybe it's worth going forward with a PR - even if everyone has to pay the 10% increase when not using concurrency. It would also be trivial to surround the entire atomic class with an #ifdef so that it's used conditionally - only for those for which it matters most.

I think the lanes should be a compile-time define so that it's configurable (something like DOCTEST_NUM_ATOMIC_THREAD_LANES), and the same goes for the cache line size (DOCTEST_CACHE_LINE_SIZE?) - those should go next to the #ifndef DOCTEST_THREAD_LOCAL in the same fashion - so that they are overridable if defined.

You can put this new atomic class right above struct ContextState. Also, it should only use C++11 and DOCTEST_THREAD_LOCAL instead of thread_local.

martinus · 2020-12-30T11:53:00Z

It is my understanding that this solution would work even if there are more threads/cores than the 32 lanes (by default) - in every CacheLineAlignedAtomic instance there is a std::atomic so a few threads can use the same atomic due to the modulo addressing in myAtomic() - is that correct?

Exactly, that's the idea. The goal is to spread the threads up between the lanes, and if some threads use the same lane it's not a problem, just a bit of a slowdown between these threads.

What compiler are you using? I saw on godbolt that visual studio seems to produce much more code than g++ or clang++ does. So it might not be as good on windows. I have an Intel i7-8700 with 6 cores/12 threads, and used clang++ with -O3 -march=native for this benchmark. I locked the CPU to a fixed frequency for more stable results.

onqtam · 2020-12-30T12:04:25Z

Whoops - rookie mistake - wasn't specifying CMAKE_BUILD_TYPE=Release.

Just tested both with clang 9 and gcc 9 on ubuntu 19.10 and now I see the same 3x increase - this does indeed seem worthwhile!

Adds the configuration option `DOCTEST_CONFIG_NO_MULTI_LANE_ATOMICS` to disable multi lane atomics. This can speed up assertions in highly parallel tests by a factor of 3 and more, with a slight slowdown for the single threaded case. Closes #452

onqtam · 2020-12-31T10:22:45Z

merged the PR - closing this issue as well. Will release a new version probably sometime in January - use the dev branch until then

Adds the configuration option `DOCTEST_CONFIG_NO_MULTI_LANE_ATOMICS` to disable multi lane atomics. This can speed up assertions in highly parallel tests by a factor of 3 and more, with a slight slowdown for the single threaded case. Closes #452

martinus mentioned this issue Dec 31, 2020

Implement a multi lane atomic for assertion counts #453

Merged

onqtam closed this as completed Dec 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of std::atomic can slow down multithreaded tests #452

Use of std::atomic can slow down multithreaded tests #452

martinus commented Dec 29, 2020 •

edited

Loading

martinus commented Dec 29, 2020

onqtam commented Dec 30, 2020 •

edited

Loading

martinus commented Dec 30, 2020

onqtam commented Dec 30, 2020

onqtam commented Dec 31, 2020 •

edited

Loading

Use of std::atomic can slow down multithreaded tests #452

Use of std::atomic can slow down multithreaded tests #452

Comments

martinus commented Dec 29, 2020 • edited Loading

Description

Steps to reproduce

martinus commented Dec 29, 2020

onqtam commented Dec 30, 2020 • edited Loading

martinus commented Dec 30, 2020

onqtam commented Dec 30, 2020

onqtam commented Dec 31, 2020 • edited Loading

martinus commented Dec 29, 2020 •

edited

Loading

onqtam commented Dec 30, 2020 •

edited

Loading

onqtam commented Dec 31, 2020 •

edited

Loading