Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of std::atomic can slow down multithreaded tests #452

Closed
martinus opened this issue Dec 29, 2020 · 5 comments
Closed

Use of std::atomic can slow down multithreaded tests #452

martinus opened this issue Dec 29, 2020 · 5 comments

Comments

@martinus
Copy link
Contributor

martinus commented Dec 29, 2020

Description

Having lots of REQUIRE in a multithreaded test is relatively slow due to the use of std::atomic.

I have a test that performs a lot of asserts, and is called in parallel. It's 240016891 REQUIRE statements. Without the REQUIRE, the test takes 0.37 seconds on my machine. When I add the REQUIRE, the test takes 8.05 seconds.

It seems that most of the slowdown comes from the use of std::atomic for the variable numAssertsCurrentTest_atomic. The problem is that in my case 12 threads basically block each other by continuously increasing the atomic, which leads to lots of cache invalidation for the other threads.

I've played a bit with the code, and have come up with a multi-lane implementation of atomic. This splits up the atomic into multiple atomics, each sitting on a different cache line, and each thread operates on a different atomic so they can't block each other. This speeds up the test from 8.05 seconds to 2.28 seconds on my machine.

Steps to reproduce

Create a test that spawns many threads, each calling REQUIRE in a tight loop.

Put this into doctest.h:

// Provides a multilane implementation of an atomic variable that supports add, sub, load, store.
// Instead of using a single atomic variable, this splits up into multiple counters, each sitting on a separate memory lane.
// This can reduce contention dramatically when many concurrent writes are performed.
template <typename T, size_t NumLanes = 32, size_t CacheLineSize = 64>
class MultiLaneAtomic {
    struct CacheLineAlignedAtomic {
        std::atomic<T> atomic{};
        std::array<uint8_t, CacheLineSize - sizeof(std::atomic<T>)> dummyData{};
    };
    std::array<CacheLineAlignedAtomic, NumLanes> mAtomics{};

public:
    MultiLaneAtomic(MultiLaneAtomic const&) = delete;
    MultiLaneAtomic(MultiLaneAtomic&&) = delete;
    auto operator=(MultiLaneAtomic const&) -> MultiLaneAtomic& = delete;
    auto operator=(MultiLaneAtomic&&) -> MultiLaneAtomic& = delete;

    MultiLaneAtomic() = default;
    ~MultiLaneAtomic() = default;

    auto operator++() noexcept -> T {
        return fetch_add(1) + 1;
    }

    auto operator++(int) noexcept -> T {
        return fetch_add(1);
    }

    auto fetch_add(T arg, std::memory_order order = std::memory_order_seq_cst) noexcept -> T {
        return myAtomic().fetch_add(arg, order);
    }

    auto fetch_sub(T arg, std::memory_order order = std::memory_order_seq_cst) noexcept -> T {
        return myAtomic().fetch_sub(arg, order);
    }

    operator T() const noexcept {
        return load();
    }

    auto load(std::memory_order order = std::memory_order_seq_cst) const noexcept -> T {
        auto result = T();
        for (auto const& c : mAtomics) {
            result += c.atomic.load(order);
        }
        return result;
    }

    auto operator=(T desired) noexcept -> T {
        store(desired);
        return desired;
    }

    void store(T desired, std::memory_order order = std::memory_order_seq_cst) noexcept {
        // first value becomes desired", all others become 0.
        for (auto& c : mAtomics) {
            c.atomic.store(desired, order);
            desired = {};
        }
    }

private:
    // Each thread has a different atomic that it operates on.
    // This is a bit tricky: laneCounter is a global static counter that goes up.
    // laneId is an ID that is unique to this. So each instance of MultiLaneAtomic in the same thread will have the same landId.
    // To support different number of lanes, we always calculate laneId % NumLanes.
    auto myAtomic() noexcept -> std::atomic<T>& {
        static auto laneCounter = std::atomic<size_t>();
        thread_local auto laneId = laneCounter.fetch_add(1);

        return mAtomics[laneId % NumLanes].atomic;
    }
};

Then, replace the line

        std::atomic<int> numAssertsCurrentTest_atomic;

with

        MultiLaneAtomic<int> numAssertsCurrentTest_atomic;

The test will run significantly faster with the MultiLaneAtomic<int>.

If you think this is worthwhile, I could create a pull request for this.

  • doctest version: 2.4.1
  • Operating System: Linux 5.9.11-3-MANJARO
  • Compiler+version: clang++ 11.0.0
@martinus
Copy link
Contributor Author

Here is a reproducer:

#include <doctest.h>

#include <thread>
#include <vector>

// 4.364 seconds: std::atomic<int> numAssertsFailedCurrentTest_atomic
// 0.755 seconds: MultiLaneAtomic<int> numAssertsCurrentTest_atomic
TEST_CASE("MultiLaneAtomic") {
    static constexpr auto numIters = size_t(10000000);

    auto threads = std::vector<std::thread>();
    for (size_t i = 0; i < std::thread::hardware_concurrency(); ++i) {
        threads.emplace_back([] {
            for (size_t it = 0; it < numIters; ++it) {
                REQUIRE(it < numIters);
            }
        });
    }
    for (auto& thread : threads) {
        thread.join();
    }
}

@onqtam
Copy link
Member

onqtam commented Dec 30, 2020

It is my understanding that this solution would work even if there are more threads/cores than the 32 lanes (by default) - in every CacheLineAlignedAtomic instance there is a std::atomic so a few threads can use the same atomic due to the modulo addressing in myAtomic() - is that correct?

Regarding the timing - the example code ran on my machine for 4.5 seconds vs 3.2 seconds with this atomic class, so there was some gain indeed.
It also seems to slow down the same test when done serially (without threads) by about 10% (no concurrency) - nothing is ever free.

Maybe the 3.5-to-1 difference you are observing is more exaggerated because there are more cores/threads on your machine? I tested with 12 cores - maybe you have a lot more? If that's the case (bigger payoff for higher multi-threadedness) maybe it's worth going forward with a PR - even if everyone has to pay the 10% increase when not using concurrency. It would also be trivial to surround the entire atomic class with an #ifdef so that it's used conditionally - only for those for which it matters most.

I think the lanes should be a compile-time define so that it's configurable (something like DOCTEST_NUM_ATOMIC_THREAD_LANES), and the same goes for the cache line size (DOCTEST_CACHE_LINE_SIZE?) - those should go next to the #ifndef DOCTEST_THREAD_LOCAL in the same fashion - so that they are overridable if defined.

You can put this new atomic class right above struct ContextState. Also, it should only use C++11 and DOCTEST_THREAD_LOCAL instead of thread_local.

@martinus
Copy link
Contributor Author

It is my understanding that this solution would work even if there are more threads/cores than the 32 lanes (by default) - in every CacheLineAlignedAtomic instance there is a std::atomic so a few threads can use the same atomic due to the modulo addressing in myAtomic() - is that correct?

Exactly, that's the idea. The goal is to spread the threads up between the lanes, and if some threads use the same lane it's not a problem, just a bit of a slowdown between these threads.

What compiler are you using? I saw on godbolt that visual studio seems to produce much more code than g++ or clang++ does. So it might not be as good on windows. I have an Intel i7-8700 with 6 cores/12 threads, and used clang++ with -O3 -march=native for this benchmark. I locked the CPU to a fixed frequency for more stable results.

@onqtam
Copy link
Member

onqtam commented Dec 30, 2020

Whoops - rookie mistake - wasn't specifying CMAKE_BUILD_TYPE=Release.

Just tested both with clang 9 and gcc 9 on ubuntu 19.10 and now I see the same 3x increase - this does indeed seem worthwhile!

onqtam pushed a commit that referenced this issue Dec 31, 2020
Adds the configuration option `DOCTEST_CONFIG_NO_MULTI_LANE_ATOMICS` to disable multi lane atomics.

This can speed up assertions in highly parallel tests by a factor of 3 and more, with a slight slowdown for the single threaded case.

Closes #452
@onqtam
Copy link
Member

onqtam commented Dec 31, 2020

merged the PR - closing this issue as well. Will release a new version probably sometime in January - use the dev branch until then

@onqtam onqtam closed this as completed Dec 31, 2020
onqtam pushed a commit that referenced this issue Feb 2, 2021
Adds the configuration option `DOCTEST_CONFIG_NO_MULTI_LANE_ATOMICS` to disable multi lane atomics.

This can speed up assertions in highly parallel tests by a factor of 3 and more, with a slight slowdown for the single threaded case.

Closes #452
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants