Fix: Atomics and Locks (ARM, AArch64, X86) #14293

ysbaddaden · 2024-02-12T17:42:36Z

This patch should fix the use of atomics in locks once and for all (until we add new architectures, that each will need to be investigated). All usages should now be correct as per the textbook, instead of relying on hacks without much impact.

Explanation

CPU architectures with a weak memory model (such as ARM) can reorder the CPU instructions at runtime (unlike X86), not following the exact order of the assembly. We must respect the memory model guarantees to ensure that we have proper barrier or specific memory ordering at specific places when they are required (e.g. locks).

A lock must ensure that anything after the lock was acquired aren't executed before the lock is acquired, and the other way around: anything before we release the lock must be executed before we actually release it. Each architecture has different quirks:

AArch64: atomics guarantee the ordering but we must use explicit atomics everywhere (at least load/store with acquire/release ordering) (source);
ARM v6/v7: atomics don't guarantee any ordering (only the atomicity of the operation), there must add explicit memory barriers around locks (help wanted: can someone test on ARMv6 or v7 hardware? 🙇) (source);
X86: it seems we can go away with a non atomic store, yet using acquire/release ordering isn't any slower than a non atomic store.

Explicit orderings

X86 is the main reason I added specific orderings. I couldn't notice differences in AArch64 between seq-cst and acq/rel, but on X86 it had a great impact, being no slower than the lazy store while using correct atomics.

Note: I'm following the algorithms used by the Linux v4.4 kernel overall.

Performance impact

I ran the "mutex" example from #14272 by @jgaskins ten times at different MT levels (ranging from 1 to 128 threads) and charted the average value. See run.cr.

On X86 (intel haswell i7-4712HQ, 4 cores, 8 threads) we see that the current patch isn't any slower than the current master (incorrect but appears to work), whereas just using load + store with the default sequentially consistent ordering has a huge impact.

On AArch64 (Neoverse N1 server, 160 cores) we can see that this patch is slower than master with the traditional LL/SC atomics (inherited from ARMv6), but that the difference becomes unnoticeable when the new LSE atomics are enabled (--mattr=+lse) which significantly outperform LL/SC on this hardware.

Future evolution: we might want to enable LSE atomics by default for AArch64 when available: can detect lse feature with LLVM 15 and this snippet by @HertzDevil and maybe a flag to disable when needed.

Help Wanted: I'd be very interested to see results of LL/SC vs LSE atomics for other AArch64 CPUs: Apple, Raspberry Pi, AWS, ... 🙇

Related issues:

Allows to specify how memory accesses, including non atomic, are to be reordered around atomics. Follows the C/C++ semantics: <https://en.cppreference.com/w/c/atomic/memory_order> Keeps sequentially consistent as the default ordering. There are a couple difficulties with this patch: 1. LLVM 12 and below only accepted some specific ordering combinations in cmpxchg, and all LLVM versions refuse release / acquire during store / load. 2. Codegen needs ordering to be constants (to generate the LLVM IR) but this is now a variable in stdlib. We thus generate all possible cases every time, which is optimized away by the LLVM optimization passes when it realizes there is only one branch possible (unless `ordering` is really a variable, which should be very rare).

CPU architectures with a weak memory model (such as ARM) can reorder the CPU instructions at runtime (unlike X86), not following the exact order of the assembly. We must respect the memory model guarantees to ensure that we have proper barrier or specific memory ordering at specific places when they are required (e.g. locks). A lock must ensure that anything after the lock was acquired aren't executed before the lock is acquired, and the other way around: anything before we release the lock must be executed before we actually release it. Each architecture has different quirks: - AArch64: atomics guarantee the ordering but we must use an explicit atomics everywhere (at least load/store with acquire/release ordering); - ARM v6/v7: atomics don't guarantee any ordering (only the atomicity of the operation), there must add explicit memory barriers around locks; - X86: it seems we can go away with a non atomic release, yet using acquire/release ordering isn't any slower than a non atomic store.

Fix: detect LLVM version

src/atomic.cr

Sija · 2024-02-15T12:00:45Z

macOS CI is still failing with Assertion failed: (isValidFailureOrdering(Ordering) && "invalid CmpXchg failure ordering"), function setFailureOrdering, file Instructions.h, line 612.

ysbaddaden · 2024-02-15T12:05:56Z

Thanks Sija, I missed that assertion, it felt like nixpkgs was crashing. I have no idea what's happening there. The assertion is pointless and doesn't hint at any issue.

Not having the assertion and having the actual LLVM IR error would actually be much more helpful.

Might Crystal::LLVM_VERSION be invalid or something?

ysbaddaden · 2024-02-15T12:59:44Z

@Sija are you sure? CI reports a different message:

invalid CmpXchg failure ordering

Not

must be atomic

ysbaddaden · 2024-02-15T13:01:00Z

I tried to reproduce with Nix on Linux but the assertion won't reproduce 🤷

Sija · 2024-02-15T13:06:37Z

It's odd, failing assertion seems to be this one: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/Instructions.h#L623-L624, yet the CI message mentions line 612...

Also, llvm-config is at version 11.1.0 (Using /nix/store/pj58mky3176z3xh7xgsp5sf46dfpyy5k-llvm-11.1.0-dev/bin/llvm-config [version= 11.1.0]), but for this version assertion line is different: https://github.com/llvm/llvm-project/blob/llvmorg-11.1.0/llvm/include/llvm/IR/Instructions.h#L611-L612

ysbaddaden · 2024-02-15T13:13:42Z

The isValidFailureOrdering function doesn't even exist for LLVM 11. It's like the reported and used LLVM versions are completely wrong. Is nix doing shenanigans? 🤔

I think I'm starting to understand: LLVM with assertions turned on still checks that the failure ordering can't have release semantics, but without assertions it will accept them. Sigh, so which is which? 😕

There are cases where the detection doesn't work properly, and we might want to avoid having 2 crystal compiler behaving differently depending on the LLVM version it was compiled in. We might want to enable more cases when we ditch support for LLVM < 13.0

ysbaddaden · 2024-02-15T15:06:34Z

As stated in the latest commit: I removed the check for the LLVM version and only enable the strictest of choices. That seems to have fixed the issue with LLVM assertions, and I think it's better to not have different semantics based on the LLVM version Crystal happens to have been compiled with.

HertzDevil · 2024-02-15T15:29:46Z

The assertion failure comes from LLVM 15.0.2, since that is the version the compiler itself is built with.

src/compiler/crystal/codegen/primitives.cr

Add comment about not enforcing the no stronger rule for CAS in the compiler (codegen has to be valid) vs stdlib (consistent public API).

src/atomic.cr

ysbaddaden · 2024-05-03T19:24:39Z

Reading again https://llvm.org/docs/Atomics.html I realize the notes for code generation mentions using explicit fences on weak architectures... and I'm back to wonder if we ever needed to put explicit Atomic.fence for ARM at all, because the codegen should already do so.

If so, the problem was merely the lazy set that was breaking the acquire/release requirement (an acquire is only guaranteed when paired with a release, and vice versa).

Let's cross compile & disassemble to check that.

ysbaddaden · 2024-05-03T20:55:43Z

I got mixed results. With a mere --target=arm-linux-gnueabihf LLVM assumes I'm aiming for an old ARM CPU, and expects __sync_* symbols from libgcc and libgcc_s, and I'm not sure about the memory barriers (depends on the __sync lib).

If I enable a recent CPU (e.g. --mattr=+armv7-a,) then it generates proper load/store with memory barriers (dmb ish). For example:

   304bc:       e3a00f72        mov     r0, #456        ; 0x1c8
   304c0:       e58d0000        str     r0, [sp]
   304c4:       f57ff05b        dmb     ish
   304c8:       e59d0000        ldr     r0, [sp]
   304cc:       f57ff05b        dmb     ish

ysbaddaden · 2024-05-06T10:57:38Z

Confirmed: the explicit fences are not needed for arvm6 and armv7, while armv5 and older don't support fences. See #14567

ysbaddaden added 4 commits February 12, 2024 16:16

src/atomic.cr

5a170f5

Fix: detect LLVM version

fixup! Fix: invalid fallback to llvm 12 and below

cd15f7b

ysbaddaden marked this pull request as ready for review February 13, 2024 12:46

straight-shoota added kind:bug topic:stdlib:concurrency topic:multithreading platform:aarch64 platform:arm platform:x86 labels Feb 13, 2024

straight-shoota reviewed Feb 13, 2024

View reviewed changes

src/atomic.cr Outdated Show resolved Hide resolved

src/atomic.cr Show resolved Hide resolved

src/atomic.cr Outdated Show resolved Hide resolved

HertzDevil reviewed Feb 14, 2024

View reviewed changes

src/atomic.cr Outdated Show resolved Hide resolved

HertzDevil reviewed Feb 14, 2024

View reviewed changes

src/atomic.cr Outdated Show resolved Hide resolved

HertzDevil added the kind:feature label Feb 14, 2024

ysbaddaden added 4 commits February 15, 2024 12:36

More compact LLVM <= 12 fallback

cbd5540

fixup! More compact LLVM <= 12 fallback

be9e9ad

fixup! More compact LLVM <= 12 fallback

5ef1f21

Fix: raise ArgumentError

fd8d209

straight-shoota reviewed Feb 19, 2024

View reviewed changes

src/compiler/crystal/codegen/primitives.cr Outdated Show resolved Hide resolved

CAS: add comment about no stronger rule

a53ab72

Add comment about not enforcing the no stronger rule for CAS in the compiler (codegen has to be valid) vs stdlib (consistent public API).

straight-shoota approved these changes Feb 26, 2024

View reviewed changes

straight-shoota reviewed Mar 4, 2024

View reviewed changes

src/atomic.cr Show resolved Hide resolved

Docs for #compare_and_set with explicit orderings

7fafb28

straight-shoota approved these changes Mar 4, 2024

View reviewed changes

straight-shoota added this to the 1.12.0 milestone Mar 4, 2024

straight-shoota merged commit c734cc4 into crystal-lang:master Mar 5, 2024
56 of 57 checks passed

ysbaddaden deleted the fix/atomics-and-locks-3 branch March 5, 2024 15:25

BrewTestBot mentioned this pull request Apr 9, 2024

crystal 1.12.0 Homebrew/homebrew-core#168465

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Atomics and Locks (ARM, AArch64, X86) #14293

Fix: Atomics and Locks (ARM, AArch64, X86) #14293

ysbaddaden commented Feb 12, 2024 •

edited

Sija commented Feb 15, 2024

ysbaddaden commented Feb 15, 2024 •

edited

ysbaddaden commented Feb 15, 2024

ysbaddaden commented Feb 15, 2024

Sija commented Feb 15, 2024 •

edited

ysbaddaden commented Feb 15, 2024 •

edited

ysbaddaden commented Feb 15, 2024

HertzDevil commented Feb 15, 2024

ysbaddaden commented May 3, 2024

ysbaddaden commented May 3, 2024

ysbaddaden commented May 6, 2024 •

edited

Fix: Atomics and Locks (ARM, AArch64, X86) #14293

Fix: Atomics and Locks (ARM, AArch64, X86) #14293

Conversation

ysbaddaden commented Feb 12, 2024 • edited

Explanation

Explicit orderings

Performance impact

Sija commented Feb 15, 2024

ysbaddaden commented Feb 15, 2024 • edited

ysbaddaden commented Feb 15, 2024

ysbaddaden commented Feb 15, 2024

Sija commented Feb 15, 2024 • edited

ysbaddaden commented Feb 15, 2024 • edited

ysbaddaden commented Feb 15, 2024

HertzDevil commented Feb 15, 2024

ysbaddaden commented May 3, 2024

ysbaddaden commented May 3, 2024

ysbaddaden commented May 6, 2024 • edited

ysbaddaden commented Feb 12, 2024 •

edited

ysbaddaden commented Feb 15, 2024 •

edited

Sija commented Feb 15, 2024 •

edited

ysbaddaden commented Feb 15, 2024 •

edited

ysbaddaden commented May 6, 2024 •

edited