Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Atomics and Locks (ARM, AArch64, X86) #14293

Merged

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Feb 12, 2024

This patch should fix the use of atomics in locks once and for all (until we add new architectures, that each will need to be investigated). All usages should now be correct as per the textbook, instead of relying on hacks without much impact.

Explanation

CPU architectures with a weak memory model (such as ARM) can reorder the CPU instructions at runtime (unlike X86), not following the exact order of the assembly. We must respect the memory model guarantees to ensure that we have proper barrier or specific memory ordering at specific places when they are required (e.g. locks).

A lock must ensure that anything after the lock was acquired aren't executed before the lock is acquired, and the other way around: anything before we release the lock must be executed before we actually release it. Each architecture has different quirks:

  • AArch64: atomics guarantee the ordering but we must use explicit atomics everywhere (at least load/store with acquire/release ordering) (source);
  • ARM v6/v7: atomics don't guarantee any ordering (only the atomicity of the operation), there must add explicit memory barriers around locks (help wanted: can someone test on ARMv6 or v7 hardware? 馃檱) (source);
  • X86: it seems we can go away with a non atomic store, yet using acquire/release ordering isn't any slower than a non atomic store.

Explicit orderings

X86 is the main reason I added specific orderings. I couldn't notice differences in AArch64 between seq-cst and acq/rel, but on X86 it had a great impact, being no slower than the lazy store while using correct atomics.

Note: I'm following the algorithms used by the Linux v4.4 kernel overall.

Performance impact

I ran the "mutex" example from #14272 by @jgaskins ten times at different MT levels (ranging from 1 to 128 threads) and charted the average value. See run.cr.

mutex-x86

On X86 (intel haswell i7-4712HQ, 4 cores, 8 threads) we see that the current patch isn't any slower than the current master (incorrect but appears to work), whereas just using load + store with the default sequentially consistent ordering has a huge impact.

mutex-aarch64

On AArch64 (Neoverse N1 server, 160 cores) we can see that this patch is slower than master with the traditional LL/SC atomics (inherited from ARMv6), but that the difference becomes unnoticeable when the new LSE atomics are enabled (--mattr=+lse) which significantly outperform LL/SC on this hardware.

Future evolution: we might want to enable LSE atomics by default for AArch64 when available: can detect lse feature with LLVM 15 and this snippet by @HertzDevil and maybe a flag to disable when needed.

Help Wanted: I'd be very interested to see results of LL/SC vs LSE atomics for other AArch64 CPUs: Apple, Raspberry Pi, AWS, ... 馃檱

Related issues:

Allows to specify how memory accesses, including non atomic, are to be
reordered around atomics. Follows the C/C++ semantics:
<https://en.cppreference.com/w/c/atomic/memory_order>

Keeps sequentially consistent as the default ordering.

There are a couple difficulties with this patch:

1. LLVM 12 and below only accepted some specific ordering combinations
   in cmpxchg, and all LLVM versions refuse release / acquire during
   store / load.

2. Codegen needs ordering to be constants (to generate the LLVM IR) but
   this is now a variable in stdlib. We thus generate all possible cases
   every time, which is optimized away by the LLVM optimization passes
   when it realizes there is only one branch possible (unless `ordering`
   is really a variable, which should be very rare).
CPU architectures with a weak memory model (such as ARM) can reorder the
CPU instructions at runtime (unlike X86), not following the exact order
of the assembly. We must respect the memory model guarantees to ensure
that we have proper barrier or specific memory ordering at specific
places when they are required (e.g. locks).

A lock must ensure that anything after the lock was acquired aren't
executed before the lock is acquired, and the other way around: anything
before we release the lock must be executed before we actually release
it. Each architecture has different quirks:

- AArch64: atomics guarantee the ordering but we must use an explicit
  atomics everywhere (at least load/store with acquire/release
  ordering);

- ARM v6/v7: atomics don't guarantee any ordering (only the atomicity of
  the operation), there must add explicit memory barriers around locks;

- X86: it seems we can go away with a non atomic release, yet using
  acquire/release ordering isn't any slower than a non atomic store.
Fix: detect LLVM version
src/atomic.cr Outdated Show resolved Hide resolved
src/atomic.cr Show resolved Hide resolved
src/atomic.cr Outdated Show resolved Hide resolved
src/atomic.cr Outdated Show resolved Hide resolved
src/atomic.cr Outdated Show resolved Hide resolved
@Sija
Copy link
Contributor

Sija commented Feb 15, 2024

macOS CI is still failing with Assertion failed: (isValidFailureOrdering(Ordering) && "invalid CmpXchg failure ordering"), function setFailureOrdering, file Instructions.h, line 612.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Feb 15, 2024

Thanks Sija, I missed that assertion, it felt like nixpkgs was crashing. I have no idea what's happening there. The assertion is pointless and doesn't hint at any issue.

Not having the assertion and having the actual LLVM IR error would actually be much more helpful.

Might Crystal::LLVM_VERSION be invalid or something?

@ysbaddaden
Copy link
Contributor Author

@Sija are you sure? CI reports a different message:

invalid CmpXchg failure ordering

Not

must be atomic

@ysbaddaden
Copy link
Contributor Author

I tried to reproduce with Nix on Linux but the assertion won't reproduce 馃し

@Sija
Copy link
Contributor

Sija commented Feb 15, 2024

It's odd, failing assertion seems to be this one: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/Instructions.h#L623-L624, yet the CI message mentions line 612...

Also, llvm-config is at version 11.1.0 (Using /nix/store/pj58mky3176z3xh7xgsp5sf46dfpyy5k-llvm-11.1.0-dev/bin/llvm-config [version= 11.1.0]), but for this version assertion line is different: https://github.com/llvm/llvm-project/blob/llvmorg-11.1.0/llvm/include/llvm/IR/Instructions.h#L611-L612

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Feb 15, 2024

The isValidFailureOrdering function doesn't even exist for LLVM 11. It's like the reported and used LLVM versions are completely wrong. Is nix doing shenanigans? 馃

I think I'm starting to understand: LLVM with assertions turned on still checks that the failure ordering can't have release semantics, but without assertions it will accept them. Sigh, so which is which? 馃槙

There are cases where the detection doesn't work properly, and we might
want to avoid having 2 crystal compiler behaving differently depending
on the LLVM version it was compiled in.

We might want to enable more cases when we ditch support for LLVM < 13.0
@ysbaddaden
Copy link
Contributor Author

As stated in the latest commit: I removed the check for the LLVM version and only enable the strictest of choices. That seems to have fixed the issue with LLVM assertions, and I think it's better to not have different semantics based on the LLVM version Crystal happens to have been compiled with.

@HertzDevil
Copy link
Contributor

The assertion failure comes from LLVM 15.0.2, since that is the version the compiler itself is built with.

Add comment about not enforcing the no stronger rule for CAS in the compiler (codegen has to be valid) vs stdlib (consistent public API).
@straight-shoota straight-shoota added this to the 1.12.0 milestone Mar 4, 2024
@straight-shoota straight-shoota merged commit c734cc4 into crystal-lang:master Mar 5, 2024
56 of 57 checks passed
@ysbaddaden ysbaddaden deleted the fix/atomics-and-locks-3 branch March 5, 2024 15:25
@ysbaddaden
Copy link
Contributor Author

Reading again https://llvm.org/docs/Atomics.html I realize the notes for code generation mentions using explicit fences on weak architectures... and I'm back to wonder if we ever needed to put explicit Atomic.fence for ARM at all, because the codegen should already do so.

If so, the problem was merely the lazy set that was breaking the acquire/release requirement (an acquire is only guaranteed when paired with a release, and vice versa).

Let's cross compile & disassemble to check that.

@ysbaddaden
Copy link
Contributor Author

I got mixed results. With a mere --target=arm-linux-gnueabihf LLVM assumes I'm aiming for an old ARM CPU, and expects __sync_* symbols from libgcc and libgcc_s, and I'm not sure about the memory barriers (depends on the __sync lib).

If I enable a recent CPU (e.g. --mattr=+armv7-a,) then it generates proper load/store with memory barriers (dmb ish). For example:

   304bc:       e3a00f72        mov     r0, #456        ; 0x1c8
   304c0:       e58d0000        str     r0, [sp]
   304c4:       f57ff05b        dmb     ish
   304c8:       e59d0000        ldr     r0, [sp]
   304cc:       f57ff05b        dmb     ish

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented May 6, 2024

Confirmed: the explicit fences are not needed for arvm6 and armv7, while armv5 and older don't support fences. See #14567

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants