Call membarrier() after making JIT mappings executable on AArch64 Linux #3426

akirilov-arm · 2021-10-08T11:37:05Z

This is the first part of a fix to issue #3310. Unfortunately, there are more calls than necessary to rsix::process::membarrier(rsix::process::MembarrierCommand::RegisterPrivateExpeditedSyncCore) (it is sufficient to call it once per process), but both cranelift_jit::JITModule and wasmtime_jit::CodeMemory are public interfaces, so my current approach is the best I have come up with that hides this AArch64 memory model detail from any crate users; I would appreciate any suggestions for improvements.

cranelift/jit/src/backend.rs

alexcrichton · 2021-10-11T14:23:34Z

I'm trying to read up a bit more on the AArch64 requirements here to understand better why these are required. I'm currently reaching the conclusion though of wondering if we need this due to the safety of Rust, but you know much more than me so I'm hoping that you can educate me!

In the ARM manual 2.2.5 that talks about self-modifying code and 2.4.4 talks about cache coherency, but at least for us JIT code isn't self-modifying and for cache coherency I figured that Rust safety guarantees would save us from having to execute these barriers. For example the Module, when created, is only visible to the calling thread. For any other thread to actually get access to the Module there'd have to be some form of external synchronization (e.g. a mutex, a channel, etc). Do the typical memory-barrier instructions, though, not cover the instruction cache?

One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel performing icache synchronization. Does this system call add more overhead on top of that? Or is w/e the kernel doing in there also required on top of this synchronization?

I suppose my main question is, in the context of safe Rust programs where we're guaranteed that making Module visible to other threads guarantees a correct memory barrier, is this still required? This sort of reminds me of Arc::clone in the standard library where increasing the reference count actually uses a Relaxed memory barrier because actually sending the new clone of the Arc to another thread requires some form of external synchronization, so there's no need to double it up.

cfallin · 2021-10-14T19:39:58Z

@alexcrichton The issue here I think is that the implicit happens-before edge that we get from synchronization in data-race-free programs is not sufficient, because the dcache and icache are not automatically coherent, if I'm understanding the manual right.

For example, the sequence of instructions in the manual B2.2.4 (page B2-131 in my PDF) does a DC CVAU to "clean the data cache", then a DSB ISH to "ensure visibility of the data cleaned from cache", then a IC IVAU to "invalidate instruction cache". That to me indicates that the usual MESI stuff that ensures a modifed line in dcache will be seen by icache does not occur automatically. (In contrast on x86 there is (i) full MESI coherence between icache and dcache, and (ii) a special set of bits that track when lines have been fetched, causing self-modifying-code pipeline flushes when modified, so everything is automatic; but that's expensive hardware and we don't have that luxury on other platforms...)

In theory we could do something better than a syscall that runs a little "sync the caches" handler on every core, though (presumably this is doing some sort of IPI?). We could run the "sync the caches" bit in two halves -- flush the dcache line when we write new code, then flush the icache line before jumping to code if we know we need to. The latter bit is tricky: we don't want to do it on every function call, obviously, or else we'd tank performance (all icache misses all the time). So we want to somehow track if this core has done an icache flush recently enough to pick up whatever changes.

The part I haven't quite worked out about the above, though, is that there's a TOCTOU issue: we could do the check, decide we're on a core with a fresh icache, then the kernel could migrate our thread to another core just before we do the call. Short of playing tricks with affinity, maybe we can't get around that and do need the membarrier() that hits every core.

Thoughts?

cfallin · 2021-10-14T19:47:31Z

(Clarification on above: the IC IVAU is broadcast across the coherence domain apparently, but the ISB ("instruction barrier" which I assume is a pipeline serialization) is necessary; that's the bit that we need to track recency of last flush on cores for, unless we serialize the pipeline every time we invoke a Wasm function pointer from the host (which IMHO we shouldn't do!).)

cfallin · 2021-10-14T19:51:48Z

Ah, and one more clarification I wanted to mention re: other comments above and the manual is that "self-modifying code" in this context is actually applicable, because although our JIT code doesn't literally modify itself, to the core (and to a microarchitect) it's all the same -- we are executing data that we've written to memory so we have to worry about when that data becomes visible to instruction-fetch.

akirilov-arm · 2021-10-14T20:59:54Z

@cfallin you are essentially correct about everything, so thanks a lot for these replies - they save me a lot of writing!

I just want to add a couple of details - in essence this pull request is about adding a "broadcast" ISB, which is not part of the Arm architecture and which the membarrier() system call (with these particular parameters) provides an equivalent to (and yes, the system call name is a bit deceptive because here we are not dealing with the familiar barriers that prevent data races). Indeed, ISB is a pipeline serialization or "context synchronization event" in the architecture's parlance and a recency tracking mechanism could be beneficial because we do not need the serialization after generating every single bit of code - e.g. if several functions are compiled in parallel by different worker threads, we could wait till all compilations complete before executing an ISB instruction on the current processor, as long as it does not try to run any of the functions in the meantime.

My changes are incomplete (hence I said that they were the first part of a fix) because the code that is dealing with the actual cache flushing is still missing - I will add that bit after the corresponding gap in the compiler-builtins crate is filled in. Cache flushing is in fact the easy part of the problem - precisely because IC IVAU is broadcast in the whole coherence domain, as Chris mentioned, so no system calls would be necessary.

One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel...

@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?

alexcrichton · 2021-10-14T21:36:20Z

Oh wow this is tricky, which I suppose should be expected of anything caching related.

So to make sure I understand, the first step recommended by the manual is that the data cache is flushed out (since we just wrote functions into memory) and then all instruction caches are brought up to date with the IC IVAU instruction. Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

In any case it sounds like it definitely doesn't suffice to have data-race-free code here since that basically only guarantees data-cache coherency, not instruction-cache coherency. Otherwise if the mapping of JIT code is reused from some previous mapping then an icache may have the old mapping still cached.

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?

When I run this program on an aarch64 machine the example output I get is:

mprotect with no touched pages 960ns
mprotect with touched pages 43.838609ms

According to perf most of the time is spent in __flush_icache_range in the kernel. Given that this takes so long it seems like mprotect is doing some sort of synchronization in the kernel, I just don't know what kind. Is that entirely unrelated to the synchronization happening here though with ISB and such?

alexcrichton · 2021-10-14T21:39:27Z

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping? If that's the case then we may be able to skip synchronization since we know caches are empty for this range and no one will otherwise try to use it while we're writing to it.

cfallin · 2021-10-14T21:49:14Z

@alexcrichton I bet the membarrier() taking 43ms is due to IPIs and latency of entering the kernel on every other core; especially on the big 128-core machine we have to play with, that will take a while! (This is via smp_call_function_many() invoked in kernel/sched/membarrier.c, and that helper seems to broadcast the IPI and let them all go at once, but there is still the long tail if any core has disabled preemption temporarily.)

Re: fresh addresses ameliorating this, iirc at least some cores have PIPT (physically-indexed, physically-tagged) caches, so it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

alexcrichton · 2021-10-14T21:54:18Z

@cfallin but in the example program I'm not calling membarrier, just mprotect, so does that mean that mprotect is doing the same thing?

cfallin · 2021-10-14T21:56:22Z

Oh, I misread that, sorry; that's a good question! Maybe that is "just" TLB shootdown overhead then, hard to say without more detailed profiling...

cfallin · 2021-10-14T22:00:12Z

(Actually, not TLB shootdown if all the time is spent in __flush_icache_range as you say. I need more caffeine in this tea, sorry.)

This does make me think: if the mprotect() is doing an IPI to every core anyway, that certainly implies a full pipeline synchronization when the core handles the interrupt, if nothing else; maybe in practice this already covers us? Although it feels a little brittle to depend on a side-effect of the kernel's implementation, especially if it's doing a smart on-demand thing and does different things if pages are touched vs. not.

bjorn3 · 2021-10-15T08:18:40Z

Although it feels a little brittle to depend on a side-effect of the kernel's implementation, especially if it's doing a smart on-demand thing and does different things if pages are touched vs. not.

I guess you could ask a maintainer if this is guaranteed? And if not maybe Wasmtime depending on it will make it guaranteed due to "don't break the userspace"? https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE

akirilov-arm · 2021-10-15T22:01:11Z

@alexcrichton

... then all instruction caches are brought up to date with the IC IVAU instruction.

Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.

Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

Correct.

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?

When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here.

Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to ISB. In particular, this means that the thread generating the code does not need to execute ISB because mprotect() has the same effect by virtue of being a system call (it's a bit more complicated than that, but no need for those details).

@cfallin

... at least some cores have PIPT (physically-indexed, physically-tagged) caches...

It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is indeed PIPT. For instance, the Technical Reference Manual (TRM) for Neoverse N1 states in section A6.1 that L1 caches behave as PIPT.

... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

Maybe just a bit 😄, but IMHO it is a possibility, especially under memory pressure, because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.

As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to mprotect() (which, if guaranteed to occur on all processors, would indeed make the membarrier() call redundant).

alexcrichton

Thanks for indulging me and my questions, this has been very imformative! I'm happy at least with the explanations here and would be fine to see this land now.

Did you want to investigate though and see what's causing the time in mprotect to take awhile before merging though?

akirilov-arm · 2021-10-18T20:40:34Z

Yes, I am taking a look.

To go on a bit of a tangent, some recent research has uncovered a vulnerability, Speculative Code Store Bypass (SCSB, tracked by CVE-2021-0089 and CVE-2021-26313 on Intel and AMD processors respectively), that demonstrates the limitations of the approach on x86. The suggested mitigation seems to be pretty much equivalent to part of what the Arm architecture requires.

akirilov-arm · 2021-10-20T14:33:53Z

It turns out that when the Linux kernel makes a memory mapping executable, it also performs the necessary cache flushing on AArch64. Setting aside the discussion of whether we should rely on an implementation detail like that or not, the real point is that this behaviour still does not solve the issue that this PR is taking care of, namely the requirement to run ISB on all processors that might execute the new code. In fact, the implementation guarantees that there will be no inter-processor interrupts.

This was previously done on bytecodealliance#3426 for linux.

@akirilov-arm

* cranelift: Add FlushInstructionCache for AArch64 on Windows This was previously done on #3426 for linux. * wasmtime: Add FlushInstructionCache for AArch64 on Windows This was previously done on #3426 for linux. * cranelift: Add MemoryUse flag to JIT Memory Manager This allows us to keep the icache flushing code self-contained and not leak implementation details. This also changes the windows icache flushing code to only flush pages that were previously unflushed. * Add jit-icache-coherence crate * cranelift: Use `jit-icache-coherence` * wasmtime: Use `jit-icache-coherence` * jit-icache-coherence: Make rustix feature additive Mutually exclusive features cause issues. * wasmtime: Remove rustix from wasmtime-jit We now use it via jit-icache-coherence * Rename wasmtime-jit-icache-coherency crate * Use cfg-if in wasmtime-jit-icache-coherency crate * Use inline instead of inline(always) * Add unsafe marker to clear_cache * Conditionally compile all rustix operations membarrier does not exist on MacOS * Publish `wasmtime-jit-icache-coherence` * Remove explicit windows check This is implied by the target_os = "windows" above * cranelift: Remove len != 0 check This is redundant as it is done in non_protected_allocations_iter * Comment cleanups Thanks @akirilov-arm! * Make clear_cache safe * Rename pipeline_flush to pipeline_flush_mt * Revert "Make clear_cache safe" This reverts commit 21165d8. * More docs! * Fix pipeline_flush reference on clear_cache * Update more docs! * Move pipeline flush after `mprotect` calls Technically the `clear_cache` operation is a lie in AArch64, so move the pipeline flush after the `mprotect` calls so that it benefits from the implicit cache cleaning done by it. * wasmtime: Remove rustix backend from icache crate * wasmtime: Use libc for macos * wasmtime: Flush icache on all arch's for windows * wasmtime: Add flags to membarrier call

bjorn3 reviewed Oct 8, 2021

View reviewed changes

cranelift/jit/src/backend.rs Outdated Show resolved Hide resolved

bjorn3 reviewed Oct 8, 2021

View reviewed changes

cranelift/jit/src/backend.rs Show resolved Hide resolved

github-actions bot added the cranelift Issues related to the Cranelift code generator label Oct 8, 2021

akirilov-arm force-pushed the membarrier branch from a0cfa83 to 266c351 Compare October 15, 2021 16:54

alexcrichton approved these changes Oct 18, 2021

View reviewed changes

Call membarrier() after making JIT mappings executable on AArch64 Linux

d6e67b8

The membarrier() system call ensures that no processor has fetched a stale instruction stream. Copyright (c) 2021, Arm Limited.

akirilov-arm force-pushed the membarrier branch from 266c351 to d6e67b8 Compare October 22, 2021 19:31

akirilov-arm merged commit e9c4164 into bytecodealliance:main Oct 25, 2021

akirilov-arm deleted the membarrier branch October 25, 2021 12:25

afonso360 added a commit to afonso360/wasmtime that referenced this pull request Oct 2, 2022

cranelift: Add FlushInstructionCache for AArch64 on Windows

a1673ca

This was previously done on bytecodealliance#3426 for linux.

afonso360 added a commit to afonso360/wasmtime that referenced this pull request Oct 2, 2022

wasmtiem: Add FlushInstructionCache for AArch64 on Windows

4eeb614

This was previously done on bytecodealliance#3426 for linux.

afonso360 added a commit to afonso360/wasmtime that referenced this pull request Oct 2, 2022

cranelift: Add FlushInstructionCache for AArch64 on Windows

7e05702

This was previously done on bytecodealliance#3426 for linux.

afonso360 added a commit to afonso360/wasmtime that referenced this pull request Oct 2, 2022

wasmtime: Add FlushInstructionCache for AArch64 on Windows

d4741f8

This was previously done on bytecodealliance#3426 for linux.

afonso360 mentioned this pull request Oct 2, 2022

Flush Icache on AArch64 Windows #4997

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call membarrier() after making JIT mappings executable on AArch64 Linux #3426

Call membarrier() after making JIT mappings executable on AArch64 Linux #3426

akirilov-arm commented Oct 8, 2021

alexcrichton commented Oct 11, 2021

cfallin commented Oct 14, 2021 •

edited

Loading

cfallin commented Oct 14, 2021 •

edited

Loading

cfallin commented Oct 14, 2021

akirilov-arm commented Oct 14, 2021 •

edited

Loading

alexcrichton commented Oct 14, 2021

alexcrichton commented Oct 14, 2021

cfallin commented Oct 14, 2021

alexcrichton commented Oct 14, 2021

cfallin commented Oct 14, 2021

cfallin commented Oct 14, 2021

bjorn3 commented Oct 15, 2021

akirilov-arm commented Oct 15, 2021 •

edited

Loading

alexcrichton left a comment

akirilov-arm commented Oct 18, 2021

akirilov-arm commented Oct 20, 2021

Call membarrier() after making JIT mappings executable on AArch64 Linux #3426

Call membarrier() after making JIT mappings executable on AArch64 Linux #3426

Conversation

akirilov-arm commented Oct 8, 2021

alexcrichton commented Oct 11, 2021

cfallin commented Oct 14, 2021 • edited Loading

cfallin commented Oct 14, 2021 • edited Loading

cfallin commented Oct 14, 2021

akirilov-arm commented Oct 14, 2021 • edited Loading

alexcrichton commented Oct 14, 2021

alexcrichton commented Oct 14, 2021

cfallin commented Oct 14, 2021

alexcrichton commented Oct 14, 2021

cfallin commented Oct 14, 2021

cfallin commented Oct 14, 2021

bjorn3 commented Oct 15, 2021

akirilov-arm commented Oct 15, 2021 • edited Loading

alexcrichton left a comment

Choose a reason for hiding this comment

akirilov-arm commented Oct 18, 2021

akirilov-arm commented Oct 20, 2021

cfallin commented Oct 14, 2021 •

edited

Loading

cfallin commented Oct 14, 2021 •

edited

Loading

akirilov-arm commented Oct 14, 2021 •

edited

Loading

akirilov-arm commented Oct 15, 2021 •

edited

Loading