Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call membarrier() after making JIT mappings executable on AArch64 Linux #3426

Merged
merged 1 commit into from
Oct 25, 2021

Conversation

akirilov-arm
Copy link
Contributor

This is the first part of a fix to issue #3310. Unfortunately, there are more calls than necessary to rsix::process::membarrier(rsix::process::MembarrierCommand::RegisterPrivateExpeditedSyncCore) (it is sufficient to call it once per process), but both cranelift_jit::JITModule and wasmtime_jit::CodeMemory are public interfaces, so my current approach is the best I have come up with that hides this AArch64 memory model detail from any crate users; I would appreciate any suggestions for improvements.

cranelift/jit/src/backend.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Oct 8, 2021
@alexcrichton
Copy link
Member

I'm trying to read up a bit more on the AArch64 requirements here to understand better why these are required. I'm currently reaching the conclusion though of wondering if we need this due to the safety of Rust, but you know much more than me so I'm hoping that you can educate me!

In the ARM manual 2.2.5 that talks about self-modifying code and 2.4.4 talks about cache coherency, but at least for us JIT code isn't self-modifying and for cache coherency I figured that Rust safety guarantees would save us from having to execute these barriers. For example the Module, when created, is only visible to the calling thread. For any other thread to actually get access to the Module there'd have to be some form of external synchronization (e.g. a mutex, a channel, etc). Do the typical memory-barrier instructions, though, not cover the instruction cache?

One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel performing icache synchronization. Does this system call add more overhead on top of that? Or is w/e the kernel doing in there also required on top of this synchronization?

I suppose my main question is, in the context of safe Rust programs where we're guaranteed that making Module visible to other threads guarantees a correct memory barrier, is this still required? This sort of reminds me of Arc::clone in the standard library where increasing the reference count actually uses a Relaxed memory barrier because actually sending the new clone of the Arc to another thread requires some form of external synchronization, so there's no need to double it up.

@cfallin
Copy link
Member

cfallin commented Oct 14, 2021

@alexcrichton The issue here I think is that the implicit happens-before edge that we get from synchronization in data-race-free programs is not sufficient, because the dcache and icache are not automatically coherent, if I'm understanding the manual right.

For example, the sequence of instructions in the manual B2.2.4 (page B2-131 in my PDF) does a DC CVAU to "clean the data cache", then a DSB ISH to "ensure visibility of the data cleaned from cache", then a IC IVAU to "invalidate instruction cache". That to me indicates that the usual MESI stuff that ensures a modifed line in dcache will be seen by icache does not occur automatically. (In contrast on x86 there is (i) full MESI coherence between icache and dcache, and (ii) a special set of bits that track when lines have been fetched, causing self-modifying-code pipeline flushes when modified, so everything is automatic; but that's expensive hardware and we don't have that luxury on other platforms...)

In theory we could do something better than a syscall that runs a little "sync the caches" handler on every core, though (presumably this is doing some sort of IPI?). We could run the "sync the caches" bit in two halves -- flush the dcache line when we write new code, then flush the icache line before jumping to code if we know we need to. The latter bit is tricky: we don't want to do it on every function call, obviously, or else we'd tank performance (all icache misses all the time). So we want to somehow track if this core has done an icache flush recently enough to pick up whatever changes.

The part I haven't quite worked out about the above, though, is that there's a TOCTOU issue: we could do the check, decide we're on a core with a fresh icache, then the kernel could migrate our thread to another core just before we do the call. Short of playing tricks with affinity, maybe we can't get around that and do need the membarrier() that hits every core.

Thoughts?

@cfallin
Copy link
Member

cfallin commented Oct 14, 2021

(Clarification on above: the IC IVAU is broadcast across the coherence domain apparently, but the ISB ("instruction barrier" which I assume is a pipeline serialization) is necessary; that's the bit that we need to track recency of last flush on cores for, unless we serialize the pipeline every time we invoke a Wasm function pointer from the host (which IMHO we shouldn't do!).)

@cfallin
Copy link
Member

cfallin commented Oct 14, 2021

Ah, and one more clarification I wanted to mention re: other comments above and the manual is that "self-modifying code" in this context is actually applicable, because although our JIT code doesn't literally modify itself, to the core (and to a microarchitect) it's all the same -- we are executing data that we've written to memory so we have to worry about when that data becomes visible to instruction-fetch.

@akirilov-arm
Copy link
Contributor Author

akirilov-arm commented Oct 14, 2021

@cfallin you are essentially correct about everything, so thanks a lot for these replies - they save me a lot of writing!

I just want to add a couple of details - in essence this pull request is about adding a "broadcast" ISB, which is not part of the Arm architecture and which the membarrier() system call (with these particular parameters) provides an equivalent to (and yes, the system call name is a bit deceptive because here we are not dealing with the familiar barriers that prevent data races). Indeed, ISB is a pipeline serialization or "context synchronization event" in the architecture's parlance and a recency tracking mechanism could be beneficial because we do not need the serialization after generating every single bit of code - e.g. if several functions are compiled in parallel by different worker threads, we could wait till all compilations complete before executing an ISB instruction on the current processor, as long as it does not try to run any of the functions in the meantime.

My changes are incomplete (hence I said that they were the first part of a fix) because the code that is dealing with the actual cache flushing is still missing - I will add that bit after the corresponding gap in the compiler-builtins crate is filled in. Cache flushing is in fact the easy part of the problem - precisely because IC IVAU is broadcast in the whole coherence domain, as Chris mentioned, so no system calls would be necessary.

One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel...

@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?

@alexcrichton
Copy link
Member

Oh wow this is tricky, which I suppose should be expected of anything caching related.

So to make sure I understand, the first step recommended by the manual is that the data cache is flushed out (since we just wrote functions into memory) and then all instruction caches are brought up to date with the IC IVAU instruction. Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

In any case it sounds like it definitely doesn't suffice to have data-race-free code here since that basically only guarantees data-cache coherency, not instruction-cache coherency. Otherwise if the mapping of JIT code is reused from some previous mapping then an icache may have the old mapping still cached.

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?

When I run this program on an aarch64 machine the example output I get is:

mprotect with no touched pages 960ns
mprotect with touched pages 43.838609ms

According to perf most of the time is spent in __flush_icache_range in the kernel. Given that this takes so long it seems like mprotect is doing some sort of synchronization in the kernel, I just don't know what kind. Is that entirely unrelated to the synchronization happening here though with ISB and such?

@alexcrichton
Copy link
Member

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping? If that's the case then we may be able to skip synchronization since we know caches are empty for this range and no one will otherwise try to use it while we're writing to it.

@cfallin
Copy link
Member

cfallin commented Oct 14, 2021

@alexcrichton I bet the membarrier() taking 43ms is due to IPIs and latency of entering the kernel on every other core; especially on the big 128-core machine we have to play with, that will take a while! (This is via smp_call_function_many() invoked in kernel/sched/membarrier.c, and that helper seems to broadcast the IPI and let them all go at once, but there is still the long tail if any core has disabled preemption temporarily.)

Re: fresh addresses ameliorating this, iirc at least some cores have PIPT (physically-indexed, physically-tagged) caches, so it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

@alexcrichton
Copy link
Member

@cfallin but in the example program I'm not calling membarrier, just mprotect, so does that mean that mprotect is doing the same thing?

@cfallin
Copy link
Member

cfallin commented Oct 14, 2021

Oh, I misread that, sorry; that's a good question! Maybe that is "just" TLB shootdown overhead then, hard to say without more detailed profiling...

@cfallin
Copy link
Member

cfallin commented Oct 14, 2021

(Actually, not TLB shootdown if all the time is spent in __flush_icache_range as you say. I need more caffeine in this tea, sorry.)

This does make me think: if the mprotect() is doing an IPI to every core anyway, that certainly implies a full pipeline synchronization when the core handles the interrupt, if nothing else; maybe in practice this already covers us? Although it feels a little brittle to depend on a side-effect of the kernel's implementation, especially if it's doing a smart on-demand thing and does different things if pages are touched vs. not.

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 15, 2021

Although it feels a little brittle to depend on a side-effect of the kernel's implementation, especially if it's doing a smart on-demand thing and does different things if pages are touched vs. not.

I guess you could ask a maintainer if this is guaranteed? And if not maybe Wasmtime depending on it will make it guaranteed due to "don't break the userspace"? https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE

@akirilov-arm
Copy link
Contributor Author

akirilov-arm commented Oct 15, 2021

@alexcrichton

... then all instruction caches are brought up to date with the IC IVAU instruction.

Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.

Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

Correct.

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?

When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here.

Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to ISB. In particular, this means that the thread generating the code does not need to execute ISB because mprotect() has the same effect by virtue of being a system call (it's a bit more complicated than that, but no need for those details).

@cfallin

... at least some cores have PIPT (physically-indexed, physically-tagged) caches...

It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is indeed PIPT. For instance, the Technical Reference Manual (TRM) for Neoverse N1 states in section A6.1 that L1 caches behave as PIPT.

... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

Maybe just a bit 😄, but IMHO it is a possibility, especially under memory pressure, because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.

As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to mprotect() (which, if guaranteed to occur on all processors, would indeed make the membarrier() call redundant).

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for indulging me and my questions, this has been very imformative! I'm happy at least with the explanations here and would be fine to see this land now.

Did you want to investigate though and see what's causing the time in mprotect to take awhile before merging though?

@akirilov-arm
Copy link
Contributor Author

Yes, I am taking a look.

To go on a bit of a tangent, some recent research has uncovered a vulnerability, Speculative Code Store Bypass (SCSB, tracked by CVE-2021-0089 and CVE-2021-26313 on Intel and AMD processors respectively), that demonstrates the limitations of the approach on x86. The suggested mitigation seems to be pretty much equivalent to part of what the Arm architecture requires.

@akirilov-arm
Copy link
Contributor Author

It turns out that when the Linux kernel makes a memory mapping executable, it also performs the necessary cache flushing on AArch64. Setting aside the discussion of whether we should rely on an implementation detail like that or not, the real point is that this behaviour still does not solve the issue that this PR is taking care of, namely the requirement to run ISB on all processors that might execute the new code. In fact, the implementation guarantees that there will be no inter-processor interrupts.

The membarrier() system call ensures that no processor has fetched
a stale instruction stream.

Copyright (c) 2021, Arm Limited.
@akirilov-arm akirilov-arm merged commit e9c4164 into bytecodealliance:main Oct 25, 2021
@akirilov-arm akirilov-arm deleted the membarrier branch October 25, 2021 12:25
afonso360 added a commit to afonso360/wasmtime that referenced this pull request Oct 2, 2022
afonso360 added a commit to afonso360/wasmtime that referenced this pull request Oct 2, 2022
afonso360 added a commit to afonso360/wasmtime that referenced this pull request Oct 2, 2022
afonso360 added a commit to afonso360/wasmtime that referenced this pull request Oct 2, 2022
cfallin pushed a commit that referenced this pull request Oct 12, 2022
* cranelift: Add FlushInstructionCache for AArch64 on Windows

This was previously done on #3426 for linux.

* wasmtime: Add FlushInstructionCache for AArch64 on Windows

This was previously done on #3426 for linux.

* cranelift: Add MemoryUse flag to JIT Memory Manager

This allows us to keep the icache flushing code self-contained and not leak implementation details.

This also changes the windows icache flushing code to only flush pages that were previously unflushed.

* Add jit-icache-coherence crate

* cranelift: Use `jit-icache-coherence`

* wasmtime: Use `jit-icache-coherence`

* jit-icache-coherence: Make rustix feature additive

Mutually exclusive features cause issues.

* wasmtime: Remove rustix from wasmtime-jit

We now use it via jit-icache-coherence

* Rename wasmtime-jit-icache-coherency crate

* Use cfg-if in wasmtime-jit-icache-coherency crate

* Use inline instead of inline(always)

* Add unsafe marker to clear_cache

* Conditionally compile all rustix operations

membarrier does not exist on MacOS

* Publish `wasmtime-jit-icache-coherence`

* Remove explicit windows check

This is implied by the target_os = "windows" above

* cranelift: Remove len != 0 check

This is redundant as it is done in non_protected_allocations_iter

* Comment cleanups

Thanks @akirilov-arm!

* Make clear_cache safe

* Rename pipeline_flush to pipeline_flush_mt

* Revert "Make clear_cache safe"

This reverts commit 21165d8.

* More docs!

* Fix pipeline_flush reference on clear_cache

* Update more docs!

* Move pipeline flush after `mprotect` calls

Technically the `clear_cache` operation is a lie in AArch64, so move the pipeline flush after the `mprotect` calls so that it benefits from the implicit cache cleaning done by it.

* wasmtime: Remove rustix backend from icache crate

* wasmtime: Use libc for macos

* wasmtime: Flush icache on all arch's for windows

* wasmtime: Add flags to membarrier call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants