-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call membarrier() after making JIT mappings executable on AArch64 Linux #3426
Conversation
I'm trying to read up a bit more on the AArch64 requirements here to understand better why these are required. I'm currently reaching the conclusion though of wondering if we need this due to the safety of Rust, but you know much more than me so I'm hoping that you can educate me! In the ARM manual 2.2.5 that talks about self-modifying code and 2.4.4 talks about cache coherency, but at least for us JIT code isn't self-modifying and for cache coherency I figured that Rust safety guarantees would save us from having to execute these barriers. For example the One other question I'd have is that when I was benchmarking loading precompiled modules I found that the I suppose my main question is, in the context of safe Rust programs where we're guaranteed that making |
@alexcrichton The issue here I think is that the implicit happens-before edge that we get from synchronization in data-race-free programs is not sufficient, because the dcache and icache are not automatically coherent, if I'm understanding the manual right. For example, the sequence of instructions in the manual B2.2.4 (page B2-131 in my PDF) does a In theory we could do something better than a syscall that runs a little "sync the caches" handler on every core, though (presumably this is doing some sort of IPI?). We could run the "sync the caches" bit in two halves -- flush the dcache line when we write new code, then flush the icache line before jumping to code if we know we need to. The latter bit is tricky: we don't want to do it on every function call, obviously, or else we'd tank performance (all icache misses all the time). So we want to somehow track if this core has done an icache flush recently enough to pick up whatever changes. The part I haven't quite worked out about the above, though, is that there's a TOCTOU issue: we could do the check, decide we're on a core with a fresh icache, then the kernel could migrate our thread to another core just before we do the call. Short of playing tricks with affinity, maybe we can't get around that and do need the Thoughts? |
(Clarification on above: the |
Ah, and one more clarification I wanted to mention re: other comments above and the manual is that "self-modifying code" in this context is actually applicable, because although our JIT code doesn't literally modify itself, to the core (and to a microarchitect) it's all the same -- we are executing data that we've written to memory so we have to worry about when that data becomes visible to instruction-fetch. |
@cfallin you are essentially correct about everything, so thanks a lot for these replies - they save me a lot of writing! I just want to add a couple of details - in essence this pull request is about adding a "broadcast" My changes are incomplete (hence I said that they were the first part of a fix) because the code that is dealing with the actual cache flushing is still missing - I will add that bit after the corresponding gap in the
@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different? |
Oh wow this is tricky, which I suppose should be expected of anything caching related. So to make sure I understand, the first step recommended by the manual is that the data cache is flushed out (since we just wrote functions into memory) and then all instruction caches are brought up to date with the In any case it sounds like it definitely doesn't suffice to have data-race-free code here since that basically only guarantees data-cache coherency, not instruction-cache coherency. Otherwise if the mapping of JIT code is reused from some previous mapping then an icache may have the old mapping still cached. Also to confirm, this PR is just the run-
When I run this program on an aarch64 machine the example output I get is:
According to |
I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new |
@alexcrichton I bet the Re: fresh addresses ameliorating this, iirc at least some cores have PIPT (physically-indexed, physically-tagged) caches, so it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-) |
@cfallin but in the example program I'm not calling |
Oh, I misread that, sorry; that's a good question! Maybe that is "just" TLB shootdown overhead then, hard to say without more detailed profiling... |
(Actually, not TLB shootdown if all the time is spent in This does make me think: if the |
I guess you could ask a maintainer if this is guaranteed? And if not maybe Wasmtime depending on it will make it guaranteed due to "don't break the userspace"? https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE |
a0cfa83
to
266c351
Compare
Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.
Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!
Correct.
When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here. Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to
It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is indeed PIPT. For instance, the Technical Reference Manual (TRM) for Neoverse N1 states in section A6.1 that L1 caches behave as PIPT.
Maybe just a bit 😄, but IMHO it is a possibility, especially under memory pressure, because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler. As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for indulging me and my questions, this has been very imformative! I'm happy at least with the explanations here and would be fine to see this land now.
Did you want to investigate though and see what's causing the time in mprotect
to take awhile before merging though?
Yes, I am taking a look. To go on a bit of a tangent, some recent research has uncovered a vulnerability, Speculative Code Store Bypass (SCSB, tracked by CVE-2021-0089 and CVE-2021-26313 on Intel and AMD processors respectively), that demonstrates the limitations of the approach on x86. The suggested mitigation seems to be pretty much equivalent to part of what the Arm architecture requires. |
It turns out that when the Linux kernel makes a memory mapping executable, it also performs the necessary cache flushing on AArch64. Setting aside the discussion of whether we should rely on an implementation detail like that or not, the real point is that this behaviour still does not solve the issue that this PR is taking care of, namely the requirement to run |
The membarrier() system call ensures that no processor has fetched a stale instruction stream. Copyright (c) 2021, Arm Limited.
266c351
to
d6e67b8
Compare
This was previously done on bytecodealliance#3426 for linux.
This was previously done on bytecodealliance#3426 for linux.
This was previously done on bytecodealliance#3426 for linux.
This was previously done on bytecodealliance#3426 for linux.
* cranelift: Add FlushInstructionCache for AArch64 on Windows This was previously done on #3426 for linux. * wasmtime: Add FlushInstructionCache for AArch64 on Windows This was previously done on #3426 for linux. * cranelift: Add MemoryUse flag to JIT Memory Manager This allows us to keep the icache flushing code self-contained and not leak implementation details. This also changes the windows icache flushing code to only flush pages that were previously unflushed. * Add jit-icache-coherence crate * cranelift: Use `jit-icache-coherence` * wasmtime: Use `jit-icache-coherence` * jit-icache-coherence: Make rustix feature additive Mutually exclusive features cause issues. * wasmtime: Remove rustix from wasmtime-jit We now use it via jit-icache-coherence * Rename wasmtime-jit-icache-coherency crate * Use cfg-if in wasmtime-jit-icache-coherency crate * Use inline instead of inline(always) * Add unsafe marker to clear_cache * Conditionally compile all rustix operations membarrier does not exist on MacOS * Publish `wasmtime-jit-icache-coherence` * Remove explicit windows check This is implied by the target_os = "windows" above * cranelift: Remove len != 0 check This is redundant as it is done in non_protected_allocations_iter * Comment cleanups Thanks @akirilov-arm! * Make clear_cache safe * Rename pipeline_flush to pipeline_flush_mt * Revert "Make clear_cache safe" This reverts commit 21165d8. * More docs! * Fix pipeline_flush reference on clear_cache * Update more docs! * Move pipeline flush after `mprotect` calls Technically the `clear_cache` operation is a lie in AArch64, so move the pipeline flush after the `mprotect` calls so that it benefits from the implicit cache cleaning done by it. * wasmtime: Remove rustix backend from icache crate * wasmtime: Use libc for macos * wasmtime: Flush icache on all arch's for windows * wasmtime: Add flags to membarrier call
This is the first part of a fix to issue #3310. Unfortunately, there are more calls than necessary to
rsix::process::membarrier(rsix::process::MembarrierCommand::RegisterPrivateExpeditedSyncCore)
(it is sufficient to call it once per process), but bothcranelift_jit::JITModule
andwasmtime_jit::CodeMemory
are public interfaces, so my current approach is the best I have come up with that hides this AArch64 memory model detail from any crate users; I would appreciate any suggestions for improvements.