Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush Icache on AArch64 Windows #4997

Merged
merged 28 commits into from
Oct 12, 2022

Conversation

afonso360
Copy link
Contributor

@afonso360 afonso360 commented Oct 2, 2022

👋 Hey,

I tried to run the cranelift filetest suite on a aarch64-pc-windows-msvc machine, and it crashes with STATUS_ILLEGAL_INSTRUCTION. All of the tests pass individually, and if I run it on a single core, it sometimes passes the entire test suite.

I think this is due to us not clearing the icache after writing the new code as required by arm. This PR adds a call to FlushInstructionCache where we already have a membarrier on linux (See #3426).

I'm not too knowledgeable about this, but it was what I saw recommended in a Arm Community blog post, although I would really appreciate if someone could doublecheck if this is the correct approach. This also seems to be what firefox does for their jit.

With this patch we can now pass the entire filetest suite without crashing!


I applied the same solution to the wasmtime side of things, but it's worth noting that I was never able to get a STATUS_ILLEGAL_INSTRUCTION there!
I tested with cargo test -p wasmtime-cli wast::Cranelift::spec::simd_i and all 48 tests pass, no matter how many times I try to run them.

I can't test the entire test suite since that fails due to #4992 (I think)

cc: @cfallin @akirilov-arm

cranelift/jit/src/memory.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Oct 2, 2022
This allows us to keep the icache flushing code self-contained and not leak implementation details.

This also changes the windows icache flushing code to only flush pages that were previously unflushed.
Copy link
Contributor

@akirilov-arm akirilov-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good - you might not have realized it, but you are fixing #3310 on Windows.

BTW the purpose of the membarrier() call is not cache maintenance, but rather flushing the processor pipeline (refer to the discussion in #3426 for more details), so this means that there is still a gap in the implementation on Windows. Looking at Microsoft's documentation, FlushProcessWriteBuffers() seems to have the necessary semantics.

cranelift/jit/src/memory.rs Outdated Show resolved Hide resolved
crates/jit/src/code_memory.rs Outdated Show resolved Hide resolved
cranelift/jit/src/memory.rs Outdated Show resolved Hide resolved
cranelift/jit/src/memory.rs Outdated Show resolved Hide resolved
@bjorn3
Copy link
Contributor

bjorn3 commented Oct 3, 2022

BTW the purpose of the membarrier() call is not cache maintenance, but rather flushing the processor pipeline (refer to the discussion in #3426 for more details), so this means that there is still a gap in the implementation on Windows. Looking at Microsoft's documentation, FlushProcessWriteBuffers() seems to have the necessary semantics.

The documentation for FlushInstructionCache says:

https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushinstructioncache

Applications should call FlushInstructionCache if they generate or modify code in memory. The CPU cannot detect the change, and may execute the old code it cached.

So it seems that FlushInstructionCache must always be called no matter if FlushProcessWriteBuffers is called or not.

@akirilov-arm
Copy link
Contributor

akirilov-arm commented Oct 3, 2022

Yes, I am not claiming that FlushProcessWriteBuffers() replaces FlushInstructionCache() - my point is that it (or something else with the necessary semantics) should be used after the latter, unless FlushInstructionCache() also performs an implicit broadcast ISB operation to all running threads of the process (which the documentation doesn't suggest, to me at least). Similarly, as I mentioned on Linux membarrier() does not flush instruction caches.

P.S. The same considerations about single-threaded applications apply to FlushProcessWriteBuffers() as well, of course.

@afonso360
Copy link
Contributor Author

afonso360 commented Oct 3, 2022

Thanks for reviewing this! ❤️

BTW the purpose of the membarrier() call is not cache maintenance, but rather flushing the processor pipeline (refer to the discussion in #3426 for more details), so this means that there is still a gap in the implementation on Windows. Looking at Microsoft's documentation, FlushProcessWriteBuffers() seems to have the necessary semantics.

So, if I understand #3310 correctly we need to both flush the icache (clear_cache / FlushInstructionCache) and then subsequently also flush the pipeline (membarrier / FlushProcessWriteBuffers). The pipeline flush only needs to happen in multi threaded environments but we cant easily detect that.

Is that right? If that is the case I might as well just clean up the terminology and add a clear_cache call for linux and that also fixes #3310 right?

@akirilov-arm
Copy link
Contributor

Is that right?

Yes, your understanding is correct. Technically the pipeline flush is always mandatory, but all implementations of __builtin___clear_cache() that I am aware of end with an ISB instruction (and I'd guess that FlushInstructionCache() does too) and furthermore a system call (e.g. the subsequent change of the memory page permissions to executable, since we don't use RWX mappings) is a context synchronization operation as well.

The ideal solution, which I had in mind when I opened #3310, would be to have a generic clear_cache() function (probably in the compiler-builtins crate to mirror what GCC and LLVM are doing; I am not aware of anything like that existing, but I would be happy to be proven wrong). Then we would call that function, followed by membarrier()/FlushProcessWriteBuffers(), finishing with making the memory executable. The order of the last 2 operations does not matter (as long as no one tries to run the newly executable code without a pipeline flush before that), but I think that this sequence looks nicer.

Note that it might appear that on Linux we don't flush instruction caches, so there might be a correctness issue. Actually we are relying on an implementation detail (as discussed in #3426), but I'd rather see an explicit operation, hence I have kept #3310 open. However, I would advise against implementing the cache flushing instruction sequence inside Cranelift and/or Wasmtime because it is a bit involved and a solution that is easily accessible to all Rust users on AArch64 is a much better option. In the meantime we can continue relying on an implementation detail; any other cleanup would be appreciated, of course.

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 3, 2022

probably in the compiler-builtins crate to mirror what GCC and LLVM are doing

It should rather be in libcore as compiler-builtins is an implementation detail of rustc to be used by the compiler backend and never directly called by the user.

@afonso360
Copy link
Contributor Author

Alright I think I understand this a little bit more! Thank you for your patience dealing with this.

However, I would advise against implementing the cache flushing instruction sequence inside Cranelift and/or Wasmtime because it is a bit involved and a solution that is easily accessible to all Rust users on AArch64 is a much better option. In the meantime we can continue relying on an implementation detail; any other cleanup would be appreciated, of course.

Yeah I also doubt I could write anything like that correctly 😆 So I'm not very inclined to go that route.

What I think we could do is something along the lines of what @cfallin suggested on zulip create a jit-icache-coherence crate with the interface that we want, use that on both wasmtime-jit and cranelift-jit. This at least shares the windows implementation between both crates, we still need to duplicate the membarriers between rsix and libc since we want to keep those separate. But at least it is a little bit more organized.

For __builtin___clear_cache() can we for now link it in from a C file as you suggested? And later on use a version from libcore if one gets upstreamed. Or if we don't want to do that on wasmtime (due to not using rsix, not sure if that is a requirement), we can ignore it for now and have essentially the same implementation but shared with cranelift.

What do you guys think about this?

Note that it might appear that on Linux we don't flush instruction caches, so there might be a correctness issue. Actually we are relying on an implementation detail (as discussed in #3426), but I'd rather see an explicit operation, hence I have kept #3310 open.

Yeah that confused me for a while! This whole thing is not easy to understand. I'd really like to get this in a central place and write all of these details in comments around this code.

@akirilov-arm
Copy link
Contributor

For __builtin___clear_cache() can we for now link it in from a C file as you suggested?

That is certainly an option, but my impression is that we are trying to move away from relying on C files as much as possible (refer to the refactoring that has been done in the wasmtime-fiber crate, for instance), and given that on Linux we are kind of OK at the moment, IMHO it might not be the best course of action.

As for creating a crate to be used by both cranelift-jit and wasmtime-jit, since the issue was that cranelift-jit should not depend on rustix, I am not sure how much sharing would be possible, though we do have the option of just implementing the libc variant (and avoiding rustix completely). That would also help with consolidating the locations that @cfallin needs to change in #4987.

@afonso360
Copy link
Contributor Author

I've implemented a new crate jit-icache-coherency, that merges all of this code together so that we don't have to update stuff in multiple places, it also provides a common interface that both wasmtime-jit and cranelift-jit can use.

It does change our call's to SYNC_CORE membarriers, using the new method suggested by @akirilov-arm. We now execute SYNC_CORE and if it fails we retry after registering it. This should only happen once per process, so I don't think its too much overhead.

It also includes the changes made by @cfallin in #4987 since I thought it was just easier to do it here instead of having to deal with the ensuing merge conflicts.

I've tried to explain the whole cache coherency thing as best as I could in the documentation, hopefully I didn't miss anything!

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much nicer to have the abstracted crate shared in both places -- thanks for doing this work (and subsuming my PR too)!

One thought on crate naming, and a thought on RISC-V; and let's make sure @akirilov-arm is happy with the final version here too before merging.

crates/jit-icache-coherence/Cargo.toml Outdated Show resolved Hide resolved
//! This crate provides utilities for instruction cache maintenance for JIT authors.
//!
//! In self modifying codes such as when writing a JIT, special care must be taken when marking the
//! code as ready for execution. On fully coherent architectures (X86, S390X) the data cache (D-Cache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary for this PR, but I wonder how RISC-V fits into this -- it looks like at the ISA level it has a fence.i instruction, so it is closer to AArch64 in this regard (weaker coherence by default). Is it enough to do the same membarrier calls as on aarch64? (cc @yuyang-ok)

In the absence of any other information, perhaps we could perform the same membarrier calls on RISC-V as we do on aarch64?

Copy link
Contributor Author

@afonso360 afonso360 Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we do need to do something, from what I've read RISCV is allowed to have incoherent I and D caches. From this documentation of the kernel, it looks like CORE_SYNC is not yet implemented for RISCV. I'm not sure they support GLOBAL either.

I've tried to read the kernel a bit, and from what I understand they have a custom syscall that does sort of what we want? But it looks like it does not guarantee anything regarding pipelines.

Edit: That syscall ends up doing something very similar to AArch64 where they execute a fence.i on all cores. (link)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an architectural detail - I am not familiar with RISC-V at all, but it is possible that the architecture specifies that if instruction caches are flushed, then the pipeline might be flushed as well if necessary, hence no need to do anything in addition; on AArch64 these actions are decoupled. Or to put it another way - an architecture having incoherent data and instruction caches does not imply that it behaves in exactly the same way as the 64-bit Arm architecture (and hence requiring exactly the same sequence of actions); possibly there are nuances.

BTW the system call you have linked to says that it can be made to apply to all threads in the process, not just the caller, which might be what you are looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to put it another way - an architecture having incoherent data and instruction caches does not imply that it behaves in exactly the same way as the 64-bit Arm architecture (and hence requiring exactly the same sequence of actions); possibly there are nuances.

Yeah, that's right, we should go and double check that!

I've opened #5033 to track this, but I'm going to look at the ISA manual to check if they guarantee anything like that.

crates/jit-icache-coherence/src/rustix.rs Outdated Show resolved Hide resolved
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 68 to 73
#[cfg(all(not(target_os = "windows"), not(feature = "rustix")))]
mod libc;
#[cfg(all(not(target_os = "windows"), feature = "rustix"))]
mod rustix;
#[cfg(target_os = "windows")]
mod win;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way I'd recommend writing this to make this more easily maintainable over time is:

cfg_if::cfg_if! {
    if #[cfg(target_os = "windows")] {
        mod win;
        use win as imp;
    } else if #[cfg(feature = "rustix")] {
        mod rustix;
        use rustix as imp;
    } else {
        mod libc;
        use libc as imp;
    }
}

and below just use imp::the_method() instead of duplicating the #[cfg]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how come there's a rustix and a libc implementation? Would it be reasonable to pick one as the only non-windows implementation?

Copy link
Contributor Author

@afonso360 afonso360 Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for cranelift-jit we don't want to import rustix since that adds a bunch of dependencies for a couple of membarriers (#3395). For wasmtime I think we do want the safety guarantees of rustix.

So we ended up with both, one implementation in wasmtime-jit and another in cranelift-jit.

Would it be reasonable to pick one as the only non-windows implementation?

Sure, I'm happy to go with whatever people choose, I just wanted to minimize the changes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a maintainability point of view though I don't think it makes sense to have two different versions of this code. If cranelift-jit doesn't want to use rustix because it's too big of a dependency then that seems like an equivalent argument could be made for wasmtime dropping rustix, but the same arguments for why wasmtime uses rustix I feel can be used in reverse as well to motivate the usage of rustix.

Overall I assume the actual compiled-down code is basically the same modulo what function does the syscall instruction so at least form my perspective I would prefer to only have one implementation to maintain rather than two.

Also a bit more broadly I feel that the cranelift-jit crate doesn't really fit well with this repository right now. Nothing in Wasmtime uses it and it does not see heavy usage in tests I believe, but it's quite a complicated an nontrivial crate at the same time. Basically the support/maintenance story for it seems somewhat unclear but I ideally don't want it to place further burdens on other code elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cranelift-jit is used and tested by cg_clif.

Copy link
Contributor Author

@afonso360 afonso360 Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I assume the actual compiled-down code is basically the same modulo what function does the syscall instruction so at least form my perspective I would prefer to only have one implementation to maintain rather than two.

I hope so too! Or its probably a bug. 😄 I don't really have an opinion on what we should do here, but I'm happy with whatever.

Also a bit more broadly I feel that the cranelift-jit crate doesn't really fit well with this repository right now. Nothing in Wasmtime uses it and it does not see heavy usage in tests I believe, but it's quite a complicated an nontrivial crate at the same time. Basically the support/maintenance story for it seems somewhat unclear but I ideally don't want it to place further burdens on other code elsewhere.

We do use the jit for all runtests in cranelift (in the filetest suite) and we also use the jit when fuzzing the cranelift-fuzzgen target.

I do think we could go the other way and try to use cranelift-jit in wasmtime instead of wasmtime-jit. That would probably be a big project, and I'm not sure if there is anything that would be fundamentally incompatible. But I think it would make sense in terms of a code sharing perspective, and as a bonus all cranelift users would get a better JIT!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could share the core for handling mapping as executable and relocating, but the user facing interface of cranelift-module is designed for the C linkage model and is not compatible with the wasm linkage model.

crates/jit-icache-coherence/src/rustix.rs Outdated Show resolved Hide resolved
@afonso360 afonso360 force-pushed the windows-aarch64-icache branch 2 times, most recently from 3d77de4 to eb57bee Compare October 5, 2022 10:22
This is implied by the target_os = "windows" above
cranelift/jit/src/memory.rs Outdated Show resolved Hide resolved
crates/jit-icache-coherence/src/libc.rs Outdated Show resolved Hide resolved
crates/jit-icache-coherence/src/rustix.rs Outdated Show resolved Hide resolved
crates/jit-icache-coherence/src/rustix.rs Outdated Show resolved Hide resolved
crates/jit-icache-coherence/src/win.rs Show resolved Hide resolved
crates/jit-icache-coherence/src/lib.rs Outdated Show resolved Hide resolved
crates/jit-icache-coherence/src/lib.rs Outdated Show resolved Hide resolved
crates/jit-icache-coherence/src/lib.rs Outdated Show resolved Hide resolved
crates/jit-icache-coherence/src/lib.rs Show resolved Hide resolved
crates/jit-icache-coherence/src/lib.rs Outdated Show resolved Hide resolved
@akirilov-arm
Copy link
Contributor

Forgot to post a general comment with my review - the overall code structure looks great and is definitely cleaner than what we had before; my remarks are mostly about improving the code comments. Also, as I have said before I am fine with just implementing the libc code path (and omitting the rustix one).

@afonso360
Copy link
Contributor Author

I think I got all of them, let me know if I missed any of the changes that you requested. And thanks for you patience in dealing with this! I'm still very much learning about all of this stuff.

Also, as I have said before I am fine with just implementing the libc code path (and omitting the rustix one).

I'm okay with libc too, but would like an ack from someone on the wasmtime side if we want to do that.

Copy link
Contributor

@akirilov-arm akirilov-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - thanks for tackling this! I have a final set of nits, but none of them are showstoppers.

crates/jit-icache-coherence/src/win.rs Outdated Show resolved Hide resolved
cranelift/jit/src/memory.rs Outdated Show resolved Hide resolved
crates/jit/src/code_memory.rs Outdated Show resolved Hide resolved
Technically the `clear_cache` operation is a lie in AArch64, so move the pipeline flush after the `mprotect` calls so that it benefits from the implicit cache cleaning done by it.
crates/jit-icache-coherence/src/win.rs Outdated Show resolved Hide resolved
@afonso360
Copy link
Contributor Author

afonso360 commented Oct 11, 2022

I had to add a flags arg to membarrier on libc since on my aarch64 machine it wasn't being properly added, somehow cranelift never triggered this. I've confirmed with strace and the membarrier is now being properly executed, and the register on fail trick is working.

[pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = -1 EPERM (Operation not permitted)
[pid 615515] membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0
[pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0

Is there a way to test the GLOBAL path on modern kernels? i.e. fail MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE with EINVAL?

Copy link
Contributor

@akirilov-arm akirilov-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to test the GLOBAL path on modern kernels? i.e. fail MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE with EINVAL?

I am not sure if I understand you correctly, but AFAIK if the membarrier() system call is supported at all, then MEMBARRIER_CMD_GLOBAL (AKA MEMBARRIER_CMD_SHARED) is supported as well.

@afonso360
Copy link
Contributor Author

Right, I just wanted to check if there was an easy way to add a test that forces us to use GLOBAL so that we at least have some CI coverage on that branch. But I tested that manually and it worked, so I guess that's ok.

@afonso360
Copy link
Contributor Author

afonso360 commented Oct 12, 2022

Hey, @alexcrichton @sunfishcode we discussed this in the cranelift meeting today, and @cfallin mentioned that you might be interested in this PR as (in its current form) it changes wasmtime to use libc on this particular membarrier call.

As a summary: We got to this situation when I merged this particular piece of code from wasmtime-jit and cranelift-jit, both of those crates have to do the exact same thing when it comes to icache maintenance. After merging we had two backends which do exactly the same thing, and to avoid having duplicated code we need to decide on one.

wasmtime already has libc on its dependency tree, but cranelift does not have rustix yet, so to avoid adding more stuff to the compilation path, I decided to keep libc.

I'd like to know if this is okay with you guys.

@sunfishcode
Copy link
Member

Yes, that makes sense!

@alexcrichton
Copy link
Member

Indeed seems reasonable to me!

@cfallin cfallin merged commit 4639e85 into bytecodealliance:main Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants