Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

externref: implement stack map-based garbage collection #1832

Merged
merged 4 commits into from
Jun 16, 2020

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Jun 5, 2020

For host VM code, we use plain reference counting, where cloning increments the reference count, and dropping decrements it. We can avoid many of the on-stack increment/decrement operations that typically plague the performance of reference counting via Rust's ownership and borrowing system. Moving a VMExternRef avoids mutating its reference count, and borrowing it either avoids the reference count increment or delays it until if/when the VMExternRef is cloned.

When passing a VMExternRef into compiled Wasm code, we don't want to do reference count mutations for every compiled local.{get,set}, nor for every function call. Therefore, we use a variation of deferred reference counting, where we only mutate reference counts when storing VMExternRefs somewhere that outlives the activation: into a global or table. Simultaneously, we over-approximate the set of VMExternRefs that are inside Wasm function activations. Periodically, we walk the stack at GC safe points, and use stack map information to precisely identify the set of VMExternRefs inside Wasm activations. Then we take the difference between this precise set and our over-approximation, and decrement the reference count for each of the VMExternRefs that are in our over-approximation but not in the precise set. Finally, the over-approximation is replaced with the precise set.

The VMExternRefActivationsTable implements the over-approximized set of VMExternRefs referenced by Wasm activations. Calling a Wasm function and passing it a VMExternRef moves the VMExternRef into the table, and the compiled Wasm function logically "borrows" the VMExternRef from the table. Similarly, global.get and table.get operations clone the gotten
VMExternRef into the VMExternRefActivationsTable and then "borrow" the reference out of the table.

When a VMExternRef is returned to host code from a Wasm function, the host increments the reference count (because the reference is logically "borrowed" from the VMExternRefActivationsTable and the reference count from the table will be dropped at the next GC).

For more general information on deferred reference counting, see An Examination of Deferred Reference Counting and Cycle Detection by Quinane: https://openresearch-repository.anu.edu.au/bitstream/1885/42030/2/hon-thesis.pdf

cc #929

Fixes #1804

Depends on rust-lang/backtrace-rs#341

@fitzgen fitzgen requested a review from alexcrichton June 5, 2020 23:20
@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Jun 5, 2020
@github-actions
Copy link

github-actions bot commented Jun 5, 2020

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.


impl StackMapSink {
fn finish(mut self) -> Vec<StackMapInformation> {
self.infos.sort_by_key(|info| info.code_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.infos.sort_by_key(|info| info.code_offset);
self.infos.sort_unstable_by_key(|info| info.code_offset);

crates/runtime/src/externref.rs Outdated Show resolved Hide resolved
crates/runtime/src/externref.rs Outdated Show resolved Hide resolved
///
/// Unsafe to drop earlier than its module is dropped. See
/// `StackMapRegistry::register_stack_maps` for details.
pub struct StackMapRegistration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe panic in the drop implementation instead and add an unsafe finish function?

crates/wasmtime/src/runtime.rs Show resolved Hide resolved
crates/wasmtime/src/trampoline/func.rs Outdated Show resolved Hide resolved
@tschneidereit
Copy link
Member

Very exciting to see this! \o/

One question: does this have any implications for deterministic behavior? ISTM that as long as we don't have weak references anywhere in the system it probably doesn't, right?

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Before I dig too closely into the internal details I wanted to get a grasp of how this is all organized and such. The main thoughts I have is that this is adding a new registry and a lot of separate points where we pass around registries and such. I'm hoping we can perhaps unify all these with existing registries and such we have? Internally it feels better if we don't have an arc and rwlock per-thing that we need to keep track of.

One other thing I'm remembering now which I think would be good to happen here, I think this should include an implementation of WasmTy for ExternRef. We'll want to enable closures with Func that take ExternRef as arguments and such, and it'd be cool to see what the monomorphize logic looks like for functions that take a number of ExternRef or produce them.

Finally I wanted to write some thoughts about the GC aspect. It feels a bit weird to me that we're using reference counting but still have to have explicit GC points. To make sure I understand, GC only happens right now automatically when you call into a function, right? Or are there other auto-inserted points that GC happens? At a minimum I think we need to make sure that all long-running code is eventually GC'd, so I think both entry and exit needs GC'ing (calling into a host and returning back to wasm may already do the GC, I likely missed it!).

I wanted to dig a bit more into the decision though to use deferred reference counting rather than explicit. Is it possible to somewhat easily get performance number comparisons? For example do we have a handle on what we predict the overhead will be? I'd imagine there are possible optimizations where local.get/local.set don't do reference counting but calling a function does.

I'm personally always a bit wary of things like stack maps and such because of how strictly precise they must be but how the surrounding bits are often "mostly precise" like iterating the stack and/or getting the stack pointer. I don't think the implementation here is incorrect by any mean, mostly just that maintaining this over time and porting it to all sorts of new platforms is likely to get hairier over time.

crates/runtime/src/externref.rs Outdated Show resolved Hide resolved
crates/runtime/src/externref.rs Show resolved Hide resolved
crates/runtime/src/instance.rs Outdated Show resolved Hide resolved
@@ -82,6 +83,7 @@ struct ModuleInner {
store: Store,
compiled: CompiledModule,
frame_info_registration: Mutex<Option<Option<Arc<GlobalFrameInfoRegistration>>>>,
stack_map_registration: Mutex<Option<Option<Arc<StackMapRegistration>>>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of continuing my comment from earlier, but this is an example where it would be great to not have this duplication. Ideally there'd be one "register stuff" call, although I'm not sure yet if this is fundamentally required to be two separate steps.

crates/wasmtime/src/ref.rs Show resolved Hide resolved
crates/wasmtime/src/runtime.rs Outdated Show resolved Hide resolved
@fitzgen
Copy link
Member Author

fitzgen commented Jun 8, 2020

One question: does this have any implications for deterministic behavior? ISTM that as long as we don't have weak references anywhere in the system it probably doesn't, right?

This has deterministic (albeit perhaps surprising if you don't know the implementation) behavior. We only GC when either the embedder calls Store::gc or when the VMExternRefActivationsTable is full (which happens after passing in N externrefs to Wasm).

FWIW. VMExternRef doesn't currently support weak references. Shouldn't be too hard to add if we find we need it. It wouldn't change the determinism of GC though.

@fitzgen
Copy link
Member Author

fitzgen commented Jun 8, 2020

Very nice! Before I dig too closely into the internal details I wanted to get a grasp of how this is all organized and such. The main thoughts I have is that this is adding a new registry and a lot of separate points where we pass around registries and such. I'm hoping we can perhaps unify all these with existing registries and such we have? Internally it feels better if we don't have an arc and rwlock per-thing that we need to keep track of.

I originally did add this to the existing global frame info, but after our discussions about not wanting implicit global context, I moved it out to its own registry.

One other thing I'm remembering now which I think would be good to happen here, I think this should include an implementation of WasmTy for ExternRef. We'll want to enable closures with Func that take ExternRef as arguments and such, and it'd be cool to see what the monomorphize logic looks like for functions that take a number of ExternRef or produce them.

Yeah, I have this on my TODO list for a follow up PR. Felt like this was big enough as is.

Finally I wanted to write some thoughts about the GC aspect. It feels a bit weird to me that we're using reference counting but still have to have explicit GC points. To make sure I understand, GC only happens right now automatically when you call into a function, right? Or are there other auto-inserted points that GC happens? At a minimum I think we need to make sure that all long-running code is eventually GC'd, so I think both entry and exit needs GC'ing (calling into a host and returning back to wasm may already do the GC, I likely missed it!).

I wanted to dig a bit more into the decision though to use deferred reference counting rather than explicit. Is it possible to somewhat easily get performance number comparisons? For example do we have a handle on what we predict the overhead will be? I'd imagine there are possible optimizations where local.get/local.set don't do reference counting but calling a function does.

Regarding when GC happens see my reply to Till: #1832 (comment)

For long-running Wasm, in the absence of explicit Store::gc calls, GC will happen as the VMExternRefActivationsTable fills up. I don't think it makes sense to GC on entry/exit from every single host<-->wasm function boundary, because it will be too expensive for simple getter/setter host functions that you expect to generally be fast. It might makes sense for the embedder to call Store::gc after the "main" call into Wasm (e.g. after servicing an HTTP request, but not after each wasm -> host call to read an HTTP header; however in this case there should be zero Wasm frames on the stack, so we could provide an unsafe method to just clear the VMExternRefActivationsTable rather than even bothering to walk the stack and do the GC).

Regarding the decision to go with deferred reference counting, let's look at the alternative: we would have to do explicit reference count increments and decrements for references inside Wasm frames, rather than deferring them. First, to get this working at all, this would require extending cranelift-wasm to add ref count increments every time we pushed a reference onto the spec-defined Wasm execution stack (e.g. every local.get), and ref count decrements every time it is popped off the Wasm execution stack (e.g. every local.set). But doing this many reference counting operations is the thing kills reference counting's throughput. So to alleviate that we could either do deferred reference counting, or we could add a static analysis to cranelift-wasm that sums the ref count increments and decrements in a basic block and does only the final effect once (hand waving a bit, would need to think about this a bit more). Then, if coalescing ref count operations within a basic block isn't enough, this static analysis could perhaps further be generalized across basic blocks (really hand waving now).

Note that we can't only do reference counting at function call boundaries because we need to handle the case where a Wasm function takes a reference and drops it. We need to decrement the reference count at the drop site, which is what instrumenting the spec stack pops does.

So that is a bunch of infrastructural work on Cranelift to get something that will perform worse but at least works, followed by more work to start moving towards fewer ref count operations (but never getting as few as deferred reference counting: zero, albeit with occasional stack walking pauses).

On the other hand, we already have stack maps produced by Cranelift, which are the necessary bits required for implementing deferred reference counting. We don't need to build any Cranelift infrastructure and Wasmtime's integration, just the latter. However, yes, this does come with occasional stack-walking pauses and reliance on libunwind or some other way of walking the stack (but also we already require it for unwinding host functions after traps).

So that was basically the calculus: an easier path to getting a better implementation (or at least an easier path to getting a good implementation, if we are comparing against the hypothetical best version of the ref counting coalescing static analysis).

Unfortunately, I don't have benchmarks or performance numbers. It would be hard to get this information without implementing both approaches. And it is hard to know how many reference counting operations we could coalesce with the hypothetical static analysis. But I don't really have any doubt that naively doing all the increments and decrements for on-stack references would be quite slow: this is Well Known(tm) in the memory management world.

@tschneidereit
Copy link
Member

This has deterministic (albeit perhaps surprising if you don't know the implementation) behavior.

Great, thank you for the explanation!

One additional question: will it be possible to use the same infrastructure for stack tracing once we start implementing one of the GC proposals? Seems like that should be the case, but I might well be missing something.

@fitzgen
Copy link
Member Author

fitzgen commented Jun 9, 2020

One additional question: will it be possible to use the same infrastructure for stack tracing once we start implementing one of the GC proposals? Seems like that should be the case, but I might well be missing something.

Yep, doing that should be a lot easier after this lands.

@alexcrichton
Copy link
Member

Sorry I haven't had a chance to read and fully digest your response yet @fitzgen (will do later today), but I wanted to comment here with a thought before I forget it. In #1845 it was found that we can't actually backtrace through other host JIT code (e.g. the CLR) on all platforms. I think that this GC implementation requires getting a full backtrace at all times, right? If, for example, a host call in the CLR triggered a GC it would get a smaller view of the world than actually exists and could cause a use-after-free?

@alexcrichton
Copy link
Member

Ok now to respond in full! One thing I'm still a bit murky and/or uncomfortable on is when GC is expected to happen. I understand that it automatically happens when tables fill up or if you explicitly call it, but my question was sort of largely do we expect wasmtime to grow some other time it automatically calls it? For example if an application never calls Store::gc is it guaranteed that, regardless of what wasm runs, memory won't be leaked?

Another question is how we'd document this. For embedders using anyref, what would be the recommended way to call this? I think "call it after the 'main call'" is pretty sound advice, but that hingest on the previous question of it should never be required to call Store::gc to prevent leaking unbounded amounts of memory.

For alternative strategies of reference counting, to clarify I don't think that we should be reverting this and switching to explicit reference counting. I'm mostly probing because the rationale against reference counting feels a bit hand-wavy to me and given the cost of relying on stack unwinding and stack maps I'd just want to make sure we're set.

To play devil's advocate a bit, I'm not convinced that function-boundary reference counting is impossible. I agree local.{set,get} should not do reference counting at all. What I'm imagining is that there's a set of "live anyref" in a function frame that cranelift keeps track of. Instructions like call and {global,table}.{get,set} will increment the reference count for received anyref values and passed away anyref values. When the function returns cranelift would then inject a decrement of the reference count for all active anyref values in the frame.

I agree that this may need some intrinsic work in Cranelift, but I don't think that this would require optimizations about coalescing reference counts. The main idea would be that local.{get,set} are the lion's share of reference count activities so by deferring that we'd get the lion's share of the benefit.

Thinking more on this though unwinds are a really important thing to account for here too. I believe the stack walking strategy perfectly handles unwinds (since you'll just gc later and realize that rooted things aren't on the stack any more), but anything with explicit code generation will still need to do something on unwinding. And that 'something' is arguably just as hard to get right as stack maps themselves.

Again though to be clear I'm trying to get it straight in my head why we're using deferred reference counting rather than explicit reference counting. I don't mean to seem like I'm challenging the conclusion, it's mostly that I just want to feel confident that we don't hand-wave too much by accident. For me, though, the nail in the coffin is that in a world of explicit reference counting unwinding needs to be handled somehow, and the solution seems like it'd be very similar if not just stack maps in one form or another.

@bjorn3
Copy link
Contributor

bjorn3 commented Jun 10, 2020

Again though to be clear I'm trying to get it straight in my head why we're using deferred reference counting rather than explicit reference counting.

Deferred reference counting is more performant.

@fitzgen
Copy link
Member Author

fitzgen commented Jun 10, 2020

Deferred reference counting is more performant.

Being this short and this absolute is not helpful. It isn't a 100% cut and dry issue, as the lengthy discussion weighing its trade offs shows. Please engage with nuance in the future. Thanks.

@bjorn3
Copy link
Contributor

bjorn3 commented Jun 10, 2020

Ok, sorry.

@fitzgen
Copy link
Member Author

fitzgen commented Jun 10, 2020

In #1845 it was found that we can't actually backtrace through other host JIT code (e.g. the CLR) on all platforms. I think that this GC implementation requires getting a full backtrace at all times, right? If, for example, a host call in the CLR triggered a GC it would get a smaller view of the world than actually exists and could cause a use-after-free?

This is worrying.

You are correct that if we miss stack frames during GC, then we can accidentally think a reference is no longer in use even though it still is, which can then lead to use-after-free. Soundness relies on stack walking.

I can think of two work arounds:

  1. We do something like SpiderMonkey's JitActivation RAII class. There is a single JitActivation per host -> JS (in our case, Wasm) call. JS->JS calls do not add new JitActivations. So a JitActivation represents the start of one or more JS frames. JitActivations have an intrusive linked list that is pointed to from the JSContext (analogous to our VMContext) so the VM can always iterate over all JitActivations. Then, SpiderMonkey additionally has its own stack unwinder that only knows how to iterate over its own frames within a single JitActivation. Given those two things, all JS stack frames can be walked. We could do something similar.

  2. For .net, presumably msft's stack walker knows how to walk CLR frames. Perhaps we could just use their stack walker on windows? It isn't clear to me how to deal with .net on linux/macos if they refuse to integrate with the standard unwinding conventions of the platform...

@alexcrichton
Copy link
Member

How does SpiderMonkey unwind frames within a single JitActivation? Do they have a custom unwinder which can sort of start halfway down the stack?

Otherwise another possible alternative is we could perhaps have a flag in Store of "am I in wasm?" and some sort of "blessed" frame on the stack, where if we don't find that frame then we know the stack trace is incomplete and skip GC entirely?

@fitzgen
Copy link
Member Author

fitzgen commented Jun 10, 2020

How does SpiderMonkey unwind frames within a single JitActivation? Do they have a custom unwinder which can sort of start halfway down the stack?

Yes, they have their own unwinder that understands only their own JIT'd frames.

Otherwise another possible alternative is we could perhaps have a flag in Store of "am I in wasm?" and some sort of "blessed" frame on the stack, where if we don't find that frame then we know the stack trace is incomplete and skip GC entirely?

This would avoid the unsoundness, but wouldn't let us clean up the garbage, so it isn't super attractive...

@peterhuene
Copy link
Member

peterhuene commented Jun 10, 2020

For .net, presumably msft's stack walker knows how to walk CLR frames. Perhaps we could just use their stack walker on windows? It isn't clear to me how to deal with .net on linux/macos if they refuse to integrate with the standard unwinding conventions of the platform...

.NET Core's walker isn't readily available to use outside of the CLR and would be burdensome to liberate. Additionally, this actually isn't a problem on Windows because .NET Core uses the Windows unwind information format to internally represent its JIT code and hence complete walks with the system unwinder is possible (but only on Windows). That said, there might be other JIT-based runtimes out there that don't register their code with system unwinders, so a general solution would probably be warranted.

I think your other proposed solution makes the most sense, similar to SpiderMonkey, to record enough context at wasm-to-host and host-to-wasm transition points to enable a wasm-specific walk within Wasmtime itself rather than relying on a system unwinder for a complete trace (we could still rely on a system unwinder for walking the wasm frames as we support libunwind and Windows; on Windows this is possible with RtlVirtualUnwind, but I'm ignorant of the libunwind equivalent).

This would also solve the trap backtrace issue more generally such that we can then guarantee correct wasm traces regardless of encountering a frame that has no system unwind information registered.

It might also be useful to enhance our trap traces so we can clearly indicate to users where such transitions occur. For instance, that would be useful to show that a trap came from the host rather than user wasm code (e.g. show a top frame of [host frames omitted] rather than the wasm frame that called into the host).

@fitzgen
Copy link
Member Author

fitzgen commented Jun 10, 2020

Ok now to respond in full! One thing I'm still a bit murky and/or uncomfortable on is when GC is expected to happen. I understand that it automatically happens when tables fill up or if you explicitly call it, but my question was sort of largely do we expect wasmtime to grow some other time it automatically calls it? For example if an application never calls Store::gc is it guaranteed that, regardless of what wasm runs, memory won't be leaked?

Another question is how we'd document this. For embedders using anyref, what would be the recommended way to call this? I think "call it after the 'main call'" is pretty sound advice, but that hingest on the previous question of it should never be required to call Store::gc to prevent leaking unbounded amounts of memory.

We could add a timer-based GC, but this does make the timing of when destructors are called non-deterministic.

The only situation where there could be long-running wasm without any GC is if the Wasm goes into a long-running loop that doesn't use any references. If it did use references (i.e. put them in tables/globals) then the VMExternRefActivationsTable would eventually fill up and trigger a GC.

I don't think recommending a Store::gc call on exit of the "main" wasm call is too onerous for embedders (especially given that they may not want to do it that way if there isn't a single "main" call, and instead let it come naturally) but we could also provide an RAII class or somehow do it by default by counting number of active calls into wasm there are at any given time for the thread.

To play devil's advocate a bit, I'm not convinced that function-boundary reference counting is impossible. I agree local.{set,get} should not do reference counting at all. What I'm imagining is that there's a set of "live anyref" in a function frame that cranelift keeps track of. Instructions like call and {global,table}.{get,set} will increment the reference count for received anyref values and passed away anyref values. When the function returns cranelift would then inject a decrement of the reference count for all active anyref values in the frame.

I agree that this may need some intrinsic work in Cranelift, but I don't think that this would require optimizations about coalescing reference counts. The main idea would be that local.{get,set} are the lion's share of reference count activities so by deferring that we'd get the lion's share of the benefit.

I see more where you are going with this now.

Cranelift already does this to some degree when generating stack maps for safepoints: it asks the register allocator which values are still live after this point, filters for just the references, and generates a stack map for them based on their locations.

This is really late in the compilation pipeline though: it requires cooperation with the register allocator. I don't think we can do exactly that same approach for ref counting operations, because it is probably too late to insert new instructions (let alone whole new blocks for checking if the refcount reached zero inline, and only calling out to a VM function if so).

I think we could do something similar in cranelift-wasm, before the IR enters Cranelift's pipeline properly. (Once it goes into Cranelift, it is essentially "too late" to insert these hooks, as Cranelift doesn't ever call back out into user code at that point, and introducing it would be a bunch of work and also not necessarily wanted.) Still feel a bit hand wavy about this, and the more I think about it, the more similar it feels to the final form of the hypothetical static analysis.

but anything with explicit code generation will still need to do something on unwinding. And that 'something' is arguably just as hard to get right as stack maps themselves.

I'm not sure what you're talking about here. What explicit code generation? What code generation do we do at all that isn't handled by Cranelift?

For me, though, the nail in the coffin is that in a world of explicit reference counting unwinding needs to be handled somehow, and the solution seems like it'd be very similar if not just stack maps in one form or another.

Yes, if we use non-deferred reference counting for on-stack references, traps will have to decrement reference counts as they unwind through Wasm frames. I hadn't thought about this either. This does end up looking very similar to stack maps, but also with personality routines thrown into the mix. The difference is that failure to unwind stacks properly leads to leaks in this case, rather than unsafety.


Ultimately, I'm not 100% sure whether it makes more sense to keep investing in this stack maps-based approach to deferred reference counting, or to start fresh and try non-deferred reference counting.

I had been leaning towards exploring non-deferred reference counting, but I hadn't thought of the need to decrement reference counts when unwinding through wasm. When considering the effort required to do that properly, I'm less sure now.

@fitzgen
Copy link
Member Author

fitzgen commented Jun 10, 2020

Oh one other thing I wanted to mention: if we do keep going with this PR's approach and introduce a JitActivation analog, it isn't clear to me yet whether we can keep using libunwind (via the backtrace crate) for unwinding through a single JitActivation's wasm frames, or if we would need to implement our own stack walker that only handles the kinds of frames generated by Cranelift.

In order to keep using libunwind, we would need a way to initialize its unwind context manually ourselves, and not just with the current youngest stack frame's register values. We would also need to plumb that through the backtrace crate. I am not sure whether libunwind supports this or not. I'm going to look into it.

@peterhuene
Copy link
Member

peterhuene commented Jun 10, 2020

I assume we can capture the context at the wasm-to-host transition and use unw_init_local to initialize the unwind cursor at that context. We could then step the cursor until we encounter a non-wasm frame.

If the wasm-to-host transitions stack is empty (i.e. there's only Wasm frames on top or a Wasmtime function such as the signal or GC handler), then we can assume the walk from the current context, skipping any initial non-wasm frames, perhaps? There shouldn't be any foreign unregistered frames that would prevent the walk in that case.

@fitzgen
Copy link
Member Author

fitzgen commented Jun 10, 2020

I assume we can capture the context at the wasm-to-host transition and use unw_init_local to initialize the unwind cursor at that context. We could then step the cursor until we encounter a non-wasm frame.

Yes, we could do this. Adding an out-of-line function call (unw_getcontext) to every wasm->host transition isn't great though. (On arm, it is actually a macro of some inline assembly, but since Rust doesn't have inline assembly, we would probably wrap that in a C function that we call anways...)

AFAICT, there is no blessed way of re-initializing an existing unw_context_t from just a PC and SP, which is all that should really be required for stack walking in principle. On both x86-64 and arm, unw_context_t is just a define of a ucontext_t, so maybe we could just poke the ucontext_t's register values directly...

@peterhuene
Copy link
Member

peterhuene commented Jun 10, 2020

Agreed, if we go with such a design, I would want to limit the context capturing only to host functions not defined by Wasmtime (and perhaps only via the C API, where there is a chance of another JIT runtime being used). Perhaps it could even be something embedders opt-in to when defining a function.

The bottom line is that we shouldn't have to pay the cost for capturing context when calling into Wasmtime's WASI implementation especially, as we are guaranteed to be able to unwind through those frames for both libunwind and Windows.

@peterhuene
Copy link
Member

peterhuene commented Jun 10, 2020

All that said, I'm comfortable with landing a stack-walking-based GC before we fix the issue that currently only affects .NET hosts (the .NET API doesn't support reference types yet anyway).

@bjorn3
Copy link
Contributor

bjorn3 commented Jun 11, 2020

Maybe there is some way for the .NET Wasmtime implementation to insert a frame whose unwind info doesn't do normal unwinding, but sets the sp to the value before entering the .NET JITtes code, such that all .NET JITted code is skipped. Or has a special personality function that invokes the .NET unwinder and then sets the register state as the .NET unwinder gives back when unwinding past the JITted code.

@tschneidereit
Copy link
Member

All that said, I'm comfortable with landing a stack-walking-based GC before we fix the issue that currently only affects .NET hosts (the .NET API doesn't support reference types yet anyway).

While I think that might be fine to do, I'm actually quite glad we have this embedding and thus caught this issue.

What's more, even if we find a workaround specifically for .NET, I think it might make sense to take a closer look at whether we can then have sufficient confidence in our ability to make this approach work in all cases that might become relevant. I.e., would landing this begin painting ourselves into a corner that's increasingly hard to get out of at some later time when we realize we need to?

@alexcrichton
Copy link
Member

@fitzgen

I'm not sure what you're talking about here. What explicit code generation? What code generation do we do at all that isn't handled by Cranelift?

Oh by "explicit code generation" I mean "the thing that isn't deferred reference counting".


So overall I feel like my thinking above is still somewhat attractive. That scheme is where on a call to Store::gc it doesn't actually do anything if we're missing some blessed marker on the stack (or some other mechanism to detect that our stack trace is likely incomplete).

The downside of that is that runtimes like .NET will GC less, but that seems like it's not really that much of an issue given that our "GC" here is very small. The GC'd aspect is you gave a value to wasm and then it became unreachable during the execution of wasm. If we don't eagerly clean those up it doesn't seem like it's the end of the world, especially because we'll want to eventually fix this.

I think that explicit reference counting is a defunkt strategy given our unwinding strategy. I don't really see how implementing unwinding correctly with explicit reference counting is going to be any less hazardous or any less work than this already was. If that's taken as an assumption then the final appearance of this feature will be exactly this PR except with more changes to the gc function and codegen (to manage JitActivation entries and such). @fitzgen brings up a good point though that the backtrace crate will likely no longer be what we use.

Personally I see a few directions to go from here:

  1. Land this PR, work afterwards to fix .NET
  2. Don't land this PR, fix .NET, then land this PR. (e.g. JitActivation or similar)
  3. Switch this PR to explicit reference counting.

I don't think (3) is feasible and I think 1/2 are pretty close. I think it might be good to sketch out in more detail how JitActivation could work though.

@tschneidereit
Copy link
Member

Personally I see a few directions to go from here

I find this convincing, and support (1), given that @peterhuene also voiced support for it.

@fitzgen
Copy link
Member Author

fitzgen commented Jun 11, 2020

Ok so we talked about this in today's wasmtime meeting. The conclusion we came to was to implement a canary stack frame in this PR for now, so we can detect situations where libunwind cannot walk the full stack, and then we can skip GC rather than get potential unsoundness. But long term, we want to do the precise JitActivation approach (it would allow other additional benefits, such as getting wasm stack traces for .net traps, which aren't currently available).

Thanks everyone for the discussion and design input!

@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Jun 11, 2020
@github-actions
Copy link

Subscribe to Label Action

cc @bnjbvr

This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

  • bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looking great to me! Apart from the inline comments the only other thing I'd say is that I think it'd be good to look into not having two registrations per module (GlobalFrameInfoRegistration and StackMapRegistration) and trying to lump that all into one (also deduplicating the BTreeMap lookups and such)

crates/jit/src/code_memory.rs Outdated Show resolved Hide resolved
///
/// The `vmctx` also holds a raw pointer to the registry and relies on this
/// member to keep it alive.
pub(crate) stack_map_registry: Arc<StackMapRegistry>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally still prefer to avoid having this here if it's sole purpose is to keep the field alive. I think that memory management should be deferred to Store

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same with externref_activations_table above too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crates/wasmtime/src/func.rs Outdated Show resolved Hide resolved
crates/runtime/src/externref.rs Outdated Show resolved Hide resolved
crates/environ/src/vmoffsets.rs Show resolved Hide resolved
crates/runtime/src/externref.rs Show resolved Hide resolved
crates/runtime/src/externref.rs Outdated Show resolved Hide resolved
crates/runtime/src/externref.rs Show resolved Hide resolved
crates/runtime/src/externref.rs Outdated Show resolved Hide resolved
@fitzgen
Copy link
Member Author

fitzgen commented Jun 12, 2020

Apart from the inline comments the only other thing I'd say is that I think it'd be good to look into not having two registrations per module (GlobalFrameInfoRegistration and StackMapRegistration) and trying to lump that all into one (also deduplicating the BTreeMap lookups and such)

Ah, I just remembered why it has to be this way: the GlobalFrameInfo is in the wasmtime crate, but we need the stack map stuff inside the wasmtime-runtime crate, so it can't be added as part of GlobalFrameInfo. The split between wasmtime and wasmtime-runtime crates is a bit annoying...

For host VM code, we use plain reference counting, where cloning increments
the reference count, and dropping decrements it. We can avoid many of the
on-stack increment/decrement operations that typically plague the
performance of reference counting via Rust's ownership and borrowing system.
Moving a `VMExternRef` avoids mutating its reference count, and borrowing it
either avoids the reference count increment or delays it until if/when the
`VMExternRef` is cloned.

When passing a `VMExternRef` into compiled Wasm code, we don't want to do
reference count mutations for every compiled `local.{get,set}`, nor for
every function call. Therefore, we use a variation of **deferred reference
counting**, where we only mutate reference counts when storing
`VMExternRef`s somewhere that outlives the activation: into a global or
table. Simultaneously, we over-approximate the set of `VMExternRef`s that
are inside Wasm function activations. Periodically, we walk the stack at GC
safe points, and use stack map information to precisely identify the set of
`VMExternRef`s inside Wasm activations. Then we take the difference between
this precise set and our over-approximation, and decrement the reference
count for each of the `VMExternRef`s that are in our over-approximation but
not in the precise set. Finally, the over-approximation is replaced with the
precise set.

The `VMExternRefActivationsTable` implements the over-approximized set of
`VMExternRef`s referenced by Wasm activations. Calling a Wasm function and
passing it a `VMExternRef` moves the `VMExternRef` into the table, and the
compiled Wasm function logically "borrows" the `VMExternRef` from the
table. Similarly, `global.get` and `table.get` operations clone the gotten
`VMExternRef` into the `VMExternRefActivationsTable` and then "borrow" the
reference out of the table.

When a `VMExternRef` is returned to host code from a Wasm function, the host
increments the reference count (because the reference is logically
"borrowed" from the `VMExternRefActivationsTable` and the reference count
from the table will be dropped at the next GC).

For more general information on deferred reference counting, see *An
Examination of Deferred Reference Counting and Cycle Detection* by Quinane:
https://openresearch-repository.anu.edu.au/bitstream/1885/42030/2/hon-thesis.pdf

cc bytecodealliance#929

Fixes bytecodealliance#1804
This allows us to detect when stack walking has failed to walk the whole stack,
and we are potentially missing on-stack roots, and therefore it would be unsafe
to do a GC because we could free objects too early, leading to use-after-free.
When we detect this scenario, we skip the GC.
@fitzgen fitzgen force-pushed the externref-stack-maps branch 2 times, most recently from b1410f0 to 6d88167 Compare June 15, 2020 16:50
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll take a TODO item to clean up the registration stuff later. Let's go ahead and land with this current design. There's one safety issue below that needs fixing but otherwise I think this is basically good to go with a green CI!

Cargo.lock Outdated
version = "0.3.46"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b1e692897359247cc6bb902933361652380af0f1b7651ae5c5013407f30e109e"
version = "0.3.48"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is likely to break CI until this is fixed

Cargo.lock Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
crates/runtime/src/externref.rs Show resolved Hide resolved
crates/wasmtime/src/values.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/instance.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/module.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/ref.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/runtime.rs Show resolved Hide resolved
tests/all/gc.rs Show resolved Hide resolved
crates/jit/src/instantiate.rs Outdated Show resolved Hide resolved
crates/runtime/src/externref.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/runtime.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/runtime.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/trampoline/func.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/types.rs Show resolved Hide resolved
@fitzgen
Copy link
Member Author

fitzgen commented Jun 15, 2020

I had to cfg(target_arch = "x86_64") the reference types-related tests because Cranelift doesn't support safepoints/stack maps for any other targets (notably aarch64).

@fitzgen fitzgen force-pushed the externref-stack-maps branch 2 times, most recently from a383fae to 6a88d5c Compare June 15, 2020 23:43
@fitzgen
Copy link
Member Author

fitzgen commented Jun 16, 2020

Confused why the aarch64 CI job is running the reference types tests despite them being ignored if !cfg!(target_arch = "x86_64"). Feel like I must be missing some simple typo or something like that, but I just don't see it...

@fitzgen
Copy link
Member Author

fitzgen commented Jun 16, 2020

Gah, its because the build.rs is being compiled for x64.

Cranelift does not support reference types on other targets.
@fitzgen fitzgen merged commit 647d2b4 into bytecodealliance:master Jun 16, 2020
@fitzgen fitzgen deleted the externref-stack-maps branch June 16, 2020 01:26
@@ -18,3 +18,7 @@ mod table;
mod traps;
mod use_after_drop;
mod wast;

// Cranelift only supports reference types on x64.
#[cfg(target_arch = "x86_64")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future instnaces of this, mind tagging this with a FIXME and an issue number so when we get around to aarch64 reference types we can make sure we run all the tests?

| ("reference_types", "externref_id_function") => {
// Ignore if this isn't x64, because Cranelift only supports
// reference types on x64.
return env::var("CARGO_CFG_TARGET_ARCH").unwrap() != "x86_64";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here for the aarch64 testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not leak externrefs passed into Wasm
5 participants