Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuations #1252

Closed
autumnontape opened this issue Jan 7, 2019 · 38 comments
Closed

Continuations #1252

autumnontape opened this issue Jan 7, 2019 · 38 comments

Comments

@autumnontape
Copy link

Some way of saving the WebAssembly execution state and later restoring it would be useful in order to port blocking APIs to platforms whose equivalent APIs are async, especially the Web itself, which relies heavily on async APIs. Only a very limited form of continuations would be necessary to open a wide range of possibilities.

Searching for "continuations" in the issue tracker brings up a few comments by @rossberg mentioning plans for "delimited continuations"/"stack switching" in a later revision of WebAssembly: #796 (comment), #919 (comment), #1171 (comment). There doesn't seem to be any mention of the feature on the website, nor a dedicated issue in this repository, so I'd like to ask -- what's the status here?

@aardappel
Copy link

I believe these features will be implemented as an extension (generalization) of exception handling support, so that being completed is probably where the current focus is.

I agree it would be nice to have more of a roadmap for what comes afterwards, yes.

@binji
Copy link
Member

binji commented Jan 7, 2019

There was some description of this here, though the exception handling proposal has changed since then, so perhaps it's not relevant anymore. Maybe @rossberg and @aheejin can provide more info.

@rossberg
Copy link
Member

rossberg commented Jan 9, 2019

Some folks among us are currently working on an experimental design that generalises the current exception proposal with resumption, yielding the equivalent of effect handlers, which are a structured and typable form of delimited continuations. We have a basic semantics but it is far too early for a proposal. Next steps will be prototyping it in the reference interpreter and experiment with the design and perhaps some producers (as a side product, this would yield a reference implementation and spec proxy for the exception proposal).

But regardless of design specifics, the fundamental mechanism of creating, managing and switching between multiple execution stacks is not easily added to existing VMs, to say the least. This will be the far greater hurdle. There is some interest e.g. in the V8 team, but I expect that implementing it and evaluating performance viability will take considerable time.

@bitwalker
Copy link

bitwalker commented Feb 4, 2019

This comment got a bit long, but bear with me, I promise it is all useful:

I'm currently working on an alternative implementation of Erlang that is based on ahead-of-time compilation rather than bytecode interpretation; the primary target is WebAssembly, but other targets will be supported eventually to support usage as a general native code compiler for BEAM languages. To implement the language and compiler, a few key requirements have to be met:

  • Proper tail calls, as recursion is heavily used in the language, there is no other means to express iteration
  • Cheap non-local returns, Erlang's throw is commonly used to simplify code which needs to exit early, or to allow returning a value from deep in a call stack; it is not something that is used just for exceptions
  • Pseudo-preemptive scheduling (implemented as cooperative scheduling in the runtime, but appears preemptive from Erlang code), which is used to implement green threads (called processes in Erlang)
  • Per-process garbage collection; important for performance, especially to avoid stopping the world to collect. In many cases, Erlang processes die before a GC of that process needs to be performed, which means large swaths of memory never need to be scanned at all.

To build an ahead-of-time compiler that meets these requirements, I chose to use continuation-passing style as described in Cheney on the MTA which provides a solution that elegantly addresses each requirement. In short, here is a description of how it works and how it solves my problems as a language implementer:

  • A program is converted into CPS form. In CPS form, and in this flavor in particular, every call is in tail position, and the callee receives both a return and an escape continuation (so-called "double barrel" continuations) in addition to any arguments.
  • Since no call ever returns, the stack would eventually blow, but the trick used in Cheney on the MTA is that the entry point takes the address of the stack pointer, and uses it to check periodically (usually each function call) to determine if the stack is about to run out of space. When that occurs, the live values on the stack are moved to the heap, along with the current continuation, and a longjmp is used to jump back to the entry point (which acts as a trampoline). This effectively resets the stack to square one, with very little overhead.
  • The jump also gives us a chance to perform garbage collection, scheduling, and as a consequence of resetting the stack, ensures that tail calls never blow the stack. Additional yield points can be added above and beyond the current state of the stack pointer, such as reduction counting (how Erlang determines when to preempt a green thread), or determining if a major garbage collection is needed.
  • The use of continuations means that we easily obtain cheap non-local returns
  • The use of continuations means that it is easy to suspend one green thread and resume another

None of what I described above actually requires explicit continuation support from the machine - it only requires some means by which the stack size can be checked (which we already have solutions for today, if a little hackish) and something like setjmp/longjmp to allow rewinding the stack cheaply.

From the research I've done so far though, the use of setjmp/longjmp in WebAssembly via JS exceptions comes at a significant performance penalty, and is basically just a compatibility shim. This makes any language using a Cheney on the MTA style compilation approach unnecessarily slow. A rather well-known example of such a language is Chicken Scheme, and I believe Chez Scheme also uses a similar, if not the same trick.

One alternative is to use trampolines - i.e. rather than longjmp when the stack is about to blow or a yield point is needed, the compiler generates code which returns to a trampoline every time one needs to invoke the next continuation. The problem of course is that it incurs an extra return + function call for each continuation called; however I suspect this is better today in WebAssembly vs the current setjmp/longjmp hack. One could also use an explicit return to unwind the stack, but I suspect that is even slower than the setjmp/longjmp solution.

All this to say, I believe something is needed to make the implementation of languages with similar requirements both feasible and performant. Perhaps that means some kind of native continuation support, or perhaps it means something less robust, like supporting the setjmp pattern I described better. I want to be clear that this isn't just an issue specific to my implementation - the set of languages which has or may have these requirements has some non-trivial members, e.g. Haskell, any standard-conforming Scheme implementation, Erlang (and BEAM languages in general), and essentially any language which depends on proper tail call support, not just the limited version made possible by the fastcc calling convention.

@rossberg Is there a place where I can review the rough sketch of what you are working on with the exception proposal + resumption? I'd like to dig in and see if it holds a solution for me or not. If so, and I can be of any help, I'm happy to try and lend a hand as well to keep it moving along.

If you have any questions (or even suggestions!) about the above, definitely let me know!

@aardappel
Copy link

@bitwalker What is not clear to me from the options you describe:

  • Using longjmp would imply you use the actual wasm stack (which is not under your control), which means you have no way of determining how much you're using and when you're about to run out of stack space, currently?
  • If instead you used you own "shadow stack" in linear memory for storing stack frames, you could reset it without using longjmp, assuming you don't use wasm functions to implement the functions in your language, but instead have all bodies live inside a huge for(;;) switch(..) construct. This might be slower for other reasons (hard to do register allocation?), but at least requires no help from wasm.
  • Using a trampoline (with the wasm stack) would preclude uses where the stack frame has to survive, like green threads?

As an aside.. "Cheney on the MTA" is indeed elegant.. its almost like a copying collector, but unlike a full GC a list of stack frames is trivial to copy. Also love that Baker has been serving that paper in the same form from the same url for >20 years :)

@bitwalker
Copy link

bitwalker commented Feb 5, 2019

@aardappel Sure, let me clarify :)

Using longjmp would imply you use the actual wasm stack (which is not under your control), which means you have no way of determining how much you're using and when you're about to run out of stack space, currently?

This is true on more platforms than just WebAssembly, but the trick is that you first define an arbitrary max stack size (say, the smallest size of all the platforms you support if you don't want to conditionally compile with different constants), and then the very first thing done on entry is to take the address of something on the stack (so not technically the stack pointer, but essentially can be used as one), or get the stack pointer itself. Given your max size, the stack pointer (approximated or no), and the knowledge of whether the stack grows up or down on a given target, you can check how much buffer you have between the maximum you have defined and the stack address you hold.

If instead you used you own "shadow stack" in linear memory for storing stack frames, you could reset it without using longjmp, assuming you don't use wasm functions to implement the functions in your language, but instead have all bodies live inside a huge for(;;) switch(..) construct. This might be slower for other reasons (hard to do register allocation?), but at least requires no help from wasm.

You pretty much nailed one major issue here - it destroys the ability to optimize register allocation, which may not be an issue with WebAssembly (?), given its stack machine semantics, but it is definitely something I would avoid unless I had no other choice. A bigger problem is that prevents a whole host of optimizations in the compiler backend (I'm using LLVM, but I've heard the issue is the same with various compile-to-C language implementations).

Using a trampoline (with the wasm stack) would preclude uses where the stack frame has to survive, like green threads?

It may help to understand how execution flows when scheduling. So assume we have some process (read: Erlang green thread) which is executing, and upon entry to the current function, and after checking if the stack needs to be reset, and whether a major GC is required, checks its reduction count (essentially an approximation of how much work it has done since the last yield) and sees that it needs to yield to the scheduler because it is going to exceed its quota:

  1. Control branches to what I call a "yield point", where a continuation is constructed which points to the current function, and references the current return and escape continuations, as well as any other arguments.
  • These arguments all represent live values on the "real" stack - in CPS form, anything not an argument to the current function is dead or on the heap.
  1. Any of the live values which are on the stack are moved to the heap (in our case, each process has its own heap, so the values are moved there).
  2. The constructed continuation is stored in the process struct (each process/green thread has a struct with all of its key metadata, such as a pointer to its heap, reduction count, etc.
  3. A jump is performed, which returns control to the scheduler.
  4. The scheduler pulls the next process to be scheduled off its run queue, gets the continuation for that process, and calls it with the arguments stored in that continuation.

So with that in mind, using a trampoline doesn't prevent this flow, instead it just imposes a costly overhead - rather than being able to call a continuation directly (really just a function pointer at this point), using arguments already on the stack; a continuation must be constructed, and live values moved to the heap, before returning to the trampoline. Doing that for every call is significantly slower than only constructing a continuation when scheduling, GC, or a stack reset occurs.

As an aside.. "Cheney on the MTA" is indeed elegant.. its almost like a copying collector, but unlike a full GC a list of stack frames is trivial to copy. Also love that Baker has been serving that paper in the same form from the same url for >20 years :)

Using the stack as a young generation is a brilliant stroke I thought. And it gets better! Since we're in CPS form, you don't need to copy stack frames, the only live values that need to be copied must be in, or referenced by, arguments to the current function, anything not reachable from those arguments is dead, and effectively collected by the jump :)

Before I encountered the paper, I am sad to say that I wasn't familiar with his work, but Baker is brilliant, I'm a huge fan, especially since as you say he continues to host all of his papers after all these years, essentially unchanged!

@bitwalker
Copy link

I want to point out as well that although there is some crossover here with the tail call proposal, I feel like my point is more about arguing for supporting continuations as a language implementation strategy, one way or another. Proper tail calls is essentially the catch of course, but the key lesson behind Cheney on the MTA is that the lack of native support for tail calls can be worked around and remain relatively efficient, given the right tools.

More to the point though, the current tail call proposal is flawed in my opinion, as it essentially describes a trampoline (one must incur the cost of a return + call). One of the key benefits of tail calls is that they are efficient, essentially equivalent to a local goto. Without that, I'm not sure what benefit a tail call instruction in the spec provides, beyond a compiler not needing to generate its own trampolines. Perhaps it will be possible for WebAssembly machines to optimize those instructions into "real" tail calls, but I haven't seen any discussion to that point, and it is a critical one.

I'm not sure what the right solution is, but I do think a key indicator on whether WebAssembly has the right primitives is whether or not functional programs can be compiled to code which is essentially as performant as imperative counterparts (barring other issues, such as immutable data), which boils down to supporting tail calls efficiently, one way or the other.

@rossberg
Copy link
Member

rossberg commented Feb 5, 2019

the current tail call proposal is flawed in my opinion, as it essentially describes a trampoline (one must incur the cost of a return + call).

@bitwalker, I'm afraid I don't follow. Why do you think that there has to be a trampoline? This is certainly meant to be a regular tail call.

Re resumable exceptions / effect handlers: there are some out-of-date example sketches here. Other than that we currently only have a formulation of a formal semantics, without any usable textual explanation in writing. We should probably write something up eventually, but there hasn't been much activity.

@bitwalker
Copy link

I'm afraid I don't follow. Why do you think that there has to be a trampoline? This is certainly meant to be a regular tail call.

I apologize if I misinterpreted! I was referring to this section, where the execution semantics are said to behave like a return followed by a call. It seems like a confusing way to explain them, particularly because of the trampoline connection. I've always seen tail calls as something more akin to a jump and that has been my impression of how other people view them as well. While the main point of tail calls is avoiding overflowing the stack by reusing stack frames, my understanding has been that another key property of tail calls is that they avoid a lot of the shuffling of the stack that happens with return+call or even just a call; more in the case of self-recursive calls, but even more generally in the case of calls to functions with arguments in the same position, where the compiler can avoid moving stuff around. I think I was just expecting to see that conveyed in the proposal. I'm not telling you anything you don't already know about tail calls, I'm just trying to explain why I interpreted the proposal the way I did :)

Other than that we currently only have a formulation of a formal semantics, without any usable textual explanation in writing. We should probably write something up eventually, but there hasn't been much activity.

Thanks for the link! It seems to me that this could probably be used to express the kind of control flow that I am using setjmp/longjmp for (as described above, e.g. Cheney on the MTA), assuming that "resuming" control in an outer call rewinds the call stack to that point. Is that the case?

Either way, is there anything I can do to help? Is the issue getting a reference implementation done, are there still open questions that need research/discussion, or does it just need someone to get it through the process?

@rossberg
Copy link
Member

rossberg commented Feb 5, 2019

Well, given that caller and callee can have completely different parameter lists, tail calls cannot generally avoid argument shuffling, even if it happens in registers. Of course, engines are expected to minimise that in cases where they match, but that isn't semantically observable. Nothing unusual here, that's how they usually work.

@Danielhillerstrom has started hacking the experimental effect handler design into the reference interpreter as a first step, but I doubt that can be parallelised much. The next step would be preliminary experiments with targeting it, and that's where it could use all the help we can get, if people were trying to play with it, help iterating on the design, perhaps by writing an experimental compiler backend targeting it. But the hardest part by far, if we get there, will be an efficient implementation of multiple stacks / continuations (regardless of concrete design details) in a production engine, which is gonna be super tricky, given the utter complexity of existing JS/Wasm VMs.

@bitwalker
Copy link

Well, given that caller and callee can have completely different parameter lists, tail calls cannot generally avoid argument shuffling, even if it happens in registers

My main issue is with this line:

Hence they unwind the operand stack like return does

This sounds a lot like it requires performing the same work that a return + a call performs, even if the implementation isn't strictly a trampoline. The reason why I think the distinction is important is that a key feature of tail calls is that they incur no more overhead than a regular call, in fact less, since no stack frame is allocated. But if tail calls are actually doing twice as much work as a normal call, then that is a huge price to pay for languages where tail calls aren't just an optimization.

I'm not necessarily saying that the proposal is requiring implementations to do it that way, but the way it is worded isn't ideal. Tail calls don't really "unwind" the stack, they overwrite their own stack frame, which I guess has the appearance of unwinding the stack, but is not the same thing. I didn't mean to derail the conversation here, but I think the spec should be explicit about these differences, unless the intention is to avoid requiring tail calls to be implemented as proper tail calls.

@Danielhillerstrom has started hacking the experimental effect handler design into the reference interpreter as a first step

Awesome, is that work public currently, or is it still being hacked together?

The next step would be preliminary experiments with targeting it, and that's where it could use all the help we can get, if people were trying to play with it, help iterating on the design, perhaps by writing an experimental compiler backend targeting it.

Once it is available to play with, even if it takes some work to set up on my end, I'd be open to writing a separate WebAssembly backend for my compiler, rather than using LLVM, or perhaps even by customizing LLVM with some new intrinsics + associated lowering. WebAssembly is an important target for me, so whatever I can do to help.

But the hardest part by far, if we get there, will be an efficient implementation of multiple stacks / continuations (regardless of concrete design details) in a production engine, which is gonna be super tricky, given the utter complexity of existing JS/Wasm VMs.

Yeah, I can imagine. That said, I would think the major engines are all strongly motivated to make it work, if WebAssembly is to become as foundational as it promises to be. Have engineers on those engines commented on this yet? Is it the kind of thing that is feasible and just a matter of time, or fundamentally incompatible with how those engines work today? I have little perspective on how those engines are currently implemented, but having an idea of the constraints they are operating under would be useful context to have for sure.

@rossberg
Copy link
Member

rossberg commented Feb 6, 2019

My main issue is with this line:

Hence they unwind the operand stack like return does

Ah, don't confuse abstract machine lingo with implementation. This refers to the operand stack within the current function, which is just a notion of the spec's abstract machine and not physically materialised in real engines. This basically says that unused operands are dropped when you tail-call. This is all handled at compile time.

That said, there is no reason to believe that tail calls are faster than regular calls in general. That's a misunderstanding. Their purpose is reducing space, not time! They can well be somewhat more expensive in certain cases.

(The tail call instruction in some commercial engines which shall remain unnamed are notorious for having been up to 100x slower -- I'm confident that we will do better. :) )

Have engineers on those engines commented on this yet?

Of course. Everybody sees the need for some form of stack switching. That doesn't mean that everybody isn't scared...

@bitwalker
Copy link

Ah, don't confuse abstract machine lingo with implementation. This refers to the operand stack within the current function, which is just a notion of the spec's abstract machine and not physically materialised in real engines.

Apologies for missing that! That indeed clarifies why it is phrased that way. Thanks for your patience :)

@aardappel
Copy link

@bitwalker

but the trick is that you first define an arbitrary max stack size (say, the smallest size of all the platforms you support if you don't want to conditionally compile with different constants), and then the very first thing done on entry is to take the address of something on the stack

Except you can't take the address of anything on wasm stack, by design. The moment you take the address of a local variable (in LLVM or e.g. C++) the wasm code emitted changes the local variable from a wasm local to something on a shadow stack. So you can't measure the wasm stack this way. And you already said you'd be unhappy having everything on a shadow stack.

You could still emulate this.. you could for each wasm function statically compute the total size of all its locals + args + max temps, and use that as a conservative approximation of stack size used, and then execute your stack cleanup (using longjmp or exceptions) once it exceeds some conservative bound you estimate will be an ok amount of wasm stack to be using on any engine. But wasm doesn't specify the min/max stack an implementation should provide, so this is very iffy. You could even try to detect what an implementation provides by running out of stack space on purpose for a variety of recursive functions, count the number of calls, catch the trap, and estimate a conservative stack space bound from that ;)

Again, I think you're better off with a shadow stack, on the whole :)

Frankly, a wasm module should be allowed to at least hint at the amount of stack space it wants the engine to provide, because without such a thing, you would never be able to even use recursive algorithms (that process user data, e.g. a parser) in wasm, if the stack can be arbitrarily small. Or backends would always need to force recursive algorithms to use the shadow stack. Would that make any sense @rossberg @titzer @binji ?

So with that in mind, using a trampoline doesn't prevent this flow, instead it just imposes a costly overhead - rather than being able to call a continuation directly (really just a function pointer at this point), using arguments already on the stack; a continuation must be constructed, and live values moved to the heap, before returning to the trampoline. Doing that for every call is significantly slower than only constructing a continuation when scheduling, GC, or a stack reset occurs.

Ah ok, I was imagining you meant a trampoline for each call/callcc, but you mean it is ideally only used for yield and when you run out of stack space?

Before I encountered the paper, I am sad to say that I wasn't familiar with his work, but Baker is brilliant, I'm a huge fan, especially since as you say he continues to host all of his papers after all these years, essentially unchanged!

Yup, so many "creative" ideas, that are still relevant!

@rossberg
Copy link
Member

rossberg commented Feb 7, 2019

@aardappel:

a wasm module should be allowed to at least hint at the amount of stack space it wants the engine to provide

It would be nice to provide some way to configure stack size, but I see two problems with the concrete suggestions:

  1. The stack is cross-module, so a per-module hint does not really make sense.
  2. There is no obvious metric by which to specify stack space, since stack use is the product of a whole range of variables.

Not sure how to address this. It would be nicer if the VM was just able to grow the stack. Then stack space would only be bounded by heap space.

@aardappel
Copy link

@rossberg

  1. Good point, that is a problem. It is still useful to be able to specify it for the majority of cases though, where the "heavy" module (that relies on recursion) that was run as first/main module specifies such a limit, and any helper/child modules don't have any particular requirements. Or if stack resizing is easy, it could be the max of all modules involved.. though I am guessing there will be engines that can't resize if they directly use the process stack of the host, so it is all moot.

  2. Yes, it is extremely implementation dependent, but that is already the case if I write a native program. If I know Linux is going to give me 8MB, I then eyeball my recursive function to estimate how deep I can go and whether that is reasonable for the worst case data I think I will process.. before deciding to turn it into a non-recursive algorithm or not. That already is very inexact science. I don't see why that would be much different for wasm. Being able to say "I expect there to be on the order of 1MB of stack" will give me much better guarantees than having to write my code assuming some engines is going to limit me to 16KB of stack (and thus having to remove any recursion from my code). Same for @bitwalker's use case.

@bitwalker
Copy link

Except you can't take the address of anything on wasm stack, by design. The moment you take the address of a local variable (in LLVM or e.g. C++) the wasm code emitted changes the local variable from a wasm local to something on a shadow stack. So you can't measure the wasm stack this way. And you already said you'd be unhappy having everything on a shadow stack.

Well, perhaps an example is needed to illustrate what I mean. The following is a trivial C application that uses recursion to increment a counter until some arbitrary limit. It uses a trampoline, but only when the stack is going to blow, and it determines when to return to the trampoline by checking how much stack space it believes it has remaining:

#include <emscripten.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <stddef.h>

#define STACK_BUFFER 16384

// Checks whether the stack is overflowing
static bool stack_check(void);

// The address of the top of the stack when we first saw it
// This approximates a pointer to the bottom of the stack, but
// may not necessarily _be_ the exact bottom. This is set by
// the entry point of the application
static void *stack_initial = NULL;
// Pointer to the end of the stack
static uintptr_t stack_max = (uintptr_t)NULL;
// The current stack pointer
static uintptr_t stack_ptr_val = (uintptr_t)NULL;
// The amount of stack remaining
static ptrdiff_t stack_left = 0;

// Maximum stack size of 248k
// This limit is arbitrary, but is what works for me under Node.js when compiled with emscripten
static size_t stack_limit = 1024 * 248 * 1;

static bool stack_check(void) {
  // Get the address of the top of the stack
  stack_ptr_val = (uintptr_t)__builtin_frame_address(0); // could also use alloca(0)
  
  // The check fails if we have no stack remaining
  // Subtraction is used because the stack grows downwards
  return (stack_ptr_val - stack_max - STACK_BUFFER) <= 0;
}

#define MAX_DEPTH 50000

static int depth;
static int trampoline;

int a(int val);
int b(int val);

int a(int val) {
    // Perform stack check and only recurse if we have stack
    if (stack_check()) {
        trampoline = 1;
        return val;
    } else {
        depth += 1;
        return b(val); 
    }
}

int b(int val) {
    int val2;
    if (depth < MAX_DEPTH) {
        // Keeping recursing, but again only if we have stack
        val2 = val + 1;
        if (stack_check()) {
            trampoline = 1;
            return val2;
        }
        return a(val2);
    } else {
        return val;
    }
}

int main() {
    stack_initial = __builtin_frame_address(0); // could also use alloca(0)
    stack_max = (uintptr_t)stack_initial - stack_limit;

    double ts1, ts2 = 0;
    ts1 = emscripten_get_now();

    depth = 0;
    trampoline = 0;

    int val = 0;

    while (true) {
        val = a(val);
        // This flag helps us distinguish between a return to reset the stack vs
        // a return because we're done. Not how you'd probably do this generally,
        // but this is just to demonstrate the basic mechanism
        if (trampoline == 1) {
            trampoline = 0;
            continue;
        }
        break;
    }
    ts2 = emscripten_get_now();

    printf("executed in %fms\n", ts2-ts1);

    return 0;
}

To see for yourself, run with emcc -s WASM=1 demo.c -o demo.js and run with node demo.js. My local node installation appears to have a maximum stack size of 984k, but as you can see I had to set the max stack size in my demo to 248k, as any higher will result in a stack overflow, and unfortunately there doesn't appear to be a way to get access to the precise limit of the stack, one has to guess or set a constant.

Please forgive the awful code, but I sketched it out in a hurry. The basic idea should be clear I hope - if one chooses to bounce back to a trampoline only when stack is running out, this is possible in WebAssembly today, albeit with some constraints. So far in my various profiling of similar scenarios, it is basically a tie between the above approach and just an ordinary trampoline, but I suspect more realistic scenarios would show that the less frequently one has to return to the trampoline, the better. With such a small stack to work with though, the benefit is quite small unfortunately - I would expect to be working with a stack of at least 2mb, if not more, 248k is just ridiculous. As for setjmp/longjmp, it just doesn't even come close due to how that is implemented today.

As @rossberg said, I don't think a per-module stack size makes sense, it would need to be configurable globally, which is essentially the case today except you are configuring the global stack size for the engine itself. As far as I'm concerned, that may as well mean it isn't configurable at all, and that's fine - the actual size isn't the primary problem, though the current situation isn't ideal. Allowing the stack to grow arbitrarily large by growing into the heap seems like a viable option as well, but probably not ideal since it isn't really what anyone would expect. I would rather just see a more forgiving stack size (like I said above, at least 2mb).

From the perspective of the spec, I don't think one can require a specific stack size, but I do think that a few things could be added or clarified:

  • A recommendation regarding the minimum stack size. Some highly restrictive platforms will probably need to go lower, but if there is a recommendation for something like 2mb, then those platforms that don't have a restriction can benefit from more stack space to work with, but the spec isn't imposing a minimum
  • Most important is a way to get the current stack limit as applies to the WebAssembly code being executed (e.g. the value of getrlimit(RLIMIT_STACK, ..) would be a limit useful to code compiled to WebAssembly). This provides language implementations with the means to handle different sized stacks without having to add a bunch of conditional constants per-engine, or worse, a single extremely low constant for the lowest common denominator.

For my own project, I'm going to do more profiling, but based on what I'm seeing, I will probably just use a simple trampoline when targeting WebAssembly, until setjmp/longjmp or something like resumable exceptions becomes viable; tail calls solve my use case; or the stack sizes in the major engines become large enough so that Cheney on the MTA can be employed. The tools at my disposal just aren't enough to make it work well today.

@bitwalker
Copy link

I should also clarify, I compiled the above example with emcc, but it is configured to use my own LLVM installation (the one used for this demo is LLVM master, so 9, and built within the last few days - no changes on my end), so it is being compiled with the LLVM backend, not asm.js - just in case that is expected to make a difference.

@aardappel
Copy link

@bitwalker
Are you sure the above code does what you think it does? Afaik, __builtin_frame_address (and alloca) will return addresses on the shadow stack, whereas your recursion is happening on the wasm stack, so these should be unrelated values. Since you're not actually storing anything on the shadow stack in this code, it may not even be growing, so __builtin_frame_address may return.. the same value each time?

@bitwalker
Copy link

@aardappel I did a fair amount of testing before posting ;). To check the values returned at each step, I dumped all the stack-related values via printf. I also ran the experiment both with the stack checking enabled and without. Without, the code gets nowhere even close to 50k (the MAX_DEPTH value), and aborts, which is expected of course. With it enabled, the program reaches 50k while only hitting the trampoline ~6 times, completing successfuly. If you change the maximum amount of stack space to a larger amount than 248k, say 256k, the stack overflows, which indicates that the code is returning to the trampoline correctly when we are reaching the stack limit defined. The value produced by __builtin_frame_address (which is lowered to the @llvm.frameaddress intrinsic) gets smaller by the correct amount each time a stack check is performed (based on a rough estimate from looking at the allocations in the LLVM IR), so it appears to be producing the value I'd expect.

I did try an implementation with getrlimit(2), but the value returned by that is clearly disconnected from the stack (the shadow stack you mention), and may refer to the real stack - I'm not sure, but it isn't useful for the stack check. So there are some tools that can't be used.

@binji
Copy link
Member

binji commented Feb 8, 2019

@bitwalker I think that this may be because you're compiling without optimizations. I took your code and compiled it with webassembly.studio, which by default compiles with -O3. That completely optimizes away the loop and all function calls. If I add in a dummy call, it turns the whole thing into a loop. https://webassembly.studio/?f=0aj0cn61ob8

@bitwalker
Copy link

@binji Yes, I'm compiling without optimizations, and that is intentional because of the size and simplicity of the example - it happens to be something a compiler can trivially convert into a loop in this case. Much larger programs would not be optimized this way because it would essentially require defining the whole program within a single function.

The point of the example is not whether a compiler can optimize this one case into a non-recursive form, but that fundamentally one can check the available stack space and act on it to implement Cheney on the MTA.

@rossberg
Copy link
Member

rossberg commented Feb 8, 2019

@bitwalker, as @aardappel said, the value produced by __builtin_frame_address is for the shadow stack. Even if the value changes that says nothing about the room left on the actual system call stack, which is the one you care about. There is no way to measure or observe that from within Wasm.

Your code is exploiting undefined behaviour in C. It happens to work on common hardware, but not on Wasm.

@bitwalker
Copy link

bitwalker commented Feb 8, 2019

Your code is exploiting undefined behaviour in C. It happens to work on common hardware, but not on Wasm.

Just to be clear, I am running that code in a WebAssembly engine, specifically v8, not compiling to x86_64.

I know I'm looking at the shadow stack in the example, but that's because I'm assuming that allocations I make on the stack (in WebAssembly code) are being allocated on the shadow stack, including stack frames for functions called in that code. In other words, the system stack is not something I need to even be aware of when writing WebAssembly; the shadow stack is the "real" stack in that context.

If that isn't the case, then yes, you're right that the example works by happy accident - but that would seem to imply that there is no way for a compiler to prevent generating code that could overflow the system stack, which seems like a huge problem to me. Is that really the case?

@rossberg
Copy link
Member

rossberg commented Feb 8, 2019

Wasm functions use the system stack like most other code. The engine implements stack guards, but other than that it uses the system stack like any other compiler and runtime do. And yes, it can overflow, but such overflows are caught. The shadow stack is something introduced by the C compiler, and it only contains locals that need to be in memory addressable from C. Clearly, the engine could not store return addresses or other internal pointers there without creating major security problems.

@bitwalker
Copy link

And yes, it can overflow, but such overflows are caught.

The problem I'm getting at is that this distinction isn't useful if a compiler implementation targeting Wasm is unable to implement any runtime checks to prevent overflowing the stack, or to recover from imminent overflow, as is possible on other targets. I'm unclear on why it is not desirable to make available, at least on a read-only basis, the current system stack pointer, as well as the maximum stack size, in order to do so. Clearly putting everything on a shadow stack isn't enough to guarantee anything, as previously suggested, so the options for functional languages here seem really awful, frankly.

My C example seems to have muddied the waters rather than help illustrate, so I apologize if it has been more trouble than it was worth. I'm neither using C, nor generating it, I chose it to keep the example simple. However, my compiler does use LLVM, and would use the intrinsics it provides for Wasm, which naturally rely on features defined in the spec. But it seems like there is no supported way for a language implementation based on Cheney on the MTA, such as mine, to target Wasm without significant refactoring, which in most cases I suspect is simply a non-starter.

So the most important question to me is: what is the path forward for these languages/implementations? Likewise, what can the spec do to support them? I thought the biggest missing piece was lack of native tail call support, but it seems like the typical options for working around that are essentially unavailable, or impractical due to overhead. The one exception is using a simple trampoline - and if you aren't starting from scratch, redesigning your implementation around that is non-trivial at best.

@rossberg
Copy link
Member

rossberg commented Feb 8, 2019

The system stack pointer is a physical hardware address. Clearly, it is out of the question to expose that to untrusted code on a security-sensitive platform like the web, even if it was read-only and inaccessible.

In principle, Wasm could provide more abstract means for querying and manipulating stack sizes. But the details of how to provide such a mechanism are not obvious, see above. Especially because they would introduce a substantial degree of observable non-determinism. A more desirable route would be to have mechanisms for growing stacks in engines themselves rather than having each program hack around it.

Once the tail call proposal becomes available, the problem should be reasonably minor in practice, though. What remains are fairly strict resource constraints and the fact that programs can die unexpectedly when they blow them, but that has always been the reality of the web. I think it is an exaggeration to say that this is prohibitive, and Wasm is surely not making it worse.

@bitwalker
Copy link

The system stack pointer is a physical hardware address. Clearly, it is out of the question to expose that to untrusted code on a security-sensitive platform like the web, even if it was read-only and inaccessible.

I'm interested to know more about the specific vulnerability(s) this represents, but I'm not arguing that real hardware addresses are necessary. Unless I'm missing something, the stack pointer address accessible to Wasm code need only be a useful proxy for the relative position in the system stack that the code is executing in (e.g., appearing to be an address between 0 and n), and the limit can be arbitrarily smaller than the real system stack size. This gets you the benefit of being able to query the stack without exposing real addresses outside the engine itself.

Especially because they would introduce a substantial degree of observable non-determinism.

Could you clarify what you mean by this? I'm specifically talking about being able to query stack size, I don't think manipulating it is a goal. Surely it is acceptable for different engines to have different stack sizes, so exposing that size would introduce no more non-determinism than is already fundamentally present.

A more desirable route would be to have mechanisms for growing stacks in engines themselves rather than having each program hack around it.

That deals with one issue (the stack being too small and thus too easy to overflow), but is tangential to the issue I'm trying to raise.

Once the tail call proposal becomes available, the problem should be reasonably minor in practice, though.

I agree, for some subset of languages anyway, but that proposal still seems to be early in its cycle, and will be some in time in coming. In the meantime, the lack of them and the lack of workarounds is a huge issue for languages that require them. The tail call proposal itself doesn't contain a solution for languages which use the stack for allocations and garbage collect by jumping, but the resumable exceptions idea would likely provide a solution for that - but that isn't being actively worked on as far as I'm aware, it is just theoretically possible once the exceptions proposal is accepted.

What remains are fairly strict resource constraints and the fact that programs can die unexpectedly when they blow them, but that has always been the reality of the web. I think it is an exaggeration to say that this is prohibitive, and Wasm is surely not making it worse.

The constraints themselves are not prohibitive, and aren't really the issue. Resource constraints and associated risk exist outside the web as well, but there are tools to deal with them there - in the context of stack limits, one has setjmp/longjmp, checking the stack and garbage collecting via jump/return when it is about to overflow, or using trampolines. Essentially, one can ensure that if you are about to exceed a limit, that it can be handled gracefully (or just allow the limit to be exceeded, and abort/panic). The issue with Wasm that I'm trying to find a solution for, is that these tools (with the exception of trampolines) are removed (either entirely, or in practice as is the case with SJLJ), and have no replacement. For languages which base their entire implementation around those tools, the lack of them is certainly a major issue. I'm not at all saying Wasm is fundamentally flawed or something, but I do think there is something missing in the spec here, specifically, some means of understanding current constraints code is executing under (e.g. stack size) and where you are relative to those constraints (e.g. something that resembles an address that gives you an idea of your position relative to the max stack size).

If I've fundamentally misunderstood or simply missed something, I'm definitely looking to learn and fix that. It just seems like this is something that has been overlooked, based on reading through the spec and looking through various issues/proposals in the org. If you have links to other discussions that are relevant here, definitely point me towards them so I can avoid rehashing anything that has already been discussed, or can at least make sure I'm bringing something new to the table.

@rossberg
Copy link
Member

@bitwalker:

Especially because they would introduce a substantial degree of observable non-determinism.

Could you clarify what you mean by this? I'm specifically talking about being able to query stack size, I don't think manipulating it is a goal. Surely it is acceptable for different engines to have different stack sizes, so exposing that size would introduce no more non-determinism than is already fundamentally present.

Currently, Wasm code cannot observe stack overflows, because it will be aborted. Its own outcome cannot depend on stack size, only success vs failure. That is quite different from being able to succeed with result A or with result B depending on the stack.

Once the tail call proposal becomes available, the problem should be reasonably minor in practice, though.

I agree, for some subset of languages anyway, but that proposal still seems to be early in its cycle, and will be some in time in coming.

The proposal is at stage 3, so should be available relatively soon. Starting a whole different proposal from scratch seems very unlikely to go faster.

The issue with Wasm that I'm trying to find a solution for, is that these tools (with the exception of trampolines) are removed (either entirely, or in practice as is the case with SJLJ), and have no replacement.

My point was that these tools have never been available on the Web, so it should perhaps not be surprising if they are not available in Wasm either.

@bcardarella
Copy link

My point was that these tools have never been available on the Web, so it should perhaps not be surprising if they are not available in Wasm either.

For sure, but now that Wasm opens up the possibility of these tools shouldn't the spec strive to identify and address?

@aardappel
Copy link

So the most important question to me is: what is the path forward for these languages/implementations?

For now, use a shadow stack, much like what your C code does in an unoptimized build, and as I mentioned in my first reply.

That need not be disastrously slow, you can still load variables into locals inside as single function, and get some benefits of register usage. Also you now have super cheap stack manipulation since you control all of it.

Like @rossberg says, while theoretically possible, I don't see it likely that wasm would provide enough access to the wasm stack such that you can implement all this functionality there.

The amount of languages out there that deviate in their execution model from the C-like model is endless, we'll unlikely support all of them directly. In fact we don't even fully support C directly :) Tail calls and resumable exceptions will help for some, but "rolling your own stack frames" will remain a popular option I predict.

@bitwalker
Copy link

The proposal is at stage 3, so should be available relatively soon. Starting a whole different proposal from scratch seems very unlikely to go faster.

That's great news! I haven't seen much movement though, and it is still unclear to me whether the actual proposal has proper tail calls or something more restricted. But tail calls is only part of the problem; memory management is the other key aspect of this.

Allocating on the stack when possible is always preferable, because on most systems, the stack is an area of memory for which access is optimized relative to main memory. In cases where it is not, it just functions like any other memory. So it follows that if the stack is mostly free, because all functions in your program are tail calls, and so consume no stack space, then you would want to use that free stack space for temporary allocations. In order to do that, you need to know whether an allocation will succeed or fail, by checking the amount of stack you have left. Furthermore, such a model simplifies garbage collection significantly, as one can easily garbage collect the stack space by either returning to a trampoline, or by manipulating the stack pointer. Which approach you use is not of any significance here, what matters is that you can tell when a GC needs to happen.

As a language implementer, you also don't want to have two completely different ways of doing memory management based on whether you are targeting WebAssembly or not. Not having the stack be optimized is no big loss, but having to avoid the stack on one platform means you have to alter both your allocation code and garbage collection implementations significantly. This is a huge burden on compilers for which WebAssembly is only one possible target. In situations where it is unavoidable, you just have to eat that cost or avoid support for the platform; but that is only true of WebAssembly if no proposal to make that change succeeds.

My point was that these tools have never been available on the Web, so it should perhaps not be surprising if they are not available in Wasm either.

I'm not surprised they aren't available right now, but I am surprised, and disheartened to hear this kind of answer, especially from you, which seems to say that because something hasn't been available on the web historically, then it isn't worth consideration. WebAssembly is already seeing use outside of the web, and such use cases are even mentioned in the spec/materials for WebAssembly itself! Not to mention that WebAssembly is already bringing tools to the table that were previously not available; tail calls being just one of them.

Nothing about the web (that I'm aware of) fundamentally places a restriction on being able to query information about stack space - that's just an implementation detail of how the engines work today, it was never needed in JS, so why would it exist? If engine developers want to make it available to WebAssembly, they can, but I doubt they are inclined to do so without motivation. I'm advocating with the goal of trying to explain why it is necessary. To that end, I'm also working on getting in touch with a variety of functional language compiler engineers to gather more use cases and feedback, which I am hoping to have shared here soon.

One of the most inspiring things about WebAssembly is that has the potential to enable so much of what was previously not possible on the web, or was possible but at great effort and expense. I generally agree with the sentiment that WebAssembly is about the web first and foremost, but unless there is a fundamental problem with some proposal, I'm not sure that we should be shutting the door on ideas just because they aren't already present in some form. After all, isn't that what the proposal process here is for?

For now, use a shadow stack, much like what your C code does in an unoptimized build, and as I mentioned in my first reply.

That's not viable as an option, as was pointed out previously, because there is no direct correspondence between the shadow stack and the system stack; this means that preventing overflows, which is the point, is not guaranteed. The guarantee is necessary for tail calls, as well as for deciding when to perform allocations on the stack vs the heap and/or trigger GC.

That need not be disastrously slow, you can still load variables into locals inside as single function, and get some benefits of register usage. Also you now have super cheap stack manipulation since you control all of it.

How are you suggesting I implement my own stack? Compiling to native code means using the native stack, in this case that means the Wasm shadow stack, over which we don't have full control. Compiling a program into a giant loop/switch dispatch is an awful solution, for a number of reasons. The only current alternative is to trampoline for every call, but as mentioned before, beyond the performance cost, it still leaves out any language that wishes to perform GC using a Cheney on the MTA style generational collector, or use the stack for allocations.

The amount of languages out there that deviate in their execution model from the C-like model is endless, we'll unlikely support all of them directly. In fact we don't even fully support C directly :)

From what I've seen, virtually every functional language of consequence (Haskell, Lisp, ML, Prolog) has at least one compiler/implementation which generates C, and the models those languages represent cover essentially the entire space of functional programming languages. Erlang, for example, is quite similar to ML in its core representation. These languages all differ significantly from C semantically, but nevertheless they can all be translated down to something that can fit into C's model. I don't think we're talking about an issue of supporting every possible implementation detail, but having primitives to query usage of the stack is a cross-cutting concern that affects all kinds of languages, C included.

Tail calls and resumable exceptions will help for some, but "rolling your own stack frames" will remain a popular option I predict.

Do you have an example of a language compiler that targets Wasm, rolls their own stack frames, and is not an interpreted language? You say it is popular, but I'm not seeing it, but maybe I'm just looking in the wrong places. I suspect this is also a problem for FFI, but I don't know enough about the proposed implementation to say.

Perhaps I should open up a new issue specifically for the stack primitives I think are needed? This thread is not really about continuations specifically anymore, and it isn't really about tail calls either. It may help to hit the reset button, so to speak.

@autumnontape
Copy link
Author

@bitwalker

Perhaps I should open up a new issue specifically for the stack primitives I think are needed?

To that end, I'm also working on getting in touch with a variety of functional language compiler engineers to gather more use cases and feedback, which I am hoping to have shared here soon.

At this point, yes, I believe you should open a separate issue.

I think you should also make a list of exactly which stack primitives you're asking for. I know that aside from features like tail calls and built-in SJLJ that are already planned, you'd like to be able to predict when the internal call stack may overflow in advance. Is that all you need?

@bitwalker
Copy link

At this point, yes, I believe you should open a separate issue.

Will do

...you'd like to be able to predict when the internal call stack may overflow in advance. Is that all you need?

Right, I am thinking to more or less mirror the primitives one would use elsewhere:

  • Stack size limit, this could be arbitrarily smaller than the real stack size limit, but would achieve the goal of providing an upper bound on the amount of stack available.
  • Stack position, this isn't a pointer, but rather a position relative to the stack limit. This provides a way to evaluate how much stack remains.

But these could be rolled up into a single primitive "stack space remaining", which doesn't provide you with a limit or a position, if that is preferable. Since some things live on the Wasm shadow stack, and some on the host stack, one has to take into account both rather than just one when making decisions, and as far as the host stack is concerned, I think it is only really necessary to know how much space remains available.

Using these, a compiler could generate code that would check the stack remaining, and determine if a garbage collection/trampoline is required or not, then perform the stack allocations it needs, do any local computation, then proceed by invoking a continuation (generally another call, but can be a local jump in some cases), which in the case of a call, would enter the new function and again check the remaining stack and decide how to proceed. An important part of that flow is the ability to know where things will get allocated, so that the correct limit is checked. I'm not 100% clear on what all may end up on the host stack, so I'm going to try and follow up on that today before I open a new issue.

@sjmackenzie
Copy link

sjmackenzie commented May 5, 2019

My point was that these tools have never been available on the Web, so it should perhaps not be surprising if they are not available in Wasm either.

Considering that a language like @shriram 's pyret language draws from deep insight into the failings and success points of the top languages out there (including javascript), his comment carries weight #919 (comment). Javascript is a low bar, the main point its got going for it is that it's ubiquitous due to participating in content dissemination over a point-to-point network systems. Data dissemination monopolies arise purely as an emergent behaviour of doing data disseminations over a point-to-point communication system. A language designed in 10 days rode the coat tails of a heavy tcp/ip node destined to serve ads, it isn't special and didn't come to it's current status through virtue of its merits. We have an opportunity to right the wrongs and not require so many frameworks to half arse reimplement continuations. TCO is a great step in the right direction.

Continuations allow for proper concurrent functional programming languages which have reactors that broadcast data to multiple points (not just GAFAM servers!). Especially when data becomes directly addressable and we don't need to obtain indexed data by addressing google's nodes. It benefits google to have a single threaded execution environment as it works well shunting ads down an IP client/server pipe. Proper continuations give us better options. The browser can be so much more than the deliberately crippled advert serving machine it is today.

@shriram
Copy link

shriram commented May 6, 2019

A bit late to this conversation, which I didn't know was happening, but @bitwalker may find it useful to take a look (maybe not any more!) at Stopify [site: https://www.stopify.org], [paper: https://cs.brown.edu/~sk/Publications/Papers/Published/bnpkg-stopify/], [repo: https://github.com/plasma-umass/Stopify].

It presents a generic JS->JS compiler that implements similar transformations and includes optimizations to provide continuations. It shows how several JS-transpiled languages can be automatically enriched using these transformations.

While I would still like to see better native support (see our performance numbers), Stopify offers a good path (see our performance numbers) from today's reality to that dream. In particular, it takes a ton of crud out of the run-time system, separating core language concerns from continuation-related concerns.

CC @arjunguha @jpolitz @rachitnigam

@rossberg
Copy link
Member

rossberg commented Jun 8, 2020

As a follow-up to some of my comments above, I should probably link to my presentation from the Feb 2020 meeting, showing ideas for adding stack switching/coroutines/continuations/effect handlers to Wasm.

@autumnontape
Copy link
Author

Closing in favor of #1359

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants