Primer on low-level stack and control-flow primitives

During all this conversation, I've been reading up on the various stack and control primitives out there. These are the low-level primitives that exceptions, stack tracing, gc-root marking, generators, resumable exceptions, and continuation delimiting can be and often are implemented with. They have been found to often work will across module boundaries and even across language boundaries (or at least to facilitate cross-boundary coordination). Given their low-level nature, they don't require introducing any new value types. So I thought I'd review them here for y'all and attempt to illustrate how they'd work in and contribute to WebAssembly.

# Simple Stack Marking

Let's start with helping languages manage their own memory. One key part of garbage collection is finding all the roots, i.e. all the garbage-collectable references currently on the stack. Yes you could do this by maintaining your own list, but this list gets changed *a lot*, and it's only needed relatively infrequently&mdash;only when the garbage collector runs. So that's not particularly efficient. What you really want is a way to walk the current stack and mark all the references on it *on demand*. (Here I'm assuming that code occasionally calls `gc_collect`, explicitly handing over control, rather than the gc firing at arbitrary times.)

We'll do this using a *stack* mark:
```mark gcroot : [] -> [i32]```
This declares a new kind of stack mark called `gcroot` that provides an `i32` (with no input needed), which conceptually is a root on the stack. This declaration would be up in the events section, not at the instruction level.

Next we'll actually mark the stack:
```
void foo() {
    i32 newref = gcalloc(12);
    with-value-mark (gcroot newref) {
        ... // do stuff with this mark on the stack
    }
}
```
Here we allocate some garbage-collected memory and assign the address to `newref`. Now this address is sitting on `foo`'s stack frame, and so we need to make sure its referenced memory isn't cleaned up until `foo` returns. The `with-value-mark (gcroot newref) {...}` construct places a `gcroot` mark with value `newref` on the stack (often using code regions to indicate which instructions are marked, imposing little to no run-time overhead) and keeps it there until the `...` in the body is exited. In particular, any call to `gc_collect` while that `...` body is still executing will see the value of `newref` on the stack as a `gcroot` mark. Which leaves the question, how does `gc_collect` observe that mark?

# Stack Walking

Now let's dive into `gc_collect`, or really its helper function `collect_roots`:
```
void collect_roots() {
    walk-stack {
        while (true) {
            next-mark gcroot {
                add_reachable(get-mark);
            } none {
                break;
            }
        }
    }
}
```

Here we have three new constructs: `walk-stack`, `next-mark`, and `get-mark`.
The construct `next-mark` is only usable within `walk-stack`, and the construct `get-mark` is only usable within `next-mark`.

What `walk-stack` does is store a pointer into the current stack, starting with the present point in the stack, indicating where we currently are in the stack walk. `next-mark` then walks up the stack looking for marks. Here it has just one mark type specified, `gcroot`, but in general there could be many mark types specified. Once `next-mark` finds a mark of one of the specified types, it updates the pointer into the stack, and then it enters the body corresponding to that tag. Within that body, whenever `get-mark` is executed it returns the payload of that mark onto the stack. If at some point `next-mark` is executed and unable to find a matching mark, it executes the `none` body.

So `collect_roots` effectively iterates through all the `gcroot` marks on the stack and calls `add_reachable` on each of the respective `i32` addresses, returning when the top of the stack is reached.

# Advanced Stack Marking

Now this is a little inefficient. Often a function references many roots at a time, and this will process them one by one. So let's optimize this example, first by understanding how `with-value-mark` is actually a shorthand.

Remember that the declaration `mark gcroot : [] -> [i32]` looks a lot like an input-output operation, not just a value. That's because it is. `with-value-mark` is just shorthand for "whenever someone does `get-mark` on this mark, provide this value".

So with that in mind, let's expand `foo` above:
```
void foo() {
    i32 newref = gcalloc(12);
    mark gcroot {
        push newref;
    } within {
        ... // do stuff with this mark on the stack
    }
}
```

Here we see that the body of `mark gcroot { push newref; }` is just a block of type `[] -> [i32]`, matching the corresponding declaration. Whenever `get-mark` is called, it hands the current pointer into the stack to this block so that the block can access its stack frame while executing on top of the overall stack (rather than creating a new stack). Once this block terminates, its frame is popped off the stack. So this gives us a way to temporarily execute blocks from stack frames up the stack.

So let's take advantage of that a redesign our mark to be `gcaddroots : [] -> []`, and consider a new function `foo2` with multiple roots:
```
void foo2() {
    i32 newref1 = gcalloc(12);
    i32 newref2 = gcalloc(24);
    mark gcaddroots {
        add_reachable(newref1);
        add_reachable(newref2);
    } within {
        ... // do stuff with this mark on the stack
    }
}
```

Here `gcaddroots` directly takes care of adding both references to the reachable set. This in turn makes `collect_roots` simpler:
```
void collect_roots() {
    walk-stack {
        while (true) {
            next-mark gcaddroots {
                get-mark;
            } none {
                break;
            }
        }
    }
}
```

So with lower-level stack primitives and control-flow primitives, we can enable much better support for manual memory management. Hopefully it's clear how this same primitives could enable one to implement `get_current_stack_trace()` and to even place stack-trace marks that indicate how WebAssembly code corresponds to source code so that the result of `get_current_stack_trace()` is something that a non-wasm expert could actually use to debug their source code.

# Stack Unwinding

Now for the second set of stack primitives used to implement stack unwinding. Stack unwinding is the process of cleaning up state as one pops frames off the stack. This clean up takes the form of destructors in C++ and `finally` clauses in many other languages. The most familiar instigators of cleanup are function returns and thrown exceptions. As we'll see, exceptions are really many primitives put together. I'll use C# because it makes it easiest to see how to break exceptions down into these primitives.

## Escaping

Consider the following C# function:
```
int bar() {
    try {
        return some_func();
    } catch {
        return uhoh();
    }
}
```

It uses the unqualified `catch` clause to catch both C++ and C# exceptions (at the cost of not being able to inspect the exception).

Here's how we would compile this to lower primitives (ignoring C++'s nested rethrow for now):
```
mark cpp_handler : [i32, i32] -> [] // rtti and value
mark csharp_handler : [csharpexn] -> []

int bar() {
    escape $target {
        mark cpp_handler(i32 rtti, i32 val) {
            escape-to $target;
        } within {
            mark csharp_handler(csharpexn exn) {
                escape-to $target;
            } within {
                return some_func();
	    }
        } 
    } hatch {
        return uhoh();
    }
}

void cpp_throw(i32 rtti, i32 val) {
    walk-stack {
        while (true) {
            next-mark cpp_handler {
                get-mark(rtti, val);
            }
        }
    }
}

void csharp_throw(csharpexn exn) {
    walk-stack {
        while (true) {
            next-mark csharp_handler {
                get-mark(exn);
            }
        }
    }
}
```

Here we introduce two new primitives: `escape-hatch` and `escape-to`. The body of `escape $target` executes normally unless/until an `escape-to $target` is reached, at which point all of the stack between those two points is unwound (starting from `escape-to`) and then redirected to the `hatch` clause. This in particular means the `hatch` clause is never executed if `escape-to $target` is never executed. It also means that the wasm program has control over when the stack is unwound and at the same time guarantees nothing is ever executed on an invalid stack.

If we look at the implementations of `cpp_throw`/`csharp_throw`, we see they work by walking up the stack until they find a handler. Note that `cpp_throw`/`csharp_throw` do not cause the stack to be unwound. Instead, as you can see in the lowering of `bar`, the handler is responsible for this, which is important for implementing more flexible exceptions (discussed next). If the handler doesn't do this, then `cpp_throw`/`csharp_throw` continue on to the next handler until they reach the top of the stack. Notice that `next-mark` doesn't have a `none` case here, which means `cpp_throw`/`csharp_throw` trap if they reach the top of the stack. Conveniently, all of the stack is still intact, so a debugger can kick in to let the programmer inspect the intact state of the program and cause of the exception, or the host can collect the stack-trace marks to report the trace of the exception.

## Handing Off State

Okay, before we get into filtered exceptions, let's first illustrate how to hand off state to the `escape-hatch`:
```
int bar2() {
    try {
        return some_func();
    } catch (Exception e) {
        return uhoh(e);
    }
}
```

lowers to
```
int bar2() {
    escape $target {
        mark cpp_handler(i32 rtti, i32 val) {
            if (RuntimeCompatibility.WrapNonExceptionThrows)
                escape-to(new RuntimeWrappedException(rtti, val))  $target;
        } within {
            mark csharp_handler(csharpexn exn) {
                escape-to(exn) $target;
            } within {
                return some_func();
	    }
        } 
    } hatch (csharpexn e) {
        return uhoh(e);
    }
}
```

Here we see that `escape-to` can hand off values to the `hatch`. We also see a case of a handler *not* escaping in `cpp_handler`. There's a flag in C# (nowadays set to true by default) that makes `catch (Exception e)` even catch (wrapped) C++ exceptions. If that flag is set, then the `cpp_handler` wraps the exception and forwards it to the `hatch`. If not, then the `cpp_handler` does nothing so that `throw_cpp` continues on to a "real" `cpp_handler`.

## Filtering Exceptions

Because we have given the wasm program more control, without any more changes we can support C#'s filtered exceptions. Consider the following example:
```
class MyException extends Exception { int type; }
int bar3(int i) {
    try {
        return some_func();
    } catch (MyException e) when (e.type == i) {
        return uhoh(e);
    }
}
```

The `when` clause indicates that the exception should only be caught when that condition evaluates to true. This clause can be stateful and so is specified to be evaluated *before* any `finally` clauses or destructors. (Here)[https://thomaslevesque.com/2015/06/21/exception-filters-in-c-6/]'s a blog post on why this is useful. (Hint: better debugging support is one reason.) As a consequence, we lower this program to the following:
```
int bar3(int i) {
    escape $target {
        mark csharp_handler(csharpexn exn) {
            if (exn is myexception && exn.type == i)
                escape-to(exn) $target;
        } within {
            return some_func();
        } 
    } hatch (myexception e) {
        return uhoh(e);
    }
}
```

Notice that no unwinding is done if the condition fails. We also didn't have to change `csharp_throw` to support any of this; `csharp_throw` just walks up the stack firing off potential exception handlers until one unwinds the stack (just like what a standard VM's implementation of exceptions does).

# Stack Conventions

I say that we unwind the stack, but what does that mean? As a low-level primitive, it just means move the stack pointer and reclaim the stack. Technically that is it. But *conventionally* stack unwinding also means executing `finally` clauses and C++ destructors. Rather than baking in such a convention, let us modify `escape/hatch` to directly support it:
```
mark unwinder : [] -> [];

int bar4(int i) {
    escape $target {
        mark csharp_handler(csharpexn exn) {
            if (exn is myexception && exn.type == i)
                escape-to(exn) $target;
        } within {
            return some_func();
        } 
    } unwind unwinder {
        get-tag;
    } hatch (myexception e) {
        return uhoh(e);
    }
}
```

We have added an `unwind` clause to `bar4`. This forces escapes to `$target` to walk the stack as it is unwound, now frame by frame rather than all at once. Whenever the given mark is encountered, in this case `unwinder`, then the corresponding `unwind` clause is executed (while the stack frame for the mark is still in tact). If the `escape-to` clause provides a value of type `t` to pass to `hatch`, then the `unwind` clause technically has type `[t] -> [t]`, giving it the chance to update this value, which can be useful in particular for collecting stack-trace marks while the stack is unwound. In the case of `bar4`, we don't care about updating the exception `e`, so we just use `get-tag` to execute the mark. In this way, the escape in `bar4` effectively calls all `unwinder` marks as it unwinds the stack.

Another lesser-known convention exists in languages with more complex control flow. Consider how a stack walk essentially maintains the stack but shifts the control focus of the program up the stack. Sometimes it is useful for function calls on the stack to be able to track this focus. For example, a C++ program maintains it's own stack in addition to the wasm stack, and so it might be helpful to also track what point within that separate stack corresponds to the current focal point of the stack walk.

Let's modify our garbage-collection helper `collect_roots` to permit such a convention:
```
void collect_roots() {
    walk-stack {
        while (true) {
            next-mark gcaddroots {
                get-mark;
            } focus-out {
                get-mark;
            } none {
                break;
            }
        }
        while (true) {
            prev-mark focus-in {
                get-mark;
            } none {
                break;
            }
        }
    }
}
```
This modification invokes `focus-out` marks on the way out so that they can observe that the stack walk has moved out of that part of the stack. After the walk reaches the top of the stack, it then walks *back* and invokes `focus-in` marks to conceptually revert any changes that were made by the `focus-out` marks.

Especially with these conventions, this pattern of simply invoking marks as one passes them is extremely common. So let's use `stack-walk [next-pass*] [prev-pass*]` and `escape [unwind-pass*]` as shorthand for simply invoking `get-mark` whenever a `get-next`/`get-prev`/`unwind` passes a mark in the respective lists. This shorthand lets us abbreviate `collect_roots` to just the following:
```
void collect_roots() {
    walk-stack [gcaddroots, focus-out] [focus-in] {
        next-mark none {}
        prev-mark none {}
    }
}
```

# Nested Rethrow

At this point we have all the core functionality needed to implement a variety of features. I need to get to bed, so I'll edit this with some more non-exception examples later, but first let me illustrate how this supports C++'s nested rethrow. Actually, I'll show it supports Python's nested rethrow because, unlike C++, Python actually has a stack-tracing convention that's worth demonstrating.

To see this convention, if you run the following program:
```
def failure():
    raise ValueError, "Failure"
def refailure():
    raise
def fail():
    try:
        failure()
    except ValueError:
        refailure()
fail()
```

then you will get the following stack trace:
```
Traceback (most recent call last):
  File "main.py", line 10, in <module>
    fail()
  File "main.py", line 9, in fail
    refailure()
  File "main.py", line 7, in fail
    failure()
  File "main.py", line 2, in failure
    raise ValueError, "Failure"
ValueError: Failure
```

Notice that *both* `failure` *and* `refailure` are in the trace. This illustrates that Python builds the stack trace as it goes. Note: if you add prints, you'll also see that the stack for `failure` is unwound before `refailure` executes.

Here is how we can compile this (assuming for simplicity that all Python exceptions are `ValueError`s and all we care about is the trace):
```
event code_line : [] -> [string, i32, string, string]
event python_handler : [pyref, string] -> []
event python_except : [] -> [pyref, string]
void fail() {
    escape $target {
        mark python_handle(pyref exn, string trace_so_far) {
            if (exn == ValueError)
                escape-to(exn, trace_so_far) $target;
        } within {
            with-value-mark (code_line "main.py", 7, "fail", "failure()") {
                failure();
            }
        }
    } hatch(pyref exn, string trace_so_far) {
        with-mark-value (python_except exn trace_so_far) {
            with-value-mark (code_line "main.py", 9, "fail", "refailure()") {
                refailure();
            }
        }
    }
}
void raise_with_trace(pyref exn, string trace) {
    walk-stack {
        while (true) {
            next-mark python_handle {
                get-mark(exn, trace);
            } code_line {
                [string file, i32 line, string body, string code] = get-mark;
                trace = file + line + body + code + trace (with formatting);
            }
        }
    }
}
void raise_new(pyref exn) { raise_with_trace(exn, ""); }
void reraise() {
    pyref exn;
    String trace;
    walk-stack {
        next-mark python_except {
            exn, trace = get-mark;
        }
    }
    raise_with_trace(exn, trace);
}
```

There are three things to note here. First, `raise_with_trace` also observes the `code_line` marks in order to collect the stack trace as it walks up the stack to find a handler (which then is handed the stack trace up to that point). Second, in the lowering of `fail` the call to `refailure` is executed within a `python_except` mark. Third, the `reraise` function implementing Python's rethrow walks up the stack looking for such a `python_except` mark that provides the exception that is most recently caught. Altogether we get an implementation of Python's nested-reraise construct with its own semantics for stack tracing (that differs from both Java's and C#'s), and we didn't have to add anything new for it. Of course, the same technique can be done/adapted for C++'s nested-reraise construct.

# Generators

A number of languages feature generators. There are a few variations on this concept, so I'll focus on C# and Python generators. These have a `foreach (x in list) {...}` construct that executes its body for each element "in" the list, and a generator-method construct in which `yield value` can be used to conveniently yield the elements "in" the list. Although `foreach` is often implemented by translating to `for (IEnumerator enum = list.GetEnumerator(); enum.MoveNext(); ) { x = enum.Current; ...; }` and by converting a generator-method into a state machine that dynamically allocates and updates the state to simulate control flow, the two features can be matched together to make for a much more efficient implementation of this common pattern. For simplicity, I'll assume every `IEnumerable` has a `Generate()` method that directly executes the `yield` statements rather than transforming them into a state machine.

Let's start by lowering the following C#-ish code:
```
void baz(IEnumerable list, String header) {
    foreach (Object elem in list) {
        print(header);
        println(elem);
    }
    println("done");
}
```
to the following:
```
mark foreach : [ref] -> [];
void baz(IEnumerable list, String header) {
    mark foreach(ref elem) {
        print(header);
        println(elem);
    } within {
        list.Generate();
    }
    println("done");
}
```

Next let's lower the body of some generating code:
```
void GenerateInts(int x, int y) {
    for (int i = 0; i < x; i++)
         yield new Integer(i);
    for (int j = 0; j < y; j++)
         yield new Integer(i);
}
```
to the following:
```
void GenerateInts(int x, int y) {
    stack-walk {
        next-mark foreach {
            for (int i = 0; i < x; i++)
                 get-mark(new Integer(i));
            for (int j = 0; j < y; j++)
                 get-mark(new Integer(j));
        }
    }
}
```

So the `foreach` sets up a `foreach` mark on the stack within which it calls `list.Generate()`, which let's say eventually calls to `GenerateInts(5, 4)`. Then `GenerateInts` walks up the stack to find that (first) `foreach` mark and executes its body, using `get-mark` to execute the `foreach` body with each `yield`ed value. This pattern effectively implements the standard stack-sharing implementation of generators, again without needing to introduce any more primitives&mdash;without even references.

# Stack-Allocated Closures

Let's review for a sec. We have this mechanism for walking the stack, which gets us a pointer *into* the stack. We have this mechanism for getting a mark, which gets a *code* pointer. The two together effectively make a *stack*-allocated closure, and `get-mark` is just calling that closure. (Technically, this is slightly different from a stack-allocated closure, but the analogy is close enough for our purposes.) Because the closure is allocated on the stack, rather than on the heap, we have to make sure our reference to it does not outlive the stack frame it sits upon. This is why this pair is not a first-class value and why `get-mark` is restricted to being used within a `stack-walk` and `next-mark`.

So stack walking and marks give us a way to go up and get stack-allocated closures, but it's also useful to be able to send stack-allocated closures (i.e. code pointer and into-stack pointer pair) down. I won't go into how this can be used to improve performance of higher-level languages, since that's still unpublished, but I will demonstrate how this can be used to optimize generators further and to help implement continuations.

First, let's extend functions with higher-order parameters so that one can write declarations like `(higher-param $sac (i32 i32) (i32 i32))` after parameter declarations like `(param $p i32)` and before `(result ...)`. These higher-order parameters are stack-allocated closures and consequently are invocable, let's say via `higher-local.invoke $sac`.

Using this, let's rewrite the lowering of `GenerateInts` to use stack-allocated closures:
```
void GenerateInts(int x, int y, void foreach(ref)) {
    for (int i = 0; i < x; i++)
        higher-local.invoke foreach (new Integer (i));
    for (int j = 0; j < y; j++)
        higher-local.invoke foreach (new Integer (j));
}
```
Notice that `GeneratesInts` no longer has to walk the stack, which was the slowest part of its implementation before. It just invokes its higher-order parameter `foreach`. Note, though, that it does not treat `foreach` as a value (i.e. there's no `higher-order.get`), which guarantees the lifetime of `foreach` will not exceed the lifetime of the call to `GenerateInts`.

Next let's rewrite the lowering of `baz` to instead assume `IEnumerable.Generate` takes a stack-allocated closure:
```
void baz(IEnumerable list, String header) {
    higher-order.let foreach(ref elem) be {
        print(header);
        println(elem);
    } in {
        list.Generate(foreach);
    }
    println("done");
}
```
Here we declare a new higher-order closure using `higher-order.let` (it's unsound to have a `higher-local.set`) and hand it off to `list.Generate`&mdash;no need for a stack mark anymore. This doesn't involve any dynamic allocation despite the fact that `foreach` closes over the local parameter `header` because that local parameter is sitting on the stack frame and the lifetime of `foreach` is guaranteed to last only as long as the stack frame. In other words, the stack frame is its closure environment.

So we can even more efficiently implement generators with higher-order parameters and stack-allocated closures, but can also combine first-class stacks with higher-order parameters to implement one-shot continuations.

# First-Class Stacks

So far the mental model has been that there is one stack for the program, but it turns out that everything described works even if have multiple stacks calling into (or really returning to) each other. Such functionality lets us implement all sorts of interesting things like lightweight threads or even continuations.

## Stack Type

The stack type is `stack (param ti*) (result to*)`, where `ti*` is the inputs the stack is expecting and `to*` is the outputs the stack will produce when/if it eventually returns. These stacks are just big state machines. We can run this state machine to its completion to get values returned, but it turns out to be much more useful for the state machine to pause on occasion, and this pausing is much more safely done if it's done voluntarily by the state machine rather than forcefully by some external entity.

## Stack Allocation

To allocate a stack, one uses `stack.alloc $f : [t*] -> [(stack (param ti*) (result to*))]` where `$f` is a `func (param ti* t*) (higher-param () (ti*)) (result to*)`. This creates a stack whose initial stack frame is the function `$f` along with some of its inputs *and* a special higher-order input. The higher-param with no inputs is to be used by `$f` to pause the newly allocated stack. `stack.alloc` initializes the portion of the stack frame corresponding to that higher-param to be (the pair of) the host-level code pointer for pausing a stack and the pointer to the freshly created stack. Since a non-executing stack is always awaiting `ti*` inputs, this "pause" will return back to `$f` the inputs it was given.

One thing to note is that the `t*` could include mutable references, which `$f` can use to externally expose changes to the stack's internal state as it desires. So while these stacks might seem like a big black box, whoever created the stack has the ability to open up as much of that black box as it wants. Of course, this all assumes we have a way to make the stack change.

## Stack Stepping

After we allocate a stack, it's just sitting there. To make it actually do something, just use `stack.step $l : [(stack (param ti*) (result to*)) ti*] -> []` where `$l` is a block from `to*`. This provides the stack with the `ti*` inputs and causes it to progress until either the "pause" higher-param is called from within the unknown enclosed `$f`, in which case `stack.step` returns normally, or until the unknown enclosed `$f` returns, in which case control is transferred to `$l` with the returned values on the stack.

(Note that, for simplicity, I'm putting aside the complication that every stack needs to be guarded by a lock so that two threads cannot step the same stack at the same top. Later on I'll show how to adjust the instructions to better accommodate that constraint.)

Putting these together, we can implement lightweight cooperative worker threads:
```
mark yielder : [] -> [];
mark spawner : [$thread_state] -> [];
mark thread-locals : [] -> [$thread_state];
$result do_work($thread_state) {...} // might call yield, spawn, and get_thread_locals
$result join($result, $result) {...}
$result main($thread_state work, $result init) {
    stack_list = new List<(stack (param) (result $result))>();
    stack_list.add(new_worker(work));
    while (!stack_list.is_empty()) {
        worker = stack_list.dequeue();
        mark spawner($thread_state more_work) {
            stack_list.add(new_worker(more_work));
        } within {
            stack.step(worker) { // block to run on return
                product = pop; // pop $result off stack
                init = join(init, product);
                continue;
            }
        }
        stack_list.enqueue(worker);
    }
    return init;
}
(stack (param) (result $result)) new_worker($thread_state work) {
    local.get $work;
    stack.alloc $worker_body;
    return;
}
$result worker_body($thread_state state, void pause()) {
    mark thread-locals() {
        local.get $state;
    } within {
        mark yielder() {
            pause();
        } within {
            return do_work(state);
        }
    }
}
void yield() { walk-stack { next-mark yielder { get-mark(); }}}
void spawn($thread_state work) { walk-stack { next-mark spawner { get-mark(work); }}}
$thread_state get_thread_locals() { walk-stack { next-mark thread-locals { return get-mark(); }}}
```

Here `main` maintains a queue of workers, with each worker being a stack. It repeatedly pops a stack off the queue, makes it take a step, requeueing the stack if it doesn't finish, and aggregating the result of its work with the ongoing result if it completes. Before it forces the stack to take a step, it sets up a `spawner` stack mark to catch any stack walks done by a call to `spawn` within the worker stack, adding any provided workers to the queue.

The function `new_worker`'s job is solely to create the stack, which it does by bundling the given `$thread_state` with the function `worker_body`, whose body is really where all the interesting stuff happens. Looking inside `worker_body`, we see that it sets up two stack marks. The `thread-locals` stack mark simply provides the state of the thread. The `yielder` stack mark invokes the provided `pause` higher-order parameter. This means that calls to `yield()` within `do_work` will walk up the stack, execute this mark, thereby running `pause`, which `stack.alloc` has set up to cause the stack to pause. Note that, if the language compiling to wasm wants to support lightweight threads with very efficient yielding, i.e not having to do a stack walk to find the "pause", then these primitives would also support another implementation strategy in which all compiled functions take a `pause` higher-order parameter and pass "pause" down to whomever they call. That is, we are not forcing a particular implementation strategy upon the language implementer; we are giving them the low-level primitives to choose their own.

Hopefully this illustrates how these low-level primitives combine together to support high-level patterns like algebraic effects. If the "event" of the algebraic effect has an input, the analog of `worker_body` would make its `yielder` mark update the `$thread_state` with that input. But we can also do a lot more than what algebraic effects support!

## Stack Extension

In order to allocate a `stack (param ti*) (result to*)` we allocated a partial stack frame for a function that pauses and otherwise converts `ti*` into `to*`. Once we have such a stack, if it's not executing then we know that it's current stack frame is awaiting some `ti*`s. So we can extend the stack with another stack frame that will return `ti*`s. In order to preserve the type of the stack, this stack frame must itself be awaiting some `ti*`s. This suggests the following primitive:
```
stack.extend $f : [(stack (param ti*) (result to*)) t*] -> []
```
where `$f` is a `func (param ti* t*) (higher-param () (ti*)) (result ti*)`.

This turns out to be surprisingly useful, especially for doing stack inspection. For example, suppose we wanted to combine our lightweight threads with our `collect_roots` implementation for program-managed garbage collection. The problem is that `main` has a lot of stacks representing worker threads that refer to root addresses that need to be collected. We can use stack extension to address this.

```
$result main($thread_state work, $result init) {
    stack_list = new List<(stack (param) (result $result))>();
    mark gcaddroots() {
        foreach (worker in stack_list) {
            stack.extend(worker) $collect_worker_roots;
            stack-wall {
                stack.step(worker);
            }
        }
    } within {
        ... // same loop as before (maybe add call to gc_collect)
    }
    return init;
}
void collect_worker_roots(void pause()) {
    collect_roots();
    pause();
}
```

The revised `main` sets up its `gcaddroots` mark to go through each of the stacks in `stack_list` and collect their roots. It does so by adding a stack frame for `collect_worker_roots` onto each stack, stepping the stack to cause it to call `collect_roots`, which then walks that stack to run `gcaddroots` along it, and then to pause, thereby returning control back to `main`. The `stack-wall` construct is just an optimization that makes stack walks think they have hit the top of the stack at that point so that each of these calls to `collect_roots` within the thread stacks don't each rewalk `main`'s stack. `stack-wall` is also useful for security purposes as it prevents callees from observing the marks on the caller's stack and prevents even observing the time it takes to walk the caller's stack.

Notice that the above example does a `stack.extend` and then a `stack.step`. There is actually one primitive that combines these two together and is in fact slightly stronger and lower-level than the combination:
```
stack.extend_and_step $f $l : [(stack (param ti*) (result to*)) t*] -> []
```
where `$f` is a `func (param t*) (higher-param () (ti*)) (result ti*)` and `$l` is a block from `to*`. As it sounds, this extends the stack frame with `$f` *and* steps the stack in one go. Notice that, unlike with `stack.extend`, `$f` does not take `ti*` as inputs, and that, unlike `stack.step`, there need not be any `ti*` on the stack. So `stack.extend` essentially adds a stack frame that first pauses to get more inputs and then calls the provided function, whereas `stack.step` adds a stack frame with the provided inputs for the function that simply returns those inputs.

What is particularly useful about `stack.extend_and_step`, besides being more true to what actually happens at the low level, is that it gives you a way to step the stack without providing the expected inputs for the stack, as in the following variant where the worker threads each expect an `i32`.
```
$result main($thread_state work, $result init) {
    stack_list = new List<(stack (param i32) (result $result))>();
    mark gcaddroots() {
        foreach (worker in stack_list) {
            stack-wall {
                stack.extend_and_step(worker) $collect_worker_roots; // no i32 provided
            }
        }
    } within {
        ... // same loop as before (maybe add call to gc_collect)
    }
    return init;
}
void collect_worker_roots(i32 pause()) {
    collect_roots();
    return pause();
}
```

## Stack Running

In both stack allocation and extension we were careful to make sure no local stack state could be caught in the first-class stack. The reason is that even if we were to immediately step the first-class stack, it could pause and outlive the local stack frame. But, although pausing is the key feature of first-class stacks, sometimes one reaches a point where they simply want to run a first-class stack to completion, disabling pausing. This effectively fuses (the lifetime of) the first-class stack with (that of) the local stack, and as such we can safely permit local state to be infused into the first-class stack.

We achieve this functionality with the following primitive:
```
stack.run {
    ... // code to run on the end of the stack, returning ti* into the stack
} pausing {
    ... // code to run whenever the stack would have paused, returning ti* back into the stack
} : [(stack (param ti*) (result to*))] -> [to*]
```
This conceptually mounts the given first-class stack on the local stack, extends the first-class stack with the frame for the main body of `stack.run`, and then resumes the first-class stack but executing the `pausing` clause whenever the first-class stack would have paused, eventually returning the `result` values of the first-class stack. The trick with the `pausing` clauses is most straightforwardly implemented by allocating space within each stack for a stack-allocated closure that is initially null but which the stack-allocated closure that is originally passed to `pause` first checks and defers to if the value is no longer null.

We can use this in our thread-workers example to forcibly abort threads. That is, suppose it is possible for the `main` loop to recognize early that the `$result` has already been determined, e.g. because it's computing a big conjunction and some worker has resulted in false, making it unnecessary to continue executing the remaining threads. Those threads might still be holding onto resources, though, so it is important to clean them up. We can do so by adding the following after the `main` loop:
```
foreach (worker in stack_list) {
    escape $target [unwinder] {
        stack.run(worker) {
            escape-to $target;
        } pausing {
            escape-to $target;
        }
    } hatch {}
}
```

## Stack Acquiring

Earlier I deferred the issue that first-class stacks need to have associated locks to prevent someone from, say, trying to extend a stack with a frame while another program is stepping the stack. We could have every stack instruction acquire and release this lock, but that seems inefficient and doesn't address the fact that one thread might want to guarantee that a number of operations happen in direct succession. So instead we add the construct `stack.acquire($s := expr) {...}`. This pops off the result of `expr`, which should be a `stack`, acquires the lock on that stack, binds a local-variable-of-sorts `$s` to that stack, executes the `...` body within which `$s` is in scope, and then releases the lock on the stack. It's input and output types are the same as the body `...` plus an additional `stack` input.

Now `$s` is not really a local variable. It is simply names a way to refer to the `stack.acquire`. So the second thing we do is modify all of the above instructions to refer to `$s` rather than take a `stack` as input. That way these instructions do not need to acquire/release the stack's lock because they know that the stack is already acquired.

Let's illustrate this new construct by showing how we can use it to append an entire stack, not just a single stack frame, onto a stack:
```
void stack_append((stack (param ti*) (result to*)) s, (stack (param ti*) (result ti*)) sapp) {
    stack.acquire($s := s) {
        local.get sapp;
        stack.extend_and_step $s $stack_append_helper;
    }
}
ti* stack_append_helper((stack (param ti*) (result ti*)) sapp, ti* pause()) {
    stack.acquire($sapp := sapp) {
        stack.run $sapp {
            pause();
        } pausing {
            pause();
        }
    }
}
```
There is a lot going on here. First, `stack_append` acquires the lock on `s`, the stack to be appended to. Then it extends `s` with the stack frame for `stack_append_helper`, capturing `sapp` as the argument to the correspond parameter, which it then steps into, effectively starting the execution of `stack_append_helper`. This simply acquires the lock on `sapp`, and then runs `sapp` but immediatle pauses execution to get the values to provide to `sapp`, restoring control to `stack_append`. The acquire on `s` then completes, releasing the lock, and `stack_append` returns. So by the end of this, no one is holding the lock on the stack `s`, but the stack frame added onto `s` is holding the lock to `sapp`. Furthermore, the next time someone tries to step `s`, the value will be returned to `sapp` instead, while will either run to completion *or*, due to the `pausing` clause, cause `s` to pause whenever `sapp` would have paused.

Altogether, semantically speaking after `stack_append` completes it is as if the entirety of `sapp` has been mounted onto `s` and locked into place, combining the two into one. Furthermore, one design choice I made in the above primitives is that this is completely equivalent to permanently acquiring the lock on `sapp` and then directly copying each of the stack frames on `sapp` onto `s` one by one. In other words, unless you were the one to create a stack boundary, you cannot observe a stack boundary.

(Side note: you can use the technique above to also compose stacks of types `[ti*] -> [t*]` and `[t*] -> [to*]` into a stack of type `[ti*] -> [to*]`, just like you can do with closures.)

## Stack Duplication

Lastly, for stack duplication we mainly just add the instruction `stack.duplicate $s : [] -> [(stack (param ti*) (result to*))]` where `$s` references a `stack.acquire` of a `stack (param ti*) (result to*)`. This primarily just duplicates the stack via memcpy. However, there is one complication.

The stack being duplicated can be in the midst of a `step` or `run` of another stack, in which case it has acquired the lock on that stack. We cannot have two stacks holding onto the same lock. So we have to duplicate that stack as well. In other words, while a stack is stepping or running another stack, the two are temporarily conceptually one stack and so must both be duplicated.

As a convention, after duplicating a stack, one could walk the duplicate and run each `mark duplicator : [] -> []` on it. These `duplicator` marks could update the stack frame to duplicate relevant resources (or trap if impossible) so that the duplicate is reasonably independent of the original. For example, the stack could be in the middle of a `stack.acquire` on the stack held in some local variable `$x`, and the `duplicator` may want to update `$x` to reference the duplicated stack. For this reason, we also add `stack.get $s` that returns the stack currently acquired by the referenced `stack.acquire`, which might not be the stack value that was first passed to the `stack.acquire` because a `stack.duplicate` might have happened. It is an odd corner case, but from what I can tell this works out and provides the right primitives for supporting multi-shot continuations.

# Wrapping Up

So what do y'all think of these primitives? Obviously these are a lot, but we don't have to add them all at once. Especially the ones about first-class stacks don't even make sense to add until after gc, since stacks will need to be memory managed. But I thought y'all would appreciate seeing a roadmap for this functionality and how it all fits together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Primer on low-level stack and control-flow primitives #105

Simple Stack Marking

Stack Walking

Advanced Stack Marking

Stack Unwinding

Escaping

Handing Off State

Filtering Exceptions

Stack Conventions

Nested Rethrow

Generators

Stack-Allocated Closures

First-Class Stacks

Stack Type

Stack Allocation

Stack Stepping

Stack Extension

Stack Running

Stack Acquiring

Stack Duplication

Wrapping Up

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Primer on low-level stack and control-flow primitives #105

Description

Simple Stack Marking

Stack Walking

Advanced Stack Marking

Stack Unwinding

Escaping

Handing Off State

Filtering Exceptions

Stack Conventions

Nested Rethrow

Generators

Stack-Allocated Closures

First-Class Stacks

Stack Type

Stack Allocation

Stack Stepping

Stack Extension

Stack Running

Stack Acquiring

Stack Duplication

Wrapping Up

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions