Skip to content
This repository was archived by the owner on Apr 25, 2025. It is now read-only.
This repository was archived by the owner on Apr 25, 2025. It is now read-only.

Primer on low-level stack and control-flow primitives #105

@RossTate

Description

@RossTate

During all this conversation, I've been reading up on the various stack and control primitives out there. These are the low-level primitives that exceptions, stack tracing, gc-root marking, generators, resumable exceptions, and continuation delimiting can be and often are implemented with. They have been found to often work will across module boundaries and even across language boundaries (or at least to facilitate cross-boundary coordination). Given their low-level nature, they don't require introducing any new value types. So I thought I'd review them here for y'all and attempt to illustrate how they'd work in and contribute to WebAssembly.

Simple Stack Marking

Let's start with helping languages manage their own memory. One key part of garbage collection is finding all the roots, i.e. all the garbage-collectable references currently on the stack. Yes you could do this by maintaining your own list, but this list gets changed a lot, and it's only needed relatively infrequently—only when the garbage collector runs. So that's not particularly efficient. What you really want is a way to walk the current stack and mark all the references on it on demand. (Here I'm assuming that code occasionally calls gc_collect, explicitly handing over control, rather than the gc firing at arbitrary times.)

We'll do this using a stack mark:
mark gcroot : [] -> [i32]
This declares a new kind of stack mark called gcroot that provides an i32 (with no input needed), which conceptually is a root on the stack. This declaration would be up in the events section, not at the instruction level.

Next we'll actually mark the stack:

void foo() {
    i32 newref = gcalloc(12);
    with-value-mark (gcroot newref) {
        ... // do stuff with this mark on the stack
    }
}

Here we allocate some garbage-collected memory and assign the address to newref. Now this address is sitting on foo's stack frame, and so we need to make sure its referenced memory isn't cleaned up until foo returns. The with-value-mark (gcroot newref) {...} construct places a gcroot mark with value newref on the stack (often using code regions to indicate which instructions are marked, imposing little to no run-time overhead) and keeps it there until the ... in the body is exited. In particular, any call to gc_collect while that ... body is still executing will see the value of newref on the stack as a gcroot mark. Which leaves the question, how does gc_collect observe that mark?

Stack Walking

Now let's dive into gc_collect, or really its helper function collect_roots:

void collect_roots() {
    walk-stack {
        while (true) {
            next-mark gcroot {
                add_reachable(get-mark);
            } none {
                break;
            }
        }
    }
}

Here we have three new constructs: walk-stack, next-mark, and get-mark.
The construct next-mark is only usable within walk-stack, and the construct get-mark is only usable within next-mark.

What walk-stack does is store a pointer into the current stack, starting with the present point in the stack, indicating where we currently are in the stack walk. next-mark then walks up the stack looking for marks. Here it has just one mark type specified, gcroot, but in general there could be many mark types specified. Once next-mark finds a mark of one of the specified types, it updates the pointer into the stack, and then it enters the body corresponding to that tag. Within that body, whenever get-mark is executed it returns the payload of that mark onto the stack. If at some point next-mark is executed and unable to find a matching mark, it executes the none body.

So collect_roots effectively iterates through all the gcroot marks on the stack and calls add_reachable on each of the respective i32 addresses, returning when the top of the stack is reached.

Advanced Stack Marking

Now this is a little inefficient. Often a function references many roots at a time, and this will process them one by one. So let's optimize this example, first by understanding how with-value-mark is actually a shorthand.

Remember that the declaration mark gcroot : [] -> [i32] looks a lot like an input-output operation, not just a value. That's because it is. with-value-mark is just shorthand for "whenever someone does get-mark on this mark, provide this value".

So with that in mind, let's expand foo above:

void foo() {
    i32 newref = gcalloc(12);
    mark gcroot {
        push newref;
    } within {
        ... // do stuff with this mark on the stack
    }
}

Here we see that the body of mark gcroot { push newref; } is just a block of type [] -> [i32], matching the corresponding declaration. Whenever get-mark is called, it hands the current pointer into the stack to this block so that the block can access its stack frame while executing on top of the overall stack (rather than creating a new stack). Once this block terminates, its frame is popped off the stack. So this gives us a way to temporarily execute blocks from stack frames up the stack.

So let's take advantage of that a redesign our mark to be gcaddroots : [] -> [], and consider a new function foo2 with multiple roots:

void foo2() {
    i32 newref1 = gcalloc(12);
    i32 newref2 = gcalloc(24);
    mark gcaddroots {
        add_reachable(newref1);
        add_reachable(newref2);
    } within {
        ... // do stuff with this mark on the stack
    }
}

Here gcaddroots directly takes care of adding both references to the reachable set. This in turn makes collect_roots simpler:

void collect_roots() {
    walk-stack {
        while (true) {
            next-mark gcaddroots {
                get-mark;
            } none {
                break;
            }
        }
    }
}

So with lower-level stack primitives and control-flow primitives, we can enable much better support for manual memory management. Hopefully it's clear how this same primitives could enable one to implement get_current_stack_trace() and to even place stack-trace marks that indicate how WebAssembly code corresponds to source code so that the result of get_current_stack_trace() is something that a non-wasm expert could actually use to debug their source code.

Stack Unwinding

Now for the second set of stack primitives used to implement stack unwinding. Stack unwinding is the process of cleaning up state as one pops frames off the stack. This clean up takes the form of destructors in C++ and finally clauses in many other languages. The most familiar instigators of cleanup are function returns and thrown exceptions. As we'll see, exceptions are really many primitives put together. I'll use C# because it makes it easiest to see how to break exceptions down into these primitives.

Escaping

Consider the following C# function:

int bar() {
    try {
        return some_func();
    } catch {
        return uhoh();
    }
}

It uses the unqualified catch clause to catch both C++ and C# exceptions (at the cost of not being able to inspect the exception).

Here's how we would compile this to lower primitives (ignoring C++'s nested rethrow for now):

mark cpp_handler : [i32, i32] -> [] // rtti and value
mark csharp_handler : [csharpexn] -> []

int bar() {
    escape $target {
        mark cpp_handler(i32 rtti, i32 val) {
            escape-to $target;
        } within {
            mark csharp_handler(csharpexn exn) {
                escape-to $target;
            } within {
                return some_func();
	    }
        } 
    } hatch {
        return uhoh();
    }
}

void cpp_throw(i32 rtti, i32 val) {
    walk-stack {
        while (true) {
            next-mark cpp_handler {
                get-mark(rtti, val);
            }
        }
    }
}

void csharp_throw(csharpexn exn) {
    walk-stack {
        while (true) {
            next-mark csharp_handler {
                get-mark(exn);
            }
        }
    }
}

Here we introduce two new primitives: escape-hatch and escape-to. The body of escape $target executes normally unless/until an escape-to $target is reached, at which point all of the stack between those two points is unwound (starting from escape-to) and then redirected to the hatch clause. This in particular means the hatch clause is never executed if escape-to $target is never executed. It also means that the wasm program has control over when the stack is unwound and at the same time guarantees nothing is ever executed on an invalid stack.

If we look at the implementations of cpp_throw/csharp_throw, we see they work by walking up the stack until they find a handler. Note that cpp_throw/csharp_throw do not cause the stack to be unwound. Instead, as you can see in the lowering of bar, the handler is responsible for this, which is important for implementing more flexible exceptions (discussed next). If the handler doesn't do this, then cpp_throw/csharp_throw continue on to the next handler until they reach the top of the stack. Notice that next-mark doesn't have a none case here, which means cpp_throw/csharp_throw trap if they reach the top of the stack. Conveniently, all of the stack is still intact, so a debugger can kick in to let the programmer inspect the intact state of the program and cause of the exception, or the host can collect the stack-trace marks to report the trace of the exception.

Handing Off State

Okay, before we get into filtered exceptions, let's first illustrate how to hand off state to the escape-hatch:

int bar2() {
    try {
        return some_func();
    } catch (Exception e) {
        return uhoh(e);
    }
}

lowers to

int bar2() {
    escape $target {
        mark cpp_handler(i32 rtti, i32 val) {
            if (RuntimeCompatibility.WrapNonExceptionThrows)
                escape-to(new RuntimeWrappedException(rtti, val))  $target;
        } within {
            mark csharp_handler(csharpexn exn) {
                escape-to(exn) $target;
            } within {
                return some_func();
	    }
        } 
    } hatch (csharpexn e) {
        return uhoh(e);
    }
}

Here we see that escape-to can hand off values to the hatch. We also see a case of a handler not escaping in cpp_handler. There's a flag in C# (nowadays set to true by default) that makes catch (Exception e) even catch (wrapped) C++ exceptions. If that flag is set, then the cpp_handler wraps the exception and forwards it to the hatch. If not, then the cpp_handler does nothing so that throw_cpp continues on to a "real" cpp_handler.

Filtering Exceptions

Because we have given the wasm program more control, without any more changes we can support C#'s filtered exceptions. Consider the following example:

class MyException extends Exception { int type; }
int bar3(int i) {
    try {
        return some_func();
    } catch (MyException e) when (e.type == i) {
        return uhoh(e);
    }
}

The when clause indicates that the exception should only be caught when that condition evaluates to true. This clause can be stateful and so is specified to be evaluated before any finally clauses or destructors. (Here)[https://thomaslevesque.com/2015/06/21/exception-filters-in-c-6/]'s a blog post on why this is useful. (Hint: better debugging support is one reason.) As a consequence, we lower this program to the following:

int bar3(int i) {
    escape $target {
        mark csharp_handler(csharpexn exn) {
            if (exn is myexception && exn.type == i)
                escape-to(exn) $target;
        } within {
            return some_func();
        } 
    } hatch (myexception e) {
        return uhoh(e);
    }
}

Notice that no unwinding is done if the condition fails. We also didn't have to change csharp_throw to support any of this; csharp_throw just walks up the stack firing off potential exception handlers until one unwinds the stack (just like what a standard VM's implementation of exceptions does).

Stack Conventions

I say that we unwind the stack, but what does that mean? As a low-level primitive, it just means move the stack pointer and reclaim the stack. Technically that is it. But conventionally stack unwinding also means executing finally clauses and C++ destructors. Rather than baking in such a convention, let us modify escape/hatch to directly support it:

mark unwinder : [] -> [];

int bar4(int i) {
    escape $target {
        mark csharp_handler(csharpexn exn) {
            if (exn is myexception && exn.type == i)
                escape-to(exn) $target;
        } within {
            return some_func();
        } 
    } unwind unwinder {
        get-tag;
    } hatch (myexception e) {
        return uhoh(e);
    }
}

We have added an unwind clause to bar4. This forces escapes to $target to walk the stack as it is unwound, now frame by frame rather than all at once. Whenever the given mark is encountered, in this case unwinder, then the corresponding unwind clause is executed (while the stack frame for the mark is still in tact). If the escape-to clause provides a value of type t to pass to hatch, then the unwind clause technically has type [t] -> [t], giving it the chance to update this value, which can be useful in particular for collecting stack-trace marks while the stack is unwound. In the case of bar4, we don't care about updating the exception e, so we just use get-tag to execute the mark. In this way, the escape in bar4 effectively calls all unwinder marks as it unwinds the stack.

Another lesser-known convention exists in languages with more complex control flow. Consider how a stack walk essentially maintains the stack but shifts the control focus of the program up the stack. Sometimes it is useful for function calls on the stack to be able to track this focus. For example, a C++ program maintains it's own stack in addition to the wasm stack, and so it might be helpful to also track what point within that separate stack corresponds to the current focal point of the stack walk.

Let's modify our garbage-collection helper collect_roots to permit such a convention:

void collect_roots() {
    walk-stack {
        while (true) {
            next-mark gcaddroots {
                get-mark;
            } focus-out {
                get-mark;
            } none {
                break;
            }
        }
        while (true) {
            prev-mark focus-in {
                get-mark;
            } none {
                break;
            }
        }
    }
}

This modification invokes focus-out marks on the way out so that they can observe that the stack walk has moved out of that part of the stack. After the walk reaches the top of the stack, it then walks back and invokes focus-in marks to conceptually revert any changes that were made by the focus-out marks.

Especially with these conventions, this pattern of simply invoking marks as one passes them is extremely common. So let's use stack-walk [next-pass*] [prev-pass*] and escape [unwind-pass*] as shorthand for simply invoking get-mark whenever a get-next/get-prev/unwind passes a mark in the respective lists. This shorthand lets us abbreviate collect_roots to just the following:

void collect_roots() {
    walk-stack [gcaddroots, focus-out] [focus-in] {
        next-mark none {}
        prev-mark none {}
    }
}

Nested Rethrow

At this point we have all the core functionality needed to implement a variety of features. I need to get to bed, so I'll edit this with some more non-exception examples later, but first let me illustrate how this supports C++'s nested rethrow. Actually, I'll show it supports Python's nested rethrow because, unlike C++, Python actually has a stack-tracing convention that's worth demonstrating.

To see this convention, if you run the following program:

def failure():
    raise ValueError, "Failure"
def refailure():
    raise
def fail():
    try:
        failure()
    except ValueError:
        refailure()
fail()

then you will get the following stack trace:

Traceback (most recent call last):
  File "main.py", line 10, in <module>
    fail()
  File "main.py", line 9, in fail
    refailure()
  File "main.py", line 7, in fail
    failure()
  File "main.py", line 2, in failure
    raise ValueError, "Failure"
ValueError: Failure

Notice that both failure and refailure are in the trace. This illustrates that Python builds the stack trace as it goes. Note: if you add prints, you'll also see that the stack for failure is unwound before refailure executes.

Here is how we can compile this (assuming for simplicity that all Python exceptions are ValueErrors and all we care about is the trace):

event code_line : [] -> [string, i32, string, string]
event python_handler : [pyref, string] -> []
event python_except : [] -> [pyref, string]
void fail() {
    escape $target {
        mark python_handle(pyref exn, string trace_so_far) {
            if (exn == ValueError)
                escape-to(exn, trace_so_far) $target;
        } within {
            with-value-mark (code_line "main.py", 7, "fail", "failure()") {
                failure();
            }
        }
    } hatch(pyref exn, string trace_so_far) {
        with-mark-value (python_except exn trace_so_far) {
            with-value-mark (code_line "main.py", 9, "fail", "refailure()") {
                refailure();
            }
        }
    }
}
void raise_with_trace(pyref exn, string trace) {
    walk-stack {
        while (true) {
            next-mark python_handle {
                get-mark(exn, trace);
            } code_line {
                [string file, i32 line, string body, string code] = get-mark;
                trace = file + line + body + code + trace (with formatting);
            }
        }
    }
}
void raise_new(pyref exn) { raise_with_trace(exn, ""); }
void reraise() {
    pyref exn;
    String trace;
    walk-stack {
        next-mark python_except {
            exn, trace = get-mark;
        }
    }
    raise_with_trace(exn, trace);
}

There are three things to note here. First, raise_with_trace also observes the code_line marks in order to collect the stack trace as it walks up the stack to find a handler (which then is handed the stack trace up to that point). Second, in the lowering of fail the call to refailure is executed within a python_except mark. Third, the reraise function implementing Python's rethrow walks up the stack looking for such a python_except mark that provides the exception that is most recently caught. Altogether we get an implementation of Python's nested-reraise construct with its own semantics for stack tracing (that differs from both Java's and C#'s), and we didn't have to add anything new for it. Of course, the same technique can be done/adapted for C++'s nested-reraise construct.

Generators

A number of languages feature generators. There are a few variations on this concept, so I'll focus on C# and Python generators. These have a foreach (x in list) {...} construct that executes its body for each element "in" the list, and a generator-method construct in which yield value can be used to conveniently yield the elements "in" the list. Although foreach is often implemented by translating to for (IEnumerator enum = list.GetEnumerator(); enum.MoveNext(); ) { x = enum.Current; ...; } and by converting a generator-method into a state machine that dynamically allocates and updates the state to simulate control flow, the two features can be matched together to make for a much more efficient implementation of this common pattern. For simplicity, I'll assume every IEnumerable has a Generate() method that directly executes the yield statements rather than transforming them into a state machine.

Let's start by lowering the following C#-ish code:

void baz(IEnumerable list, String header) {
    foreach (Object elem in list) {
        print(header);
        println(elem);
    }
    println("done");
}

to the following:

mark foreach : [ref] -> [];
void baz(IEnumerable list, String header) {
    mark foreach(ref elem) {
        print(header);
        println(elem);
    } within {
        list.Generate();
    }
    println("done");
}

Next let's lower the body of some generating code:

void GenerateInts(int x, int y) {
    for (int i = 0; i < x; i++)
         yield new Integer(i);
    for (int j = 0; j < y; j++)
         yield new Integer(i);
}

to the following:

void GenerateInts(int x, int y) {
    stack-walk {
        next-mark foreach {
            for (int i = 0; i < x; i++)
                 get-mark(new Integer(i));
            for (int j = 0; j < y; j++)
                 get-mark(new Integer(j));
        }
    }
}

So the foreach sets up a foreach mark on the stack within which it calls list.Generate(), which let's say eventually calls to GenerateInts(5, 4). Then GenerateInts walks up the stack to find that (first) foreach mark and executes its body, using get-mark to execute the foreach body with each yielded value. This pattern effectively implements the standard stack-sharing implementation of generators, again without needing to introduce any more primitives—without even references.

Stack-Allocated Closures

Let's review for a sec. We have this mechanism for walking the stack, which gets us a pointer into the stack. We have this mechanism for getting a mark, which gets a code pointer. The two together effectively make a stack-allocated closure, and get-mark is just calling that closure. (Technically, this is slightly different from a stack-allocated closure, but the analogy is close enough for our purposes.) Because the closure is allocated on the stack, rather than on the heap, we have to make sure our reference to it does not outlive the stack frame it sits upon. This is why this pair is not a first-class value and why get-mark is restricted to being used within a stack-walk and next-mark.

So stack walking and marks give us a way to go up and get stack-allocated closures, but it's also useful to be able to send stack-allocated closures (i.e. code pointer and into-stack pointer pair) down. I won't go into how this can be used to improve performance of higher-level languages, since that's still unpublished, but I will demonstrate how this can be used to optimize generators further and to help implement continuations.

First, let's extend functions with higher-order parameters so that one can write declarations like (higher-param $sac (i32 i32) (i32 i32)) after parameter declarations like (param $p i32) and before (result ...). These higher-order parameters are stack-allocated closures and consequently are invocable, let's say via higher-local.invoke $sac.

Using this, let's rewrite the lowering of GenerateInts to use stack-allocated closures:

void GenerateInts(int x, int y, void foreach(ref)) {
    for (int i = 0; i < x; i++)
        higher-local.invoke foreach (new Integer (i));
    for (int j = 0; j < y; j++)
        higher-local.invoke foreach (new Integer (j));
}

Notice that GeneratesInts no longer has to walk the stack, which was the slowest part of its implementation before. It just invokes its higher-order parameter foreach. Note, though, that it does not treat foreach as a value (i.e. there's no higher-order.get), which guarantees the lifetime of foreach will not exceed the lifetime of the call to GenerateInts.

Next let's rewrite the lowering of baz to instead assume IEnumerable.Generate takes a stack-allocated closure:

void baz(IEnumerable list, String header) {
    higher-order.let foreach(ref elem) be {
        print(header);
        println(elem);
    } in {
        list.Generate(foreach);
    }
    println("done");
}

Here we declare a new higher-order closure using higher-order.let (it's unsound to have a higher-local.set) and hand it off to list.Generate—no need for a stack mark anymore. This doesn't involve any dynamic allocation despite the fact that foreach closes over the local parameter header because that local parameter is sitting on the stack frame and the lifetime of foreach is guaranteed to last only as long as the stack frame. In other words, the stack frame is its closure environment.

So we can even more efficiently implement generators with higher-order parameters and stack-allocated closures, but can also combine first-class stacks with higher-order parameters to implement one-shot continuations.

First-Class Stacks

So far the mental model has been that there is one stack for the program, but it turns out that everything described works even if have multiple stacks calling into (or really returning to) each other. Such functionality lets us implement all sorts of interesting things like lightweight threads or even continuations.

Stack Type

The stack type is stack (param ti*) (result to*), where ti* is the inputs the stack is expecting and to* is the outputs the stack will produce when/if it eventually returns. These stacks are just big state machines. We can run this state machine to its completion to get values returned, but it turns out to be much more useful for the state machine to pause on occasion, and this pausing is much more safely done if it's done voluntarily by the state machine rather than forcefully by some external entity.

Stack Allocation

To allocate a stack, one uses stack.alloc $f : [t*] -> [(stack (param ti*) (result to*))] where $f is a func (param ti* t*) (higher-param () (ti*)) (result to*). This creates a stack whose initial stack frame is the function $f along with some of its inputs and a special higher-order input. The higher-param with no inputs is to be used by $f to pause the newly allocated stack. stack.alloc initializes the portion of the stack frame corresponding to that higher-param to be (the pair of) the host-level code pointer for pausing a stack and the pointer to the freshly created stack. Since a non-executing stack is always awaiting ti* inputs, this "pause" will return back to $f the inputs it was given.

One thing to note is that the t* could include mutable references, which $f can use to externally expose changes to the stack's internal state as it desires. So while these stacks might seem like a big black box, whoever created the stack has the ability to open up as much of that black box as it wants. Of course, this all assumes we have a way to make the stack change.

Stack Stepping

After we allocate a stack, it's just sitting there. To make it actually do something, just use stack.step $l : [(stack (param ti*) (result to*)) ti*] -> [] where $l is a block from to*. This provides the stack with the ti* inputs and causes it to progress until either the "pause" higher-param is called from within the unknown enclosed $f, in which case stack.step returns normally, or until the unknown enclosed $f returns, in which case control is transferred to $l with the returned values on the stack.

(Note that, for simplicity, I'm putting aside the complication that every stack needs to be guarded by a lock so that two threads cannot step the same stack at the same top. Later on I'll show how to adjust the instructions to better accommodate that constraint.)

Putting these together, we can implement lightweight cooperative worker threads:

mark yielder : [] -> [];
mark spawner : [$thread_state] -> [];
mark thread-locals : [] -> [$thread_state];
$result do_work($thread_state) {...} // might call yield, spawn, and get_thread_locals
$result join($result, $result) {...}
$result main($thread_state work, $result init) {
    stack_list = new List<(stack (param) (result $result))>();
    stack_list.add(new_worker(work));
    while (!stack_list.is_empty()) {
        worker = stack_list.dequeue();
        mark spawner($thread_state more_work) {
            stack_list.add(new_worker(more_work));
        } within {
            stack.step(worker) { // block to run on return
                product = pop; // pop $result off stack
                init = join(init, product);
                continue;
            }
        }
        stack_list.enqueue(worker);
    }
    return init;
}
(stack (param) (result $result)) new_worker($thread_state work) {
    local.get $work;
    stack.alloc $worker_body;
    return;
}
$result worker_body($thread_state state, void pause()) {
    mark thread-locals() {
        local.get $state;
    } within {
        mark yielder() {
            pause();
        } within {
            return do_work(state);
        }
    }
}
void yield() { walk-stack { next-mark yielder { get-mark(); }}}
void spawn($thread_state work) { walk-stack { next-mark spawner { get-mark(work); }}}
$thread_state get_thread_locals() { walk-stack { next-mark thread-locals { return get-mark(); }}}

Here main maintains a queue of workers, with each worker being a stack. It repeatedly pops a stack off the queue, makes it take a step, requeueing the stack if it doesn't finish, and aggregating the result of its work with the ongoing result if it completes. Before it forces the stack to take a step, it sets up a spawner stack mark to catch any stack walks done by a call to spawn within the worker stack, adding any provided workers to the queue.

The function new_worker's job is solely to create the stack, which it does by bundling the given $thread_state with the function worker_body, whose body is really where all the interesting stuff happens. Looking inside worker_body, we see that it sets up two stack marks. The thread-locals stack mark simply provides the state of the thread. The yielder stack mark invokes the provided pause higher-order parameter. This means that calls to yield() within do_work will walk up the stack, execute this mark, thereby running pause, which stack.alloc has set up to cause the stack to pause. Note that, if the language compiling to wasm wants to support lightweight threads with very efficient yielding, i.e not having to do a stack walk to find the "pause", then these primitives would also support another implementation strategy in which all compiled functions take a pause higher-order parameter and pass "pause" down to whomever they call. That is, we are not forcing a particular implementation strategy upon the language implementer; we are giving them the low-level primitives to choose their own.

Hopefully this illustrates how these low-level primitives combine together to support high-level patterns like algebraic effects. If the "event" of the algebraic effect has an input, the analog of worker_body would make its yielder mark update the $thread_state with that input. But we can also do a lot more than what algebraic effects support!

Stack Extension

In order to allocate a stack (param ti*) (result to*) we allocated a partial stack frame for a function that pauses and otherwise converts ti* into to*. Once we have such a stack, if it's not executing then we know that it's current stack frame is awaiting some ti*s. So we can extend the stack with another stack frame that will return ti*s. In order to preserve the type of the stack, this stack frame must itself be awaiting some ti*s. This suggests the following primitive:

stack.extend $f : [(stack (param ti*) (result to*)) t*] -> []

where $f is a func (param ti* t*) (higher-param () (ti*)) (result ti*).

This turns out to be surprisingly useful, especially for doing stack inspection. For example, suppose we wanted to combine our lightweight threads with our collect_roots implementation for program-managed garbage collection. The problem is that main has a lot of stacks representing worker threads that refer to root addresses that need to be collected. We can use stack extension to address this.

$result main($thread_state work, $result init) {
    stack_list = new List<(stack (param) (result $result))>();
    mark gcaddroots() {
        foreach (worker in stack_list) {
            stack.extend(worker) $collect_worker_roots;
            stack-wall {
                stack.step(worker);
            }
        }
    } within {
        ... // same loop as before (maybe add call to gc_collect)
    }
    return init;
}
void collect_worker_roots(void pause()) {
    collect_roots();
    pause();
}

The revised main sets up its gcaddroots mark to go through each of the stacks in stack_list and collect their roots. It does so by adding a stack frame for collect_worker_roots onto each stack, stepping the stack to cause it to call collect_roots, which then walks that stack to run gcaddroots along it, and then to pause, thereby returning control back to main. The stack-wall construct is just an optimization that makes stack walks think they have hit the top of the stack at that point so that each of these calls to collect_roots within the thread stacks don't each rewalk main's stack. stack-wall is also useful for security purposes as it prevents callees from observing the marks on the caller's stack and prevents even observing the time it takes to walk the caller's stack.

Notice that the above example does a stack.extend and then a stack.step. There is actually one primitive that combines these two together and is in fact slightly stronger and lower-level than the combination:

stack.extend_and_step $f $l : [(stack (param ti*) (result to*)) t*] -> []

where $f is a func (param t*) (higher-param () (ti*)) (result ti*) and $l is a block from to*. As it sounds, this extends the stack frame with $f and steps the stack in one go. Notice that, unlike with stack.extend, $f does not take ti* as inputs, and that, unlike stack.step, there need not be any ti* on the stack. So stack.extend essentially adds a stack frame that first pauses to get more inputs and then calls the provided function, whereas stack.step adds a stack frame with the provided inputs for the function that simply returns those inputs.

What is particularly useful about stack.extend_and_step, besides being more true to what actually happens at the low level, is that it gives you a way to step the stack without providing the expected inputs for the stack, as in the following variant where the worker threads each expect an i32.

$result main($thread_state work, $result init) {
    stack_list = new List<(stack (param i32) (result $result))>();
    mark gcaddroots() {
        foreach (worker in stack_list) {
            stack-wall {
                stack.extend_and_step(worker) $collect_worker_roots; // no i32 provided
            }
        }
    } within {
        ... // same loop as before (maybe add call to gc_collect)
    }
    return init;
}
void collect_worker_roots(i32 pause()) {
    collect_roots();
    return pause();
}

Stack Running

In both stack allocation and extension we were careful to make sure no local stack state could be caught in the first-class stack. The reason is that even if we were to immediately step the first-class stack, it could pause and outlive the local stack frame. But, although pausing is the key feature of first-class stacks, sometimes one reaches a point where they simply want to run a first-class stack to completion, disabling pausing. This effectively fuses (the lifetime of) the first-class stack with (that of) the local stack, and as such we can safely permit local state to be infused into the first-class stack.

We achieve this functionality with the following primitive:

stack.run {
    ... // code to run on the end of the stack, returning ti* into the stack
} pausing {
    ... // code to run whenever the stack would have paused, returning ti* back into the stack
} : [(stack (param ti*) (result to*))] -> [to*]

This conceptually mounts the given first-class stack on the local stack, extends the first-class stack with the frame for the main body of stack.run, and then resumes the first-class stack but executing the pausing clause whenever the first-class stack would have paused, eventually returning the result values of the first-class stack. The trick with the pausing clauses is most straightforwardly implemented by allocating space within each stack for a stack-allocated closure that is initially null but which the stack-allocated closure that is originally passed to pause first checks and defers to if the value is no longer null.

We can use this in our thread-workers example to forcibly abort threads. That is, suppose it is possible for the main loop to recognize early that the $result has already been determined, e.g. because it's computing a big conjunction and some worker has resulted in false, making it unnecessary to continue executing the remaining threads. Those threads might still be holding onto resources, though, so it is important to clean them up. We can do so by adding the following after the main loop:

foreach (worker in stack_list) {
    escape $target [unwinder] {
        stack.run(worker) {
            escape-to $target;
        } pausing {
            escape-to $target;
        }
    } hatch {}
}

Stack Acquiring

Earlier I deferred the issue that first-class stacks need to have associated locks to prevent someone from, say, trying to extend a stack with a frame while another program is stepping the stack. We could have every stack instruction acquire and release this lock, but that seems inefficient and doesn't address the fact that one thread might want to guarantee that a number of operations happen in direct succession. So instead we add the construct stack.acquire($s := expr) {...}. This pops off the result of expr, which should be a stack, acquires the lock on that stack, binds a local-variable-of-sorts $s to that stack, executes the ... body within which $s is in scope, and then releases the lock on the stack. It's input and output types are the same as the body ... plus an additional stack input.

Now $s is not really a local variable. It is simply names a way to refer to the stack.acquire. So the second thing we do is modify all of the above instructions to refer to $s rather than take a stack as input. That way these instructions do not need to acquire/release the stack's lock because they know that the stack is already acquired.

Let's illustrate this new construct by showing how we can use it to append an entire stack, not just a single stack frame, onto a stack:

void stack_append((stack (param ti*) (result to*)) s, (stack (param ti*) (result ti*)) sapp) {
    stack.acquire($s := s) {
        local.get sapp;
        stack.extend_and_step $s $stack_append_helper;
    }
}
ti* stack_append_helper((stack (param ti*) (result ti*)) sapp, ti* pause()) {
    stack.acquire($sapp := sapp) {
        stack.run $sapp {
            pause();
        } pausing {
            pause();
        }
    }
}

There is a lot going on here. First, stack_append acquires the lock on s, the stack to be appended to. Then it extends s with the stack frame for stack_append_helper, capturing sapp as the argument to the correspond parameter, which it then steps into, effectively starting the execution of stack_append_helper. This simply acquires the lock on sapp, and then runs sapp but immediatle pauses execution to get the values to provide to sapp, restoring control to stack_append. The acquire on s then completes, releasing the lock, and stack_append returns. So by the end of this, no one is holding the lock on the stack s, but the stack frame added onto s is holding the lock to sapp. Furthermore, the next time someone tries to step s, the value will be returned to sapp instead, while will either run to completion or, due to the pausing clause, cause s to pause whenever sapp would have paused.

Altogether, semantically speaking after stack_append completes it is as if the entirety of sapp has been mounted onto s and locked into place, combining the two into one. Furthermore, one design choice I made in the above primitives is that this is completely equivalent to permanently acquiring the lock on sapp and then directly copying each of the stack frames on sapp onto s one by one. In other words, unless you were the one to create a stack boundary, you cannot observe a stack boundary.

(Side note: you can use the technique above to also compose stacks of types [ti*] -> [t*] and [t*] -> [to*] into a stack of type [ti*] -> [to*], just like you can do with closures.)

Stack Duplication

Lastly, for stack duplication we mainly just add the instruction stack.duplicate $s : [] -> [(stack (param ti*) (result to*))] where $s references a stack.acquire of a stack (param ti*) (result to*). This primarily just duplicates the stack via memcpy. However, there is one complication.

The stack being duplicated can be in the midst of a step or run of another stack, in which case it has acquired the lock on that stack. We cannot have two stacks holding onto the same lock. So we have to duplicate that stack as well. In other words, while a stack is stepping or running another stack, the two are temporarily conceptually one stack and so must both be duplicated.

As a convention, after duplicating a stack, one could walk the duplicate and run each mark duplicator : [] -> [] on it. These duplicator marks could update the stack frame to duplicate relevant resources (or trap if impossible) so that the duplicate is reasonably independent of the original. For example, the stack could be in the middle of a stack.acquire on the stack held in some local variable $x, and the duplicator may want to update $x to reference the duplicated stack. For this reason, we also add stack.get $s that returns the stack currently acquired by the referenced stack.acquire, which might not be the stack value that was first passed to the stack.acquire because a stack.duplicate might have happened. It is an odd corner case, but from what I can tell this works out and provides the right primitives for supporting multi-shot continuations.

Wrapping Up

So what do y'all think of these primitives? Obviously these are a lot, but we don't have to add them all at once. Especially the ones about first-class stacks don't even make sense to add until after gc, since stacks will need to be memory managed. But I thought y'all would appreciate seeing a roadmap for this functionality and how it all fits together.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions