Benchmarks so far have been encouraging, the fib(35)
test with -O2
now takes around 5.5 seconds, but fib(40)
still takes around 54 seconds,
while Nystrom's stack-based bytecode interpreter can do that calculation
in around 5 seconds. Of course this is due to using environments instead
of a stack, and walking trees instead of stepping through bytecode.
Another factor is that the entire AST needs to be protected, and must be marked every time a garbage collection occurs.
So how difficult would it be to convert the AST to bytecode and use a local stack? It turns out to be not so hard. Of course we still need environments, because of closures that capture them, and it's quite possible that version 2 will actually be slower initially, but I'll discuss a possible version 2.1 later that I hope will fix that.
Anyway let's review the math from version 1 and see what the bytecode equivalent might look like.
The new machine no longer has any
One big difference however is of course that since there is no longer an AST, the C regiter is now an index into an array of bytecodes.
I'll present the original math for each step, then its bytecode equivalent.
CAVEAT - NONE OF THIS IS TESTED YET. Please don't rush to implement and then blame me if it doesn't work, If/when I get it working I'll update this document.
UPDATE - V2.0 is working as intended and
(fib 40)
now clocks in at ~40s which is a little over 20% faster, without any of the optimisations proposed for V2.1, so I'm now very hopeful it can be gotten close to the clox performance.
These first few bytecodes are the equivalent of the old
Variables get looked up in the environment:
bytecode for that:
bytecode | action |
---|---|
| VAR | frame | offset | |
push(lookup(frame, offset, env)) |
The bytecode consists of three bytes, a VAR
tag that identifies that
a variable is coming up, then a byte for its frame and a byte for its
offset in the frame (see Lexical Addressing
for details if you haven't already.)
On seeing that, the bytecode interpreter does the lookup, and pushes the result onto the stack.
env
here is the
Note that evaluating an aexp
always has a stack cost of 1, e.g. there
is always one more element on the stack after evaluation of an aexp
.
Constants evaluate to their value equivalents:
bytecode | action |
---|---|
| INT | hi | lo | |
push(hi << 8 | lo); |
| TRUE | |
push(1); |
| FALSE | |
push(0); |
(16 bit integers for now).
Lambdas become closures:
bytecode | action |
---|---|
| LAM | nvar | addr(after exp) | ..exp.. | |
push(clo(nvar, addr(exp), env)); |
Where addr(exp)
is the index of the exp
in the bytecode array, and
addr(after exp)
tells the bytecode interpreter where to resume after
pushing the closure.
Note the absence of explicit variable names. Again because of lexical addressing the only thing the closure needs to know is the size of the environment.
Primitive expressions are evaluated recursively:
where
It starts to get interesting here, if we rewrite this to Reverse Polish Notation we can achieve the entire operation very simply:
bytecode | action |
---|---|
| ..aexp1.. | ..aexp2.. | PRIM | |
push(O(PRIM)(pop(), pop())); |
I should probably explain that.
Consider a primitive sequence like 2 + 3 * 4
. That will parse to
2 + (3 * 4)
and the AST will look like:
flowchart TD
plus(plus)
plus ---- two(2)
plus --- times(times)
times --- three(3)
times --- four(4)
There are various ways to traverse that tree, for example for each
(non-terminal) node, visiting the left hand branch, then visiting the
operation, then visiting the right hand branch would recover the infix
notation we started with. However if instead we visit the left-hand
branch, then the right-hand branch, then the operation, we end up with
reverse polish notation: 2 3 4 * +
which is exactly the order we need
to evaluate the expressions:
- push 2.
- push 3.
- push 4.
- pop the 3 and the 4, multiply them and push the result 12.
- pop the 2 and the 12, add them and push the result 14.
Note that the entire operation has a stack cost of 1, preserving that invariant.
The rest of these situations change the overall state of the machine,
corresponding to
For function calls, step
first evaluates the function, then the
arguments, then it applies the function:
where
and
is
where
and
where
bytecode | condition | new state |
---|---|---|
| ..aexp1.. | .. | ..aexpn.. | ..aexp.. | APPLY | naargs | |
clo(nfargs, addr, env) = pop() |
(see below) |
cont(letk(body, env, k')) = pop() |
(body, extend(env, pop()), k', F) |
|
cont(halt) = pop(); |
(DONE, env, halt, F) |
This is a bit more tricky. a sequence of aexpn
arguments followed by
a callable aexp
followed by an APPLY
token followed by the nunber
of actual arguments (naargs
). By the time we get to the APPLY
all
the aexpx
have been evaluated and are sitting on the stack, with the
callable on top. But there are three possibilities. the callable could
be a closure, or it could be a continuation created by call/cc
. If
it's a continuation it could be a letk
or a halt
. We deal with
each separately.
Also, because of currying, tha actual number of arguments could be
different from the number of arguments the closure is expecting. The
following cases handle those circumstances, having popped a closure from
clo(nfargs, addr, env) = pop()
above.
condition | new state |
---|---|
nfargs == naargs |
(addr, extend(env, pop(naargs)), K, F, S) |
naargs == 0 |
(C, E, K, F, [S + clo]) |
nfargs > naargs |
(C, env, K, F, [S + clo(nfargs - naargs, addr, extend(env, pop(naargs)))]) |
The pseudocode pop(naargs)
just means to copy the stack to the
environment, a simple memcpy
followed by a stack reset should suffice.
So if the actual number of arguments passed to the function is equal to the number it is expecting, we extend the environment with those arguments and transfer control to the body of the function. If the closure expects arguments but receives none, we just re-push the closure and resume execution at the next instruction. If the actual number of arguments is less than the expected number, we create a new closure extended with the available arguments, and reduced number of expected arguments, push it on to the stack, and resume execution at the next instruction.
Static analysis will not cope with this scenario, it will assume all variables local to a function are in the one stack frame. plus the actual stack frame arrangement can only be known at run time if we do it this way. We might instead be able to control this with a new type of "partial closure" which is recognised as being in the process of being built.
If the closure is a normal one, and is given all of its arguments by APPLY, then all proceeds as above.
If the closure is partial (having already been supplied with some of its arguments) then it will need to hold both the number of arguments remaining and the number of arguments it has already. It is still a first class object so we can't modify it, instead we must clone it and replace the environment of the clone with the additional arguments.
So curried functions will be a bit more expensive to invoke, but hopefully should not slow down simple function application.
Explicitly if the pre-condition is pclo(nfargs, nargs, addr, env) = pop()
then:
condition | new state |
---|---|
nfargs == naargs |
(addr, clone(env, pop(naargs)), K, F, S) |
naargs == 0 |
(C, E, K, F, [S + pclo]) |
nfargs > naargs |
(C, env, K, F, [S + pclo(nfargs - naargs, nargs + naargs, addr, clone(env, pop(naargs)))]) |
where clone(env, pop(naargs))
means to duplicate the existing env and
extend its frame with room for the additional args.
When the expression under evaluation is an aexp
, that means we need
to return it to the continuation:
where
and
is
where
bytecode | condition | new state |
---|---|---|
| ..aexp.. | RETURN | |
letk(body, env, k') = K |
(body, extend(env, pop()), k', F) |
halt = K |
(DONE, E, K, F) |
We'll need that RETURN
code to stop the internal iterations from proceeding.
There are two possibilities here, the continuation K
could be a letk
or a halt
.
we deal with each appropriately.
bytecode | condition | new state |
---|---|---|
| ..aexp.. | IF | addr(ef) | ..et.. | JMP | addr(after ef) | ..ef.. | |
pop(p) |
(addr(et), E, K, F) |
!pop(p) |
(addr(ef), E, K, F) |
Not mutch to say here, if the value at the top of the stack after evaluating aexp
is true, go to the true branch,
otherwise go to the false branch, but the true branch must jump over the false branch when complete.
Evaluating let
forces the creation of a continuation
where
bytecode | new state |
---|---|
| LET | addr(body) | ..exp.. | ..body.. | |
(addr(exp), E, letk(addr(body), E, K), F) |
Again just a re-expression of the specified behaviour.
where:
but subsequently mutated with pseudomath:
bytecode | new state |
---|---|
| ENV | nargs | |
(C, extend(E, nargs), K, F) |
| ..aexp1.. | .. | ..aexpn.. | LETREC | nargs | ..body.. | |
(addr(body), E <== aexps, K, F) |
This is meant to be read as a single squence of bytecodes.
The ENV
followed by the number of arguments it should hold causes the current env to be extended
by a new frame with that many slots. Then each aexp
is evaluated, then the new env is populated
with the values, then the body is evaluated in the new environment.
call/cc
takes a function as argument and invokes it with the current
continuation (dressed up to look like a
as a reminder of
where
and
where
so
bytecode | preparation | new state |
---|---|---|
| ..aexp.. | CALLCC | |
K == letk(addr, env', k'); env'' = extend(env' pop()); |
(addr, env'', k', F) |
amb
arranges the next state such that its first argument will be
evaluated, and additionally installs a new Fail continuation that,
if backtracked to, will resume computation from the same state, except
evaluating the second argument.
bytecode | new state |
---|---|
| AMB | addr(e2) | ..e1.. | JMP | addr(after e2) | ..e2.. | |
(addr(e1), E, K, backtrack(addr(e2), E, K, F)) |
back
invokes the failure continuation, restoring the state captured by
amb
.
bytecode | condition | new state |
---|---|---|
| BACK | |
F == backtrack(addr, E, K, F') |
(addr, E, K, F') |
F == end |
(addr(DONE), E, K, F) |
The DONE
Exp signals termination.
terminates the machine, effectively it's undefined for
So why might V2 be slower, and how can we speed it up?
Another caveat, this is quite vague at this point, I'm just thinking out loud.
My gut tells me that V2 will probably be a bit faster, but not by much. It gains the locality of a bytecode interpreter, and the AST can be discarded, but I suspect the overhead of all that copying between stack and environment might offset those benefits.
One observation is that many variables will be local to the function, if you look at how functions are called, their arguments are built on the stack, then the environment of the closure is extended with the values from the stack, then the values are popped off the stack, then the function is executed. Worse, when a function needs a variable value it is copied back from the environment onto the stack again!
Why not use the values on the stack directly?
Variables are created in three ways:
- as arguments to functions.
- as bound
let
variables. - as
letrec
closures.
It should be easy to distinguish free variables from bound ones in a function, there are well-defined mechanisms,
plus we've already done lexical analysis and we know the frame of each variable, so a variable is bound if its frame
count minus its lexical depth (the number of let
s and letrec
s between it and the top-level of its function) is less than 1.
So imagine a fuction:
(lambda (a b c)
(let (d <expr>)
(let (e <expr>)
(letrec (g (lambda (x) ..))
(h (lambda (y)
(+ (g y) y)))
(+ a b d e (h c))))))
and annotate its variable usage with [frame:offset]
(lambda (a b c)
(let (d <expr>)
(let (e <expr>)
(letrec (g (lambda (x) ..))
(h (lambda (y)
(+ (g[1:0] y[0:0]) y[0:0])))
(+ a[3:0] b[3:1] d[2:0} e[1:0] (h[0:1] c[3:2]))))))
If instead of an environment, we used a stack for the function's variables, then at the point of the body of the inner letrec, the stack would be:
[h]
[g] // letrec
[e] // let
[d] // let
[c]
[b]
[a] // lambda
We could then, instead annotate those local variables with a stack position, an offset from the base of the current function's stack frame. We'd also need to tag them as a different type of variable, or maybe just use a frame number of zero to indicate that (I think I prefer the first option):
(lambda (a b c)
(let (d <expr>)
(let (e <expr>)
(letrec (g (lambda (x) ..))
(h (lambda (y)
(+ (g[0:0] y[0]) y[0])))
(+ a[0] b[1] d[3] e[4] (h[6] c[2]))))))
The closure generated by the function would no longer have a local environment, instead it would directly hold a reference to its parent environment. A local environment would be constructed on demand when needed to create a closure, and that environment would contain (a copy of) the complete stack frame of the function up to that point, plus a pointer to the parent environment.
Note in the above h
lambda, the g
is non-local, but has a frame of 0 because 0 is now the parent frame.
A feature of A-Normal form is that, even though it's not entirely in continuation-passing style, there is
only one situation where function calls are not in tail position, and that is the exp
in a (let (var exp) body)
The place that requires a new continuation in the evaluator is the same place where we can't throw away the current stack frame.
In all other cases the called function can re-use the calling function's stack frame.
Firstly we're going to need a new register to keep track of the stack, all the good letters are gone so I'm just going to call it (s)
.
Our CEKF
machine is now a CEKF(s)
machine.
bytecode | action |
---|---|
| VAR | frame | offset | |
push(lookup(frame, offset, env)) |
| LVAR | offset | |
push(stack[S + offset]) |
VAR
is effectively unchanged (except its frame count is one less), the new local LVAR
instead accesses
its value directly from the stack (unfortunately we need values guaranteed to be at the top of the stack.)
bytecode | action |
---|---|
| INT | hi | lo | |
push(hi << 8 | lo); |
| TRUE | |
push(1); |
| FALSE | |
push(0); |
no change.
bytecode | action |
---|---|
| LAM | nvar | ..exp.. | |
push(clo(nvar, addr(exp), env.extend(snapshot(frame))); |
The new snapshot(frame)
pseudocode means to copy the current stack frame into a new environment.
bytecode | action |
---|---|
| ..aexp1.. | ..aexp2.. | PRIM | |
push(O(PRIM)(pop(), pop())); |
Unchanged.
bytecode | condition | new state |
---|---|---|
| ..aexp1.. | .. | ..aexpn.. | ..aexp.. | APPLY | |
clo(nargs, addr, env) = pop() |
(addr, env, K, F, sp - nargs) |
cont(letk(body, env, k', s')) = pop() |
(body, extend(env, pop()), k', F, s') |
|
cont(halt) = pop(); |
(DONE, env, halt, F, S) |
We evaluate the args and the callable. If the callable is a closure we call it without extending its environment,
and we set the stack frame pointer to be the bottom of its arguments (sp
is the current top of stack).
If it's a letk
continuation ... TBD
bytecode | condition | preparation | new state |
---|---|---|---|
| ..aexp.. | RETURN | |
letk(body, env, k', s') = K |
S[0] = pop() |
(body, env, k', F, s') |
halt = K |
(DONE, E, K, F, S) |
We arrange for the result of the expression to be at the bottom of the called function's frame, then resore the caller.