This project contains
- A simple, stack-based virtual machine for executing low-level instructions
- An assembler supporting a Forth / PostScript like language
- An interpreter able to run compiled programs
Architecture and design
The instructions are fixed-width at 32-bits and so are the arithmetic operands.
By default, programs have 1 million cells available for both program text and data. This means that a virtual machine memory takes up 4MB plus the data and instruction stacks.
The text and data regions are overlapped, so you can easily write self-modifying code (early versions actually required self-modification to be able to return from subroutine calls, just like Knuth's MIX, but I've since taken liberty to add such modern convenience into the core instruction set).
There are no registers. This is a stack machine, after all.
As we know from theoretical computer science, a pushdown automaton needs two stacks to be Turing equivalent. Therefore we employ two as well; one for the instruction pointer and one for the data. They live separately from the text and data region, and are only limited by the host process heap size.
The machine contains no special facilities besides this: It's inherently single-threaded and has no protection mechanisms. Its operation is completely sandboxed, though, except for access to standard output.
The project aim was to create a simple machine and language to play around with. You can benefit from it by reading the source code, playing with a language similar to Forth, but conceptually simpler, and finally by seeing how easy it is to build your own system.
The programming language
The language is very similar to Forth and PostScript: You basically write in RPN --- reverse Polish notation. Anything not recognized as an instruction is put on the data stack, so to put the numbers 3 and 2 on the stack, just write
To multiply them, just append with an asterix:
3 2 * ; multiplication
This operation pops the topmost two numbers on the stack and replaces them with the result of the multiplication. To run such a program, you'd need to include the core library first, since multiplication is defined as a function:
$ cat tests/core.src your-file.src | sm 6
Labels, addresses and their values
Labels are identifiers ending with a colon.
They refer to a particular cell in the machine, and you can access their position, value or execute code from that cell location:
label: ; create a label for the cell at this location &label ; put ADDRESS of label on top of stack &label LOAD ; put VALUE of label's cell "label" on top-of-stack label ; EXECUTE code from label position
So, to put the address of a label on the top of the data stack, just prepend the label name with an ampersand.
If you want the value of an address, put the address on the TOS (top of
stack) and use the
LOAD instruction to replace the TOS with the value at
the given cell location.
When executing code at a given label position, the machine first puts the
address of the next instruction on top of the instruction stack. This way
you can return from a function call by using the instruction
main: ; program start print-dot print-dot HALT print-dot: '.' OUT '\n' OUT POPIP ; return from "function"
Variables and subroutines
An idiom for creating variables is to create labels and putting a
that location to reserve one memory cell to hold variables. An example of
using a counter variable to implement a loop is given below.
counter: NOP ; reserve 1 word for the variable "counter" program: 2 &counter STOR ; set counter to two &counter LOAD 1 ADD &counter STOR ; increment counter by one ; loop counter+1 times display: '\n' '*' OUT OUT ; print an asterix 1 &counter LOAD SUB &counter STOR ; decrement counter by one &display &counter LOAD JNZ ; jump to display if not zero
The output of the above program is three stars:
$ ./sm foo.src * * *
You can forward-reference labels. In fact, another idiom is to jump to the main part of the program at the start of the source.
You can do
72 OUT to print the letter "H" (72 is the ASCII code for "H").
Cutting to the chase, a program to print "Hello!" would be:
; Labels are written as a name without whitespace ; and a colon at the end. main: 72 out ; "H" 101 out ; "e" 108 dup out out ; "ll" 111 out ; "o" 33 out ; "!" ; newline '\n' out 42 outnum ; print a number '\n' out ; and newline ; stop program halt
Notice the use of the
HALT instruction to stop the program.
Multiplication and core library
I've implemented a multiplication function in the core library in
mul: ; ( a b -- (a*b) ) mul-res: nop ; placeholder for result mul-cnt: nop ; placeholder for counter mul-num: nop &mul-cnt stor ; b to cnt dup &mul-res stor ; a to res &mul-num stor ; and to num mul-loop: ; calculate res += a &mul-res load &mul-num load + &mul-res stor ; decrement counter &mul-cnt load -1 &mul-cnt stor ; loop until counter is zero &mul-cnt load &mul-loop swap -1 jnz &mul-res load popip ; ... *: ; alias for mul mul popip
Note that this function needs definitions for the functions
Recall the program to multiply two numbers. Put the following in a file
3 2 * outnum '\n' out halt
If we concatenate the core library with our program, we get:
$ cat tests/core.src hey.src | ./sm 6
You could implement the whole program without depending on the core library:
; semi-obfuscated multiply and print ; does not depend on any libraries ; re-inventing the wheel can be very educational! main: 12345 67890 * outnum '\n' out halt ; multiplication function w/inner loop *: R: nop C: nop N: nop &C stor dup &R stor &N stor *-loop: &R load &N load add &R stor 1 &C load sub &C stor &C load &*-loop swap 1 swap sub jnz &R load popip
While implementing the Karatsuba algorithm should be quite easy, Toom-Cook multiplication is left as an exercise for the reader.
It's not a joke
I think I need to clarify that this project is actually not a joke. Fun, absolutely, but not a joke.
I just wanted to create a simple virtual machine and from that I grew a language. It's very similar to Forth and PostScript, and we all know those are extremely powerful --- particularly Forth!
Building stuff yourself is a powerful way of learning.
A Fibonacci program
The following is a program to generate and print Fibonacci numbers, taken
; Print the well-known Fibonacci sequence ; ; Our word size is only 32-bits, so we can't ; count very far. ; Program starts at main, so jump there &main jmp ; Create label 'count', which refers to this memory ; address. ; ; The NOP (no operation; do nothing) is only used ; to reserve memory space for a variable. count: nop ; Initialize the counter by storing 46 at the address of 'count'. ; ; POPIP will pop the instruction pointer, effectively jumping to ; the next location (probably the caller). count-init: 46 &count stor popip ; Shorthand for loading the number at 'count' onto the top of the stack. ; ; The "( -- counter)" comment is similar to Forth's comments, explaining ; that no number is expected on the stack, and after running this function, ; a number ("counter") will be on the stack. count-get: ; ( -- counter ) &count load ; load number popip ; Shorthand for decrementing the number on the stack dec: ; ( a -- a-1 ) 1 swap sub popip ; Store top of stack to 'count', do not alter stack count-set: ; ( counter -- counter ) dup &count stor popip ; Decrement counter and return it count-dec: ; ( -- counter ) count-get dec count-set popip ; Print number with a newline without altering stack show: ; ( number -- number ) dup outnum '\n' out popip ; Duplicate two top-most numbers on stack dup2: ; ( a b -- a b a b ) swap ; b a dup ; b a a rol3 ; a a b dup ; a a b b rol3 ; a b b a swap ; a b a b popip jump-if-nonzero: ; ( dest_address predicate -- ) swap jnz popip ; The start of our Fibonacci printing program main: count-init 0 show ; first Fibonacci number 1 ; second Fibonacci number loop: ; add top numbers and show ; a b -> a b a b -> a b (a + b) dup2 add show ; decrement, loop if non-zero count-dec &loop jump-if-nonzero
I've added a
HALT instruction. This replaces the old idiom of looping
forever to signal that a program was finished:
stop: stop ; form 1 stop: &stop jmp ; form 2 halt ; convenience form
Originally, it was an argument of minimalism for not including any halt instructions.
Secondly, I've added a
POPIP instruction along with automatically storing
the next instruction before performing a jump. This effectively let's you
call and return from subroutines:
boot: &main jmp halt foo: bar: baz: '\n' '!' 'e' 'c' 'i' 'u' 'j' 'e' 'l' 't' 'e' 'e' 'B' out out out out out out out out out out out out out popip main: foo bar baz
Third, I never bothered to write my own print number function, because it
would require me to write both division and modulus functions in source
first. So I implemented
OUTNUM that prints a number to the output:
123 OUTNUM '\n' OUT ; prints "123\n"
Lacking is proper string handling. One could say that string handling is not this language's strongest point.
Compiling the project
To compile and run the examples:
$ make all check
To see the low-level machine instructions:
$ ./smr -h
To execute source code on-the-fly:
$ ./sm filename
To compile source to bytecode:
$ ./smc filename
The assembly language is not documented other than in code, because I'm actively playing with it.
Although the interpreter is slow, it should be possible to convert stack operations to a register machine. In fact, it should be trivial to compile programs to native machine code, e.g. x86.
The instructions are found
VALUE OPCODE EXPLANATION 0x00000000 NOP do nothing 0x00000001 ADD pop a, pop b, push a + b 0x00000002 SUB pop a, pop b, push a - b 0x00000003 AND pop a, pop b, push a & b 0x00000004 OR pop a, pop b, push a | b 0x00000005 XOR pop a, pop b, push a ^ b 0x00000006 NOT pop a, push !a 0x00000007 IN read one byte from stdin, push as word on stack 0x00000008 OUT pop one word and write to stream as one byte 0x00000009 LOAD pop a, push word read from address a 0x0000000A STOR pop a, pop b, write b to address a 0x0000000B JMP pop a, goto a 0x0000000C JZ pop a, pop b, if a == 0 goto b 0x0000000D PUSH push next word 0x0000000E DUP duplicate word on stack 0x0000000F SWAP swap top two words on stack 0x00000010 ROL3 rotate top three words on stack once left, (a b c) -> (b c a) 0x00000011 OUTNUM pop one word and write to stream as number 0x00000012 JNZ pop a, pop b, if a != 0 goto b 0x00000013 DROP remove top of stack 0x00000014 PUSHIP push a in IP stack 0x00000015 POPIP pop IP stack to current IP, effectively performing a jump 0x00000016 DROPIP pop IP, but do not jump 0x00000017 COMPL pop a, push the complement of a
The instruction set could easily be more minimal, even more so if we allowed registers. Also, we have taken absolutely no care about the machine code values for each instruction. A good design would do something cool with that.
License and author
Placed in the public domain in 2010 by the author, Christian Stigen Larsen http://csl.sublevel3.org