# The Mit Virtual Machine

Reuben Thomas

28th March 2020

### Typographical conventions

Instructions and registers are shown in typewriter font; interface calls are shown in **bold** type.

Addresses are given in bytes and refer to the VM address space except where stated. Addresses are written in hexadecimal; hex numbers are prefixed with "0x".

### 1 Introduction

Mit is a simple virtual machine for study and experiment. It is a stack machine, based on the more complex register machine Mite [4]. This paper gives a full description of Mit.

Mit is conceptually (and usually in fact) a library, embedded in other programs.

#### 2 Architecture

The address unit is the byte, which is eight bits. Most of the quantities on which Mit operates are fixed-size words, which are stored in memory in either big- or little-endian order. The choice of byte and word size enable efficient implementation on the vast majority of machine architectures.

#### 2.1 Parameters

Some parameters may take a range of values, but are fixed for a particular virtual machine instance. They are listed in table 1.

| Parameter        | Meaning                                                                                  |
|------------------|------------------------------------------------------------------------------------------|
| word_bytes       | The number of bytes in a word. Must be in the range 2 to 32 inclusive, and a power of 2. |
| instruction_bits | The number of bits in an instruction. Must be in the range $5$ to $8$ inclusive.         |

Table 1: Parameters

### 2.2 Registers

The registers are word quantities; they are listed, with their functions, in table 2. The registers are initialised to 0.

| Register          | Function                                                                                          |
|-------------------|---------------------------------------------------------------------------------------------------|
| рс                | The program counter. Points to the next word from which i may be loaded.                          |
| ir<br>stack_depth | The instruction register. Contains instructions to be executed. The number of words on the stack. |

Table 2: Registers

### 2.3 Memory

Mit's memory consists of discontiguous words in a single flat address space. The address of a word is that of the byte in it with the lowest address.

#### 2.4 Stack

The stack is a LIFO stack of words used for passing values to instructions and routines and for holding subroutine return addresses. To **push** a word on to the stack means to add a new word to the top of the stack, increasing the stack depth by 1; to **pop** a word means to reduce the stack depth by 1. Instructions that change the number of words on the stack implicitly pop their arguments and push their results.

#### 2.5 Execution

Execution proceeds as follows:

```
begin

let opcode be the least significant instruction_bits bits of ir

shift ir logically instruction_bits bits to the right

execute the instruction given by opcode

repeat
```

If an error occurs during execution (see section 2.6), the state of the virtual machine is reset to its state at the start of the loop before the error is raised. This allows instructions to be restarted after handling the error, where desired.

### 2.6 Errors and termination

When Mit encounters certain abnormal situations, such as an attempt to access an invalid address, or divide by zero, an **error** is **raised**, and execution terminates. The effect of the current instruction is undone (see section 2.5). An **error code** is returned to the caller.

Execution can be terminated explicitly by performing a halt instruction (see section 3.3.1).

Error codes are unsigned numbers. 0 to 127 are reserved for the specification; other error codes may be used by implementations. The meanings of those that may be raised by Mit are shown in table 3.

| Code | Meaning                                                        |  |
|------|----------------------------------------------------------------|--|
| 0    | single_step() has terminated without error.                    |  |
| 1    | Invalid opcode (see section 3.11).                             |  |
| 2    | Stack overflow.                                                |  |
| 3    | Invalid stack read.                                            |  |
| 4    | Invalid stack write.                                           |  |
| 5    | Invalid memory read.                                           |  |
| 6    | Invalid memory write.                                          |  |
| 7    | Address alignment error: raised when an instruction is given a |  |
|      | valid address, but insufficiently aligned.                     |  |
| 8    | Invalid size (greater than $\log_2$ word_bytes).               |  |
| 9    | Division by zero attempted (see section 3.8).                  |  |
| 127  | A halt instruction was executed.                               |  |

Table 3: Errors raised by Mit

### 3 Instruction set

The instruction set is listed below, with the instructions grouped according to function. The instructions are given in the following format:

```
NAME ( before - after ) Description.
```

The first line consists of the name of the instruction. On the right is the stack effect, which shows the effect of the instruction on the stack. Underneath is the description.

Stack effects are written

```
( before - after )
```

where *before* and *after* are stack pictures showing the items on top of a stack before and after the instruction is executed. An instruction only affects the items shown in its stack effects. The brackets and dashes serve merely to delimit the stack effect and to separate *before* from *after*. **Stack pictures** are a representation of the top-most items on the stack, and are written

$$i_1$$
  $i_2 \dots i_{n-1}$   $i_n$ 

where the  $i_k$  are stack items, each of which occupies a whole number of words,<sup>1</sup> with  $i_n$  being on top of the stack. The symbols denoting different types of stack item are shown in table 4.

Types are only used to indicate how instructions treat their arguments and results; Mit does not distinguish between stack items of different types. In stack

<sup>&</sup>lt;sup>1</sup>In this specification, each stack item occupies *precisely* one word.

| Symbol     | Data type                                                    |  |
|------------|--------------------------------------------------------------|--|
| flag       | a Boolean flag, 0 for false or non-zero for true             |  |
| size       | an integer in the range $0$ to $\log_2 word_bytes$ inclusive |  |
| n          | signed number                                                |  |
| u          | unsigned number                                              |  |
| $n \mid u$ | number (signed or unsigned)                                  |  |
| X          | unspecified word                                             |  |
| addr       | address                                                      |  |
| a-addr     | word-aligned address                                         |  |

Table 4: Types used in stack effects

pictures the most general argument types with which each instruction can be supplied are given; subtypes may be substituted. Using the phrase " $i \Rightarrow j$ " to denote "i is a subtype of j", table 5 shows the subtype relationships. The subtype relation is transitive.

```
u\Rightarrow x
n\Rightarrow x
flag\Rightarrow u
size\Rightarrow u
a-addr\Rightarrow addr\Rightarrow u
```

Table 5: The subtype relation

Numbers are represented in two's complement form. *addr* consists of all valid virtual machine addresses.

Each type may be suffixed by a number in stack pictures; if the same combination of type and suffix appears more than once in a stack effect, it refers each time to the identical stack item.

Ellipsis is used for indeterminate numbers of specified types of item.

#### 3.1 Instruction fetch

If an invalid or unaligned address is accessed when loading ir, the appropriate error is raised (see section 2.6).

Load the word pointed to by pc into ir then add word\_bytes to pc.

### 3.2 Control

These instructions implement unconditional and conditional branches, and subroutine call and return (subroutine return is jump):

```
jump ( a-addr -  )
```

If ir is not 0, raise error 1. Set pc to a-addr. Perform the action of next.

jumpz (flag a-addr - )

If *flag* is false then set pc to *a-addr* and perform the action of next.

call (  $a-addr_1$  -  $a-addr_2$  )

If ir is not 0, raise error 1. Exchange pc with the top stack value. Perform the action of next.

#### 3.3 Extra instructions

Since ir must be 0 when next is performed, the rest of an instruction word following next, jump and call must normally be all zero bits.

Non-zero values following next and call are reserved for the Mit specification; non-zero values following jump may be used by implementations to implement extra functionality.

When an extra instruction is performed, the original instruction is considered to have completed executing.

#### 3.3.1 Termination

This instruction terminates execution (see section 2.6):

Raise error 127.

### 3.4 Stack manipulation

These instructions manage the stack:

pop ( x - )

Remove *x* from the stack.

$$(x_u \dots x_0 \ u - x_u \dots x_0 \ x_u)$$

Remove u. Copy  $x_u$  to the top of the stack.

swap ( 
$$\mathbf{x}_{u+1} \dots \mathbf{x}_0 \ \mathbf{u} - \mathbf{x}_0 \ \mathbf{x}_u \dots \mathbf{x}_1 \ \mathbf{x}_{u+1}$$
 )

Exchange the top stack word with the u+1th.

#### 3.5 Literals

The word pointed to by pc is pushed on to the stack, and pc is incremented to point to the following word.

Like lit, except that the initial value of pc is added to the value pushed on to the stack.

Push 0 on to the stack.

Push 1 on to the stack.

Push 2 on to the stack.

Push 3 on to the stack.

### 3.6 Logic and shifts

Logic functions:

not (  $\mathbf{x}_1 - \mathbf{x}_2$  )

Invert all bits of  $x_1$ , giving its logical inverse  $x_2$ .

and  $(x_1 x_2 - x_3)$ 

 $x_3$  is the bit-by-bit logical "and" of  $x_1$  with  $x_2$ .

or  $(x_1 x_2 - x_3)$ 

 $\mathbf{x}_3$  is the bit-by-bit inclusive-or of  $\mathbf{x}_1$  with  $\mathbf{x}_2$ .

 $(x_1 x_2 - x_3)$ 

 $x_3$  is the bit-by-bit exclusive-or of  $x_1$  with  $x_2$ .

Shifts:

lshift  $(x_1 u - x_2)$ 

Perform a logical left shift of u bit-places on  $x_1$ , giving  $x_2$ . Put zero into the least significant bits vacated by the shift. If u is greater than or equal to the number of bits in a word,  $x_2$  is zero.

rshift 
$$(x_1 u - x_2)$$

Perform a logical right shift of u bit-places on  $x_1$ , giving  $x_2$ . Put zero into the most significant bits vacated by the shift. If u is greater than or equal to the number of bits in a word,  $x_2$  is zero.

arshift 
$$(x_1 u - x_2)$$

Perform an arithmetic right shift of u bit-places on  $x_1$ , giving  $x_2$ . Copy the original most-significant bits into the most significant bits vacated by the shift. If u is greater than or equal to the number of bits in a word, all the bits of  $x_2$  are the same as the original most-significant bit.

$$sign\_extend$$
 (  $u \ size - n$  )

Sign extend the  $2^{size}$ -byte quantity u to n.

### 3.7 Comparison

These words compare two numbers on the stack, returning a flag (for equality, use xor; see section 3.6):

lt (  $n_1$   $n_2$  - flag )

flag is 1 if and only if  $n_1$  is less than  $n_2$ .

ult  $(u_1 \ u_2 - flag)$ 

flag is 1 if and only if  $u_1$  is less than  $u_2$ .

#### 3.8 Arithmetic

These instructions consist of monadic and dyadic operators. All calculations are made without bounds or overflow checking, except as detailed for certain instructions.

Negation and addition:

negate  $(n_1 - n_2)$ 

Negate  $n_1$ , giving its arithmetic inverse  $n_2$ .

add 
$$(n_1 | u_1 n_2 | u_2 - n_3 | u_3)$$

Add  $n_2 \mid u_2$  to  $n_1 \mid u_1$ , giving the sum  $n_3 \mid u_3$ .

Multiplication and division (note that all division instructions raise error 9 if division by zero is attempted):

mul  $(n_1|u_1 n_2|u_2 - n_3|u_3)$ 

Multiply  $n_1 \mid u_1$  by  $n_2 \mid u_2$  giving the product  $n_3 \mid u_3$ .

divmod ( 
$$n_1$$
  $n_2$  -  $n_3$   $n_4$  )

Divide  $n_1$  by  $n_2$  using symmetric division, giving the single-word quotient  $n_3$  and the single-word remainder  $n_4$ . The quotient is rounded towards zero.

$$(u_1 u_2 - u_3 u_4)$$

Divide  $u_1$  by  $u_2$ , giving the single-word quotient  $u_3$  and the single-word remainder  $u_4$ .

### 3.9 Memory

These instructions fetch and store quantities to and from memory. If an invalid or unaligned address is accessed, the appropriate error is raised (see section 2.6).

Load the  $2^{size}$ -byte quantity x stored at addr, which must be a multiple of  $2^{size}$ . Any unused high-order bits are set to zero.

Store the  $2^{size}$  least-significant bytes of x at addr, which must be a multiple of  $2^{size}$ .

### 3.10 Instruction encoding

Instructions are encoded as instruction\_bits-bit opcodes; opcodes are packed into words, which are executed starting at the least-significant bits.

#### 3.11 Instruction opcodes

Table 6 lists the instruction opcodes in numerical order. Table 7 lists the extra instruction opcodes (following call; see section 3.3). Other instruction opcodes are undefined.

| Opcode | Instruction | Opcode | Instruction |
|--------|-------------|--------|-------------|
| 0x0    | next        | 0x10   | (undefined) |
| 0x1    | jump        | 0x11   | 1t          |
| 0x2    | jumpz       | 0x12   | ult         |
| 0x3    | call        | 0x13   | negate      |
| 0x4    | pop         | 0x14   | add         |
| 0x5    | dup         | 0x15   | mul         |
| 0x6    | swap        | 0x16   | divmod      |
| 0x7    | (undefined) | 0x17   | udivmod     |
| 0x8    | load        | 0x18   | not         |
| 0x9    | store       | 0x19   | and         |
| 0xa    | lit         | 0x1a   | or          |
| 0xb    | lit_pc_rel  | 0x1b   | xor         |
| 0xc    | lit_0       | 0x1c   | lshift      |
| 0xd    | lit_1       | 0x1d   | rshift      |
| 0xe    | lit_2       | 0x1e   | arshift     |
| 0xf    | lit_3       | 0x1f   | sign_extend |

Table 6: Instruction opcodes

| Opcode | Instruction |  |
|--------|-------------|--|
| 0x1    | halt        |  |

Table 7: Extra instruction opcodes

### 4 External interface

- Implementations should provide an **API** to create and run virtual machine instances, and provide access to its registers, stack and memory.
- Implementations can add **extra instructions** to provide extra computational primitives, and to offer access to system facilities, previously written code, native libraries and so on.

# Acknowledgements

Martin Richards introduced me to Cintcode [2], which kindled my interest in virtual machines, and led to Beetle [3] and an earlier version of Mite [4], of which the current Mite is a sort of synthesis. GNU *lightning* [1] helped inspire me to greater simplicity, while still aiming for good performance. Alistair Turnbull has been a fount of criticism for all my work on virtual machines.

# References

- [1] Paulo Bonzini. Using and porting GNU *lightning*, 2000. ftp://alpha.gnu.org/gnu/.
- [2] Martin Richards. Cintcode distribution, 2000. https://www.cl.cam.ac.uk/~mr/BCPL.html.
- [3] Reuben Thomas. Beetle and pForth: a Forth virtual machine and compiler. BA dissertation, University of Cambridge, 1995. https://rrt.sc3d.org/.
- [4] Reuben Thomas. *Mite: a basis for ubiquitous virtual machines*. PhD thesis, University of Cambridge Computer Laboratory, November 2000. https://rrt.sc3d.org/.