Rust Instruction Dispatch

This is a project to learn how to implement instruction dispatch in Rust. Instruction dispatch is a technique for executing bytecode instructions in a language interpreter. This project compares five types of instruction dispatch:

Switch Dispatch
Direct Call Threading
Direct Tail Call Threading
Direct Threading
Optimized Direct Threading

"Writing Interpreters in Rust: a Guide[1]" says this about "Instruction dispatch" in "4.8 Virtual Machine: Architecture and Design"

Prior research[2] into implementing dispatch in Rust concludes that simple switch-style dispatch is the only cross-platform construct we can reasonably make use of. Other mechanisms come with undesirable complexity or are platform dependent.

In my environment (MacBook Pro 13" 2020 Apple M1), Direct Tail Call Threading was faster than Switch Dispatch.

Installation

This project is written in Rust and C. To install Rust, please refer to here.

Also, the implementations of Direct Threading and Optimized Direct Threading in Rust are for Aarch64 only. If you want to support other architectures, you need to rewrite the inline assembly in the source code.

After downloading this project, build it with the following commands.

(cd Rust/switch-dispatch; cargo build --release)
(cd Rust/direct-call-threading; cargo build --release)
（cd Rust/direct-tail-call-threading; cargo build --release)
(cd Rust/direct-threading; cargo build --release)
(cd Rust/optimized-direct-threading; cargo build --release)
(cd C/switch-dispatch; make)
(cd C/direct-call-threading; make)
(cd C/direct-tail-call-threading; make)
(cd C/direct-threading; make)
(cd C/optimized-direct-threading; make)

Usage

After building, you can measure the performance of each type of instruction dispatch with the following commands.

(cd Rust/switch-dispatch; cargo run --release)
(cd Rust/direct-call-threading; cargo run --release)
(cd Rust/direct-tail-call-threading; cargo run --release)
(cd Rust/direct-threading; cargo run --release)
(cd Rust/optimized-direct-threading; cargo run --release)
C/switch-dispatch/switch_dispatch
C/direct-call-threading/direct_call_threading
C/direct-tail-call-threading/direct_tail_call_threading
C/direct-threading/direct_threading
C/optimized-direct-threading/optimized_direct_threading

The output example for "(cd Rust/switch-dispatch; cargo run --release)" is as follows.

> (cd Rust/switch-dispatch; cargo run --release)
    Finished release [optimized] target(s) in 0.08s
     Running `target/release/switch-dispatch`
1048575
Time elapsed:    5067625 ns
(omitted)
1048575
Time elapsed:    3709042 ns
Average of 100 results:    3916145 ns

Programs written in Rust can also run cargo criterion.

(cd Rust/switch-dispatch; cargo criterion)
(cd Rust/direct-call-threading; cargo criterion)
(cd Rust/direct-tail-call-threading; cargo criterion)
(cd Rust/direct-threading; cargo criterion)
(cd Rust/optimized-direct-threading; cargo criterion)

To run cargo criterion, you must install cargo-criterion as follows.

cargo install cargo-criterion

The output example for "(cd Rust/switch-dispatch; cargo criterion)" is as follows.

❯ (cd Rust/switch-dispatch; cargo criterion)
(omitted)
    Finished bench [optimized] target(s) in 21.06s
switch dispatch         time:   [3.8144 ms 3.8509 ms 3.8909 ms]

running 7 tests
test bytecode::test_get_opcode ... ignored
(omitted)

In addition, programs written in Rust can be run with the dev profile to output debugging information.

(cd Rust/switch-dispatch; cargo run)
(cd Rust/direct-call-threading; cargo run)
(cd Rust/direct-tail-call-threading; cargo run)
(cd Rust/direct-threading; cargo run)
(cd Rust/optimized-direct-threading; cargo run)

The output example for "(cd Rust/switch-dispatch; cargo run)" is as follows.

❯ (cd Rust/switch-dispatch; cargo run)
(omitted)
     Running `target/debug/switch-dispatch`
Program:
0: LOAD: memory[0] = 0
1: LOAD: memory[1] = 1
2: LOAD: memory[2] = 2
3: ADD: memory[0] = memory[0] + memory[1]
4: JMPNE: if memory[0] != memory[2] pc = 3
5: PRINT: print memory[0]
6: RET

Trace:
0: memory[0] = 0; memory[0]:0
1: memory[1] = 1; memory[1]:1
2: memory[2] = 2; memory[2]:2
3: memory[0]:0 = memory[0]:0 + memory[1]:1; memory[0]:1
4: if memory[0]:1 != memory[2]:2 pc = 3; pc: 3
3: memory[0]:1 = memory[0]:1 + memory[1]:1; memory[0]:2
4: if memory[0]:2 != memory[2]:2 pc = 3; pc: 5
5: print memory[0]
2
6: ret

Instruction Dispatch Techniques

For details on the individual instruction dispatch techniques, see the third chapter entitled "Dispatch Techniques" in the paper "YETI: a gradualY Extensible Trace Interpreter[3]".

The Wikipedia article "Threaded code[4]" is also helpful.

Switch Dispatch

In switch dispatch, each bytecode instruction is mapped to a case label in a switch statement.

Direct Call Threading

In direct call threading, each bytecode instruction is translated into a function pointer to an instruction implementation function.

Direct Tail Call Threading

In direct call threading, each bytecode instruction is translated into a function pointer to an instruction implementation function. The return from each function and the next function call are tail-call optimized.

This technique is mentioned in "Direct tail-call threading in Rust[5]”.

The Wikipedia article "Tail call[6]" is also helpful.

Direct Threading

In direct threading, each bytecode is executed by jumping to its implementation address.

The program "threadedasm.rs " in "Virtual Machine Dispatch Experiments in Rust[7]" implements direct threading in Rust using inline assembly. But, the program "threadedasm.rs" uses the old inline asm syntax, so I write direct threading program using the new inline asm syntax[8].

My program is only for Aarch64. I would appreciate it if you could send me pull requests for other architectural implementations.

My program runs only at opt-level 0 and is significantly slower than the -Ofast C implementation. I would appreciate it if you could send me a pull request for this as well.

Optimized Direct Threading

In the optimized direct thread, the number of array references during bytecode execution is one less than in the direct thread.

Like direct threading, my program is an Aarch64-only implementation, with the restriction that my program can only run at opt-level 0.

Comparison of Techniques

Threading Techniques	Pros	Cons
Switch dispatch	Simple	High dispatch overhead
	Portable	Poor instruction caching
Direct call threading	Reduce dispatch overhead	Conversion to internal format
	Portable	Function call overhead
		Function return overhead
Direct tail call threading	Reduce dispatch overhead	Conversion to internal format
	Remove function return overhead	Function call overhead
	Portable	Tail call isn't optimized in dev profile
Direct threading	Reduce dispatch overhead	Require compiler support
	Improve instruction caching	Indirect jump overhead (2 times)
Optimized Direct threading	Reduce dispatch overhead	Require compiler support
	Improve instruction caching	Indirect jump overhead (1 time)
	Reduce Indirect jump overhead

VM Bytecode for Benchmarking

LOAD a, imm

Set the value 'imm' at memory location 'a'.

ADD a, b, c

The values in memory location 'b' and memory location 'c' are added and stored in memory location 'a'.

JMPNE a, b, jmp

If the values of memory location 'a' and memory location 'b' are different, jump to 'jmp'; if they are the same, execute the next instruction.

PRINT a

Print the value of memory location 'a'.

RET

Terminate execution.

Virtual Machine Program for Benchmarking

Add memory location 0 by 1 from 0 to 1,048,575(0xfffff).

// Init

0: LOAD 0, 0

1: LOAD 1, 1

2: LOAD 2, 0xfffff

// Loop

3: ADD, 0, 0, 1

4: JMPNE 0, 2, 3

// Finish

5: PRINT 0

6: RET

Performance Comparison

Threading Techniques	Rust criterion (msec)	Rust Average of 100 runs (nanosec)	C Average of 100 runs (nanosec)
Switch dispatch	3.8509	3,916,145	3,865,595
Direct call threading	7.4248	7,923,624	6,831,052
Direct tail call threading	2.4223	2,429,428	2,618,682
Direct threading	17.583	19,134,075	1,561
Optimized Direct threading	12.829	13,066,978	627

Measured on MacBook Pro 13" 2020 (Apple M1)

Threading Techniques	Rust criterion (msec)	Rust Average of 100 runs (nanosec)	C Average of 100 runs (nanosec)
Switch dispatch	7.6682	6,025,179	6,825,759
Direct call threading	9.7546	6,582,182	7,998,624
Direct tail call threading	3.8759	3,430,561	3,473,483
Direct threading	-	-	1,074
Optimized Direct threading	-	-	895

Measured on DELL-inspiron 15 3000 2019 (Intel Core i7-1065G7)

References

[1] 2022. Writing Interpreters in Rust: a Guide. Retrieved from https://rust-hosted-langs.github.io/book/.

[2] Peter Liniker. 2017. Virtual Machine Dispatch Experiments in Rust. Retrieved from https://pliniker.github.io/post/dispatchers/.

[3] Mathew Zaleski. 2008. YETI: a graduallY Extensible Trace Interpreter. Ph.D. Dissertation. University of Tronto, Toronto, ON.

[4] Wikipedia. 2023. Threaded code. Retrieved from https://en.wikipedia.org/wiki/Threaded_code.

[5] Demindiro. 2021. Direct tail-call threading in Rust. Retrieved from https://www.demindiro.com/blog/2021/rust-direct-tail-call-threading.

[6] Wikipedia. 2023. Tail call. Retrieved from https://en.wikipedia.org/wiki/Tail_call.

[7] Peter Liniker. 2017. Virtual Machine Dispatch Experiments in Rust. Retrieved from https://github.com/pliniker/dispatchers.

[8] Rust RFC. 2873-inline-asm. Retrieved from https://rust-lang.github.io/rfcs/2873-inline-asm.html.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
C		C
Rust		Rust
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rust Instruction Dispatch

Installation

Usage

Instruction Dispatch Techniques

Switch Dispatch

Direct Call Threading

Direct Tail Call Threading

Direct Threading

Optimized Direct Threading

Comparison of Techniques

VM Bytecode for Benchmarking

LOAD a, imm

ADD a, b, c

JMPNE a, b, jmp

PRINT a

RET

Virtual Machine Program for Benchmarking

Performance Comparison

References

About

Uh oh!

Releases

Packages

Languages

License

djwatson/rust-instruction-dispatch

Folders and files

Latest commit

History

Repository files navigation

Rust Instruction Dispatch

Installation

Usage

Instruction Dispatch Techniques

Switch Dispatch

Direct Call Threading

Direct Tail Call Threading

Direct Threading

Optimized Direct Threading

Comparison of Techniques

VM Bytecode for Benchmarking

LOAD a, imm

ADD a, b, c

JMPNE a, b, jmp

PRINT a

RET

Virtual Machine Program for Benchmarking

Performance Comparison

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages