zisp is a proof of concept that asks how far Zig's compile-time machinery and the new labeled switch continue syntax can push parser generation. The project starts from high-level PEG (Parsing Expression Grammar) declarations and lowers them, at compile time, into tightly-specialized VM loops that read more like hand-written interpreters than generic parser combinators.
The repository doubles as a playground for a few ideas:
comptime-driven codegen – Grammar rules are analysed and expanded during compilation, producing concrete bytecode tables and AST layouts before the program ever runs.- Switch-label
continue– The VM core relies on Zig 0.15's ability tocontinue :vm next_ipdirectly from inside nested control flow, giving a threaded-interpreter style loop without manualgotos. - Runtime that still feels ergonomic – Even with all the specialization, the public API stays close to "declare a grammar, parse a buffer, walk a typed AST".
- Transparency of the generated code – We want to be able to inspect the lowered form easily (LLVM IR, assembly, AST dumps) and reason about the cost model.
src/peg.zig– Grammar DSL, compile-time compilation of PEG rules, and AST helpers.src/vm.zig– The bytecode interpreter/VM with loop-mode execution using labeledswitchcontinue.src/main.zig– CLI harness that exercises the parser and prints traces/ASTs.docs/vm-loop-llvm.md– Walkthrough of how to force Zig/LLVM to emit the specialized loop fordemoGrammar.vm_loop_demo.zig– Minimal driver used by the docs to instantiate the VM in isolation.
You need Zig 0.15.1 or newer (the build script uses the labeled-continue feature). The usual workflow:
zig build run # build the CLI and run it
zig build test # run the grammar + VM unit testsThe CLI parses a miniature Zig subset (src/zigmini). Today that grammar still rides on the older pegvm.zig backend simply because it hasn't been ported over yet, but the shape mirrors the new peg.zig + vm.zig pipeline. For a quick feel of the existing system, run zig run src/peg.vm—that’s the main entry point that prints the bytecode, step trace, and AST using the original VM. Try passing --dump-pegcode for a readable dump of the generated bytecode.
Running the grammar module directly prints the compiled bytecode, a step-by-step trace for a demo input, and the resulting typed forest:
$ zig run src/peg.zig
&Value:
0 push ->3
1 call ->5
2 drop ->4
3 call ->15
4 done
&Integer:
5 open
6 read 1..9
7 next
8 open
9 read 0..9*
10 shut
...
Parsing: "[[1] [2]]"
[ | 0000 push ->3
| 0001 call ->5
|-| 0005 open
...
✓ (156 steps)
Array [0..16) "[[1] [2] [4096]]"
└─values: 3 items
├─[0] Value: .array -> Integer d='1'
├─[1] Value: .array -> Integer d='2'
└─[2] Value: .array -> Integer d='4', ds="096"
The VM builds a "typed forest": every grammar rule owns a dedicated growable array, and siblings for a rule end up stored contiguously. That layout makes it cheap to gather a rule’s results and to reinterpret slices as strongly-typed structs/unions when you walk the AST later. In the demo run the root rule is Array, whose values field is emitted as a Kleene list of Value nodes; each Value lowers to either an Integer or another Array, and you can see the nesting clearly in the forest dump:
Array:
└─values: 3 items
├─[0] Value: .array
│ └─Array:
│ └─values: 1 items
│ └─[0] Value: .integer
│ └─Integer:
│ ├─d: '1' [2]
│ └─ds: (empty)
├─[1] Value: .array
│ └─Array:
│ └─values: 1 items
│ └─[0] Value: .integer
│ └─Integer:
│ ├─d: '2' [6]
│ └─ds: (empty)
└─[2] Value: .array
└─Array:
└─values: 1 items
└─[0] Value: .integer
└─Integer:
├─d: '4' [10]
└─ds: "096" [11..14)
The full trace (with detailed stack annotations and AST layout) is available any time you want to sanity-check how a grammar runs.
To look directly at the loop-mode codegen for the included demoGrammar, follow the steps in docs/vm-loop-llvm.md. The short version:
zig build-exe vm_loop_demo.zig \
-O ReleaseFast -fllvm \
-femit-llvm-ir=zig-out/vm_loop_demo.ll \
-femit-asm=zig-out/vm_loop_demo.sThe emitted .ll and .s highlight how the interpreter turns into a computed-goto state machine with literal bitsets for character classes.
Because the VM bytecode is baked during comptime, the “interpreter” that ships in the binary already knows the exact instruction stream. VM(G).next gets monomorphized for the grammar, the opcode array becomes a constant, and the main loop lowers to one giant switch/jump-table keyed on the instruction pointer. In other words we don’t even switch on an opcode enum at runtime; we switch on the literal IP and jump straight to the inlined code for that specific instruction. A toy sketch of the shape you get looks like this:
// Pseudocode, but this is the flavour LLVM ends up with.
vm: switch (ip) {
0 => { // read '['
if (self.text[self.sp] != '[') return error.ParseFailed;
self.sp += 1;
continue :vm 1;
},
1 => { // call Skip rule
try self.calls.append(.{ .return_ip = 2, .target_ip = 31, ... });
continue :vm 31;
},
2 => { // next field, etc.
...;
continue :vm 3;
},
else => return;
}Every case carries the rule metadata, call targets, character sets, and struct bookkeeping as compile-time constants. In release builds the control flow resembles an assembler hand-written threaded interpreter for a program that was known when you built the binary. The deep dive in docs/vm-loop-llvm.md shows the LLVM view, but even at the Zig level you can reason about the VM as a tightly unrolled state machine specialized to the grammar you compiled.
This is intentionally exploratory code. Expect breakage, rapid refactors, and plenty of TODOs around:
- Enriching the grammar DSL with more PEG operators.
- Experimenting with alternative backends (direct threaded code vs VM bytecode).
- Measuring performance against other PEG implementations.
- Refining the AST representation to reduce allocations.
If you're curious about a specific angle—memoization strategies, labelled-switch ergonomics, or further comptime tricks—open an issue or hack on a branch. The more weird experiments, the better.
MIT. See LICENSE for details.