This is an experiment on building a small language compiler on top of a home brewed parsing expression grammar implementation.
The language implemented in this project, effigy, currently compiles
down to a subset of the Python 3.7 bytecode format. More
specifically, the Effigy compiler produces .pyc
files.
Effigy's runtime is the Python 3.7 Virtual Machine. The difference is just how the bytecode gets generated. Most idioms like declaring literals, calling functions, assigning variables etc have the exact same semantics as in regular Python code.
Effigy differs from Python on the use of functions for control flow a little more often and the absence of classes (might be added later).
Effigy is currently a teeny little JavaScript program. You can
install it with npm i efgc
. After that, you can type your effigy
programs in a file and then run efgc yourfile.efg
. That will
generate a .pyc
file in the same directory as the source file
that can be ran with Python (currently only 3.7).
Here's what's available and some of what's not:
- integers
- strings (double quotes only. Single quotes currently yield syntax error)
- lists
- functions (named and anonymous)
- Arithmetic Operators
- Logic Operators
- Comparison Operators
- Flow Control (if/else/while/for)
- Exceptions (single catch block for now)
- Imports
- Slice notation
- Variadic arguments
- Named/Default parameters
- Floating points
As mentioned in the introduction, Effigy is an experiment. So it probably won't be a good example of how to write the next industry standard compiler, but it should give insights about what compilers do and at least one way of doing it.
The current version of the efgc
compiler is broken down into
three main pieces: 1) PEG parser-generator, 2) bytecode
translator, 3) assembler. Let's look at them separately.
The PEG is the most basic component of this compiler. It's what
the compiler uses to 1) Parse the program text into a parse tree
and 2) to transform the parse tree into bytecode
.
PEGs provide very similar functionality compared to Context Free Grammars. The most relevant difference is 1. being deterministic 2. allowing infinite lookahead via predicates. This allows PEGs to provide functionality for both syntactical and semantic matching. To read beyond this vague definition, I suggest reading the article that introduced the concept.
The API for parsing text currently looks like this:
> const g = peg.pegc('Digit <- [0-9]+'); // Compile Grammar
> g.match('123') // Match some input
['Digit', ['1', '2', '3']]
There's also an API for matching data structures (lists):
> peg.pegc('List <- { "a" { "b" } }').matchl(["a", ["b"]])
['L', ['a', ['b']]]
In very practical terms, this home grown PEG implementation is being used in the parser and the translator pieces. And besides the grammar language, this PEG also provides semantic actions exposed via the JavaScript API (not in the grammar language). Allowing the user to declare traversals for the output trees captured from successful matching. E.g.:
> const join = x => Array.isArray(x) ? x.join('') : x; // Helper for joining lists of strings together
> const g = peg.pegc('Digit <- [0-9]+') // Compile Grammar
> const r = g.bind({ Digit: ({ visit }) => parseInt(join(visit()), 10) }); // Bind semantic actions
> r('123')
123
It is worth mentioning that bindl()
is also available for
binding semantic actions to a generator that will process data
structures (lists) instead of text.
The semantic actions are modular. They're not executed until the whole match is finished successfully. That way, the user of the PEG engine doesn't ever have to think about the backtracking that happens behind the scenes.
This PEG implementation has no dependencies besides the host
language used to write the file peg.js
.
Sadly there are a few valuable things that I didn't get to implement yet that would considerably increase the quality of the PEG implementation:
-
Error Reporting. Although parser generators sometimes get bad fame for their error reporting, there is some modern literature on how to allow pretty good error reporting. The best this PEG does is to report accurately the farther failure position heuristics that tell how far on the input the current grammar was able to match before the error happened. Link for the aforementioned modern literature. Current error reporting on list matching is awful to say the least. It literally only tells you that it didn't match a list.
-
Arity of PEG operators. The operator
OneOrMore (+)
returns an item if it matches one and a list if it matches many. And the list is flattened. TheZeroOrMore (*)
operator behaves similarly to(+)
but can also return nothing. Which is represented withnull
. These are a bit confusing but I'm not really sure if I found all the answers to design something better yet. -
Left recursion. There's a branch for supporting that. It currently misses mutual left recursion support so it's not merged yet. The implementation leverages bounded left recursion.
Although the first target of the little compiler is a subset of Python, JavaScript was chosen as the host language for a few reasons:
-
I didn't want to do it in Python because it'd be very tempting to use one of its modules for parsing, scope analysis or code generation. I wanted to implement all the pieces of the compiler to be able to reason how far I could leverage the PEG to do those tasks.
-
Python and JavaScript have very similar semantics for closures but present slight differences in how side-effect (assignment) of values declared in enclosed scopes work. Java Script separates assignment from declaration, Python provides the
nonlocal
keyword.I wanted something right in the middle for Effigy: Assignment is coupled to declaring a variable, but provides the keyword
let
to mark names to be saved as closures so assignments in deeper scopes will know its not a new value. -
It doesn't really matter. The goal is to rewrite Effigy with Effigy.