Skip to content

The Hitch Hiker's Guide to the jeff65 Internals

Jonathan David Page edited this page Jun 19, 2018 · 3 revisions

In which the author fulfils a fantasy of having people ask them technical questions about their pet project.

This document is aimed at people who either want to contribute to jeff65, or who are just interested in how it works. It's also partially to get some stuff straightened out in my head.

Not really. There's no technical reason that a compiler has to be written in a compiled language; the main reason that they usually are is that it's traditional to work towards a goal of rewriting the compiler in the language it implements.

(This one is an actual question that I received.) Yes, us mere mortals can write compilers.

On the one hand, yes, there are a bunch of things that you need to know for writing a compiler that you don't really need to know to, e.g., write a webapp in Javascript. On the other hand, the converse is also true, so it's really a matter of what you're familiar with and what you're interested in; part of the point of this project as far as I'm concerned is to get better at the kinds of things you need to know to write compilers.

Personally, a lot of reading and a lot of learning-by-doing-badly. Admittedly, Woody and I both have formal computer science education; Woody majored in it, and I was majoring in it until I switched to mathematics. This was useful mostly because it included a somewhat crunchy (for undergrads) algorithms and data structures course. I understand that the program offered a course in DFAs, which I never took, but could be useful.

A real gem is Sarkar, Waddell, and Dybvig's A Nanopass Framework for Compiler Education, which I'm slowly working through. Jeff65 generally follows the nanopass approach.

Early on, when we were using a hand-coded top-down parser, Simple Top-Down Parsing in Python was helpful, though this approach had issues due to top-down parsing being more suitable for expressions. We ended up rewriting to use the ANTLR4 parser generator, though I could see this being replaced in the future. I've heard good things about Ghuloum's An Incremental Approach to Compiler Construction, though I've only skimmed it, and we have no followed it's suggested order much at all. There's a well-known book on compiler theory called The Dragon Book. (This is not the actual title, but if you call it by its real name no-one will know what you're talking about.) I haven't touched it, and it's well-known mostly for being very very tough reading.

In which we tour the compiler in order of what happens to the code.

Code usually starts out life as some text, even in image-based systems like Smalltalk. In jeff65, we take the usual approach, and expect code to live in a text file. Lexing is the process by which program text is converted from a stream of characters into a stream of tokens. Tokens group substrings of the program text into a single language element. For example

1 + foo * 2

becomes

NUMERIC("1") OPERATOR_PLUS("+") IDENTIFIER("foo") OPERATOR_TIMES("*") NUMERIC("3")

This doesn't really do much for the structure of the program--the flat stream of characters becomes a flat stream of tokens. However, it does make the next step much easier.

Our lexer is currently a hand-coded affair. ANTLR4 does have a built-in system for generating lexers, but by the time we started using it, we already had a lexer which worked fairly well, and it wasn't clear how to get ANTLR4 to understand some of the idiosyncrasies of gold-syntax. In particular, we tokenize hello-world into IDENTIFIER("hello-world"), and hello - world into IDENTIFIER("hello") OPERATOR_MINUS("-") IDENTIFIER("world").

Of course, the lexer ended up being partially rewritten in the process of adjusting it to work with ANTLR4, in particular to make it use regular expressions.

In general, when writing a compiler, one wants to work with trees, rather that flat streams. This transformation is effected using a parser, which takes our token stream

NUMERIC("1") OPERATOR_PLUS("+") IDENTIFIER("foo") OPERATOR_TIMES("*") NUMERIC("3")

and turns it into an abstract syntax tree (or "AST"):

add
  numeric :value 1
  mul
    identifier :name 'foo'
    numeric :value 3

Something to notice here is that our expression is now nested--the multiplication has been grouped as an argument to the addition, which gets us operator precedence!

How this happens is... a big subject. We do it in two stages. For the first stage, we use a parser generated by a parser generator called ANTLR4, which takes the grammar file and converts it into a program which ingests the token stream and produces a tree which is similar to the one above.

For the second stage, we walk the tree produced by ANTLR4, and output a tree with mostly the same structure using our own data structures. This happens in ast.py. Some structures are simplified here, and a lot of data about the source file which is no longer relevant is thrown away. (For example, 1 + 2 * 3 and 1 + (2 * 3) will end up producing the same tree. Once we've established the tree structure, we don't need the parentheses anymore.)

All of the modifications that get made to this stage are in the grammar file and ast.py, particularly in the AstBuilder class, which implements the Visitor pattern. (Take note of that pattern. It'll show up a lot.)

If this section seems like a bit of a cop-out... well, in a sense, it is. If we were writing our own parser, a much deeper understanding of how parsing happens would be necessary, but we're currently delegating that to a library.