Compiler internals

Here we explain how the compiler works. We link to the relevant code whenever possible, but since code changes we will use this version, which is a recent one and where later changes have a low impact on the general algorithm.

The main file

The main file that is compiled to generate the compiler is src/compiler/crystal.cr. Here all source code relative to Crystal is required and then Crystal::Command.run is executed. The Command module provides a command line interface to the compiler. According to command line options, it creates a Compiler, configures it and then uses it to compile one or more source files.

Let's see what the Compiler class does.

The Compiler class

The main public method of the Compiler class is compile.

First of all a Program is instantiated. A Program represents the top level container of everything. It's like a top level module that can have classes and methods. It's similar to Ruby's main when you do puts self at the top-level. However, unlike Ruby, when you define a method at the top level it gets defined in this Program, not as a private method of Object.

As you can see in Program's source code some basic types common to all programs are defined, like Object, Nil and String.

The Program is also a container of data associated to a single compilation, so for example it keeps track of all the symbols that were used (symbols can't be dynamically created), as well as some configurations, like CRYSTAL_PATH (similar to Ruby's $LOAD_PATH, only it is immutable).

Going back to Compiler#compile, a Program is created and configured. Then the source code is parsed (also here). After parsing each file into an AST it is normalized. Normalization consists of transforming some AST nodes into others. The most important transformation is transforming a require into AST nodes that result from actually requiring that file. Other transformations are, for example, transforming an unless to an if by simply inverting the branches.

At the end of this stage we will have an AST node representing the whole program, with all requires expanded (the special "prelude" file is automatically required). This AST node will probably be an Expressions node, which just represents more than one AST node.

The next step is the most important one: type inference.

Type inference

The name of this stage is actually misleading. It's called infer_type in the source code, but many things happen here. This has two important consequences: 1) The code is harder to follow and understand, because many concerns are mixed, and 2) The compiler is faster, as the whole program is traversed just once. We believe point 2 is more important than 1, as there will be more developers using the compiler than developers developing the compiler, and compile times are very important for us.

Let's take a look at what Program#infer_type(node) does.

The first and most important thing it does is to create a TypeVisitor to traverse that AST node. This makes use of the Visitor Pattern, which is one that is heavily used across the compiler and it's one of the most useful ways to deal with an AST node in a generic way. Because of Crystal's multi-dispatch feature implementing the Visitor Pattern is very easy, as no manual double-dispatch is needed.

The TypeVisitor does many things:

It declares types and methods
It binds AST nodes between each other to propagate type information
It analyzes variable types and their flow

Type and methods declaration

When the user writes this:

class Foo
  class Bar
    def baz
      1
    end
  end
end

we want a Foo class to be declared (or reopened), a Bar class to be declared (or reopened) inside it, and a baz method inside it to be declared (or redefined).

For this, the TypeVisitor has a stack of types. The stack starts with a single element, @mod.

Note: throughout the code the word mod will be a synonym of the Program, as it is the global module that's always accessible. In some other cases program is used. The word mod comes from old times and we could change it now to program, only mod is very short and convenient.

Back to the stack of types, the stack starts with the Program. That means that when a type is defined, it will be defined in that type (the Program). Then when processing the class' body, this new type is pushed to the stack so new types will be defined underneath it. After processing the class' body the stack is popped. You can see this in action in visit(node : ClassDef). In that method there is a lot more code than just that, but it's mostly validations (for example: superclass mismatch, reopening a type as a module, namespace not found, etc), dealing with definition of generic types, running hooks (inherited).

A similar process happens in other definitions: Module, Enum, Lib, Alias, Include, Extend, Def and Macro.

AST nodes binding

To understand this section better we highly recommend you to read this blog post and this other one.

In short, the whole type inference algorithm is based on binding AST nodes to each other. When you bind a node A to B, A's type gets B's type. If B's type changes, A's type changes as well. If A is bound to another node C, A's type will be the union of B's type and C's type.

Every ASTNode has a bind_to method to bind itself to one or more nodes. When you bind a node A to a node B, B is added to A's dependencies and at the same time B is added A as an observer. The result is that A's dependencies will be [B], and B's observers will be [A].

After a node is bound to others, its type is recomputed by doing a type merge of the dependencies' types. How types are merged is explained in the appendix. After the new type is computed it is propagated to suscribed observers by invoking their update method. The update method does more or less the same thing: compute the new type based on the dependencies, only this information is not yet propagated. The node is marked as dirty and after all observers are updated propagate is invoked on them. This makes the propagation happen in small steps, preventing extra propagations.

Note: the above code is in the semantic/ast.cr file. There's also syntax/ast.cr, which defines the AST nodes and their properties. Everything under the semantic directory has something to do with the semantic stage, and it can (and does, a lot) reopen AST nodes to add more functionality. This allows grouping functionality in different files without having a huge ast.cr file with all the functionality mixed.

Where is bind_to used? Let's see what happens when you write something like this:

a = 1

This is an Assign node, so the TypeVisitor will visit it. Since the target (the left-hand side) is a Var, this method will be invoked. First, the value is visited. In this case it's a NumberLiteral. Assigning a type to a NumberLiteral is easy: if it's an Int32 literal the type will be Int32. These types are well known, already defined (as we saw before) and accessed via the mod variable, which is the always-present Program. This is one of the few cases where bindings are not used. Other cases are Nil, Bool, Char and other primitive types.

Going back to the type_assign method we can see that (amongst many other things) the target is effectively bound to the value. The node is also bound to the value because the Assign's node type is that of the value.

To be continued...

Provide feedback

Saved searches