# [Writing your own programming language and compiler with Python](https://blog.usejournal.com/writing-your-own-programming-language-and-compiler-with-python-a468970ae6df)  
### Thank You [Marcelo](https://blog.usejournal.com/@marcelogdeandrade)!

The purpose of this notebook is to help people that are seeking a way to start developing their first programming language/compiler.  
The original author's source code can be [found on Github](https://github.com/marcelogdeandrade/PythonCompiler).

We'll be using [PLY](http://www.dabeaz.com/ply/) as [lexer](https://en.wikipedia.org/wiki/Lexical_analysis) and [parser](), and [LLVMlite](http://llvmlite.readthedocs.io/en/latest/) as low level intermediate language to do code generation with optimizations.  
If you don't want to look that information up now, that's fine because we'll cover the relevant parts later.  
Also, the LLVM documentation contains a helpful tutorial called [Kaleidoscope: Implementing a Language with LLVM](https://llvm.org/docs/tutorial/index.html#kaleidoscope-implementing-a-language-with-llvm) that covers many of the same topics as this project.  
Some of the material in this notebook is for more experienced programmers, so make sure you do a bit of homework on [RPLY](https://github.com/alex/rply), which we'll be using instead of PLY, as well as [LLC](https://llvm.org/docs/CommandGuide/llc.html) and [GCC](https://gcc.gnu.org/).

## Let's Begin

Where shall we begin, anyway?  
A good start is to define a language, which we can call **TOY** for now.

That looks straighforward, but how do we implement that with our programming language?  
We can break that code down into a simpler example:

In order for machines to process the instructions you give them, they have to be given a defined grammar structure.  
[EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) is used to make a formal description of a language which can be a computer programming language.  
The original author of this tutorial recommends [this post](https://tomassetti.me/ebnf/) for a deeper look at how EBNF grammar works.

## What the EBNF?

Let’s create a EBNF that describes the minimal possible functionality of TOY, which is a sum operation.

Now for the EBNF:

If the underlying structure of the language isn't easy to spot yet, that's okay.  
You'll get better at spotting the patterns with more practice.  
Our programming language will need a bit more functionality to actually be useful.  
First, we want to be able to add as many numbers as we wish.  
Second, we want to be able to do subtraction as well.

As for the EBNF:

We'll also need some output, so let's add print functionality.

Our resulting EBNF will be:

We have defined a basic grammar, so now we can move on to translating it to code and writing a program.  
We will be answering two questions:  
Can we translate the grammar to code in a way that we can validate and understand?  
After that, can we compile that code into a binary executable?  

## The Compiler

A compiler is a program that turns a programming language into another language, in this case machine language.  
In this guide, We'll compile our programming language into LLVM IR and then into machine language.  

![compiling_process](img/compiling_process.png)

Using LLVM, it is possible to optimize your compilation without learning compiling optimization, and LLVM has a really good library to work with compilers.  
Our compiler can be divided into three components:  
* Lexer  
* Parser  
* Code Generator  
For the Lexer and Parser we’ll be using RPLY, which is similar to PLY, but with a more robust API.  
And for the Code Generator, we’ll use LLVMlite, a Python library for binding LLVM components.

### The Lexer

The role of the **Lexer** is to take the program as input and divide it into *tokens*.

![tokenizing](img/tokenizing.png)

We can then use the minimal structures form our EBNF to define our tokens.

Our lexer will divide the statement above into the following list of tokens:

It's time to begin coding our compiler.  
Create a file in this directory named `lexer.py`.  
This file exists to define our tokens.  
We will use the `LexerGenerator` class from RPLY to create our lexer.

In [1]:
# lexer.py

from rply import LexerGenerator


class Lexer():
    def __init__(self):
        self.lexer = LexerGenerator()

    def _add_tokens(self):
        # print
        self.lexer.add('PRINT', r'print')
        # parentheses
        self.lexer.add('OPEN_PAREN', r'\(')
        self.lexer.add('CLOSE_PAREN', r'\)')
        # semicolon
        self.lexer.add('SEMI_COLON', r'\;')
        # addition and subtraction operators
        self.lexer.add('SUM', r'\+')
        self.lexer.add('SUB', r'\-')
        # number
        self.lexer.add('NUMBER', r'\d+')
        # ignore spaces
        self.lexer.ignore('\s+')

    def get_lexer(self):
        self._add_tokens()
        return self.lexer.build()

Let's create a `main.py` file that we can use to coordinate the compiler's 3 components.  
After you run the next cell, you should see all of the tokens in the input text printed out in the correct order.

In [2]:
from lexer import Lexer

text_input = """print(4 + 4 - 2);"""

lexer = Lexer().get_lexer()
tokens = lexer.lex(text_input)

for token in tokens:
    print(token)

Token('PRINT', 'print')
Token('OPEN_PAREN', '(')
Token('NUMBER', '4')
Token('SUM', '+')
Token('NUMBER', '4')
Token('SUB', '-')
Token('NUMBER', '2')
Token('CLOSE_PAREN', ')')
Token('SEMI_COLON', ';')


When you run the `python main.py` command, the output should be the same as above.  
The token names can be anything you want, just make sure that they are consistent with the parser we're going to build.

### The Parser

The second component of our compiler is the **Parser**, which analyzes the program's syntax.  
Basically, the parser takes the list of tokens created by the lexer and creates a data structure called an [Abstract Syntax Tree(AST)](https://en.wikipedia.org/wiki/Abstract_syntax_tree) as output.  

![Lexical Analysis](img/lexical_analysis.png)

To implement our parser, we’ll use the structure created with out EBNF as model.  
RPLY’s parser uses a format similar to the EBNF, so it's pretty straightforward.  
Create a new file named `ast.py` that will contain all classes that are going to be called on the parser and create the AST.

Next, we can create the parser using `ParserGenerator` from RPLY.  
Create a file named `parser.py`.

It's time to update our `main.py` file so that the Parser and Lexer can communicate with each other.

When you run the command  
`$ python main.py`  
you will see the correct output, which is a printed 6, but there is the matter of  
`parser.py:38: ParserGeneratorWarning: 4 shift/reduce conflicts return self.pg.build()`  
to deal with, so there is still work to be done. 

However, with these two components in place, we have a working compiler that can translate our TOY language using Python.  
Now we have to work on creating machine language code, and then we'll need some optimization as well.  
Get ready for code generation using LLVM.

### The Code Generator

The third and last component of out compiler is the **Code Generator**.  
It transforms the AST created from the parser into machine language or an IR.  
In this case, it’s going to transform the AST into an [LLVM IR](https://llvm.org/docs/tutorial/LangImpl03.html).

LLVM can be really complex to understand, so if you wish to fully understand what is going on, I recommend reading the [LLVMlite documentation](http://llvmlite.readthedocs.io/en/latest/).  
LLVMlite doesn’t have a print function, so you have to define your own.  
To start, let’s create a file named `codegen.py` that will contain the class `CodeGen`.  
This class configures the LLVM and saves the IR code.  

Now we must update our `main.py` file so that we can call our `CodeGen` methods.  
Instead of hard_coding the input, store the input `print(4 + 4 + 2);` in a file named `input.toy`.

An important change in the code above involves passing the `module`, `builder`, and `printf` objects to the parser.  
Now we can pass objects directly to the AST where the [LLVM-AST](https://llvm.org/docs/tutorial/LangImpl02.html) is created.  
To do this, we must update the `parser.py` file to receive those objects and pass them on to the AST.