# Lecture 01 - Introduction

In the first part of this lecture, you learned about formal languages: grammars, parsers, and finite state automata.
In this part, we will see an **application area**: compilers and interpreters - and how to implement them. 

These notebooks are not complete -- please check the lecture slides for more information.

## Stages of a Compiler

Remember that we typically want to go from here 

```
x := (3 + 4) + 5
x + 2
```

to assembly code.

### Lexical Syntax

**Concrete syntax** represents the source code. 
This source code is usually represented as a list of characters. 

In OCaml, we represent the concrete syntax of an example program hence as a *string*:

In [1]:
let source_code : string =
"x := (3 + 4) + 5
x + y"

val source_code : string = "x := (3 + 4) + 5\nx + y"


We will see relevant ``string`` functions in OCaml once we build our own lexers - for now, have a look at the OCaml documentation on strings  https://v2.ocaml.org/api/String.html if you want to know more.

### Phrasal Syntax

A lexer converts this list of characters into a list of **words**. 
We call these words **tokens**.

We represent tokens as an **algebraic datatype**.
If you require a quick reminder you can find some good explanation here: https://cs3110.github.io/textbook/chapters/data/variants.html and here: https://cs3110.github.io/textbook/chapters/data/algebraic_data_types.html. 


In our case, a token is  
- either an identifer ``ID``, with a value of type ``string`` - written e.g. ``ID "x"`` for an identifer *x*,
- or an integer ``INT``, with a value of type ``int`` - written e.g. ``INT 3`` 
- or an assignment sign, ``ASGN``, ``PLUS``, ``MINUS``, ``STAR``, ... 
- or the end-of file symbol ``EOF``, which represents that we arrived at the end of the file.

With the representation as an algebraic data type a token can be **nothing else**.

In [2]:
type token = ID of string | INT of int
           | ASGN
           | PLUS  | MINUS | STAR | SLASH 
           | LBRA | RBRA | EOF

type token =
    ID of string
  | INT of int
  | ASGN
  | PLUS
  | MINUS
  | STAR
  | SLASH
  | LBRA
  | RBRA
  | EOF


A whole program is then a list of tokens.
Recall lists of type 'a, ``'a list``, which can be either: 
- an empty list [] 
- a list ``x :: xs`` where ``x :: 'a`` and ``xs :: 'a list``. 

See here: https://cs3110.github.io/textbook/chapters/data/lists.html for a reminder.

In [3]:
[];; 

2 :: 3 :: [];; 

[2; 3]

- : 'a list = []


- : int list = [2; 3]


- : int list = [2; 3]


E.g., the previous program would be translated into the following list of tokens: 

In [4]:
let phrasal_syntax : token list =
[ID "x"; ASGN; LBRA; INT 3; PLUS; INT 4; RBRA; PLUS; INT 5; ID "X"; PLUS; INT 2; EOF]   

val phrasal_syntax : token list =
  [ID "x"; ASGN; LBRA; INT 3; PLUS; INT 4; RBRA; PLUS; INT 5; ID "X"; PLUS;
   INT 2; EOF]


### Abstract Syntax

Even lists of tokens are not easy to analyse. 
Instead, a program is typically represented by a syntax tree. 

Again, we represent them as abstract data types.

In [5]:
type op = Plus | Minus | Mult | Div 
type exp = Id of string | Numb of int | Op of exp * op * exp 
type cmd = Asgn of string * exp 
type program = cmd list * exp

type op = Plus | Minus | Mult | Div


type exp = Id of string | Numb of int | Op of exp * op * exp


type cmd = Asgn of string * exp


type program = cmd list * exp


Note that this time, expressions can refer to expressions themselves. 

E.g., we can represent the previous program as:

In [9]:
let abstract_syntax : program = 
([Asgn ("x", Op (Op (Numb 3, Plus, Numb 4), Plus, Numb 5))], Op (Id "x", Plus, Numb 2))

val abstract_syntax : program =
  ([Asgn ("x", Op (Op (Numb 3, Plus, Numb 4), Plus, Numb 5))],
   Op (Id "x", Plus, Numb 2))


### Code Generator

Lastly, we want to output machine instructions. 
For the above problem this could look as the following code in MIPS assembly. 

This representation will require a little bit more work - but, in gneneral we represent MIPS instructions again via abstract data types: 

In [11]:
type register = int

let t8 = 24 (* $t8 *)
let t9 = 25 (* $t9 *)

type instruction = Li of register * int 
            | Push of register
            | AddiR of register * register * register
            (* ... | and many more instructions *)

type code = instruction list

let example_code = [Li (t8, 3); Push t8] (* ... further instructions to come... *)

type register = int


val t8 : int = 24


val t9 : int = 25


type instruction =
    Li of register * int
  | Push of register
  | AddiR of register * register * register


type code = instruction list


val example_code : instruction list = [Li (24, 3); Push 24]
