# Parsing Algorithms

**Overview of Parsing**
Parsing is the process of analyzing a raw input (typically source code) according to the rules of a formal grammar to uncover its underlying structure . In compilers and interpreters, parsing bridges the gap between human-readable code and machine-friendly representations by transforming text into structured trees for further semantic checks or code generation .

## Why should you care about parsing?

Parsing is a fundamental concept in computer science, especially in the context of programming languages and compilers. Understanding parsing helps you:
* **Design languages**: Knowing how parsing works allows you to create languages such as Domain Specific Languages with clear and unambiguous syntax.
* **Implement compilers**: Parsing is a crucial step in compiling code, and understanding it helps you build efficient and effective compilers.
* **Debug and optimize**: Understanding parsing can help you identify and fix issues in your code, as well as optimize performance.


## A simple arithmetic parser and interpreter

First let's start with an informal way to create an interpreter for super simple arithmetic expressions. For now we will only support addition and subtraction of integers. We will not use any formal specification or grammar, but let's try to "wing it" and see how far we can get. We will use Python for this example, but the concepts apply to any programming language.

In [4]:
# so we will start with trying to do parser for expressions such as "1 + 2 + 3" just addition and numbers
# to do so we will simply split the string by spaces and then we will check if the first element is a number or not
text = "1 + 2 + 3"
tokens = text.split(" ")
print(tokens)

['1', '+', '2', '+', '3']


In [9]:
# how would we interpret this expression?
# we simply process the tokens one by one
# we will start with the first token and check if it is a number or not
# if it is a number we will store it in a variable and then we will check the next token
# if it is a number we will add it to the variable
result = 0
for token in tokens:
    if token.isnumeric():
        # if the token is a number we will store it in a variable
        result += int(token)
    else:
        # if the token is not a number we will check if it is a plus sign
        if token == "+":
            # if it is a plus sign we will check the next token
            # and add it to the result
            continue
        else:
            # if it is not a plus sign we will raise an error
            print("Invalid token", token)
            raise ValueError("Invalid token")
print(result)

6


In [15]:
# let's create a function that will do the same thing
def evaluate_add_expression(expression):
    tokens = expression.split(" ")
    result = 0
    for token in tokens:
        if token.isnumeric():
            result += int(token)
        else:
            if token == "+":
                continue
            else:
                print("Invalid token", token)
                raise ValueError("Invalid token")
    return result
# let's test the function
print(evaluate_add_expression("50 + 60 + 70 + 1000"))
# try it with negative numbers, what will happen?
# print(evaluate_expression("50 + 60 + 70 + -1000"))

1180


### Adding support for substraction

To improve our simple evaluator we will add support for substraction. This will involve a bit more of logic, but we will still keep it simple. We will use the same approach as before, but we will add a new function to handle substraction. We will also need to modify the `parse_expression` function to handle the new operator.




---

## **Core Components of a Parser**
A traditional parser comprises two main stages:

1. **Lexical Analysis (Lexer/Scanner)**: Converts the raw character stream into a sequence of tokens, each representing atomic elements like identifiers, keywords, literals, or operators .
2. **Syntactic Analysis (Parser)**: Consumes the token stream to build a parse tree or abstract syntax tree (AST) according to production rules defined in the grammar .

Some modern tools use **scannerless parsers**, which merge tokenization and parsing into a single step, treating character sequences and syntactic constructs uniformly .

---
## Compiler versus Interpreter Parsing

In a **compiler**, parsing is a critical phase that translates high-level source code into an intermediate representation (IR) or machine code. The parser checks for syntactic correctness and builds a parse tree or abstract syntax tree (AST) that reflects the program's structure.

In an **interpreter**, parsing is often more dynamic, as it may involve interpreting code on-the-fly. The parser may need to handle incomplete or incremental input, such as in interactive environments or REPLs (Read-Eval-Print Loops). In these cases, the parser must be able to adapt to partial inputs and provide immediate feedback or execution results.

## **Grammar Families**
Grammars formalize the syntax of a language. The two primary classes are:

* **Regular Grammars**: Describe *regular languages* that can be recognized by finite-state machines (and thus by regular expressions). They cannot express nested or recursive constructs .
* **Context-Free Grammars (CFGs)**: More powerful; can describe recursive, nested structures common in programming languages. Recognized by pushdown automata and form the basis for most parser generators .

A third practical formalism is **Parsing Expression Grammars (PEGs)**, which resemble CFGs but resolve ambiguities by ordered choice and naturally pair with *packrat* (memoizing) parsers .

---

## **Parsing Strategies**
Parsers are classified by the order in which they traverse the parse tree:

* **Top-Down Parsers** build the tree from the root downward, expanding nonterminals as they match tokens. They include LL-family algorithms (e.g., LL(1)) and **recursive descent** parsers .
* **Bottom-Up Parsers** build from the leaves upward, recognizing small constructs and reducing them to nonterminals until reaching the start symbol. The classic example is the LR-family (e.g., LR(1), LALR) .

Chart parsers such as **Earley** and **CYK** blend strategies using dynamic programming to handle all CFGs (Earley) or CFGs in Chomsky Normal Form (CYK) .

---

## **Key Parsing Algorithms**

| **Algorithm**         | **Strategy & Grammar**                  | **Complexity**              | **Notes**                                                        |
| --------------------- | --------------------------------------- | --------------------------- | ---------------------------------------------------------------- |
| **LL(1)**             | Top-Down, CFG without left recursion    | O(n)                        | Table-driven; requires careful grammar structuring               |
| **Recursive Descent** | Top-Down, backtracking or predictive    | O(n⁴) worst-case            | Easy to hand-write; tail-recursion schemes handle left-recursion |
| **LR(1) / LALR**      | Bottom-Up, all deterministic CFG (LALR) | O(n)                        | Powerful; tables large—used via generators (e.g., Bison)         |
| **Earley**            | Top-Down chart, all CFG                 | O(n³) worst, \~O(n) average | No grammar restrictions; prediction + completion phases          |
| **CYK**               | Bottom-Up chart, CNF-required CFG       | O(n³)                       | Theoretical importance; impractical for general parsing          |
| **Packrat (PEG)**     | Top-Down with memoization, PEG          | O(n) average                | Unlimited lookahead; high memory usage; ordered choice           |

Each algorithm offers trade-offs in terms of grammar expressiveness, performance guarantees, ease of implementation, and memory consumption .

---

## **Parse Trees vs. Abstract Syntax Trees**

* **Parse Tree (Concrete Syntax Tree):** Mirrors the structure of the grammar exactly, including all intermediate symbols and sometimes punctuation tokens .
* **Abstract Syntax Tree (AST):** A pruned, higher-level representation that retains only semantically significant constructs (e.g., operator nodes, control-flow constructs) and omits syntactic sugar like parentheses or literal tokens .

Transforming a parse tree into an AST simplifies downstream compiler or analysis phases by focusing on the core meaning of the code.

---

## **Choosing an Algorithm**

* For **hand-crafted** parsers or small DSLs, recursive descent or Pratt parsers yield fast development.
* For **production** compilers, LALR or LR(1) via generator tools balance performance and grammar power.
* For **extensible** systems (e.g., IDEs supporting many languages) or educational purposes, Earley offers simplicity at some performance cost.
* For **PEG-style** grammars requiring unambiguous ordered choice, packrat parsing ensures linear performance when memory permits.






## Tooling for Parsing

Usually you do not want to implement a parser from scratch. Instead, you can use existing libraries or tools that provide parsing capabilities. Here are some popular options:

### Parser Generators

Parser generators automate the creation of parsers from formal grammar specifications. They typically generate code for a specific parsing algorithm, such as LALR or LL(1). Some popular parser generators include:
* **ANTLR**: A powerful parser generator that supports multiple languages and generates parsers in Java, C#, Python, and more. It uses LL(*) parsing and can handle complex grammars. - https://www.antlr.org/
* **Bison**: A widely used parser generator for C/C++ that implements LALR(1) parsing. It is often used in conjunction with Flex for lexical analysis. - https://www.gnu.org/software/bison/
* **Yacc**: An older parser generator for C that also implements LALR(1) parsing. It is less commonly used today but still relevant in some legacy systems. - [https://dinosaur.compi    lertools.net/yacc/](https://en.wikipedia.org/wiki/Yacc)
* **PEG.js**: A parser generator for JavaScript that uses Parsing Expression Grammars (PEGs). It is suitable for building parsers for DSLs and other applications. - https://github.com/pegjs/pegjs

In [None]:
# so we want to parse input/strings such as "5 + 6 + 10"
# eventually we would want to move to "5+6 * 10+3" and parse that correctly

In [None]:
# for addition we do not need anything fance

In [1]:
eval("5 + 6 + 7")  # this is dangerous if you do not control the string ! 
# so eval will lex the string and parse it and then actually do the work (meaning summing)

18

In [2]:
text = "5 + 6 + 7"
# no significant whitespace
clean = text.replace(" ","") # part of lexical analysis cleaning whitespace
tokens = clean.split("+") # tokenization again part of lexical analysis
result = sum([int(token) for token in tokens]) # here we skip the tree since all of the tokens are separated by +..
result

18

In [3]:
def addIntrepreter(text):
    clean = text.replace(" ","")
    tokens = clean.split("+")
    result = sum([int(token) for token in tokens]) # nice shortcut because we only have +
    return result

In [4]:
addIntrepreter("  5+5+10000+5   + 7 + 10  ")

10032

In [None]:
# how about substraction well then we will already need some sort of structure we could use a stack based structure to store operation
# "10 - 5 + 3 - 2 + 20"  should be 26

In [None]:
# we could start again by stripping whitespace and similar as it is not signifant here
# optimization would be to skip cleaning and clean whitespace as we go
# then we could save the tokens in some sort of data structure (here stacks would work nicely)
# or we could interpret as we go (so sort of like REPL)

In [5]:
def sub_add(text):
    acc = 0
    n = 0
    tok = ""
    state = "NUM" # "OP"
    operations = ["+","-"]
    op = ""
    # we really need a state machine here for determining whether we have a number or addition or substraction
    # so ONE PASS parsing, scannerless parsing since it is so trivial
    # so there is a simple state machine hidden here
    for t in text:
        if t in [" ","\t"]: # same as replace or cleaning our insignifcant
            continue
        if t.isdigit():
#             print(f"Digit is {t} and tok is {tok}")
            if state == "OP":
                state = "NUM"
                tok = "" # not efficient keep building up the NUM
            tok += t
#             print(f"Digit is {t} and tok AFTER is {tok}")
            continue
        if t in operations:
            state = "OP" # FIXME multiple operations error
            print(f"BEFORE operation  {acc} {op} {tok}")
            if op == "+": # we check the previous operation
                acc += int(tok)
                tok = ""
            elif op == "-":
                acc -= int(tok)
                tok = ""
            elif op == "": # first time
                acc = int(tok)
                tok = ""
            print(f"AFTER operation {op} {acc}")
            op = t
    if op == "+": # we check the previous operation
        acc += int(tok)
    elif op == "-":
        acc -= int(tok)
    return acc
        

In [6]:
sub_add("10 - 5 + 3 - 2 + 20")


BEFORE operation  0  10
AFTER operation  10
BEFORE operation  10 - 5
AFTER operation - 5
BEFORE operation  5 + 3
AFTER operation + 8
BEFORE operation  8 - 2
AFTER operation - 6


26

In [None]:
# for more complicated operations we will need to build a syntax tree we can't just have a simple accumulator design, 
# above is only sufficient when we have left to right order of operations

# one example is given in this course
# https://ruslanspivak.com/lsbasi-part1/

## Full Arithmetic Parser

To implement a full arithmetic parser we will separate tokenization and parsing. We will use a blank AST class to represent the tree. 

We use EBNF to define the grammar. 

Grammar (EBNF):
```

expression ::= term { ("+" | "-") term } ;
term       ::= factor { ("*" | "/") factor } ;
factor     ::= integer | "(" expression ")" ;
digit_excluding_zero ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
digit                ::= "0" | digit_excluding_zero ;
positive_integer ::= digit_excluding_zero, { digit } ;
integer ::= "0" | [ "-" ], positive_integer  ;

```

### EBNF parsers online

You can test your EBNF grammar online with the following tools:

* [EBNF Evaluator](https://mdkrajnak.github.io/ebnftest/)

### BNF parser

BNF is more limited than EBNF. It does not support optionality or repetition. 
BNF can define the same grammar as EBNF, but it is more verbose. 



* [BNF Playground](https://bnfplayground.pauliankline.com/)

More discussion on EBNF can be found in the [Wikipedia article](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form

In [None]:
import re
from collections import namedtuple

# Grammar (EBNF):
# expression = term { ("+" | "-") term } ;
# term       = factor { ("*" | "/") factor } ;
# factor     = INTEGER | "(" expression ")" ;
# INTEGER    = [0-9]+ ;

Token = namedtuple('Token', ['type', 'value'])

TOKEN_SPEC = [
    ('INTEGER', r'\d+'),
    ('PLUS',    r'\+'),
    ('MINUS',   r'-'),
    ('MUL',     r'\*'),
    ('DIV',     r'/'),
    ('LPAREN',  r'\('),
    ('RPAREN',  r'\)'),
    ('WS',      r'\s+'),
]

master_pattern = re.compile(
    '|'.join(f'(?P<{name}>{pattern})' for name, pattern in TOKEN_SPEC)
)

def tokenize(text):
    """Generate tokens from the input text."""
    for mo in master_pattern.finditer(text):
        kind = mo.lastgroup
        if kind == 'WS':
            continue
        value = mo.group()
        yield Token(kind, value)
    yield Token('EOF', '')

# AST nodes
typedef = None
class AST:
    pass

class BinOp(AST):
    def __init__(self, left, op, right):
        self.left = left
        self.op = op    # 'PLUS', 'MINUS', 'MUL', or 'DIV'
        self.right = right

class Num(AST):
    def __init__(self, value):
        self.value = int(value)

# Parser with operator precedence
class Parser:
    def __init__(self, tokens):
        self.tokens = iter(tokens)
        self.current_token = next(self.tokens)

    def eat(self, token_type):
        if self.current_token.type == token_type:
            self.current_token = next(self.tokens)
        else:
            raise SyntaxError(f"Expected {token_type}, got {self.current_token.type}")

    def parse(self):
        node = self.parse_expression()
        if self.current_token.type != 'EOF':
            raise SyntaxError("Unexpected token after expression")
        return node

    def parse_expression(self):
        # expression = term { (+|-) term }
        node = self.parse_term()
        while self.current_token.type in ('PLUS', 'MINUS'):
            op = self.current_token.type
            self.eat(op)
            right = self.parse_term()
            node = BinOp(node, op, right)
        return node

    def parse_term(self):
        # term = factor { (*|/) factor }
        node = self.parse_factor()
        while self.current_token.type in ('MUL', 'DIV'):
            op = self.current_token.type
            self.eat(op)
            right = self.parse_factor()
            node = BinOp(node, op, right)
        return node

    def parse_factor(self):
        # factor = INTEGER | LPAREN expression RPAREN
        token = self.current_token
        if token.type == 'INTEGER':
            self.eat('INTEGER')
            return Num(token.value)
        elif token.type == 'LPAREN':
            self.eat('LPAREN')
            node = self.parse_expression()
            self.eat('RPAREN')
            return node
        else:
            raise SyntaxError(f"Unexpected token: {token.type}")

# Evaluator
def evaluate(node):
  if isinstance(node, Num):
      return node.value
  if isinstance(node, BinOp):
      left = evaluate(node.left)
      right = evaluate(node.right)
      if node.op == 'PLUS':
          return left + right
      elif node.op == 'MINUS':
          return left - right
      elif node.op == 'MUL':
          return left * right
      elif node.op == 'DIV':
          return left / right  # or integer division // if desired
  raise ValueError("Unknown node type")

# Interpreter function
def interpret(text):
  tokens = tokenize(text)
  parser = Parser(tokens)
  ast = parser.parse()
  return evaluate(ast)

# Examples

print(interpret("3 + 4 * (2 - 5) / 5"))  # 3 + (4*(2-5)/5) = 3 + (4*(-3)/5) = 3 - 12/5 = 0.6
print(interpret("(10 + 2) * 7"))       # (10+2) * 7 = 84


0.6000000000000001
84


## Multiline support and variable assignment

In [2]:
# import re
# from collections import namedtuple

# Grammar (EBNF):
# program    = statement { NEWLINE statement } ;
# statement  = assignment | expression ;
# assignment = IDENTIFIER "=" expression ;
# expression = term { ("+" | "-") term } ;
# term       = factor { ("*" | "/") factor } ;
# factor     = INTEGER | IDENTIFIER | "(" expression ")" ;

Token = namedtuple('Token', ['type', 'value'])

TOKEN_SPEC = [
    ('INTEGER',    r'\d+'),
    ('IDENTIFIER', r'[A-Za-z_]\w*'),
    ('PLUS',       r'\+'),
    ('MINUS',      r'-'),
    ('MUL',        r'\*'),
    ('DIV',        r'/'),
    ('LPAREN',     r'\('),
    ('RPAREN',     r'\)'),
    ('EQ',         r'='),
    ('WS',         r'\s+'),
]
master_pattern = re.compile(
    '|'.join(f'(?P<{name}>{pattern})' for name, pattern in TOKEN_SPEC)
)

def tokenize(text):
    """Generate tokens from the input text."""
    for mo in master_pattern.finditer(text):
        kind = mo.lastgroup
        if kind == 'WS':
            continue
        yield Token(kind, mo.group())
    yield Token('EOF', '')

# AST nodes
class AST: pass

class BinOp(AST):
    def __init__(self, left, op, right):
        self.left = left
        self.op = op    # 'PLUS', 'MINUS', 'MUL', 'DIV'
        self.right = right

class Num(AST):
    def __init__(self, value):
        self.value = int(value)

class Var(AST):
    def __init__(self, name):
        self.name = name

# Parser with operator precedence and variables
class Parser:
    def __init__(self, tokens):
        self.tokens = list(tokens)
        self.pos = 0
        self.current_token = self.tokens[self.pos]

    def eat(self, token_type):
        if self.current_token.type == token_type:
            self.pos += 1
            self.current_token = self.tokens[self.pos]
        else:
            raise SyntaxError(f"Expected {token_type}, got {self.current_token.type}")

    def parse(self):
        return self.parse_expression()

    def parse_expression(self):
        node = self.parse_term()
        while self.current_token.type in ('PLUS', 'MINUS'):
            op = self.current_token.type
            self.eat(op)
            right = self.parse_term()
            node = BinOp(node, op, right)
        return node

    def parse_term(self):
        node = self.parse_factor()
        while self.current_token.type in ('MUL', 'DIV'):
            op = self.current_token.type
            self.eat(op)
            right = self.parse_factor()
            node = BinOp(node, op, right)
        return node

    def parse_factor(self):
        token = self.current_token
        if token.type == 'INTEGER':
            self.eat('INTEGER')
            return Num(token.value)
        if token.type == 'IDENTIFIER':
            self.eat('IDENTIFIER')
            return Var(token.value)
        if token.type == 'LPAREN':
            self.eat('LPAREN')
            node = self.parse_expression()
            self.eat('RPAREN')
            return node
        raise SyntaxError(f"Unexpected token: {token.type}")

# Evaluator with variable environment

def evaluate(node, env):
    if isinstance(node, Num):
        return node.value
    if isinstance(node, Var):
        if node.name in env:
            return env[node.name]
        raise NameError(f"Undefined variable: {node.name}")
    if isinstance(node, BinOp):
        left = evaluate(node.left, env)
        right = evaluate(node.right, env)
        if node.op == 'PLUS': return left + right
        if node.op == 'MINUS': return left - right
        if node.op == 'MUL': return left * right
        if node.op == 'DIV': return left / right
    raise ValueError("Unknown node type")

# Top-level interpreter with multiline and assignment support

def interpret(program_text):
    env = {}
    last_value = None
    for line in program_text.splitlines():
        line = line.strip()
        if not line:
            continue
        if '=' in line:
            # assignment
            name, expr = map(str.strip, line.split('=', 1))
            if not re.match(r'^[A-Za-z_]\w*$', name):
                raise SyntaxError(f"Invalid variable name: {name}")
            tokens = tokenize(expr)
            parser = Parser(tokens)
            ast = parser.parse()
            value = evaluate(ast, env)
            env[name] = value
        else:
            # expression
            tokens = tokenize(line)
            parser = Parser(tokens)
            ast = parser.parse()
            last_value = evaluate(ast, env)
    return last_value


prog = """
x = 5
x * 10
"""
print(interpret(prog))  # 50


50


In [3]:
prog = """
y = 10
x = 5
(x + y) * 2
"""
print(interpret(prog))  # 30

30
