# Arithmetic Parser

Creating a simple arithmetic parser in Python involves several steps:

- **Tokenization**: Break the input string into a list of tokens. In the case of arithmetic operations, tokens could be numbers, parentheses, or operators like `+`, `-`, `*`, `/`.
- **Parsing**: Convert the list of tokens into an abstract syntax tree (AST). In the case of arithmetic expressions, this will involve handling operator precedence and associativity.
- **Evaluation**: Walk through the AST and calculate the result of the expression.

In [2]:
import re

# Step 1: Tokenization
def tokenize(expression):
    return re.findall(r"\d+|\+|\-|\*|\/|\(|\)", expression)

# Step 2: Parsing
def parse(tokens):
    def evaluate(tokens):
        args = [term(tokens)]
        while tokens and tokens[0] in "+-":
            op = tokens.pop(0)
            if op == '+':
                args.append(term(tokens))
            else:
                args.append(-term(tokens))
        return sum(args)

    def term(tokens):
        args = [factor(tokens)]
        while tokens and tokens[0] in "*/":
            op = tokens.pop(0)
            if op == '*':
                args.append(factor(tokens))
            else:
                args.append(1 / factor(tokens))
        result = 1
        for arg in args:
            result *= arg
        return result

    def factor(tokens):
        if tokens[0] == '(':
            tokens.pop(0)  # Remove '('
            result = evaluate(tokens)
            tokens.pop(0)  # Remove ')'
        else:
            result = float(tokens.pop(0))
        return result

    return evaluate(tokens)

# Step 3: Evaluation
def evaluate_expression(expression):
    tokens = tokenize(expression)
    return parse(tokens)

In [3]:
evaluate_expression("2+3")  # 5

5.0

In [4]:
# let's check (2+3)*4 which should be 20
evaluate_expression("(2+3)*4")  # 20

20.0

In [5]:
# let's try triple parentheses (2+3)*((4+5)/3) which should be 15
evaluate_expression("(2+3)*((4+5)/3)")  # 15

15.0

In [6]:
# let's add some text in between like Valdis and RBS
# so 2 + Valdis RBS 6 should be 8
evaluate_expression("2 + Valdis RBS 6")  # 8

8.0

## Building your own tokenizer without regular expressions

In [14]:
# we could make our own tokenizer without using regex but that would be more work
# we could do it using finite state machin
# TODO: implement own tokenizer without regex
# let's create a tokenizer that will use finite state machine to tokenize our expression
# we want to support parentheses, +, -, *, /, and integers

# we will have states for each type of token
# we will have a current state and transition to next state based on current character

# we will have a dictionary of transitions
# we will have a dictionary of final states
# we will have a dictionary of token
# we will have a dictionary of token values

# we will have a function that will take a string and return a list of tokens and their values
# if syntax is incorrect we will raise an exception

# we will call this function tokenize_fsm

def tokenize_fsm(raw_string):
    # we will have a dictionary of transitions
    transitions = {
        "start": {
            "digit": "integer",
            "(": "lparen",
            ")": "rparen",
            "+": "plus",
            "-": "minus",
            "*": "mult",
            "/": "div",
            "whitespace": "start"
        },
        "integer": {
            "digit": "integer",
            "whitespace": "start"
        },
        "lparen": {
            "whitespace": "start"
        },
        "rparen": {
            "whitespace": "start"
        },
        "plus": {
            "whitespace": "start"
        },
        "minus": {
            "whitespace": "start"
        },
        "mult": {
            "whitespace": "start"
        },
        "div": {
            "whitespace": "start"
        }
    }
    # we will have a dictionary of final states
    final_states = {
        "integer": True,
        "lparen": True,
        "rparen": True,
        "plus": True,
        "minus": True,
        "mult": True,
        "div": True
    }
    # we will have a dictionary of token values
    token_values = {
        "integer": "",
        "lparen": "(",
        "rparen": ")",
        "plus": "+",
        "minus": "-",
        "mult": "*",
        "div": "/"
    }
    # we will have a current state
    current_state = "start"
    # we will have a current token
    current_token = ""
    # we will have a list of tokens
    tokens = []
    # we will iterate over the string
    for char in raw_string:
        # we will determine the type of character
        if char.isdigit():
            char_type = "digit"
        elif char == " ": # could add support for other whitespace characters
            char_type = "whitespace"
        else:
            char_type = char
        # we will check if the transition is valid
        if char_type not in transitions[current_state]:
            raise ValueError(f"Invalid syntax at {char}")
        # we will update the current state
        current_state = transitions[current_state][char_type]
        # we will update the current token
        if current_state != "start":
            current_token += char
        # we will check if the current state is final
        if current_state in final_states:
            tokens.append((current_state, current_token))
            current_token = ""
            current_state = "start"
    
    # return [(token_values[token], value) for token, value in tokens]
    return tokens

# TODO check balance of parentheses
# TODO digits should be combined into integers - try modifiying the code yourself


In [15]:
# let's get tokens for 5 + 9000
# we should get ['5', '+', '9000']
my_tokens = tokenize_fsm("5 + 9000")
print(my_tokens)  # [('5', '5'), ('plus', '+'), ('integer', '9000')]

[('integer', '5'), ('plus', '+'), ('integer', '9'), ('integer', '0'), ('integer', '0'), ('integer', '0')]


In [16]:
# how about something more difficult such as double parentheses
# (2+3)*((4+509)/3)
my_tokens = tokenize_fsm("(2+3)*((4+509)/3)")
print(my_tokens)  # ['(', '2', '+', '3', ')', '*', '(', '(', '4', '+', '509', ')', '/', '3', ')']

[('lparen', '('), ('integer', '2'), ('plus', '+'), ('integer', '3'), ('rparen', ')'), ('mult', '*'), ('lparen', '('), ('lparen', '('), ('integer', '4'), ('plus', '+'), ('integer', '5'), ('integer', '0'), ('integer', '9'), ('rparen', ')'), ('div', '/'), ('integer', '3'), ('rparen', ')')]


In [17]:
# if we pass badly formatted string we should get an exception
# let's try to add a letter in the middle of the expression
try:
    my_tokens = tokenize_fsm("2 + Val + 3")
except ValueError as e:
    print(f"Got an exception {e}")

Got an exception Invalid syntax at V


## EBNF for checking syntax of our arithmetic

Let's write EBNF for our arithmetic expressions:

In EBNF { } are used to indicate repetition of the preceding element zero or more times, and [ ] are used to indicate optional elements.
```
expression = term , { ("+" | "-") , term } ;
term = factor , { ("*" | "/"), factor } ;
factor = number | "(", expression, ")" ;
number = "0" | non_zero_digit, { digit } ;
non_zero_digit = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
digit = "0" | non_zero_digit ;
```

TODO check the above syntax in a EBNF parser

## Parser design decisions

- This is a simple recursive-descent parser. It doesn't handle many edge cases or provide detailed error messages.
- The tokenization step is done with a regular expression to capture digits, operators, and parentheses.
- The parsing step converts the list of tokens into a number by recursively breaking down the terms and factors, taking into account the precedence and associativity of the operators.
- Finally, the evaluation step is simple because we've constructed our AST such that each node immediately knows how to evaluate itself. In this simple example, the AST is implicitly built into the recursive structure of the `parse()` function.

In [3]:
test_expressions = [
    "2+3",
    "2-3",
    "2*3",
    "6/3",
    "(2+3)*4",
    "(2+3)/5",
    "2+3*4",
    "2*3+4",
]
for expression in test_expressions:
    print(f"{expression} = {evaluate_expression(expression)}")

2+3 = 5.0
2-3 = -1.0
2*3 = 6.0
6/3 = 2.0
(2+3)*4 = 20.0
(2+3)/5 = 1.0
2+3*4 = 14.0
2*3+4 = 10.0
