In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

In [None]:
import { readFileSync, writeFileSync } from "fs";
import { buildParser } from "@lezer/generator";
import { Tree, TreeCursor } from "@lezer/common";
import { LRParser } from "@lezer/lr";
import { instance } from "@viz-js/viz";
import { 
    ast2dot, genericLezerToAST, ASTConfig, AST, 
    one, many, asOne, asMany, asString 
} from "./AST2Dot";

const viz = await instance();

# Building a Complete Interpreter with Lezer

In this notebook, we build a fully functional interpreter for a simple `C`-like language.
We will implement the Scanner and Parser using **Lezer**, transform the result into a clean **AST**, and finally write an **Interpreter** that executes the code.

## 0. The Language Specification

Our target language supports arithmetic, variables, control flow (`if`, `while`), and function calls.
Formally, the grammar is defined as follows:

```ebnf
program
  : ε
  | stmnt program

stmnt
  : IF '(' bool_expr ')' stmnt
  | WHILE '(' bool_expr ')' stmnt
  | '{' program '}'
  | IDENTIFIER ':=' expr ';'
  | expr ';'

bool_expr
  : expr '==' expr
  | expr '!=' expr
  | expr '<=' expr | expr '>=' expr
  | expr '<'  expr | expr '>'  expr

expr
  : expr '+' product
  | expr '-' product
  | product

product
  : product '*' factor
  | product '/' factor
  | product '%' factor
  | factor

factor
  : '(' expr ')'
  | NUMBER
  | IDENTIFIER
  | IDENTIFIER '(' expr_list ')'

expr_list
  : ε
  | ne_expr_list

ne_expr_list
  : expr
  | expr ',' ne_expr_list
```

## 1. Defining the Lexical Grammar (Tokens)

The **Lexical Grammar** defines how raw text is split into tokens. We will define the content of the `@tokens` block in several parts and assemble them at the end.

### 1.1 Whitespace and Comments

**Whitespace**: We define `space` to match tabs, newlines, and carriage returns.

**Block Comments (`/* ... */`)**:
Recognizing C-style block comments is tricky because they have a multi-character ending `*/`. Regexes are sometimes insufficient or hard to read for this.
Instead, we define a small **State Machine**:

1.  **`BlockComment`** (Entry): Matches the start `/*` and transitions to the state `blockCommentRest`.
2.  **`blockCommentRest`** (Content State): This is the **body** of the comment. It consumes characters that are **not** `*`. If it sees a `*`, it transitions to the "Potential End State" (`blockCommentAfterStar`).
3.  **`blockCommentAfterStar`** (Potential End State): We just saw a `*`. We check the next character to decide if we are really done:
    * If it is `/`: We found the end `*/`. The token is complete.
    * If it is `*`: We might have `/**`, so we stay in this state (waiting for a slash or non-star).
    * Anything else: It was just a standalone `*` inside the text (like in `2 * 3`), so we go back to the "Content State".

In [None]:
const tokensComments : string = `
    // 1. Whitespace
    space { $[ \\t\\n\\r]+ }

    // 2. Comments (Finite State Machine)
    LineComment { "//" ![\n]* }
    
    // Start with /*, then enter the state machine
    BlockComment { "/*" blockCommentRest }
    
    // State 1: Consume anything that is not '*'
    blockCommentRest { ![*] blockCommentRest | "*" blockCommentAfterStar }
    
    // State 2: We just saw a '*'. Check what comes next.
    blockCommentAfterStar { "/" | "*" blockCommentAfterStar | ![/*] blockCommentRest }
`;

### 1.2 Keywords and Identifiers

This section handles names.
* **Keywords**: `if` and `while`.
* **Identifiers**: Variable names.

**The Conflict:**
The input string `"if"` theoretically matches both the specific keyword `KwIf` and the general pattern `Identifier`.
We define them here, but we will need to resolve this ambiguity later using **Precedence**.

In [None]:
const tokensIdent : string = `
    // 3. Keywords
    KwIf { "if" }
    KwWhile { "while" }

    // 4. Identifiers
    // Must start with a letter, can continue with letters, numbers, or underscore.
    Identifier { $[a-zA-Z] $[a-zA-Z0-9_]* }
`;

### 1.3 Literals and Operators

Here we define the static symbols of our language.
* **Numbers**: Note the pattern `0 | $[1-9] $[0-9]*`. This strict rule prevents numbers with leading zeros (like `012`), ensuring only valid integers are parsed.
* **Operators**: We map literal characters (like `+`) to token names (like `OpPlus`).

In [None]:
const tokensOps : string = `
    // 5. Literals
    Number { "0" | $[1-9] $[0-9]* }

    // 6. Operators & Punctuation
    OpAssign { ":=" }
    OpEq { "==" } OpNe { "!=" }
    OpLe { "<=" } OpGe { ">=" }
    OpLt { "<" }  OpGt { ">" }
    
    OpPlus { "+" } OpMinus { "-" } 
    OpTimes { "*" } OpDivide { "/" } OpModulo { "%" }
    
    LParen { "(" } RParen { ")" }
    LBrace { "{" } RBrace { "}" }
    Semi { ";" } Comma { "," }
`;

### 1.4 Conflict Resolution & Assembly

Now we assemble the `@tokens` block.
Crucially, we define **Lexical Precedence** here to solve the ambiguities mentioned above:

1.  `KwIf` > `Identifier`: Ensures "if" is treated as a keyword.
2.  `LineComment` > `OpDivide`: Ensures `//` is a comment, not two division operators.
3.  `@skip`: Tells the parser to completely ignore spaces and comments in the resulting syntax tree.

In [None]:
const grammarTokens : string = `
  @tokens {
    ${tokensComments}
    ${tokensIdent}
    ${tokensOps}

    // --- CONFLICT RESOLUTION ---
    // Keywords take precedence over general identifiers
    @precedence { KwIf, KwWhile, Identifier }
    
    // Comments take precedence over division operator
    @precedence { LineComment, BlockComment, OpDivide }
  }

  // Globally skip whitespace and comments
  @skip { space | LineComment | BlockComment }
`;

## 2. Operator Precedence

In arithmetic expressions, ambiguity arises. For `1 + 2 * 3`, two trees are possible: `(1+2)*3` or `1+(2*3)`.

We solve this by defining **Precedence Levels**.
* **`times`**: Highest priority (binds tightest).
* **`plus`**: Medium priority.
* **`compare`**: Lowest priority.

The `@left` annotation indicates **Left Associativity**. `1 - 2 - 3` is parsed as `(1 - 2) - 3`, not `1 - (2 - 3)`.

In [None]:
const grammarPrecedence : string = `
  @precedence {
    times @left,
    plus @left,
    compare @left
  }
`;

## 3. Syntactic Grammar (Rules)

We now define the structure of our language. We split the rules into logical blocks and concatenate them at the end.

### 3.1 The Entry Point
The **`Program`** is the start symbol of our grammar.
It is defined simply as a sequence of zero or more `statement` nodes.

In [None]:
const ruleTop : string = `
  @top Program { statement* }
`;

### 3.2 Statements

A **`statement`** represents a standalone instruction. It acts as a wrapper for various specific statement types.
* **Control Flow**: `IfStatement` and `WhileStatement` define the structure for branching and looping. Note how they recursively use `statement` for their bodies.
* **Block**: A group of statements enclosed in curly braces `{ ... }`.
* **Assignment**: Assigns an expression (`Expr`) to an `Identifier`.
* **Expression Statement**: A standalone expression followed by a semicolon (e.g., a function call `print(x);`).

In [None]:
const ruleStatements : string = `
  statement {
    IfStatement { KwIf LParen Expr RParen statement } |
    WhileStatement { KwWhile LParen Expr RParen statement } |
    Block { LBrace statement* RBrace } |
    Assignment { Identifier OpAssign Expr Semi } |
    ExprStatement { Expr Semi }
  }
`;

### 3.3 Expressions

An **`Expr`** (Expression) represents a value computation.
It serves as a "dispatcher" rule that delegates to:
1.  **`BinaryExpression`**: Arithmetic or Logic.
2.  **`FunctionCall`**: Calling a function with arguments.
3.  **`Atom`**: The basic building blocks (numbers, variables).

In [None]:
const ruleExpr : string = `
  Expr {
    BinaryExpression |
    FunctionCall { Identifier LParen ArgList? RParen } |
    Atom
  }
`;

### 3.4 Binary Expressions & Precedence Handling

This is where the **Precedence** defined earlier is applied.
Instead of writing a complex recursive grammar (like `Expr -> Term -> Factor`), Lezer allows us to write a flat rule and apply **Precedence Markers** (`!times`, `!plus`, `!compare`).

* `Expr !times (OpTimes | ...) Expr`: This tells Lezer that this rule belongs to the `times` precedence group.
* Because `times` was defined above `plus`, Lezer knows that `*` binds tighter than `+`.

In [None]:
const ruleBinary : string = `
  BinaryExpression {
    Expr !compare (OpEq | OpNe | OpLe | OpGe | OpLt | OpGt) Expr |
    Expr !plus (OpPlus | OpMinus) Expr |
    Expr !times (OpTimes | OpDivide | OpModulo) Expr
  }
`;

### 3.5 Argument Lists and Atoms

* **`ArgList`**: Defines a comma-separated list of expressions. Used in function calls.
* **`Atom`**: The base case of our recursion.
    * `Number`: A raw numeric value.
    * `Identifier`: A variable name.
    * `LParen Expr RParen`: Parentheses allow us to override precedence manually (e.g., `(1+2)*3`).

In [None]:
const ruleAtoms : string = `
  ArgList {
    Expr (Comma Expr)*
  }

  Atom {
    Number |
    Identifier |
    LParen Expr RParen
  }
`;

### 4. Building the Parser

Finally, we concatenate all string parts to form the complete grammar definition and generate the parser.

In [None]:
const grammarString : string = 
    grammarTokens + 
    grammarPrecedence + 
    ruleTop + 
    ruleStatements + 
    ruleExpr + 
    ruleBinary + 
    ruleAtoms;

const parser : LRParser = buildParser(grammarString);
"Parser generated successfully.";

## 2. Testing the Scanner (Token Verification)

Since Lezer combines tokenization and parsing, we do not get a flat list of tokens automatically. However, we can simulate a scanner check by extracting the leaves of the resulting **Concrete Syntax Tree (CST)**.

**Formal Definition:**

Let $S$ be the input string and $T$ be the Syntax Tree generated by the parser.
We define the sequence of Tokens $\mathcal{T}$ as the ordered sequence of all leaf nodes in $T$.

A node $n \in T$ is considered a leaf if and only if it has no children:
$$\text{isLeaf}(n) \iff \text{degree}^{+}(n) = 0$$

The function `testScanner` performs a Depth-First Traversal over $T$. For every node $n$ visited by the cursor $C$, we execute the following logic:

$$
\text{Output}(n) =
\begin{cases}
    \texttt{print}(n.\text{type}, S[n.\text{from} \dots n.\text{to}]) & \text{if } \neg C.\text{firstChild}() \\
    \text{continue} & \text{otherwise}
\end{cases}
$$

In [None]:
import { Tree, TreeCursor } from "@lezer/common";

function testScanner(fileName: string): void {
    const input: string = readFileSync(fileName, "utf8");
    
    console.log(`--- Scanning ${fileName} ---`);
    console.log(input);
    console.log("Tokens:");
    
    const tree: Tree = parser.parse(input);
    const cursor: TreeCursor = tree.cursor();
    
    do {
        if (!cursor.firstChild()) {
            const tokenText = input.slice(cursor.from, cursor.to);
            const safeText = tokenText.replace(/\n/g, "\\n");
            console.log(`[${cursor.name}]`.padEnd(15) + `: ${safeText}`);
        }
    } while (cursor.next());
}

In [None]:
testScanner('sum.sl');

In [None]:
testScanner('factorial.sl');

## 3. The CST-to-AST Transformation

Lezer generates a **Concrete Syntax Tree (CST)**. This tree contains every single detail of the source code, including whitespace, comments, parentheses, and semicolons. While perfect for syntax highlighting, this structure is too noisy for interpretation.

We need to transform this CST into an **Abstract Syntax Tree (AST)**. An AST focuses purely on the logical structure of the program (e.g., "This is an assignment") rather than the syntactic sugar (e.g., "There is a semicolon here").

### 3.1 The Configuration Object (`ASTConfig`)

To perform this transformation generically, we define an **`ASTConfig`** object. This object serves as a blueprint for the mapper. It has three distinct purposes:

1.  **Noise Reduction (`ignore`)**:
    We define a set of token names that carry no semantic meaning for the interpreter. Tokens like `LParen` (`(`), `Semi` (`;`), or keywords like `KwIf` are strictly syntactic. They guide the parser but are irrelevant for the logical tree. The mapper automatically discards these nodes.

2.  **Leaf Extraction (`treatAsLiteral`)**:
    Some nodes, like **Operators** or **Identifiers**, are effectively "leaves" in our logic. We do not need to traverse inside them; we simply want their text content (e.g., the string `"x"` or `"+"`). We use a Regular Expression to identify these nodes and extract their raw text immediately.

3.  **Structural Transformation (`rules`)**:
    This is the core logic. For every complex node (like `BinaryExpression` or `Assignment`), we define a specific transformation function.
    * **Input**: A list of already processed children (sanitized of noise).
    * **Output**: A strictly typed `AST` object (e.g., `{ kind: 'Assignment', ... }`).

By separating the transformation logic (the generic walker) from the specific rules (the `ASTConfig`), we achieve a modular and type-safe design.

In [None]:
import { 
    ASTConfig, AST, Operator, 
    one, many, asOne, asMany, asString, 
    genericLezerToAST 
} from "./AST2Dot";

const astConfig: ASTConfig = {
    ignore: new Set([
        "LParen", "RParen", "LBrace", "RBrace", 
        "Semi", "Comma", "KwIf", "KwWhile", "OpAssign"
    ]),

    treatAsLiteral: /^(Op|Identifier)/,

    rules: {
        "Number": ({ text }) => one(Number(text)),

        "Program": ({ children }) => one({ 
            kind: 'Program', 
            statements: children.map(c => asOne(c, "Prog")) 
        }),
        
        "Block": ({ children }) => one({ 
            kind: 'Block', 
            statements: children.map(c => asOne(c, "Blk")) 
        }),

        "Assignment": ({ children }) => one({
            kind: 'Assignment',
            id: asString(children[0], "Assign.Id"),
            expr: asOne(children[1], "Assign.Expr")
        }),

        "BinaryExpression": ({ children }) => one({
            kind: 'BinaryExpr',
            left: asOne(children[0], "Bin.L"),
            op: asString(children[1], "Bin.Op") as Operator,
            right: asOne(children[2], "Bin.R")
        }),

        "IfStatement": ({ children }) => one({
            kind: 'If',
            condition: asOne(children[0], "If.Cond"),
            body: asOne(children[1], "If.Body")
        }),

        "WhileStatement": ({ children }) => one({
            kind: 'While',
            condition: asOne(children[0], "While.Cond"),
            body: asOne(children[1], "While.Body")
        }),

        "ExprStatement": ({ children }) => one({
            kind: 'ExprStmt',
            expr: asOne(children[0], "ExprStmt")
        }),

        "FunctionCall": ({ children }) => one({
            kind: 'Call',
            funcName: asString(children[0], "Call.Name"),
            // Handle optional ArgList: if present it's children[1], else empty array
            args: children[1] ? asMany(children[1], "Call.Args") : []
        }),

        "ArgList": ({ children }) => many(children.map(c => asOne(c, "Arg")))
    }
};

function parse(fileName: string): AST {
    const program = readFileSync(fileName, "utf8");
    const tree = parser.parse(program);
    return asOne(genericLezerToAST(tree.cursor(), program, astConfig), "Root");
}

### 3.2 Visualizing the AST

To verify that our configuration correctly transforms the CST into the intended AST structure, we utilize our `AST2Dot` library. This renders the tree structure, showing the **Program** root, the statements, and the nested expressions.

In [None]:
const astSum = parse("sum.sl");
const dotSum = ast2dot(astSum);
display.html(viz.renderString(dotSum, { format: "svg" }));

In [None]:
const astFact = parse("factorial.sl");
const dotFact = ast2dot(astFact);
display.html(viz.renderString(dotFact, { format: "svg" }));

## 4. The Interpreter

The interpreter breathes life into our **Abstract Syntax Tree (AST)**. It recursively traverses the tree structure and performs the operations described by the nodes.

We divide the implementation into three specialized functions, mirroring the structure of our AST:

1.  **`execute`**: Handles **Statements** (side effects). It modifies the program state but returns nothing (`void`).
2.  **`evaluate`**: Handles **Arithmetic Expressions**. It calculates and returns a `number`.
3.  **`evaluateBool`**: Handles **Conditions**. It returns a `boolean` to decide control flow paths.

### 4.1 State Management (Memory)

A computer program needs memory. We simulate this using a simple dictionary (Map):
* **Keys**: Variable names (`string`).
* **Values**: The stored numbers (`number`).

We define the type `Variables` for this map. Additionally, we introduce a global `inputStream` to simulate standard input (`stdin`) for the `read()` function.

In [None]:
// Type Alias for Memory
type Variables = { [key: string]: number };

// Global Input Buffer (Simulates stdin queue)
let inputStream: string[] = [];

// Forward Declarations with strict signatures
let execute: (node: AST, values: Variables) => void;
let evaluate: (node: AST, values: Variables) => number;
let evaluateBool: (node: AST, values: Variables) => boolean;

### 4.2 Executing Statements (`execute`)

The `execute` function serves as the control center. It accepts an `AST` node and the current `values` (memory).

**The Logic:**
It uses a `switch` statement on the **`node.kind`** property. This is a "Discriminated Union" in TypeScript: inside each `case`, TypeScript knows exactly which fields exist on `node`.

* **`Program` / `Block`**: These are containers. We iterate through their `statements` array and recursively call `execute` for each one.
* **`Assignment`**: This is where memory changes. We `evaluate` the expression on the right-hand side and store the result in `values` using the variable name (`node.id`).
* **`If` / `While`**: These control flow structures use `evaluateBool` to check their condition. If true, they recursively `execute` their `body`.

In [None]:
execute = (node: AST, values: Variables): void => {
    // 1. Safety check: Primitives (Leaves) cannot be executed as statements
    if (typeof node === 'number' || typeof node === 'string') return;

    // 2. Structural Switching
    switch (node.kind) {
        case 'Program':
        case 'Block':
            // Recursive Step: Execute list of statements sequentially
            for (const stmt of node.statements) {
                execute(stmt, values);
            }
            break;

        case 'Assignment':
            // Side Effect: Update memory
            // We use node.id (string) as key and evaluate(node.expr) as value
            values[node.id] = evaluate(node.expr, values);
            break;

        case 'ExprStmt':
            // Evaluate expression for side-effects (e.g., function calls)
            evaluate(node.expr, values);
            break;

        case 'If':
            // Conditional Execution
            if (evaluateBool(node.condition, values)) {
                execute(node.body, values);
            }
            break;

        case 'While':
            // Loop Execution
            while (evaluateBool(node.condition, values)) {
                execute(node.body, values);
            }
            break;
            
        default:
            // This ensures we handled all statement types defined in our AST
            throw new Error(`Runtime Error: Cannot execute node kind: ${(node).kind}`);
    }
};

### 4.3 Evaluating Conditions (`evaluateBool`)

Control flow statements like `if` and `while` require boolean values. Since our base language only supports numbers, we need a bridge logic.

**The Logic:**
1.  **Comparisons**: If the node is a `BinaryExpr` with a comparison operator (like `<` or `==`), we evaluate both sides and compare them using JavaScript's native boolean logic.
2.  **C-Style Truthiness**: If the node is a number or an arithmetic expression (e.g., `while(1)`), we treat `0` as `false` and any non-zero value as `true`.

In [None]:
evaluateBool = (node: AST, values: Variables): boolean => {
    // 1. Explicit Comparison Operators
    if (typeof node === 'object' && node.kind === 'BinaryExpr') {
        const l = evaluate(node.left, values);
        const r = evaluate(node.right, values);
        
        switch (node.op) {
            case '==': return l === r;
            case '!=': return l !== r;
            case '<=': return l <= r;
            case '>=': return l >= r;
            case '<': return l < r;
            case '>': return l > r;
            // Arithmetic operators (e.g. 1 + 1) fall through to step 2
        }
    }
    
    // 2. Implicit Boolean (C-Style): 0 is false, everything else is true
    return evaluate(node, values) !== 0;
};

### 4.4 Evaluating Expressions (`evaluate`)

This function computes the actual numerical values.

**The Logic:**
* **Leaves**:
    * **Numbers**: Returned "as is".
    * **Variables (Strings)**: Looked up in the `values` map. We throw an error if the variable is undefined.
* **Binary Expressions**: We recursively evaluate `left` and `right` operands and apply the math operator. Note that `/` performs **integer division** (`Math.floor`) to keep things simple.
* **Function Calls**:
    * **`print`**: Evaluates its argument, logs it to the console (simulating `stdout`), and returns `0`.
    * **`read`**: Simulates reading from `stdin`. It takes the next value from the global `inputStream` queue. If the queue is empty, it throws a Runtime Error.

In [None]:
evaluate = (node: AST, values: Variables): number => {
    // 1. Literals (Base Cases)
    if (typeof node === 'number') return node;
    
    // 2. Variable Lookup
    if (typeof node === 'string') {
        if (values[node] === undefined) {
            throw new Error(`Runtime Error: Undefined variable '${node}'`);
        }
        return values[node];
    }

    // 3. Complex Expressions
    switch (node.kind) {
        case 'BinaryExpr': {
            const l = evaluate(node.left, values);
            const r = evaluate(node.right, values);
            switch (node.op) {
                case '+': return l + r;
                case '-': return l - r;
                case '*': return l * r;
                case '/': return Math.floor(l / r); // Integer Division
                case '%': return l % r;
                default: throw new Error(`Unknown numeric op: ${node.op}`);
            }
        }
        
        case 'Call': {
            // Built-in Function: print(x)
            if (node.funcName === 'print') {
                const val = evaluate(node.args[0], values);
                console.log(">> STDOUT:", val);
                return 0; // Void return value
            }
            
            // Built-in Function: read()
            if (node.funcName === 'read') {
                const input = inputStream.shift(); // Dequeue input
                
                if (input === undefined) {
                    throw new Error("Runtime Error: 'read()' called but Input Stream is empty!");
                }
                
                console.log(`<< STDIN: Read value '${input}'`);
                
                const num = Number(input);
                if (isNaN(num)) {
                    throw new Error(`Runtime Error: Input '${input}' is not a valid number.`);
                }
                return num;
            }
            
            throw new Error(`Unknown function: ${node.funcName}`);
        }
        
        default:
            throw new Error(`Cannot evaluate node kind: ${node.kind}`);
    }
};

## 5. Execution

Finally, we define a `main` function that puts everything together:
1.  Read the source file.
2.  Parse it into an AST.
3.  Initialize an empty memory.
4.  Execute the AST.

In [None]:
function main(fileName: string, inputs: string[] = []) {
    console.log(`\n--- Executing ${fileName} with inputs [${inputs.join(", ")}] ---`);
    
    // 1. Reset Input Stream for this run
    inputStream = [...inputs]; 
    
    try {
        const ast = parse(fileName);
        const values: Variables = {};
        
        execute(ast, values);
        
        console.log("Final Memory State:", values);
    } catch (e) {
        console.error(e.message);
    }
}

In [None]:
main('sum.sl', ["5"]);

In [None]:
main('sum.sl');

In [None]:
main('sum.sl', ["6"]);

In [None]:
main('factorial.sl', ["10"]);