In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# An EBNF based Parser for Arithmetic Expressions

In this notebook, we implement an **EBNF (Extended Backus-Naur Form)** recursive-descent parser. Unlike the simple recursive parser, this implementation leverages iterative loops (`MANY`) to express repetitions efficiently, which aligns perfectly with the architecture of the *Chevrotain* library.

## Architectural Overview

1.  **Lexing:** Tokenizes the input string into numbers and operators.
2.  **Parsing:** Uses an **LL(k)** parser with iterative rules (`expr`, `product`) to build a CST.
3.  **Visiting:** Evaluates the CST to compute the result and collects all unique numbers encountered.

## Grammar Specification

The grammar rules are defined iteratively using the Kleene star `*`, avoiding the need for helper rules like `exprRest` or `productRest`:

$$
\begin{array}{lcl}
  \mathrm{expr}    & \rightarrow & \mathrm{product} \; \bigl( (\mathtt{'+'} \mid \mathtt{'-'}) \; \mathrm{product} \bigr)^* \\
  \mathrm{product} & \rightarrow & \mathrm{factor} \; \bigl( (\mathtt{'*'} \mid \mathtt{'/'}) \; \mathrm{factor} \bigr)^* \\
  \mathrm{factor}  & \rightarrow & \mathtt{'('} \; \mathrm{expr} \;\mathtt{')'} \\
                   & \mid        & \mathtt{NUMBER}
\end{array}
$$

### Imports and Setup

We use `Chevrotain` for parsing and `RecursiveSet` to track all unique numbers encountered during evaluation.

In [None]:
import {
  createToken,
  Lexer,
  CstParser,
  IToken,
  ILexingResult,
  TokenType,
  CstNode,
} from "chevrotain";
import { RecursiveSet } from "recursive-set";

## 1. Specification of the Scanner

We define the tokens for the arithmetic language.

**Token Definitions:**

| Token Name | Pattern | Description |
| :--- | :--- | :--- |
| `WhiteSpace` | `[ \t]+` | Spaces and tabs (skipped). |
| `NumberToken`| `[1-9][0-9]*\|0` | Integers. |
| `Operators` | `+`, `-`, `*`, `/` | Basic arithmetic operators. |
| `Parentheses`| `(`, `)` | Grouping. |

In [None]:
const WhiteSpace: TokenType = createToken({
  name: "WhiteSpace",
  pattern: /[ \t]+/,
  group: Lexer.SKIPPED,
});

const NumberToken: TokenType = createToken({
  name: "NumberToken",
  pattern: /[1-9][0-9]*|0/,
});

const Plus: TokenType = createToken({ name: "Plus", pattern: /\+/ });
const Minus: TokenType = createToken({ name: "Minus", pattern: /-/ });
const Multi: TokenType = createToken({ name: "Multi", pattern: /\*/ });
const Div: TokenType = createToken({ name: "Div", pattern: /\// });
const LParen: TokenType = createToken({ name: "LParen", pattern: /\(/ });
const RParen: TokenType = createToken({ name: "RParen", pattern: /\)/ });

const allTokens: TokenType[] = [
  WhiteSpace,
  NumberToken,
  Plus,
  Minus,
  Multi,
  Div,
  LParen,
  RParen,
];

const ArithmeticLexer = new Lexer(allTokens, { positionTracking: "onlyOffset" });

### Helper Function `tokenize`

Wraps the lexer for debugging purposes.

**Input:**
* `s`: Source string.

**Output:**
* List of token images ($\texttt{string[]}$).

In [None]:
function tokenize(s: string): string[] {
  const lexingResult: ILexingResult = ArithmeticLexer.tokenize(s);

  if (lexingResult.errors.length > 0) {
    throw new Error(`Lexing errors: ${lexingResult.errors[0].message}`);
  }

  return lexingResult.tokens.map((token: IToken) => token.image);
}

In [None]:
console.log(tokenize('12 * 13 + 14 * 4 / 6 - 7'))

## 2. Implementing the EBNF Parser

We implement the parser class using `CstParser`.

**Key Differences to Simple Recursive Parser:**
* Instead of recursive function calls for "rest" rules, we use `this.MANY(...)`.
* This creates arrays of children in the CST (e.g., an array of `Plus` tokens), which we can iterate over during evaluation.

**Rules:**
* **`expr`**: Consumes a `product` and then loops over `+`/`-` and subsequent `product`s.
* **`product`**: Consumes a `factor` and then loops over `*`/`/` and subsequent `factor`s.
* **`factor`**: Handles numbers and parentheses.

**Input:**
Token Vector.

**Output:**
CST Root Node.

In [None]:
class EbnfArithmeticParser extends CstParser {
  constructor() {
    super(allTokens);
    this.performSelfAnalysis();
  }
  // ----- expr Rule -----
  // EBNF: expr â†’ product (('+' | '-') product)*
  public expr = this.RULE("expr", () => {
    this.SUBRULE(this.product);
    this.MANY(() => {
      this.OR([
        { ALT: () => this.CONSUME(Plus) },
        { ALT: () => this.CONSUME(Minus) },
      ]);
      this.SUBRULE2(this.product);
    });
  });
  // ----- product Rule -----
  // EBNF: product -> factor ( ('*'|'/') factor )*
  public product = this.RULE("product", () => {
    this.SUBRULE(this.factor);
    this.MANY(() => {
      this.OR([
        { ALT: () => this.CONSUME(Multi) },
        { ALT: () => this.CONSUME(Div) },
      ]);
      this.SUBRULE2(this.factor);
    });
  });
  // ----- factor Rule -----
  // EBNF: factor -> '(' expr ')' | NUMBER
  public factor = this.RULE("factor", () => {
    this.OR([
      {
        ALT: () => {
          this.CONSUME(LParen);
          this.SUBRULE(this.expr);
          this.CONSUME(RParen);
        },
      },
      { ALT: () => this.CONSUME(NumberToken) },
    ]);
  });
}

const parser = new EbnfArithmeticParser();
const BaseCstVisitor = parser.getBaseCstVisitorConstructor();

## 3. Visitor: Evaluation

The `ArithmeticVisitor` computes the result by traversing the CST.

### Algorithm: Iterative Evaluation with Precedence

Because `expr` and `product` are implemented iteratively (using `MANY`), the CST contains arrays of operands and operators. To ensure correct left-associative evaluation (especially for mixed operators like `1 - 2 + 3`), we must process them in textual order.

**Algorithm Sketch:**

Let $T$ be the list of operand nodes.
Let $Ops$ be the set of operator tokens (e.g., `Plus` and `Minus`).

1.  **Collect & Sort:** Gather all operator tokens and sort them by `startOffset`.
2.  **Fold Left:** Start with the first operand.
3.  **Iterate:** For each sorted operator $op_i$, apply it to the current result and the next operand $T_{i+1}$.

**Output:**
* The calculated number ($\mathbb{R}$).
* Side effect: Populates `foundNumbers`.

In [None]:
interface IResult {
  value: number;
  numbers: RecursiveSet<number>;
}

class ArithmeticVisitor extends BaseCstVisitor {
  public foundNumbers: RecursiveSet<number>;

  constructor() {
    super();
    this.foundNumbers = new RecursiveSet<number>();
    this.validateVisitor();
  }

  public expr(ctx: {
    product: CstNode[];
    Plus?: IToken[];
    Minus?: IToken[];
  }): number {
    let result: number = this.visit(ctx.product[0]) as number;

    // Iteration over the arrays created by MANY
    if (ctx.product.length > 1) {
      const pluses = ctx.Plus || [];
      const minuses = ctx.Minus || [];
      const allOps = [...pluses, ...minuses].sort((a, b) => a.startOffset - b.startOffset);

      for (let i = 1; i < ctx.product.length; i++) {
        const operand: number = this.visit(ctx.product[i]) as number;
        const operator = allOps[i - 1];

        // Check token type by name instead of array existence
        if (operator.tokenType.name === "Plus") {
          result += operand;
        } else {
          result -= operand;
        }
      }
    }
    return result;
  }

  public product(ctx: {
    factor: CstNode[];
    Multi?: IToken[];
    Div?: IToken[];
  }): number {
    let result: number = this.visit(ctx.factor[0]) as number;

    if (ctx.factor.length > 1) {
      const multis = ctx.Multi || [];
      const divs = ctx.Div || [];
      const allOps = [...multis, ...divs].sort((a, b) => a.startOffset - b.startOffset);

      for (let i = 1; i < ctx.factor.length; i++) {
        const operand: number = this.visit(ctx.factor[i]) as number;
        const operator = allOps[i - 1];

        if (operator.tokenType.name === "Multi") {
          result *= operand;
        } else {
          result /= operand;
        }
      }
    }
    return result;
  }

  public factor(ctx: { expr?: CstNode[]; NumberToken?: IToken[] }): number {
    if (ctx.expr) {
      return this.visit(ctx.expr[0]) as number;
    } else {
      const token: IToken = ctx.NumberToken![0];
      const val: number = parseFloat(token.image);
      this.foundNumbers.add(val);
      return val;
    }
  }
}

## 4. Main Parsing Function

### Function `parse`

Orchestrates the complete pipeline.

**Input:**
* `s`: Expression string.

**Output:**
* `IResult` object with `value` and `numbers`.

In [None]:
function parse(s: string): IResult {
  const lexingResult: ILexingResult = ArithmeticLexer.tokenize(s);

  if (lexingResult.errors.length > 0) {
    throw new Error(`Lexing Errors: ${lexingResult.errors[0].message}`);
  }

  parser.input = lexingResult.tokens;
  const cst: CstNode = parser.expr();

  if (parser.errors.length > 0) {
    throw new Error(`Parsing Errors: ${parser.errors[0].message}`);
  }

  const visitor = new ArithmeticVisitor();
  const value: number = visitor.visit(cst) as number;

  return {
    value,
    numbers: visitor.foundNumbers,
  };
}

## 5. Testing

### Function `test`

Runs the parser on a given string and prints debug information (tokens, AST result, found numbers).

**Input:**
* `s`: Expression string.

**Output:**
* Console logs of the process.

In [None]:
function test(s: string): void {
  try {
    // Check tokenization output
    console.log(`Tokens: [${tokenize(s).join(", ")}]`);

    // Parse and evaluate
    const result: IResult = parse(s);

    console.log(`Input: ${s}`);
    console.log(`Result: ${result.value}`);
    console.log(`Numbers: ${result.numbers.toString()}`);
    console.log("------------------------------------------------");
  } catch (e) {
    console.error(`Error processing '${s}':`, e);
  }
}

In [None]:
parse('12 * 13 + 14 * 4 / 6 - 7')

In [None]:
test('11+22*(33-44)/(5-10*5/(4-3))')

In [None]:
test('0*11+22*(33-44)/(5-10*5/(4-3))')