In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css: string = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

The following example has been adapted from the official Ply documentation and ported to TypeScript using Lezer.

## A Tokenizer for Numbers and the Arithmetical Operators

In this notebook, we use a **declarative** approach using **Lezer**. Lezer is a modern, incremental parser generator often used in web editors (like CodeMirror).

**Lezer** requires us to write a **Grammar**.

Usually, a grammar defines both the tokens (words) and the sentence structure (syntax). However, since we are currently focusing only on **Tokenization** (Scanning) and have not yet covered Grammars or LR-Parsers, we will configure Lezer to act as a pure Scanner.

### 1. Defining the Lexical Rules

We define our tokens using a grammar string. We define rules using a notation similar to **Regular Expressions**.

We define a "permissive" top-level rule:
$$\text{Script} \rightarrow (\text{AnyToken})^*$$

This tells the parser: "A script is simply a sequence of any known tokens, in any order." This disables syntax checking for now, allowing us to focus entirely on how raw text is split into tokens.

## Step-by-Step Lexer Definition

Instead of writing one giant grammar string, we will build our Lexer definition piece by piece. This allows us to inspect exactly how tokens are defined and structured.

First, we need the generator tools.

In [None]:
import { buildParser } from "@lezer/generator";
import { Tree, TreeCursor } from "@lezer/common";
import { LRParser } from "@lezer/lr";

### 1. The Entry Point

Since we are building a **Scanner** (not a Parser yet), we want to be very permissive. We define a rule named `Script` that accepts a sequence of *any* valid tokens (`token*`).

This creates a "flat" structure. It tells Lezer: *"Don't check if the order makes sense mathematically, just check if the words exist."*

In [None]:
const entryPoint: string = `
  @top Script { token* }
`;

### 2. The Token Union

We need a rule that lists all possible things that can appear in our token stream. This acts as a central registry for our scanner.

In [None]:
const tokenStructure: string = `
  token {
    Number |
    Identifier |
    Plus | Minus | Times | Divide |
    LParen | RParen
  }
`;

### 3. Defining Tokens in Lezer

In **Lezer**, we define all lexical rules inside a `@tokens { ... }` block within the grammar string. This block acts as the dictionary for our language, mapping **Token Names** to **Patterns**.

#### 3.1 Literal Tokens (Operators)

The simplest tokens are fixed strings, such as mathematical operators or parentheses. We define them by mapping a name (Capitalized) to a string literal.

* **Syntax:** `TokenName { "string" }`
* **Note:** Unlike standard Regular Expressions, we do not need to escape special characters like `+` or `*` here. Lezer treats quoted strings as exact text matches.

In [None]:
const simpleTokens: string = `
  @tokens {
    // 1. Literal Matches (Operators & Punctuation)
    Plus   { "+" }
    Minus  { "-" }
    Times  { "*" }
    Divide { "/" }
    LParen { "(" }
    RParen { ")" }
`;

#### 3.2 Complex Patterns (Regular Expressions)

For dynamic values like Numbers or Identifiers, we cannot list every possibility. Instead, we use **Patterns**.

Lezer uses a notation very similar to Regular Expressions (Regex), but with a key difference regarding **Character Sets**:

1.  **Character Classes `$[...]`**:
    In standard Regex, `[0-9]` matches any digit. In Lezer, you **must** prefix the brackets with a dollar sign `$`.
    * Standard Regex: `[a-z]` $\rightarrow$ Lezer: `$[a-z]`
    
2.  **Quantifiers**:
    These work exactly like standard Regex:
    * `*` : Match zero or more times.
    * `+` : Match one or more times.
    * `|` : Logical OR (Alternative).

**The Number Definition:**
We define a number using this pattern:
`'0' | $[1-9] $[0-9]*`

This logic is precise:
* **Either** match the character `'0'` exactly.
* **OR** match a non-zero digit (`1-9`) followed by any number of digits (`0-9`).
* *Why?* This prevents a sequence like `007` from being interpreted as a single number "Seven". Instead, our scanner will see it as three separate tokens: `0`, `0`, and `7`.

**The Identifier Definition:**
`$[a-zA-Z_] $[a-zA-Z0-9_]*`
* Must start with a letter or underscore.
* Can be followed by letters, numbers, or underscores.

In [None]:
const complexTokens: string = `
    // 2. Complex Patterns using Lezer Regex Syntax

    // Number: Matches '0' OR (1-9 followed by digits)
    Number { '0' | $[1-9] $[0-9]* }

    // Identifier: Starts with Letter/Underscore, then alphanumeric
    Identifier { $[a-zA-Z_] $[a-zA-Z0-9_]* }
`;

#### 3.3 Handling Whitespace

We need to explicitly define what "whitespace" is so we can tell the parser to ignore it later. We define a token named `space` that matches spaces, tabs (`\t`), newlines (`\n`), and carriage returns (`\r`).

* **Pattern:** `$[ \t\n\r]+` (One or more whitespace characters).

**Important Note on Escaping:**
Lezer is flexible regarding escape sequences. It accepts both:
1.  **Single Backslash (`\n`):** Creates an actual newline character inside the TypeScript string. Lezer interprets this correctly as whitespace.
2.  **Double Backslash (`\\n`):** Creates the literal text characters `\` followed by `n`. Lezer parses this text sequence as a "newline rule".

**Why do we use double backslashes below?**
We use `\\t` and `\\n` purely for **visualization purposes**. When we print the `finalGrammar` string to the console later, we want to read the text `\n` (the definition) rather than seeing an invisible line break.

Finally, we close the `@tokens` block.

In [None]:
const whitespaceAndClose: string = `
    // 3. Whitespace Definition
    space { $[ \\t\\n\\r]+ }
  }
`;

const tokenDefinitions: string = simpleTokens + complexTokens + whitespaceAndClose;

#### Note on Values

Unlike some other lexing tools, Lezer does not automatically convert the text `"42"` into the integer `42` during this step.

Lezer's job is to produce a **Syntax Tree** containing the raw text segments. Converting these strings into actual numbers (e.g., using `parseInt`) happens in a later step when we process the tree.

### 4. Skip Logic

Finally, we tell Lezer which tokens should be ignored in the output. Usually, we want to discard spaces, tabs, and line breaks to keep the result clean.

In [None]:
const skipStrategy: string = `
  @skip { space }
`;

### 5. Assembly and Execution

Now we concatenate all string parts to form the full grammar definition and generate the parser.

In [None]:
const finalGrammar: string = entryPoint + tokenStructure + tokenDefinitions + skipStrategy;

### Inspecting the Complete Grammar

Before we compile the parser, let's verify the final grammar string we have assembled.

**Note on Best Practices:**
In this notebook, we split the grammar into multiple string variables (`entryPoint`, `tokenStructure`, etc.) solely for **educational purposes** to explain each section individually.

In a real-world project, you would typically write the entire grammar in:
1.  A single **Template Literal** (one large backtick string).
2.  Or, more commonly, in a separate file (e.g., `arithmetic.grammar`) which is then compiled by the Lezer CLI tools.

In [None]:
finalGrammar;

In [None]:
const parser : LRParser = buildParser(finalGrammar);

Let's test the generated lexer with the following string:

In [None]:
const data : string = `
       3 + 4 * 10 + 007 + (-20) * 2
       42/À
       a
       `;

Here is the input string we will tokenize:

In [None]:
data;

### 6. Generating the Syntax Tree

To verify our Lexer logic, we first need to instantiate the parser and process our input data.

**Function Specification:**
We utilize the `parse` method of the generated parser.
* **Input:** Source string $S \in \Sigma^*$ (where $\Sigma$ is the set of Unicode characters).
* **Output:** A Concrete Syntax Tree (CST) $T$.

After parsing, we initialize a `TreeCursor` $C$. This cursor acts as an iterator that traverses $T$ in a **depth-first, pre-order** manner. This allows us to visit every token exactly in the order they appear in the source text.

In [None]:
const tree: Tree = parser.parse(data);
const cursor: TreeCursor = tree.cursor();

### 7. Transforming the Tree into a Token List

To separate the scanning logic from the presentation logic, we define a transformation function `extractTokens`. This function flattens the Lezer CST into a linear sequence of typed objects.

#### 7.1 Data Structure Definition

We define a `Token` as a 5-tuple:
$$ T = ( \text{type}, \text{value}, \text{start}, \text{end}, \text{isError} ) $$

Where:
* $\text{type} \in \text{String}$: The name of the token class (e.g., "Number", "Plus").
* $\text{value} \in \text{String}$: The literal substring from the source code.
* $\text{start}, \text{end} \in \mathbb{N}_0$: The indices in the source string.
* $\text{isError} \in \text{Boolean}$: A flag indicating lexical failures.

In [None]:
interface Token {
  type: string;
  value: string;
  start: number;
  end: number;
  isError: boolean;
}

#### 7.2 The Extraction Algorithm

**Input:** A Syntax Tree $Tree$ and the Source String $S$.
**Output:** A sequence of Tokens $L = [t_1, t_2, \dots, t_n]$.

**Algorithm:**

1. Initialize an empty list $L \leftarrow []$.
2. Initialize cursor $C$ at the root of $Tree$.
3. **Do** while $C$ can move to the next node:
    * Let $N$ be the name of the current node.
    * **If** $N \in \{ \text{"Script"}, \text{"token"} \}$ (Grammar Wrappers), **Continue**.
    * Extract substring $V = S[C.\text{from} \dots C.\text{to}]$.
    * Determine type: if $N = \text{"⚠"}$, then $type \leftarrow \text{"Error"}$, else $type \leftarrow N$.
    * Construct token $t = (type, V, C.\text{from}, C.\text{to}, N = \text{"⚠"})$.
    * Append $t$ to $L$.
4. **Return** $L$.

In [None]:
function extractTokens(tree: Tree, source: string): Token[] {
  const cursor: TreeCursor = tree.cursor();
  const tokens: Token[] = [];

  do {
    const typeName = cursor.name;
    
    if (typeName === "Script" || typeName === "token") {
      continue;
    }

    const value = source.substring(cursor.from, cursor.to);
    const isError = typeName === "⚠";

    const token: Token = {
      type: isError ? "Error" : typeName,
      value: value,
      start: cursor.from,
      end: cursor.to,
      isError: isError
    };

    tokens.push(token);

  } while (cursor.next());

  return tokens;
}

### 8. Result Inspection

Finally, we apply our transformation function to the parsed tree and iterate over the resulting typed array to display the token stream.

**Process:**
1.  **Input:** The raw `tree` and `data` string.
2.  **Transformation:** Call `extractTokens(tree, data)` to obtain `Token[]`.
3.  **Output:** Print each token to the console, formatting newlines (`\n`) for visibility.

In [None]:
const tokenList: Token[] = extractTokens(tree, data);

console.log("Token Stream Analysis (Typed):");
console.log("------------------------------");

tokenList.forEach((t: Token) => {
  const displayVal = t.value.replace(/\n/g, "\\n");
  
  if (t.isError) {
     console.log(`Illegal character: '${displayVal}' at pos ${t.start}`);
  } else {
     console.log(`Token(${t.type.padEnd(11)}, '${displayVal}', Pos: ${t.start})`);
  }
});