In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# From Regular Expressions to <span style="font-variant:small-caps;">Fsm</span>s

This notebook shows how a given regular expression $r$
can be transformed into an equivalent finite state machine. It implements the theory that is outlined in section 4.4. of the lecture notes.

## Declaring the Necessary Types

First, we import the necessary libraries. We continue to use `RecursiveSet` to manage sets of states efficiently.

**System Architecture Note:**
Instead of redefining the NFA structure locally, we import the **Single Source of Truth** definitions (`NFA`, `State`, `Char`, `TransRel`) from our previous notebook (`01-NFA-2-DFA`).

This ensures strict interoperability: The NFAs generated by our Regular Expression parser are guaranteed to be type-compatible with our NFA-to-DFA conversion and minimization algorithms. Since our `State` type allows numbers (`string | number`), the integer states generated during the construction algorithm fit perfectly into our existing type system.

In [None]:
import { RecursiveSet, RecursiveMap, Tuple, Value } from 'recursive-set';
import { NFA, State, Char, TransRel } from "./01-NFA-2-DFA";

### Defining Regular Expressions

The type `RegExp` describes the parse tree of a regular expression. This recursive structure will be the input for the program we develop in this notebook.



To strictly align with the mathematical definition and ensure type safety, we define **Semantic Aliases** for the atomic parts of the expression. We utilize the `Tuple` generic from the `recursive-set` library for composite expressions to allow for structural equality checks.

The definition maps as follows:
- **EmptySet** ($\emptyset$): Represented by the literal number `0`.
- **Epsilon** ($\varepsilon$): Represented by the literal string `'ε'`.
- **Char** ($c \in \Sigma$): Any other single character string (type `Char`).
- **Composite Types:** Tuples representing Kleene Star (`*`), Concatenation (`⋅`), and Union (`+`).

In [None]:
// ============================================================================
// AST CLASS DEFINITIONS
// ============================================================================

/**
 * Base class for all Regular Expression Nodes.
 * All nodes are Tuples, ensuring structural equality and hashability.
 */
abstract class RegExpNode<T extends Value[]> extends Tuple<T> {}

// --- Atomic Types ---

/** Represents the Empty Set (∅). */
class EmptySet extends RegExpNode<[0]> {
    constructor() { super(0); }
}

/** Represents Epsilon (ε). */
class Epsilon extends RegExpNode<['ε']> {
    constructor() { super('ε'); }
}

/** Represents a single character (c ∈ Σ). */
class CharNode extends RegExpNode<[Char]> {
    constructor(c: Char) { super(c); }
    get value(): Char { return this.get(0); }
}

// --- Composite Types ---

type UnaryOp = '*';
type BinaryOp = '⋅' | '+';

/** * Represents the Kleene Star (r*).
 * We wrap the inner expression.
 */
class Star extends RegExpNode<[RegExp, UnaryOp]> {
    constructor(inner: RegExp) { super(inner, '*'); }
    get inner(): RegExp { return this.get(0); }
}

/** * Abstract base for Binary Operations to ensure type safety.
 */
abstract class BinaryRegExp<L extends RegExp, R extends RegExp> 
    extends RegExpNode<[L, BinaryOp, R]> {
    
    constructor(left: L, op: BinaryOp, right: R) {
        super(left, op, right);
    }

    get left(): L { return this.get(0); }
    get right(): R { return this.get(2); }
}

/** Represents Concatenation (r1 ⋅ r2). */
class Concat extends BinaryRegExp<RegExp, RegExp> {
    constructor(left: RegExp, right: RegExp) { super(left, '⋅', right); }
}

/** Represents Union (r1 + r2). */
class Union extends BinaryRegExp<RegExp, RegExp> {
    constructor(left: RegExp, right: RegExp) { super(left, '+', right); }
}

/** * The Union Type of all possible Regular Expression nodes.
 * This is the type used in function signatures.
 */
type RegExp = 
    | EmptySet 
    | Epsilon 
    | CharNode 
    | Star 
    | Concat 
    | Union;

### State Generator

Since we need to generate unique integer states for each new NFA component we build, we use a simple helper class `StateGenerator`.


In [None]:
class StateGenerator {
    private stateCount: number = 0;

    getNewState(): State {
        return ++this.stateCount;
    }
}

### Helper: Extracting the Single Accepting State

In Thompson's construction, every NFA component we build maintains a structural invariant: it has exactly **one start state** and **one accepting state**.

However, our general `NFA` type defines `A` (the set of accepting states) as a `RecursiveSet<State>`. To connect two NFAs (e.g., for concatenation $f_1 \cdot f_2$), we need to access the specific accepting state of $f_1$ to add an $\varepsilon$-transition from it.

The helper `getOnlyElement` extracts this single state from the set.

In [None]:
function getOnlyElement(S: RecursiveSet<State>): State {
    for (const s of S) return s;
    throw new Error("Unreachable");
}

## NFA Construction Functions

The NFA `genEmptyNFA()` is defined as
$$ \langle \{q_0, q_1\}, \Sigma, \{\}, q_0, \{q_1\} \rangle. $$
Note that this NFA has no transitions at all.
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:

![Nfa recognizing the empty set](./aLeer.jpg)

In [None]:
function genEmptyNFA(gen: StateGenerator, Sigma: RecursiveSet<Char>): NFA {
    const q0 = gen.getNewState();
    const q1 = gen.getNewState();
    
    return {
        Q: new RecursiveSet(q0, q1),
        Σ: Sigma,
        δ: new RecursiveMap<Tuple<[State,Char]>, RecursiveSet<State>>(),
        q0: q0,
        A: new RecursiveSet(q1)
    };
}

The NFA `genEpsilonNFA` is defined as
$$ \langle \{q_0, q_1\}, \Sigma, \{ \langle q_0, \varepsilon \rangle \mapsto \{q_1\} \}, q_0, \{q_1\} \rangle. $$
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:

![Nfa recognizing the empty string](./aEpsilon.jpg)

In [None]:
function genEpsilonNFA(gen: StateGenerator, Sigma: RecursiveSet<Char>): NFA {
    const q0 = gen.getNewState();
    const q1 = gen.getNewState();
    
    const delta = new RecursiveMap<Tuple<[State,Char]>, RecursiveSet<State>>();
    delta.set(new Tuple(q0, "ε"), new RecursiveSet(q1));
    
    return {
        Q: new RecursiveSet(q0, q1),
        Σ: Sigma,
        δ: delta,
        q0: q0,
        A: new RecursiveSet(q1)
    };
}

For a letter $c \in \Sigma$, the NFA `genCharNFA(c)` is defined as
$$ A(c) = \langle \{q_0, q_1\}, \Sigma, \{ \langle q_0, c \rangle \mapsto \{q_1\} \}, q_0, \{q_1\} \rangle. $$
Graphically, this <span style="font-variant:small-caps;">NFA</span> looks as follows:

![NFA recognizing the character c](./aChar.jpg)

In [None]:
function genCharNFA(
    gen: StateGenerator,
    Sigma: RecursiveSet<Char>,
    c: Char,
): NFA {
    const q0 = gen.getNewState();
    const q1 = gen.getNewState();
    
    const delta = new RecursiveMap<Tuple<[State,Char]>, RecursiveSet<State>>();
    delta.set(new Tuple(q0, c), new RecursiveSet(q1));
    
    return {
        Q: new RecursiveSet(q0, q1),
        Σ: Sigma,
        δ: delta,
        q0: q0,
        A: new RecursiveSet(q1)
    };
}

### Helper: Merging Deltas

When combining NFAs, we often need to merge two transition functions $\delta_1$ and $\delta_2$.


In [None]:
function copyDelta(d1: TransRel, d2: TransRel): TransRel {
    const newDelta = new RecursiveMap<Tuple<[State, Char]>, RecursiveSet<State>>();    
    for (const [k, v] of d1) newDelta.set(k, v);
    for (const [k, v] of d2) newDelta.set(k, v);
    return newDelta;
}

### Concatination

Given two <span style="font-variant:small-caps;">Nfa</span>s `f1` and `f2`, the function `catenate(f1, f2)` 
creates an <span style="font-variant:small-caps;">Nfa</span> that recognizes a string $s$ if it can be written 
in the form
$$ s = s_1s_2 $$
and $s_1$ is recognized by `f1` and $s_2$ is recognized by `f2`. 

Assume that $f_1$ and $f_2$ have the following form:
- $f_1 = \langle Q_1, \Sigma, \delta_1, q_1, \{ q_2 \}\rangle$,
- $f_2 = \langle Q_2, \Sigma, \delta_2, q_3, \{ q_4 \}\rangle$,
- $Q_1 \cap Q_2 = \{\}$.
 
Then $\texttt{catenate}(f_1, f_2)$ is defined as:
$$  \bigl\langle Q_1 \cup Q_2, \Sigma, 
   \bigl\{ \langle q_2,\varepsilon\rangle  \mapsto \{q_3\} \bigr\} 
         \cup \delta_1 \cup \delta_2, q_1, \{ q_4 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:

![Nfa recognizing the concatenation of two languages](./aConcat.jpg)

In [None]:
function catenate(gen: StateGenerator, f1: NFA, f2: NFA): NFA {
    const q1 = f1.q0;
    const q3 = f2.q0;
    const q2 = getOnlyElement(f1.A);
    
    const delta = copyDelta(f1.δ, f2.δ);
    delta.set(new Tuple(q2, 'ε'), new RecursiveSet(q3));
    
    return {
        Q: f1.Q.union(f2.Q),
        Σ: f1.Σ,
        δ: delta,
        q0: q1,
        A: f2.A
    };
}

### Disjunction

Given two <span style="font-variant:small-caps;">Nfa</span>s `f1` and `f2`, the function `disjunction(f1, f2)` 
creates an <span style="font-variant:small-caps;">Nfa</span> that recognizes a string $s$ if it is either 
is recognized by `f1` or by `f2`. 

Assume again that the states of 
$f_1$ and $f_2$ are different and that $f_1$ and $f_2$ have the following form:
- $f_1 = \langle Q_1, \Sigma, \delta_1, q_1, \{ q_3 \}\rangle$,
- $f_2 = \langle Q_2, \Sigma, \delta_2, q_2, \{ q_4 \}\rangle$,
- $Q_1 \cap Q_2 = \{\}$.

Then $\texttt{disjunction}(f_1, f_2)$ is defined as follows:
$$ \bigl\langle \{ q_0, q_5 \} \cup Q_1 \cup Q_2, \Sigma, 
                \bigl\{ \langle q_0,\varepsilon\rangle \mapsto \{q_1, q_2\},
                   \langle q_3,\varepsilon\rangle \mapsto \{q_5\}, 
                   \langle q_4,\varepsilon\rangle \mapsto \{q_5\} \bigr\} 
                   \cup \delta_1 \cup \delta_2, q_0, \{ q_5 \} \bigr\rangle
$$
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:
![Nfa recognizing the disjunction](./aPlus.jpg)


In [None]:
function disjunction(gen: StateGenerator, f1: NFA, f2: NFA): NFA {
    const q1 = f1.q0;
    const q2 = f2.q0;
    const q3 = getOnlyElement(f1.A);
    const q4 = getOnlyElement(f2.A);
    
    const q0 = gen.getNewState();
    const q5 = gen.getNewState();
    
    const delta = copyDelta(f1.δ, f2.δ);
    
    delta.set(new Tuple(q0, 'ε'), new RecursiveSet(q1, q2));
    
    const targetQ5 = new RecursiveSet(q5);
    delta.set(new Tuple(q3, 'ε'), targetQ5);
    delta.set(new Tuple(q4, 'ε'), targetQ5);
    
    return {
        Q: new RecursiveSet(q0, q5).union(f1.Q).union(f2.Q),
        Σ: f1.Σ,
        δ: delta,
        q0: q0,
        A: targetQ5
    };
}

### Kleene Star

Given an <span style="font-variant:small-caps;">Nfa</span> `f`, the function `kleene(f)` 
creates an <span style="font-variant:small-caps;">Nfa</span> that recognizes a string $s$ if it can be written as
$$ s = s_1 s_2 \cdots s_n $$
and all $s_i$ are recognized by `f`.  Note that $n$ might be $0$. 

If `f` is defined as
$$ f = \langle Q, \Sigma, \delta, q_1, \{ q_2 \} \rangle,
$$
then  `kleene(f)` is defined as follows:
$$ \bigl\langle \{ q_0, q_3 \} \cup Q, \Sigma, 
                \bigl\{ \langle q_0,\varepsilon\rangle \mapsto \{q_1, q_3\},  
                        \langle q_2,\varepsilon\rangle \mapsto \{q_1, q_3\} \bigr\} 
                \cup \delta, q_0, \{ q_3 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:
![Nfa recognizing the Kleene star](./aStar.jpg)

In [None]:
function kleene(gen: StateGenerator, f: NFA): NFA {
    const q1 = f.q0;
    const q2 = getOnlyElement(f.A);
    
    const q0 = gen.getNewState();
    const q3 = gen.getNewState();
    
    const delta = new RecursiveMap<Tuple<[State, Char]>, RecursiveSet<State>>();
    for (const [k, v] of f.δ) delta.set(k, v);
    
    const targets = new RecursiveSet(q1, q3);
    
    delta.set(new Tuple(q0, 'ε'), targets);
    delta.set(new Tuple(q2, 'ε'), targets);
    
    return {
        Q: new RecursiveSet(q0, q3).union(f.Q),
        Σ: f.Σ,
        δ: delta,
        q0: q0,
        A: new RecursiveSet(q3)
    };
}

## Main Class: RegExp2NFA

We now bundle the construction logic into the `RegExp2NFA` class. This class acts as the **Compiler** that transforms the abstract syntax tree of a regular expression into an executable NFA.

**Key Implementation Details:**

1.  **State Management:** The class holds a persistent `StateGenerator` instance. This ensures that every state generated during the recursive process gets a globally unique identifier, preventing collisions between different parts of the automaton.
2.  **Declarative Logic:** Thanks to the **View Pattern** introduced above, the `toNFA` method avoids complex `if/else` chains or unsafe type assertions (`as ...`). Instead, it uses a type-safe `switch` statement on the expression's `kind`.
3.  **Recursion (Inductive Step):** For composite expressions (Star, Concat, Union), the method recursively converts the sub-expressions (e.g., `view.left`) into NFAs first, and then combines them using the Thompson construction functions (`kleene`, `catenate`, `disjunction`) defined earlier.

In [None]:
class RegExp2NFA {
    private gen: StateGenerator;
    private Σ: RecursiveSet<Char>;

    constructor(Σ: RecursiveSet<Char>) {
        this.Σ = Σ;
        this.gen = new StateGenerator();
    }

    public toNFA(r: RegExp): NFA {
        if (r instanceof EmptySet) {
            return genEmptyNFA(this.gen, this.Σ);
        }

        if (r instanceof Epsilon) {
            return genEpsilonNFA(this.gen, this.Σ);
        }

        if (r instanceof CharNode) {
            return genCharNFA(this.gen, this.Σ, r.value);
        }
        if (r instanceof Star) {
            return kleene(this.gen, this.toNFA(r.inner));
        }

        if (r instanceof Concat) {
            return catenate(
                this.gen,
                this.toNFA(r.left),
                this.toNFA(r.right)
            );
        }

        if (r instanceof Union) {
            return disjunction(
                this.gen,
                this.toNFA(r.left),
                this.toNFA(r.right)
            );
        }

        throw new Error(`Unknown RegExp Node: ${r}`);
    }
}

The notebook `04-Test-Regexp-2-NFA.ipynb` can be used to test the functions implemented in this notebook.