In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Term Rewriting System for Regular Expressions

In this notebook, we implement a **Term Rewriting System** (TRS) to simplify complex regular expressions. The expressions generated by algorithms like *State Elimination* (see Module 05) often contain redundancies (e.g., $R + \emptyset$, $\varepsilon \cdot R$).

We define an algebraic simplification engine based on the axioms of **Regular Algebra** (Kleene Algebra).

## Data Structures: Type Extension Architecture

To ensure seamless integration with our existing toolchain (Parser $\to$ NFA $\to$ DFA), we **reuse the strict `RegExp` type** defined in `03-RegExp-2-NFA`.

However, a rewriting system introduces a new concept: **Variables**.
A rule like $R + \emptyset \to R$ uses $R$ as a placeholder for *any* sub-expression. Since our strict `RegExp` type only allows characters from the alphabet $\Sigma$, we define an **Extended Type System**:

1.  **`RegExp` (Strict):** The concrete type used by NFAs (Imported from Module 03).
2.  **`PatternRegExp` (Extended):** A superset that allows both concrete `RegExp` nodes AND **Variables** (strings representing placeholders like "R", "S").

In [6]:
import { 
    RegExp, Variable, EmptySet, Epsilon, CharNode, Star, Concat, Union, RegExpNode 
} from "./03-RegExp-2-NFA";

// A substitution maps variable names to complete RegExp trees
type Subst = Map<string, RegExp>;
type Rule = [RegExp, RegExp];

## Pattern Matching Engine

The core of a rewriting system is **Pattern Matching**. We need to determine if a specific concrete term (e.g., `(a + 0)`) matches a defined rule pattern (e.g., `(R + 0)`).

The engine consists of four main functions, all operating on `PatternRegExp`:

1.  **`deepEquals`**: Recursively checks if two ASTs are structurally identical. It uses `getPatternView` to compare logic nodes safely (e.g., ensuring a `Char 'a'` never equals a `Variable 'a'`).
2.  **`match`**: Checks if a `term` matches a `pattern`.
    * If the pattern node is a **Variable**, it binds the corresponding sub-tree of the term to that variable in a `Substitution` map.
    * If the pattern is a structure, it verifies that the term has the exact same structure and recursively matches children.
3.  **`apply`**: Reconstructs the term by replacing variables in the Right-Hand-Side (RHS) of a rule with their bound values from the substitution map.
4.  **`rewrite`**: Combines matching and application to perform a single transformation step.

In [7]:
function deepEquals(a: RegExp, b: RegExp): boolean {
    if (a === b) return true;
    if (a.constructor !== b.constructor) return false;

    if (a instanceof CharNode && b instanceof CharNode) return a.value === b.value;
    if (a instanceof Variable && b instanceof Variable) return a.name === b.name;
    
    // Recursive Checks
    if (a instanceof Star && b instanceof Star) {
        return deepEquals(a.inner, b.inner);
    }
    if ((a instanceof Concat && b instanceof Concat) || 
        (a instanceof Union && b instanceof Union)) {
        return deepEquals(a.left, b.left) && deepEquals(a.right, b.right);
    }
    
    // Singletons (EmptySet, Epsilon) match if constructors match
    return true; 
}

function match(pattern: RegExp, term: RegExp, substitution: Subst): boolean {
    // A. Variable Match (The logic hook)
    if (pattern instanceof Variable) {
        const name = pattern.name;
        if (substitution.has(name)) {
            // Variable already bound: must match the existing binding exactly
            return deepEquals(substitution.get(name)!, term);
        } else {
            // Bind variable
            substitution.set(name, term);
            return true;
        }
    }

    // B. Structure Match (Must be same class)
    if (pattern.constructor !== term.constructor) return false;

    // C. Recursive Descent
    if (pattern instanceof Star && term instanceof Star) {
        return match(pattern.inner, term.inner, substitution);
    }
    
    if ((pattern instanceof Concat && term instanceof Concat) || 
        (pattern instanceof Union && term instanceof Union)) {
        return match(pattern.left, term.left, substitution) &&
               match(pattern.right, term.right, substitution);
    }

    if (pattern instanceof CharNode && term instanceof CharNode) {
        return pattern.value === term.value;
    }

    return true; // EmptySet, Epsilon
}

function apply(term: RegExp, substitution: Subst): RegExp {
    if (term instanceof Variable) {
        return substitution.has(term.name) ? substitution.get(term.name)! : term;
    }

    // Reconstruct with simplified children
    // NO CASTING NEEDED because Star accepts RegExp!
    if (term instanceof Star) {
        return new Star(apply(term.inner, substitution));
    }

    if (term instanceof Concat) {
        return new Concat(
            apply(term.left, substitution),
            apply(term.right, substitution)
        );
    }

    if (term instanceof Union) {
        return new Union(
            apply(term.left, substitution),
            apply(term.right, substitution)
        );
    }

    return term; // Primitives unchanged
}

function rewrite(term: RegExp, rule: Rule): { simplified: boolean, result: RegExp } {
    const [lhs, rhs] = rule;
    const substitution: Subst = new Map();

    if (match(lhs, term, substitution)) {
        return { simplified: true, result: apply(rhs, substitution) };
    }
    return { simplified: false, result: term };
}

## Algebraic Rules (Axioms)

We define the **Axioms of Regular Algebra** (Kleene Algebra) as a list of rewrite rules `LHS -> RHS`.

**Common Simplifications:**
* **Identity:** $R + 0 \to R$, $\varepsilon \cdot R \to R$
* **Annihilation:** $R \cdot 0 \to 0$
* **Idempotence:** $R + R \to R$
* **Kleene Star:** $\varepsilon + R \cdot R^* \to R^*$ (Arden's Rule lemma)
* **Associativity:** $(R + S) + T \to R + (S + T)$ (Standardizing structure to the right)

We use the helper `T(...)` to define these rules compactly.

In [9]:
// Ein "Input" für unser DSL kann ein fertiges RegExp-Objekt, 
// die Zahl 0, das Epsilon-Symbol oder ein String (Char/Variable) sein.
type DSLInput = RegExp | 0 | string;

// 1. Overloads: Definieren die erlaubten Signaturen
function T(arg: DSLInput): RegExp;
function T(inner: DSLInput, op: '*'): RegExp;
function T(left: DSLInput, op: '+' | '⋅', right: DSLInput): RegExp;

// 2. Implementation: Die Logik, die alle Fälle abdeckt
function T(arg0: DSLInput, arg1?: string, arg2?: DSLInput): RegExp {
    
    // Case 1: Atom (nur arg0 ist gesetzt)
    if (arg1 === undefined) {
        if (arg0 instanceof RegExpNode) return arg0 as RegExp; // Pass-through
        if (arg0 === 0) return new EmptySet();
        if (arg0 === "ε") return new Epsilon();
        
        if (typeof arg0 === "string") {
            // Convention: Uppercase = Variable, Lowercase = Char
            // (einfache Prüfung auf Großbuchstaben A-Z)
            return (arg0.length === 1 && arg0 >= "A" && arg0 <= "Z")
                ? new Variable(arg0)
                : new CharNode(arg0);
        }
        
        throw new Error(`Invalid Atom: ${arg0}`);
    }

    // Case 2: Kleene Star (arg1 ist '*')
    if (arg1 === '*') {
        // Wir rufen T rekursiv auf, um sicherzustellen, dass arg0 ein RegExp wird
        return new Star(T(arg0));
    }

    // Case 3: Binary Operation (arg1 ist '+' oder '⋅', arg2 muss existieren)
    if ((arg1 === '+' || arg1 === '⋅') && arg2 !== undefined) {
        const left = T(arg0);
        const right = T(arg2); // Hier meckert TS nicht mehr, weil arg2 undefined gecheckt ist
        
        if (arg1 === '+') return new Union(left, right);
        if (arg1 === '⋅') return new Concat(left, right);
    }

    throw new Error(`Invalid Rule Template: ${arg0}, ${arg1}, ${arg2}`);
}

In [11]:
// === THE RULES ===

function getRules(): Rule[] {
    const rules: Rule[] = [
        [T("R", "+", 0), T("R")], 
        [T(0, "+", "R"), T("R")],
        [T("R", "+", "R"), T("R")],

        [T("ε", "+", T("R", "*")), T("R", "*")],
        [T(T("R", "*"), "+", "ε"), T("R", "*")],
        [T("ε", "+", T("R", "⋅", T("R", "*"))), T("R", "*")],
        [T("ε", "+", T(T("R", "*"), "⋅", "R")), T("R", "*")],
        [T(T("R", "⋅", T("R", "*")), "+", "ε"), T("R", "*")],
        [T(T(T("R", "*"), "⋅", "R"), "+", "ε"), T("R", "*")],

        [T("S", "+", T("S", "⋅", "T")), T("S", "⋅", T("ε", "+", "T"))],
        [T("S", "+", T("T", "⋅", "S")), T(T("ε", "+", "T"), "⋅", "S")],

        [T(0, "⋅", "R"), T(0)],
        [T("R", "⋅", 0), T(0)],
        [T("ε", "⋅", "R"), T("R")],
        [T("R", "⋅", "ε"), T("R")],

        [T(T("ε", "+", "R"), "⋅", T("R", "*")), T("R", "*")],
        [T(T("R", "+", "ε"), "⋅", T("R", "*")), T("R", "*")],
        [T(T("R", "*"), "⋅", T("R", "+", "ε")), T("R", "*")],
        [T(T("R", "*"), "⋅", T("ε", "+", "R")), T("R", "*")],

        [T(0, "*"), T("ε")],
        [T("ε", "*"), T("ε")],

        [T(T("ε", "+", "R"), "*"), T("R", "*")],
        [T(T("R", "+", "ε"), "*"), T("R", "*")],

        [T("R", "+", T("S", "+", "T")), T(T("R", "+", "S"), "+", "T")],
        [T("R", "⋅", T("S", "⋅", "T")), T(T("R", "⋅", "S"), "⋅", "T")],

        [
            T(T("R", "⋅", T("S", "*")), "⋅", T("ε", "+", "S")),
            T("R", "⋅", T("S", "*")),
        ],
    ];
    return rules;
}

## Main Simplification Algorithm

The simplification process uses a **Fixpoint Iteration** strategy combined with recursive descent.

### Algorithm `simplifyOnce`
This function performs a single pass over the AST:
1.  **Check Current Node:** It tries to apply every rule in the catalogue to the current term. If a rule matches (e.g., $R \cdot \varepsilon \to R$), it returns the transformed result immediately.
2.  **Recurse:** If no rule matches at the top level, it recurses into the children (`left`, `right`, or `inner`) to simplify sub-expressions.

### Algorithm `simplify`
Repeatedly calls `simplifyOnce` until the term stabilizes (i.e., `current == next`). This ensures that simplifications propagate correctly up the tree (e.g., simplifying a leaf node $0^* \to \varepsilon$ might trigger a subsequent parent node simplification $R \cdot \varepsilon \to R$).

In [14]:
// ============================================================================
// 5. MAIN SIMPLIFICATION ALGORITHM
// ============================================================================

function simplifyOnce(term: RegExp, rules: Rule[]): RegExp {
    // 1. Try top-level rewrite
    for (const rule of rules) {
        const { simplified, result } = rewrite(term, rule);
        if (simplified) return result;
    }

    // 2. Recurse into children (Inductive Step)
    // No casting needed: The constructors accept 'RegExp', which includes all nodes.
    
    if (term instanceof Star) {
        return new Star(simplifyOnce(term.inner, rules));
    }

    if (term instanceof Concat) {
        return new Concat(
            simplifyOnce(term.left, rules),
            simplifyOnce(term.right, rules)
        );
    }

    if (term instanceof Union) {
        return new Union(
            simplifyOnce(term.left, rules),
            simplifyOnce(term.right, rules)
        );
    }

    // 3. Base cases (EmptySet, Epsilon, CharNode, Variable) are leaves
    return term;
}

function simplify(t: RegExp): RegExp {
    const rules = getRules();
    let current = t;
    let iterations = 0;
    const MAX = 1000;

    // Fixed-Point Iteration
    while (true) {
        const next = simplifyOnce(current, rules);
        if (deepEquals(current, next)) return next;

        current = next;
        if (++iterations > MAX) {
            console.warn("Rewrite limit reached");
            return current;
        }
    }
}

## Pretty Printing

Finally, we convert the internal AST back into a human-readable string format.
The function `regexpToString` leverages the **View Pattern** to smartly render:
* **Atoms:** $\emptyset$ for `0`, $\varepsilon$ for `'ε'`.
* **Variables:** The variable name directly.
* **Precedence:** Parentheses are added only when necessary (e.g., `(a+b)*` needs them, `a*` does not).

In [15]:
function regexpToString(r: RegExp): string {
    // 1. Atomic Cases
    if (r instanceof EmptySet) return "∅";
    if (r instanceof Epsilon) return "ε";
    if (r instanceof CharNode) return r.value;
    if (r instanceof Variable) return r.name;

    if (r instanceof Star) {
        const inner = regexpToString(r.inner);
        
        const isAtomic = 
            r.inner instanceof CharNode || 
            r.inner instanceof Variable || 
            r.inner instanceof EmptySet ||
            r.inner instanceof Epsilon;
            
        return isAtomic ? `${inner}*` : `(${inner})*`;
    }

    if (r instanceof Concat) {
        return regexpToString(r.left) + regexpToString(r.right);
    }

    if (r instanceof Union) {
        return `(${regexpToString(r.left)}+${regexpToString(r.right)})`;
    }

    return "?";
}