In [1]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Converting a Deterministic <span style="font-variant:small-caps;">Fsm</span> into a Regular Expression

To implement the mathematical concept of "sets of sets" efficiently in TypeScript, we utilize the RecursiveSet library. This library allows sets to contain other sets as elements and supports using sets as keys in maps based on their structural content, which is essential for constructing the states of our deterministic machine.

In [2]:
import { RecursiveSet } from "recursive-set";

State is the abstract type of the states of an Fsm.

In [3]:
type State = string | number;

In [4]:
type Char = string;

We represent the non-deterministic transition relation 
 by the following type:

In [5]:
type TransRelDet = Map<string, RecursiveSet<State>>;

The deterministic finite state machine (DFA) produced in this notebook has the following type.

In [6]:
type DFA = {
    Q: RecursiveSet<RecursiveSet<State>>; // Zustandsmenge (Menge von Mengen)
    Sigma: RecursiveSet<Char>;
    delta: TransRelDet;
    q0: RecursiveSet<State>;
    A: RecursiveSet<RecursiveSet<State>>;
};

`RegExp` definiton of regular expression

In [7]:
type BinaryOp = '⋅' | '+';
type UnaryOp = '*';

In [8]:
type RegExp = number | string | [RegExp, UnaryOp] | [RegExp, BinaryOp, RegExp];

First, we define `key`, a helper function to generate unique identifiers for transition lookups given a state (or a set of states) and a character.

In [9]:
function key(q: State | RecursiveSet<State>, c: Char): string {
    return `${q.toString()},${c}`;
}

The function `regexp_sum` takes a set $S = \{ r_1, \cdots, r_n \}$ of regular expressions
as its argument.  It returns the regular expression 
$$ r_1 + \cdots + r_n. $$
The regular expression will be represented as a nested tuple that uses the operators `+` (for alternatives), `&` (for concatenations), and `*` (for repetitions).

Instead of a set, in the recursive invocation the function is called with a  list.  This compilicates the type annotation a little bit and forces us to annotate the input agrument `S` with `Collection[RegExp]`, since `Collection` is a super type of both `Set` and `List`.

In [10]:
function regexpSum(S: RecursiveSet<RegExp> | RegExp[]): RegExp {
    // Wir verwenden Array.isArray für eine saubere Unterscheidung.
    // Das Casting 'as RegExp[]' behebt den Fehler, falls der Iterator 'unknown' liefert.
    const elems: RegExp[] = Array.isArray(S) 
        ? S 
        : Array.from(S) as RegExp[];

    const n = elems.length;

    if (n === 0) {
        return 0; // Leere Menge
    }
    if (n === 1) {
        return elems[0];
    }

    const [r, ...rest] = elems;
    return [r, '+', regexpSum(rest)];
}

The function `rpq` assumes there is some <span style="font-variant:small-caps;">Fsm</span>
$$ F = \langle \texttt{States}, \Sigma, \delta, \texttt{q0}, \texttt{Accepting} \rangle $$
given and takes five arguments:
- `p1` and `p2` are states of the <span style="font-variant:small-caps;">Fsm</span> $F$,
- $\Sigma$ is the alphabet of the <span style="font-variant:small-caps;">Fsm</span>,
- $\delta$ is the transition function of the <span style="font-variant:small-caps;">Fsm</span> $F$, and
- `Allowed` is a subset of the set `States`.  On recursive calls, `Allowed` is a list of states.

The function `rpq` computes a regular expression that describes those strings that take the 
<span style="font-variant:small-caps;">Fsm</span> $F$ from the state `p1` to state `p2`.
When $F$ switches states from `p1` to `p2` only states in the set `Allowed` may be visited in-between the states `p1` and `p2`.

The function is defined by recursion on the set `Allowed`.  There are two cases
- $\texttt{Allowed} = \{\}$.  
  Define `AllChars`as the set of all characters that when read by $F$ in the state `p1` cause $F$ to enter the state `p2`:
  $$ \texttt{AllChars} = \{ c \in \Sigma \mid \delta(p_1, c) = p_2 \} $$
  Then we need a further case distinction:
  - $p_1 = p_2$: In this case we have:
    $$ \texttt{rpq}(p_1, p_2, \{\}) := \sum\limits_{c\in\texttt{AllChars}} c \quad + \varepsilon$$
    If $\texttt{AllChars} = \{\}$ the sum $\sum\limits_{c\in\texttt{AllChars}} c$ is to be interpreted as the
    regular expression $\emptyset$ that denotes the empty language. 
    
    Otherwise, if $\texttt{AllChars} = \{c_1,\cdots,c_n\}$ we have
    $\sum\limits_{c\in\texttt{AllChars}} c \quad = c_1 + \cdots + c_n$.
  - $p_1 \not= p_2$: In this case we have:
    $$ \texttt{rpq}(p_1, p_2, \{\}) := \sum\limits_{c\in\texttt{AllChars}} c \quad$$
- $\texttt{Allowed} = \{ q \} \cup \texttt{RestAllowed}$.  In this case we recursively define the following variables:
  1. $\texttt{rp1p2} := \texttt{rpq}(p_1, p_2, \Sigma, \delta, \texttt{RestAllowed})$,
  2. $\texttt{rp1q } := \texttt{rpq}(p_1, q, \Sigma, \delta, \texttt{RestAllowed})$,
  3. $\texttt{rqq }\texttt{ } := \texttt{rpq}(q, q, \Sigma, \delta, \texttt{RestAllowed})$,
  4. $\texttt{rqp2 } := \texttt{rpq}(q, p_2, \Sigma, \delta, \texttt{RestAllowed})$.

  Then we can define:
  $$ \texttt{rpq}(p_1, p_2, \texttt{Allowed}) := \texttt{rp1p2} + \texttt{rp1q} \cdot \texttt{rqq}^* \cdot \texttt{rqp2} $$
  This formula can be understood as follows:  If a string $w$ is read in state $p_1$ and reading this string takes the 
  <span style="font-variant:small-caps;">Fsm</span> $F$ from the state $p_1$ to the state $p_2$ while only visiting states from the set 
  `Allowed` in-between, then there are two cases:
  - Reading $w$ does not visit the state $q$ in-between.  Hence the string $w$ can be described by the regular expression
    `rp1p2`.
  - The string $w$ can be written as $w = t u_1 \cdots u_n v$ where:
    - reading $t$ in the state $p_1$ takes the <span style="font-variant:small-caps;">Fsm</span> $F$ into the state $q$,
    - for all $i \in \{1,\cdots,n\}$ reading $v_i$ in the state $q$ takes the <span style="font-variant:small-caps;">Fsm</span> $F$ from $q$ to $q$, and
    - reading $v$ in the state $q$ takes the <span style="font-variant:small-caps;">Fsm</span> $F$ into the state $p_2$.

In [11]:
function rpq(
    p1: RecursiveSet<State>,
    p2: RecursiveSet<State>,
    Sigma: RecursiveSet<Char>,
    delta: TransRelDet,
    Allowed: RecursiveSet<State>[]
): RegExp {
    // Basisfall: keine Zwischenzustände mehr erlaubt
    if (Allowed.length === 0) {
        // ERSATZ: RecursiveSet statt Set
        // Wir geben explizit <RegExp> an, da Char ein Teilmenge von RegExp ist.
        const allChars = new RecursiveSet<RegExp>();

        // AllChars = { c in Σ | δ(p1, c) == p2 }
        for (const symbol of Sigma) {
            const c = symbol as Char;
            // Key-Generierung muss zum Map-Format passen
            const target = delta.get(key(p1, c));

            // Vergleich der Sets funktioniert, da wir Referenzgleichheit 
            // oder die interne Struktur von RecursiveSet nutzen.
            // Hinweis: Bei RecursiveSet sollte man .equals() nutzen, wenn es keine identischen Objekte sind.
            // Da DFA-Konvertierung oft neue Objekte erzeugt, ist .equals() sicherer:
            if (target && (target === p2 || target.equals(p2))) {
                allChars.add(c);
            }
        }

        const r = regexpSum(allChars);

        // Wenn Start- und Zielzustand äquivalent sind (gleiches Set), ist zusätzlich ε erlaubt
        if (p1.equals(p2)) {
            return ['ε', '+', r];
        } else {
            return r;
        }
    }

    // Rekursionsfall: einen Zwischenzustand q aus Allowed eliminieren
    const [q, ...RestAllowed] = Allowed;

    const rp1p2 = rpq(p1, p2, Sigma, delta, RestAllowed);
    const rp1q = rpq(p1, q, Sigma, delta, RestAllowed);
    const rqq = rpq(q, q, Sigma, delta, RestAllowed);
    const rqp2 = rpq(q, p2, Sigma, delta, RestAllowed);

    // Term: (rp1q ⋅ rqq* ⋅ rqp2)
    const loop: RegExp = [rqq, '*'];
    const concat1: RegExp = [rp1q, '⋅', loop];
    const concat2: RegExp = [concat1, '⋅', rqp2];

    // Gesamtergebnis: rp1p2 + (rp1q ⋅ rqq* ⋅ rqp2)
    return [rp1p2, '+', concat2];
}

The function `dfa_2_regexp` takes a deterministic <span style="font-variant:small-caps;">Fsm</span> $F$ and computes a regular expression $r$ that describes the same language as $F$, i.e. we have
$$ L(A) = L(r). $$

In [12]:
function dfa2regexp(F: DFA): RegExp {
    const { Q, Sigma, delta, q0, A } = F;

    // Allowed-States-Liste: alle Zustände des DFA als Array
    // Hier casten wir Array.from(Q) korrekt
    const allStates = Array.from(Q) as RecursiveSet<State>[];

    // ERSATZ: RecursiveSet für die Teile des regulären Ausdrucks
    const parts = new RecursiveSet<RegExp>();

    // Für jeden akzeptierenden Zustand p wird r_{q0,p} berechnet
    for (const acc of A) {
        const p = acc as RecursiveSet<State>;
        const r = rpq(q0, p, Sigma, delta, allStates);
        parts.add(r);
    }

    // Gesamtsprache ist die Summe über alle akzeptierenden Zustände
    return regexpSum(parts);
}

The notebook `06-Test-DFA-2-RegExp.ipynb` provides a test for the function `dfa_2_regexp`.