In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Converting a Deterministic <span style="font-variant:small-caps;">Fsm</span> into a Regular Expression

This notebook implements the algorithm to convert a given DFA back into an equivalent regular expression. This is based on the **State Elimination** method (generalized here via the recursive `rpq` path function).

## Data Structures and Imports

To maintain strict type safety, we import the core definitions from our previous modules.

**Crucially**, we use `TransRelDet` because in a DFA, the transition relation maps to a single resulting state (which itself is a `DFAState` / `RecursiveSet`).

- **`01-NFA-2-DFA`**: For `DFA`, `DFAState` (the sets), and `TransRelDet`.
- **`03-RegExp-2-NFA`**: For the `RegExp` AST and aliases.

In [None]:
import { RecursiveSet, Tuple } from 'recursive-set';
import { DFA, DFAState, State, Char, TransRelDet, key } from "./01-NFA-2-DFA";
import { RegExp, BinaryOp, UnaryOp, EmptySet, Epsilon } from "./03-RegExp-2-NFA";

The function `regexp_sum` takes a set $S = \{ r_1, \cdots, r_n \}$ of regular expressions
as its argument.  It returns the regular expression 
$$ r_1 + \cdots + r_n. $$
The regular expression will be represented as a nested tuple that uses the operators `+` (for alternatives), `&` (for concatenations), and `*` (for repetitions).


In [None]:
function regexpSum(S: RecursiveSet<RegExp> | RegExp[]): RegExp {
  const elems: readonly RegExp[] = (S instanceof RecursiveSet) ? S.raw : S;
  const n = elems.length;

  if (n === 0) return 0;
  if (n === 1) return elems[0];

  const [r, ...rest] = elems;

  return new Tuple(
    r, 
    '+', 
    regexpSum(rest)
  );
}

The function `rpq` assumes there is some <span style="font-variant:small-caps;">Fsm</span>
$$ F = \langle \texttt{States}, \Sigma, \delta, \texttt{q0}, \texttt{Accepting} \rangle $$
given and takes five arguments:
- `p1` and `p2` are states of the <span style="font-variant:small-caps;">Fsm</span> $F$,
- $\Sigma$ is the alphabet of the <span style="font-variant:small-caps;">Fsm</span>,
- $\delta$ is the transition function of the <span style="font-variant:small-caps;">Fsm</span> $F$, and
- `Allowed` is a subset of the set `States`.  On recursive calls, `Allowed` is a list of states.

The function `rpq` computes a regular expression that describes those strings that take the 
<span style="font-variant:small-caps;">Fsm</span> $F$ from the state `p1` to state `p2`.
When $F$ switches states from `p1` to `p2` only states in the set `Allowed` may be visited in-between the states `p1` and `p2`.

The function is defined by recursion on the set `Allowed`.  There are two cases
- $\texttt{Allowed} = \{\}$.  
  Define `AllChars`as the set of all characters that when read by $F$ in the state `p1` cause $F$ to enter the state `p2`:
  $$ \texttt{AllChars} = \{ c \in \Sigma \mid \delta(p_1, c) = p_2 \} $$
  Then we need a further case distinction:
  - $p_1 = p_2$: In this case we have:
    $$ \texttt{rpq}(p_1, p_2, \{\}) := \sum\limits_{c\in\texttt{AllChars}} c \quad + \varepsilon$$
    If $\texttt{AllChars} = \{\}$ the sum $\sum\limits_{c\in\texttt{AllChars}} c$ is to be interpreted as the
    regular expression $\emptyset$ that denotes the empty language. 
    
    Otherwise, if $\texttt{AllChars} = \{c_1,\cdots,c_n\}$ we have
    $\sum\limits_{c\in\texttt{AllChars}} c \quad = c_1 + \cdots + c_n$.
  - $p_1 \not= p_2$: In this case we have:
    $$ \texttt{rpq}(p_1, p_2, \{\}) := \sum\limits_{c\in\texttt{AllChars}} c \quad$$
- $\texttt{Allowed} = \{ q \} \cup \texttt{RestAllowed}$.  In this case we recursively define the following variables:
  1. $\texttt{rp1p2} := \texttt{rpq}(p_1, p_2, \Sigma, \delta, \texttt{RestAllowed})$,
  2. $\texttt{rp1q } := \texttt{rpq}(p_1, q, \Sigma, \delta, \texttt{RestAllowed})$,
  3. $\texttt{rqq }\texttt{ } := \texttt{rpq}(q, q, \Sigma, \delta, \texttt{RestAllowed})$,
  4. $\texttt{rqp2 } := \texttt{rpq}(q, p_2, \Sigma, \delta, \texttt{RestAllowed})$.

  Then we can define:
  $$ \texttt{rpq}(p_1, p_2, \texttt{Allowed}) := \texttt{rp1p2} + \texttt{rp1q} \cdot \texttt{rqq}^* \cdot \texttt{rqp2} $$
  This formula can be understood as follows:  If a string $w$ is read in state $p_1$ and reading this string takes the 
  <span style="font-variant:small-caps;">Fsm</span> $F$ from the state $p_1$ to the state $p_2$ while only visiting states from the set 
  `Allowed` in-between, then there are two cases:
  - Reading $w$ does not visit the state $q$ in-between.  Hence the string $w$ can be described by the regular expression
    `rp1p2`.
  - The string $w$ can be written as $w = t u_1 \cdots u_n v$ where:
    - reading $t$ in the state $p_1$ takes the <span style="font-variant:small-caps;">Fsm</span> $F$ into the state $q$,
    - for all $i \in \{1,\cdots,n\}$ reading $v_i$ in the state $q$ takes the <span style="font-variant:small-caps;">Fsm</span> $F$ from $q$ to $q$, and
    - reading $v$ in the state $q$ takes the <span style="font-variant:small-caps;">Fsm</span> $F$ into the state $p_2$.

In [None]:
function rpq(
  p1: DFAState,
  p2: DFAState,
  Sigma: RecursiveSet<Char>,
  delta: TransRelDet,
  Allowed: readonly DFAState[]
): RegExp {
  if (Allowed.length === 0) {
    const allChars: Char[] = [];
    
    for (const c of Sigma) {
      const target = delta.get(key(p1, c));
      
      if (target && target.equals(p2)) {
        allChars.push(c);
      }
    }

    const r = regexpSum(allChars);

    if (p1.equals(p2)) {
      return new Tuple('ε', '+', r);
    } else {
      return r;
    }
  }

  const [q, ...RestAllowed] = Allowed;

  const rp1p2 = rpq(p1, p2, Sigma, delta, RestAllowed);
  const rp1q  = rpq(p1, q,  Sigma, delta, RestAllowed);
  const rqq   = rpq(q,  q,  Sigma, delta, RestAllowed);
  const rqp2  = rpq(q,  p2, Sigma, delta, RestAllowed);

  const loop = new Tuple(rqq, '*');
  const concat1 = new Tuple(rp1q, '⋅', loop);
  const concat2 = new Tuple(concat1, '⋅', rqp2);

  return new Tuple(
    rp1p2, 
    '+', 
    concat2
  );
}

The function `dfa_2_regexp` takes a deterministic <span style="font-variant:small-caps;">Fsm</span> $F$ and computes a regular expression $r$ that describes the same language as $F$, i.e. we have
$$ L(A) = L(r). $$

In [None]:
function dfa2regexp(F: DFA): RegExp {
  const { Q, Sigma, delta, q0, A } = F;
  
  const allStates = Q.raw; 
  
  const parts: RegExp[] = [];

  for (const acc of A) {
    const r = rpq(q0, acc, Sigma, delta, allStates);
    parts.push(r);
  }

  return regexpSum(parts);
}

The notebook `06-Test-DFA-2-RegExp.ipynb` provides a test for the function `dfa_2_regexp`.