In [1]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# From Regular Expressions to <span style="font-variant:small-caps;">Fsm</span>s

This notebook shows how a given regular expression $r$ can be transformed into
an equivalent finite state machine. It implements the theory described in
Section 4.4 of the lecture notes.

The type RegExp describes the parse tree of a regular expression.
It is defined as a recursive `TypeScript` type (*union* of literals and *tuples*).

In [2]:
type Char = string;
const EPS = 'ε' as const;

We will represent the states of the `NFA` as integers.

In [3]:
type State = number;

The type `Delta` denotes the transition relation of a non-deterministic finite state machine.
The type `NFA` denotes a non-deterministic finite state machine.

In [4]:
type Delta = Map<string, Set<State>>;

// We represent a non-deterministic finite automaton (NFA) as a tuple-like object
// (but it's clearer to use a structured interface in TypeScript)
interface NFA {
  Q: Set<State>;       // set of states
  Sigma: Set<Char>;    // input alphabet
  delta: Delta;        // transition relation
  q0: State;           // start state
  F: Set<State>;       // set of final (accepting) states
}

The class `RegExp2NFA` administers two member variables:
- `Sigma` is the <em style="color:blue">alphabet</em>, i.e. the set of characters used.
- `StateCount` is a counter that is needed to create <em style="color:blue">unique</em> state names.

The methods given here are just stubs that are needed by the type checker.  The implementation of these stubs is given later.

In [5]:
type Tok =
  | { t: 'sym'; v: string }
  | { t: '(' | ')' | '+' | '*' | '⋅' | '.' | 'ε' | '∅' };

function isOpChar(c: string) {
  return '() +*⋅.'.includes(c);
}

function tokenize(input: string): Tok[] {
  const toks: Tok[] = [];
  let i = 0;
  while (i < input.length) {
    const c = input[i]!;
    if (/\s/.test(c)) {
      i++;
      continue;
    }

    if ('()+*⋅.'.includes(c)) {
      toks.push({ t: c } as Tok);
      i++;
      continue;
    }
    if (c === 'ε' || c === '∅') {
      toks.push({ t: c } as Tok);
      i++;
      continue;
    }
    if (!isOpChar(c)) {
      toks.push({ t: 'sym', v: c });
      i++;
      continue;
    }

    throw new Error(`Unexpected character '${c}'`);
  }

  // implicit concatenation
  const out: Tok[] = [];
  const isAtomRight = (t?: Tok) =>
    t && (t.t === 'sym' || t.t === 'ε' || t.t === '(');
  const isAtomLeft = (t?: Tok) =>
    t && (t.t === 'sym' || t.t === 'ε' || t.t === ')' || t.t === '*');
  for (let k = 0; k < toks.length; k++) {
    const a = toks[k];
    const b = toks[k + 1];
    out.push(a);
    if (isAtomLeft(a) && isAtomRight(b)) out.push({ t: '⋅' } as Tok);
  }
  return out;
}


In [6]:
type RegExp =
  | { kind: 'empty' }
  | { kind: 'epsilon' }
  | { kind: 'sym'; ch: string }
  | { kind: 'concat'; left: RegExp; right: RegExp }
  | { kind: 'union'; left: RegExp; right: RegExp }
  | { kind: 'star'; expr: RegExp };

 function parseRegex(expr: string): RegExp {
  const toks = tokenize(expr);
  let pos = 0;
  function peek() {
    return toks[pos];
  }
  function next() {
    return toks[pos++]!;
  }

  function parseE(): RegExp {
    let left = parseT();
    while (peek()?.t === '+') {
      next();
      const right = parseT();
      left = { kind: 'union', left, right };
    }
    return left;
  }

  function parseT(): RegExp {
    let left = parseF();
    while (peek() && (peek()!.t === '⋅' || peek()!.t === '.')) {
      next();
      const right = parseF();
      left = { kind: 'concat', left, right };
    }
    return left;
  }

  function parseF(): RegExp {
    let base = parseA();
    while (peek()?.t === '*') {
      next();
      base = { kind: 'star', expr: base };
    }
    return base;
  }

  function parseA(): RegExp {
    const tok = next();
    if (!tok) throw new Error('Unexpected end of input');
    switch (tok.t) {
      case 'sym':
        return { kind: 'sym', ch: tok.v };
      case 'ε':
        return { kind: 'epsilon' };
      case '∅':
        return { kind: 'empty' };
      case '(': {
        const inner = parseE();
        const close = next();
        if (!close || close.t !== ')')
          throw new Error('Missing closing parenthesis');
        return inner;
      }
      default:
        throw new Error(`Unexpected token: ${tok.t}`);
    }
  }

  const result = parseE();
  if (pos !== toks.length)
    throw new Error('Unexpected characters at end of expression');
  return result;
}

The member function `toNFA` takes an object `self` of class `RegExp2NFA` and a regular expression `r` and returns a finite state machine 
that accepts the same language as described by `r`.  The regular expression is represented in `Python` as follows:
- The regular expression $\emptyset$ is represented as the number `0`.
- The regular expression $\varepsilon$ is represented as the string `'𝜀'`.
- The regular expression $c$ that matches the character $c$ is represented by the character $c$.
- The regular expression $r_1 \cdot r_2$  is represented by the triple $\bigl(\texttt{repr}(r_1), \texttt{'⋅'}, \texttt{repr}(r_2)\bigr)$.

  Here, and in the following, for a given regular expression $r$ the expression $\texttt{repr}(r)$ denotes the `Python` representation of the regular 
  expressions  $r$.
- The regular expression $r_1 + r_2$  is represented by the triple $\bigl(\texttt{repr}(r_1), \texttt{'+'}, \texttt{repr}(r_2)\bigr)$.
- The regular expression $r^*$  is represented by the pair $\bigl(\texttt{repr}(r), \texttt{'*'}\bigr)$.

The annotation `# type: ignore`is needed to silence the type checker.

In [11]:
class RegExp2NFA {
  Sigma: Set<Char>;
  StateCount: number;

  constructor(Sigma: Set<Char>) {
    this.Sigma = Sigma;
    this.StateCount = 0;
  } 
    toNFA(r: RegExp): NFA {
    switch (r.kind) {
      case 'empty':
        return this.genEmptyNFA();
      case 'epsilon':
        return this.genEpsilonNFA();
      case 'sym':
        return this.genCharNFA(r.ch);
      case 'concat':
        return this.catenate(this.toNFA(r.left), this.toNFA(r.right));
      case 'union':
        return this.disjunction(this.toNFA(r.left), this.toNFA(r.right));
      case 'star':
        return this.kleene(this.toNFA(r.expr));
      default:
        throw new Error('Unknown regex kind');
    }
  }

  genEmptyNFA(): NFA {
    const q0 = this.getNewState();
    const q1 = this.getNewState();
    const delta: Delta = new Map();
    return {
      Q: new Set([q0, q1]),
      Sigma: this.Sigma,
      delta,
      q0,
      F: new Set([q1]),
    };
  }

  genEpsilonNFA(): NFA {
    const q0 = this.getNewState();
    const q1 = this.getNewState();
    const delta: Delta = new Map();
    delta.set(`${q0},${EPS}`, new Set([q1]));
    return {
      Q: new Set([q0, q1]),
      Sigma: this.Sigma,
      delta,
      q0,
      F: new Set([q1]),
    };
  }

  genCharNFA(c: Char): NFA {
    const q0 = this.getNewState();
    const q1 = this.getNewState();
    const delta: Delta = new Map();
    delta.set(`${q0},${c}`, new Set([q1]));
    return {
      Q: new Set([q0, q1]),
      Sigma: this.Sigma,
      delta,
      q0,
      F: new Set([q1]),
    };
  }

  // ---------- Operations on Automata ---------------------------------------

  catenate(f1: NFA, f2: NFA): NFA {
    const delta: Delta = new Map([...f1.delta, ...f2.delta]);
    const [q2] = Array.from(f1.F);
    delta.set(`${q2},${EPS}`, new Set([f2.q0]));
    return {
      Q: new Set([...f1.Q, ...f2.Q]),
      Sigma: f1.Sigma,
      delta,
      q0: f1.q0,
      F: new Set(f2.F),
    };
  }

  disjunction(f1: NFA, f2: NFA): NFA {
    const [q3] = Array.from(f1.F);
    const [q4] = Array.from(f2.F);
    const q0 = this.getNewState();
    const q5 = this.getNewState();

    const delta: Delta = new Map([...f1.delta, ...f2.delta]);
    delta.set(`${q0},${EPS}`, new Set([f1.q0, f2.q0]));
    delta.set(`${q3},${EPS}`, new Set([q5]));
    delta.set(`${q4},${EPS}`, new Set([q5]));

    return {
      Q: new Set([q0, q5, ...f1.Q, ...f2.Q]),
      Sigma: f1.Sigma,
      delta,
      q0,
      F: new Set([q5]),
    };
  }

  kleene(f: NFA): NFA {
    const [q2] = Array.from(f.F);
    const q0 = this.getNewState();
    const q3 = this.getNewState();

    const delta: Delta = new Map([...f.delta]);
    delta.set(`${q0},${EPS}`, new Set([f.q0, q3]));
    delta.set(`${q2},${EPS}`, new Set([f.q0, q3]));

    return {
      Q: new Set([q0, q3, ...f.Q]),
      Sigma: f.Sigma,
      delta,
      q0,
      F: new Set([q3]),
    };
  }

  getNewState(): State {
    this.StateCount += 1;
    return this.StateCount;
  }
}

The <span style="font-variant:small-caps;">Fsm</span> `genEmptyNFA()` is defined as
$$\bigl\langle \{ q_0, q_1 \}, \Sigma, \{\}, q_0, \{ q_1 \} \bigr\rangle. $$
Note that this <span style="font-variant:small-caps;">Fsm</span> has no transitions at all.
Graphically, this <span style="font-variant:small-caps;">Fsm</span> looks as follows:

![Fsm recognizing the empty set](./aLeer.jpg)

The <span style="font-variant:small-caps;">Fsm</span> `genEpsilonNFA` is defined as
$$  \bigl\langle \{ q_0, q_1 \}, \Sigma, 
                          \bigl\{ \langle q_0, \varepsilon\rangle \mapsto \{q_1\} \bigr\}, q_0, \{ q_1 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Fsm</span> looks as follows:

![Fsm recognizing the empty string](./aEpsilon.jpg)

For a letter $c \in \Sigma$ the <span style="font-variant:small-caps;">Fsm</span> `genCharNFA`$(c)$ is defined as 
$$ A(c) = 
   \bigl\langle \{ q_0, q_1 \}, \Sigma, 
   \bigl\{ \langle q_0, c \rangle \mapsto \{q_1\}\bigr\}, q_0, \{ q_1 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Fsm</span> looks as follows:

![Fsm recognizing the character c](./aChar.jpg)

Given two <span style="font-variant:small-caps;">Fsm</span>s `f1` and `f2`, the function `catenate(f1, f2)` 
creates an <span style="font-variant:small-caps;">Fsm</span> that recognizes a string $s$ if it can be written 
in the form
$$ s = s_1s_2 $$
and $s_1$ is recognized by `f1` and $s_2$ is recognized by `f2`. 

Assume that $f_1$ and $f_2$ have the following form:
- $f_1 = \langle Q_1, \Sigma, \delta_1, q_1, \{ q_2 \}\rangle$,
- $f_2 = \langle Q_2, \Sigma, \delta_2, q_3, \{ q_4 \}\rangle$,
- $Q_1 \cap Q_2 = \{\}$.
 
Then $\texttt{catenate}(f_1, f_2)$ is defined as:
$$  \bigl\langle Q_1 \cup Q_2, \Sigma, 
   \bigl\{ \langle q_2,\varepsilon\rangle  \mapsto \{q_3\} \bigr\} 
         \cup \delta_1 \cup \delta_2, q_1, \{ q_4 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Fsm</span> looks as follows:

![Fsm recognizing the concatenation of two languages](./aConcat.jpg)

Given two <span style="font-variant:small-caps;">Fsm</span>s `f1` and `f2`, the function `disjunction(f1, f2)` 
creates an <span style="font-variant:small-caps;">Fsm</span> that recognizes a string $s$ if it is either 
is recognized by `f1` or by `f2`. 

Assume again that the states of 
$f_1$ and $f_2$ are different and that $f_1$ and $f_2$ have the following form:
- $f_1 = \langle Q_1, \Sigma, \delta_1, q_1, \{ q_3 \}\rangle$,
- $f_2 = \langle Q_2, \Sigma, \delta_2, q_2, \{ q_4 \}\rangle$,
- $Q_1 \cap Q_2 = \{\}$.

Then $\texttt{disjunction}(f_1, f_2)$ is defined as follows:
$$ \bigl\langle \{ q_0, q_5 \} \cup Q_1 \cup Q_2, \Sigma, 
                \bigl\{ \langle q_0,\varepsilon\rangle \mapsto \{q_1, q_2\},
                   \langle q_3,\varepsilon\rangle \mapsto \{q_5\}, 
                   \langle q_4,\varepsilon\rangle \mapsto \{q_5\} \bigr\} 
                   \cup \delta_1 \cup \delta_2, q_0, \{ q_5 \} \bigr\rangle
$$
Graphically, this <span style="font-variant:small-caps;">Fsm</span> looks as follows:
![Fsm recognizing the disjunction](./aPlus.jpg)

Given an <span style="font-variant:small-caps;">Fsm</span> `f`, the function `kleene(f)` 
creates an <span style="font-variant:small-caps;">Fsm</span> that recognizes a string $s$ if it can be written as
$$ s = s_1 s_2 \cdots s_n $$
and all $s_i$ are recognized by `f`.  Note that $n$ might be $0$. 

If `f` is defined as
$$ f = \langle Q, \Sigma, \delta, q_1, \{ q_2 \} \rangle,
$$
then  `kleene(f)` is defined as follows:
$$ \bigl\langle \{ q_0, q_3 \} \cup Q, \Sigma, 
                \bigl\{ \langle q_0,\varepsilon\rangle \mapsto \{q_1, q_3\},  
                        \langle q_2,\varepsilon\rangle \mapsto \{q_1, q_3\} \bigr\} 
                \cup \delta, q_0, \{ q_3 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Fsm</span> looks as follows:
![Fsm recognizing the Kleene star](./aStar.jpg)

The auxiliary function `getNewState` returns a new number that has not yet been used as a state.

The notebook `04-Test-Regexp-2-NFA`can be used to test the functions implemented in this notebook.