# 🔍 Exploring Regular Expression Engine Implementations 🚀

I want to understand what goes on inside a regex engine by writing a *simplified, illustrative* model that does **not** represent a fully compliant engine, but it will be more than sufficient for me to understand the core.

**1. 🧱 Regex Building Blocks: A Quick Recap**

   *   Metacharacters (`.`, `\w`, `\d`, `\s`): Special symbols that define patterns.
   *   Quantifiers (`*`, `+`, `?`, `{n,m}`): Control repetition.
   *   Character Classes (`[a-z]`, `[0-9]`, `[^abc]`): Define sets of characters.
   *   Grouping (`()`): Create sub-patterns.
   *   Anchors (`^ $`): (Start & End)
   *   Alternation (`|`): (OR: `a|b` means "a OR b")

**2. ⚙️ Regex to NFA: Representing a Regexe as NFA**

   This section dives into the conversion of regular expressions into Non-deterministic Finite Automata (NFAs). An NFA is a state machine that accepts or rejects strings based on defined transitions. Converting a regex to an NFA provides a visual and computational representation of the pattern.

   *   **What is an NFA?** An NFA consists of:
         *   A set of states $Q$
         *   An input alphabet $\Sigma$
         *   A transition function $\delta: Q \times (\Sigma \cup \{\epsilon\}) \rightarrow \mathcal{P}(Q)$  (where $\mathcal{P}(Q)$ is the power set of Q, meaning it returns a *set* of possible next states, and $\epsilon$ represents the empty string/epsilon transition)
         *   A start state $q_0 \in Q$
         *   A set of accept states $F \subseteq Q$


   *   **Thompson's Construction:** A common algorithm for converting a regex to an NFA. It builds the NFA step-by-step based on the regex operators.  We'll implement a similar approach.

   *   **Example:** The regex `a*` can be represented by an NFA with states $q_0$ and $q_1$, where $q_0$ is the start state and $q_1$ is the accept state. There's an 'a' transition *looping* on $q_1$ and an epsilon transition from $q_0$ to $q_1$.


**2. 🚂 Shunting Yard Algorithm: From Infix to Postfix**

   We'll implement the Shunting Yard algorithm to transform familiar infix regex (like `a+b*c`) into postfix (Reverse Polish Notation - RPN), which is easier for machines to evaluate.

   **Example:** `a + b * c  -->  a b c * +`



## **I. NFA Representation**

In [60]:
# How to represent a NFA maybe as
# 2d array representing transitions next_state = transition_table[row][col];
# events are rows & states are columns
# how to match abc, this is how the transition table should look like
# last state is always accept state?? maybe
#   s0 s1 s2 s3
# a 1  N  N
# b N  2  N
# c N  N  3
from typing import List, Tuple, Union


EPSILON_TRANSITION = None
EPSILON = 'ε'
class State:
    def __init__(self, size, name, is_accept, is_epsilon=False):
        self.transitions = [0] * size  # Initialize the transitions list with 0 representing a failure state
        self.name = name
        self.is_accept = is_accept
        self.is_epsilon = is_epsilon
        
    def set_transition_vec(self, range: range, nextStateIdx: int) -> None:
        for i in range:
            self.transitions[i] = nextStateIdx 
    def __repr__(self):
        return f"State(name='{self.name}', is_accept={self.is_accept}, transitions={self.transitions})"
        
class EngineNFA:
    def __init__(self, alphabet):
        self.transitions_table = [] 
        self.alphabet = alphabet.copy() # the language lookup this NFA can describe {char: int}
        self.alphabet[EPSILON] = max(alphabet.values()) + 1 # EPSILON is always present in the alphabet
        
    def add_state(self, name, transition: range, nextStatesIdx: Tuple[int, ...], is_accept=False, is_epsilon=False) -> None:
        new_state = State(len(self.alphabet.keys()), name, is_accept, is_epsilon)  # Create a new instance of State
        new_state.set_transition_vec(transition, nextStatesIdx)
        self.transitions_table.append(new_state)  # Add state to transitions table
        
    def append_state(self, state: State):
        last_state = self.transitions_table[-1]
        state.name = self.next_state_name(last_state)
        state.transitions = [(x[0] + 1, *x[1:]) if isinstance(x, tuple)
                             and len(x) > 0 and x[0] != 0 else x for x in state.transitions]
        self.transitions_table.append(state)
        
    def insert_state_at_idx(self, state: State, idx: int):
        self.transitions_table.insert(idx, state)
        for i in range(idx, len(self.transitions_table)): 
            s = self.transitions_table[i]
            s.transitions = [(idx + 1, *x[1:]) if isinstance(x, tuple)
                                and len(x) > 0 and x[0] != 0 else x for x in state.transitions]
                
    def next_state_name(self, state: State):
        num_as_str = ""
        i = 0
        while i < len(state.name):
            if state.name[i] == 'S':
                i += 1  # Move past the 'S'
                while i < len(state.name) and state.name[i].isdigit():
                    num_as_str += state.name[i]
                    i += 1
            else:
                i += 1
        return f"S{int(num_as_str) + 1}"

    def is_match(self, input: Union[str, List[str]]):
        next_state = None
        next_state_idx = 0
        next_possible_states = None
        for char in input:
            alphabetKey = self.alphabet[char]
            if alphabetKey is None:
                raise ValueError(f"Value '{char}' is not part of the defined alphabet.")
            next_state = self.transitions_table[next_state_idx]
            # we now have tuple of next possible states
            # take the first one for now
            # TODO: handle multiple possible states on single character
            next_possible_states = self.transitions_table[next_state_idx].transitions[alphabetKey]
            next_state_idx = next_possible_states[0]
        return next_state.is_accept
    
    def dump(self):
        # Find the maximum string length of any element (int or tuple) for elemnts alignment
        max_len = 0
        for row in self.transitions_table:
            for element in row.transitions:
                element_str = str(element)
                max_len = max(max_len, len(element_str))
                
        for symbol in self.alphabet.keys():
            print(f"{symbol} => ", end=" ")
            for state in self.transitions_table:
                element = str(state.transitions[self.alphabet[symbol]])
                padding = ' ' * (max_len - len(element))
                print(element + padding, end=" ")
            print('\n')
    

In [57]:
# example of a NFA that recognizes the string "acb"
alphabet = {chr(i): i - ord('a') for i in range(ord('a'), ord('c') + 1)}

events = ['a', 'c', EPSILON, EPSILON, 'b']
NFA = EngineNFA(alphabet)
for idx, e in enumerate(events):
    NFA.add_state(f"S{idx}", range(NFA.alphabet[e], NFA.alphabet[e] + 1), (len(
        NFA.transitions_table) + 1,), is_accept=idx == len(events) - 1, is_epsilon=e == EPSILON)

NFA.dump()
print(NFA.is_match(['a', 'c', EPSILON, EPSILON, 'b']))

a =>  (1,) 0    0    0    0    

b =>  0    0    0    0    (5,) 

c =>  0    (2,) 0    0    0    

ε =>  0    0    (3,) (4,) 0    

True


## **II. Regex 2 NFA**


## Step 1: Parsing
#### Actualy, instead of creating a parser for the regex, I will use the _Shunting-Yard Algorithm_ to translate the regex to reverse polish notation.

In [None]:
# Shunting-Yard Algorithm
#  Operator precedance and associativity, table taken from unix: (Not all operators are supported in below implementation.)
#  +---+----------------------------------------------------------+
#  |   |             ERE Precedence (from high to low)            |
#  +---+----------------------------------------------------------+
#  | 1 | Collation-related bracket symbols | [==] [::] [..]       |
#  | 2 | Escaped characters                | \<special character> |
#  | 3 | Bracket expression                | []                   |
#  | 4 | Grouping                          | ()                   |
#  | 5 | Single-character-ERE duplication  | * + ? {m,n}          |
#  | 6 | Concatenation                     | #                    |
#  | 7 | Anchoring                         | ^ $                  |
#  | 8 | Alternation                       | |                    |
#  +---+-----------------------------------+----------------------+
def getPresedence(op):
    opPresedence = {"(": 1, "|": 2, "#": 3, "?":6, "*":6, "+": 6, "^": 5, "$": 5}    
    if op in opPresedence:
        return opPresedence[op]
    else:
        return max(opPresedence.values()) + 1
    
def implicitConcat(regex):
    output = regex[0]
    i = 1
    while(i < len(regex)):
        match regex[i]:
            case char if char in (")", "+", "*", "|"):
                output += regex[i]
            case _:
                if(regex[i-1] == "(" or regex[i-1] == "|"):
                    output += regex[i]
                else:
                    output += "#" + regex[i]
        i +=1
    return output
        
    
def infix2Postfix(regex):
    stack = []
    output = ""
    for char in regex:
        match char:
            case "(":
                stack.append(char)
            case ")":
                while(len(stack) and stack[-1] != '('):
                    output += stack.pop()
                stack.pop()
            case _:
                while(len(stack)):
                    if(getPresedence(stack[-1]) >= getPresedence(char)):
                        output += stack.pop()
                    else: break
                stack.append(char)
                    
    while (len(stack)):
            if (stack[-1] == '(' or stack[-1] == ')'):
                raise ValueError(f"Invalid Expression: Open parenthesis without closing")
            output += stack.pop()
    return output

print(infix2Postfix(implicitConcat("(a|b)+a?|c")))
# +------------+---------------+
# | INFIX      | POSTFIX       |
# +------------+---------------+
# | ^xyz$      |  ^x#y#z#$#    |
# | (a|b)+a?|c |  ab|+a#?#c|   |
# +------------+---------------+


ab#


## Step 2: Building the NFA

In [None]:

class NFABuilder:
    def __init__(self, alphabet, postfix_regex):
        """
        Initializes the NFABuilder object.
        """
        self.postfix_regex = postfix_regex      
        self.alphabet = alphabet
        
    def single_nfa_generator(self,events):
        NFA = EngineNFA(self.alphabet)
        for idx, e in enumerate(events):
            NFA.add_state(f"S{idx}", range(NFA.alphabet[e], NFA.alphabet[e] + 1), (len(
                NFA.transitions_table) + 1,), idx == len(events) - 1, e == EPSILON)
        return NFA
        
    def build_nfa(self):
        """
        Placeholder method to convert a single regex AST node to an NFA.
        Needs implementation.
        """
        # consider: ab|
        nfas = []
        for char in self.postfix_regex:
            match char:
                # Initial State --a--> Final State
                case char if char.isalnum():
                    char_nfa = self.single_nfa_generator([char])
                    nfas.append(char_nfa)
                    
                    # concatination
                    # N(s) --> N(t)
                case '#':
                    if(len(nfas) < 2):
                        raise ValueError("Append symbol must be preceded with atleast 2 symbols")
                    nfa1 = nfas.pop()
                    nfa2 = nfas.pop()
                    nfa2.append_state(nfa1.transitions_table[0])
                    
                    # alternation
                    #       --> ε --> N(s) --ε-->
                    #     /                       \
                    # Start                          --> End
                    #     \                       /
                    #     --> ε --> N(t) --ε-->
                case '|': 
                    if(len(nfas) < 2):
                        raise ValueError("Alternation symbol must be preceded with atleast 2 symbols")
                    nfa1 = nfas.pop() #b
                    nfa2 = nfas.pop() #a
                    union_nfa = EngineNFA(self.alphabet)
                    union_nfa.add_state(f"S{1}", range(
                            union_nfa.alphabet[EPSILON], union_nfa.alphabet[EPSILON] + 1), (1, 3), False, True)
                    union_nfa.insert_state_at_idx(nfa2.transitions_table[0],1)
                    union_nfa.add_state(f"S{1}", range(
                            union_nfa.alphabet[EPSILON], union_nfa.alphabet[EPSILON] + 1), (5,), False, True)
                    union_nfa.insert_state_at_idx(nfa1.transitions_table[0],3)
                    union_nfa.add_state(f"S{1}", range(
                            union_nfa.alphabet[EPSILON], union_nfa.alphabet[EPSILON] + 1), (5,), False, True)
                    union_nfa.dump()
                case _:
                    raise ValueError(f"Unkown operator: {char}")
                                        
alphabet = {chr(i): i - ord('a') for i in range(ord('a'), ord('b') + 1)}
postfix_regex = 'ab|'
builder = NFABuilder(alphabet, postfix_regex)
builder.build_nfa()



a =>  0      (2,)   0      0      0      

b =>  0      0      0      (4,)   0      

ε =>  (1, 3) 0      (5,)   0      (5,)   

