# 🔍 Exploring Regular Expression Engine Implementations 🚀

I want to understand what goes on inside a regex engine by writing a *simplified, illustrative* model that does **not** represent a fully compliant engine, but it will be more than sufficient for me to understand the core.

**1. 🧱 Regex Building Blocks: A Quick Recap**

   *   Metacharacters (`.`, `\w`, `\d`, `\s`): Special symbols that define patterns.
   *   Quantifiers (`*`, `+`, `?`, `{n,m}`): Control repetition.
   *   Character Classes (`[a-z]`, `[0-9]`, `[^abc]`): Define sets of characters.
   *   Grouping (`()`): Create sub-patterns.
   *   Anchors (`^ $`): (Start & End)
   *   Alternation (`|`): (OR: `a|b` means "a OR b")

**2. ⚙️ Regex to NFA: Representing a Regexe as NFA**

   This section dives into the conversion of regular expressions into Non-deterministic Finite Automata (NFAs). An NFA is a state machine that accepts or rejects strings based on defined transitions. Converting a regex to an NFA provides a visual and computational representation of the pattern.

   *   **What is an NFA?** An NFA consists of:
         *   A set of states $Q$
         *   An input alphabet $\Sigma$
         *   A transition function $\delta: Q \times (\Sigma \cup \{\epsilon\}) \rightarrow \mathcal{P}(Q)$  (where $\mathcal{P}(Q)$ is the power set of Q, meaning it returns a *set* of possible next states, and $\epsilon$ represents the empty string/epsilon transition)
         *   A start state $q_0 \in Q$
         *   A set of accept states $F \subseteq Q$


   *   **Thompson's Construction:** A common algorithm for converting a regex to an NFA. It builds the NFA step-by-step based on the regex operators.  We'll implement a similar approach.

   *   **Example:** The regex `a*` can be represented by an NFA with states $q_0$ and $q_1$, where $q_0$ is the start state and $q_1$ is the accept state. There's an 'a' transition *looping* on $q_1$ and an epsilon transition from $q_0$ to $q_1$.


**2. 🚂 Shunting Yard Algorithm: From Infix to Postfix**

   We'll implement the Shunting Yard algorithm to transform familiar infix regex (like `a+b*c`) into postfix (Reverse Polish Notation - RPN), which is easier for machines to evaluate.

   **Example:** `a + b * c  -->  a b c * +`



## **I. NFA Representation**

In [None]:
# How to represent a NFA maybe as
# 2d array representing transitions next_state = transition_table[row][col];
# events are rows & states are columns
# how to match abc, this is how the transition table should look like
# last state is always accept state?? maybe
#   s0 s1 s2 s3
# a 1  N  N
# b N  2  N
# c N  N  3
class State:
    def __init__(self, size, name):
        self.transitions = [0] * size  # Initialize the transitions list with 0 representing a failure state
        self.name = name
        
    def setTransitions(self, range: range, nextStateIdx: int) -> None:
        for i in range:
            self.transitions[i] = nextStateIdx 
        
class EngineNFA:
    def __init__(self, alphabet):
        self.transitions_table = [] 
        self.alphabet = alphabet # the language this NFA can describe
        
    def add_state(self, name, transition: range, nextStateIdx: int) -> None:
        new_state = State(len(self.alphabet.keys()), name)  # Create a new instance of State
        new_state.setTransitions(transition, nextStateIdx)
        self.transitions_table.append(new_state)  # Add it to 
            
    def get_key_from_value(self, value):
        keys = [k for k, v in self.alphabet.items() if v == value]
        return keys[0] if keys else None

    def is_match(self, input: str):
        next_state = 1
        for char in input:
            if(next_state == 0 or next_state >= len(self.transitions_table)): break
            alphabetKey = self.get_key_from_value(char)
            if alphabetKey is None:
                raise ValueError(f"Value '{char}' is not part of the defined alphabet.")
            next_state = self.transitions_table[next_state].transitions[alphabetKey]
            
        if (next_state == 0): return False
        # NOTE : for the string "abc" this matches "abccc" since char c always have correct ending state
        # Is that CORRECT???
        return next_state == len(self.transitions_table)
    
    def dump(self):
        for symbol in self.alphabet.keys():
            print(f"{self.alphabet[symbol]} => ", end=" ")
            for state in self.transitions_table:
                print(state.transitions[symbol], end=" ")
            print('\n')
    

In [52]:
# example of a NFA that recognizes the string "abc"
alphabet = {i - ord('a'): chr(i) for i in range(ord('a'), ord('c') + 1)}
events = ['a', 'b', 'c']
NFA = EngineNFA(alphabet)
# state s0 which is the start state
NFA.add_state("s1",  range(0), 0)
for idx, _ in enumerate(events):
    NFA.add_state("s" + str(idx + 1), range(idx,idx + 1), len(NFA.transitions_table) + 1)

print(NFA.is_match("abccc"))

True


## **II. Regex 2 NFA**


## Step 1: Parsing
#### Actualy, instead of creating a parser for the regex, I will use the _Shunting-Yard Algorithm_ to translate the regex to reverse polish notation.

In [9]:
# Shunting-Yard Algorithm
#  Operator precedance and associativity, table taken from unix: (Not all operators are supported in below implementation.)
#  +---+----------------------------------------------------------+
#  |   |             ERE Precedence (from high to low)            |
#  +---+----------------------------------------------------------+
#  | 1 | Collation-related bracket symbols | [==] [::] [..]       |
#  | 2 | Escaped characters                | \<special character> |
#  | 3 | Bracket expression                | []                   |
#  | 4 | Grouping                          | ()                   |
#  | 5 | Single-character-ERE duplication  | * + ? {m,n}          |
#  | 6 | Concatenation                     | #                    |
#  | 7 | Anchoring                         | ^ $                  |
#  | 8 | Alternation                       | |                    |
#  +---+-----------------------------------+----------------------+
def getPresedence(op):
    opPresedence = {"(": 1, "|": 2, "#": 3, "?":6, "*":6, "+": 6, "^": 5, "$": 5}    
    if op in opPresedence:
        return opPresedence[op]
    else:
        return max(opPresedence.values()) + 1
    
def implicitConcat(regex):
    output = regex[0]
    i = 1
    while(i < len(regex)):
        match regex[i]:
            case char if char in (")", "+", "*", "|"):
                output += regex[i]
            case _:
                if(regex[i-1] == "(" or regex[i-1] == "|"):
                    output += regex[i]
                else:
                    output += "#" + regex[i]
        i +=1
    return output
        
    
def infix2Postfix(regex):
    stack = []
    output = ""
    for char in regex:
        match char:
            case "(":
                stack.append(char)
            case ")":
                while(len(stack) and stack[-1] != '('):
                    output += stack.pop()
                stack.pop()
            case _:
                while(len(stack)):
                    if(getPresedence(stack[-1]) >= getPresedence(char)):
                        output += stack.pop()
                    else: break
                stack.append(char)
                    
    while (len(stack)):
            if (stack[-1] == '(' or stack[-1] == ')'):
                raise ValueError(f"Invalid Expression: Open parenthesis without closing")
            output += stack.pop()
    return output

print(infix2Postfix(implicitConcat("(a|b)+a?|c")))
# +------------+---------------+
# | INFIX      | POSTFIX       |
# +------------+---------------+
# | ^xyz$      |  ^x#y#z#$#    |
# | (a|b)+a?|c |  ab|+a#?#c|   |
# +------------+---------------+


ab|+a#?#c|
