# REGEX Pattern Match Progam
### Assignment 1, Programming in Python
**Note:** *Two literary works, 'Pride and Prejudice' and 'A Tale of Tiw Cities', were provided to test the program. For the program to operate as correctly, it is understood the script and texts are stored in the same directory.* 

___

## 1. DEVELOPMENT PRINCIPLES
### 1.1 Conceptual Framework
The program prints and returns results for short strings and patterns from exact and multi-pattern searches of literary texts which are short enough for linear time solutions to work efficiently. We expect to achieve an average time requirement for simple string search of O(n) i.e. linear iterates through the string, but this can theoretically be improved using pattern preprocessing such as a sliding window approach with optimization or heuristics e.g. Rabin-Karp, Boyer Moore algorithms to achieve O(n) and O(n/m) respectively, or text preprocessing with a Suffix tree approach to achieve a performance of O(m). <br>
*For more details, see Appendix 1: String Search & Pattern Matching Notes & References*<br>


### 1.2 REGEX Approach
REGEX uses sophisticated alogrithms for pattern matching including Finite State Automata and Thompson NFA.  During testing on the texts specified for this assignment, the REGEX solution outperformed several popular sliding window and trie matching algorithms.  However, in real world applications using very large texts or datasets such as DNA sequence matching, Intrusion Detection Systems or datasets compiling data mining, the average performance of search algorithms will typically outperform REGEX and naive/brute force approaches.  <br>
*For a comparison of performace, see Appendix 2: REGEX solution v. Knuth-Morris-Pratt, Wu-Manber, etc)* <br>


### 1.3 Time and Space efficiency
The regex solution performs a linear search with a time of O(n) (best case) and O(n*m) (worst case), We will also look at a more efficient algorithm for large texts and long patterns, the Boyes Moore Algorithm, which has preprocessing Time complexity of O(m) for the bad character table and good suffix table and best case/worst case search time complexity of O(m) and O(n+m). 

|Algorithm |Preprocessing time |Complicity matching time|
|:---|:---:|---:|
|Naïve string search algorithm| 0 (no preprocessing) |O((n-m+1) m)| 
|Trie-matching |0 (no preprocessing) |O (m + · n)|
|Rabin-Karp string search algorithm |θ(m) |O((n-m+1) m)| 
|Finite automata |O(m |Σ|)| θ(n)|
|Knuth-Morris Pratt algorithm |θ(m) |θ(n)| 
|Boyer-Moore string search algorithm |O(m)| average O(n/m)|

### 1.3 References
* *A Fast string-Matching Algorithm*, R. S. Boyer, J. S. Moore (ACM, 1977)<br>
* *Analysis of Multiple String Pattern Matching Algorithms*, A. I. Jony, (IJASCIT, 2014)<br>
* *Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA)*, Z. Qadi, M.J. Aqel, I. El Emary (IJSC, 2007)


Additional references are listed in Appendix 1<br>
An invaluable resource is J.S. Moore's personal website, which has excellent examples of how different fast matching algorithms work:<br>
https://www.cs.utexas.edu/~moore/best-ideas/string-searching/index.html

___

# 2. Iterative Development of Program
The program is iteratively developed as follows:<br>
*  Formulation of function for Text Loading & String Count 
*  Formulation of function for pattern matching with counter
*  Error handling with exceptions
*  Combination of functions and driver in simple program with docstrings, annotation & dynamic formatting
*  Testing of program with string and list inputs

## 2.1 LOAD TEXT

For the assignment, a word is defined as a continuous sequence of characters from the ranges A–Z, a–z, 0–9, and the underscore (`_`) character which corresponds to the regular expression '\w', or '\w+' for counting all instances of the substring. All other characters are ignored for counting words in the selected text e.g "they're" counts as two separate words "they" and "re". 

In [1]:
import re # only library permitted for exercise

# Open, read & close selected text using 'with'
file_path = 'a-tale-of-two-cities.txt' 
with open(file_path, "r", encoding = "utf-8") as f:  # NB: utf-8 encoding neccessary 
    text = f.read()

# Use re.findall() to determine word count for the text
text_length = len(re.findall('\\w+', text))  
print(f"The text contains are {text_length} words in {file_path}")

The text contains are 138206 words in a-tale-of-two-cities.txt


## 2.2 PATTERN MATCHING WITH REGEX

### *Single String Pattern & Frequency*

In [2]:
# Use RE to define pattern 
word = "France"  # Random word input
pattern = '\\b' + word + '\\b' 

# Match all instances of the pattern and return count in string
frequency = len(re.findall(pattern, text, re.IGNORECASE))  #  search should not be case-sensitive
print(f"The word '{word}' appears {frequency} times")


The word 'France' appears 59 times


### *List Pattern & Frequency*

Terms are not limited to single words, but can also be a list of words/terms

In [3]:
# Create a loop for pattern list
substring_list = ['man', 'woman']
for word in substring_list:
    word = word.strip()
    pattern = '\\b' + word + '\\b'
    frequency = len(re.findall(pattern, text, re.IGNORECASE)) 
    print(f"{word} occurance = {frequency}")

man occurance = 300
woman occurance = 74


In [4]:
# Include a counter for list results
substring_list = ['man', 'woman']
counter = 0 
for word in substring_list:
    word = word.strip()
    pattern = '\\b' + word + '\\b'
    frequency = len(re.findall(pattern, text, re.IGNORECASE))  # Repeat frquency call for this part of function 
    counter += frequency # Count increments of frequency
    print(f"{word} occurance = {frequency}")
print(f"Total patterns found: {counter}")

man occurance = 300
woman occurance = 74
Total patterns found: 374


## 2.3 FUNCTIONS & ERROR HANDLING
To simplify the program's development, debugging errors and maintenance, we create two individual functions. Any detected lexical, syntactic and semantic errors are caught with a simple catch-all 'Exceptions as e:' formulation and printed (not raised) using f-string formatting with distinct error message to report to users to unambiguously identify the source of an error. 

### *File Opener*

The file opener uses a 'try..except' formulation to facilitate error handling. Rather than raising the error and stopping the program, an error report indicates the source of the problem to the user. For simplicity, the catchall "Exception as e:" is used with a message clearly indicating to users of the final program that the error was raised in this function.

In [5]:
def file_opener(file_path):
    try:
        with open(file_path, "r", encoding = "utf-8") as f:
            text = f.read()  
            word_count = len(re.findall('\\w+', text))            
            print(f"There are {word_count} words in the text '{file_path}'")
            return text
            
    except Exception as e:
        print(f"Please check file_path, an error occured: {e}")


In [6]:
# Check error reporting is working
file_opener("wrong_file_name")

Please check file_path, an error occured: [Errno 2] No such file or directory: 'wrong_file_name'


### *2.3.2 Word Counter*
The word counter uses a boolean conditional ('if'.. 'else') to determine if the search terms are handled as a single word or list.  The conditional is wrapped in a 'try/except' structure to facilitate error handling. Outputs are not formatted at this stage.

In [7]:
# Define search function
def word_count_with_regex(file_path, search_terms):

    # Open file with opener function
    text = file_opener(file_path)
    
    # Search for frequency of substring/pattern
    try:                
        # boolean handles input as word or list of words
        if isinstance(search_terms, str): 
            pattern = '\\b' + search_terms + '\\b'
            frequency = len(re.findall(pattern, text, re.IGNORECASE))
            print(f"The word '{search_terms}' appears {frequency} times.\n")
        
        else:
            # Input is a list for pattern match, not a word for exact substring match 
            print(f"The {len(search_terms)} search terms appear appear in this text as follows: \n")          

            # Strip words from substring list
            counter = 0 
            for word in search_terms:
                word = word.strip()
                pattern = '\\b' + word + '\\b'
                frequency = len(re.findall(pattern, text, re.IGNORECASE))  # frquency call for this part of function only 
                counter += frequency 
                print(f"{word} occurance = {frequency}")
            print(f"Total patterns found: {counter}")

    except Exception as e:
        print(f"Please check search_terms. An error occured: {e}")
        
    return


In [8]:
word_count_with_regex('a-tale-of-two-cities.txt', ['man', 'woman', 'horse'])

There are 138206 words in the text 'a-tale-of-two-cities.txt'
The 3 search terms appear appear in this text as follows: 

man occurance = 300
woman occurance = 74
horse occurance = 24
Total patterns found: 398


In [9]:
# Check error handling
word_count_with_regex('a-tale-of-two-cities.txt', 42)

There are 138206 words in the text 'a-tale-of-two-cities.txt'
Please check search_terms. An error occured: object of type 'int' has no len()


___

# 3. Combining Functions to create Program

The functions with regex appear to be working as intended so we can build the program with the desired formatting. The program combines the two functions with a driver and returns results in sentence or table format with a a dynamic layout using f-string formatting and docstrings to explain the function.  The file_opener function passes the selected text/string to the word_counter function for matching search instances of the sub-string/pattern.  In order to prevent NameError defaults, the driver provides default values for the text and term_search parameters. <br>

In [10]:
# NB. Code block can be run from terminal

#!/usr/bin/env python
import re

def file_opener(file_path: str) -> str:
    """
    A context manager function that automatically opens/closes texts 
    Output: Returns selected text, reports file handling errors
    """
    try:
        with open(file_path, "r", encoding = "utf-8") as f:
            content = f.read()  
            word_count = len(re.findall('\\w+', content))    # Regex 'words' defined as A–Z, a–z, 0–9, _
            print(f"There are {word_count} words in the text '{file_path}'")
            return content
            
    except Exception as e:
        print(f"A file_path error occured: {e}")


# Define search function
def word_count_summary(file_path: str, search_terms: str) -> list[int]:
    """
    Scans for frequency of substrings & patterns in str texts
    Arguments: 
        Name and path of a file containing text
        String input of substring/pattern
    Returns:  
        Exact match of substring/individual words (str)
        Muliti-pattern match for list of words (in tabular format int[])
    """
    # Open file with opener function
    content = file_opener(file_path)
    
    # Search for frequency of terms/expressions
    try:                
        if isinstance(search_terms, str): # input is string, not list
            pattern = '\\b' + search_terms + '\\b'
            frequency = len(re.findall(pattern, content, re.IGNORECASE))
            print(f"The word '{search_terms}' appears {frequency} times.\n")
            return frequency
        
        else:
            print(f"The {len(search_terms)} search terms appear appear in this text as follows: \n")  # Input is a list, not string
        
            # Dynamic formatting for table
            longest_term = max((term.strip() for term in search_terms), key=len)  # determines minimum lenght of columns
            column_width = len(longest_term) if len(longest_term) > 8 else 8  # sets minimum width of columns
            line_break = f"| {'-' * column_width} | {'-' * column_width} |"  # Formatting for header/end breaks

            # Print header
            header = "| {0:^{1}} | {2:^{1}} |".format('Word', column_width, 'Count') # header center-aligned
            print(f"{line_break}\n{header}\n{line_break}")

            # Strip terms for multi-term search
            word_counter = 0 
            for word in search_terms:
                word = word.strip()
                pattern = '\\b' + word + '\\b'
                frequency = len(re.findall(pattern, content, re.IGNORECASE))
                word_counter += frequency
                print(f"| {word:<{column_width}} | {frequency:>{column_width}} |")  # Output aligned left/right by column
    
            #  Aggregate search counts of all strings 
            sum_total = f"| {'Total':<{column_width}} | {word_counter:>{column_width}} |"
            print(f"{line_break}\n{sum_total}\n{line_break}\n")
            return word_counter
                
    except Exception as e:
        print(f"A counter error occured: {e}")
        
    return

# Set driver code default values to  
# prevent TypeError/NameErrors when run as program 
if __name__ == '__main__':
    file_path  = 'pride-and-prejudice.txt'
    search_terms  = ['man', 'woman', 'horse']
    word_count_summary(file_path, search_terms)

There are 122900 words in the text 'pride-and-prejudice.txt'
The 3 search terms appear appear in this text as follows: 

| -------- | -------- |
|   Word   |  Count   |
| -------- | -------- |
| man      |      150 |
| woman    |       61 |
| horse    |        3 |
| -------- | -------- |
| Total    |      214 |
| -------- | -------- |



### *Testing the Program*

In [11]:
help(word_count_summary)

Help on function word_count_summary in module __main__:

word_count_summary(file_path: str, search_terms: str) -> list[int]
    Scans for frequency of substrings & patterns in str texts
    Arguments:
        Name and path of a file containing text
        String input of substring/pattern
    Returns:
        Exact match of substring/individual words (str)
        Muliti-pattern match for list of words (in tabular format int[])



In [12]:
# Test the program with both texts and search for a string/list of strings

# Love and 18th century happiness
word_count_summary("pride-and-prejudice.txt", "the")  # test words from Assignment notes
word_count_summary("pride-and-prejudice.txt", ["Jane", "Elizabeth", "Mary", "Kitty", "Lydia"])  # test a list of words  
word_count_summary("pride-and-prejudice.txt", ["round", "ability", "enemy"])   # test words from Assignment Appendix
word_count_summary("pride-and-prejudice.txt", ["old man", "strong woman", "love", "marriage", "happy"])  

# War, evil and self-sacrifice
word_count_summary("a-tale-of-two-cities.txt", "pizza")
word_count_summary("a-tale-of-two-cities.txt", ["London", "Paris"])
word_count_summary("a-tale-of-two-cities.txt", ["sacrifice", "fear", "death", "guillotine", "victims", "orphan"])  
word_count_summary("a-tale-of-two-cities.txt", ["innocent man", "God", "the devil"])   # test list of strings and patterns            


There are 122900 words in the text 'pride-and-prejudice.txt'
The word 'the' appears 4333 times.

There are 122900 words in the text 'pride-and-prejudice.txt'
The 5 search terms appear appear in this text as follows: 

| --------- | --------- |
|   Word    |   Count   |
| --------- | --------- |
| Jane      |       292 |
| Elizabeth |       634 |
| Mary      |        39 |
| Kitty     |        71 |
| Lydia     |       171 |
| --------- | --------- |
| Total     |      1207 |
| --------- | --------- |

There are 122900 words in the text 'pride-and-prejudice.txt'
The 3 search terms appear appear in this text as follows: 

| -------- | -------- |
|   Word   |  Count   |
| -------- | -------- |
| round    |       17 |
| ability  |        0 |
| enemy    |        1 |
| -------- | -------- |
| Total    |       18 |
| -------- | -------- |

There are 122900 words in the text 'pride-and-prejudice.txt'
The 5 search terms appear appear in this text as follows: 

| ------------ | ------------ |
|   

44

# 4. Algorithm Optimization with Heuristics: Boyer Moore Algorithm

*"Our algorithm has the peculiar property that, roughly speaking, the longer the pattern is, the faster the algorithm goes. Furthermore, the algorithm is ``sublinear'' in the sense that it generally looks at fewer characters than it passes."* J.S. Moore 

As indicated, there are several sliding window alogorithms for improving the pattern matching performance on large datasets using pattern preprocessing algorithms or preprocessing of the text for substring search using suffixes.  A workbook examing the relative performance of representative algorithms can be found in Appendix 2. Where certain algorithms performed well for exact substring/pattern matching, their performance for multiple pattern matching was inferior to regular expressions and multi-pattern algorithms i.e. exact match alogorithms are obliged to loop through the search terms.  For the literary texts used in this exercies, the window shifting KMP and Boyer Moore algorithms outperformed the regex pattern search, while the exact search and trie algorithms were slower. <br>
<br>
It should be noted that this notebook does not pretend to offer an empirical analysis of the solution against algorithms for exact and multi-pattern searchs, but to put the REGEX solution in context by looking at one of the most popular window sliding algorithms for multiple pattern search, the Boyer Moore algoritm, which was the best performing algorithm on the exercise texts.  

### Algorithm
Proposed in 1977, the Boyer-Moore algorithm is a string matching algorithm that achieves high efficiency by beginning from the last character of the pattern and skipping sections of the text during the search phase. It leverages two heuristics: the bad character rule and the good suffix rule, which determine how far the pattern can be shifted when a mismatch occurs. These rules significantly reduce the number of comparisons, especially in practical scenarios where the pattern is long or the alphabet is large. The algorithm’s worst-case time complexity is O(n ⋅ m) but it performs close to O(n / m) on average for random text and patterns.


___

**DISCLOSURE**<br>
The code for the Boyer Moore model is not work that has been independently developed, it is derivative work that has been synthesized of several works, most importantly the following papers:<br>
* **A Fast string-Matching Algorithm**, R. S. Boyer, J. S. Moore (ACM, 1977)<br>
* **Analysis of Multiple String Pattern Matching Algorithms**, A. I. Jony, (IJASCIT, 2014)<br>
* **Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA)**, Z. Qadi, M.J. Aqel, I. El Emary (IJSC, 2007)


___

In [13]:
# Additional libraries are used to compare the solutions
import time  # Import the time library
from collections import Counter

def regex_counter(content, pattern):
    re_total_time = 0
    start_time = time.time()
    word_count_summary(content, pattern)
    end_time = time.time()
    re_total_time = (end_time - start_time)
    print(f"REGEX total time: {re_total_time:.4f} seconds")
    
regex_counter("a-tale-of-two-cities.txt", ["birds", "women", "horses"])

There are 138206 words in the text 'a-tale-of-two-cities.txt'
The 3 search terms appear appear in this text as follows: 

| -------- | -------- |
|   Word   |  Count   |
| -------- | -------- |
| birds    |       10 |
| women    |       61 |
| horses   |       40 |
| -------- | -------- |
| Total    |      111 |
| -------- | -------- |

REGEX total time: 0.0806 seconds


In [14]:
# Disclosure & Acknowledgment:
# This version of the Boyer-Moore algorithm is
# adapted from an interpretation found at GeeksforGeeks.com
# https://www.geeksforgeeks.org/boyer-moore-algorithm-for-pattern-searching/
# and a
# https://github.com/je-suis-tm/search-and-sort/blob/master/boyer%20moore%20search.py

def boyer_moore_search(content, pattern):
  n = len(content)
  m = len(pattern)
  if m == 0 or m > n:
      return []

  bad_char = dict()
  for i in range(m):
      bad_char[ord(pattern[i])] = i

  matches = []
  shift = 0

  # Slide the pattern over the content
  while shift <= n - m:
      j = m - 1

      # Match pattern from the end
      while j >= 0 and content[shift + j] == pattern[j]:
          j -= 1

      if j < 0:
          # Pattern found at this shift
          matches.append(shift)
          # Move the pattern to the next possible position
          shift += (m if (shift + m) < n else 1)  # Move by the full length or 1 if at the end of the text
      else:
          # Use the bad character heuristic to shift window
          bad_char_shift = bad_char.get(ord(content[shift + j]), -1)  # Handle potential out-of-bounds access with dictionary
          shift += max(1, j - bad_char_shift)

  return matches

def count_words(text):
    return len(text.split())

if __name__ == "__main__":
    file_name = "a-tale-of-two-cities.txt"
    with open(file_name, "r", encoding="utf-8") as file:
        text = file.read()

    patterns = ["birds", "women", "horses"]

    total_words = count_words(text)
    print(f"There are {total_words} words in the text: {file_name}")

    total_matches = Counter()
    bm_total_time = 0

    for pattern in patterns:
        start_time = time.time()
        matches = boyer_moore_search(text, pattern)
        end_time = time.time()

        bm_total_time += (end_time - start_time)
        total_matches[pattern] += len(matches)
        
        print(f"{pattern} : {len(matches)} matches")

    print(f"\nBoyer-Moore total time: {bm_total_time:.4f} seconds")


There are 135660 words in the text: a-tale-of-two-cities.txt
birds : 10 matches
women : 61 matches
horses : 40 matches

Boyer-Moore total time: 0.1225 seconds


### Conclusions
The time performance of the Boyer-Moore algorithm at 0.127 seconds is significantly slower than REGEX at 0.0831
seconds, though the the two solutions different number of 'words' detected in the text and different results for the search_terms. The performance of other algorithms can be found in the Jupyter notebook in Appendix 2.<br>
It should be noted that the test is for illustration purposes only and not by any means an empirical demonstration of the efficiency of REGEX compared to sophisticated algorithms which would undoubtedly perform better on longer patterns and larger string inputs.

___

### *Code References by Academic Module*

___

**File Path (Lecture 14)**<br>
The file path code is based on instruction received in Lecture 14, Files. While it is good practice to close the file to prevent the corruption of the file and avoid OS limitations using, a context manager allows Python to optimize the use of resources.<br>
<br>
**REGEX patterns & Word Count (Lecture 16)**<br>
The REGEX search patterns are based on instruction received in Lecture 16, Regular Expression.  We search for (A–Z, a–z, 0–9, _ ) using the '\\w+' formulation. The selected text is searched for the search term using the pattern formulation of search term between word boundaries ('\\b'). <br> Please note that a more accurate count of words which excludes approx 100  numeric expressions can be obtained by using the code: <br>
    _# word_pattern = '\\b[^\d\W_]+\\b'_ <br>
    _# non_numeric_words = re.findall(word_pattern, text)_ <br>
    _# numeric_count = word_count - len(non_numeric_count)_ <br>
    _# print(f"There are {word_count} strings in this text, including", len(non_numeric_words), f"words and {numeric_count} numerics")_ <br>
<br>
**_Printf_ formatting (Lecture 11)**<br>
The _printf_ formatting is based on instruction received in Lecture 11, String Manipulation. <br> 
Using a list comprehension, the program iterates through the words to obtain the maximum lenght of an unbundled list of terms to be included in the table of results. The minimum lenght of the table's equal-sized oolumns is set at 8 characters (dictated by the title words) with alignment of words in the title rows to the center using (^:{}), while the left ('Word') column is aligned to the left using (<:{}) and the right ('Count') column is aligned to the right using (>:{})<br>
<br>
**Input string & list of strings (Lecture 6)**<br>
Per instruction in Lecture 6, Conditionals, we determine the processing of the search term using a boolean and _'if-else'_ conditional where the if statement processes the search terms either as a string or list. <br> 
*  Where the boolean is True, the search term is determined to be a single string and the frequency of the search term is returned within a string (a simple sentence).
*  Where the boolean is False, the search terms is determined not to be a string but to be a list, and the frequecies are unpacked and iterated through a loop and recorded in a table formatted using printf.

**Errors handling (Topic 2, Lecture 5)**<br>
Per instruction in Topic 2: Lecture 5 'Structure: Errors and exception handling', we use the 'try-except' formulation for handling errors.  The errors are not returns, but error reports are printed ensuring that the source of the problem can easily be identified. 

___

### Appendices
**Appendix 1:** *Overview of String Search & Search Pattern Matching Algorithms*<br>
**Appendix 2:** *Jupyter Notebook: Comparison of Regex v. Popular Pattern Matching Algorithms*
