# String Matching Algorithms

![Medieval String Matching](https://github.com/ValRCS/RTU_Algorithms_DIP321/blob/main/imgs/Medieval_May%2020,%202025,%2011_13_48%20AM.png?raw=true)

## Description

### String Matching

String matching is a fundamental problem in computer science. It is the process of finding a given pattern(needle) in a text(target or haystack). The pattern can be a string or a sequence of characters or a regular expression.

Haystack might not be the best description since string is a sequence of characters and not a stack. But it is a common term used in string matching algorithms.

In [3]:
# in python we already have string find / index method
text = "MIT HARVARD OXFORD RRRRRTU RTU RTUUUU RTU ???"
text.find("RTU"), text.index("RTU"), text.rfind("RTU"), text.rindex("RTU")

(23, 23, 38, 38)

In [4]:
# how about finding all indexes of a substring that matches the search string
# we can loop through the string and find the index of the substring and then move the pointer to the next index
needle = "RTU"
text = "MIT HARVARD OXFORD RRRRRTU RTU RTUUUU RTU ???"
# we will save indexes in a list - so we start with a blank container
indexes = []
# we will start from 0
index = 0
# we will loop through the text
while index < len(text):
    # we will find the index of the substring
    index = text.find(needle, index) # so find has a start index
    # if we find it we will add it to the list
    if index != -1:
        indexes.append(index)
        # we will move the pointer to the next index - this way we do not find the same substring again, we start looking from the next index
        index += 1
    else: # means we did not find the substring
        break
indexes

[23, 27, 31, 38]

In [5]:
# we can double check using slicing and lenght of the needle
for index in indexes:
    print(f"Starting at index {index}", text[index:index+len(needle)]) # so we slice the text from the index to the index + length of the needle

Starting at index 23 RTU
Starting at index 27 RTU
Starting at index 31 RTU
Starting at index 38 RTU


## Brute force approach

In [6]:
def find_needle(text, needle):
    '''
    Brute force search for needle in text
    '''
    for i in range(len(text)): # more Pythonic would be to use enumerate
        # python slicing makes this easy
        # we check whether current window of text matches the needle
        if text[i:i+len(needle)] == needle: 
            return i # if we had to find all we could add the index to a list
    return -1

find_needle(text, "RTU")

23

In [7]:
# lets rewrite the above function without slicing
def find_needle_no_slice(text, needle):
    '''
    Brute force search for needle in text
    Not pythonic implementation - more like C
    '''
    for i in range(len(text)-len(needle)+1): # slight optimization we do not need to go to the end of the text
        # python slicing makes this easy
        for j in range(len(needle)):
            if text[i+j] != needle[j]: # so we check each character one by one
                # here we would add a flag if we did not use the else for the for loop (not available in C or Java)
                break # we failed to find the needle so no need to continue inner loop
        else: # this else is for the for loop, not the if - it means we did not break out of the loop
            return i # if we didn't break out of the loop, we found the needle

    return -1

find_needle_no_slice(text, "RTU")

23

## Complexity of brute force algorithm

The brute force algorithm is the simplest algorithm for string matching. It checks for the pattern in the text by sliding the pattern over the text one by one and checking for a match. The time complexity of this algorithm is O(mn) where m is the length of the pattern and n is the length of the text(**needle** stands for n).

Sometimes the m an n are used to represent the length of the pattern and text respectively. 

## Knuth-Morris-Pratt Algorithm

### Description

**Big idea: If we know some of the characters in the text, we can use that information to avoid matching the characters that we know will anyway match.**

![Steam Punk]()

In KMP we build a prefix table that tells us how many characters to skip when a mismatch occurs. This is called the failure function or the prefix function.
We could use this information to skip characters in the text that we know will anyway match.

We could also build a DFA using this information.

DFA - Deterministic Finite Automaton
https://en.wikipedia.org/wiki/Deterministic_finite_automaton

### Complexity

Time complexity: O(m+n) where m is the length of the pattern and n is the length of the text.

In [10]:
def kmp_table(needle):
    '''
    Build the KMP table for the needle
    '''
    # initialize table
    table = [0]*len(needle)
    i = 1
    j = 0
    # we start at 1 because table[0] is always 0
    while i < len(needle): 
        if needle[i] == needle[j]:
            table[i] = j+1
            i += 1
            j += 1
        elif j > 0:
            j = table[j-1]
        else:
            i += 1
    return table

kmp_table("RTU")

[0, 0, 0]

## Links on building DFA

* Original paper by Knuth-Morris-Pratt: TODO
* Video: - https://www.youtube.com/watch?v=GTJr8OvyEVQ


In [11]:
kmp_table("aabaaa")

[0, 1, 0, 1, 2, 2]

In [12]:
# kmp search
def kmp_search(text, needle, table=None):
    '''
    KMP search for needle in text
    '''
    if table is None:
        table = kmp_table(needle) # so if we had a table we would use it - building table has complexity O(n) - where n is the length of the needle
    i = 0
    j = 0
    while i < len(text):
        if text[i] == needle[j]:
            # so if we are at the end of the needle, we found it!!
            if j == len(needle)-1:
                return i-j
            else:
                i += 1
                j += 1
        elif j > 0:
            j = table[j-1] # essentially it is the same DFA as in the table - so we know how many characters we can skip
        else:
            i += 1
    return -1

kmp_search(text, "RTU")

23

In [None]:
# left as an exercise for the reader run timing tests on the above functions
# and compare them to the python find method

# you can use %timeit in a jupyter notebook to time a function

## Boyer-Moore Algorithm

https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

![1980s](https://github.com/ValRCS/RTU_Algorithms_DIP321/blob/main/imgs/1980s_May%2020,%202025,%2011_07_58%20AM.png?raw=true)

### Description

Boyer-Moore algorithm is a string searching algorithm that uses information from the end of the pattern to skip characters in the text. It uses two rules to skip characters in the text.

1. Bad character rule
Bad character rule is used to skip characters in the text when a mismatch occurs. The bad character rule is based on the observation that if the mismatch occurs at position i in the pattern, then we can shift the pattern by i characters to the right.
2. Good suffix rule
TODO good suffix rule

### Example page

Author: Robert C. Moore
https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html



### Complexity

Worst case time complexity: O(mn) where m is the length of the pattern and n is the length of the text.

Best case time complexity: O(n/m) where m is the length of the pattern/needle and n is the length of the text.

In [13]:
# let's implement Boyer Moore
def boyer_moore(text, needle):
    '''
    Boyer Moore search for needle in text
    '''
    # build bad character table
    table = {}
    for i in range(len(needle)):
        table[needle[i]] = i
    # now we search
    i = len(needle)-1 # start at the end of the needle!
    j = i
    while i < len(text):
        if text[i] == needle[j]:
            if j == 0:
                return i
            else:
                i -= 1
                j -= 1
        else:
            if text[i] in table:
                i += len(needle) - min(j, 1+table[text[i]])
            else:
                i += len(needle)
            j = len(needle)-1
    return -1

boyer_moore(text, "RTU")

23

## Timing

In [14]:
import random
# insert seed here
random.seed(42) # answer to life the universe and everything
long_random_text = "".join([random.choice("ACGT") for i in range(1_000_000)])
long_random_text[:10]

'AAGCCCAATA'

In [15]:
# movie GATTACA - url https://www.imdb.com/title/tt0119177/
needle = "GATTACA"
long_random_text.find(needle)

27459

In [16]:
%timeit long_random_text.find(needle)

37.2 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [17]:
%timeit find_needle(long_random_text, needle)

3.63 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [18]:
%timeit find_needle_no_slice(long_random_text, needle)

7.24 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
%timeit boyer_moore(long_random_text, needle)

4.34 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [20]:
%timeit kmp_search(long_random_text, needle)

4.06 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [21]:
# let'' look at the end of the long_random_text
long_random_text[-50:]

'CGTGGTTGGTTTCGGATCTGTTGACAGAGAACTGACCCCATCCGCCTTGA'

In [22]:
# our needle will be last 100 characters
needle = long_random_text[-100:]

In [23]:
%%timeit
long_random_text.find(needle)

1.23 ms ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [24]:
%%timeit
find_needle(long_random_text, needle)

137 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [25]:
%%timeit
find_needle_no_slice(long_random_text, needle)

267 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
%%timeit
kmp_search(long_random_text, needle)

144 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [27]:
%%timeit
boyer_moore(long_random_text, needle)

193 ms ± 9.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Python implemention of find

Turns out that the python find method is implemented using the Boyer-Moore algorithm. Actually a modified version of Boyer-Moore algorithm called the Boyer-Moore-Horspool algorithm.

* https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm
* Stack Overflow: https://stackoverflow.com/questions/681649/how-is-string-find-implemented-in-cpython

## Timing longer needles

We expect that Boyer Moore and KMP will be faster than brute force for longer needles. Let's test this hypothesis.

In [23]:
# lets get some longer needle from back of the long_random_text
needle = long_random_text[-200:-50]
needle

'GCCATATTACTTAGGTTAAGGTTGGCGTACTCGTGTTTAACATCCGGCCTACGCAGGCTATTTTATACATTATTGTACTTTTTGATAGTTAGTCAATGCGCCACCGGTTCGTTAGAGGGTAGGTATCTCTTTTGGCGAGGATGCACGTCC'

In [24]:
%timeit long_random_text.find(needle)

3.6 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [25]:
%timeit find_needle(long_random_text, needle)

303 ms ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
%timeit boyer_moore(long_random_text, needle)

378 ms ± 42.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
%timeit kmp_search(long_random_text, needle)

261 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
# let's replace some characters in the long random text with X
long_list = list(long_random_text)
for i in range(950):
    long_list[i*1000] = "X"
long_random_text = "".join(long_list)

In [26]:
%timeit boyer_moore(long_random_text, needle)

233 ms ± 7.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
%timeit find_needle(long_random_text, needle)

164 ms ± 3.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [28]:
%timeit long_random_text.find(needle)

1.61 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [None]:
# so it does look like find and our naive Boyer Moore improve when text contains characters that are NOT in the needle


## Timing longer needles with longer text

Finally we will use bigger alphabets and longer texts to see how the algorithms perform.

In [28]:
import string
letters = string.ascii_letters
letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [30]:
long_text = "".join([random.choice(letters) for i in range(1000000)])
needle = long_text[-200:-50]
needle

'zyjnFPQbKJRTsQEawcXZWYnKTlJiZCbduFFXofSHHwcdGoTMpYsCcQMBpaYdcoNPXWnJChYWGcfsAGMKIVKwuLnEpWBSPOKeQdvfQuYGbYbRghMOsuQrzezdMwXnpePcIbpMzdahLYAHkwFXLYFGDj'

In [31]:
len(set(needle))

47

In [33]:
long_text.find(needle)

999800

In [32]:
%timeit long_text.find(needle)

336 µs ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [34]:
%timeit find_needle(long_text, needle)

269 ms ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [35]:
%timeit kmp_search(long_text, needle)

284 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [36]:
%timeit boyer_moore(long_text, needle)

13 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Rabin-Karp

The **Rabin-Karp algorithm** is a clever string searching technique that uses **hashing** to find all occurrences of a pattern in a text. It is particularly efficient when you need to search for **multiple patterns** in the same text.

---

## 📜 History

The Rabin-Karp algorithm was developed by **Michael O. Rabin and Richard M. Karp** in **1987**. It introduced a **probabilistic hashing approach** to string matching, making it faster in many real-world scenarios, especially for **multi-pattern searching**.

---

## 🧠 Key Idea

Instead of checking characters one by one (like KMP or Boyer-Moore), Rabin-Karp:

1. Computes a **hash** of the pattern.
2. Computes hashes of all substrings of the text with the same length.
3. Compares hashes:

   * If hash matches, do a **character-by-character check** to confirm.
   * If not, move on.

> This is called a **rolling hash** because we efficiently update the hash as we slide the window.

---

## 🔢 Hash Function (Basic Version)

Use a polynomial hash:

```
hash(s) = s[0]*base^(m-1) + s[1]*base^(m-2) + ... + s[m-1]*base^0 mod q
```

* `base` is a constant (like 256 for ASCII)
* `q` is a prime number to avoid large numbers and reduce collisions

---

## 🧪 Python Implementation

```python
def rabin_karp(text, pattern, base=256, q=101):
    """
    Rabin-Karp string matching using rolling hash.
    - base: size of character set (e.g., 256 for extended ASCII)
    - q: a prime number to mod the hash values
    Returns list of starting indices where pattern is found in text.
    """
    n = len(text)
    m = len(pattern)

    if m > n:
        return []

    pattern_hash = 0  # hash value for pattern
    text_hash = 0     # hash value for current text window
    h = 1             # high-order base multiplier (base^(m-1) % q)

    # Precompute h = pow(base, m-1) % q
    for _ in range(m - 1):
        h = (h * base) % q

    # Compute the hash value for pattern and first window of text
    for i in range(m):
        pattern_hash = (base * pattern_hash + ord(pattern[i])) % q
        text_hash = (base * text_hash + ord(text[i])) % q

    matches = []

    # Slide the pattern over text one by one
    for i in range(n - m + 1):
        # Check if the hash values match
        if pattern_hash == text_hash:
            # Double-check with direct character comparison
            if text[i:i + m] == pattern:
                matches.append(i)

        # Compute the hash of the next window
        if i < n - m:
            text_hash = (base * (text_hash - ord(text[i]) * h) + ord(text[i + m])) % q
            # Make sure hash is positive
            if text_hash < 0:
                text_hash += q

    return matches
```

---

## ✅ Example Usage

```python
text = "the quick brown fox jumps over the lazy dog"
pattern = "over"
print("Pattern found at indices:", rabin_karp(text, pattern))
```

### Output:

```
Pattern found at indices: [26]
```

---

## ⏱️ Time Complexity

| Case         | Complexity |
| ------------ | ---------- |
| Average Case | O(n + m)   |
| Worst Case   | O(n \* m)  |

* `n`: length of text
* `m`: length of pattern
* Worst-case happens when there are many hash collisions.

---

## 📚 Advantages & Use Cases

* **Efficient for multiple pattern matching** (just hash each pattern and compare).
* Used in **plagiarism detection**, **virus signature scanning**, **dictionary matching**.
* Simple to adapt for **binary data**, **DNA**, and other non-textual sequences.

---

## ⚠️ Notes on Practical Use

* Choice of `q` (modulus) and `base` affects performance and collision rate.
* Real implementations may use **Rabin fingerprints**, **rolling hashes with primes**, or **double hashing** for reliability.




In [29]:
def rabin_karp(text, pattern, base=256, q=101):
    """
    Rabin-Karp string matching using rolling hash.
    - base: size of character set (e.g., 256 for extended ASCII)
    - q: a prime number to mod the hash values
    Returns list of starting indices where pattern is found in text.
    """
    n = len(text)
    m = len(pattern)

    if m > n:
        return []

    pattern_hash = 0  # hash value for pattern
    text_hash = 0     # hash value for current text window
    h = 1             # high-order base multiplier (base^(m-1) % q)

    # Precompute h = pow(base, m-1) % q
    for _ in range(m - 1):
        h = (h * base) % q

    # Compute the hash value for pattern and first window of text
    for i in range(m):
        pattern_hash = (base * pattern_hash + ord(pattern[i])) % q
        text_hash = (base * text_hash + ord(text[i])) % q

    matches = []

    # Slide the pattern over text one by one
    for i in range(n - m + 1):
        # Check if the hash values match
        if pattern_hash == text_hash:
            # Double-check with direct character comparison
            if text[i:i + m] == pattern:
                matches.append(i)

        # Compute the hash of the next window
        if i < n - m:
            text_hash = (base * (text_hash - ord(text[i]) * h) + ord(text[i + m])) % q
            # Make sure hash is positive
            if text_hash < 0:
                text_hash += q

    return matches

# our needle
needle = "RTU"
print(f"Needle: {needle}")
print(f"Text: {text}")
print(f"Text length: {len(text)}")
print(f"Needle length: {len(needle)}")
print(f"Needle found at: {rabin_karp(text, needle)}")

Needle: RTU
Text: MIT HARVARD OXFORD RRRRRTU RTU RTUUUU RTU ???
Text length: 45
Needle length: 3
Needle found at: [23, 27, 31, 38]


## Conclusion

Boyer-Moore performs best when there are many potential "easy losses" - some character is present in the text but not in the pattern. 

KMP will be faster than brute force when the pattern has many repeated characters.

Use Rabin-Karp when you need to search for multiple patterns in the same text. It is also useful when you need to search for a pattern in a large text with many repeated characters.
The Rabin-Karp algorithm is particularly efficient when you need to search for multiple patterns in the same text. It is also useful when you need to search for a pattern in a large text with many repeated characters.

## Resources for String Matching Algorithms

While the teacher has checked the links, it is possible that some links break in the future. If you find a broken link, please let me know and I will fix it.
E-mail: valdis.saulespurens@rtu.lv

Second note, there are some authorative links such as academic papers but also some less formal links such as blog posts and articles. I have tried to mark the links with the type of resource they are.



| Algorithm       | Resource Name & Link                                                  | URL                                                         | Type                     | Description                                                                                         | Date           |
|----------------|------------------------------------------------------------------------|--------------------------------------------------------------|--------------------------|-----------------------------------------------------------------------------------------------------|----------------|
| Naïve          | [GeeksforGeeks: Naive Pattern Searching](https://www.geeksforgeeks.org/naive-algorithm-for-pattern-searching/) | https://www.geeksforgeeks.org/naive-algorithm-for-pattern-searching/ | Tutorial (article)       | Basic brute-force matching explained with code and examples.                                        | Apr 20, 2024   |
| Naïve          | [Exact String Matching Algorithms: Brute Force](https://www-igm.univ-mlv.fr/~lecroq/string/node3.html)         | https://www-igm.univ-mlv.fr/~lecroq/string/node3.html         | Academic resource        | Charras & Lecroq's comprehensive 1997 writeup with C code and algorithm description.                | Jan 14, 1997   |
| KMP            | [Dev.to: Understanding the KMP Algorithm](https://dev.to/yo-shi/for-beginners-understanding-the-kmp-algorithm-by-comparing-with-the-brute-force-1da3)           | https://dev.to/yo-shi/for-beginners-understanding-the-kmp-algorithm-by-comparing-with-the-brute-force-1da3     | Blog post                | Simple explanation of KMP with examples and comparisons to brute force.                             | Jan 4, 2025    |
| KMP            | [Virtual Labs: KMP Algorithm](https://ds2-iiith.vlabs.ac.in/exp/kmp-algorithm/index.html)                            | https://ds2-iiith.vlabs.ac.in/exp/kmp-algorithm/index.html          | Interactive tutorial     | Animation and quiz for learning KMP through simulations.                                            | —              |
| KMP            | [Knuth, Morris, Pratt (1977)](https://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Knuth77.pdf)                                   | https://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Knuth77.pdf                  | Academic paper           | The original paper introducing KMP, with formal proofs and linear time analysis.                    | 1977           |
| Boyer-Moore    | [Medium: Boyer-Moore String Search](https://medium.com/@siddharth.21/the-boyer-moore-string-search-algorithm-674906cab162)  | https://medium.com/@siddharth.21/the-boyer-moore-string-search-algorithm-674906cab162 | Blog post      | Explanation of both bad-character and good-suffix rules with Python code.                          | Oct 19, 2021   |
| Boyer-Moore    | [UBC Visualization: Boyer-Moore Search](https://cmps-people.ok.ubc.ca/ylucet/DS/BoyerMoore.html)                  | https://cmps-people.ok.ubc.ca/ylucet/DS/BoyerMoore.html            | Interactive tool         | Step-by-step animation of Boyer-Moore alignment and pattern shifts.                                | —              |
| Boyer-Moore    | [Boyer & Moore (1977)](https://www.cs.utexas.edu/~moore/publications/fstrpos.pdf)                                         | https://www.cs.utexas.edu/~moore/publications/fstrpos.pdf                  | Academic paper           | Foundational work that introduced Boyer-Moore and sublinear average-case search.                   | 1977           |
| Rabin-Karp     | [GeeksforGeeks: Rabin-Karp Algorithm](https://www.geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching/) | https://www.geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching/ | Tutorial (article) | Clear step-by-step example of Rabin-Karp’s rolling hash technique.                                 | Feb 26, 2025   |
| Rabin-Karp     | *Introduction to Algorithms* (Cormen et al.)                           | https://mitpress.mit.edu/9780262046305/                      | Textbook (paid)          | Covers Rabin-Karp and compares it with other algorithms in detail.                                 | 2022           |
| Rabin-Karp     | [Karp & Rabin (1987)](https://citeseerx.ist.psu.edu/document?doi=c47d151f09c567013761632c89e237431c6291a2)                                              | https://citeseerx.ist.psu.edu/document?doi=c47d151f09c567013761632c89e237431c6291a2                    | Academic paper           | The original randomized pattern-matching paper for multiple patterns.                               | 1987           |
| Aho-Corasick   | [Aho & Corasick (1975)](https://cr.yp.to/bib/1975/aho.pdf)                                         | https://cr.yp.to/bib/1975/aho.pdf                  | Academic paper           | Introduced the multi-pattern matching algorithm using tries and failure transitions.                | 1975           |
| Suffix Trees   | [Ukkonen (1995)](https://link.springer.com/article/10.1007/BF01206331)                                         | https://link.springer.com/article/10.1007/BF01206331          | Academic paper           | Describes the first linear-time online suffix tree construction algorithm.                          | 1995           |
| Levenshtein    | [Levenshtein (1966)](https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf) | https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf | Academic paper | Defines the edit distance and lays the foundation for approximate matching and corrections.         | 1966           |
