# String Searching Algorithms

String matching algorithms are algorithms that find all occurrences of a given pattern string in a text string. These algorithms have a wide range of applications, such as in text editing, data compression, and search engines.

There are several different string matching algorithms, each with its own advantages and disadvantages. Some of the most commonly used algorithms include:

1. Naive String Matching: This algorithm checks every substring of the text against the pattern, which is the simplest approach but also the least efficient.

2. Knuth-Morris-Pratt (KMP) Algorithm: This algorithm uses the prefix function to avoid unnecessary comparisons of characters in the text string and the pattern.

3. Boyer-Moore Algorithm: This algorithm uses two preprocessing steps to speed up the matching process: a bad character rule and a good suffix rule.

4. Rabin-Karp Algorithm: This algorithm uses hashing to check if the pattern matches a substring of the text string.

5. Finite Automaton Algorithm: This algorithm uses a finite state machine to recognize the pattern in the text string.

These algorithms have different time and space complexities, and the choice of which algorithm to use depends on the specific requirements of the application.

## Practical Applications

* Search engines: Search engines use string matching algorithms to find relevant web pages based on the user's query.

* Virus scanners: Virus scanners use string matching algorithms to detect malicious code in files by searching for known virus signatures.

* Data compression: Compression algorithms use string matching algorithms to identify repeated patterns in the data, which can then be replaced with a shorter code.

* Text editors: Text editors use string matching algorithms to implement search and replace functionality.

* Natural language processing: String matching algorithms are used in natural language processing tasks such as named entity recognition, where a specific pattern of words or characters is matched to identify named entities such as people, organizations, and locations.

* DNA sequence analysis: String matching algorithms are used in bioinformatics to search for specific patterns in DNA sequences.

## Naive String Matching

The naive string matching algorithm is the simplest string matching algorithm. It checks every substring of the text string against the pattern string. If the pattern string is found, the index of the first character of the substring is added to the list of matches. If the pattern string is not found, the algorithm continues to the next substring.


## Naive String Matching Algorithm

The naive_string_matcher function takes two input strings text and pattern, and returns a list of all occurrences of the pattern in the text.

The function first initializes the lengths of the text and pattern, and creates an empty list occurrences to store the indices of each occurrence of pattern in text.

It then iterates through every possible starting index of a substring of text that is the same length as pattern. For each starting index i, it checks whether the substring of text starting at index i and with length m (i.e., the same length as pattern) is equal to pattern. If it is, then it appends i to the occurrences list.

Finally, the function returns the occurrences list containing the indices of each occurrence of pattern in text.

In [1]:
text = "Riga Rocks RBS Rocking as well also RBS is great plus we have RBS at home"
pattern = "RBS"
# Python built in
text.find(pattern), text.rfind(pattern) # of course if you have more than two occurences you would need to write some more code

(11, 62)

In [2]:
# we can also do existance check
print(f"Is there NEEDLE: {pattern} in our haystack? {pattern in text}")

Is there NEEDLE: RBS in our haystack? True


In [13]:
# let's return indexes of all occurences of pattern in text
def find_all_naive(text, pattern):
    indexes = []
    i = 0
    while i < len(text): # if i matches or goes over len(text) means we reached the end of the string
        i = text.find(pattern, i) # find returns -1 if pattern is not found
        # so it is actually not completely naive because we are using Python's find which has optimizations
        if i == -1:
            break
        indexes.append(i)
        i += 1
    return indexes

indexes = find_all_naive(text, pattern)
indexes

[11, 36, 62]

In [4]:
text.find("willnotexist")

-1

In [14]:
def find_all_indexes_using_built_in_find(text, pattern):
    occurences = []
    find_index = text.find(pattern) # tiny optimization over previous approach instead of 0 we start with first find
    while find_index >=0:
        occurences.append(find_index)
        find_index = text.find(pattern, find_index+1) # so we keep looking at slices
    return occurences

In [7]:
find_all_indexes_using_built_in_find(text,pattern)

[11, 36, 62]

In [11]:
## Implementation of naive string matching

def brute_string_matcher(text, pattern):
    n = len(text) # n is length of our text/haystack
    m = len(pattern) # m is how long our needle/pattern is
    occurrences = [] # storage of found patterns

    for i in range(n - m + 1): # why +1 ? because we want to stop at n-m
        if text[i:i + m] == pattern: # this equality actually might be efficient..
        # to make it completely naive we would need to compare each character one by one
            occurrences.append(i)
            # here if we did not want all matches we could just return immediately/early with i

    return occurrences

# so this has O(n*m) complexity


In [12]:
brute_string_matcher(text, pattern)

[11, 36, 62]

In [None]:
# again Practical advise - use whatever is built into your language first,
# even then you can use those basic tools (such as in, find, rfind, lfind in Python) to make your case

## Knuth-Morris-Pratt (KMP) Algorithm

Idea behind KMP algorithm is to not match a character more than once. If we have matched a character in the pattern, we can skip some characters in the text string and start matching from the next character in the pattern.

We will perform pre-processing on the pattern string to create a prefix function that will help us determine the number of characters to skip in the text string when a mismatch occurs.

This prefix function will essentially be a DFA (Deterministic Finite Automaton) that will help us determine the next state to go to when a mismatch occurs.

In [10]:
# let's implement KMP algorithm

def compute_prefix_function(pattern):
    m = len(pattern)
    pi = [0] * m
    k = 0
    for q in range(1, m):
        while k > 0 and pattern[k] != pattern[q]:
            k = pi[k - 1]
        if pattern[k] == pattern[q]:
            k += 1
        pi[q] = k
    return pi

def kmp_matcher(text, pattern, pi=None):
    n = len(text)
    m = len(pattern)
    if pi is None:
        pi = compute_prefix_function(pattern)
    q = 0
    occurrences = []
    for i in range(n): # so you see this is complexity O(n)
        while q > 0 and pattern[q] != text[i]:
            q = pi[q - 1]
        if pattern[q] == text[i]:
            q += 1
        if q == m:
            occurrences.append(i - m + 1)
            q = pi[q - 1]
    return occurrences

# test it
kmp_matcher(text, pattern)

[11, 36, 62]

In [15]:
#let's make a bit text 1000 time bigger
text1k = text * 1000
# our pattern will be our text itself so it is 73 characters long
pattern1k = text
# so length of n
print(f"length of text(n): {len(text1k)}, pattern(m): {len(pattern1k)}")

length of text(n): 73000, pattern(m): 73


In [16]:
%%timeit
brute_string_matcher(text1k, pattern1k)

9.59 ms ± 566 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [17]:
%%timeit
kmp_matcher(text1k, pattern1k)
# so kmp_matcher is actually slower than my brute_string_matcher

22.3 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [18]:
# let's also check naive one from begginning
%%timeit
find_all_naive(text1k, pattern1k)
# why is this one fastest?
# because we are actually utilzing find inside it
# and find is highly optimized already
# so naive is actually not so naive it utilizes the built in find

# so find utilized optimized vector instructions for short needles

601 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Testing with needle at the very end only

In [28]:
needle = "Tuesday2025Algorithms"
# let's create a new text by putting this needle at the very end
text1k_augmented = text1k + needle
# how long is text1k_augmented and needle
print(f"length of text(n): {len(text1k_augmented)}, pattern(m): {len(needle)}")

length of text(n): 73021, pattern(m): 21


In [32]:
# let's check that all 3 algorithms find this needle
brute_index = brute_string_matcher(text1k_augmented, needle)
kmp_index = kmp_matcher(text1k_augmented, needle)
naive_index = find_all_naive(text1k_augmented, needle)
built_in_index = text1k_augmented.find(needle)
# print all 4 indexes
print(f"brute_index: {brute_index}, kmp_index: {kmp_index}, naive_index: {naive_index}, built_in_index {built_in_index} ")

brute_index: [73000], kmp_index: [73000], naive_index: [73000], built_in_index 73000 


In [33]:
# let's time all of these
%%timeit
brute_string_matcher(text1k_augmented, needle)
#

15.3 ms ± 5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [34]:
# now kmp
%%timeit
kmp_matcher(text1k_augmented, needle)
#

6.3 ms ± 91.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [35]:
# now naive
%%timeit
find_all_naive(text1k_augmented, needle)
#

33.6 µs ± 909 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [36]:
# now built in
%%timeit
text1k_augmented.find(needle)

31.4 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [None]:
# so why is naive so close to built in?
# because if you check your find all naive
# in this case we end up making a single or double call to the same find function that we used in built_in
# so only a bit of overhead for creating a list to store the results but otherwise identical

### Preliminary Results

So our Naive match actually beats KMP? Why? Because each time we recomputed the prefix function. We can actually precompute the prefix function and then use it to match the pattern with the text.

In [19]:
# now let's precompute prefix function
pi = compute_prefix_function(pattern1k)

In [20]:
%%timeit
kmp_matcher(text1k, pattern1k, pi)

12.6 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [21]:
results = kmp_matcher(text1k, pattern1k, pi)
len(results)

1000

In [22]:
results_naive = brute_string_matcher(text1k, pattern1k)
len(results_naive)

1000

In [23]:
# first 5 results
results[:5], results_naive[:5]

([0, 73, 146, 219, 292], [0, 73, 146, 219, 292])

In [None]:
# last 5 results
results[-5:], results_naive[-5:]

([72635, 72708, 72781, 72854, 72927], [72635, 72708, 72781, 72854, 72927])

In [24]:
# start of text1k
text1k[:100]

'Riga Rocks RBS Rocking as well also RBS is great plus we have RBS at homeRiga Rocks RBS Rocking as w'

In [25]:
# end of text1k
text1k[-100:]

'at plus we have RBS at homeRiga Rocks RBS Rocking as well also RBS is great plus we have RBS at home'

### More conclusions

Our naive match is still  faster than KMP. Why?
One hypothesis is that == is more efficient than comparing strings one character at a time.

### Needle matters!

Our needle "Riga Rocks RBS Rocking as well also RBS is great plus we have RBS at home" will draw a mismatch very quickly even with naive match because we only need to check R or Ri at most , so no more than 2 characters.

In [None]:
# So our pattern will be Riga * 100 + "X"
patternX = "Riga" * 100 + "X"
# our text will be Riga * 1000+ "X" * 100
textX = ("Riga" * 1000 + "X") * 100
# lengths
print(f"length of text(n): {len(textX)}, pattern(m): {len(patternX)}")

length of text(n): 400100, pattern(m): 401


In [None]:
# test results
results = kmp_matcher(textX, patternX)
len(results)

100

In [None]:
# naive
results_naive = naive_string_matcher(textX, patternX)
len(results_naive)

100

In [None]:
# lets precoompute prefix function
pi = compute_prefix_function(patternX)


In [None]:
results = kmp_matcher(textX, patternX, pi)
len(results)

100

In [None]:
%%timeit
naive_string_matcher(textX, patternX)

85.3 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
%%timeit
kmp_matcher(textX, patternX, pi)

108 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Random text creation

In [43]:
# let's create a function that retuns some random text from some alphabet
import random
import string # for ascii
alpha = string.ascii_lowercase+string.ascii_uppercase+string.digits+string.punctuation+" "
def gen_random_text(n, alphabet=alpha, seed=None):
    if seed is not None:
        random.seed(seed)
    return ''.join(random.choice(alphabet) for _ in range(n))

# test with 100
text = gen_random_text(100, seed=2025)
text

"*k?9w&aVW+DiWZmfpBWm0g;d)$iyyd{D>Y6T%d/o(zApF:eor)Vl.6J4,w~P&5'6Z_j+3@.dQ;FSD(&/NuJQLUj*s\\J(J+>saS~<"

### Library of Babel

https://en.wikipedia.org/wiki/The_Library_of_Babel

![Babel](https://upload.wikimedia.org/wikipedia/en/a/ae/The_library_of_babel_-_bookcover.jpg)

<iframe width="560" height="315" src="https://www.youtube.com/embed/no_elVGGgW8?si=5tmekCUwWoo-JPwu" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

In [45]:
# so lets gen 10M text
text10M = gen_random_text(10**7, seed=2025)
pattern100 = gen_random_text(10**2,seed=2025)

# lengths
print(f"length of text(n): {len(text10M)}, pattern(m): {len(pattern100)}")
print(f"Is pattern in text? {pattern100 in text10M}")
#

length of text(n): 10000000, pattern(m): 100
Is pattern in text? True


In [47]:
# where can we find this pattern?
text10M.find(pattern100)

0

In [48]:
# of course same seed so same start....
print(f"First 120 characters of text10M: {text10M[:120]}")
print(f"First 120 characters of pattern100: {pattern100[:120]}")
# not shocking that same seed produces same values....

First 120 characters of text10M: *k?9w&aVW+DiWZmfpBWm0g;d)$iyyd{D>Y6T%d/o(zApF:eor)Vl.6J4,w~P&5'6Z_j+3@.dQ;FSD(&/NuJQLUj*s\J(J+>saS~<).(),t#XR$FN.d#,0^eA
First 120 characters of pattern100: *k?9w&aVW+DiWZmfpBWm0g;d)$iyyd{D>Y6T%d/o(zApF:eor)Vl.6J4,w~P&5'6Z_j+3@.dQ;FSD(&/NuJQLUj*s\J(J+>saS~<


In [None]:
# so let's try something else
# TODO explore seeds and find a non-trivial match with different seeds so one seed for text10M gen and another for pattern100 but pattern100 would be found at least once
# we could do this by generating text10M once
# then trying different seeds for pattern100
# however another we should do what is the Probability that we do find a match
# my gut feeling is that that the probability is very very very low - meaning 0.000000000000....00001

In [None]:
# let's do smaller pattern
pattern10 = gen_random_text(10, seed=1)
print(f"Mini 10 character pattern: {pattern10}")
#

In [49]:
# so idea from class is to just pick a random index in 10M and get 100character patern ther
random_index = random.randint(0, len(text10M)-100)
print(f"Random index: {random_index}")

Random index: 9191496


In [50]:
# let's get our needle from there
pattern100 = text10M[random_index:random_index+100]
# how many times it occurs in our text
print(f"How many times it occurs in our text: {text10M.count(pattern100)}")

How many times it occurs in our text: 1


In [51]:
# now let's test it
results = kmp_matcher(text10M, pattern100)
len(results), results

(1, [9191496])

In [52]:
naive_results = naive_string_matcher(text10M, pattern100)
len(naive_results)

1

In [None]:
# we are allowed to precompute
pi = compute_prefix_function(pattern100)
# so jumps are all 0 meaning we do not gain anythin from any reoccurence
pi

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [None]:
%%timeit
naive_string_matcher(text10M, pattern100)

1.91 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
kmp_matcher(text10M, pattern100, pi)

1.69 s ± 48.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Boyer-Moore Algorithm

The Boyer-Moore algorithm is a string matching algorithm that uses two preprocessing steps to speed up the matching process: a bad character rule and a good suffix rule.

The bad character rule works by examining the character in text that caused the mismatch with pattern, and then shifting pattern so that the character in pattern that matches the bad character in text is aligned with the bad character in text. This can be done because we know that any occurrences of pattern in text that end at the index of the bad character in text cannot match due to the mismatch.

The good suffix rule works by examining the longest suffix of pattern that matches a suffix of the current mismatch. If such a suffix exists, we can shift pattern so that the matching suffix is aligned with the mismatched suffix in text. This can be done because we know that any occurrences of pattern in text that end at the index of the mismatched suffix cannot match due to the mismatch.

The Boyer-Moore algorithm combines these two rules to determine the best shift to make after a mismatch occurs. Specifically, it chooses the larger of the shifts suggested by the bad character rule and the good suffix rule. This means that the algorithm skips over as many characters as possible in text before attempting another match.

In practice, the Boyer-Moore algorithm is often faster than other string matching algorithms, particularly when the pattern string is long or there are many occurrences of the pattern in the text. However, it can be slower than other algorithms in certain cases, such as when the pattern string is short or there are few occurrences of the pattern in the text.

In [None]:
def boyer_moore(text, pattern, debug=False):
    n = len(text)
    m = len(pattern)
    if m == 0:
        return 0

    # Initialize variables
    skip = [m] * 256
    for i in range(m - 1):
        skip[ord(pattern[i])] = m - i - 1

    # Search for the pattern in the text
    i = m - 1
    while i < n:
        j = m - 1
        while text[i] == pattern[j]:
            if debug:
                print(f"Potential match at location {i} letter {text[i]}")
            if j == 0:
                return i # we could modif this to collect indexes
            i -= 1
            j -= 1
        i += max(skip[ord(text[i])], m - j)
        if debug:
            print(f"skipping to position {i}")

    return -1

In [None]:
# we will modify Boyer-Moore to return all indexes
def boyer_moore_all(text, pattern, debug=False):
    n = len(text)
    m = len(pattern)
    if m == 0:
        return []

    # Initialize variables
    skip = [m] * 256
    for i in range(m - 1):
        skip[ord(pattern[i])] = m - i - 1

    # Search for the pattern in the text
    i = m - 1
    indexes = []
    while i < n:
        j = m - 1
        while text[i] == pattern[j]:
            if debug:
                print(f"Potential match at location {i} letter {text[i]}")
            if j == 0:
                indexes.append(i)
                break
            i -= 1
            j -= 1
        i += max(skip[ord(text[i])], m - j)
        if debug:
            print(f"skipping to position {i}")

    return indexes

# test it on Riga Riga Riga Rocks Riga
text = "Good old Riga Riga Riga Rocks Riga"
pattern = "Riga"
boyer_moore_all(text, pattern)

[9, 14, 19, 30]

In [None]:
# print(text, pattern)
# print lenth
print(f"length of text(n): {len(text)}, pattern(m): {len(pattern)}")
boyer_moore(text, pattern)

length of text(n): 100, pattern(m): 3


-1

In [None]:
# let's do 10mtext
boyer_moore(text10M, pattern100)

-1

In [None]:
%%timeit
boyer_moore(text10M, pattern100)

80.2 ms ± 4.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Comments on Boyer-Moore results

We can see that Boyer Moore beats both Naive and KMP. This is because Boyer Moore is a more efficient algorithm than Naive and KMP. It uses two preprocessing steps to speed up the matching process: a bad character rule and a good suffix rule. This allows it to skip over as many characters as possible in the text before attempting another match, making it faster than Naive and KMP in many cases.

Of course, this was not completely fair comparison beause naive and KMP here had an option to collect matches, while B-M did not.

However, in for these particular 10M characters there was no match for the pattern so the results are still valid.

In [None]:
%%timeit
boyer_moore_all(text10M, pattern100)

95.5 ms ± 6.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Final Conclusions on Boyer-Moore

We see that even after adding the match collection to Boyer-Moore, it is still faster than Naive and KMP.

In [None]:
boyer_moore("abra dabra cadabra", "cada")

11

In [None]:
boyer_moore("a quick brown fox jumped over a sleeping dog and ate it", "sleeping dog")

skipping to position 23
skipping to position 25
skipping to position 26
skipping to position 38
skipping to position 43
Potential match at location 43 letter g
Potential match at location 42 letter o
Potential match at location 41 letter d
Potential match at location 40 letter  
Potential match at location 39 letter g
Potential match at location 38 letter n
Potential match at location 37 letter i
Potential match at location 36 letter p
Potential match at location 35 letter e
Potential match at location 34 letter e
Potential match at location 33 letter l
Potential match at location 32 letter s


32

## Boyer-Moore efficiency

So Boyer Moore is quite efficient Omega(n/m) in real life applications where pattern is large.

### Boyer-Moore Algorithm explanation

The boyer_moore function takes two input strings text and pattern, and returns the index of the first occurrence of pattern in text. If pattern does not occur in text, the function returns -1.

The function first initializes the lengths of the text and pattern, and checks if the length of pattern is zero. If it is, it returns 0 (indicating that pattern occurs at the beginning of text).

It then initializes a skip table skip that stores the number of characters to skip when a mismatch occurs, based on the character that caused the mismatch. The skip table is initialized with the value m for each character in the ASCII table.

Next, the function iterates through the first m - 1 characters of pattern and updates the skip table with the appropriate values for each character in pattern.

Finally, the function searches for pattern in text. It starts at the end of pattern and the corresponding index in text, and works its way backwards through pattern and text until a mismatch is found. If the mismatch occurs at the first character of pattern, then the function returns the current index in text. Otherwise, the function jumps ahead in text by the maximum of the skip value for the mismatched character and the number of characters remaining in pattern.

If pattern is not found in text, the function returns -1.

## Other String Matching Algorithms

There are several other string matching algorithms, including the Knuth-Morris-Pratt (KMP) algorithm, the Rabin-Karp algorithm, and the finite automaton algorithm. These algorithms have different time and space complexities, and the choice of which algorithm to use depends on the specific requirements of the application.

## TODO What algorithm is Python Find/Index using