# String Matching Algorithms

## Description

### String Matching

String matching is a fundamental problem in computer science. It is the process of finding a given pattern(needle) in a text(target). The pattern can be a string or a sequence of characters or a regular expression. 

In [5]:
# in python we already have string find / index method
text = "RRRRTTRUUU RTUUUUURRTUUUUU"
text.find("RTU"), text.index("RTU")

(11, 11)

In [6]:
def find_needle(text, needle):
    '''
    Brute force search for needle in text
    '''
    for i in range(len(text)):
        # python slicing makes this easy
        if text[i:i+len(needle)] == needle: 
            return i
    return -1

find_needle(text, "RTU")

11

In [7]:
# lets rewrite the above function without slicing
def find_needle_no_slice(text, needle):
    '''
    Brute force search for needle in text
    Not pythonic implementation - more like C
    '''
    for i in range(len(text)-len(needle)+1):
        # python slicing makes this easy
        for j in range(len(needle)):
            if text[i+j] != needle[j]:
                break
        else: # this else is for the for loop, not the if
            return i # if we didn't break out of the loop, we found the needle
    return -1

find_needle_no_slice(text, "RTU")

11

## Complexity of brute force algorithm

The brute force algorithm is the simplest algorithm for string matching. It checks for the pattern in the text by sliding the pattern over the text one by one and checking for a match. The time complexity of this algorithm is O(mn) where m is the length of the pattern and n is the length of the text.

In [9]:
# 6 in binary is 110
bin(3),bin(6), bin(9),bin(12)

('0b11', '0b110', '0b1001', '0b1100')

## Knuth-Morris-Pratt Algorithm

### Description

Big idea: If we know some of the characters in the text, we can use that information to avoid matching the characters that we know will anyway match.

In KMP we build a prefix table that tells us how many characters to skip when a mismatch occurs. This is called the failure function or the prefix function.
We could use this information to skip characters in the text that we know will anyway match.

We could also build a DFA using this information.

DFA - Deterministic Finite Automaton
https://en.wikipedia.org/wiki/Deterministic_finite_automaton

### Complexity

Time complexity: O(m+n) where m is the length of the pattern and n is the length of the text.

In [10]:
def kmp_table(needle):
    '''
    Build the KMP table for the needle
    '''
    # initialize table
    table = [0]*len(needle)
    i = 1
    j = 0
    # we start at 1 because table[0] is always 0
    while i < len(needle): 
        if needle[i] == needle[j]:
            table[i] = j+1
            i += 1
            j += 1
        elif j > 0:
            j = table[j-1]
        else:
            i += 1
    return table

kmp_table("RTU")

[0, 0, 0]

In [11]:
kmp_table("aabaaa")

[0, 1, 0, 1, 2, 2]

In [12]:
# kmp search
def kmp_search(text, needle):
    '''
    KMP search for needle in text
    '''
    table = kmp_table(needle)
    i = 0
    j = 0
    while i < len(text):
        if text[i] == needle[j]:
            # so if we are at the end of the needle, we found it!!
            if j == len(needle)-1:
                return i-j
            else:
                i += 1
                j += 1
        elif j > 0:
            j = table[j-1]
        else:
            i += 1
    return -1

kmp_search(text, "RTU")

11

In [None]:
# left as an exercise for the reader run timing tests on the above functions
# and compare them to the python find method

# you can use %timeit in a jupyter notebook to time a function

## Boyer-Moore Algorithm

https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

### Description

Boyer-Moore algorithm is a string searching algorithm that uses information from the end of the pattern to skip characters in the text. It uses two rules to skip characters in the text.

1. Bad character rule
Bad character rule is used to skip characters in the text when a mismatch occurs. The bad character rule is based on the observation that if the mismatch occurs at position i in the pattern, then we can shift the pattern by i characters to the right.
2. Good suffix rule
TODO good suffix rule

### Complexity

Worst case time complexity: O(mn) where m is the length of the pattern and n is the length of the text.

Best case time complexity: O(n/m) where m is the length of the pattern/needle and n is the length of the text.

In [13]:
# let's implement Boyer Moore
def boyer_moore(text, needle):
    '''
    Boyer Moore search for needle in text
    '''
    # build bad character table
    table = {}
    for i in range(len(needle)):
        table[needle[i]] = i
    # now we search
    i = len(needle)-1 # start at the end of the needle!
    j = i
    while i < len(text):
        if text[i] == needle[j]:
            if j == 0:
                return i
            else:
                i -= 1
                j -= 1
        else:
            if text[i] in table:
                i += len(needle) - min(j, 1+table[text[i]])
            else:
                i += len(needle)
            j = len(needle)-1
    return -1

boyer_moore(text, "RTU")

11

## Timing

In [16]:
import random
# insert seed here
random.seed(42) # answer to life the universe and everything
long_random_text = "".join([random.choice("ACGT") for i in range(1000000)])
long_random_text[:10]

'AAGCCCAATA'

In [17]:
needle = "GATTACA"
long_random_text.find(needle)

27459

In [18]:
%timeit long_random_text.find(needle)

76.6 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [19]:
%timeit find_needle(long_random_text, needle)

7.89 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [20]:
%timeit find_needle_no_slice(long_random_text, needle)

13.8 ms ± 720 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [21]:
%timeit boyer_moore(long_random_text, needle)

7.87 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [22]:
%timeit kmp_search(long_random_text, needle)

8.73 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Timing longer needles

We expect that Boyer Moore and KMP will be faster than brute force for longer needles. Let's test this hypothesis.

In [23]:
# lets get some longer needle from back of the long_random_text
needle = long_random_text[-200:-50]
needle

'GCCATATTACTTAGGTTAAGGTTGGCGTACTCGTGTTTAACATCCGGCCTACGCAGGCTATTTTATACATTATTGTACTTTTTGATAGTTAGTCAATGCGCCACCGGTTCGTTAGAGGGTAGGTATCTCTTTTGGCGAGGATGCACGTCC'

In [24]:
%timeit long_random_text.find(needle)

3.6 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [25]:
%timeit find_needle(long_random_text, needle)

303 ms ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
%timeit boyer_moore(long_random_text, needle)

378 ms ± 42.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
%timeit kmp_search(long_random_text, needle)

261 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Timing longer needles with longer text

Finally we will use bigger alphabets and longer texts to see how the algorithms perform.

In [28]:
import string
letters = string.ascii_letters
letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [30]:
long_text = "".join([random.choice(letters) for i in range(1000000)])
needle = long_text[-200:-50]
needle

'zyjnFPQbKJRTsQEawcXZWYnKTlJiZCbduFFXofSHHwcdGoTMpYsCcQMBpaYdcoNPXWnJChYWGcfsAGMKIVKwuLnEpWBSPOKeQdvfQuYGbYbRghMOsuQrzezdMwXnpePcIbpMzdahLYAHkwFXLYFGDj'

In [31]:
len(set(needle))

47

In [33]:
long_text.find(needle)

999800

In [32]:
%timeit long_text.find(needle)

336 µs ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [34]:
%timeit find_needle(long_text, needle)

269 ms ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [35]:
%timeit kmp_search(long_text, needle)

284 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [36]:
%timeit boyer_moore(long_text, needle)

13 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Conclusion

Boyer-Moore performs best when there are many potential "easy losses" - some character is present in the text but not in the pattern. 

KMP will be faster than brute force when the pattern has many repeated characters.