# String Searching Algorithms

String matching algorithms are algorithms that find all occurrences of a given pattern string in a text string. These algorithms have a wide range of applications, such as in text editing, data compression, and search engines.

There are several different string matching algorithms, each with its own advantages and disadvantages. Some of the most commonly used algorithms include:

1. Naive String Matching: This algorithm checks every substring of the text against the pattern, which is the simplest approach but also the least efficient.

2. Knuth-Morris-Pratt (KMP) Algorithm: This algorithm uses the prefix function to avoid unnecessary comparisons of characters in the text string and the pattern.

3. Boyer-Moore Algorithm: This algorithm uses two preprocessing steps to speed up the matching process: a bad character rule and a good suffix rule.

4. Rabin-Karp Algorithm: This algorithm uses hashing to check if the pattern matches a substring of the text string.

5. Finite Automaton Algorithm: This algorithm uses a finite state machine to recognize the pattern in the text string.

These algorithms have different time and space complexities, and the choice of which algorithm to use depends on the specific requirements of the application.

## Practical Applications

* Search engines: Search engines use string matching algorithms to find relevant web pages based on the user's query.

* Virus scanners: Virus scanners use string matching algorithms to detect malicious code in files by searching for known virus signatures.

* Data compression: Compression algorithms use string matching algorithms to identify repeated patterns in the data, which can then be replaced with a shorter code.

* Text editors: Text editors use string matching algorithms to implement search and replace functionality.

* Natural language processing: String matching algorithms are used in natural language processing tasks such as named entity recognition, where a specific pattern of words or characters is matched to identify named entities such as people, organizations, and locations.

* DNA sequence analysis: String matching algorithms are used in bioinformatics to search for specific patterns in DNA sequences.

## Naive String Matching

The naive string matching algorithm is the simplest string matching algorithm. It checks every substring of the text string against the pattern string. If the pattern string is found, the index of the first character of the substring is added to the list of matches. If the pattern string is not found, the algorithm continues to the next substring.


## Naive String Matching Algorithm

The naive_string_matcher function takes two input strings text and pattern, and returns a list of all occurrences of the pattern in the text.

The function first initializes the lengths of the text and pattern, and creates an empty list occurrences to store the indices of each occurrence of pattern in text.

It then iterates through every possible starting index of a substring of text that is the same length as pattern. For each starting index i, it checks whether the substring of text starting at index i and with length m (i.e., the same length as pattern) is equal to pattern. If it is, then it appends i to the occurrences list.

Finally, the function returns the occurrences list containing the indices of each occurrence of pattern in text.

In [7]:
text = "Riga Rocks RBS Rocking as well also RBS is great plus we have RBS at home"
pattern = "RBS"
# Python built in
text.find(pattern), text.rfind(pattern) # of course if you have more than two occurences you would need to write some more code

(11, 62)

In [9]:
text.find("willnotexist")

-1

In [12]:
def find_all_indexes_using_built_in_find(text, pattern):
    occurences = []
    find_index = text.find(pattern)
    while find_index >=0:
        occurences.append(find_index)
        find_index = text.find(pattern, find_index+1) # so we keep looking at slices
    return occurences

In [13]:
find_all_indexes_using_built_in_find(text,pattern)

[11, 36, 62]

In [5]:
## Implementation of naive string matching

def naive_string_matcher(text, pattern):
    n = len(text)
    m = len(pattern)
    occurrences = [] # storage of found patterns

    for i in range(n - m + 1):
        if text[i:i + m] == pattern:
            occurrences.append(i)

    return occurrences


In [8]:
naive_string_matcher(text, pattern)

[11, 36, 62]

## Boyer-Moore Algorithm

The Boyer-Moore algorithm is a string matching algorithm that uses two preprocessing steps to speed up the matching process: a bad character rule and a good suffix rule.

The bad character rule works by examining the character in text that caused the mismatch with pattern, and then shifting pattern so that the character in pattern that matches the bad character in text is aligned with the bad character in text. This can be done because we know that any occurrences of pattern in text that end at the index of the bad character in text cannot match due to the mismatch.

The good suffix rule works by examining the longest suffix of pattern that matches a suffix of the current mismatch. If such a suffix exists, we can shift pattern so that the matching suffix is aligned with the mismatched suffix in text. This can be done because we know that any occurrences of pattern in text that end at the index of the mismatched suffix cannot match due to the mismatch.

The Boyer-Moore algorithm combines these two rules to determine the best shift to make after a mismatch occurs. Specifically, it chooses the larger of the shifts suggested by the bad character rule and the good suffix rule. This means that the algorithm skips over as many characters as possible in text before attempting another match.

In practice, the Boyer-Moore algorithm is often faster than other string matching algorithms, particularly when the pattern string is long or there are many occurrences of the pattern in the text. However, it can be slower than other algorithms in certain cases, such as when the pattern string is short or there are few occurrences of the pattern in the text.

In [19]:
def boyer_moore(text, pattern, debug=True):
    n = len(text)
    m = len(pattern)
    if m == 0:
        return 0

    # Initialize variables
    skip = [m] * 256
    for i in range(m - 1):
        skip[ord(pattern[i])] = m - i - 1

    # Search for the pattern in the text
    i = m - 1
    while i < n:
        j = m - 1
        while text[i] == pattern[j]:
            print(f"Potential match at location {i} letter {text[i]}")
            if j == 0:
                return i
            i -= 1
            j -= 1
        i += max(skip[ord(text[i])], m - j)
        print(f"skipping to position {i}")

    return -1

In [21]:
print(text, pattern)
boyer_moore(text, pattern)

Riga Rocks RBS Rocking as well also RBS is great plus we have RBS at home RBS
skipping to position 5
skipping to position 7
skipping to position 10
skipping to position 13
Potential match at location 13 letter S
Potential match at location 12 letter B
Potential match at location 11 letter R


11

In [22]:
boyer_moore("abra dabra cadabra", "cada")

Potential match at location 3 letter a
skipping to position 6
Potential match at location 6 letter a
Potential match at location 5 letter d
skipping to position 8
skipping to position 12
Potential match at location 12 letter a
skipping to position 14
Potential match at location 14 letter a
Potential match at location 13 letter d
Potential match at location 12 letter a
Potential match at location 11 letter c


11

In [23]:
boyer_moore("a quick brown fox jumped over a sleeping dog and ate it", "sleeping dog")

skipping to position 23
skipping to position 25
skipping to position 26
skipping to position 38
skipping to position 43
Potential match at location 43 letter g
Potential match at location 42 letter o
Potential match at location 41 letter d
Potential match at location 40 letter  
Potential match at location 39 letter g
Potential match at location 38 letter n
Potential match at location 37 letter i
Potential match at location 36 letter p
Potential match at location 35 letter e
Potential match at location 34 letter e
Potential match at location 33 letter l
Potential match at location 32 letter s


32

## Boyer-Moore efficiency

So Boyer Moore is quite efficient Omega(n/m) in real life applications where pattern is large.

### Boyer-Moore Algorithm explanation

The boyer_moore function takes two input strings text and pattern, and returns the index of the first occurrence of pattern in text. If pattern does not occur in text, the function returns -1.

The function first initializes the lengths of the text and pattern, and checks if the length of pattern is zero. If it is, it returns 0 (indicating that pattern occurs at the beginning of text).

It then initializes a skip table skip that stores the number of characters to skip when a mismatch occurs, based on the character that caused the mismatch. The skip table is initialized with the value m for each character in the ASCII table.

Next, the function iterates through the first m - 1 characters of pattern and updates the skip table with the appropriate values for each character in pattern.

Finally, the function searches for pattern in text. It starts at the end of pattern and the corresponding index in text, and works its way backwards through pattern and text until a mismatch is found. If the mismatch occurs at the first character of pattern, then the function returns the current index in text. Otherwise, the function jumps ahead in text by the maximum of the skip value for the mismatched character and the number of characters remaining in pattern.

If pattern is not found in text, the function returns -1.

## Other String Matching Algorithms

There are several other string matching algorithms, including the Knuth-Morris-Pratt (KMP) algorithm, the Rabin-Karp algorithm, and the finite automaton algorithm. These algorithms have different time and space complexities, and the choice of which algorithm to use depends on the specific requirements of the application.

## TODO What algorithm is Python Find/Index using