# String Algorithms and Techniques
### String notations and concepts
Strings are basicallly a sequence of objects, mainly a sequence of characters. As with any other data type, such as an `int` or `float`, we need to store the data and operations that have to be applied to them. String data types allow us to store the data, and Python provides a rich set of operations and functions that can be applied to the data of the string type. Most of the operations and functions provided by Python that can be applied to the strings were described in Chapter 1.

Strings are mainly textual data that is generally handled very efficiently. The following is an example of a string:
```python 
"packt publishing" 
```
A substring is also a sequence of characters that is part of the given string. For example, `"packt"` is a substring for the string `"packt publishing"`.

A subsequence is a sequence of characters that can be obtained from the given string by removing some of the characters from the string but by keeping the order of occurrence of the characters. For example, `"pct publishing"` is a valid subsequence for the string `"packt pblishing"` that is obtained by removing the characters a, k, and u. However, this is not a substring. A subsequence is different from a substring, since it can be considered as a generalization of substrings.

The prefix of a string, `s`, is the substring of `s` in that it is present in the starting of the string. There is also another string, `u`, that exists in the string `s` after the prefix. For example, the substring `"pack"` is a prefix for the string `s = "packt publishing"` as it is starting the substring and there is another substring after it.

The suffix `d` is a substring that is present at the end of the string `s` so that there is another nonempty substring existing before substring `d`.  For example, the substring `"shing"` is the suffix for the string `"packt publishing"`. Python has built-in functions to check whether a string has a given prefix or suffix as shown in the code snippet:

In [1]:
string = "this is data structures book by packt publisher"
suffix = "publisher"
prefix = "this"
print(string.endswith(suffix))   # Check if string contains given suffix
print(string.startswith(prefix)) # Check if string contains given prefix

True
True


Pattern matching algorithms are the most important string processing algorithms, and we will be discussing them in subsequent sections.
### Pattern matching algorithms
A pattern matching algorithm is used to determine the index positions where a given pattern string is matched in a text string. It returns `"pattern not found"` if the pattern does not match in the text string. For example, for the given string `s = "packt publisher"`, and the pattern `p = "publisher"`, the pattern matching algorithm returns the index position where the pattern is matched in the text string.

In this section, we will discuss four pattern matching algorithms, that is, the brute-force method, as well as the Rabin-Karp algorithm, Knuth-Morris-Pratt (KMP), and Boyer-Moore pattern matching algorithms.
### The brute-force algorithm
The brute-force algorithm, or naive approach for the pattern matching algorithm, is very basic. Using this, we simply test all the possible combinations of the input pattern in the given string to find the position of occurrence of the pattern. This algorithm is very naive and is not suitable if the text is very long.

Here, we start by comparing the characters of the pattern and the text string one by one, and if all the characters of the pattern are matched with the text, we return the index position of the text where the first character of the pattern is placed. If any character of the pattern is mismatched with the text string, we shift the pattern by one place. We continue comparing the pattern and text string by shifting the pattern by one index position.

Here, let's consider the Python implementation of the brute-force algorithm for pattern matching:

In [2]:
def brute_force(text, pattern):
    l1 = len(text)      # The text which is to be checked for the existence of the pattern
    l2 = len(pattern)   # The pattern to be determined in the text
    i= 0
    j=0          
 # looping variables are set to 0

    flag = False        # If the pattern doesn't appear at all, then set this to false and execute the last if statement
    while i < l1:       # iterating from the 0th index of text
        j = 0
        count = 0       # Count stores the length upto which the pattern and the text have matched
        while j < l2:
            if i+j<l1 and text[i+j] == pattern[j]:  # statement to check if a match has occoured or not
                count += 1                          # if the statement evaluates to true, then update count
            j += 1
        if count == l2:                             # if total number of successful matches is equal to count of the array
            print("\nPattern occours at index", i)   # print the starting index of the successful match
            flag = True                             # Even if the matching occours once, set this flag to True
        i += 1
    if not flag:                                    # If the pattern doesn't occours even once, this statement gets executed
        print('\nPattern is not at all present in the array')

brute_force('acbcabccababcaacbcaabacbbc','acbcaa')                    # function call


Pattern occours at index 14


In the preceding code for the brute-force approach, we start by computing the length of the given text strings and pattern. We also initialize the looping variables with `0` and set the flag to `False`. this variable is used to continue searching for a match of the pattern in the string. If the flag is `False` by the end of the text string, it means that there is no match of the pattern at all in the text string.

Next, we start the searching loop from the 0<sup>th</sup> index to the end of the text string. In this loop, we have a count variable that is used to keep track of the length up to which the pattern and the text have been matched. Next, we have another nested loop that runs from the 0<sup>th</sup> index to the length of the pattern. Here, the variable `i`  keeps track of the index position in the text string and the variable `j` keeps track of the characters in the pattern. Next, we compare the characters of the patterns and the text string using the following code fragment:
```python
if i + j < l1 and text[i+j] == pattern[j]:
```
Furthermore, we increment the count variable after every match of the character of the pattern in the text string. Then, we continue matching the characters of the pattern and text string. If the length of the pattern becomes equal to the count variable, it means there is a match.

We print the index position of the text string if there is a match of the pattern in the text string, and keep the flag variable to `True` as we wish to continue searching for more matches of the patterns in the text string. Finally, if the value of the variable flag is `False`, it means that there was not a match of the pattern in the text string at all.

The best-case and worst-case time complexity for the naive string matching algorithms are $\mathcal{O}(n)$ and $\mathcal{O}(m*(n-m+1))$, respectively. The best-case occurs when the pattern is not found in the text and first character of the pattern is not present in the text at all, for example, if the text string is `ABAACEBCCDAAEE`, and the pattern is `FAA`. Here, as the first character of the pattern will not match in the text, it will have comparisons equal to the length of the text `n`.

The worst-case occurs when all characters of the text string and the pattern are the same, for example, if the text string is `AAAAAAAAAAA`, and the pattern is `AAAA`. Another worst case scenario occurs when only the last character is different, for example, if the text string is `AAAAAAAAAAAAAF` and the pattern is `AAAAF`. Thus, worst-case time complexity would be $\mathcal{O}(m*(n-m+1))$.
### Rabin-Karp algorithm
