# String-searching algorithms

String-searching algorithms find positions where one or several strings (also called patterns) are found within a larger string.

We search pattern $P$: $p_1p_2 . . . p_m$ of length $m$ in a template string $T$: $t_1t_2 . . . t_n$ of length $n$.

Output:

Position i of a substring $T[i:j] = P$

## Naive approach

We move along the string T character by character, checking for a match with P at each position.

Time complexity in the worst case: $O(mn)$

![Title](img/naive_string_search.png)


In [1]:
def naive_search(t, p):
    m = len(p)
    n = len(t)
    i = 0
    while i <= n - m:
        j = 0
        while j < n:
            if p[j] != t[i+j]:
                break
            else:
                j += 1
            if j == m:
                print(f"Match at position {i}")
                break
        i+=1
        
naive_search("aaaabaaaa", "aaa")

Match at position 0
Match at position 1
Match at position 5
Match at position 6


## Boyer-Moore Algorithm

The main idea of the Boyer-Moore algorithm is to skip unnecessary alignments during search.

Alignment direction - left-to-right

Comparison direction - right-to-left

![Title](img/boyer-moore.png)

Every time we calculate how many positions in the T string we can skip during a search, we proceed from two rules: the bad character rule and the good suffix rule.

### Bad character rule

Upon mismatch, skip alignments until

I. A mismatch becomes a match.

or

II. P moves past a mismatched character.

![Title](img/boyer-moore_bad_char.png)

We can precalculate the number of alignments we can skip:

![Title](img/boyer-moore_bad_char_precalc.png)


## Good suffix rule

Let's define the string $s$ as a substring of T that matches a suffix of $P$  at a given position. We can skip alignments until

I. There are no mismatches between P and s.

or

II. P moves past s until the prefix of P matches the suffix of s.

![Title](img/boyer-moore_good_suf.png)


### Implementation

https://github.com/jordancheah/DNA-Sequencing-Boyer-Moore-Approximate-Matching/blob/master/boyermoore.py

Time complexity in the worst-case: $O(mn)$

Time complexity in the best-case: $O(n/m)$


# Knuth-Morris-Pratt Algorithm

**Prefix function** of string $S$:

$PrefixFunction(S)$ = max length $n$ ($n < |S|$) of prefix $S[: n]$ that equals to the suffix $S[−n :]$.

Example:

PrefixFunction(**abra**cad**abra**) = 4

**abra** is the prefix and suffix. We can calculate PrefixFunction for all prefixes in the template string $T$:

![Title](img/prefix_function.png)


In [2]:
def prefix_function(s):
    n =  len(s)
    prefix_array = [0]*n
    for i in range(1,n):
        j = prefix_array[i-1]
        while j > 0 and s[i] != s[j]:
            j = prefix_array[j-1]
        if s[i] == s[j]:
            j+=1
        prefix_array[i] = j
    return prefix_array

prefix_function('abracadabra')

[0, 0, 0, 1, 0, 1, 0, 1, 2, 3, 4]

### Knuth-Morris-Pratt algorithm steps:

- Concatenate strings P and T using some separator character, e.g., \$

- Calculate PrefixFunction(P\$T)

- Using the PrefixFunction array, find prefixes U: PrefixFunction(U) = Length(P)

Example: T = banana, P = ana

![Title](img/knuth_morris_pratt.png)

Time complexity: $T = O(m+n)$


In [3]:
def kmp(t,p):
    prefix_array = prefix_function(p+"$"+t)
    return [i - len(t) for i, x in enumerate(prefix_array) if x == len(p)]

kmp1('banana', 'ana')

NameError: name 'kmp1' is not defined

# Rabin-Karp Algorithm

The main idea of the algorithm is to calculate the hash function of each substring of T and compare the hash values with the hash of the pattern P.

String T: $s_1s_2...s_n$ each character has its own number representation: $\sigma_1\sigma_2...\sigma_n$.

Hash function of a string polynom:

$N(S) = \sigma_1A^{n-1} + \sigma_2A^{n-2} + ... + \sigma_n$

String hash $H(i) = N(i) mod M$, where $A$ and $M$ are coprime.


###  Rolling hash

When shifted one position in a string T, the hash function from the new substring can be calculated using the value of the hash function at the previous position using a **rolling hash**.

![Title](img/rolling_hash.png)

$H(T[i:i+m]) = \sigma_i\cdot A^m + H(T[i+1:i+m])$

$H(T[i+1:i+m]) = H(T[i:i+m]) -  \sigma_i\cdot A^m$

$H(T[i+1:i+m+1]) = A\cdot(H(T[i:i+m]) -  \sigma_i\cdot A^m) + \sigma_{i+m+1}$



### Rabin-Karp algorithm steps: 

1) Compute $H(P)$.

2) Compute a hash function for each substring of $T$ of length $m$: $H(T[i: i + m]) $.

3) If $H(T[i: i + m]) = H(P)$, compare strings character by character to avoid collisions.

Worst-case time complexity (n-m matches found): $O(m(n-m))$

Best-case time complexity (no matches found): $O(n + m)$

Time complexity for k matches: $O(n + m + km)$

In [None]:
class RabinKarp:
    def __init__(self):
        self.A = 256
        self.M = 101
    
    def hash(self, s):
        h = 0
        for c in s:
            h = (h * self.A + ord(c)) % self.M
        return h
    
    def search(self, t, p):
        n = len(t)
        m = len(p)
        hash_p = self.hash(p)
        h = pow(self.A, m-1, self.M)
        hash_t = self.hash(t[:m])
        results = []
        
        for i in range(n-m+1):
            if hash_t == hash_p:
                match = True
                for j in range(m):
                    if t[i+j] != p[j]:
                        match = False
                        break
                if match:
                    results.append(i)
            
            if i < n-m:
                hash_t = (self.A*(hash_t - ord(t[i])*h) + ord(t[i+m])) % self.M
        
        return results


In [None]:
rabin_karp = RabinKarp()
rabin_karp.search('CAGCAGCGCAG', 'CAG')

# Aho–Corasick algorithm

The Aho-Corasick algorithm can be used to simultaneously search for several patterns in text. The algorithm uses a trie structure.


Input:

T - template string

k pattern strings: $P_1$,...,$Pk$

Output:

All occurrences of $P_1$,...,$Pk$ in T.

### Trie

A trie is a tree-like data structure that is used to efficiently store and retrieve strings or sequences of characters. Each edge of such a tree is labeled with a character. The nodes are empty or can contain the marker of the end of the string.

Example:

The words he, she, his, and hers were added to the trie:

![Title](https://neerc.ifmo.ru/wiki/images/e/ea/%D0%91%D0%BE%D1%80.jpg)

Adding a string: $O(n)$

Retrieving a string: $O(n)$

### Suffix links

To implement the Aho-Corasick algorithm, we add all pattern strings $P_1$,...,$Pk$ to a trie. Next, we add suffix links to the trie.

**Suffix link**: $\pi(u)=v$, if $v$ is a maximum length suffix of $u$ present in the trie, $v \neq u$. If no such node can be found, we set a suffix link from the current node to the root.

![Title](https://neerc.ifmo.ru/wiki/images/d/d6/Axo.jpg)

### Aho–Corasick algorithm steps

1) Build a trie with suffix links from $P_1$,...,$Pk$.

2) Process characters from T one by one and follow the links in the trie at each step.


The algorithm constructs a finite-state machine. Suffix links allow fast transitions between failed string matches to other branches of the trie that share a common prefix.

Example:

T: shers

Path for search :

→ s → h → e → suffixlink → r → s

Functions:

- **goto** function returns a state that can be reached from state s by character a

- **fail** function returns the state to go by non-matching character (by suffix link).

- **out** function returns a set of patterns, occurrences of which are detected upon transition to state s



# Suffix Array

A suffix array is a sorted array of all suffixes in a string that includes the position of each suffix in the template string. 
To find a pattern in a string, we can find all the suffixes in the array that have the pattern we are looking for as a prefix. Binary search can be used.

![Title](img/suffix_array.png)

https://github.com/dohlee/pysuffixarray/blob/master/pysuffixarray/core.py

# Suffix tree

First, let's build trie using suffix strings from the suffix array.

![Title](img/suffix_trie.png)

We can compress the suffix trie into the suffix tree to reduce the amount of memory used. Ukkonen's algorithm can be used to construct a suffix tree in linear time.

![Title](img/suffix_tree.png)

To check the occurrence of a pattern, we can go along the corresponding edges in the trie. To get all the positions of its occurrence, we must get the positions of the suffixes from the leaf vertices accessible from a given node.

![Title](img/suffix_trie_search.png)


About MUMmer: https://www.biostat.wisc.edu/bmi776/spring-18/lectures/long-alignment.pdf

# Burrows-Wheeler Transform

S = ABAABA\$

1. Get all rotations of the string:

![Title](img/bwt_rotations.png)

**BWT(S) = ABBA\$AA**

We can compress the BWT result using run-length encoding (RLE):

**1A2B1\$2A**

![Title](img/bwt_lf.png)

## BWT: reversing

Starting from \$.

![Title](img/bwt_reversing.png)


## BWT: pattern search

![Title](img/bwt_search.png)

### BWT lection: 

https://www.youtube.com/watch?v=6BJbEWyO_N0

https://www.youtube.com/watch?v=GWFb_C4IR14

### BWT implementation:

https://github.com/BenLangmead/comp-genomics-class/tree/master/notebooks