# BC205: Algorithms for Bioinformatics.
## VI. Rapid Searches

### Christoforos Nikolaou

### Introduction

#### Sequence similarity. Up to now:

Dealing with the problem of sequence similarity and comparison, we have up to now seen how:
* We can quantify the similarity between two sequences using scoring schemes that are either arbitrarily defined, or based on **substitution matrices** obtained from molecular evolution approaches.
* We can define an objective methodology that maximizes sequence similarity through the concept of **alignment**.
* We can distinguish between **global** (for the entire sequence length) and **local** (for subparts of the sequences) alignment.

#### More complex problems
We are now moving on into a different aspect of the problem of sequence similarity that stems from the scale of sequence size and volume.  

Consider the following problems:

1. We need to check the similarity not between two sequences but between one sequence and a much longer one (such as, for instance, a whole genome).
2. We would like to search for similarity for a given sequence in a database that contains hundreds of thousands, or millions of sequences.  
  
In both of the above problems we cannot proceed with sequence alignment in the way we have seen up to now. In fact we need to consider rather different strategies for rapid sequence matches. In this and the next week, we will discuss:

a. How we may approach the problem of similarity with identical matches. These sound more restrictive than the alignment concept but we will see how we can apply alignment strategies on selected identical matching "seeds" that effectively help us speed up the whole process.

b. How we can use data transformation techniques to enable the speeding up and paralelization of sequence searches.

### Pattern Searches

We will start from the simplest, yet fundamental approach in the string matching problem. Consider a long sequence we will call _sequence_, and a smaller one we will _pattern_. Our goal is to write a program that will identify matches of _pattern_ within _sequence_. Even though we are tempted to use pre-defined functions, it is a good exercise to try and think how we would do this from scratch.  

### Naive Pattern Search

Consider the simplest way possible. 

We start from the beginning of _sequence_ and scan substrings of length equal to pattern for one-to-one matches. We do this exhaustively for all substrings. Thus, the steps are:

1. Start from _sequence[0]_ and loop one residue at a time (i=0)    
2. Take a substring from _sequence_ equal to _pattern_  
3. Start from _pattern[0]_ and compare to _sequence[i]_ (j=0)
4. If there is a mismatch, exit and go to 2, take next _i_
5. If there are no mismatches and _j_ reaches the size of length of _pattern_, report a full match, go to 2 and take next _i_  
  
Below is a Python script to do this in the simplest way possible. 

In [13]:
# Naive Pattern Searching algorithm
def naivePatternSearch(pattern, sequence):
    p = len(pattern)
    s = len(sequence)
    no = 0
 
    # We slide pattern one residue/character at a time
    for i in range(s - p + 1):
        no += 1
        j = 0
        
        # for each pairing of pattern to sequence we check characters starting from the beginning
        while(j < p):
            if (sequence[i + j] != pattern[j]): 
                break
            j += 1
 
        if (j == p):
            print("Pattern found at index ", i)
    return("Number of steps taken=", no)
 
naivePatternSearch('abba', 'IhateabbaandIdontlikealibabba')

Pattern found at index  5
Pattern found at index  25


('Number of steps taken=', 26)

### Naive Pattern Search. Limitations

Can you spot problems with the approach above that make it slow and inefficient? In which algorithm category do you think it falls?

### Optimized naive search

There are actually two points that we should consider. One is that in a case of a long pattern that matches the fist _p-1_ positions but not in the final one, we may be doing a lot of comparisons we can do away with.
The other is that we may be able to speed up the process provided that the _pattern_ has some particular properties, by sliding not 1 character at a time, but more. The condition is for the _pattern_ to be variable because when all characters of the _pattern_ are different, we can slide the pattern by more than 1. This is because when a mismatch occurs after _j_ matches, we know that the first character of pattern will not match the j matched characters because all characters of pattern are different. So we can always slide the pattern **not by 1 but by j** without missing any valid shifts. 

Following is the modified code that is optimized for the special patterns. 


In [12]:
# Optimized Naive Search 
def optNaivePatternSearch(pattern, sequence):
    p = len(pattern)
    s = len(sequence)
    i = 0
    no = 0
  
    while i <= s-p:
        no+=1
        # For current index i, check for pattern match
        for j in range(p):
            if sequence[i+j] != pattern[j]:
                break
            j += 1
  
        if j==p:    # if pat[0...M-1] = txt[i,i+1,...i+M-1]
            print("Pattern found at index " + str(i))
            i = i + p
        elif j==0:
            i = i + 1
        else:
            i = i+ j    # slide the pattern by j
    return("Number of steps taken=", no)

optNaivePatternSearch('abba', 'IhateabbaandIdontlikealibabba')


Pattern found at index 5
Pattern found at index 25


('Number of steps taken=', 23)

Notice how the steps taken for the optimized approach are 3 fewer than the naive one. This is even stronger when the pattern is more variable than above. Compare:


In [18]:
naivePatternSearch('ledzeppelin', 'IhateabbaledzeppeliniswhatIreallylove')



Pattern found at index  9


('Number of steps taken=', 27)

with:

In [19]:
optNaivePatternSearch('ledzeppelin', 'IhateabbaledzeppeliniswhatIreallylove')

Pattern found at index 9


('Number of steps taken=', 17)

You see that a pattern of large size makes the situation much faster since we don't need to slide it exhaustively. 
A number of exact pattern matching approaches are based on efficient pattern sliding techniques. In the following we will discuss some of the most common ones. 

### More efficient approaches in pattern sliding. Knuth-Morris-Pratt Algorithm (KMP)

Both naive approaches don’t work well in cases where we see many matching characters followed by a mismatching character. This is because we still have to try to match a large number of characters before realizing that an exact match is impossible. (Think, e.g. for the case of 'ledzeppelin' in 'Iloveledzeppelix'). 

The KMP matching algorithm focus **again** on the _pattern_ and in particular specific structural properties of the pattern. For instance if the pattern has recurring sub-patterns appearing more than once. The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text of the next window. We take advantage of this information to avoid matching the characters that we know will anyway match.  
  
For instance check the following example:

sequence : XXXXAAAAAXXBB
pattern : AAA

The _pattern_ will match three consecutive positions in sequence but because its structure is repetitive, once we have matched it for the first time, we don't need to match the first two positions in the second match. We can simply move to the third position and check that one to ascertain the match. 

#### Preprocessing in KMP
KMP uses the structure of the pattern to allow for efficient slides like the one above. It does so by preprocessing the pattern. The idea of preprocessing is to create 



In [6]:
# Python program for KMP Algorithm
def KMPSearch(pat, txt):
    M = len(pat)
    N = len(txt)
  
    # create lps[] that will hold the longest prefix suffix 
    # values for pattern
    lps = [0]*M
    j = 0 # index for pat[]
  
    # Preprocess the pattern (calculate lps[] array)
    computeLPSArray(pat, M, lps)
  
    i = 0 # index for txt[]
    while i < N:
        if pat[j] == txt[i]:
            i += 1
            j += 1
  
        if j == M:
            print("Found pattern at index " + str(i-j))
            j = lps[j-1]
  
        # mismatch after j matches
        elif i < N and pat[j] != txt[i]:
            # Do not match lps[0..lps[j-1]] characters,
            # they will match anyway
            if j != 0:
                j = lps[j-1]
            else:
                i += 1
  
def computeLPSArray(pat, M, lps):
    len = 0 # length of the previous longest prefix suffix
  
    lps[0] # lps[0] is always 0
    i = 1
  
    # the loop calculates lps[i] for i = 1 to M-1
    while i < M:
        if pat[i]== pat[len]:
            len += 1
            lps[i] = len
            i += 1
        else:
            # This is tricky. Consider the example.
            # AAACAAAA and i = 7. The idea is similar 
            # to search step.
            if len != 0:
                len = lps[len-1]
  
                # Also, note that we do not increment i here
            else:
                lps[i] = 0
                i += 1
  
txt = "ABABDABACDABABCABAB"
pat = "ABABCABAB"
KMPSearch(pat, txt)

Found pattern at index 10


BM

In [11]:
# Python3 Program for Bad Character Heuristic
# of Boyer Moore String Matching Algorithm
 
NO_OF_CHARS = 256
 
def badCharHeuristic(string, size):
    '''
    The preprocessing function for
    Boyer Moore's bad character heuristic
    '''
 
    # Initialize all occurrence as -1
    badChar = [-1]*NO_OF_CHARS
 
    # Fill the actual value of last occurrence
    for i in range(size):
        badChar[ord(string[i])] = i;
 
    # retun initialized list
    return badChar
 
def search(txt, pat):
    '''
    A pattern searching function that uses Bad Character
    Heuristic of Boyer Moore Algorithm
    '''
    m = len(pat)
    n = len(txt)
 
    # create the bad character list by calling
    # the preprocessing function badCharHeuristic()
    # for given pattern
    badChar = badCharHeuristic(pat, m)
 
    # s is shift of the pattern with respect to text
    s = 0
    while(s <= n-m):
        j = m-1
 
        # Keep reducing index j of pattern while
        # characters of pattern and text are matching
        # at this shift s
        while j>=0 and pat[j] == txt[s+j]:
            j -= 1
 
        # If the pattern is present at current shift,
        # then index j will become -1 after the above loop
        if j<0:
            print("Pattern occur at shift = {}".format(s))
 
            '''   
                Shift the pattern so that the next character in text
                      aligns with the last occurrence of it in pattern.
                The condition s+m < n is necessary for the case when
                   pattern occurs at the end of text
               '''
            s += (m-badChar[ord(txt[s+m])] if s+m<n else 1)
        else:
            '''
               Shift the pattern so that the bad character in text
               aligns with the last occurrence of it in pattern. The
               max function is used to make sure that we get a positive
               shift. We may get a negative shift if the last occurrence
               of bad character in pattern is on the right side of the
               current character.
            '''
            s += max(1, j-badChar[ord(txt[s+j])])
 
 
# Driver program to test above function
def main():
    txt = "ABAAABCD"
    pat = "ABC"
    search(txt, pat)
 
if __name__ == '__main__':
    main()

FASTA

FastA - Steps
Hashing: FastA locates regions of the query sequence and matching regions in the database sequences that have high densities of exact word matches. (without gaps) The length of the matched word is called the ktup parameter.
Scoring: The ten highest scoring regions are rescored using a scoring matrix. The score for such a pair of regions is saved as the init1 score.
Introduction of Gaps: FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined. The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. The score of the highest scoring region, at the end of this step, is saved as the initn score.
Alignment: After computing the initial scores, FastA determines the best segment of similarity between the query sequence and the search set sequence, using a variation of the Smith-Waterman algorithm. The score for this alignment is the opt score.
Random Sequence Simulation: In order to evaluate the significance of such alignment FastA empirically estimates the score distribution from the alignment of many random pairs of sequences. More precisely, the characters of the query sequences are reshuffled (to maintain bias due to length and character composition) and searched against a random subset of the database. This empirical distribution is extrapolated, assuming it is an extreme value distribution, and each alignment to the real query is assigned a Z-score and an E-score.

In [15]:
# Creation of a PSSM
def pssm(pwm, nucfreqs):
    import numpy as np
    import math
    pseudocount=0.01
    pssm=[[0 for i in range(len(pwm[0]))] for j in range(len(nucfreqs))]
    for i in range(len(nucfreqs)):
        pssm[i]=(np.array(pwm[i])+pseudocount)/nucfreqs[i]
    for i in range(len(pssm)):
        for k in range(len(pssm[0])):
            pssm[i][k]=math.log(pssm[i][k])/math.log(2)
    return(pssm)

mypssm=pssm(mypwm, nucfreqs)

In [None]:
c. Searching a sequence with a PWM/PSSM or with Hamming Distance 

In [16]:
def pssmSearch(pssm, sequence, threshold):
    nuc = ['A', 'C', 'G', 'T']
    hits = []
    instances = []
    for i in range(len(sequence)-len(pssm[0])):
        instance=sequence[i:i+len(pssm[0])]
        score=0
        for l in range(len(instance)):
            score=score+pssm[nuc.index(instance[l])][l]
        if (score > threshold):
            hits.append(i)
            instances.append(instance) 
    return(hits, instances)

out=pssmSearch(mypssm, targetsequence, 9)

BLAST

In [None]:
# code to be added

### The next problem. Discover a new motif from a given set of sequences

#### Part 1. Formulating the problem
1. Given a set of sequences that each contains an instance of the motif, find the motif.

#### A first approach 

Assuming we have a way to 


In [None]:
# code here

1. Given a set of s sequences: Find a set of k-mers (for a given
length k, one from each sequence) that maximizes the score (or
minimizes the distance) of each (one) k-mer with its sequence
2. Collect k-mers
3. Create a motif from them


In [None]:
# code

### Brute Force Approach

What is the complexity of the BFA?
1. Number of k-mers 4k
2. Number of k-mers in each sequence: (n − k + 1)
3. Number of calculations for each k-mer given s sequences of
length n: (n − k + 1)s
4. Total number of calculations 4k (n − k + 1)s

The complexity of the algorithm is at least O(ns ).

We need something faster!


In [None]:
# code

* Assuming we have a way to calculate the distance of a k-mer k

```
from a given sequence seq  
    for k in kmers:  
        for seq in sequences:  
            if distance(k, seq)<min_distance:  
                min_distance<-distance(k,seq)  
                motif[seq]<-k
```

* Because each k-mer needs to pass only once through each
sequence, the median string has O(4k ) complexity because k is
(usually) much shorter than the length of the sequence.

* However, it is still quite slow and for k>10 its implementation
is still unapplicable.


In [None]:
# code

### An alternative

* Assume a greedy approach to go through all sequences
updating a motif every time

* Starting from sequence i:

1. find the most common k-mer
2. create a profile from it (adding pseudocounts to all 0-values)
3. go to the next sequence
4. choose the k-mer that best fits the profile
5. store that k-mer in the collection and update profile
6. iterate steps 3->5.

* We’ ve just described a Greedy approach for discovering a
motif p of a given length k among t sequences.


In [None]:
# code

### A greedy Approach

* Assuming a set of s sequences and a given consensus k-mer k:

* We will construct a PWM “on the go” as we move from one
sequence to the next.
1. For i=1 :
2. For each k in seq_i:  
    2.1 For i = 2 to i = s:  
    2.2 Find the best (smallest distance) kmer in seq_i  
    2.3 Build a profile  
    2.4 If the score(profile) is better than all previous update profile  
3. Repeat


### Analysis. Greedy Approach

1. Why Greedy: It takes kmers from the first sequence only to
scan in the following. Thus it doesn’t go through all
combinations of sequences and k-mers. As we’ve seen above
the trade-off is speed.
2. KEY: It assumes that all sequences contain the motif. If the
first sequence doesn’t contain the motif (in any variation) then
we are doomed in looking for something that is non-sensical.
3. A way to go around this is to sample a small percentage of
sequences randomly, which brings us to the next-to-last chapter
of the motif finding problem


### A randomized approach

* In the Greedy Approach we take the kmers from the first
sequence and scan over the rest. In this way an initial wrong
choice may lead you to disastrous results.
* In a Randomized Approach we start, instead with a
collection of s k-mers, one from each sequence, build a profile,
scan the sequences with that profile, update it and repeat until
the k-mer set is good enough match for the updated profile.
* Stop and think of the problems we get rid of with this
approach.


### Pseudocode

```
for seq in sequences:  
    profile[seq]<-random(k, seq)  
    while distance(profile, sequences)>threshold  
        for seq in sequences:  
        profile[seq]<-max(k, profile, seq)  
```

In [None]:
# code