# BC205: Algorithms for Bioinformatics.
## V. Sequence Comparison and Alignment
### Christoforos Nikolaou

### Goals for this class

* To present the biological problem of Sequence Comparison
* To define String Similarity and show how different definitions change the problem
* To present the problem of Sequence Alignment
* To solve the problem of Sequence Alignment using the algorithm presented by Saul B. Needleman and Christian D. Wunsch (1970)
* To present the concept of Dynamic Programming as an algorithmic technique


### Introduction. The biological problem
In this class we will be moving into the discussion of problems of similarity in sequence analysis. Two basic questions that we will deal with are:
Given two sequences of comparable length, what is the best way to objectively define a measure of their of their similarity?

Given two sequences with significantly different lengths, how can we identify all identical matches of the shorter within the longer one?



### Part A. Objectively defining the similarity between two sequences

Biological Context

This is one of the most primary problems of bioinformatics with its roots in the study of:   
*   Phylogenetic relationships at the molecular level, e.g. the comparison of orthologous sequences  
*   Evolutionary dynamics in genomes, e.g. the identification of gene duplications, rearrangements etc  
*   Genomic variability, e.g. comparisons of the same sequence among different individuals


### Sequence Similarity. The problem
How similar are two sequences?

Let’s start by looking into a very simple example of two 10-nucleotide sequences
```
G G G A A T T T C C
G G C A A T T T C C
```
It is obvious that they basically differ in only one nucleotide. The question is how we can quantify this?


### Measures of sequence distance

In Computer Science this is a problem of “character string comparison” (or “string comparison”). There are many measures for the quantification of such similarities, the simplest of which is called “Edit Distance” or “Hamming Distance”. We can define it as following:

**Edit distance is the number of residues we need to edit/substitute to obtain one sequence from the other, without re-arrangements, deletions or insertions.**

### Edit Distance of two Sequences

![Edit Distance 1](figures/EditDistance1.png)

In the figure you see that, between the two sequences, there are 9 identical residues, meaning only one change is needed. 

We say that they have Distance=1. This can be scaled for the total size of the combined sequences Distance=1/10=0.1

In a similar way we can quantify the distance of the following pair of sequences as Distance=3/10=0.3

![Edit Distance 2](figures/EditDistance2.png)

### Going beyond distance measures
What if the sequences are not so similar? In the example we see below the distance is calculated as D=9/10=0.9. 

![Edit Distance 3](figures/EditDistance3.png)

This means that the two sequences are only 10% similar to each other. Are they though?

If one “slides” the two sequences against each other we can obtain a smaller distance and higher similarity as residues become “aligned”

![Edit Distance 4](figures/MatchVsSlide.png)

The sequences now have a much lower distance than 0.9, even allowing for the fact that the combined length of the two sequences is 12 nucleotides D=4/12=0.33.

The main point here is that we have allowed ourselves the additional liberty of displacing the two sequence against each other, without changing the order of residues in either of the two.

### The Sequence Alignment Problem - Description

The highlighted text above is a good definition of Sequence Alignment.

This is:

_Assuming two sequences, which is the best way we can “slide” one against the other, without altering the order of residues, in order to maximize their similarity?_

In the example shown below, we see that two sequences can be “aligned” with more than one ways.

![MatchVsSlide2](figures/MatchVsSlide2.png)

Both of the following sequences give rise to 5 “matching” residues, but in the first there 4 more “slides” required to open a gap in the middle of the second sequence. Do we prefer the first or the second? Should we take into account the “matching” residues or also the number of “gaps” in the alignment?


### The Sequence Alignment Problem - Definition

The question that arises is then related to that best way we mentioned earlier. The problem is now reduced to the following:

_Given two sequences, how can we define which of their possible alignments provides the highest similarity score provided an evaluation of matches, mismatches and gaps_

We will have to provide some specifications on how to score all three of them.

![SequenceAlignment](figures/SequenceAlignmentSimilarity.png)

So, assuming we have e.g. the DNA sequences:

```
ACACAGACATAGCATCGACTAGGAGAGA
ACAGGGACTAGGTTGCCA
```

What is the best way to align them?

### The Sequence Alignment Problem - An example

We will start the description of the problem with a -somewhat- strange analogy of a sailor that wants to travel the Aegean Sea. 
His problem is described [here](https://rpubs.com/ChristoforosNikolaou/MBioMedPtIIb_A) and can be found in a set of slides [here](cb_2016_lecture_04_seqcomparison.pdf)

Below is the problem.

A sailor is set to sail from the port of Rafina to Astypalaia.

![](figures/Cruise1.png)

The weather and other constraints only permit him to move to the right, to the bottom or diagonally bottom-right.   
Before he begins he has scored all the places he may (or may not visit) in advance.  

![](figures/Cruise2.png)

The Question is: **Which route should he choose given the constraints and the scoring board, that will allow him to gather the maximum total score?**  

![](figures/Cruise3.png)

One first -and rather obvious choice- is the **greedy approach**.   
We start from the top-left corner and always take the allowed route to the position with the maximum score.  

![](figures/Cruise4.png)

A different approach would be an **exhaustive, brute force approach**.   
In this, we would calculate the score for **every possible** complete route and then choose the best.  
The number of the possible routes is quite big and it grows exponentially. So it's not a viable option.  
  
So, we need **a new approach**.  
This is what we will do:   

We will start by filling the first row and the first column with the scores that can be calculated.  
This is easy because in the case of horizontal/vertical displacements in the grid there are no options.  
Only one choice exists.  

![](figures/Cruise5.png)

Now, here comes the main trick: We will then work our way **backward**.   
Instead of asking which is the best square to _move to_ from a given point, we will ask which is the best way to _have arrived at_ a given point.  

This can be done only for squares in the grid that:  
*   lie at the bottom-right corner of a set of 4 squares  
*   have all three other squares pre-calculated   

We can, thus apply a simple choice, for the first internal square like this:  
1. We examine our three option for arriving to this square: from the top vertically, from the left horizontally and from the upper-left diagonally  
2. We choose the greatest score combination, by adding the score of the square of origin to the one of the destination.  
3. We assign the **cumulative score** to the square of destination and **mark the direction of origin** with a _tracking arrow_.  

![](figures/Cruise6.png)

Once we do this for the first internal square, two more squares "open up".  
We go on calculating this for all squares in the grid.   
Remember: What we are actually doing is that we are calculating which is the best way to arrive to **any possible square**.  
In this way, upon reaching the final bottom-right square, **we will have effectively solved the whole route**.   
The score of the final square is by definition **the maximum score** we can have with the given scoring grid.  

To find the route, we now only need to go back from the bottom-right to the top left, following the **tracking arrows**. This is called backtracking.  

![](figures/Cruise7.png)

### The Needlemann-Wunsch Algorithm

The basis of the NW Algorithm is breaking the bigger problem into smaller ones.   
The pairwise sequence comparison problem is directly analogous to the one of the sailor.   
Here is how:  
1. The two sequences can be seen as the indexes of the rows and the columns in a grid.   
2. We start from the top-left at the point when we have not read _any_ residue and we need to reach the bottom right when both sequences will have been read in full.  
3. Every time we make a horizontal or a vertical move **we are introducing a gap** in the alignment since we are only reading one of the two sequences.  
4. Every time we make a diagonal move we have to distinguish between outcomes of match/mismatch or other, more elaborate scoring schemes (see below).  

In this way we can use a simple scheme of e.g. match=1, mismatch=-2, gap=-2 to implement the algorithm in the following example of two aminoacid sequences.  
We start by filling the first row/column with cumulative gap scores, as below.

![](figures/NW1.png)

The choice for each square is done on the basis of the scoring scheme we are given.  

![](figures/NW2.png)

We work our way down and right the grid

![](figures/NW3.png)

As shown below

![](figures/NW4.png)

In the end, we simply backtrack from the bottom right corner, taking note of the tracking pointers/arrows.   
Note how the alignment is created from the scored grid.

![](figures/NW5.png)



### Scoring schemes and Substitution Matrices

When it comes to the outcome of an alignment, the scoring scheme is of prime importance. 
Notice how the final score, but most importantly the alignment itself is changed with a slight change in the scoring schemes.

![](figures/NW6.png)

In reality and especially for protein sequences the align

![](figures/BLOSUM.png)

### Dynamic Programming



In [None]:
def SimpleFibonacci(N):
	fib=[]
	fib.append(0)
	fib.append(1)
	for i in range(2,N+1):
		fib.append(fib[i-1]+fib[i-2])
	return fib[i]

### Recursion in Fibonacci

Remember how if we want to use recursion in the calculation of the Fibonacci sequence we can deploy the following algorithm.

In [None]:
def RecursiveFibonacci(N):
    fib = {}
    if N in {0,1}:
        fib[N] = N
    if N > 1:
        fib[N] = RecursiveFibonacci(N-2) + RecursiveFibonacci(N-1)
    return fib[N]

which is not efficient since a number of steps are repeated many times

![](figures/FibonacciRecursion.JPG)

We can deploy some sort of dynamic programming approach to speed up the recursion. We can basically check if the value to be calculated is already calculated and then skip it calculation step. 

See below how the green steps are not calculated.

![](figures/FibonacciDP.JPG)

In [None]:
def DPFibonacci(N):
    fib = {}
    if N in {0,1}:
        fib[N] = N
    if N > 1:
        if (N not in fib):
            fib[N] = RecursiveFibonacci(N-2) + RecursiveFibonacci(N-1)
    return fib[N]

In [None]:
import time
start_time = time.time()
myFib = SimpleFibonacci(10)
print(myFib)
print("--- %s seconds ---" % (time.time() - start_time))
#
start_time = time.time()
myFib = RecursiveFibonacci(10)
print(myFib)
print("--- %s seconds ---" % (time.time() - start_time))
#
start_time = time.time()
myFib = DPFibonacci(10)
print(myFib)
print("--- %s seconds ---" % (time.time() - start_time))


## Largest common subsequence (LCS)

In [None]:
def longest_common_subsequence(str1, str2):
    m = len(str1)
    n = len(str2)

    # Initialize a 2D array to store the lengths of the common subsequences
    dp = [[0] * (n+1) for _ in range(m+1)]

    # Fill the dp array using dynamic programming
    for i in range(1, m+1):
        for j in range(1, n+1):
            if str1[i-1] == str2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    # Backtrack to find the longest common subsequence
    lcs = ""
    i, j = m, n
    while i > 0 and j > 0:
        if str1[i-1] == str2[j-1]:
            lcs = str1[i-1] + lcs
            i -= 1
            j -= 1
        elif dp[i-1][j] > dp[i][j-1]:
            i -= 1
        else:
            j -= 1

    return lcs

# Test the function with two strings
str1 = "ABCBDAB"
str2 = "BDCAB"
print(longest_common_subsequence(str1, str2))  # Output: "BCAB"

In [None]:
def longest_common_substring(text1, text2):
  """
  Finds the longest common substring between two strings.

  Args:
      text1: The first string.
      text2: The second string.

  Returns:
      The longest common substring between the two strings.
  """
  m = len(text1)
  n = len(text2)
  dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]

  # Find the length of the longest common substring
  for i in range(1, m + 1):
    for j in range(1, n + 1):
      if text1[i - 1] == text2[j - 1]:
        dp[i][j] = dp[i - 1][j - 1] + 1
      else:
        dp[i][j] = 0

  # Backtrack to find the actual substring
  longest_len = max(max(row) for row in dp)
  i, j = m, n
  substring = ""
  while i > 0 and j > 0:
    if dp[i][j] > 0:
      substring = text1[i - 1] + substring
      i -= 1
      j -= 1
    else:
      if dp[i][j - 1] > dp[i - 1][j]:
        j -= 1
      else:
        i -= 1

  return substring

# Example usage
text1 = "ATGCCCAC"
text2 = "AAATGCTTTTTT"
lcs = longest_common_substring(text2, text1)
print("Longest Common Substring:", lcs)

## LCS in quadratic time

In [None]:
def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

print(longestSubstringFinder("grapple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("batplesx", "cappleses"))

In [None]:
def LCSubStr(str1, str2):

    N = len(str1)
    M = len(str2)
    print(N)
    print(M)
 
    LCSuff = [[0 for k in range(M+1)] for l in range(N+1)]
    mx = 0
    for i in range(N + 1):
        for j in range(M + 1):
            if (i == 0 or j == 0):
                LCSuff[i][j] = 0
            elif (str1[i-1] == str2[j-1]):
                LCSuff[i][j] = LCSuff[i-1][j-1] + 1
                mx = max(result, LCSuff[i][j])
            else:
                LCSuff[i][j] = 0
    return mx

LCSubStr("grapple pie available", "apple pies")

In [None]:
def LCS(X, Y):
    m = len(X)
    n = len(Y)
    
    L = [[0 for x in range(n+1)] for y in range(m+1)]
    for i in range(m+1):
        for j in range(n+1):
            if i == 0 or j == 0:
                L[i][j] = 0
            elif X[i-1] == Y[j-1]:
                L[i][j] = L[i-1][j-1] + 1
            else:
                L[i][j] = max(L[i-1][j], L[i][j-1])
    return L[m][n]

LCS("grapple pie available", "apple pies")

## LCS with Dynamic Programming

In [None]:
def lcs_algo(S1, S2, m, n):
    L = [[0 for x in range(n+1)] for x in range(m+1)]

    # Building the mtrix in bottom-up way
    for i in range(m+1):
        for j in range(n+1):
            if i == 0 or j == 0:
                L[i][j] = 0
            elif S1[i-1] == S2[j-1]:
                L[i][j] = L[i-1][j-1] + 1
            else:
                L[i][j] = max(L[i-1][j], L[i][j-1])

    index = L[m][n]

    lcs_algo = [""] * (index+1)
    lcs_algo[index] = ""

    i = m
    j = n
    while i > 0 and j > 0:

        if S1[i-1] == S2[j-1]:
            lcs_algo[index-1] = S1[i-1]
            i -= 1
            j -= 1
            index -= 1

        elif L[i-1][j] > L[i][j-1]:
            i -= 1
        else:
            j -= 1
            
    # Printing the sub sequences
    print("S1 : " + S1 + "\nS2 : " + S2)
    print("LCS: " + "".join(lcs_algo))


S1 = "grapple pie available"
S2 = "apple pies"
m = len(S1)
n = len(S2)
lcs_algo(S1, S2, m, n)

dsadasd