# BC205: Algorithms for Bioinformatics.
## V. Sequence Comparison and Alignment
### Christoforos Nikolaou

### Goals for this class

* To present the biological problem of Sequence Comparison
* To define String Similarity and show how different definitions change the problem
* To present the problem of Sequence Alignment
* To solve the problem of Sequence Alignment using the algorithm presented by Saul B. Needleman and Christian D. Wunsch (1970)
* To present the concept of Dynamic Programming as an algorithmic technique


### Introduction. The biological problem
In this class we will be moving into the discussion of problems of similarity in sequence analysis. Two basic questions that we will deal with are:
Given two sequences of comparable length, what is the best way to objectively define a measure of their of their similarity?

Given two sequences with significantly different lengths, how can we identify all identical matches of the shorter within the longer one?



### Part A. Objectively defining the similarity between two sequences

Biological Context

This is one of the most primary problems of bioinformatics with its roots in the study of:   
*   Phylogenetic relationships at the molecular level, e.g. the comparison of orthologous sequences  
*   Evolutionary dynamics in genomes, e.g. the identification of gene duplications, rearrangements etc  
*   Genomic variability, e.g. comparisons of the same sequence among different individuals


### Sequence Similarity. The problem
How similar are two sequences?

Let’s start by looking into a very simple example of two 10-nucleotide sequences
```
G G G A A T T T C C
G G C A A T T T C C
```
It is obvious that they basically differ in only one nucleotide. The question is how we can quantify this?


### Measures of sequence distance

In Computer Science this is a problem of “character string comparison” (or “string comparison”). There are many measures for the quantification of such similarities, the simplest of which is called “Edit Distance” or “Hamming Distance”. We can define it as following:

**Edit distance is the number of residues we need to edit/substitute to obtain one sequence from the other, without re-arrangements, deletions or insertions.**

### Edit Distance of two Sequences

![Edit Distance 1](figures/EditDistance1.png)

In the figure you see that, between the two sequences, there are 9 identical residues, meaning only one change is needed. 

We say that they have Distance=1. This can be scaled for the total size of the combined sequences Distance=1/10=0.1

In a similar way we can quantify the distance of the following pair of sequences as Distance=3/10=0.3

![Edit Distance 2](figures/EditDistance2.png)

### Going beyond distance measures
What if the sequences are not so similar? In the example we see below the distance is calculated as D=9/10=0.9. 

![Edit Distance 3](figures/EditDistance3.png)

This means that the two sequences are only 10% similar to each other. Are they though?

If one “slides” the two sequences against each other we can obtain a smaller distance and higher similarity as residues become “aligned”

![Edit Distance 4](figures/MatchVsSlide.png)

The sequences now have a much lower distance than 0.9, even allowing for the fact that the combined length of the two sequences is 12 nucleotides D=4/12=0.33.

The main point here is that we have allowed ourselves the additional liberty of displacing the two sequence against each other, without changing the order of residues in either of the two.

### The Sequence Alignment Problem - Description

The highlighted text above is a good definition of Sequence Alignment.

This is:

_Assuming two sequences, which is the best way we can “slide” one against the other, without altering the order of residues, in order to maximize their similarity?_

In the example shown below, we see that two sequences can be “aligned” with more than one ways.

![MatchVsSlide2](figures/MatchVsSlide2.png)

Both of the following sequences give rise to 5 “matching” residues, but in the first there 4 more “slides” required to open a gap in the middle of the second sequence. Do we prefer the first or the second? Should we take into account the “matching” residues or also the number of “gaps” in the alignment?


### The Sequence Alignment Problem - Definition

The question that arises is then related to that best way we mentioned earlier. The problem is now reduced to the following:

_Given two sequences, how can we define which of their possible alignments provides the highest similarity score provided an evaluation of matches, mismatches and gaps_

We will have to provide some specifications on how to score all three of them.

![SequenceAlignment](figures/SequenceAlignmentSimilarity.png)

So, assuming we have e.g. the DNA sequences:

```
ACACAGACATAGCATCGACTAGGAGAGA
ACAGGGACTAGGTTGCCA
```

What is the best way to align them?

### The Sequence Alignment Problem - An example

We will start the description of the problem 

G-ATTACA
GCA-TGCU
0.0
----G-C--AGGCAAGTGGGGCACCCGTATCCT-T-T-C-C-AACTTACAAGGGT-C-CC-----CGT-T
GTGCGCCAGAGG-AAGT----CA--C-T-T--TATATCCGCG--C--AC---GGTACTCCTTTTTC-TA-
30.0
GCAG-GCAAGTGG--GGCAC-CCGTATCCTTTC-CAAC-TTACAAGGGTCC-CCGT-T-
G-TGCGCCAGAGGAAGTCACTTTATATCC--GCGC-ACGGTAC-----TCCTTTTTCTA
-1.0
GCAGGCAAGTGG--GGCAC-CCGTATCCTTTCCAACTTACAAGGGTCCCCGTT
GTGCGCCAGAGGAAGTCACTTTATATCC-GCGCACGGTAC-TCCTTTTTC-TA
-11.0
GCAGGCAAGTGG--GGCAC-CCGTATCCTTTCCAACTTACAAGGGTCCCCGTT
GTGCGCCAGAGGAAGTCACTTTATATCC-GCGCACGGTAC-TCCTTTTTC-TA
-11.0


In [14]:
x = "GATTACA"
y = "GATTGCA"
print(nw(x, y, gap = 2))

GATTACA
GATTGCA
5.0
