# Sequences Alignment

## 1st approach: alignment based on sequences distance

- Hamming
- Levenshtein

## 2nd approach: scoring scheme

### Hamming distance

In [1]:
function hamming(x,y)
    if length(x) != length(y)
        println("ERROR: sequences should have equal lenghts!")
        return
    else
        d = 0
        for i = 1:length(x)
            if x[i] != y[i]
                d = d + 1
            end
        end
        return d
    end
end

hamming (generic function with 1 method)

In [2]:
x = "wheeaaa"; y = "ghearpa"
d = hamming(x,y)

4

### How to read real sequences from online database

In [7]:
using FastaIO
#using Images
using OffsetArrays
using PyPlot
using DelimitedFiles
using BenchmarkTools
using StatsBase
using LinearAlgebra
using Printf
using HTTP

In [9]:
function sequenceDownload(sequence)

    sequenceFile = sequence * ".fasta"

    URL = "https://www.uniprot.org/uniprotkb/" * sequenceFile

    query = HTTP.get(URL)
    fastaString=String(query.body)

    open(sequenceFile,"w") do f
        write(f,fastaString)
    end

    FastaIO.readfasta(sequenceFile)[1][2]
end

sequenceDownload (generic function with 1 method)

In [10]:
HBB_Human = sequenceDownload("P68871")
HBA_Bonobo = sequenceDownload("P69906")
HBA_Chimp = sequenceDownload("P69907")
HBA_Donkey = sequenceDownload("P01959")
LegHem = sequenceDownload("P02240")

"MGALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA"

In [8]:
d = hamming(HBB_Human,HBA_Bonobo)

ERROR: sequences should have equal lenghts!


In [9]:
@show length(HBB_Human)
@show length(HBA_Bonobo)
@show length(HBA_Chimp)
@show length(HBA_Donkey)
@show length(LegHem)

length(HBB_Human) = 147
length(HBA_Bonobo) = 142
length(HBA_Chimp) = 142
length(HBA_Donkey) = 142
length(LegHem) = 154


154

In [69]:
@show hamming(HBA_Bonobo,HBA_Chimp)
@show hamming(HBA_Bonobo,HBA_Donkey)
@show hamming(HBA_Chimp,HBA_Donkey)

hamming(HBA_Bonobo, HBA_Chimp) = 0
hamming(HBA_Bonobo, HBA_Donkey) = 20
hamming(HBA_Chimp, HBA_Donkey) = 20


20

### Levenshtein distance: recursive

$$L(i,j) = \min{\begin{cases}1-\delta_{i,j}+L(i-1,j-1)\\1+L(i-1,j)\\1+L(i,j-1)\end{cases}}$$

In [11]:
function leven(x,y)
    D = Dict()

    function levenshtein(x,y)
        isempty(x) && return length(y)
        isempty(y) && return length(x)
        haskey(D,(x,y)) && return D[(x,y)]
        D[(x,y)] = min(1 - (x[end] == y[end]) + levenshtein(x[1:end-1],y[1:end-1]), 1 + levenshtein(x[1:end-1],y), 1 + levenshtein(x,y[1:end-1]))
    end

    levenshtein(x,y)
end

leven (generic function with 1 method)

In [72]:
@show leven(HBA_Bonobo,HBA_Chimp);
@show leven(HBA_Bonobo,HBA_Donkey);
@show leven(HBA_Chimp,HBA_Donkey);

leven(HBA_Bonobo, HBA_Chimp) = 0
leven(HBA_Bonobo, HBA_Donkey) = 20
leven(HBA_Chimp, HBA_Donkey) = 20


In [73]:
@show leven(HBB_Human,HBA_Bonobo);
@show leven(HBB_Human,HBA_Chimp);
@show leven(HBB_Human,HBA_Donkey);
@show leven(HBB_Human,LegHem);

leven(HBB_Human, HBA_Bonobo) = 84
leven(HBB_Human, HBA_Chimp) = 

84


leven(HBB_Human, HBA_Donkey) = 84
leven(HBB_Human, LegHem) = 119


### Scoring scheme

#### Substitution matrix
$$S(X,Y)=\sum_{i=1}^N\log{\frac{p_{x_iy_i}}{q_{x_i}q_{y_i}}}=\sum_{i=1}^Ns(x_i,y_i)$$

#### Gap score
$$\begin{cases}\gamma(g)=-dg\\\gamma(g)=-d-e(g-1) & e < d\end{cases}$$


### Sequence Alignment: global
Global alignment between $X=(x_1,\dots,x_n)$ and $Y=(y_1,\dots,y_m)$.
- Initialization: $F(0,0)=0,\ F(i,0)=-id,\ F(0,j)=-jd\ \forall\ i,j$

- Recursion: $F(i,j)=\max{\begin{cases}F(i-1,j-1)+s(x_i,u_j) & \text{Substitution}\\F(i-1,j)-d & \text{Deletion (in X)}\\F(i,j-1)-d & \text{Insertion (in Y)}\end{cases}}$ for $\begin{cases}0\leq i \leq n \\ 0\leq j \leq m\end{cases}$

- Termination: $F(n,m)$ optimal score

In [4]:
#=
x = "wheea" -> |x| = 5
y = "hepga" -> |y| = 5
whe-ea
-hepga
DMMISM
D: DELETION
M: MATCH
I: INSERTION
S: SUBSTITUTION

    0   1   2   3   4   5
0   0  -1  -2  -3  -4  -5     
1  -1 
2  -2 
3  -3 
4  -4 
5  -5 
=#

function globalAlignment(x,y)
    d = 7
    # initialization
    F = zeros(Int64,length(x) + 1,length(y) + 1)
    F[1,1] = 0
    for i in eachindex(x)
        F[i + 1,1] = - d * i
        for j in eachindex(y)
            F[1,j + 1] = - d * j
            F[i + 1,j + 1] = max(F[i,j] + 1,F[i,j + 1] - d,F[i + 1,j] - d) # F[i,j] + 1 -> F[i,j] + s(x[i],y[j])
        end
    end
    return F
end

globalAlignment (generic function with 1 method)

In [8]:
x = "wheea"
y = "hepga"

@benchmark globalAlignment(x,y)

BenchmarkTools.Trial: 10000 samples with 560 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m206.671 ns[22m[39m … [35m 1.849 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 87.06%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m210.664 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m224.084 ns[22m[39m ± [32m81.127 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.25% ±  5.54%

  [39m▆[34m█[39m[39m▅[39m▃[39m▁[39m [32m [39m[39m [39m▂[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[39m█[

In [18]:
a = globalAlignment(HBB_Human,HBA_Bonobo)
maximum(a)

142