# Problem turiste na Menhetnu

<img src="assets/Manhattan_tourist.png" width="700"> 

<u>Rekurzivno resenje problema:</u>

$S[i, 0] = S[i-1, 0], \hspace{0.3cm} \forall i$

$S[0, j] = S[0, j-1], \hspace{0.3cm} \forall j$

$S[i, j] = max \begin{cases}
                    S[i-1, j] + weight ((i-1, j), (i, j)) \\
                    S[i, j-1] + weight ((i, j-1), (i, j)) 
                \end{cases}$

Funkcija **manhattan_tourist** pronalazi put najvece tezine u grafu dimenzije **n** $\times$ **m** sa tezinama grana definisanim datim matricama **DOWN** i **RIGHT**. Pomocna funkcija **backtracking_manhattan** rekonstruise putanju koja odgovara izracunatom najboljem skoru, tj. putanju sa najvecom tezinom.

In [1]:
def manhattan_tourist(DOWN, RIGHT, n, m):
    S = [[0 for j in range(m)] for i in range(n)]       
    BACKTRACK = [[None for j in range(m)] for i in range(n)]
    
    for i in range(1, n):
        S[i][0] = S[i-1][0] + DOWN[i][0]
        BACKTRACK[i][0] = (i-1, 0)
        
    for j in range(1, m):
        S[0][j] = S[0][j-1] + RIGHT[0][j]
        BACKTRACK[0][j] = (0, j-1)
        
    for i in range(1, n):
        for j in range(1, m):
            from_top = S[i-1][j] + DOWN[i][j]
            from_left = S[i][j-1] + RIGHT[i][j]
            
            S[i][j] = max(from_top, from_left)
            
            if S[i][j] == from_top:
                BACKTRACK[i][j] = (i-1, j)
            else:
                BACKTRACK[i][j] = (i, j-1)
             
    score = S[n-1][m-1]
    path = backtracking_manhattan(BACKTRACK, n, m)
    
    return (score, path)

In [2]:
def backtracking_manhattan(BACKTRACK, n, m):
    i = n - 1
    j = m - 1
    
    path = []
        
    while BACKTRACK[i][j] != None:      #jedino je na polju (0, 0) ostalo None   
        path.append((i, j))
        (i, j) = BACKTRACK[i][j]
        
    path.append((0, 0))
    path.reverse()
    
    return path

<u>Nas primer:</u>

In [3]:
DOWN = [[0, 0, 0],
        [1, 2, 1],
        [1, 2, 1]]

RIGHT = [[0, 1, 1],
         [0, 1, 1],
         [0, 1, 2]]

In [4]:
(score, path) = manhattan_tourist(DOWN, RIGHT, 3, 3)

print('Score: ', score)
print('Path: ', path)

Score:  7
Path:  [(0, 0), (0, 1), (1, 1), (2, 1), (2, 2)]


# Problem pronalazenja najduze zajednicke podsekvence

Terminologija:
- podstring = niz uzastopnih karaktera iz stringa
- podsekvenca = niz karaktera takav da su indeksi pozicija tih karaktera unutar datog stringa u rastucem poretku

Problem pronalazenja najduze zajednicke podsekvence za dve niske mozemo svesti na problem turiste na Menhetnu, uz male modifikacije. Mrezni graf koji pridruzujemo ovom problemu ima (n+1) $\times$ (m+1) cvorova (n i m su duzine stringova) i osim grana koje vode 'na dole' i 'na desno' imamo i dijagonalne grane. Grane 'na dole' odgovaraju preskakanju karaktera unutar jednog ili drugog stringa, dok dok dijagonalne grane odgovaraju **(ne)poklapanjima** odgovarajucih karaktera u niskama. Tezine grana definisemo u skladu sa time - grane koje su dijagonalne i koje uparuju (poravnavaju) iste simbole imaju tezinu razlicitu od 0, dok sve ostale grane imaju tezinu 0.

<img src="assets/longest_common_subsequence.png" width="350"> 

<u>Rekurzivno resenje problema:</u>

$S[i, 0] = 0, \hspace{0.3cm} \forall i$

$S[0, j] = 0, \hspace{0.3cm} \forall j$

$S[i, j] = max \begin{cases}
                    S[i-1, j] + 0 \\ 
                    S[i, j-1] + 0 \\ 
                    S[i-1, j-1] + match (seq1[i], seq2[j]) 
               \end{cases}$

Funkcija **longest_common_subsequence** pronalazi najduzu zajednicku podsekvencu za niske **seq1** i **seq2**. Pomocna funkcija **backtracking_lcs** rekonstruise putanju (tj. podsekvencu) koja odgovara izracunatom najboljem skoru.

In [5]:
def longest_common_subsequence(seq1, seq2):
    n = len(seq1) + 1
    m = len(seq2) + 1
    
    S = [[0 for j in range(m)] for i in range(n)]
    BACKTRACK = [[None for j in range(m)] for i in range(n)]    
    
    for i in range(1, n):
        #S[i][0] ostavljamo da bude 0
        BACKTRACK[i][0] = (i-1, 0)
        
    for j in range(1, m):
        #S[0][j] ostavljamo da bude 0
        BACKTRACK[0][j] = (0, j-1)
        
    for i in range(1, n):
        for j in range(1, m):
            from_top = S[i-1][j] + 0
            from_left = S[i][j-1] + 0
            from_diagonal = S[i-1][j-1] + int(seq1[i-1] == seq2[j-1])
                
            S[i][j] = max(from_top, from_left, from_diagonal)
            
            if S[i][j] == from_top:
                BACKTRACK[i][j] = (i-1, j)
            elif S[i][j] == from_left:
                BACKTRACK[i][j] = (i, j-1)
            else:
                BACKTRACK[i][j] = (i-1, j-1)
                    
    score = S[n-1][m-1]
    lcs = backtracking_lcs(BACKTRACK, n, m)
    
    return (score, lcs)

In [6]:
def backtracking_lcs(BACKTRACK, n, m, seq1, seq2):
    i = n - 1
    j = m - 1
    
    lcs = ''
    
    while BACKTRACK[i][j] != None:
        if BACKTRACK[i][j] == (i-1, j-1) :    #ako smo dosli po dijagonali
            lcs = seq1[i-1] + lcs             #svejedno je da li uzimamo karakter iz seq1 ili seq2
            
        (i, j) = BACKTRACK[i][j]
        
    return lcs    

In [7]:
seq1 = 'ATCGTCC'
seq2 = 'ATGTTATA'

(score, lcs) = longest_common_subsequence(seq1, seq2)

print('LCS score: ', score)
print('LCS: ', lcs)

LCS score:  4
LCS:  ATGT


# Edit (Levenstajnovo) rastojanja dve niske

Za razliku od Hamingovog rastojanja koje se koristi za poredjenje niski jednake duzine, edit rastojanje dopusta da poredimo i niske razlicitih duzina. Edit rastojanje se definise kao najmanji broj *edit operacija* potrebnih da bi se jedna niska transformisala u drugu.

Edit operacije: 
- delecije - brisanje karaktera is jedne niske
- insercije - umetanje karaktera u jednu nisku 
- supstitucije - uparivanje karaktera iz obe niske (moze biti 'match' ili 'mismatch')

<u>Rekurzivno resenje problema:</u>

$S[i, 0] = i, \hspace{0.3cm} \forall i$

$S[0, j] = j, \hspace{0.3cm} \forall j$

$S[i, j] = min \begin{cases} 
                    S[i-1, j] + 1 \\
                    S[i, j-1] + 1 \\ 
                    S[i-1, j-1] + mismatch (seq1[i], seq2[j]) 
                \end{cases}$

Funkcija **edit_distance** racuna edit rastojanje za niske **seq1** i **seq2**, kao i matricu poravnanja. Pomocna funkcija **backtracking_alignment** rekonstruise putanju (tj. poravnanje) koja odgovara izracunatom rastojanju (minimalni skor).

In [8]:
def edit_distance(seq1, seq2):
    n = len(seq1) + 1
    m = len(seq2) + 1
    
    S = [[0 for j in range(m)] for i in range(n)]
    BACKTRACK = [[None for j in range(m)] for i in range(n)]
    
    for i in range(1, n):
        S[i][0] = i                
        BACKTRACK[i][0] = (i-1, 0)
        
    for j in range(1, m):
        S[0][j] = j                
        BACKTRACK[0][j] = (0, j-1)  
        
    for i in range(1, n):
        for j in range(1, m):
            from_top = S[i-1][j] + 1
            from_left = S[i][j-1] + 1
            from_diagonal = S[i-1][j-1]  + int(seq1[i-1] != seq2[j-1])
                
            S[i][j] = min(from_top, from_left, from_diagonal)
            
            if S[i][j] == from_top:
                BACKTRACK[i][j] = (i-1, j)
            elif S[i][j] == from_left:
                BACKTRACK[i][j] = (i, j-1)
            else:
                BACKTRACK[i][j] = (i-1, j-1)
                
    distance = S[n-1][m-1]
    (seq1_align, seq2_align) = backtracking_alignment(BACKTRACK, n, m, seq1, seq2)
    
    return (distance, seq1_align, seq2_align)

In [9]:
def backtracking_alignment(BACKTRACK, n, m, seq1, seq2):
    i = n - 1
    j = m - 1
    
    seq1_align = ''
    seq2_align = ''
    
    while BACKTRACK[i][j] != None:
        if BACKTRACK[i][j] == (i-1, j):
            seq1_align = seq1[i-1] + seq1_align
            seq2_align = '-' + seq2_align
            
        elif BACKTRACK[i][j] == (i, j-1):
            seq1_align = '-' + seq1_align
            seq2_align = seq2[j-1] + seq2_align
            
        else:
            seq1_align = seq1[i-1] + seq1_align
            seq2_align = seq2[j-1] + seq2_align
            
        (i, j) = BACKTRACK[i][j]
        
    return (seq1_align, seq2_align)    

In [10]:
seq1 = 'ATCGTCC'
seq2 = 'ATGTTATA'

(distance, seq1_align, seq2_align) = edit_distance(seq1, seq2)

print('Edit distance: ', distance)
print('Alignment:')
print(seq1_align)
print(seq2_align)

Edit distance:  5
Alignment:
ATCGTCC--
AT-GTTATA


# Needleman-Wunsch algoritam

Prethodni algoritmi su podjednako kaznjavali sva nepoklapanja prilikom poravnanja. Moguce je i koristiti tzv. matricu skora koja svakoj kombinaciji nepodudaranja dodeljuje razlicitu kaznu. Za poravnjanje bioloskih sekvenci te velicine se zadaju na osnovu empirijskih verovatnoca da se odgovarajuca mutacija u prirodi dogodi (npr. neki nukleotidi/aminokiseline cesce mutiraju u neke nukleotide/aminokiseline, a u neke druge redje). 

<u>Rekurzivno resenje problema:</u>

$S[i, 0] = i \cdot gap\_penalty, \hspace{0.3cm} \forall i$

$S[0, j] = j \cdot gap\_penalty, \hspace{0.3cm} \forall j$

$S[i, j] = max \begin{cases} 
                    S[i-1, j] + gap\_penalty \\ 
                    S[i, j-1] + gap\_penalty \\ 
                    S[i-1, j-1] + match\_score (seq1[i], seq2[j]) 
               \end{cases}$

Najcesce koriscene matrice skora za aminokiselinske sekvence su:
- BLOSUM matrice (BLOcks SUbstitution Matrix)
- PAM matrice(Point Accepted Mutation)

In [11]:
input_file = open('data/BLOSUM62.txt', 'r')

lines = input_file.readlines()

BLOSUM_mapping = {} 

i = 0
for line in lines:
    if line.startswith("#"):
        continue
       
    elems = line.strip().split(' ')
    elems = [el for el in elems if el != '']
        
    if i == 0:          #linija sa oznakama aminokiselina
        AAs = elems
    else:    
        AA = elems[0]
        for j in range(1, len(elems)):
            BLOSUM_mapping[(AA, AAs[j-1])] = int(elems[j])
             
    i += 1

In [12]:
BLOSUM_mapping

{('A', 'A'): 4,
 ('A', 'R'): -1,
 ('A', 'N'): -2,
 ('A', 'D'): -2,
 ('A', 'C'): 0,
 ('A', 'Q'): -1,
 ('A', 'E'): -1,
 ('A', 'G'): 0,
 ('A', 'H'): -2,
 ('A', 'I'): -1,
 ('A', 'L'): -1,
 ('A', 'K'): -1,
 ('A', 'M'): -1,
 ('A', 'F'): -2,
 ('A', 'P'): -1,
 ('A', 'S'): 1,
 ('A', 'T'): 0,
 ('A', 'W'): -3,
 ('A', 'Y'): -2,
 ('A', 'V'): 0,
 ('A', 'B'): -2,
 ('A', 'Z'): -1,
 ('A', 'X'): 0,
 ('A', '*'): -4,
 ('R', 'A'): -1,
 ('R', 'R'): 5,
 ('R', 'N'): 0,
 ('R', 'D'): -2,
 ('R', 'C'): -3,
 ('R', 'Q'): 1,
 ('R', 'E'): 0,
 ('R', 'G'): -2,
 ('R', 'H'): 0,
 ('R', 'I'): -3,
 ('R', 'L'): -2,
 ('R', 'K'): 2,
 ('R', 'M'): -1,
 ('R', 'F'): -3,
 ('R', 'P'): -2,
 ('R', 'S'): -1,
 ('R', 'T'): -1,
 ('R', 'W'): -3,
 ('R', 'Y'): -2,
 ('R', 'V'): -3,
 ('R', 'B'): -1,
 ('R', 'Z'): 0,
 ('R', 'X'): -1,
 ('R', '*'): -4,
 ('N', 'A'): -2,
 ('N', 'R'): 0,
 ('N', 'N'): 6,
 ('N', 'D'): 1,
 ('N', 'C'): -3,
 ('N', 'Q'): 0,
 ('N', 'E'): 0,
 ('N', 'G'): 0,
 ('N', 'H'): 1,
 ('N', 'I'): -3,
 ('N', 'L'): -3,
 ('N', 'K'): 0,
 (

In [13]:
gap_penalty = BLOSUM_mapping[('A', '*')]
gap_penalty

-4

In [14]:
def needleman_wunsch(seq1, seq2, match_score, gap_penalty):
    n = len(seq1) + 1
    m = len(seq2) + 1
    
    S = [[0 for j in range(m)] for i in range(n)]
    BACKTRACK = [[None for j in range(m)] for i in range(n)]
    
    for i in range(1, n):
        S[i][0] = S[i-1][0] + gap_penalty
        BACKTRACK[i][0] = (i-1, 0)
        
    for j in range(1, m):
        S[0][j] = S[0][j-1] + gap_penalty
        BACKTRACK[0][j] = (0, j-1)
        
    for i in range(1, n):
        for j in range(1, m):
            from_top = S[i-1][j] + gap_penalty
            from_left = S[i][j-1] + gap_penalty
            from_diagonal = S[i-1][j-1] + match_score[(seq1[i-1], seq2[j-1])]
                
            S[i][j] = max(from_top, from_left, from_diagonal)
            
            if S[i][j] == from_top:
                BACKTRACK[i][j] = (i-1, j)
            elif S[i][j] == from_left:
                BACKTRACK[i][j] = (i, j-1)
            else:
                BACKTRACK[i][j] = (i-1, j-1)
                
    i = n - 1
    j = m - 1
    
    score = S[n-1][m-1]
    seq1_align, seq2_align = backtracking_alignment(BACKTRACK, n, m, seq1, seq2)
    
    return (score, seq1_align, seq2_align)           

In [15]:
seq1 = 'ATCGTCC'
seq2 = 'ATGTTATA'

(score, seq1_align, seq2_align) = needleman_wunsch(seq1, seq2, BLOSUM_mapping, gap_penalty)

print('Score: ', score)
print('Alignment:')
print(seq1_align)
print(seq2_align)

Score:  8
Alignment:
ATCGT-C-C
AT-GTTATA


# Needleman-Wunsch algoritam sa boljom prostornom slozenoscu

Needleman-Wunsch algoritam ima i vremensku i prostornu slozenost **O(n $\times$ m)**, gde su n i m duzine sekvenci. Prostornu slozenost mozemo da popravimo tako da bude linearna tako sto cemo umesto da cuvamo celu matricu S cuvati samo tekucu i prethodnu vrstu/kolonu, posto nam vrednosti u tekucoj vrsti/koloni prilikom izracunavanja zavise samo od tekuce vrste/kolone (njenog dela koji je prethodno popunjen) i prethodne vrste/kolone. U zavisnosti od toga da li matricu S popunjavamo po vrstama ili kolonama, cuvacemo tekucu i poslednju vrstu ili tekucu i poslednju kolonu, ne i jedno i drugo!

**Medjutim, ovakav algoritam nema mogucnost rekonstrukcije putanje (poravnjanja), vec samo racuna najbolji skor.**

In [16]:
import copy

In [17]:
def needleman_wunsch_score(seq1, seq2, match_score, gap_penalty):
    n = len(seq1) + 1
    m = len(seq2) + 1
    
    #cuvamo prethodnu i tekucu vrstu
    S = [[0 for j in range(m)] for i in range(2)]   
    
    for j in range(1, m):
        S[0][j] = S[0][j-1] + gap_penalty
        
    for i in range(1, n):
        S[1][0] = S[0][0] + gap_penalty         #ovo je element prve kolone za koje imamo jednoznacno resenje
        
        for j in range(1, m):
            from_top = S[0][j] + gap_penalty
            from_left = S[1][j-1] + gap_penalty
            from_diagonal = S[0][j-1] + match_score[(seq1[i-1], seq2[j-1])]
                
            S[1][j] = max(from_top, from_left, from_diagonal)
             
        S[0] = copy.copy(S[1])                 #tekuca vrsta postaje prethodna vrsta za sledecu iteraciju
        
    return S[1][m-1]

In [18]:
seq1 = 'ATCGTCC'
seq2 = 'ATGTTATA'

score = needleman_wunsch_score(seq1, seq2, BLOSUM_mapping, gap_penalty)

print('Score: ', score)

Score:  8
