# Multiple sequence alignment

---

## Before Class
In class today we will be implementing a genearlized version of Smith-Waterman algorithm to identify optimal local alignments of multiple.

Prior to class, please do the following:
1. Review slides on sequence alignments in detail
* Focus on how to conceptually translate the algorithm to code

---
Today we will be implementing a generalized form of Smith-Waterman from our previous class. This is a dynamic programming algorithm used for multiple local sequence alignments. For today's class, we have provided the basic implementation of the algorithm but have not populated the functions for 1) scoring the alignment in the or for 2) traceback through the matrix. You will implement these portions of the algorithm in class today.

As a reminder from the slides, the scoring for Smith-Waterman only uses the scores from the positions above, left, and above-left of the current position in the matrix. The main difference today will be dealing with multiple string and the scoring function for the diagonal matrix which is the average of the possible scores given all sequences.

For traceback, you will need to keep track of the direction of the arrows in a matrix and then begin traceback from the maximum value.

---
## Imports

In [1]:
import numpy as np
from itertools import product #iterate bases in two alignments
#this is what you want 

---
## Implement progressive alignment for multiple sequence alignment

In [1]:
#Define the calultation of diag (substitution) score for alignment of a pair of alignments
def compute_diag_score(aln1_chars, aln2_chars, match, mismatch): #treat mismatch & gap as same
    '''
    Calculate diag score by averaging over all base combinations in alignment 1 and alignment 2
    
    Args:
        aln1_chars (str): bases in alignment 1
        aln2_chars (str): bases in alignment 2
    
    Returns:
        diag_score (float): diag score
    '''
    diag_score_temp = 0
    
    #generates all the possible combinations, sets it equal to one variable. 
    combo_generator = product(aln1_chars, aln2_chars) 
    
    #for loop to check match vs mismatch
    for char1, char2 in combo_generator:
        if char1 == char2:
            #want to continually add to that one variable to finally take the average. 
            diag_score_temp += match 
        if char1 != char2:
            diag_score_temp += mismatch 
            
    #final diag score calculation 
    #taking the average here. 
    #need the total number of chars for each alignment. 
    diag_score = diag_score_temp / (len(aln1_chars) + len(aln2_chars))
    
    return diag_score 
                      

In [2]:
test1 = [1,2,3,4,5,6,7,8,9,10]
test2 = [11,12,13,14,15,16,17,18,19,20]

In [80]:
# final = product(test1, test2)
# for test1, test2 in final:
#     if test1 == test2:
#         print("True")
#     if test1 != test2:
#         print("False")

In [81]:
# final = product(test1, test2)
# list(final)

In [3]:
#Modify score calculation function from previous class: 
def cal_score(matrix, aln1, aln2, i, j, match, mismatch, gap):
    '''Calculate score for position (i,j) in scoring matrix, also record move to trace back
    
    Args:
        matrix (numpy array): scoring matrix
        i (int): row number
        j (int): column number
        
    Returns:
        score in position (i,j)    
        move to trace back: 0-END, 1-DIAG, 2-UP, 3-LEFT
    Pseudocode:
        aln1_chars = bases of all seqs in alignment 1 in position (i,j)
        aln2_chars = bases of all seqs in alignment 2 in position (i,j)
        calculate scores based on upper-left, up, left neighbors:
        diag_score = compute_diag_score(aln1_chars,aln2_chars)
        up_score = ...
        left_score = ...
        take the maximum:
        score = max(0, diag_score, up_score, left_score)
        move = ...
        
    '''
    score = 0
    move = 0
    
    #aln1_chars = bases of all seqs in alignment 1 in position (i,j)
    #aln2_chars = bases of all seqs in alignment 2 in position (i,j)
    
    for seq in aln1:
        #put it into a list and then join into a string. 
        aln1_temp = [seq[i-1]]
        aln1_chars = "".join(aln1_temp) 
    
    for seq in aln2:
        #put this into a list and join into a string. 
        aln2_temp = [seq[j - 1]]
        aln2_chars = "".join(aln2_temp)
        
    #now calculate each respective score 
    diag_score = matrix[i-1][j-1] + compute_diag_score(aln1_chars, aln2_chars, match, mismatch)
    up_score = matrix[i-1][j] + gap 
    left_score = matrix[i][j-1] + gap 
    
    
    #will take the max of all three scores. 
    score = max(0, diag_score, up_score, left_score) 
    
    #finally do same thing with move. make sure they're the same order and an index will be assigned to each element in the list. 
    move = np.argmax([0, diag_score, up_score, left_score]) 
    
    return score, move 
    
    

In [4]:
test = ["a", "b", "c"]
test2 = "".join(test)
test2

'abc'

In [70]:
test

['a', 'b', 'c']

In [71]:
test3 = np.argmax(test) #assigns an index to each variable and returns the max numerical value. 
test3

2

In [72]:
test22 = np.argmax([0, "hi", "you", "me", "bye"])
print(test22)

2


In [26]:
test1 = [1,2,3,4,5]
test2 = [2,3,4,5,6]
final = zip(test1, test2)
list(final)

[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)]

In [30]:
aligned_test1 = [[] for index in range(len(test1))]
aligned_test1

[[], [], [], [], []]

In [32]:
aligned_test2 = [[] for index in range(len(test2))]
aligned_test2

[[], [], [], [], []]

In [33]:
aln1 = ["A", "T", "C", "G", "R"]

In [37]:
test = zip(aligned_test1, aln1)
list(test)

[([], 'A'), ([], 'T'), ([], 'C'), ([], 'G'), ([], 'R')]

In [41]:
for x, j in test:
    aligned_test1.append(j)
print(aligned_test1)

[[], [], [], [], []]


In [44]:
#Hint: use zip()
def traceback(aln1, aln2, traceback_matrix, maximum_position):
    '''Find the opmital path through scoring matrix
        
        diagonal: match/mismatch
        up: gap in aln1
        left: gap in aln2
        
    Args:
        score_matrix (numpy array): scoring matrix
        start_row, start_col: starting position (i.e. max_pos) to trace back
        
    Returns:
        aln_final (array of str): results of multiple sequence alignment (e.g. ['GTTGAC','GTT-AC','GTTG-C'])
        
    Pseudocode:
        #Initialize alignment results for aln1 and aln2
        aligned_aln1 = [[] for i in range(len(aln1))]
        aligned_aln2 = [[] for i in range(len(aln2))]
        while current_move != END:
            current_move = traceback_matrix[current_row][current_col]
            if current_move == DIAG:
                for each element 
                ...
            elif current_move == UP:
                ...
            elif current_move == LEFT:
                ...
            
    '''
    
    #initialize alignment results for aln1 and aln2
    #basically for each index in the range of the length of aln1, place an emtpy list within this one large list. 
    aligned_aln1 = [[] for index in range(len(aln1))]
    aligned_aln2 = [[] for index in range(len(aln2))]
    gap = "-"
    aln_final = []
    test = zip(aligned_aln1, aln1)
    print(aligned_aln2)
    # list(test)
    
    #need to define current_row and current_col 
    current_row = maximum_position[0]
    current_col = maximum_position[0]
    current_move = None #this will keep track of where we're at when we do the if statements
    
    #map the numbers of the move:
    END, DIAG, UP, LEFT = range(4) #0, 1, 2, 3 pretty cool!! 
    
    while current_move != END:
        #setting the current_move to equal the (0,0) position in the traceback_matrix 
        current_move = traceback_matrix[current_row][current_col]
        if current_move == DIAG:
            for aligned_seq, actual_seq in zip(aligned_aln1, aln1): #putting each alinged seq base next to the empty dictionary. ([], A), ([], C) 
                aligned_aln1.append(actual_seq[current_row - 1]) #obviously 
            for alinged_seq, actual_seq in zip(aligned_aln2, aln2):
                aligned_aln2.append(actual_seq[current_col - 1]) 
            
            #update the current_moves
            current_row = current_row - 1
            current_col = current_col - 1
            
        elif current_move == UP:
            for alinged_seq, actual_seq in zip(aligned_aln1, aln1):
                aligned_aln1.append(actual_seq[current_row - 1]) 
            for aligned_seq, actual_seq in zip(aligned_aln2, aln2):
                aligned_aln2.append(gap) 
            
            #update the current_moves 
            current_row = current_row = 1
            #current_col does not change so leave it be. 
            
        elif current_move == LEFT:
            for aligned_seq, actual_seq in zip(aligned_aln1, aln1):
                aligned_aln1.append(gap) 
            for aligned_seq, actual_seq in zip(aligned_aln2, aln2):
                aligned_aln2.append(actual_seq[current_col - 1]) 
            
            #update current moves 
            current_col = current_col - 1
            #leave current_row alone. 
            
        aligned_aln1 = ["".join(aligned_seq[::-1]) for aligned_seq in aligned_aln1]
        aligned_aln2 = ["".join(aligned_seq[::-1]) for aligned_seq in aligned_aln2]
        
        for char1 in aligned_aln1:
            aln_final.append(char1)
        for char2 in aligned_aln2:
            aln_final.append(char2) 
            
    
    return aln_final 
            

In [45]:
#Generalize Smith-Waterman algorithm for a pair of alignments
def SmithWaterman_generalized(aln1, aln2, match=3, mismatch=-3, gap=-2):
    '''Smith-Waterman algorithm for local alignment, generalized for a pair of alignments
    
    Args:
        seq1 (array of strs): input alingment 1 (e.g. ['GTTGAC','GTT-AC'])
        seq2 (array of strs): input alingment 2 (e.g. ['GTTGAC','GTT-AC'])
        match: default = 3
        mismatch: default = -3
        gap: default = -2
    
    Returns:
        results of multiple sequence alignment (array of strs) 
        score_matrix (numpy array): scoring matrix
    '''
    
    
    num_rows = len(aln1[0]) +1
    num_cols = len(aln2[0]) +1
    score_matrix = np.zeros(shape=(num_rows,num_cols), dtype=float) #diag scores can be float
    traceback_matrix = np.zeros(shape=(num_rows,num_cols), dtype=int)
    max_score = 0
    max_pos = (0,0)

    #Create scoring matrix
    for i in range(1,num_rows):
        for j in range(1,num_cols): #iteration starts from position (1,1)
            score_matrix[i][j], traceback_matrix[i][j] = cal_score(score_matrix, aln1, aln2, i, j, match, mismatch, gap)
            
            # Keep track of maximum position for trackback
            if score_matrix[i][j] > max_score:
                max_score = score_matrix[i][j]
                max_pos = (i,j)
    
    #Traceback the optimal path through scoring matrix
    aln_final = traceback(aln1, aln2, traceback_matrix, max_pos)
    
    return aln_final, score_matrix

In [46]:
aligned_seq1 = 'GTTGAC'
aligned_seq2 = 'GTT-AC'
aln1 = ['GTTGAC', 'GTT-AC']
aln2 = ['AGTTGCG']

In [47]:
SmithWaterman_generalized(aln1, aln2)

[[]]


(['', '', 'T', 'T', '', 'T', '', '', 'T', 'T', '', 'T'],
 array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
        [0. , 0. , 1.5, 0. , 0. , 1.5, 0. , 1.5],
        [0. , 0. , 0. , 3. , 1.5, 0. , 0. , 0. ],
        [0. , 0. , 0. , 1.5, 4.5, 2.5, 0.5, 0. ],
        [0. , 0. , 0. , 0. , 2.5, 3. , 1. , 0. ],
        [0. , 1.5, 0. , 0. , 0.5, 1. , 1.5, 0. ],
        [0. , 0. , 0. , 0. , 0. , 0. , 2.5, 0.5]]))