# Alignment 2: Smith-Waterman Algorithm

The Needleman-Wunsch algorithm finds the optimal **global** alignment of the two given sequence.

At many times, however, a local alignment is more desirable than a global one. Say, you are mapping a short sequence (no more than 100 letters) onto a large sequence (say 1000 letters). If there is an exact match of the short sequence in the longer sequence, it is better to find this exact match. However the global Needle-Wunsch algorithm is easily influenced by sub-sequences out of the exact matching region, so that a different alignment is often produced.

~~~
sequence1 is 
sequence2 is
local alignment:
nw (global) alignment: check the previous notebook
~~~

The Smith-Waterman algorithm finds the optimal **local** alignment of the two given sequence. It can be seen as a variation of the Needleman-Wunsch algorithm. Two main distinctions are made:

- If the value of a cell in the score matrix is negative, it is set to zero;
- The trace back begins from the cell that has the highest score, and ends when a cell whose score is 0 is encountered.

## Implementation

In [None]:
import numpy as np

In [None]:
def smith_waterman(seq1:str, seq2:str, match:int=1, mismatch:int=-1, gap:int=-2) -> None:
    
    '''
    Needleman-Wunsch aligment algorithm implementation, prints the aligment
    Args:
        seq1 (str): first sequence, it will appear on the first line of the aligment
        seq2 (str): second sequence
        match (int): score for a match
        mismatch (int): score for a mismatch
        gap (int): score for a gap, usually negative
    '''
    
    m, n = len(seq1), len(seq2)
    score_matrix = np.zeros((m+1, n+1))
    path_matrix = np.zeros((m+1, n+1))
    
    path = {
        'diagonal': 1,
        'down': 2,
        'right': 3
    }
    
    # initialize
    score_matrix[:, 0] = np.zeros(m+1)
    score_matrix[0, :] = np.zeros(n+1)
    path_matrix[1:, 0] = np.full(m, path['down'])
    path_matrix[0, 1:] = np.full(n, path['right'])
    score_max, max_i, max_j = 0, 0, 0
    
    # fill the matrices
    for i in range(1, m+1):
        for j in range(1, n+1):
            if seq1[i-1] == seq2[j-1]: 
                score_diagonal = score_matrix[i-1, j-1] + match
            else:
                score_diagonal = score_matrix[i-1, j-1] + mismatch
            score_down = score_matrix[i-1, j] + gap
            score_right = score_matrix[i, j-1] + gap
            score = max(score_diagonal, score_down, score_right)
            if score >= score_max:
                score_max = score
                max_i = i
                max_j = j
            score_matrix[i, j] = max(score, 0)
            
            # when multiple paths are available, priority: diagonal > down > right
            if score >=0:
                if score == score_diagonal:
                    path_matrix[i, j] = path['diagonal']
                elif score == score_down:
                    path_matrix[i, j] = path['down']
                elif score == score_right:
                    path_matrix[i, j] = path['right']
    
    # trace back
    trace_i, trace_j = max_i, max_j
    alignment_1, alignment_2, alignment_3 = [], [], [] # print in three rows
    alignment_1.append(' '+str(trace_i))
    alignment_2.append('  ')
    alignment_3.append(' '+str(trace_j))
    while trace_i > 0 or trace_j > 0:
        if score_matrix[trace_i, trace_j] == 0:
            break
        if path_matrix[trace_i, trace_j] == path['diagonal']:
            alignment_1.append(seq1[trace_i-1])
            if seq1[trace_i-1] == seq2[trace_j-1]:
                alignment_2.append('|')
            else:
                alignment_2.append(' ')
            alignment_3.append(seq2[trace_j-1])
            trace_i -= 1
            trace_j -= 1
        elif path_matrix[trace_i, trace_j] == path['down']:
            alignment_1.append(seq1[trace_i-1])
            alignment_2.append(' ')
            alignment_3.append('-')
            trace_i -= 1
        elif path_matrix[trace_i, trace_j] == path['right']:
            alignment_1.append('-')
            alignment_2.append(' ')
            alignment_3.append(seq2[trace_j-1])
            trace_j -= 1
    alignment_1.append(str(trace_i)+' ')
    alignment_2.append('  ')
    alignment_3.append(str(trace_j)+' ')
    
    # print the result
    alignment_1.reverse()
    alignment_2.reverse()
    alignment_3.reverse()
    alignment = ('').join(alignment_1) + '\n' + ('').join(alignment_2) + '\n' + ('').join(alignment_3) + '\n'
    print(alignment)
    print('score matrix')
    print(score_matrix)
    print('path matrix')
    print(path_matrix)

## Tests

In [None]:
seq1 = 'needleman'
seq2 = 'neadlmen'
smith_waterman(seq1, seq2)

In [None]:
smith_waterman(seq1, seq2, gap=-1)

In [None]:
seq3 = 'ACGGTAC'
seq4 = 'TTTTACGGTACTTTT'
smith_waterman(seq3, seq4)