## Session 2: Sequence alignments

In this exercise worksheet, we will write our very own (rudimentary, but functional) sequence aligner. It will consist of two pieces:
1. An exact matching step to find series of short exact matches, or "seeds".
2. An inexact matching step to extend the seeds and find gapped alignments.

We will then use it to align an unknown gene from a bat coronavirus to the genome of Sars-cov-2 and identify which gene it corresponds to.

### Exercise 1: Finding exact matches

Here we will implement a suffix array to quickly lookup the positions of short sequences in a larger one.

As we saw in the lecture, there are many ways to identify exact matches between two sequences. Different methods can have different performances, in terms of speed and memory requirements. Suffix arrays allow searches in 

**a) Write the `build_suffix_array` function.**

**b) Write the `search_suffix_array` function**

**c) What is the time complexity of the building process ?**

In [1]:

def build_suffix_array(target_seq):
    """
    Given an input DNA sequence, this function should generate
    The corresponding suffix array.
    >>> build_suffix_array("ACATG")
    [0, 2, 1, 4, 3]
    """
    suffix_array = [0] * len(target_seq)
    # For each n we generate a tuple (suffix, index) -> O(n)
    suffixes = [(target_seq[i:], i) for i in range(len(target_seq))]
    # We sort the tuples according to their suffix. Sorting means O(n log n)
    # comparisons. Each suffix comparisons involves at most O(n) operations
    # -> O(n * n log n) = O(n^2 log n)
    for i, (_, rank) in enumerate(sorted(suffixes)):
             suffix_array[i] = rank
    
    return suffix_array



def search_suffix_array(target_seq, suffix_array, query_seq):
    """
    This function should return the position at which query_seq
    is found in target_seq, using the suffix array of target_seq.
    >>> search_suffix_array("ACATG", [0, 2, 1, 4, 3], "CAT")
    1
    >>> search_suffix_array("ACATG", [0, 2, 1, 4, 3], "PPP")
    -1
    """
    
    def cmp_str(q, t):
        """
        Compare two strings element-by-element.
        If q is (alphabetically) smaller than t, returns -1.
        If it is larger than t, return 1, and if they are equal,
        return 0.
        >>> cmp_str('AACT', 'AAG')
        -1
        """
        i = 0
        while i < len(q):
            if q[i] < t[i]:
                return -1
            elif q[i] > t[i]:
                return 1
            i += 1
        return 0
    
    m = len(query_seq)
    n = len(target_seq)            
    left, right = 0, n - 1
    # We use binary search with character-based comparison
    while left <= right:
        middle = (left + right) // 2
        suffix_idx = suffix_array[middle]
        comp = cmp_str(query_seq, target_seq[suffix_idx:suffix_idx+m])
        if not comp:
            return suffix_idx
        if comp > 0:
            left = middle + 1
        elif comp < 0:
            right = middle - 1
    # Return -1 if not found
    return -1

In [2]:
###########################################
### TEST YOUR CODE BY RUNNING THIS CELL ###
###########################################
assert build_suffix_array("ACATAGGGGATT") == [0, 4, 2, 9, 1, 8, 7, 6, 5, 11, 3, 10], 'The suffix array is wrong !'
assert search_suffix_array("ACATG", [0, 2, 1, 4, 3], "CAT") == 1, 'The suffix array search yields a wrong index'
print(' 0 0 0 \n0 . . 0\n0  v  0\n 0 0 0 ')
print("Congrats !!")

 0 0 0 
0 . . 0
0  v  0
 0 0 0 
Congrats !!




**d) Can you estimate how long it would take your function to generate the suffix array for the chromosome 1 of the human genome (250Mbp) ?**
> Hint: Once you know the time complexity of `build_array`, you could use something like scipy.optimize.fit_curve() to estimate the parameters. Then you could extrapolate values from the function.

In [3]:
%matplotlib notebook
import matplotlib.pyplot as plt

# In this cell, we use %timeit to measure the runtime of "build_suffix_array" with different input sizes
# We then use matplotlib to visualise the results. %timeit is an iPython "magic method".
# It is available in jupyter notebooks, but not in regular python scripts.
# More about magic methods at: https://ipython.readthedocs.io/en/stable/interactive/magics.html

# We use exponentially growing sequences so that we have closely space values for short sequences
# while also measuring time for a few very long sequences

# The input sizes to test
seq_lengths = [int(1.07**i) for i in range(10, 180, 5)]
run_time = [0] * len(seq_lengths)

for i, n in enumerate(seq_lengths):
    # Note we generate a sequence of only "A". This is the worst case. Can you guess why ?
    # This means our estimates will be very pessimistic
    seq = "A" * n
    # %timeit will run each "build_suffix_array" call 3x10 times
    # It will keep the best of 3 repeats, and return the average of 20 times in it
    # Note: If this takes too much time on your computer, you can reduce -n and -r
    t = %timeit -q -n10 -r3 -o build_suffix_array(seq)
    run_time[i] = t.average

plt.plot(seq_lengths, run_time)
plt.scatter(seq_lengths, run_time)
plt.title("Run time of build_suffix_array")
plt.xlabel("Sequence length [bp]")
plt.ylabel("Run time [s]")
                    

<IPython.core.display.Javascript object>

Text(0, 0.5, 'Run time [s]')

In [4]:
from scipy.optimize import curve_fit
import numpy as np
import math

def time_complexity(x, a, b):
    """Based on the number of operations, we can know the time complexity function is O(n^2 log n)"""
    return a * x**2 * np.log(x) + b


# We use the curve_fit function to find optimal values of the b and c parameters
(a, b), _ = curve_fit(time_complexity, np.array(seq_lengths), np.array(run_time))

In [5]:
%matplotlib notebook

extra_lengths = range(100, 250000000, 10000)
plt.plot(extra_lengths, time_complexity(np.array(extra_lengths), a, b))
plt.title(f"{a:.2E} * n^2 * log(n) + {b:.2E})")
plt.show()
print(f"Construction time for human chr1: {time_complexity(250000000, a, b) / (3600 * 24):.2f} days")

<IPython.core.display.Javascript object>

Construction time for human chr1: 76.54 days


**Bonus (not part of the exercises):**

Here we compare the performances of an optimized suffix array construction algorithm with our naive implementation, and estimate how long it would take it to index human chromosome 1.

The performance differences emphasize the importance of reducing the number of operations when designing algorithm. By switching from the naive to the optimized implementation, we go from an estimated time of 72 days to index human chromosome 1, to about 40 minutes.

In [6]:
# If you are interested in suffix arrays, the optimised implementation below is
# taken from: https://louisabraham.github.io/notebooks/suffix_arrays.html
# The author discusses various optimisations for suffix array implementations

from itertools import zip_longest, islice

def val_to_rank(l):
    """
    l: iterable of keys
    returns: a list with the rank of input keys (integers)
    >>> val_to_rank('BAC')
    [1, 0, 2]
    >>> val_to_rank([10, 32, 1])
    [1, 2, 0]
    """
    seen_keys = set()
    unique_keys = []
    for e in l:
        if not e in seen_keys:
            unique_keys.append(e)
            seen_keys.add(e)
    # Sort keys by value
    unique_keys.sort()
    # Map key values to their rank (between 0 and the number of unique keys)
    key_to_rank = {key: rank for rank, key in enumerate(unique_keys)}
    # Compute the array of ranks from the input
    ranks = [key_to_rank[k] for k in l]
    return ranks


def fast_suffix_array(s):
    """
    Build the suffix array of s in O(n * log(n)^2).
    
    This implementation uses two optimisations: 
    - prefix doubling: Only the prefixes of suffixes are compared.
      Their length is doubled on each iteration. Prefixes are converted to ranks
      at every iteration, so we do not need to perform string comparisons.
    - tuple to integer conversion: (a, b) -> a * (n + 1) + b + 1. This guarantees that
      the rank "a" will have more weight in the comparison than rank "b". Same as if it
      was in the first position of the tuple (a, b).
      
    With this algorithm, there will be at most log(n) doubling iterations.
    For each doubling iteration, we need to sort integers (inside val_to_rank),
    which takes O(n log n).
    Therefore the total number of operations is O(n * log n * log n) = O(n (log n)^2)
    
    # Here is a visualisation of prefix doubling (without tuple to int conversion):
    
    BANANA # Input
    102020 # Convert values to ranks
    02020$ # Shift to the left of 2^0 (Fill with smallest possible character, here using $)
    (1, 0), (0, 2), (2, 0), (0, 2), (2, 0), (0, $) # Merge rank of i and i+2^0 into a tuple
    213130 # Rank tuples
    3130$$ # Shift to the left of 2^1
    (2, 3), (1, 1), (3, 3), (1, 0), (3, $), (0, $) # Merge rank of i and i+2^1 into tuples
    325140 # Rank tuples
    # Max(array) == len(array - 1) -> every rank is unique -> suffix array resolved: We can stop !
    
    """
    n = len(s)
    k = 1
    # Convert input letters to their ranks (A->0, C->1, G->2, T->3)
    line = val_to_rank(s)
    # Keep iterating while the array is not fully sorted
    # (i.e. no duplicate rank -> as many ranks (max line) as indicees (len(s)))
    while max(line) < n - 1:
        # For iteration k, the array is shifted to the left of 2^k positions
        # Each element i is combined into an int with element i+2^k
        # The resulting combined keys are ranked again
        line = val_to_rank(
            [a * (n + 1) + b + 1
             for (a, b) in
             zip_longest(line, islice(line, k, None),
                         fillvalue=-1)])
        k <<= 1 # Bitshift operator, k will be 1, 2, 4, 8, 16...
    return line

# The input sizes to test
fast_run_time = [0] * len(seq_lengths)

for i, n in enumerate(seq_lengths):
    seq = "A" * n
    # %timeit will run each "build_suffix_array" call 3x10 times
    # It will keep the best of 3 repeats, and return the average of 20 times in it
    # Note: If this takes too much time on your computer, you can reduce -n and -r
    t = %timeit -q -n10 -r3 -o fast_suffix_array(seq)
    fast_run_time[i] = t.average


In [11]:
%matplotlib notebook
# Visual comparison of runtime between naive and fast construction algorithms

plt.plot(seq_lengths, run_time, c='r', label='naive')
plt.scatter(seq_lengths, run_time, c='r')
plt.plot(seq_lengths, fast_run_time, c='g', label='fast')
plt.scatter(seq_lengths, fast_run_time, c='g')
plt.title("Run time of fast_suffix_array")
plt.xlabel("Sequence length [bp]")
plt.ylabel("Run time [s]")
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x7f1419ab2910>

In [12]:
%matplotlib notebook
# Extrapolation to the size of human chr1

def fast_time_complexity(x, fa, fb):
    """Based on the number of operations, we can know the time complexity function is O(n^2 log n)"""
    return fa * x * np.log(x)**2 + fb


# We use the curve_fit function to find optimal values of the b and c parameters
(fa, fb), _ = curve_fit(fast_time_complexity, np.array(seq_lengths), np.array(fast_run_time))
extra_lengths = range(100, 250000000, 10000)
plt.plot(
    extra_lengths, time_complexity(np.array(extra_lengths), a, b), c='r',
    label=f'naive: ${a:.2E} \cdot n^2 log(n) + {b:.2E})$'
)
plt.plot(
    extra_lengths, fast_time_complexity(np.array(extra_lengths), fa, fb), c='g', 
    label=f'fast: ${fa:.2E} \cdot n log(n)^2 + {fb:.2E})$')
plt.title(f"Naive vs fast SA construction: extrapolation")
plt.axvline(250000000, ls=':')
plt.legend()
plt.show()
print(f"Construction time for human chr1: {fast_time_complexity(250000000, fa, fb) / 3600:.2f} hours")

<IPython.core.display.Javascript object>

Construction time for human chr1: 0.65 hours


### Exercise 2: Finding seeds

Many sequence alignment algorithms start by identifying regions with small exact matches to start their inexact alignments. Here we will implement a simple seeding algorithm using the suffix array above.

The target and query sequences you will be using are real biological sequences. The target is the Sars-cov-2 genome (isolate Wuhan-Hu-1). The query is a viral gene from a bat coronavirus. You will writing a home-made aligner to identify the which gene the query is.


**a) Extract k-mers from the query and feed them to the suffix array to find their position in the sequence**

**b) Merge exact matches closer than A to each other into a single seed.**

**c) Extract a subsequence of 300bp around each seed. We will use them in exercise 3.**

In [6]:
from Bio import SeqIO
target = str(next(SeqIO.parse('data/session_2_target.fasta', format='fasta')).seq) # Sars-cov-2 genome
query = str(next(SeqIO.parse('data/session_2_query.fasta', format='fasta')).seq) # Unknown bat coronavirus gene

In [7]:
def extract_kmers(seq, k=7):
    """
    Extract all unique k-mers in the input sequence.
    >>> extract_kmers("ACATGA", k=3)
    ("ACA", "CAT", "ATG", "TGA")
    """
    kmers = set()
    
    for i in range(len(seq) - (k - 1)):
        kmers.add(seq[i:i+k])
    
    return kmers

def lookup_kmers(target, suffix_array, kmers):
    """
    Given a target sequence, its suffix array and a
    list of kmers (words), identify the position of
    exact matches in the target sequence for each k-mer.
    >>> lookup_kmers("ACATG", [0, 2, 1, 4, 3], ["CAT", "ATG"])
    [1, 2]
    """
    pos_seeds = []
    for kmer in kmers:
        pos_seeds.append(search_suffix_array(target, suffix_array, kmer))

    return pos_seeds

def merge_seeds(positions, A=100):
    """
    Given a list of sorted positions for exact matches, group all those closer than A into
    a single seed and return its middle position.
    >>> merge_seeds([0, 1, 2, 100, 103, 1000], A=10)
    [1, 101, 1000]
    """
    seeds = []
    current_seed = []
    n = len(positions)
    # Initialize seed with the first position
    current_seed.append(positions[0])
    # Add a meaningless value to allow iterating on last position
    positions.append(-1)
    # Loop over positions of each exact match
    for i in range(1, n + 1):
        # If exact match is too far from the previous one,
        # store and flush current seed and start a new one
        if ((positions[i] - positions[i-1]) > A) or (i == n):
            # Compute the middle position of the current seed
            middle = (min(current_seed) + max(current_seed)) // 2
            # Add the middle position to the output seeds
            seeds.append(int(middle))
            # Flush positions in current seed
            current_seed = []
        current_seed.append(positions[i])
    
    
    return seeds



In [8]:
m, n = len(query), len(target)
extend_len = 300
kmers = extract_kmers(query, k=17) # Extract all unique k-mers from the query
suffix_array = build_suffix_array(target) # Build the target's suffix array
pos_kmers = lookup_kmers(target, suffix_array, kmers) # Find the position of query k-mers in target
# Remove negative values (k-mers absent from target)
pos_kmers = sorted([p for p in pos_kmers if p >= 0])
pos_seeds = merge_seeds(pos_kmers, A=200)

seq_seeds = []
print("Seed spans: ")
for pos in pos_seeds:
    start = max(0, pos - extend_len)
    end =   min(n, pos + extend_len)
    print(f"{start}-{end}")
    seq_seeds.append(target[start:end])

Seed spans: 
21563-22163
22396-22996
23662-24262
24238-24838
24775-25375
25042-25642


In [9]:
###########################################
### TEST YOUR CODE BY RUNNING THIS CELL ###
###########################################
k_set = set(["ACA", "CAT", "ATG", "TGA"])
k_remain = set(["ACA", "CAT", "ATG", "TGA"])
for k in extract_kmers("ACATGA", k=3):
    k_remain.remove(k)
    if k not in k_set:
        raise ValueError(f"Wrong k-mer found: {k}")
if len(k_remain):
    raise ValueError(f"Missing k-mers from the extracted set: {k_remain}")
assert lookup_kmers("ACATG", [0, 2, 1, 4, 3], ["CAT", "ATG"]) == [1, 2], 'K-mers reported at wrong positions'
assert merge_seeds([0, 1, 2, 100, 103, 1000], A=10) == [1, 101, 1000], 'Seeds are not merged as expected'
print(' 0 0 0 \n0 . . 0\n0  v  0\n 0 0 0 ')
print("Congrats !!")

 0 0 0 
0 . . 0
0  v  0
 0 0 0 
Congrats !!


### Exercise 3

In practice, we usually want to find inexact matches. Here we will focus on finding local alignments using dynamic programming.

**a) Implement the Smith Waterson algorithm. Some of the code is already written to help you, look at the commments for hints and help.**


In [10]:
# This cell contains the meat of the inexact alignment algorithm.
# The code has been partially written, but critical parts are missing from the functions
# Try to fill up the missing parts based on what we saw in the dynamic programming lecture

def fill_score_matrix(query, target):
    '''
    Creates the dynamic-programming matrix for local sequence alignment.
    This matrix is filled with cumulative scores as the pairwise sequence
    alignment progresses.
    >>>fill_score_matrix("CGT", "ACAT")
    [
     [0,0,0,0,0],
     [0,0,2,1,0],
     [0,0,1,1,0],
     [0,0,0,0,3],
    ],
    (3, 4),
    3
    '''
    # Default scores for Smith-Waterman.
    # Note we could use a substitution table instead of a fixed score for match / mismatch
    match = 2
    mismatch = -1
    gap = -1
    # We initialize a scoring matrix which we will fill using dynamic programming
    # Note the matrix has an additional row and column for the gaps (-)
    rows = len(query) + 1
    cols = len(target) + 1
    score_matrix = [[0 for col in range(cols)] for row in range(rows)]

    # Fill the scoring matrix.
    max_score = 0
    max_pos   = None    # The row and column of the highest score in matrix.
    for i in range(1, rows):
        for j in range(1, cols):
            similarity = match if query[i - 1] == target[j - 1] else mismatch
            diag_score = score_matrix[i- 1][j - 1] + similarity
            up_score   = score_matrix[i- 1][j] + gap
            left_score = score_matrix[i][j - 1] + gap
            # The score of position i,j depends on scores of the up, left and up-left neighbours
            score = max(0, diag_score, up_score, left_score)
            if score > max_score:
                max_score = score
                max_pos   = (i, j)

            score_matrix[i][j] = score

    try:
        assert max_pos is not None
    except AssertionError:
        print("No score was above zero")

    return score_matrix, max_pos, max_score


def next_move(score_matrix, x, y):
    """
    This function decides the next traceback move based
    on the score values around input position (x, y).
    It will return a movement encoded as a number:
    - 0: END
    - 1: DIAG
    - 2: UP
    - 3: LEFT
    """
    diag = score_matrix[x - 1][y - 1]
    up   = score_matrix[x - 1][y]
    left = score_matrix[x][y - 1]
    if diag >= up and diag >= left:     # Tie goes to the DIAG move.
        return 1 if diag != 0 else 0    # 1 signals a DIAG move. 0 signals the end.
    elif up > diag and up >= left:      # Tie goes to UP move.
        return 2 if up != 0 else 0      # UP move or end.
    elif left > diag and left > up:
        return 3 if left != 0 else 0    # LEFT move or end.
    else:
        # Execution should not reach here.
        raise ValueError('invalid move during traceback')


def traceback(query, target, score_matrix, start_pos):
    '''Find the optimal path through the matrix.

    This function traces a path from the bottom-right to the top-left corner of
    the scoring matrix. Each move corresponds to a match, mismatch, or gap in one
    or both of the sequences being aligned. Moves are determined by the score of
    three adjacent squares: the upper square, the left square, and the diagonal
    upper-left square.

    WHAT EACH MOVE REPRESENTS
        diagonal: match/mismatch
        up:       gap in sequence 1
        left:     gap in sequence 2
    
    score_matrix is the filled score matrix, start_pos is the position from which
    the traceback will start. This should be the position with the maximum score.
    >>> traceback("AC", "AAC", [[]])
    '''

    END, DIAG, UP, LEFT = range(4)
    aligned_query = []
    aligned_target = []
    x, y = start_pos
    move = next_move(score_matrix, x, y)
    while move != END:
        if move == UP:
            aligned_query.append(query[x - 1])
            aligned_target.append('-')
            x -= 1
        elif move == DIAG:
            aligned_query.append(query[x - 1])
            aligned_target.append(target[y - 1])
            x -= 1
            y -= 1
        else:
            aligned_query.append('-')
            aligned_target.append(target[y - 1])
            y -= 1

        move = next_move(score_matrix, x, y)

    aligned_query.append(query[x - 1])
    aligned_target.append(target[y - 1])

    return ''.join(reversed(aligned_query)), ''.join(reversed(aligned_target))


In [22]:
###########################################
### TEST YOUR CODE BY RUNNING THIS CELL ###
###########################################
test_q = "CGT"
test_t = "ACAT"
test_m = [[0,0,0,0,0],[0,0,2,1,0],[0,0,1,1,0],[0,0,0,0,3]]
m, p, s = fill_score_matrix(test_q, test_t)
assert m == test_m, 'Wrong scoring matrix generated by "fill_score_matrix"'
assert p == (3, 4), 'Wrong starting position returned by "fill_score_matrix"'
assert s == 3, 'Wrong max score returned by "fill_score"matrix"'
assert next_move(test_m, 2, 3) == 1, '"next_move" yields wrong move'
assert next_move(test_m, 1, 2) == 0, "\"next_move\" should yield END (=0) but it doesn't"
assert traceback(test_q, test_t, m, p) == ('CGT', 'CAT')
print(' 0 0 0 \n0 . . 0\n0  v  0\n 0 0 0 ')
print("Congrats !!")

 0 0 0 
0 . . 0
0  v  0
 0 0 0 
Congrats !!


In [16]:
# These two functions are just for cosmetic purpose, to visualise the results of the alignment.
# You don't need to change anything here.

def alignment_string(aligned_query, aligned_target):
    '''Construct a special string showing identities, gaps, and mismatches.

    This string is printed between the two aligned sequences and shows the
    identities (|), gaps (-), and mismatches (:). As the string is constructed,
    it also counts number of identities, gaps, and mismatches and returns the
    counts along with the alignment string.

    AAGGATGCCTCAAATCGATCT-TTTTCTTGG-
    ::||::::::||:|::::::: |:  :||:|   <-- alignment string
    CTGGTACTTGCAGAGAAGGGGGTA--ATTTGG
    '''
    # Build the string as a list of characters to avoid costly string
    # concatenation.
    idents, gaps, mismatches = 0, 0, 0
    alignment_string = []
    for base_q, base_t in zip(aligned_query, aligned_target):
        if base_q == base_t:
            alignment_string.append('|')
            idents += 1
        elif '-' in (base_q, base_t):
            alignment_string.append(' ')
            gaps += 1
        else:
            alignment_string.append(':')
            mismatches += 1

    return ''.join(alignment_string), idents, gaps, mismatches

def prettify_alignment(query_aligned, target_aligned):
    # Pretty print the results. The printing follows the format of BLAST results
    # as closely as possible.
    alignment_str, idents, gaps, mismatches = alignment_string(query_aligned, target_aligned)
    alength = len(query_aligned)

    print(' Identities = {0}/{1} ({2:.1%}), Gaps = {3}/{4} ({5:.1%})'.format(idents,
          alength, idents / alength, gaps, alength, gaps / alength))
    print()
    for i in range(0, alength, 60):
        query_slice = query_aligned[i:i+60]
        print('Query  {0:<4}  {1}  {2:<4}'.format(i + 1, query_slice, i + len(query_slice)))
        print('             {0}'.format(alignment_str[i:i+60]))
        target_slice = target_aligned[i:i+60]
        print('Sbjct  {0:<4}  {1}  {2:<4}'.format(i + 1, target_slice, i + len(target_slice)))
        print()

You can use the cell below to visually test your Smith-Waterman implementation.
You should see the following output:
```
 Identities = 2/3 (66.7%), Gaps = 0/3 (0.0%)

Query  1     CGT  3   
             |:|
Sbjct  1     CAT  3   

```

In [24]:
# Here you can give your inexact aligner a test run, you can try changing the dummy inputs 
# and see how it affects the resulting alignment
dummy_query = "GGGGGCCA"
dummy_target = "CGTTTTACAGGGGACAT"

# Initialize the scoring matrix and fill it with scores
score_matrix, start_pos, max_score = fill_score_matrix(dummy_query, dummy_target)

# Find the optimal path through the scoring matrix 
# producing the best local sequence alignment.
query_aligned, target_aligned = traceback(dummy_query, dummy_target, score_matrix, start_pos)
prettify_alignment(query_aligned, target_aligned)


 Identities = 6/8 (75.0%), Gaps = 1/8 (12.5%)

Query  1     GGGGGCCA  8   
             ||||:| |
Sbjct  1     GGGGAC-A  8   



**b) Run your seed sequences through the algorithm.**


In [122]:
best_score = 0
best_seed = ""
best_seed_pos = 0
for seed in seq_seeds:
    mat, _, score = fill_score_matrix(query, seed)
    if score > best_score:
        best_score = score
        best_seed = seed
        best_seed_pos = pos

mat, start_pos, _ = fill_score_matrix(query, best_seed)
query_aligned, target_aligned = traceback(query, best_seed, mat, start_pos)


**d) What are the coordinates of the best the alignment ?**


In [123]:
print(f"Seed position in target: {best_seed_pos}")
prettify_alignment(query_aligned, target_aligned)


Seed position in target: 25341
 Identities = 527/640 (82.3%), Gaps = 74/640 (11.6%)

Query  1     GGTGCAGCTCTTCAAATACCATTTGCTATGCAAATGGCTTATAGGTTTAATGGCATTGGA  60  
             |   ||  | |:|||||||||||||||||||||||||||||||||||||||||:||||||
Sbjct  1     G---CA--T-TACAAATACCATTTGCTATGCAAATGGCTTATAGGTTTAATGGTATTGGA  60  

Query  61    GTTACTCA-AAATGTTCTCTATGAGAACCAAAAGCT-GATAGCCAATC-AGTTTAATAGT  120 
             |||||:|| || ||||||||||||||||||||| :| |||:||||| | |:|||||||||
Sbjct  61    GTTACACAGAA-TGTTCTCTATGAGAACCAAAA-ATTGATTGCCAA-CCAATTTAATAGT  120 

Query  121   GCTATAGGCAAAATTCAAGAATCA-TTATCAT-CTACTGCAAGTGCACTAGGAAAACTGC  180 
             |||||:||||||||||||||:||| || || | |:||:|||||||||||:||||||||:|
Sbjct  121   GCTATTGGCAAAATTCAAGACTCACTT-TC-TTCCACAGCAAGTGCACTTGGAAAACTTC  180 

Query  181   A-GGATGTGGTTAACCAAAATGCACAAGCTCTTAA-CACGCTTGTTAAACAACTCAGCTC  240 
             | | |||||||:|||||||||||||||||| |||| ||||||||||||||||||:|||||
Sbjct  181   AAG-ATGTGGTCAACCAAAATGCACAAGCT-TTAAACACGCTT

**e) What gene does that fall into ? (Look at the file "data/session_2_annotations.bed")**

The S gene, which encodes for the "Spike" surface glycoprotein.

**f) Can you think of a way to measure the significance of your alignments ? (discussion)**

Generate many random queries and align them to the target with Smith-Waterman. We can use store their score distribution to compute an empirical p-value; The p-value will be equal to the proportion of random scores higher than the score of the actual query.