## Overview

This notebook contains code, notes, and solutions to week two's programming assignment.

### Naive Read Alignment

Modified implementation of the Naive read alignment algorithm that (potentially) allows for mismatches.

In [10]:
# Question 5 requires something slightly different (mismatch tolerance)
# occurrences, num_alignments, num_character_comparisons = naive_with_counts(p, t)
#  print(occurrences, num_alignments, num_character_comparisons)
def naive_mismatch(target, reference, n_mismatch=0):
    """
    Modified version of naive, exact-match algo.
    
    Args:
        reference (Seq): reference genome
        target (Seq): target sequence
        n_mismatch(int): number of mismatches to tolerate

    Returns:
        occurences (list): list of occurences

    Note:
        A special case of this algorithm with n_match=0 is an
        "exact-match" implementation of the naive read alignment
        algorithm.
    """

    occurrences = []

    # Additional counters for assignment purposes
    num_alignments = 0
    num_character_comparisons = 0

    for i in range(len(reference) - len(target) + 1):  # loop over alignments

        match = True

        # Track alignments
        num_alignments += 1

        # Track mismatch count
        mismatch = 0

        for j in range(len(target)):
            num_character_comparisons += 1

            if reference[i+j] != target[j]:
                mismatch += 1

            if mismatch > n_mismatch:
                match = False
                break

        if match:
            occurrences.append(i)  # all chars matched; record

    return occurrences, num_alignments, num_character_comparisons

In [11]:
# Example 1
p = 'word'
t = 'there would have been a time for such a word'
occurrences, num_alignments, num_character_comparisons = naive_mismatch(p, t)
print(occurrences, num_alignments, num_character_comparisons)

[40] 41 46


In [12]:
# Example 2
p = 'needle'
t = 'needle need noodle needle'
occurrences, num_alignments, num_character_comparisons = naive_mismatch(p, t)

print(occurrences, num_alignments, num_character_comparisons)

[0, 19] 20 35


The naive read-alignment implementation above appears to work as intended.

### Boyer-Moore

This section contains a (slightly) modified implementation of the BM algorithm that allows for mismatch tolerances. Note that Hamming distance rather than edit distance, in this implementation. In other words, only substitutions are considered rather than insertions and deletions.

In [2]:
# First, we need to import the Boyer-Moore classes and functions
from bm_preproc import BoyerMoore

# Now, import other modules we'll likely need along the way

In [101]:
# This function is lifted directly from the programming reading.
#  Note: I made some small changes because my OCD won't allow for
#  the super crappy linting and programming practices.
def boyer_moore(p, p_bm, t):
    """
    Do Boyer-Moore matching.

    At its heart, Boyer-Moore is an exact-matching algorithm
    that allows the user to skip many of the potential read
    alignments using a simple heuristic:
    
        - Compare the pattern and the text backwards
        - If a "bad character" is found, then advance
          the pattern until the "bad character" matches
          the underlying text.
        - If a "good suffix" is found followed by a mismatch,
          then advance the pattern until the same suffix is found.
          This will ensure that there is a match in this discovered
          region.
 
    Note:
        This implementation has no mismatch tolerance.
        Must be modified for the programming assigment.

    Args:
        p (string): pattern (sequence)
        p_bm (BoyerMoore): preprocessed BoyerMoore object
        t (text): string to which the pattern is compared

    Returns:
        occurrences (list): a list of exact matches
    """

    # Start at beginning of the sequence (offset of 0)
    i = 0

    # Track matches as a list
    occurrences = []

    # Counters for tracking purposes
    num_alignments = 0
    num_character_comparisons = 0

    # Also, we want to build in mismatch tolerance
    while i < len(t) - len(p) + 1:

        # Checking an alignment
        num_alignments += 1

        # By default, we will move to the next position
        shift = 1

        mismatched = False

        # We move backwards through the pattern to find
        # a potentially matching suffix.
        #
        # j in this case refers to the position within the pattern
        # sequence.
        for j in range(len(p)-1, -1, -1):

            # Checking a new character
            num_character_comparisons += 1

            if p[j] != t[i+j]:
    
                # Lookup the maximum shift based on the
                # bad characte rule
                skip_bc = p_bm.bad_character_rule(j, t[i+j])
                
                # Lookup maximum shift based on the good suffix rule
                skip_gs = p_bm.good_suffix_rule(j)
                
                # Figure out what the maximal shift is overall
                shift = max(shift, skip_bc, skip_gs)

                mismatched = True

            if mismatched:
                break

        if not mismatched:
            occurrences.append(i)
            
            # Advances to the next position
            skip_gs = p_bm.match_skip()
            shift = max(shift, skip_gs)

        i += shift

    return occurrences, num_alignments, num_character_comparisons



In [106]:
# Example 1
p = 'word'
t = 'there would have been a time for such a word'
lowercase_alphabet = 'abcdefghijklmnopqrstuvwxyz '
p_bm = BoyerMoore(p, lowercase_alphabet)
occurrences, num_alignments, num_character_comparisons = boyer_moore(p, p_bm, t)
print(occurrences, num_alignments, num_character_comparisons)

[40] 12 15


In [103]:
# Example 2
p = 'needle'
t = 'needle need noodle needle'
p_bm = BoyerMoore(p, lowercase_alphabet)
occurrences, num_alignments, num_character_comparisons = boyer_moore(p, p_bm, t)
print(occurrences, num_alignments, num_character_comparisons)

[0, 19] 5 18


## Quiz

Notes for the regular quiz

In [132]:
# Question 1
p = 'TAATAAA'
p_bm = BoyerMoore(p, 'ATGC')
shift = p_bm.bad_character_rule(4, 'T')
n_skip = shift - 1
print(n_skip)

0


In [138]:
# Question 2
p = 'TAATTAA'
p_bm = BoyerMoore(p, 'ATGC')
shift = p_bm.good_suffix_rule(4)
n_skip = shift - 1
print(n_skip)

3


## Programming Quiz

This section contains quiz notes and solutions.

In [107]:
# Need to load the genome
from Bio.Seq import Seq
import Bio.SeqIO

genome = list(Bio.SeqIO.parse('chr1.GRCh38.excerpt.fasta', 'fasta')).pop().seq

In [109]:
# Question 1/2
p = 'GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG'
occurrences, num_alignments, num_character_comparisons = naive_mismatch(p, genome)
print(occurrences, num_alignments, num_character_comparisons)

[56922] 799954 984143


In [110]:
# Question 3
p = 'GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG'
p_bm = BoyerMoore(p, 'ATGC')
occurrences, num_alignments, num_character_comparisons = boyer_moore(p, p_bm, genome)
print(occurrences, num_alignments, num_character_comparisons)

[56922] 127974 165191


In [112]:
(127974) / 799954 * 100

15.997669866017297