# Burrows-Wheeler Alignment

---

## Before Class

Prior to class, please do the following:
1. Review the Last First property of Burrows-Wheeler Transform matrix
2. Review Burrows-Wheeler Transform for string matching

---
## Learning Objectives
1. Implement Burrows-Wheeler Transform to match a substring to reference string

---
## Background

Today we continue the implementation of the Burrows-Wheeler Transform and Alignment

---
## Imports

In [10]:
from collections import Counter

---
## Burrows-Wheeler Transform for string matching


In [1]:
#function to caculate BWT string from last class
def BWT(string):
    ''' Function to calculate Burrows-Wheeler Transform for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        bwt_string (str): BWT string        
        
    Example:
        >>> BWT('googol')
        'lo$oogg'
        
    '''
    #Append '$' to the end of string
    string += '$'
    
    #generate table of circulated strings
    t = sorted(string[i:] + string[:i] for i in range(len(string)))
    #concatenate last symbols of circulated strings to generate BWT string
    bwt_string = ''.join([l[-1] for l in t])

    return bwt_string

In [2]:
#function to caculate suffix array from last class
def suffix_array(string):
    ''' Function to calculate suffix-array for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        sa (array of integers): suffix array
        
    Example:
    >>> suffix_array('googol')
    [6, 3, 0, 5, 2, 4, 1]
        
    '''
    #Append '$' to the end of string
    string += '$'

    #generate suffix array
    sa = [index for suffix, index in sorted((string[i:], i) for i in range(len(string)))]
    #equivalent with s = [rank for suffix, rank in sorted((x[i:]+x[:i], i) for i in range(len(x)))]
    
    return sa

---
## Continuation of `cal_count` and `cal_occur` functions


One of the key parts for string matching is to do Last-to-First column mapping (LF mapping) within the BWT matrix. With the LF property, we  need to build two dictionaries for our reference string beforehand:
1. count: e.g.  `{'A': 0, 'C': 2, 'G': 3, 'T': 5}`

Where for each character `a` in a string, `count[a]` contains the number of characters in string that are lexicographically smaller than `a`.


2. occur: `{'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}`

Where for each character `a` in a bwt string, `occur[a][i]` contains the number of occurences of `a` in `bwt_string[0,i], i=1,...,len(bwt_string)` (i.e. the first i characters in bwt string).

With those two dictionaries, we can then start matching the query string to our reference string. 


In [3]:
#function to build count dictionary
def cal_count(string):
    '''Function to count the number of characters in string that 
        are lexicographically smaller than a given character
    
    Args:
        string (str): Input string
    
    Returns:
        count (dict)
    
    Example:
        >>> cal_count('ATGACG')
        {'A': 0, 'C': 2, 'G': 3, 'T': 5}
    '''
    c = Counter(string)
    number = 0
    count = {}
    for char in sorted(set(string)):
        count[char] = number
        number += c[char]
    return count

In [4]:
#function to build occur dictionary
def cal_occur(bwt_string):
    '''Function to calculate number of occurrences of each character 
        in bwt [0,i], i=1,...,len(bwt_string)
    
    Args:
        b (str): BWT string
    
    Returns:
        occur (dict of arrays)
    
    Example:
        >>> cal_occur('AG$CG')
        {'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}
    '''
    
    occur = {}
    chars = sorted(set(bwt_string))
    occur = {char: [0]*len(bwt_string) for char in chars}
    occur[bwt_string[0]][0] = 1
    for i in range(1,len(bwt_string)):
        for char in chars:
            if char == bwt_string[i]:
                occur[char][i] = occur[char][i-1]+1
            else:
                occur[char][i] = occur[char][i-1]
        
    return occur   

In [63]:
test = "AAGGTT$CCG"
set(test)

{'$', 'A', 'C', 'G', 'T'}

In [64]:
sorted(test)

['$', 'A', 'A', 'C', 'C', 'G', 'G', 'G', 'T', 'T']

In [65]:
sorted(set(test))

['$', 'A', 'C', 'G', 'T']

## Alignments using BWT

We now want to use our BWT and helper functions to perform alignments. This is performed using a pair of functions.

We want to find the rows of BWT matrix beginning wtih the query string. Notice that the way we sort the BWT matrix will ensure those rows being together, so we need to find out the range of those matched rows. 


1. We initialize the range with the first and the last row of BWT matrix:

`lower = 0`

`upper = len(reference)`


2. Then we update the range as we go through each character `a` of the query string in reverse order: 

`lower_new = count[a] + occur[a][lower_old-1] + 1`

`upper_new = count[a] + occur[a][upper_old]`

(we define `occur[a][-1] = 0`)

As long as `lower_new <= upper_new`, we will find at least one matching.

3. After we go through the entire query string and get the final range (i.e. the indexes of rows of BWT matrix with query string as prefix), we can map it to the reference string with suffix array `sa` to get the matched position(s) (start position(s) of query string within reference string):

`matched_positions = sa[lower_final,upper_final+1]`

In [54]:
def find_match(query, reference):
    '''Function to find exact matching by applying Burrows-Wheeler Transform
    
    Args:
        query (str): query string
        reference (str): reference string
    
    Returns:
        matched_positions (array of integers): start position(s) of query string within reference string
    
    Example:
    >>> find_match('ana','banana')
    [1, 3]
    
    '''
    reference_bwt = BWT(reference)
    # print(reference_bwt)
    sa = suffix_array(reference)
    print(sa)
    lower = 0
    upper = len(reference)
    count = cal_count(reference)
    occur = cal_occur(reference_bwt)
    for char in query[::-1]:
        if lower > upper:
            print ("No matching found.")
            break
        else:
            lower,upper = update_range(lower,upper,count,occur,char)
        print (lower,upper,sa[lower],sa[upper])
    
    return sorted(sa[lower:upper]) #mine 
    # return sorted(sa[lower:upper+1]) #their's 
        

In [55]:
find_match("ana", "banana")

[6, 5, 3, 1, 0, 4, 2]
1 3 5 1
5 6 4 2
2 3 3 1


[3]

In [12]:
def update_range(lower,upper,count,occur,a):
    '''Function to update range given character a, define occur[a][-1] = 0
    
    Args:
        lower (int): the lower boundary of range
        upper (int): the upper boundary of range
        count (dict)
        occur (dict)
        a (char)
    Returns:
        lower_new (int): the updated lower boundary of range
        upper_new (int): the updated upper boundary of range
    
    '''
    if lower == 0:
        lower_new = count[a] + 0 + 1
    else:
        lower_new = count[a] + occur[a][lower-1] + 1
    upper_new = count[a] + occur[a][upper]
    
    return lower_new,upper_new

In [None]:
matched_positions = find_match('ana','banana')
print (matched_positions)

Expected output: [1, 3]

In [None]:
import doctest
doctest.testmod()