# Burrows-Wheeler Transform

---

## Before Class

Prior to class, please do the following:

1. Review Burrows-Wheeler Transform
2. Familiarize yourself with sort operator

---
## Learning Objectives

1. Implement Burrows-Wheeler Transform to calculate BWT string and suffix array.


---
## Background

Today we will be building a Burrows-Wheeler transfor and a suffix array for a string as described in the lecture notes.

To generate a BWT matrix, we append \$ to a string, perform all rotations to build a matrix, sort lexographically, and return the last column:

```
BWT(T):
    Append $ to T
    Build matrix of all rotations of T
    sort matrix
    return last column of matrix
```
    
We also need to calculate the suffix array for the string. This will be required for when we use BWT for string matching in the next class. To generate a suffix array for a string:
```
suffix_array(T):
    Append $ to T
    build matrix of all rotations of T with row index i
    sort i by lexographic sorting of rotation matrix
    return i
```



---
## Burrows-Wheeler Transform



In [7]:
#function to caculate BWT string from last class
def BWT(string):
    ''' Function to calculate Burrows-Wheeler Transform for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        bwt_string (str): BWT string        
        
    Example:
        >>> BWT('googol')
        'lo$oogg'
        
    '''
    #Append '$' to the end of string
    string += '$'
    
    #generate table of circulated strings
    table = sorted(string[i:] + string[:i] for i in range(len(string)))
    
    #concatenate last symbols of circulated strings to generate BWT string
    bwt_string = ''.join([rotated_string[-1] for rotated_string in table])

    return bwt_string

In [None]:
# Alternative implementation
def BWT2(string):
    ''' Function to calculate Burrows-Wheeler Transform for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        bwt_string (str): BWT string        
        
    Example:
        >>> BWT2('googol')
        'lo$oogg'
        
    '''
    #Append '$' to the end of string
    string += '$'
    
    #generate table of circulated strings
    table = []
    for i in range(len(string)):
        rotated_string = string[i:] + string[:i]
        table.append(rotated_string)
    
    table = sorted(table)
    
    #concatenate last symbols of circulated strings to generate BWT string
    bwt_string = ''
    for i in range(len(table)):
        bwt_string = bwt_string + table[i][-1]

    return bwt_string

In [2]:
#function to caculate suffix array from last class
def suffix_array(string):
    ''' Function to calculate suffix-array for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        sa (array of integers): suffix array
        
    Example:
    >>> suffix_array('googol')
    [6, 3, 0, 5, 2, 4, 1]
        
    '''
    #Append '$' to the end of string
    string += '$'

    #generate suffix array
    sa = [index for suffix, index in sorted((string[i:], i) for i in range(len(string)))]
    
    return sa

In [5]:
#function to caculate suffix array from last class
def suffix_array2(string):
    ''' Function to calculate suffix-array for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        sa (array of integers): suffix array
        
    Example:
    >>> suffix_array2('googol')
    [6, 3, 0, 5, 2, 4, 1]
        
    '''
    #Append '$' to the end of string
    string += '$'

    #generate table of circulated strings
    table = []
    for i in range(len(string)):
        rotated_string = string[i:] + string[:i]
        table.append(rotated_string)
    
    #sort this list as a tuple with the indexes
    table_index = sorted(zip(table, range(len(string))))
    
    #Use the now sorted index values for our final output
    sa = []
    for item, index in table_index:
        sa.append(index)
        
    return type(sa)

In [None]:
import doctest
doctest.testmod()

In [8]:
BWT('googol')

'lo$oogg'

In [9]:
suffix_array2('googol')

list

---
If you finish the two functions above, you can start working on another two functions we will use in the next class for string matching.

## Background
One of the key parts for string matching is to do Last-to-First column mapping (LF mapping) within the BWT matrix. With the LF property, we  need to build two dictionaries for our reference string beforehand:
1. count: e.g.  `{'A': 0, 'C': 2, 'G': 3, 'T': 5}`

Where for each character `a` in a string, `count[a]` contains the number of characters in string that are lexicographically smaller than `a`.


2. occur: e.g. `{'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}`

For each character `a` in a bwt string, `occur[a][i]` contains the number of occurences of `a` in `bwt_string[0,i], i=1,...,len(bwt_string)` (i.e. the first i characters in bwt string).

---
## Imports

In [None]:
from collections import Counter

In [None]:
#function to build count dictionary
def cal_count(string):
    '''Function to count the number of characters in string that 
        are lexicographically smaller than a given character
    
    Args:
        string (str): Input string
    
    Returns:
        count (dict)
    
    Example:
        >>> cal_count('ATGACG')
        {'A': 0, 'C': 2, 'G': 3, 'T': 5}
    '''
    pass

In [None]:
#function to build occur dictionary
def cal_occur(bwt_string):
    '''Function to calculate number of occurrences of each character 
        in bwt [0,i], i=1,...,len(bwt_string)
    
    Args:
        b (str): BWT string
    
    Returns:
        occur (dict of arrays)
    
    Example:
        >>> cal_occur('AG$CG')
        {'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}
    '''
    pass

In [None]:
import doctest
doctest.testmod()