# Burrows-Wheeler Transform

---

## Before Class

Prior to class, please do the following:

1. Review Burrows-Wheeler Transform
2. Familiarize yourself with sort operator

---
## Learning Objectives

1. Implement Burrows-Wheeler Transform to calculate BWT string and suffix array.


---
## Background

Today we will be building a Burrows-Wheeler transfor and a suffix array for a string as described in the lecture notes.

To generate a BWT matrix, we append \$ to a string, perform all rotations to build a matrix, sort lexographically, and return the last column:

```
BWT(T):
    Append $ to T
    Build matrix of all rotations of T
    sort matrix
    return last column of matrix
```
    
We also need to calculate the suffix array for the string. This will be required for when we use BWT for string matching in the next class. To generate a suffix array for a string:
```
suffix_array(T):
    Append $ to T
    build matrix of all rotations of T with row index i
    sort i by lexographic sorting of rotation matrix
    return i
```



---
## Burrows-Wheeler Transform



BWT(T):
    Append $ to T
    Build matrix of all rotations of T
    sort matrix
    return last column of matrix

In [8]:
#function to caculate BWT string
def BWT(string):
    ''' Function to calculate Burrows-Wheeler Transform for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        bwt_string (str): BWT string        
        
    Example:
        >>> BWT('googol')
        'lo$oogg'
        
    '''
    #need to append a "$" to the end of the string. 
    new_string = string + "$"
    rotated_matrix = []
    bwt_list = []
    #need to iterate through cyclic rotation of the matrix. 
    for pos in range(len(new_string)):
        rotated_string = new_string[pos:] + new_string[:pos]
        # print(rotated_string)
        rotated_matrix.append(rotated_string)
        print(rotated_matrix)
        # sorted(rotated_matrix) googled how to sort a list alphabetically. 
        alph_matrix = sorted(rotated_matrix)
        print(alph_matrix)
        
    #need to take the last letter in each list element. 
    for i in range(len(alph_matrix)):
        bwt_list.append(alph_matrix[i][-1]) #for each index in the list I want to take the last letter and add it to the list. 
        final_bwt = "".join(bwt_list)
        
    # return rotated_matrix, final_bwt
    return final_bwt  #the final result is of type str. 

In [9]:
BWT("googol")

['googol$']
['googol$']
['googol$', 'oogol$g']
['googol$', 'oogol$g']
['googol$', 'oogol$g', 'ogol$go']
['googol$', 'ogol$go', 'oogol$g']
['googol$', 'oogol$g', 'ogol$go', 'gol$goo']
['gol$goo', 'googol$', 'ogol$go', 'oogol$g']
['googol$', 'oogol$g', 'ogol$go', 'gol$goo', 'ol$goog']
['gol$goo', 'googol$', 'ogol$go', 'ol$goog', 'oogol$g']
['googol$', 'oogol$g', 'ogol$go', 'gol$goo', 'ol$goog', 'l$googo']
['gol$goo', 'googol$', 'l$googo', 'ogol$go', 'ol$goog', 'oogol$g']
['googol$', 'oogol$g', 'ogol$go', 'gol$goo', 'ol$goog', 'l$googo', '$googol']
['$googol', 'gol$goo', 'googol$', 'l$googo', 'ogol$go', 'ol$goog', 'oogol$g']


'lo$oogg'

In [11]:
string = "AATTCC"
final_string = string + "$"
final_string

'AATTCC$'

In [12]:
string = "AATTCC"
string += "$"
string

'AATTCC$'

In [13]:
#this takes away 
for i in range(len(string)):
    print(string[i:])
    # print(string[:i])

AATTCC$
ATTCC$
TTCC$
TCC$
CC$
C$
$


In [14]:
for i in range(len(string)):
    print(string[:i])


A
AA
AAT
AATT
AATTC
AATTCC


In [16]:
for i in range(len(string)):
    print(string[i:] + string[:i])

AATTCC$
ATTCC$A
TTCC$AA
TCC$AAT
CC$AATT
C$AATTC
$AATTCC


In [19]:
test = "GGTTC$CGAA"
sorted(test)

['$', 'A', 'A', 'C', 'C', 'G', 'G', 'G', 'T', 'T']

In [22]:
matrix = []
for i in range(len(string)):
    new_string = string[i:] + string[:i]
    print(new_string)
    matrix.append(new_string)
    # print(matrix)
sorted_matrix = sorted(matrix)
print(sorted_matrix)

AATT
ATTA
TTAA
TAAT
['AATT', 'ATTA', 'TAAT', 'TTAA']


In [19]:
string = "AATT"
new_string = "AATT" + "$"
string_list = new_string
string_list

'AATT$'

In [20]:
new_string[0] 

'A'

In [21]:
new_string[-1]

'$'

In [69]:
#function to caculate suffix array
def suffix_array(string):
    ''' Function to calculate suffix-array for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        sa (array of integers): suffix array
        
    Example:
    >>> suffix_array('googol')
    [6, 3, 0, 5, 2, 4, 1]
        
    '''
    #I need to begin with the rotated_matrix. DONE
    #then I need to assign an index to each line. DONE
    #then alphabetically organize. DONE
    #then print out BWT, but just the indecies.  
    
    #need to append a "$" to the end of the string. 
    new_string = string + "$"
    rotated_matrix = []
    suf_arr = []
    #need to iterate through cyclic rotation of the matrix. 
    for pos in range(len(new_string)):
        rotated_string = new_string[pos:] + new_string[:pos]
        rotated_matrix.append(rotated_string)
    
    
    #already used zip in hw_01, do this again to get a int. 
    test_zip = zip(rotated_matrix, range(len(new_string)))
    
    #this is a list of tuples. tuples are immutable. 
    #just index for the tuple and then index for the second element in the tuple, which will be the integer. 
    #sorted will sort the list in alphebtical order. 
    alph_matrix = sorted(test_zip)
    
    #now I only want the number, not the actual letter, so we need to index for each tuple, then for the second elemnt in each tuple. 
    for i in range(len(alph_matrix)):
        suf_arr.append(alph_matrix[i][1])
    

    
    # return rotated_matrix, alph_matrix
    return suf_arr
    

In [31]:
suffix_array("googol")

[6, 3, 0, 5, 2, 4, 1]

In [107]:
string = "ATCGTCG"
# for i in range(len(string)):
test = zip(string, range(len(string)))
list(test)

[('A', 0), ('T', 1), ('C', 2), ('G', 3), ('T', 4), ('C', 5), ('G', 6)]

In [103]:
string_div = zip(string)
list(string_div)

[('A',), ('T',), ('C',), ('G',), ('T',), ('C',), ('G',)]

In [161]:
test = [("AAT", 1), ("GGT", 2), ("HHT", 3)]
# for i in range(len(test)):
#     test2 = test[i][1]
# print(test2)
# print((len(test)))
# len(test)
for i in range(len(test)):
    print(test[i][1])

1
2
3


In [29]:
import doctest
doctest.testmod()

TestResults(failed=0, attempted=2)

---
If you finish the two functions above, you can start working on another two functions we will use in the next class for string matching.

## Background
One of the key parts for string matching is to do Last-to-First column mapping (LF mapping) within the BWT matrix. With the LF property, we  need to build two dictionaries for our reference string beforehand:
1. count: e.g.  `{'A': 0, 'C': 2, 'G': 3, 'T': 5}`

Where for each character `a` in a string, `count[a]` contains the number of characters in string that are lexicographically smaller than `a`.


2. occur: e.g. `{'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}`

For each character `a` in a bwt string, `occur[a][i]` contains the number of occurences of `a` in `bwt_string[0,i], i=1,...,len(bwt_string)` (i.e. the first i characters in bwt string).

---
## Continuation of `cal_count` and `cal_occur` functions


One of the key parts for string matching is to do Last-to-First column mapping (LF mapping) within the BWT matrix. With the LF property, we  need to build two dictionaries for our reference string beforehand:
1. count: e.g.  `{'A': 0, 'C': 2, 'G': 3, 'T': 5}`

Where for each character `a` in a string, `count[a]` contains the number of characters in string that are lexicographically smaller than `a`.


2. occur: `{'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}`

Where for each character `a` in a bwt string, `occur[a][i]` contains the number of occurences of `a` in `bwt_string[0,i], i=1,...,len(bwt_string)` (i.e. the first i characters in bwt string).

With those two dictionaries, we can then start matching the query string to our reference string. 


---
## Imports

In [38]:
from collections import Counter

In [70]:
#function to build count dictionary
def cal_count(string):
    '''Function to count the number of characters in string that 
        are lexicographically smaller than a given character
    
    Args:
        string (str): Input string
    
    Returns:
        count (dict)
    
    Example:
        >>> cal_count('ATGACG')
        {'A': 0, 'C': 2, 'G': 3, 'T': 5}
    '''
    #may want to use the counter. 
    #need to set each new index at 0 so that it counts from 0. 
    # string_counter = Counter
#     alph_string = sorted(string)
    
#     for a in range(len(alph_string)):
#         #I want to set each new character in the string to 0. 
#         count = 0
#         for b in range(a + 1, len(alph_string)):
            #
    #we have to use 'update' to update the counter. 
    #have to check each index 
    
    # A C G T 
    # 0 2 3 5 
    
    #generalization here was difficult to understand. 
    
    string_counter = Counter(string)
    # print(string_counter) #this just gives us the number of times of occurances for each of teh characters in the string. 
    #we can use this to our advantage. 
    
    count_tracker = 0 #this will make sure that "A" gets zero 
    final_count_dict = {}  #we want this to be empty. 
    
    #now we need a loop to go through the Counter itself. 
    for elem in sorted(list(string_counter)):
        # print(elem)
        #below we're creating the key in the empty dictionary and setting the key equal to a value. 
        # final_count_dict["HI"] = 1
        #this is now creating the key for each dictionary entry
        #setting each dict entry equal to a integer. 
        final_count_dict[elem] = count_tracker 
        #not every count is zero. 
        #so first iteration the count_tracker will have 2 which "C" will get during the second iteration. 
        count_tracker += string_counter[elem]
    
    return final_count_dict
    

In [46]:
cal_count("ATGACG")

{'A': 0, 'C': 2, 'G': 3, 'T': 5}

In [42]:
test_string = "AGTCGGT"
count_test = Counter(test_string)
count_test

Counter({'A': 1, 'G': 3, 'T': 2, 'C': 1})

In [99]:
#test of empty_dict. 
empty_dict = {}
empty_dict["Amelia"] = 20
empty_dict

{'Amelia': 20}

In [83]:
count_test = Counter(string)
count_test
list(count_test)

['A', 'T', 'G', 'C']

In [42]:
string = "ATGACG"
sorted(string)

['A', 'A', 'C', 'G', 'G', 'T']

In [36]:
counter = Counter()
counter.update(string)
counter

Counter({'A': 2, 'T': 1, 'G': 2, 'C': 1})

In [38]:
num_string = len(string)
num_string

6

In [None]:
#I want each char to have a starting number of 0
for i in range(len(string)):
    count = 0
    for k in 

In [33]:
new_string = sorted(string)
new_string

['$', 'A', 'A', 'C', 'C', 'T', 'T']

In [34]:
#so when you sort lexicographically, you automatically increase going from left to right. 
if new_string[2] < new_string[0]:
    print("True")
else:
    print("False")

False


In [66]:
string = "ATGACG"
for a in range(len(string)):
    #it was very long 
    count = 0
    for b in range(a + 1, len(string)):
        if (string[b] > string[a]):
            count += 1
    print(count)

4
0
0
2
1
0


In [70]:
count = Counter(string)
alph_string = sorted(string)
alph_string


for char in alph_string:
    alph_string[char] = 0

    

['A', 'A', 'C', 'G', 'G', 'T']

In [72]:
test = {}
type(test)

dict

2. occur: e.g. `{'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}`

For each character `a` in a bwt string, `occur[a][i]` contains the number of occurences of `a` in `bwt_string[0,i], i=1,...,len(bwt_string)` (i.e. the first i characters in bwt string).

In [110]:
BWT("ATGACG")

'GG$ACTA'

In [None]:
#example, the bwt_string is AG$CG

In [71]:
# #function to build occur dictionary
# def cal_occur(bwt_string):
#     '''Function to calculate number of occurrences of each character 
#         in bwt [0,i], i=1,...,len(bwt_string)
    
#     Args:
#         b (str): BWT string
    
#     Returns:
#         occur (dict of arrays)
    
#     Example:
#         >>> cal_occur('AG$CG')
#         {'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}
#     '''
#     #want to create an emtpy dictionary that I'll ultimately return 
#     cal_occur_dict = {}
#     sorted_bwt_str = bwt_string #just set it equal to a variable here. 
    
#     for elem in sorted_bwt_str:
#         #so for each element in bwt_string I need it to start with zeros. 
#         cal_occur_dict[elem] = [0]*len(bwt_string)
#         # print(cal_occur_dict)
#         #we need to see where each element occurs in the original bwt string. 
#         for char_index in range(len(sorted_bwt_str)): #need to get an index for each element. 
#             for elem in sorted_bwt_str: #needs to be bwt_str and not cal_occur_dict since you only want the elements. 
#                 #doing cal_occur_dict[elem] won't work since that is giving the value of the key, not the key itself.
#                 if elem == sorted_bwt_str[char_index]: #if that index has that same element in the string.  
#                     cal_occur_dict[elem][char_index] = cal_occur_dict[elem][char_index] + 1  #then we need to add a number to that key value
#                     #the return dict. which key [elem] and then which position in the key value [index] 
#                     #now we need to add a 1               
#                 else:
#                     cal_occur_dict[elem][char_index] = cal_occur_dict[elem][char_index] #based off of the slides. 
#                     #keep the current number. 
                    
#     return cal_occur_dict
        
    #WILL REDO 

In [8]:
#function to build occur dictionary
def cal_occur(bwt_string):
    '''Function to calculate number of occurrences of each character 
        in bwt [0,i], i=1,...,len(bwt_string)
    
    Args:
        b (str): BWT string
    
    Returns:
        occur (dict of arrays)
    
    Example:
        >>> cal_occur('AG$CG')
        {'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}
    '''
    obs_dict = {}
    elements = sorted(set(bwt_string))
    
    for elem in elements:
        obs = 0
        obs_tracker = []
        for pos in bwt_string:
            if pos == elem:
                obs = obs + 1
                obs_tracker.append(obs)
            else:
                obs_tracker.append(obs) #will keep the obs value the way it was in the if statement. Won't be a zero since it's 
                                        #not directly under the for loop. 
        obs_dict[elem] = obs_tracker
    return obs_dict
    

In [9]:
cal_occur("AAG$CGGG")

{'$': [0, 0, 0, 1, 1, 1, 1, 1],
 'A': [1, 2, 2, 2, 2, 2, 2, 2],
 'C': [0, 0, 0, 0, 1, 1, 1, 1],
 'G': [0, 0, 1, 1, 1, 2, 3, 4]}

In [3]:
bwt_string = "AG$CG"
len(bwt_string)

5

In [122]:
sorted(bwt_string)

['$', 'A', 'C', 'G', 'G']

In [124]:
type(sorted(bwt_string))

list

In [174]:
test_dict = {}
for x in sorted(bwt_string):
    test_dict[x] = [0]*len(bwt_string)
test_dict
test_dict
for elem in test_dict:
    print(elem)

$
A
C
G


In [47]:
lst_zeros = [0]*10
lst_zeros

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [173]:
test = [0]*len(bwt_string)
test

[0, 0, 0, 0, 0]

In [119]:
count = Counter(bwt_string)
count

Counter({'A': 1, 'G': 2, '$': 1, 'C': 1})

In [None]:
import doctest
doctest.testmod()

## Alignments using BWT

We now want to use our BWT and helper functions to perform alignments. This is performed using a pair of functions.

We want to find the rows of BWT matrix beginning wtih the query string. Notice that the way we sort the BWT matrix will ensure those rows being together, so we need to find out the range of those matched rows. 


1. We initialize the range with the first and the last row of BWT matrix:

`lower = 0`

`upper = len(reference)`


2. Then we update the range as we go through each character `a` of the query string in reverse order: 

`lower_new = count[a] + occur[a][lower_old-1] + 1`

`upper_new = count[a] + occur[a][upper_old]`

(we define `occur[a][-1] = 0`)

As long as `lower_new <= upper_new`, we will find at least one matching.

3. After we go through the entire query string and get the final range (i.e. the indexes of rows of BWT matrix with query string as prefix), we can map it to the reference string with suffix array `sa` to get the matched position(s) (start position(s) of query string within reference string):

`matched_positions = sa[lower_final,upper_final+1]`

In [86]:
def find_match(query, reference):
    '''Function to find exact matching by applying Burrows-Wheeler Transform
    
    Args:
        query (str): query string
        reference (str): reference string
    
    Returns:
        matched_positions (array of integers): start position(s) of query string within reference string
    
    Example:
    >>> find_match('ana','banana')
    [1, 3]
    
    '''
    
    #initialize the range with the first and last row in the BWT matrix. 
    lower_row = 0 
    upper_row = len(reference)
    
    #gather all the helper functions so far. 
    
    #build the BWT matrix
    bwt_matrix = BWT(reference)
    #suffix array 
    suff_array = suffix_array(reference)
    #count dict
    count = cal_count(reference)
    #occur dict 
    occur = cal_occur(bwt_matrix)
    
    #for char starting with the last letter in the query 
    for char in query[::-1]:
        if lower_row > upper_row:
            break 
        else:
            lower_row, upper_row = update_range(lower_row, upper_row, count, occur, char)
    
    #we want to return numbers. so it would be from the suffix array. 
    #we have our indecies from our lower and upper
    return sorted(suff_array[lower_row:upper_row + 1]) #it's due to the 0-index based python. 
    
    

In [7]:
# find_match("ana", "banana")

In [87]:
def update_range(lower,upper,count,occur,a):
    '''Function to update range given character a, define occur[a][-1] = 0
    
    Args:
        lower (int): the lower boundary of range
        upper (int): the upper boundary of range
        count (dict)
        occur (dict)
        a (char)
    Returns:
        lower_new (int): the updated lower boundary of range
        upper_new (int): the updated upper boundary of range
    
    '''
    if lower == 0:
        lower_new = count[a] + 0 + 1
    else:
        lower_new = count[a] + occur[a][lower - 1] + 1
    upper_new = count[a] + occur[a][upper] #need to indent out of the if statement cuz it doesn't pertaining to the upper_new
    
    return lower_new, upper_new
    

In [88]:
matched_positions = find_match('ana','banana')
print (matched_positions)

[1, 3]


In [89]:
import doctest
doctest.testmod()

TestResults(failed=0, attempted=5)