# Burrows-Wheeler Transform

**Section:** G03

**Submitted by:**
- Aquino, Kurt Neil
- Matias, Angelo Christian

**Proposal:** Comparison of the Implementations of the Burrows-Wheeler Transform Algorithm Between Python and PyOpenCL

Burrows–Wheeler Transform is an algorithm used to prepare data for use with data compression techniques invented by Michael Burrows and David Wheeler in 1994. It is done by sorting all rotations of an input text into alphabetical/lexicographical order and taking the last column of the sorted rotations as the output.

This notebook demonstrates a simple implementation of the BWT algorithm in python.

First, we initialize the necessary python libraries to be used, as well as the input string:

In [1]:
from datetime import datetime
from datetime import timedelta

file = open("test.txt", "r") 
input = "".join(file.read().splitlines()) + "$"

print(input)

banana$


We then perform the actual Burrows-Wheeler Algorithm to prepare the input string for data compression

In [2]:
start_time = datetime.now()
print("Started BWT at:", start_time)

# Step 1 & 2 - List and sort all of the rotations of the input string in aphabetical/lexicographical order
rotations = sorted(range(len(input)), key=lambda i: input[i:]) 
    
# Step 3 - Get the last characters of the sorted rotations
bwt = [input[i - 1] if(i > 0) else input[len(input) - 1] for i in rotations]

end_time = datetime.now()
print("Finished BWT at:", end_time)

runtime = end_time - start_time
print("BWT runtime in seconds:", runtime.seconds, ".", runtime.microseconds)

print("\nBWT:", "".join(bwt))

Started BWT at: 2017-12-11 16:57:10.549590
Finished BWT at: 2017-12-11 16:57:10.565216
BWT runtime in seconds: 0 . 15626

BWT: annb$aa


# FM-Index using BWT

An algorithm which utilizes the BWT in order to create a compressed full-text substring index is the Full-text index in Minute space, or FM-Index for short. This compression/search algorithm, invented by by Paolo Ferragina and Giovanni Manzini, is used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence.

In order to determine the counts/locations of a substring by using the FM-Index algorithm, the following steps must be performed:
1. Create an array with the BWT
2. Sort the array lexicographically
3. Append each of the characters of the original BWT to the left of the sorted array
4. Repeat until the substrings being sorted has the same length with the pattern being searched

In [3]:
substring = 'ana'

# Perform FM-Index compression
fm_index = sorted(bwt)
for i in range(1, len(substring)):
    print("\nPass #", i)
    fm_index = sorted(fm_index)
    fm_index = [x + y for x, y in zip(bwt, fm_index)]
    
    for j in range(len(bwt)):
        print(fm_index[j])
        
# FM-Index Count - get the number of occurences of the substring within the compressed text
substring_count = fm_index.count(substring)
print("\nNumber of substrings present:", substring_count)

# FM-Index Locate - locate the position(s) of the substring within the compressed text
substring_index = []
if(substring_count > 0):
    substring_index = [i for i, j in enumerate(fm_index) if j == substring]
    print("Index/Indices where the substring is present:", substring_index)


Pass # 1
a$
na
na
ba
$b
an
an

Pass # 2
a$b
na$
nan
ban
$ba
ana
ana

Number of substrings present: 2
Index/Indices where the substring is present: [5, 6]
