Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:


In [2]:
NAME = ""
IMMATRICULATION_NUMBER = ""

---


# Exercise 5: Suffix arrays

Goals of this exercise are:

- to implement a simple version of a suffix array construction algorithm
- to get a feeling for the interval search using suffix arrays
- understand the relationship between suffix array and BWT

For this exercise you can assume that the input string ends with a \\$ sign and does not contain further \\$ signs or characters that are lexicographically smaller than the \\$ sign.


## 1. Suffix array construction: Text -> SA

Implement a function <code>makeSA</code> that can be called by

<code>SA = makeSA(inputString)</code>

which generates a suffix array from a given input string.

<i>Hint:</i> You can reuse and modify your implementation of Assignment 1.

For example, for the input string <code>"abracadabra\$"</code>, the function call should return an array containing the values

<code>11, 10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2</code>

Use other input strings and verify manually that your implementation is correct.


In [3]:
def makeSA(S):
    suffixes = []
    for i in range(len(S)):
        suffixes.append((S[i:], i))
    suffixes.sort()
    # return [suffix[1] for suffix in suffixes]
    res = [suffix[1] for suffix in suffixes]
    assert len(res) == len(S)
    # print(f"{S=}, {res=}")
    return res

In [4]:
S = "alfalfa$"

print(f"{S=}, {makeSA(S)=}")

S='alfalfa$', makeSA(S)=[7, 6, 3, 0, 5, 2, 4, 1]


In [97]:
assert makeSA("abracadabra$") == [11, 10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
assert makeSA("yabbadabbadoo$") == [13, 1, 6, 4, 9, 3, 8, 2, 7, 5, 10, 12, 11, 0]

## 2. Constructing a BWT string from the suffix array

Implement a function <code>makeBWT</code> that can be called by

<code>BWT = makeBWT(SA,inputString)</code>

and that returns a computed BWT string from the suffix array SA and the input string.
As you can use the suffix array, your implementation shall be shorter than the solution for Assignment 1.

For example, using the suffix array SA with the values

<code>11, 10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2</code>

and the input string <code>"abracadabra\$"</code>, the function call should return the BWT string

<code>"ard\$rcaaaabb"</code>.


In [98]:
def makeBWT(SA, inputString):
    # bwt = []
    # for i in range(len(SA)):
    #     bwt.append(inputString[SA[i] - 1])
    # bwt = [inputString[SA[i] - 1] for i in range(len(SA))]
    # return bwt
    return "".join([inputString[SA[i] - 1] for i in range(len(SA))])

In [99]:
def testMakeBWT(inputString, expectedResult):
    sa = makeSA(inputString)
    bwt = makeBWT(sa, inputString)
    assert bwt == expectedResult, "Test makeBWT failed for inputString " + inputString


testMakeBWT("abracadabra$", "ard$rcaaaabb")

testMakeBWT("hokuspokus$", "s$oophsuukk")
testMakeBWT("yabbadabbadoo$", "oydbbbbaaaaod$")
testMakeBWT("tobeornottobeortobeornot$", "tooobbbrrttteeennoooor$to")

## 3. Searching a substring by binary search using suffix array and input string (without using LCP (or bcp or ncp) arrays!)

Implement a function <code>find</code> that can be called by

<code>positions = find(SA,substring,inputString)</code>

and that uses <b>binary search</b> on the suffix array SA (and then accesses the inputString) to return all the start positions of substring in the string inputString.

For example, using the suffix array SA with the values

<code>11, 10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2</code>

and the example search substring <code>"abr"</code> and the input string <code>"abracadabra\$"</code>, the function call should return <code>[7,0]</code> as the start positions are 7 and 0. If a substring does not occur, the function should return <code>[ ]</code>. Use other substrings and other input strings and verify manually that your implementation is correct.

<i>Hint</i>: Use two functions each of which uses binary search, i.e.,

- <code>largestIndex_SmallerSuffix(...)</code> to find the largest index with a smaller suffix than the substring and
- <code>smallestIndex_GreaterSuffix(...)</code> to find the smallest index with a greater suffix than the substring


In [102]:
import bisect  # we don't have python 3.10 lmao

# no key parameter


class KeyWrapper:
    def __init__(self, iterable, key):
        self.it = iterable
        self.key = key

    def __getitem__(self, i):
        return self.key(self.it[i])

    def __len__(self):
        return len(self.it)


def find(SA, substring, inputString):
    # print(f"{SA=}, {substring=}, {inputString=}")
    l = bisect.bisect_left(
        # SA, substring, key=lambda x: inputString[x:][: len(substring)]
        KeyWrapper(SA, lambda x: inputString[x:][: len(substring)]),
        substring,
    )
    r = bisect.bisect_right(
        # SA, substring, key=lambda x: inputString[x:][: len(substring)]
        KeyWrapper(SA, lambda x: inputString[x:][: len(substring)]),
        substring,
    )
    keys = [inputString[SA[i]:][:len(substring)] for i in range(len(SA))]
    # print(f"{keys=}")
    res = [SA[i] for i in range(l, r)]
    # print(f"{SA=}, {substring=}, {inputString=}, {l=}, {r=}, {res=}")
    return res


print(find(makeSA("yabbadabbadoo$"), "yab", "yabbadabbadoo$"))

[0]


In [103]:
inputString = "yabbadabbadoo$"
SA = makeSA(inputString)


# Helper method to find all positions of a substring within a given string without using a SA
def find_all(substring, inputString):
    res = []
    start = inputString.find(substring, 0)
    while start > -1:
        res.append(start)
        start = inputString.find(substring, start + 1)
    return res


# Test for all possible matches
for i in range(len(inputString) - 1):
    # suffix
    myString = inputString[i : len(inputString) - 1]
    findRes = set(find(SA, myString, inputString))
    findAllRes = set(find_all(myString, inputString))
    assert findRes - findAllRes == set([]) and findAllRes - findRes == set([]), (
        "find failed for input " + inputString + " and substring " + myString
    )
    # prefix of suffix up to length 3
    for j in range(3):
        myString2 = myString[: j + 1]
        findRes = set(find(SA, myString2, inputString))
        findAllRes = set(find_all(myString2, inputString))

        assert findRes - findAllRes == set([]) and findAllRes - findRes == set([]), (
            "find failed for input " + inputString + " and substring " + myString2
        )

# Test for strings not occurring in the string, but which are in between two suffixes (or smaller than the smallest or larger than the largest suffix)
for i in range(len(inputString) + 1):
    # suffix
    myString = inputString[i : len(inputString)] + "!"  # as !<$ this string
    findRes = set(find(SA, myString, inputString))
    findAllRes = set(find_all(myString, inputString))
    assert findRes - findAllRes == set([]) and findAllRes - findRes == set([]), (
        "find failed for input " + inputString + " and substring " + myString
    )

# Test for multiple occurrences
length = 16
for i in range(length):
    for j in range(i + 1, length):
        inputString = "x" * i + "a" * (j - i + 1) + "x" * (length - j - 1) + "$"
        SA = makeSA(inputString)
        # substring of length 1
        myString = "a"
        findRes = set(find(SA, myString, inputString))
        findAllRes = set(find_all(myString, inputString))
        assert findRes - findAllRes == set([]) and findAllRes - findRes == set([]), (
            "find failed for input " + inputString + " and substring " + myString
        )
        # substring of length 2 (overlapping)
        myString = "aa"
        findRes = set(find(SA, myString, inputString))
        findAllRes = set(find_all(myString, inputString))
        assert findRes - findAllRes == set([]) and findAllRes - findRes == set([]), (
            "find failed for input " + inputString + " and substring " + myString
        )

## 4. Adding futher test cases

Add at least one test case that tests the method test implemented in this task 3.

Add a comment to explain, why this test case is a useful test case and describe, which typical mistake shall be covered by this test case.


In [104]:
import random
import string
# i am a fan of property based testing
# we will test the property prop_find_equals_find_all

# as before: a model of the function we want to test
def find_all(substring, inputString):
    res = []
    # we want to return all positions in the string where:
    # inputString[start:start+len(substring)] == substring
    for i in range(len(inputString) - len(substring) + 1):
        if i < len(inputString) and inputString[i:i+len(substring)] == substring:
            res.append(i)
    # start = inputString.find(substring, 0)
    # while start > -1:
    #     res.append(start)
    #     start = inputString.find(substring, start + 1)
    return res

def prop_find_equals_find_all(inputString, substring):
    SA = makeSA(inputString)
    findRes = set(find(SA, substring, inputString))
    findAllRes = set(find_all(substring, inputString))
    return findRes == findAllRes

def test_string_pair_prop(n = 100000):
    for _ in range(n):
        inputString = "".join(random.choices(string.ascii_letters + string.digits, k=random.randint(0, 100)))
        substring = "".join(random.choices(string.ascii_letters + string.digits, k=random.randint(0, 100)))
        SA = makeSA(inputString)
        assert prop_find_equals_find_all(inputString, substring), f"Failed for {inputString=}, {substring=}, findRes={find(SA, substring, inputString)}, findAllRes={find_all(substring, inputString)}"

test_string_pair_prop()