## Suffix Trees

Before running the code below, install the suffix-tree package.
Go to "File -> New Console for notebook" and submit "pip install suffix-tree".


In [24]:
from suffix_tree import Tree

tree = Tree ({ 'A' : 'xabxac' })
print(tree.find ('abx'))
print(tree.find ('abc'))

True
False


#### The package implements a generalized suffix tree (i.e. a tree for a set of strings)

In [8]:
tree = Tree ({ 'A' : 'xabxac',
               'B' : 'xbacazbxaz',
               'C' : 'bxtzbxa'})

#find in any string
print(tree.find('xab'))

#find in all strings
print(tree.find_all("bxa"))

True
[('A', <suffix_tree.util.Path object at 0x7fa9906ca610>), ('B', <suffix_tree.util.Path object at 0x7fa9906cd2b0>), ('C', <suffix_tree.util.Path object at 0x7fa9906cdaf0>)]


## Problem: Longest common substring

#### Let $K$ the total number of strings and $n$ the total length. Define $l(k)$ for $k=2,\dots,K$ $l(k)$ as the length of a longest substring common to at least $k$ of the strings. Use the suffix-tree package for a implementation of an $O(Kn)$ algorithm, that computes the table of $l(k)$ values.

#### Help: Recall that in the generalized tree, each node has two identifiers: One for the string and one for the suffix of that string. For any inner node $v$ let $C(v)$ denote the number of different string identifiers occuring at the leaves of the subtree below $v$. How can $l(k)$ be computed using $C(v)$ and the string depths of $v$ for all $v$?

###### (An O(n) algorithm is also possible, but requries more work; go ahead if you like)

In [30]:
S = { 'A' : 'sandollar',
      'B' : 'sandlot',
      'C' : 'handler',
      'D' : 'grand',
      'E' : 'panrry'}

#e.g. l(2)=4 ("sand") l(5)=2 ("an")

T = Tree(S)

#traverse the tree
#currently only prints the IDs of all leafs, use this as a starting point for computing C(v)
def compute_C(u):
    ids = set()
    if u.is_leaf():
        return {u.str_id}
    else:
        for c in u.children.values():
           ids.update(compute_C(c))
    u.c_value = len(ids)
    return ids
        
compute_C(T.root)

print(T.root.c_value)


# String depth via: node.string_depth()
# Idea: traverse again and use C(v) and the string-depth of v to compute l(k)
# Help: If you are not sure how to handle the "at least k", find a solution for "exactly k" first
    

2
2
4
5
5
2
4
5
2
4
2
3
4
5
5
