An implementation of a suffix tree.
How to run:
> ghci suffixTree.hs λ graph . SuffixTree.construct $ "cacao"
(or whatever other string you want instead of "
- GraphViz (
brew install graphviz)
- Haskell GraphViz (
cabal install graphviz)
- Zora (
cabal install zora)
DC3 implementation uses regular sorting instead of radix sort, so it's
O(n log n)instead of linear.
Found a bug? E-mail me at
firstname.lastname@example.org let me know and I'll get right on it :)
T$: The input string (of length n)
Suffix array: an array of the suffixes of
T$, stored in sorted order.
Suffix tree: A Patricia trie of
T$where each leaf is labeled with the index where the corresponding suffix starts in
O(n)space, but large constant (
Usually just an array of start indices of suffixes
Patricia trie (radix trie): A trie where nodes with only one child are merged with their parents.
LCP: Longest common prefix of two strings
(all steps are
- Build suffix array
- Build LCP array
Lof adjacent elems in
- Build suffix tree from suffix array and LCP array
####Step 1: Build suffix array
Recursively get the sorted order of all suffixes starting at positions that aren't multiples of three.
Construct a new string based on suffixes starting at positions in
Begin by computing
T$[2:]and padding each with
'$'until the lengths are multiples of three, then strcat
Treat each block of three characters as its own character.
Can determine the relative ordering of those characters by radix sort.
Replace each block of three characters with its index.
Recursively compute the suffix array of that string.
Compute the suffix array of that string, recursively.
Use the resulting suffix array to deduce the orderings of the suffixes.
Using this information, sort the suffixes at positions that are at multiples of three (call them
For each position in
T0, form a pair of the letter at that position and the index of the suffix right after it (which is in
T1). These pairs are effectively strings drawn from an alphabet of size
\Sigma + n.
Radix sort them.
3-way merge the sorted lists of suffixes together into the overall suffix array.
- if two compared letters at indices are same, compare letters after them in string
####Step 2. Build LCP array
L of adjacent elems in
pos[i]: "what's the
i^th lexicographically ordered suffix (== what position does it start at?)?"
rank[i]: "what's the lexicographic order (rank) of the suffix starting at
irepresents the starting index of a substring
rank[i]is the lexicographic order (rank) of suffix starting at
=> each it. of the forloop fills
heightat the index of the next substring, e.g.
1st it. finds LCP for
"nonsense$"and whatever suffix is lexicographically before that
2nd it. finds LCP for
"onsense$"and whatever suffix is lexicographically before that
3rd it. ...
krepresents the starting index of the suffix we're comparing against. Simply put,
i, the index of the current suffix (one of the two in the comparison); to get the other, since we want to compare with an adjacent suffix in the suffix array (
pos), we get precisely that suffix (it's adjacent in
pos[rank[i] - 1](remember that
jis the starting index of the
jth lexicographically ordered suffix)
if h > 0: h -= 1only makes sense in context of comparing
s[k+h]. So, why
+h? OK, this is pretty cool. So basically the idea is that if
h> 0 then that means that last iteration we compared some suffix
Swith the suffix lexicographically neighbouring it (this implementation uses the one previous to it in
pos, but you could do the one after it if you wanted) (let it be
T) and found some overlap. Because of the way the forloop is defined (forwards over
rank), the next current suffix (let it be
S') is just
Swith the first char lopped off, and, because of the lexicographic ordering of the suffix array (
pos), the one to which we compare it (
pos[rank[i] - 1],
T') is just
Twith the first char lopped off, hence, the LCP length between
T'is just the LCP length between
Tminus one. What's more, that's why this runs in linear time -- even if the string was the same character repeated a whole bunch of times, we'd still be caching the previous iteration's overlap amount in
Here's an example of the interaction between
height during the construction of
height: ==================== j^* | height[j] -------------------- | LCP | | length^** | suffix (pos) (0) | ---------------------------- | | $ (8) 1 | 0 <-|----------------- | | e$ (7) 2 | 1 <-|----------------- | | ense$ (4) 3 | 0 <-|----------------- | | nonsense$ (0) 4 | 1 <-|----------------- | | nse$ (5) 5 | 3 <-|----------------- | | nsense$ (2) 6 | 0 <-|----------------- | | onsense$ (1) 7 | 0 <-|----------------- | | se$ (6) 8 | 2 <-|----------------- | | sense$ (3) * in forloop, this will be rank[i] ** (LCP length)[q] corresponds to LCP for suffix[q] and [q+1]
####Step 3. Build suffix tree from suffix array and LCP array
Construct a Cartesian tree from the LCP array, fusing together nodes with the same values if one becomes a parent of the other.
Run a DFS over the tree and add missing children in the order in which they appear in the suffix array.
Assign labels to the edges based on the LCP values.
T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. CPM, volume 2089 of LNCS, pages 181–192. Springer, 2001.
J. Kärkkäinen and P. Sanders. Simple linear work suﬃx array construction. In ICALP, 2003, pp. 943–955.
Harold Carr for examples using GraphViz in Haskell: (http://haroldcarr.com/posts/2014-02-28-using-graphviz-via-haskell.html)