-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
486 lines (352 loc) · 15.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
suffix_tree 1.0
=========================
Version: 2.0
README file updated on: 08-2002
This document is a summary of the project report and contains useful
information for development purposes.
By: Dotan Tsadok.
Instructor: Shlomo Yona.
Table of Contents
-----------------
1. Project Overview
2. Installation instructions
3. User Guide
4. Developer Guide
5. Testing Guide
6. References
7. License
8. Reporting bugs
1. Project Overview
-------------------
The intention of this project is to provide an open-source implementation
for an efficient data structure for strings manipulation - the Suffix Tree.
The code was written with as much consistency with the theoretical algorithm
as possible (see references). It provides a set of interface functions for
creating and searching the tree.
The suffix tree is implemented in ANSI-C. The code is written based on the
original algorithm by E.Ukkonen. For a detailed description see references.
The code is explained internally (comments in the code) and externally (this
document) and thus could be used for development.
the structure was tested for both correctness and space/time performance (see
Testing Guide).
2. Installation instructions
----------------------------
Installation for Command Line Use
Files needed for the installation:
* suffix_tree.c
* suffix_tree.h
* main.c
For all platforms that support ANSI C: Windows/DOS/Linux/Unix/Mac:
* Copy the files suffix_tree.c, suffix_tree.h and main.c into your working
directory.
* Compile suffix_tree.c and main.c (they both include suffix_tree.h so make
sure they are at the same directory) in that order. For example, in UNIX
prompt write:
gcc -ansi -pedantic -Wall suffix_tree.c main.c -o suffix_tree
In the above example, the executable suffix_tree will be created.
* Run the executable:
./suffix_tree
* For further usage information User Guide.
Installation for "As Is" Development
Files needed for the installation:
* suffix_tree.c
* suffix_tree.h
For all platforms that support ANSI C: Windows/DOS/Linux/Unix/Mac:
* Copy the files suffix_tree.c and suffix_tree.h to your working directory.
* Compile suffix_tree.c (it includes suffix_tree.h so make sure they are at the
same directory).
* In your code - include (#include) suffix_tree.h.
* Use interface functions that are listed in suffix_tree.h (see developer
Guide).
Installation for Advanced Development
Same as the above (see Installation for "As Is" Development) only now
changes are made to suffix_tree.c and/or suffix_tree.h by the programmer.
--------------------------------------------------------
Example of using the interface functions for developing:
--------------------------------------------------------
#include "suffix_tree.h"
#include <stdio.h>
int main()
{
/*Will hold the position of the substring if exists in the tree.*/
DBL_WORD position;
/*Create the suffix tree*/
SUFFIX_TREE* tree = ST_CreateTree("mississippi", 11);
/*Print the suffix tree.*/
ST_PrintTree(tree);
/*Search for a substring in the tree and return its position if exists.*/
position = ST_FindSubstring(tree, "ssis", 4);
/*Print the position of the substring*/
printf("\nPosition of ssis in mississippi is %ld.\n\n", position);
/*Delete the tree and all its nodes.*/
ST_DeleteTree(tree);
}
--------------------------------------------------------
3. User Guide
-------------
For online help - just run the executable files (see installation instructions).
The following printout will appear:
USAGE: suffix_tree <cmd> <string | filename> [<search string>] [p]
examples: immidiate string with printing: suffix_tree s mystring trin p
string from file: suffix_tree f mytree.txt substring
immidate with self testing: suffix_tree ts mystring trin
file with self testing: suffix_tree tf mytree.txt substring
<cmd> - Command. These commands are available:
s - build a suffix tree based on an immediate string, meanning
the actual string follows the command.
f - build a suffix tree based on a file, meanning the
full path file name follows the command.
ts - same as the s command but includes a self-test
as well (See Testing Guide).
tf - same as the f command but includes a self-test
as well (See Testing Guide).
<string | filename> - the string that follows command s or the file name that
follows the command f (see <cmd>).
<search string> - the string to search after the construction is done.
[p] - optional - print the tree. Printing is useful when dealing with
small trees, while printing a large tree
could take a long time.
The tree size is theoretically bounded by about 4 billion characters
(see Testing Guide). But - memory will probably be over on much lower
numbers.
In such case - A message "Out of memory." will apear on the screen and
the progarm will exit.
Output examples:
When STATISTICS is defined (see Developer Guide) and command line is
"./suffix_tree s mississippi sip", the following output appears:
=======================================================
Constructing tree.....Done.
root
+mississippi$
+i
|+ssi
||+ssippi$
||+ppi$
|+ppi$
|+$
+s
|+si
||+ssippi$
||+ppi$
|+i
||+ssippi$
||+ppi$
+p
|+pi$
|+i$
+$
N = 12
Construction: Bytes allocated per text symbol = 53.083
Atomic operations per text symbol = 4.000
Searching: Atomic operations per text symbol = 2.333
Results: Substring exists in position 7.
=======================================================
When STATISTICS is not defined the command line is "s mississippi sip",
the following output appears:
=======================================================
Constructing tree.....Done.
root
+mississippi$
+i
|+ssi
||+ssippi$
||+ppi$
|+ppi$
|+$
+s
|+si
||+ssippi$
||+ppi$
|+i
||+ssippi$
||+ppi$
+p
|+pi$
|+i$
+$
Results: Substring exists in position 7.
=======================================================
4. Developer Guide
------------------
Basic Ideas
-----------
* The suffix tree is an acyclic trie.
* The sources string is a the string the tree is based on. Only one true
source string exists.
* The Ukkonen algorithm, as described in the references, is for constructing
a tree out of one sources string and not for adding strings to the tree.
Therfore, this suffix tree behaves the same.
* The $ sign is considered unique and thus must not be a part of the source
string.
* The standart \0 sign is not assumed to end the string. This is why the
length parameter is neccessary in the cunstruction process.
* a character is of type char so up to 256 different symbols are
supported. There's no need to define how many different symbols are in
the source string in advance.
* #define STATISTICS for measuring time/space complexity.
* #define DEBUG for print-outs of stages in the constructing and searching
algorithms.
* ST_ERROR is a global variable (and not a #define) that symbols an error. It
equals length+10 (and thus is not a macro).
* A path of a node is a string from the root to that node.
* All edges are represented inside their ending nodes (more detail later) by
an index of the starting position and ending position in the source string.
* Rules 1 and 3 are implicit and thus are not implemented. Only rule 2 is
"real".
* a variable edge_pos is used for storing the index relarive an specific edge
where last match occured (details later).
* A function SPA is called for each prefix of the source string. In return,
SPA calles to function SEA for each suffix of that suffix. Naively, this
takes O(m^3) time.
* A major shortcut in Ukkonen's algorithm: When rule 3 is applied in a
specific suffix, current SEA is done and current SPA is done. The next
SPA starts by increasing e (the virtual end of all leaves) by one, and
starts calling SEA only from the suffix which rule 3 was applied to in
the previous phase (cause for that - implicit rule 1). For more detailes
on that see Ukkonen's algorithm (references).
Data Structures
--------------
The suffix tree is represented by the following data:
* tree_string - the one and only string of the tree, all edges point to it
(using indice).
* length - the length of the string + 1 for the $ sign.
* e - the current end of the tree. Used in all leaves.
* root - pointerto the tree's root node.
In the suffix tree there are no edges - edges are represented by their
ending nodes. Meanning: edge from node A to node B is written inside node B.
Being acyclic - no two incoming edges are of the same node, and thus it is
implemented correctly.
A node is represented by the floowing data:
* sons - a linked list of sons of that node.
* right_sibling - a linked list of right siblings of that node.
* left_sibling - a linked list of left siblings of that node.
* father - a pointer to that node's father.
* suffix_link - a pointer to the node that represents the largest suffix
of the current node.
* path_position - index of the start position of the node's path.
* edge_label_start - start index of the incoming edge.
* edge_label_end - end index of the incoming edge.
The actual string of the tree is written from index 1 for consistency
with the algorithm. Also - a $ sign is added at the end of it to make
the final implicit tree into explicit one (see references).
Interface functions
-------------------
ST_CreateTree(const char* str, DBL_WORD length)
Allocates memory for the tree and starts Ukkonen's construction algorithm.
Parameters: pointer to the string, length of the string.
Returns : pointer to the new tree.
ST_FindSubstring(SUFFIX_TREE* tree, char* W, DBL_WORD P)
Searches for a string in the tree. It traverses the tree down starting
its root like in a regular trie.
Parameters: the tree to search in, a pointer to the string to look for,
length of that string.
Returns : if string is not in the tree - returns ST_ERROR.
Else, returns the position it was found in the source string.
ST_PrintTree(SUFFIX_TREE* tree)
Prints the tree.
Parameters: the tree to print.
Returns : void.
ST_DeleteTree(SUFFIX_TREE* tree)
Deletes a suffix tree.
Parameters: the tree to delete.
Returns : void.
ST_SelfTest(SUFFIX_TREE* tree)
Tests a suffix tree - search for all substrings of the main string (see
Testing Guide).
Parameters: the tree to test.
Returns : 1 for success, 0 for failure.
Major Internal Functions
------------------------
apply_extension_rule_2()
Apply extension rule 2 (see Ukkonen's algorithm) - 2 cases:
(1) (1)
1.A new son (leaf) is added to / \ -> / | \
a node that already has sons. (2) (3) (2)(3)(4)
2.An edge is split and a new | |
leaf and an internal node are | (3)
added. | -> / \
(1) (1) (2)
Parameters: see code.
Returns : a pointer to the newly created node.
trace_string()
Traces for a string in the tree.
Parameters: see code.
Returns : the node where tracing has stopped, the edge position where
last match occured, the string position where last match
occured, number of characters found, a flag for signaling
whether search is done, and a flag to signal whether search
stopped at a last character of an edge.
follow_suffix_link(SUFFIX_TREE* tree, POS* pos)
Creates a suffix link of a newly created node (meaning, after rule 2 was
applied) according to Ukkonen's rules.
Parameters: the tree, the newly created node.
Returns : the position in edge where last match occured (while tracing
for the suffix node).
SEA()
Single-Phase-Algorithm (see references). Ensures that a specific extension
is in the tree by:
1. Following the current node's suffix link.
2. Checking whether the rest of the extension already is in the tree
(implicit rule 3).
3. If it is - reporting of rule 3 (= current phase is done, report handled
by function SPA()).
4. If it's not - inserting it by applying rule 2 (creating a new leaf - see
function apply_extension_rule_2()).
Parameters: see code.
Returns : a pointer to the current node, the one that will be used in the
next extentsion/phase.
void SPA()
Performs all insertion of a single phase by calling function SEA. See Basic
Ideas and description of SEA.
Parameters: see code.
Returns : a pointer to the current node, the one that will be used in the
next phase.
Minor Internal Functions
------------------------
create_node(NODE* father, DBL_WORD start, DBL_WORD end, DBL_WORD position)
Allocates a node with the given field-values. All other fields are field
with NULL or 0.
Parameters: the node's father, incoming edge start and stop index, path
position.
Returns : a pointer to that node.
find_son(SUFFIX_TREE* tree, NODE* node, char character)
Finds a son of the node that starts with character. Used for seraching
in the tree.
Parameters: the tree, the source node, the character to look for.
Returns : a pointer to that son if exists, otherwise NULL.
get_node_label_end(SUFFIX_TREE* tree, NODE* node)
Returns the end index of the incoming edge to that node.
IMPORTANT: this function is requierd because leaves does not really
hold their incoming edge end index - the real one is tree->e, which is
increased by one in each phase. This function identify if the node is a
leaf and due to that it returns its real end. NEVER access
node->edge_label_end directly.
get_node_label_length(SUFFIX_TREE* tree, NODE* node)
Returns the length of the incoming edge to that node.
Uses function get_node_label_end() (IMPORTANT: See description of the
previous function).
5. Testing Guide
----------------
There is a SelfTest interface function for testing a specific tree.
It searches all possible substring of the sources string in the tree.
In large trees this can take a cosiderable amount of time, so be careful
with it.
All indices are unsigned long int - meanning they are bounded by 2^32.
But - this is only theoretical. The tree has not been tested for that for
lack of memory reasons. It would take many gigabytes to handle that amount
of characters, considering a certain coeficent on the number of characters.
6. References
-------------
[1] Dan Gusfield, Algorithms on Strings, Trees and Sequences, Chapter 6 -
Linear-Time Construction of Suffix Trees, Ukkonen's Algorithm.
University of California, Davis, 1997.
[2] Udi Manbar and Gene Myers, Suffix Arrays: A new method for on-line
string searches. University of Arizona, Tucson, 1989.
7. License
----------
Copyright 2002-2003 Shlomo Yona. All rights reserved.
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
8. Bug reports
--------------
Please report bugs to Shlomo Yona <shlomo@cs.haifa.ac.il>