-
Notifications
You must be signed in to change notification settings - Fork 0
Implementation of a state-of-art dictionary compression method based on LZ-Trie data structure
License
dkorenci/lzt
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
LZT package is an implementation of a state-of-art dictionary compression method based on LZ-Trie data structure. The method represents an (enumerated) set of words as an (enumerated) determinisitic finite automaton - (E)DFA. The automaton is repesented as a LZ-Trie data structure (word Trie viewed as string) and compressed with Lempel-Ziv-like method based on a fast compression algorithm. The LZ-trie data structure and compression algorithm are described in the paper: Fast construction of space-optimized recursive automaton; S.Ristov, D.Korencic; Software Practice and Experience 2014 This package implements lexicon compression, ie. compression of a list of words, compression of an enumerated lexicon, i.e. string dictionary (bidirectional mapping of words to integer indices), as well as compression of a full word to word mapping based on enumerated lexicon. The lexicon and string dictionary are implemented by the tool in shell_lzt. Word to word mapping is implemented by the tool in shell_lztd. ********* AUTHOR CONTACT You can send comments, bug reports, and queries to: Damir Korencic, name.surname@[gmail.com; irb.hr] ********* COMPILING AND RUNNING Command line tools can be found in shell_lzt and shell_lztd folders. To create the executables, run runmake.sh script from within the shell_lzt(d) folder. Running the executables without parameters outputs usage instructions. For large dictionaries stack size should be increased to prevent stack overflow when reading the compressed structure. On Linux, the command "ulimits -s sizeInKB" should do the trick. To perform batch run and test of string dictionary compression use batch_eval.sh script in shell_lzt dictionary: ./batch_eval.sh lzt "words1.txt words2.txt words3.txt" wordFilesFolder resultOutputFolder "lzt" can be substituted by any name the executable shell tool is assigned. Files must be lists of words with standard newline endings. ********* LICENSING The code is licensed under GNU LGPL license, except for the following code derived from other sources: 1. Code in the "lzt_core/fsa_convert/s_fsa_subset/" folder which contains code (modified and original) from Jan Daciuk's fsa package (http://www.jandaciuk.pl/fsa.html) that is licensed under GNU GPL. 2. Makefiles, which are adapted from the autogenerated Netbeans makefiles ********* FUNCTIONALITY OVERVIEW Implementations of algorithms and data structures can be found in the lzt_core folder. Tests exist for majority of the core functionality, and classes with tests are located in the same folders as the corresponding functionality, while the lzt_test folder contains code neccessary to run the tests with CppUnit framework. The data structures and algorithms are coded using cpp templates that enable flexible definition of basic data types that are used to represent string symbols (default is unsigned char) and trie indices (default is long or int). To change the types, change the definitions of TSymbol and TIndex types in the client code, i.e. the code of shell tools in shell_lzt(d) folders. Larger TIndex types increase memory usage during compression, however the type must suport indices large enough to index the entire trie containing all the words. Int is the safe setting for all but the largest dictionaries. Varying of TSymbol can be used for compression of non-character strings. Definition of TSymbol can also, to a small degree, influence the size of the compressed structure since TSymbol fields are used to store replaced substring lengths of lz pointers (see the original algorithm for details). The vizualization folder contains code for algorithm introspection, i.e. for creating graphical (dot format) and string representations of the trie structure.
About
Implementation of a state-of-art dictionary compression method based on LZ-Trie data structure
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published