GitHub - dkorenci/lzt: Implementation of a state-of-art dictionary compression method based on LZ-Trie data structure

dkorenci / lzt Public

Notifications You must be signed in to change notification settings
Fork 0
Star 3

Implementation of a state-of-art dictionary compression method based on LZ-Trie data structure

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
lzt_core		lzt_core
lzt_test		lzt_test
shell_lzt		shell_lzt
shell_lztd		shell_lztd
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.txt		README.txt

Repository files navigation

LZT package is an implementation of a state-of-art dictionary compression method based on LZ-Trie data structure.
The method represents an (enumerated) set of words as an (enumerated) determinisitic finite automaton - (E)DFA.
The automaton is repesented as a LZ-Trie data structure (word Trie viewed as string)
and compressed with Lempel-Ziv-like method based on a fast compression algorithm.

The LZ-trie data structure and compression algorithm are described in the paper:
Fast construction of space-optimized recursive automaton; S.Ristov, D.Korencic; Software Practice and Experience 2014

This package implements lexicon compression, ie. compression of a list of words,
compression of an enumerated lexicon, i.e. string dictionary (bidirectional mapping of words to integer indices),
as well as compression of a full word to word mapping based on enumerated lexicon.
The lexicon and string dictionary are implemented by the tool in shell_lzt.
Word to word mapping is implemented by the tool in shell_lztd.

********* AUTHOR CONTACT

You can send comments, bug reports, and queries to:
Damir Korencic, name.surname@[gmail.com; irb.hr]

********* COMPILING AND RUNNING

Command line tools can be found in shell_lzt and shell_lztd folders.
To create the executables, run runmake.sh script from within the shell_lzt(d) folder.
Running the executables without parameters outputs usage instructions.
For large dictionaries stack size should be increased to prevent stack overflow when reading the compressed structure.
On Linux, the command "ulimits -s sizeInKB" should do the trick.

To perform batch run and test of string dictionary compression use batch_eval.sh script in shell_lzt dictionary:
./batch_eval.sh lzt "words1.txt words2.txt words3.txt" wordFilesFolder resultOutputFolder
"lzt" can be substituted by any name the executable shell tool is assigned.
Files must be lists of words with standard newline endings.

********* LICENSING

The code is licensed under GNU LGPL license, except for the following code derived from other sources:

1. Code in the "lzt_core/fsa_convert/s_fsa_subset/" folder which contains code (modified and original)
from Jan Daciuk's fsa package (http://www.jandaciuk.pl/fsa.html) that is licensed under GNU GPL.

2. Makefiles, which are adapted from the autogenerated Netbeans makefiles

********* FUNCTIONALITY OVERVIEW

Implementations of algorithms and data structures can be found in the lzt_core folder.
Tests exist for majority of the core functionality, and classes with tests are
located in the same folders as the corresponding functionality,
while the lzt_test folder contains code neccessary to run the tests with CppUnit framework.

The data structures and algorithms are coded using cpp templates that
enable flexible definition of basic data types that are used to represent
string symbols (default is unsigned char) and trie indices (default is long or int).
To change the types, change the definitions of TSymbol and TIndex types in the client code,
i.e. the code of shell tools in shell_lzt(d) folders.
Larger TIndex types increase memory usage during compression, however the type
must suport indices large enough to index the entire trie containing all the words.
Int is the safe setting for all but the largest dictionaries.
Varying of TSymbol can be used for compression of non-character strings.
Definition of TSymbol can also, to a small degree, influence the size of the compressed structure
since TSymbol fields are used to store replaced substring lengths of lz pointers (see the original algorithm for details).

The vizualization folder contains code for algorithm introspection,
i.e. for creating graphical (dot format) and string representations of the trie structure.