This repository contains an implementation of the MR-CFG (Maximal Repeat Context-Free Grammar) construction algorithm presented in [1]. Specifically, the algorithm computes an MR-CFG (a straight-line grammar) for a string using its LCP-intervals.
The algorithm described in [1] is a generic algorithm that allows LCP-intervals to be computed from a variety of different data-structures so long as the intervals are iterated in shortest-LCP-value-first order. This implementation uses a Compressed Suffix Array (CSA) to compute LCP-intervals using the algorithm of [2]. Both the "optimal" and "online" versions of the algorithm described in [1] are implemented. Additionally, a third "fast" implementation based on compressed bitmaps has been implemented.
This implementation is not particularly fast or memory efficient. And the generated grammar is not encoded or output. Instead, statistics about the grammar are reported and then the input string is reconstructed from the grammar and output for validation. Moreover, this implementation is intended to be used as a reference only.
All non-CLI code has been implemented as a header-only library so that it may easily be used and extended.
This project depends on Succinct Data Structure Library 3.0 and Roaring Bitmap. You can install these on your system yourself before proceeding or let the build system install them in the repository's directory for you.
The project uses the CMake meta-build system to generate build files specific to your environment. Generate the build files as follows:
cmake -B build .
This will create a build/
directory containing all the build files.
The first time you run this command it may take a while because it downloads dependencies from GitHub.
To actually build the code, run:
cmake --build build
This will generate an MR-CFG
executable in the build/
directory.
Similar to the previous command, the first time you run this command may take a while because it has to build the dependencies.
If you make changes to the code, you only have to run this command to recompile the code.
MR-CFG
uses a command-line interface (CLI).
Its usage instructions are as follows:
Usage: MR-CFG {OPTIMAL|ONLINE|FAST} <FILE>
The first argument - {OPTIMAL|ONLINE|FAST}
- specifies what interval stabbing algorithm to use.
OPTIMAL
uses the theoretically optimal ONLINE
uses the more space efficient but theoretically slower FAST
uses an algorithm based on compressed bitmaps that is relatively fast and space efficient.
The second argument - <FILE>
- is a file containing text a straight-line grammar (SLG) will be built from.
After it computes the SLG, MR-CFG outputs the string the SLG produces to the standard error stream for validation. This output can be captured in a file as follows:
./build/MR-CFG ONLINE my-input-file.txt 2> my-output-file.txt
Here my-input-file.txt
is the input file and my-output-file.txt
is the file that will capture the output to standard error, i.e. a copy of my-input-file.txt
generated from the SLG.
These files can be compared using a tool like diff
:
diff my-input-file.txt my-output-file.txt
Basic run-time info and statistics about the computed SLG will be output to the standard output.
The MR-CFG was first introduced in [3] as a CFG that can be derived from the Compact Directed Acyclic Word Graph (CDAWG) of a string. In [1], we showed that its relationship to the CDAWG connects the MR-CFG to various string properties and data-structures. These connections are not typical of SLGs and merit further investigation. The primary contribution of [1] was to elucidate these connections and provide an algorithm for constructing the MR-CFG without the CDAWG, allowing it to be studied independently while further relating it to the existing stringology literature. Additionally, these connections provide an opportunity for various string properties and data-structures to be better connected to string attractors and dictionary compressors in general [4].
The actual run-time performance of our algorithm and compression power of the MR-CFG was not the purpose of our study, so experiments were not included in [1]. However, such experiments may provide insight into limitations of the MR-CFG and guide improvements to both the algorithm and the grammar. Here we present experimental results on the data set from the MR-RePair manuscript [5]. We chose this data set because it contains a nice variety of data types and sizes. MR-RePair is also a recent grammar-based compression algorithm based on maximal repeats, although it is not as strongly connected with other string properties and structures as MR-CFG.
Data | MR-RePair | MR-CFG | ||
---|---|---|---|---|
Name Number of Characters Alphabet Size |
rand77.txt 65537 77 |
4492 46143 9 46152 |
123 54 65483 65537 |
Number of Rules Non-Start Total Length Start Length Total Length |
Name Number of Characters Alphabet Size |
world192.txt 2473401 94 |
48601 104060 212940 317000 |
96701 295586 693085 988671 |
Number of Rules Non-Start Total Length Start Length Total Length |
Name Number of Characters Alphabet Size |
bible.txt 4077776 63 |
72082 153266 386516 539782 |
216244 610301 1549683 2159984 |
Number of Rules Non-Start Total Length Start Length Total Length |
Name Number of Characters Alphabet Size |
E_coli.txt 4638691 4 |
62363 129138 650174 779312 |
283701 680553 3725385 4405938 |
Number of Rules Non-Start Total Length Start Length Total Length |
Name Number of Characters Alphabet Size |
einstein.de.txt 92758442 117 |
21787 71709 12683 84392 |
26355 155697 29310 185007 |
Number of Rules Non-Start Total Length Start Length Total Length |
Name Number of Characters Alphabet Size |
fib41.txt 267914297 2 |
38 76 3 79 |
41 74 24 98 |
Number of Rules Non-Start Total Length Start Length Total Length |
- A. Cleary and J. Dood, "Constructing the CDAWG CFG using LCP-Intervals," 2023 Data Compression Conference (DCC). IEEE, Mar. 2023. doi: 10.1109/DCC55655.2023.00026
- T. Beller, K. Berger, and E. Ohlebusch, "Space-Efficient Computation of Maximal and Supermaximal Repeats in Genome Sequences," String Processing and Information Retrieval. Springer Berlin Heidelberg, pp. 99–110, 2012. doi: 10.1007/978-3-642-34109-0_11.
- D. Belazzougui and F. Cunial, "Representing the Suffix Tree with the CDAWG," 2019 Data Compression Conference (DCC). Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik GmbH, Wadern/Saarbruecken, Germany, 2017, doi: 10.4230/LIPICS.CPM.2017.7.
- D. Kempa and N. Prezza, "At the roots of dictionary compression: string attractors," Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. ACM, Jun. 20, 2018. doi: 10.1145/3188745.3188814.
- I. Furuya, T. Takagi, Y. Nakashima, S. Inenaga, H. Bannai, and T. Kida, "MR-RePair: Grammar Compression Based on Maximal Repeats," 2019 Data Compression Conference (DCC). IEEE, Mar. 2019. doi: 10.1109/dcc.2019.00059.