Skip to content

MR-CFG builds a straight-line grammar using LCP-intervals

License

Notifications You must be signed in to change notification settings

alancleary/mr-cfg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MR-CFG

This repository contains an implementation of the MR-CFG (Maximal Repeat Context-Free Grammar) construction algorithm presented in [1]. Specifically, the algorithm computes an MR-CFG (a straight-line grammar) for a string using its LCP-intervals.

The algorithm described in [1] is a generic algorithm that allows LCP-intervals to be computed from a variety of different data-structures so long as the intervals are iterated in shortest-LCP-value-first order. This implementation uses a Compressed Suffix Array (CSA) to compute LCP-intervals using the algorithm of [2]. Both the "optimal" and "online" versions of the algorithm described in [1] are implemented. Additionally, a third "fast" implementation based on compressed bitmaps has been implemented.

This implementation is not particularly fast or memory efficient. And the generated grammar is not encoded or output. Instead, statistics about the grammar are reported and then the input string is reconstructed from the grammar and output for validation. Moreover, this implementation is intended to be used as a reference only.

All non-CLI code has been implemented as a header-only library so that it may easily be used and extended.

Building

This project depends on Succinct Data Structure Library 3.0 and Roaring Bitmap. You can install these on your system yourself before proceeding or let the build system install them in the repository's directory for you.

The project uses the CMake meta-build system to generate build files specific to your environment. Generate the build files as follows:

cmake -B build .

This will create a build/ directory containing all the build files. The first time you run this command it may take a while because it downloads dependencies from GitHub.

To actually build the code, run:

cmake --build build

This will generate an MR-CFG executable in the build/ directory. Similar to the previous command, the first time you run this command may take a while because it has to build the dependencies. If you make changes to the code, you only have to run this command to recompile the code.

Running

MR-CFG uses a command-line interface (CLI). Its usage instructions are as follows:

Usage: MR-CFG {OPTIMAL|ONLINE|FAST} <FILE>

The first argument - {OPTIMAL|ONLINE|FAST} - specifies what interval stabbing algorithm to use. OPTIMAL uses the theoretically optimal $\mathcal{O}(n)$ time algorithm, where $n$ is the length of the input text. ONLINE uses the more space efficient but theoretically slower $\mathcal{O}(n\log{m})$ time algorithm based on binary search, where $m$ is the number of maximal repeats in the input text. And FAST uses an algorithm based on compressed bitmaps that is relatively fast and space efficient. The second argument - <FILE> - is a file containing text a straight-line grammar (SLG) will be built from.

After it computes the SLG, MR-CFG outputs the string the SLG produces to the standard error stream for validation. This output can be captured in a file as follows:

./build/MR-CFG ONLINE my-input-file.txt 2> my-output-file.txt

Here my-input-file.txt is the input file and my-output-file.txt is the file that will capture the output to standard error, i.e. a copy of my-input-file.txt generated from the SLG. These files can be compared using a tool like diff:

diff my-input-file.txt my-output-file.txt

Basic run-time info and statistics about the computed SLG will be output to the standard output.

Results

The MR-CFG was first introduced in [3] as a CFG that can be derived from the Compact Directed Acyclic Word Graph (CDAWG) of a string. In [1], we showed that its relationship to the CDAWG connects the MR-CFG to various string properties and data-structures. These connections are not typical of SLGs and merit further investigation. The primary contribution of [1] was to elucidate these connections and provide an algorithm for constructing the MR-CFG without the CDAWG, allowing it to be studied independently while further relating it to the existing stringology literature. Additionally, these connections provide an opportunity for various string properties and data-structures to be better connected to string attractors and dictionary compressors in general [4].

The actual run-time performance of our algorithm and compression power of the MR-CFG was not the purpose of our study, so experiments were not included in [1]. However, such experiments may provide insight into limitations of the MR-CFG and guide improvements to both the algorithm and the grammar. Here we present experimental results on the data set from the MR-RePair manuscript [5]. We chose this data set because it contains a nice variety of data types and sizes. MR-RePair is also a recent grammar-based compression algorithm based on maximal repeats, although it is not as strongly connected with other string properties and structures as MR-CFG.

Data MR-RePair MR-CFG
Name
Number of Characters
Alphabet Size
rand77.txt
65537
77
4492
46143
9
46152
123
54
65483
65537
Number of Rules
Non-Start Total Length
Start Length
Total Length
Name
Number of Characters
Alphabet Size
world192.txt
2473401
94
48601
104060
212940
317000
96701
295586
693085
988671
Number of Rules
Non-Start Total Length
Start Length
Total Length
Name
Number of Characters
Alphabet Size
bible.txt
4077776
63
72082
153266
386516
539782
216244
610301
1549683
2159984
Number of Rules
Non-Start Total Length
Start Length
Total Length
Name
Number of Characters
Alphabet Size
E_coli.txt
4638691
4
62363
129138
650174
779312
283701
680553
3725385
4405938
Number of Rules
Non-Start Total Length
Start Length
Total Length
Name
Number of Characters
Alphabet Size
einstein.de.txt
92758442
117
21787
71709
12683
84392
26355
155697
29310
185007
Number of Rules
Non-Start Total Length
Start Length
Total Length
Name
Number of Characters
Alphabet Size
fib41.txt
267914297
2
38
76
3
79
41
74
24
98
Number of Rules
Non-Start Total Length
Start Length
Total Length

References

  1. A. Cleary and J. Dood, "Constructing the CDAWG CFG using LCP-Intervals," 2023 Data Compression Conference (DCC). IEEE, Mar. 2023. doi: 10.1109/DCC55655.2023.00026
  2. T. Beller, K. Berger, and E. Ohlebusch, "Space-Efficient Computation of Maximal and Supermaximal Repeats in Genome Sequences," String Processing and Information Retrieval. Springer Berlin Heidelberg, pp. 99–110, 2012. doi: 10.1007/978-3-642-34109-0_11.
  3. D. Belazzougui and F. Cunial, "Representing the Suffix Tree with the CDAWG," 2019 Data Compression Conference (DCC). Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik GmbH, Wadern/Saarbruecken, Germany, 2017, doi: 10.4230/LIPICS.CPM.2017.7.
  4. D. Kempa and N. Prezza, "At the roots of dictionary compression: string attractors," Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. ACM, Jun. 20, 2018. doi: 10.1145/3188745.3188814.
  5. I. Furuya, T. Takagi, Y. Nakashima, S. Inenaga, H. Bannai, and T. Kida, "MR-RePair: Grammar Compression Based on Maximal Repeats," 2019 Data Compression Conference (DCC). IEEE, Mar. 2019. doi: 10.1109/dcc.2019.00059.