MR-CFG

This repository contains an implementation of the MR-CFG (Maximal Repeat Context-Free Grammar) construction algorithm presented in [1]. Specifically, the algorithm computes an MR-CFG (a straight-line grammar) for a string using its LCP-intervals.

The algorithm described in [1] is a generic algorithm that allows LCP-intervals to be computed from a variety of different data-structures so long as the intervals are iterated in shortest-LCP-value-first order. This implementation uses a Compressed Suffix Array (CSA) to compute LCP-intervals using the algorithm of [2]. Both the "optimal" and "online" versions of the algorithm described in [1] are implemented. Additionally, a third "fast" implementation based on compressed bitmaps has been implemented.

This implementation is not particularly fast or memory efficient. And the generated grammar is not encoded or output. Instead, statistics about the grammar are reported and then the input string is reconstructed from the grammar and output for validation. Moreover, this implementation is intended to be used as a reference only.

All non-CLI code has been implemented as a header-only library so that it may easily be used and extended.

Building

This project depends on Succinct Data Structure Library 3.0 and Roaring Bitmap. You can install these on your system yourself before proceeding or let the build system install them in the repository's directory for you.

The project uses the CMake meta-build system to generate build files specific to your environment. Generate the build files as follows:

cmake -B build .

This will create a build/ directory containing all the build files. The first time you run this command it may take a while because it downloads dependencies from GitHub.

To actually build the code, run:

cmake --build build

This will generate an MR-CFG executable in the build/ directory. Similar to the previous command, the first time you run this command may take a while because it has to build the dependencies. If you make changes to the code, you only have to run this command to recompile the code.

Running

MR-CFG uses a command-line interface (CLI). Its usage instructions are as follows:

Usage: MR-CFG {OPTIMAL|ONLINE|FAST} <FILE>

The first argument - {OPTIMAL|ONLINE|FAST} - specifies what interval stabbing algorithm to use. OPTIMAL uses the theoretically optimal $\mathcal{O}(n)$ time algorithm, where $n$ is the length of the input text. ONLINE uses the more space efficient but theoretically slower $\mathcal{O}(n\log{m})$ time algorithm based on binary search, where $m$ is the number of maximal repeats in the input text. And FAST uses an algorithm based on compressed bitmaps that is relatively fast and space efficient. The second argument - <FILE> - is a file containing text a straight-line grammar (SLG) will be built from.

After it computes the SLG, MR-CFG outputs the string the SLG produces to the standard error stream for validation. This output can be captured in a file as follows:

./build/MR-CFG ONLINE my-input-file.txt 2> my-output-file.txt

Here my-input-file.txt is the input file and my-output-file.txt is the file that will capture the output to standard error, i.e. a copy of my-input-file.txt generated from the SLG. These files can be compared using a tool like diff:

diff my-input-file.txt my-output-file.txt

Basic run-time info and statistics about the computed SLG will be output to the standard output.

Results

The MR-CFG was first introduced in [3] as a CFG that can be derived from the Compact Directed Acyclic Word Graph (CDAWG) of a string. In [1], we showed that its relationship to the CDAWG connects the MR-CFG to various string properties and data-structures. These connections are not typical of SLGs and merit further investigation. The primary contribution of [1] was to elucidate these connections and provide an algorithm for constructing the MR-CFG without the CDAWG, allowing it to be studied independently while further relating it to the existing stringology literature. Additionally, these connections provide an opportunity for various string properties and data-structures to be better connected to string attractors and dictionary compressors in general [4].

The actual run-time performance of our algorithm and compression power of the MR-CFG was not the purpose of our study, so experiments were not included in [1]. However, such experiments may provide insight into limitations of the MR-CFG and guide improvements to both the algorithm and the grammar. Here we present experimental results on the data set from the MR-RePair manuscript [5]. We chose this data set because it contains a nice variety of data types and sizes. MR-RePair is also a recent grammar-based compression algorithm based on maximal repeats, although it is not as strongly connected with other string properties and structures as MR-CFG.

	Data	MR-RePair	MR-CFG
Name Number of Characters Alphabet Size	rand77.txt 65537 77	4492 46143 9 46152	123 54 65483 65537	Number of Rules Non-Start Total Length Start Length Total Length
Name Number of Characters Alphabet Size	world192.txt 2473401 94	48601 104060 212940 317000	96701 295586 693085 988671	Number of Rules Non-Start Total Length Start Length Total Length
Name Number of Characters Alphabet Size	bible.txt 4077776 63	72082 153266 386516 539782	216244 610301 1549683 2159984	Number of Rules Non-Start Total Length Start Length Total Length
Name Number of Characters Alphabet Size	E_coli.txt 4638691 4	62363 129138 650174 779312	283701 680553 3725385 4405938	Number of Rules Non-Start Total Length Start Length Total Length
Name Number of Characters Alphabet Size	einstein.de.txt 92758442 117	21787 71709 12683 84392	26355 155697 29310 185007	Number of Rules Non-Start Total Length Start Length Total Length
Name Number of Characters Alphabet Size	fib41.txt 267914297 2	38 76 3 79	41 74 24 98	Number of Rules Non-Start Total Length Start Length Total Length

References

A. Cleary and J. Dood, "Constructing the CDAWG CFG using LCP-Intervals," 2023 Data Compression Conference (DCC). IEEE, Mar. 2023. doi: 10.1109/DCC55655.2023.00026
T. Beller, K. Berger, and E. Ohlebusch, "Space-Efficient Computation of Maximal and Supermaximal Repeats in Genome Sequences," String Processing and Information Retrieval. Springer Berlin Heidelberg, pp. 99–110, 2012. doi: 10.1007/978-3-642-34109-0_11.
D. Belazzougui and F. Cunial, "Representing the Suffix Tree with the CDAWG," 2019 Data Compression Conference (DCC). Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik GmbH, Wadern/Saarbruecken, Germany, 2017, doi: 10.4230/LIPICS.CPM.2017.7.
D. Kempa and N. Prezza, "At the roots of dictionary compression: string attractors," Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. ACM, Jun. 20, 2018. doi: 10.1145/3188745.3188814.
I. Furuya, T. Takagi, Y. Nakashima, S. Inenaga, H. Bannai, and T. Kida, "MR-RePair: Grammar Compression Based on Maximal Repeats," 2019 Data Compression Conference (DCC). IEEE, Mar. 2019. doi: 10.1109/dcc.2019.00059.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
include/mr-cfg		include/mr-cfg
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MR-CFG

Building

Running

Results

References

About

Releases 1

Packages

Languages

License

alancleary/mr-cfg

Folders and files

Latest commit

History

Repository files navigation

MR-CFG

Building

Running

Results

References

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages