Heuristics for the Smallest Grammar Problem (SGP)
The smallest grammar problem is the problem of finding the smallest context-free grammar that generates exactly one given sequence. Approximating the problem with a ratio of less than 8569/8568 is known to be NP-hard. Most work on this problem has focused on finding decent solutions fast (mostly in linear time), rather than on good heuristic algorithms.
This project provides a hybrid of a max-min ant system and a genetic algorithm that in combination with a novel local search outperforms the state of the art on all files of the Canterbury corpus, a standard benchmark suite. Furthermore, this hybrid performs well on a standard DNA corpus.
- Go version 1.x
Most of the code is released under the Apache 2.0 license (see LICENSE). The only exception is src/qsufsort/qsufsort.go which is a modified version of the file from the Go library and thus released under a BSD-style license (see src/qsufsort/LICENSE). All source code files contain a copyright notice header.
To build everything, use:
A explanation of all command line arguments is provided with:
Usage of sgp: -choice="": file containing an encoded choice of yields to start with -cpuprofile="": enables CPU profiling -csv=false: output in CSV format -i="": input file -initls=false: save intermediate results of initial local search -irr=10: number of IRR runs -ls=true: use local search -lscandidates=1000: max #candidates considered for bottom-up local search -m="hybrid": metaheuristic (genetic, ants, hybrid, none) -number=-1: this number is appended to the created files -outputlimit=-1: output every grammar of this or smaller size -print=false: prints the grammar human readable to stdout -rirr=false: randomized IRR -seed=-1: seed for pseudo-random number generator -sizelimit=-1: stop if a grammar of this size is reached -timelimit=0: time limit with unit (e.g. 10s)
Run the hybrid
sgp --i <input file> --timelimit 10s
Any result can easily be checked for correctness with a separate small checker. The checker is an independent programm that has less than 200 lines of code. Thus, the implementation fulfills the requirements of a Certifying Algorithm. The choice of nonterminals is used as a compact way of storing grammars and also plays the role of the witness:
checker --f <input file> --g <grammar to verify> -s <grammar size>
Canterbury corpus: http://corpus.canterbury.ac.nz/
This project started during my Master's thesis at Saarland University in winter 2012/13. The results are published in the following paper:
An Effective Heuristic for the Smallest Grammar Problem. Florian Benz and Timo Kötzing. GECCO '13: Proceedings of the 15th International Conference on Genetic and Evolutionary Computation