Probabilistic Context Free Grammar (PCFG) Library for C#
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
CFGLib don't print productions in pdf for now Jan 6, 2018
CFGLibTest only parse ebnf with new parser Jan 3, 2018
ConsolePlayground parsing a really simple "cons" attribute Jan 6, 2018
Grammars fixing capitalization of resource files for case sensitive platforms Sep 12, 2018
comparison
wiki updating readme with sppf Mar 9, 2016
.gitignore adding comparison grammar Dec 7, 2017
CFGLib.sln highly ambiguous grammars still use tons of memory in traversal Dec 31, 2017
LICENSE.txt added basic license and readme Jan 18, 2016
README.md removed Production.New(), but added some new constructors Mar 12, 2016

README.md

CFGLib

A C# library for (Probabilistic) Context Free Grammars (PCFGs). Supports probability-preserving conversion to Chomsky Normal Form (CNF), as well as the CYK and Earley parsing algorithms. These algorithms allow you to tell whether or not a grammar could have generated a given string, and if so, at what probability they are generated. In addition, the Earley algorithm supports returning a parse forest (containing all possible parse trees) in Shared Packed Parse Forest (SPPF) format.

Earley is significantly faster, while CYK is simpler and the code is easier to understand. Having both allows me to test them against one another.

License

CFGLib is licensed using the MIT license.

Examples

CFGLib can perform a number of basic operations on context free grammars. If there are some operations you need, feel free to suggest them on the Issues page!

Representation and Acceptance

CFGLib can be used to represent grammars and then use those grammars to parse terms:

// S -> aSa | bSb | ε
var productions = new List<Production> {
	// construct productions by passing arguments...
	new Production(
		lhs: Nonterminal.Of("S"),
		rhs: new Sentence { Terminal.Of("a"), Nonterminal.Of("S"), Terminal.Of("a") },
		weight: 20
	),
	// or from a string...
	CFGParser.Production(@"<S> -> 'b' <S> 'b' [10]"),
	CFGParser.Production(@"<S> -> ε [5]"),
};
var cfg = new Grammar(productions, Nonterminal.Of("S"));
var ep = new EarleyParser(cfg);

// How frequently does the grammar generate the string "aabb"?
Console.WriteLine("aabb: {0}", ep.ParseGetProbability(Sentence.FromLetters("aabb")));
// How about "abba"?
Console.WriteLine("abba: {0}", ep.ParseGetProbability(Sentence.FromLetters("abba")));
aabb: 0
abba: 0.0233236151603499

Explore the Grammar

CFGLib can generate random terms from a given grammar:

for (int i = 0; i < 5; i++) {
	Console.WriteLine(cfg.ProduceRandom().AsTerminals());
}
bbbaaaaaabbb
aa
aaabaabaaa
aa
ε

It can also explore your grammar, returning all the strings generated up to a particular depth:

var sentences = cfg.ProduceToDepth(3);
foreach (var sentence in sentences) {
	Console.WriteLine(sentence.Value.AsTerminals());
}
ε
aa
bb
aaaa
abba
baab
bbbb

Parse Forests

Consider the following simple, but ambiguous grammar:

<S> → <S> '+' <S>
<S> → '0'

Using this grammar to parse the string 0 + 0 + 0 yields a parse forest, not just a parse tree, because the parsing is ambiguous. That is, it could either be grouped like (0 + 0) + 0 or 0 + (0 + 0). The tool returns a shared packed parse forest (SPPF):
Addition SPPF

or for convenience (but not efficiency), a flattened version:
Addition SPPF Flattened

Random Grammar Generation

You can also generate random Grammars (either generic, or CNF).

var gg = new GrammarGenerator(1);
var terminals = new List<Terminal> { Terminal.Of("a"), Terminal.Of("b") };
var randGram = gg.NextCFG(
	numNonterminals: 4,
	numProductions: 10,
	maxProductionLength: 4,
	terminals: terminals
);
Console.WriteLine(randGram);
Grammar(<X_0>){
  2.75e-001: <X_0> → 'a' <X_3> <X_2> [47.7010679872246]
  1.49e-001: <X_0> → 'b' [25.8029145527645]
  4.19e-001: <X_2> → 'b' 'a' <X_1> [29.2729223967869]
  4.41e-001: <X_3> → <X_3> 'b' 'b' [17.9467153106568]
  5.77e-001: <X_0> → 'a' 'b' <X_3> 'a' [100.187609646091]
  4.05e-001: <X_1> → <X_0> <X_2> <X_1> [68.5405530107862]
  5.81e-001: <X_2> → ε [40.5143099778864]
  2.33e-001: <X_3> → <X_2> <X_0> <X_0> <X_2> [9.50396021665258]
  5.95e-001: <X_1> → 'a' <X_0> <X_3> <X_3> [100.789105122811]
  3.26e-001: <X_3> → <X_0> <X_1> [13.2542167139492]
}

I used randomly generated grammars and both parsing algorithms to help test and make sure everything is correct.

Caveats

There are still a few weaknesses:

  • No support for EBNF
  • No easy to use forest visitor

Other Useful Classes and Methods

Here are some other useful classes and methods, in addition to the ones exhibited above.

  • Sentence: A list of Nonterminals and Terminals
    • static Sentence FromWords(string s); (tokenizes on space)
    • static Sentence FromTokens(IEnumerable<string> tokens);
  • BaseGrammar > {Grammar, CNFGrammar}: All kinds of grammars inherit from BaseGrammar.
    • Sentence ProduceRandom();
    • CNFGrammar ToCNF();: Returns an equivalent grammar in CNF form
    • double EstimateProbabilityNull(Nonterminal nt, long iterations);
    • void Simplify(); [Note: Grammars are simplified by default, but if you edit a production afterwards (making it unproducing), you might need to force a simplification]
  • Parser > {EarleyParser, CykParser}: The currently implemented parsers
    • Sppf ParseGetForest();