Puck is a lightning-fast parser for natural languages using GPUs
Scala Java
Latest commit f8f0a0a Nov 2, 2014 @dlwh whoops

README.md

Puck Build Status

(c) 2014 David Hall

Puck is a high-speed, high-accuracy parser for natural languages. It's (currently) designed for use with grammars trained with the Berkeley Parser and on NVIDIA cards. On recent-ish NVIDIA cards (e.g. a GTX 680), around 400 sentences a second with a full Berkeley grammar for length <= 40 sentences.

Puck is only useful if you plan on parsing a lot of sentences. On the order of a few thousand. Also, it's designed for throughput, not latency.

The current version is 0.2.

Puck is based on the research in two papers:

  • David Hall, Taylor Berg-Kirkpatrick, John Canny, and Dan Klein. 2014. Better, Faster, Sparser GPU Parsing. To Appear in Proceedings of the Association for Computational Linguistics.
  • John Canny, David Hall, and Dan Klein. 2013. A multi-Teraflop Constiuency Parser using GPUs. In Proceedings of Empirical Methods in Natural Language Processing.

Documentation

Puck has three main classes. The first is for compiling the GPU representation of a grammar, the second is for parsing with that grammar, and the third is for experimental use. Running --help with any of these commands will list all options.

Obtaining Puck

A prebuilt version of puck can be downloaded from http://www.scalanlp.org/releases/puck-assembly-0.2.jar. Plain text grammar files (needed to set up a parser) can be obtained from this repository in the textGrammars/ directory.

Building Puck

This project can be built with sbt 0.23. Run sbt assembly to create a fat jar in target/scala-2.10/

Compiling a Grammar

The first step in using Puck is to compile a grammar to GPU code. The best way to do this is to run the command

java -Xmx4g -cp target/scala-2.10/puck-assembly-0.2.jar puck.parser.CompileGrammar --textGrammarPrefix textGrammars/wsj_1.gr:textGrammars/wsj_6.gr --grammar grammar.grz

This command will take a long time: up to an hour. When it's finished, this program will produce a parser equivalent to the one used in the 2014 paper in a file called grammar.grz. The textGrammarPrefix argument accepts a sequence of plain text grammars, separated by colons. We have provided the cascade of grammars used in the Berkeley Parser for English. In practice, using wsj_1 and wsj_6 gives you all the benefit for GPU grammars.

Running the parser

The parser can be run with:

java -Xmx4g -cp target/scala-2.10/puck-assembly-0.2.jar puck.parser.RunParser --grammar grammar.grz <input files>

This will output 1 tree per line to files named [input file name].parsed. By default it will skip sentences longer than 50 words, printing out "(())" instead. If no files are listed, it will read from standard input.

If the sentences are already split up into one sentence per line, use --sentences newline. If the words are already tokenized into PTB tokens, use --tokens whitespace.

Initializing the parser can take 2-3 minutes, longer if it's your first time starting the parser. So this parser is only worth your time if you plan on parsing a lot of text. Also note that you won't get 400 sentences a second if you aren't parsing a lot of sentences.

Experiments

We benchmarked our parser by running it on the treebank.

java -cp target/scala-2.10/puck-assembly-0.2.jar puck.parser.CLParser --maxParseLength 40 --treebank.path /path/to/treebank/wsj --maxLength 40 --numToParse 20000  --reproject false --viterbi true  --cache false --textGrammarPrefix textGrammars/wsj_1.gr:textGrammars/wsj_6.gr --mem 4g --device 680"

Should reproduce.

Acknowledgements

David Hall is supported by a Google PhD Fellowship. Taylor Berg-Kirkpatrick is supported by a Qualcomm fellowship. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.