A simple chart parser in python with a CKY in cython for speed.
Inspired by the recent success of benepar and the minimal-span parser I wanted to revisit chart parsing with CKY on binarized trees. No neural networks here however, just rule probabilities estimated by maximum likelihood.
To obtain the data and grammar, use:
cd grammar
./get-grammar.sh
To compile cky, use:
cd cky
python setup.py build_ext --inplace
To run a quick test, use:
python main.py
To parse a sentence, use:
python main.py --sent "The horse raced past the barn fell."
To parse the dev-set and compute f-score, use:
python main.py --infile grammar/dev/dev.tokens --outfile grammar/dev/dev.pred.trees --goldfile grammar/data/dev.trees
This can be done in parallel by adding --parallel
.
To parse 5 sentences from the dev-set, show predicted and gold parses, and compute their individual f-scores, use:
python main.py --treefile grammar/data/dev.trees -n 5
The default grammar used is the vanilla CNF. To use the (v1h1) Markovized grammar, use:
python main.py --grammar grammar/train/train.markov.grammar
To speed up the CKY parsing, we use a (simple) cythonized version that is almost a numpy implementation.
We also provide a numpy cky. To use this, add the flag --use-numpy
.
The speed difference is very significant: the cython CKY parses a 20-word sentence in ~1 second, the numpy CKY takes ~90 seconds.
Parsing the entire development set in parallel with 8 processes (for my quad-core machine) takes around 15 minutes.
The Markovized CNF gives these results on the test set:
=== Summary ===
-- All --
Number of sentence = 2416
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 2416
Bracketing Recall = 78.20
Bracketing Precision = 76.53
Bracketing FMeasure = 77.36
Complete match = 14.82
Average crossing = 2.44
No crossing = 41.68
2 or less crossing = 65.02
Tagging accuracy = 95.49
This is what we should expect based on the numbers that Klein and Manning (2003) report on the unrefined and Markovized grammars.
python>=3.6.0
numpy
cython
nltk
tqdm
flake8
PYEVALB
Working to make collaboration easier.
Under construction
Run flake8
from the project directory for style guide enforcement. See the documentaion for more info on flake8.
- More elaborate unking scheme (e.g.
UNK-DASH-ity
) - Write a
setup.py
to make collaboration easier. See this example. - Add tests. See this example.