Chart parser

A simple chart parser in python with a CKY in cython for speed.

Inspired by the recent success of benepar and the minimal-span parser I wanted to revisit chart parsing with CKY on binarized trees. No neural networks here however, just rule probabilities estimated by maximum likelihood.

Setup

To obtain the data and grammar, use:

cd grammar
./get-grammar.sh

To compile cky, use:

cd cky
python setup.py build_ext --inplace

Usage

To run a quick test, use:

python main.py

To parse a sentence, use:

python main.py --sent "The horse raced past the barn fell."

To parse the dev-set and compute f-score, use:

python main.py --infile grammar/dev/dev.tokens --outfile grammar/dev/dev.pred.trees --goldfile grammar/data/dev.trees

This can be done in parallel by adding --parallel.

To parse 5 sentences from the dev-set, show predicted and gold parses, and compute their individual f-scores, use:

python main.py --treefile grammar/data/dev.trees -n 5

The default grammar used is the vanilla CNF. To use the (v1h1) Markovized grammar, use:

python main.py --grammar grammar/train/train.markov.grammar

Speed

To speed up the CKY parsing, we use a (simple) cythonized version that is almost a numpy implementation. We also provide a numpy cky. To use this, add the flag --use-numpy. The speed difference is very significant: the cython CKY parses a 20-word sentence in ~1 second, the numpy CKY takes ~90 seconds.

Parsing the entire development set in parallel with 8 processes (for my quad-core machine) takes around 15 minutes.

Accuracy

The Markovized CNF gives these results on the test set:

=== Summary ===

-- All --
Number of sentence        =   2416
Number of Error sentence  =      0
Number of Skip  sentence  =      0
Number of Valid sentence  =   2416
Bracketing Recall         =  78.20
Bracketing Precision      =  76.53
Bracketing FMeasure       =  77.36
Complete match            =  14.82
Average crossing          =   2.44
No crossing               =  41.68
2 or less crossing        =  65.02
Tagging accuracy          =  95.49

This is what we should expect based on the numbers that Klein and Manning (2003) report on the unrefined and Markovized grammars.

Requirements

python>=3.6.0
numpy
cython
nltk
tqdm
flake8
PYEVALB

Contributing

Working to make collaboration easier.

Run tests

Under construction

Run linters

Run flake8 from the project directory for style guide enforcement. See the documentaion for more info on flake8.

TODO

More elaborate unking scheme (e.g. UNK-DASH-ity)
Write a setup.py to make collaboration easier. See this example.
Add tests. See this example.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
cky		cky
grammar		grammar
results		results
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
complexity.py		complexity.py
evaluate.py		evaluate.py
main.py		main.py
makefile		makefile
parser.py		parser.py
pcfg.py		pcfg.py
predict.py		predict.py
requirements.txt		requirements.txt
syneval.py		syneval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chart parser

Setup

Usage

Speed

Accuracy

Requirements

Contributing

Run tests

Run linters

TODO

About

Releases

Packages

Languages

daandouwe/chart-parser

Folders and files

Latest commit

History

Repository files navigation

Chart parser

Setup

Usage

Speed

Accuracy

Requirements

Contributing

Run tests

Run linters

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages