Skip to content
Constituency parser for English and Chinese, built on the RNNG and In-Order parsers with BERT
C++ Python Shell C Jupyter Notebook CMake Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
EVALB Update tree processing scripts Feb 18, 2019
bert @ f39e881 De-cruft, serialization improvements, and add BERT protobufs and vocab Jun 16, 2019
bert_models
cmake More complete commit Oct 7, 2016
cnn
corpora Add decode script and misc corpus preprocessing fixes Feb 25, 2019
embeddings
evalb_spmrl2013
nt-parser
run_scripts
scripts Cleaned, updated readme, and few fixes to bert parsing via python Jul 7, 2019
utils Initial commit of silver Brown cluster script Jan 21, 2017
.gitignore De-cruft, serialization improvements, and add BERT protobufs and vocab Jun 16, 2019
.gitmodules
CMakeLists.txt
README.md Update README Jul 7, 2019
bert_make_graphs.sh De-cruft, serialization improvements, and add BERT protobufs and vocab Jun 16, 2019
decode.py Fix decoding BERT params and show errors Feb 26, 2019
decode_from_oracle_inorder_bert_chinese.sh
decode_from_oracle_inorder_bert_large_english.sh
decode_from_oracle_topdown_bert_large_english.sh Cleaned, updated readme, and few fixes to bert parsing via python Jul 7, 2019
train_inorder_bert_chinese.sh Cleaned, updated readme, and few fixes to bert parsing via python Jul 7, 2019
train_inorder_bert_large_english.sh Cleaned, updated readme, and few fixes to bert parsing via python Jul 7, 2019
train_topdown_bert_large_english.sh Cleaned, updated readme, and few fixes to bert parsing via python Jul 7, 2019

README.md

Shift-Reduce Constituency Parsing with Contextual Representations

This repository contains an implementation of the Recurrent Neural Network Grammars (Dyer et al. 2016) and In-Order (Liu and Zhang 2017) constituency parsers, both integrated with a BERT (Devlin et al. 2019) sentence encoder.

Our best current models for the In-Order system with BERT obtain 96.0 F1 on the English PTB test set and 92.0 F1 on the Chinese Treebank v5.1 test set. More results (including out-of-domain transfer) are described in Cross-Domain Generalization of Neural Constituency Parsers (Fried*, Kitaev*, and Klein, 2019).

Modifications to the RNNG and In-Order parsers implemented here include:

  • BERT integration for the discriminative models
  • Beam search decoding for the discriminative and generative models
  • Minimum-risk training using policy gradient for the discriminative models
  • Dynamic oracle training for the RNNG discriminative model

This repo contains a compilation of code from many people and multiple research projects; please see Credits and Citations below for details.

Note: for most practical parsing purposes, we'd recommend using the BERT-equipped Chart parser of Kitaev, Cao, and Klein, 2019, which is easier to setup, faster, has a smaller model size, and achieves performance nearly as strong as this parser.

Contents

  1. Available Models
  2. Prerequisites
  3. Build Instructions
  4. Usage
  5. Training
  6. Citations
  7. Credits

Available Models

Model Language Info
english English 95.65 F1 / 57.28 EM on the PTB test set (with beam size 10). 1.2GB. This is the model that is the best-scoring on the development set out of the five runs of In-Order+BERT English models described in our ACL 2019 paper.
english-wwm English 95.99 F1 / 57.99 EM on the PTB test set (with beam size 10). 1.2GB. This model is identical to english above, but uses a BERT model pre-trained with whole-word masking.
chinese Chinese 91.96 F1 / 44.54 EM on the CTB v5.1 test set (with beam size 10). 370MB. This is the model that is best-scoring on the development set out of the five runs of In-Order+BERT Chinese models.

Prerequisites

Optional:

  • MKL allows faster processing for the non-BERT CPU operations

We use a submodule for the BERT code. To get this when cloning our repository:

git clone --recursive https://github.com/dpfried/rnng-bert.git

If you didn't clone with --recursive, you'll need to manually get the bert submodule. Run the following inside the top-level rnng-bert directory:

git submodule update --init --recursive

Build Instructions

Assuming the latest development version of Eigen is stored at: /opt/tools/eigen-dev, and you've extracted or built the TensorFlow C files (see prerequisites above) at $HOME/lib/libtensorflow-gpu-linux-x86_64-1.12.0:

mkdir build
cd build
cmake -DEIGEN3_INCLUDE_DIR=/opt/tools/eigen-dev -DTENSORFLOW_ROOT=$HOME/lib/libtensorflow-gpu-linux-x86_64-1.12.0 -DCMAKE_BUILD_TYPE=Release ..
make -j2

If your BOOST installation is in a non-standard location, also specify -DBOOST_ROOT=/path/to/boost

Optional: to compile with MKL, assuming MKL is stored at /opt/intel/mkl, instead run:

mkdir build
cd build
cmake -DEIGEN3_INCLUDE_DIR=/opt/tools/eigen-dev -DTENSORFLOW_ROOT=$HOME/lib/libtensorflow-gpu-linux-x86_64-1.12.0 -DMKL=TRUE -DMKL_ROOT=/opt/intel/mkl -DCMAKE_BUILD_TYPE=Release ..
make -j2

Optional: If training the parser, you'll also need the evalb executable. Build it by running make inside the EVALB directory.

Usage

First, download and extract one of the models. For the rest of this section, we'll assume that you've downloaded and extracted english-wwm into the bert_models folder.

Parsing Raw Text

Input should be a file with one sentence per line, consisting of space-separated tokens. For best performance, you should use tokenization in the style of the Penn Treebank.

For English, you can tokenize sentences using a tokenizer such as nltk.word_tokenize. Here is an example tokenized sentence (taken from the Penn Treebank):

No , it was n't Black Monday .

(note that "wasn't" is split into "was" and "n't").

For Chinese, use a tokenizer such as jieba or unofficial tokenizers for SpaCy. Here is an example tokenized sentence (from the Penn Chinese Treebank and using its tokenization; automatic tokenizers may return different tokeizations):

“ 中 美 合作 高 科技 项目 签字 仪式 ” 今天 在 上海 举行 。

Once the token input file is constructed, run python3 scripts/bert_parse.py $model_dir $token_file --beam_size 10 to parse the token file and print parse trees to standard out. For example,

python3 scripts/bert_parse.py bert_models/english-wwm tokens.txt --beam_size 10

Note: These parsers are not currently designed to predict part-of-speech (POS) tags, and will output trees that use XX for all POS tags.

Comparing Against a Treebank

Given a treebank file in $treebank_file with one tree per line (for example, as produced by our PTB data generation code), you can parse the tokens in these sentences and compute parse evaluation scores using the following:

python3 scripts/dump_tokens.py $treebank_file > treebank.tokens
python3 scripts/bert_parse.py bert_models/english-wwm treebank.tokens --beam_size 10 > treebank.parsed
python3 scripts/retag.py $treebank_file treebank.parsed > treebank.parsed.retagged
EVALB/evalb -p EVALB/COLLINS_ch.prm $treebank_file treebank.parsed.retagged

(COLLINS_ch.prm is a parameter file that can be used to evaluate on either the English or Chinese Penn Treebanks; it is modified from COLLINS.prm to drop the PU punctuation tag which is found in the CTB corpora.)

Training

Instructions should (hopefully) be coming soon. Please contact dfried AT cs DOT berkeley DOT edu if you'd like help training the models that use BERT in the meantime. The oracle generation scripts we used are in corpora/*/build_corpus.sh and training scripts are in train_*.sh, but there are currently some missing dependencies and hard-coded paths.

Citations

This repo contains code from a number of papers.

For the RNNG or In-Order models, please cite the original papers:

@inproceedings{dyer-rnng:16,
  author = {Chris Dyer and Adhiguna Kuncoro and Miguel Ballesteros and Noah A. Smith},
  title = {Recurrent Neural Network Grammars},
  booktitle = {Proc. of NAACL},
  year = {2016},
} 

@article{TACL1199,
  author = {Liu, Jiangming and Zhang, Yue },
  title = {In-Order Transition-based Constituent Parsing},
  journal = {Transactions of the Association for Computational Linguistics},
  volume = {5},
  year = {2017},
  issn = {2307-387X},
  pages = {413--424}          
}

For beam search in the generative model:

@InProceedings{Stern-Fried-Klein:2017:GenerativeParserInference,
  title     = {Effective Inference for Generative Neural Parsing},
  author    = {Mitchell Stern and Daniel Fried and Dan Klein},
  booktitle = {Proceedings of EMNLP},
  month     = {September},
  year      = {2017},
}

For policy gradient or dynamic oracle training:

@InProceedings{Fried-Klein:2018:PolicyGradientParsing,
  title     = {Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing},
  author    = {Daniel Fried and Dan Klein},
  booktitle = {Proceedings of ACL},
  month     = {July},
  year      = {2018},
}

For the BERT integration:

@inproceedings{devlin-etal-2019-bert,
  title = {{BERT}: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author = {Devlin, Jacob  and
    Chang, Ming-Wei  and
    Lee, Kenton  and
    Toutanova, Kristina},
  booktitle = {Proceedings of NAACL},
  month = {June},
  year = {2019},
}

@InProceedings{Fried-Kitaev-Klein:2019:ParserGeneralization,
  title     = {Cross-Domain Generalization of Neural Constituency Parsers},
  author    = {Daniel Fried, Nikita Kitaev, and Dan Klein},
  booktitle = {Proceedings of ACL},
  month     = {July},
  year      = {2019},
}

Credits

The code in this repo (and parts of this readme) is derived from the RNNG parser by Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah Smith, incorporating the In-Order transition system of Jiangming Liu and Yue Zhang. Additional modifications (beam search, abstraction of the parser state and ensembling, BERT integration, the RNNG dynamic oracle, and min-risk policy gradient training) were made by Daniel Fried, Mitchell Stern, and Nikita Kitaev.

You can’t perform that action at this time.