Skip to content


Repository files navigation

Parsing as pretraining


Checked on Ubuntu 18.04 64 bits

  1. Create a virtualenv: virtualenv -p python3.6 parsing-as-pretraining

  2. Activate the virtualenv: source parsing-as-pretraining/bin/activate

  3. Install the required dependencies: pip install -r requirements.txt

Preliminaries: Linearizing constituent and dependency trees

To know HOWTO transform a constituent or a dependency tree into a sequence of labels, please check and read the of the repositories tree2labels and dep2labels

In what follows, we assume the linearized datasets are stored in PTB-linearized/ and EN_EWT-linearized/: Each folder contains three different files: train.tsv, dev.tsv, and test.tsv

Training a model

Training an NCRFpp model


cd NCRFpp
python --config $PATH_CONFIG_FILE

The folders NCRFpp/const_confs/ and NCRFpp/dep_confs/ show some examples of configuration files.

Parameters used to train different types of models:

  • contextualize: [True|False] to specify whether to further contextualize word vectors through the NCRFpp BILSTMs
  • use_elmo: [True|False] Execute ELMo to compute the word vectors, instead of using precomputed or random representations
  • fine_tune_emb: [True|False] To finetune or not the pretraining encoder during training
  • use_char: [True|False] To use or not the character LSTMs supported by NCRFpp (always False in our work)
  • use_features: [True|False] To use or not features other than words that are present in the linearized dataset (always False in our work)
  • word_emb_dim: Size of the word embeddings. Used when training random representations

Specific parameters to train constituent models:


Specific parameters to train dependency models:


Training a BERT model

Adapt the paths accordingly and run ./

The script assumes that the dataset is inside a folder and separated in three different files name train.tsv, dev.tsv, and test.tsv.

General parameter description:

  • --bert_model: The base model used during training, i.e. bert-base-cased
  • --task_name: sl_tsv It specifies the format of the input files (always sl_tsv)
  • --model_dir: Path where to save the model
  • --max_seq_length: Expected maximum sequence length
  • --output_dir: Path to where store the outputs generated by the model
  • --do_train: Activate to train the model
  • --do_eval: Activate to evaluate the model on the dev set
  • --do_test: Activate to run the model on the test set
  • --do_lower_case: To lower case the input when using an uncased model (e.g. bert-base-uncased)

Additional options:

  • --parsing_paradigm: [dependencies|constituency]
  • --not_finetune: Keeps BERT weights frozen during training
  • --use_bilstms: Flag to indicate whether to use BILSTMs before the output layer

Additional specific options for dependency parsers:

  • --path_gold_conll: Path to the gold conll file to evaluate

Additional specific options for constituent parsers:

  • --evalb_param: [True|False] to indicate whether to use the COLLINS.prm parameter file to compute the F1 bracketing score
  • --path_gold_parenthesized: Path to the gold parenthesized tree to evaluate


python \
--data_dir ./data/datasets/PTB-linearized/ \
--bert_model bert-base-cased \
--task_name sl_tsv \
--model_dir /tmp/bert.finetune.linear.model \
--output_dir /tmp/dev.bert.finetune.linear.output \
--path_gold_parenthesized ../data/datasets/PTB/dev.trees \
--parsing_paradigm constituency --do_train --do_eval --num_train_epochs 15 --max_seq_length 250 [--use_bilstms] [--not_finetune]

Running an NCRFpp model

Adapt the paths and run the scripts ./ (constituents) and ./ (dependencies)

Running a BERT model

Adapt the paths and model names accordingly and execute ./

Example for constituency parsing:

	python \
	--data_dir ./data/datasets/PTB-linearized/ \
	--bert_model bert-base-cased \
	--task_name sl_tsv \
	--model_dir  ./data/bert_models_const/bert.const.finetune.linear \
	--output_dir ./data/outputs_const/test.bert.finetune.linear.output \
    --evalb_param True \
    --max_seq_length 250 \
	--path_gold_parenthesized ./data/datasets/PTB/test.trees \
	--parsing_paradigm constituency --do_test [--use_bilstms]

Example for dependency parsing:

	python \
	--data_dir ./data/datasets/EN_EWT-pred-linearized \
	--bert_model bert-base-cased \
	--task_name sl_tsv \
	--model_dir  ./data/bert_models_dep/bert.dep.finetune.linear \
	--output_dir ./data/outputs_dep/test.bert.finetune.linear.output \
	--path_gold_conll ./data/datasets/en-ewt/en_ewt-ud-test.conllu \
	--max_seq_length 350 \
	--parsing_paradigm dependencies --do_test [--use_bilstms]

Note: Remember to use the option --do_lower_case too, in case you trained an uncased model.

Getting some extra stats

Use python [--predicted] [--gold] to show some charts referring to constituent experiments:

  • --predicted Path to the directory containing the files (each of them in PTB, parenthesized format) for which to plot the charts
  • --gold Path to the file containing the gold trees in PTB (parenthesized format)

Use python [--predicted] [--gold] to show some charts referring to the dependency experiments:

  • --predicted: Path to the directory containing the files (each of them a predicted conllu file) for which to plot the charts
  • --gold: Path to the corresponding gold conllu file.


Vilares, D. and Strzyz, M. and Søgaard, A. and Gómez-Rodríguez, C. Parsing as Pretraining. In AAAI 2020


Parsing only with Pretraining Networks







No releases published