Benjamin Newman, Kai-Siang Ang, Julia Gong, and John Hewitt
How should we evaluate the syntactic understanding of our NLP models? We build off of a body of work that uses minimal pair for evaluation and argue that we should be evaluating models' likely behavior and systematicty. We adapt minimal pair evaluation to address these goals, finding the models prefentially conjugate verbs they deem likely.
Our publication is available here.
To get started, set up your environment:
conda env create -f environment.yaml
conda activate refining-tse
Create the full verb list:
python scripts/create_combined_verb_list.py
Run a small experiment:
python run.py configs/bert-base-cased/ML_simple_agrmt/mw.yaml
You should see the results in results/bert-base-cased/ML_simple_agrmt/mw/custom_bert-base-cased_main/metrics/main.txt
in the results
folder.
To reproduce all the experiments from the paper:
python run_all.py configs --whitelist bert-large-cased,bert-large-uncased,gpt2-xl,roberta-large
And to generate the main plots and table:
python src/plots.py
Recommended exploration:
- Begin in
run.py
, follow it tosrc/experiment.py
, then explore thesrc/dataset
,src/models
andsrc/metrics
folders. - Examine some
.yaml
file inconfigs
. - Try running an experiment and inspecting the
results
.
Your own extensions:
- Extend
TransformersModel
orMPEModel
to your own model implementation insrc/models/<YOUR_MODEL>.py
. - Extend
MetricComputer
to your own metric implementation insrc/metrics/<YOUR_METRIC>.py
. - Adapt your dataset to the
CustomDataset
APIs and thedata
folder. If needed, extendMPEDataset
to your own dataset type insrc/datasets/<YOUR_DATASET>.py
anddata/<YOUR_DATASET_TYPE>
.
.
├── configs
│ ├── <MODEL>/<TEMPLATE>/<METRIC>.yaml # complete config specification for an experiment
│ └── ⋮
├── data
│ ├── ML # Marvin and Linzen (2018) S/V agreement templates
│ │ ├── <SOME_TEMPLATE>.jsonl
│ │ └── ⋮
│ ├── blimp # BLiMP (Warstadt et al., 2020) S/V agreement templates
│ │ ├── <SOME_TEMPLATE>.jsonl
│ │ └── ⋮
│ ├── verbs # the verb lemmas used for experiments
│ │ └── combined_verb_list.csv
│ ├── <SOME_OTHER_DATASETE> # folder with data to support your own templates
│ └── ⋮
├── src # source code
│ ├── datasets
│ │ ├── __init__.py
│ │ ├── datasets.py # routes user to appropriate dataset
│ │ ├── MPE_dataset.py # base dataset class
│ │ ├── custom_dataset.py # implementation for Marvin and Linzen (2018) and BLiMP dataset
│ │ ├── <SOME_DATASET>.py # your own specific dataset implementation
│ │ └── ⋮
│ ├── metrics
│ │ ├── __init__.py
│ │ ├── metric_computer.py # base metric class
│ │ ├── metrics.py # routes user to appropriate metrics
│ │ ├── ML_metric.py # implements TSE
│ │ ├── main_metric.py # implements MW and EW
│ │ ├── <SOME_METRIC_COMPUTER>.py # your own specific metric implementation
│ │ └── ⋮
│ ├── models
│ │ ├── __init__.py
│ │ ├── models.py # routes user to appropriate model
│ │ ├── MPE_model.py # base model class
│ │ ├── transformers_model.py # base class to interface with Hugging Face Transformers
│ │ ├── utils.py # utilities used by models
│ │ ├── <SOME_MODEL>.py # specific model implementation
│ │ └── ⋮
│ ├── constants.py
│ ├── experiments.py # core logic of experiment
│ ├── logger.py # manages record-keeping
│ ├── plots.py # generates summary plots and tables
├── results
│ ├── <MODEL>/<TEMPLATE>/<METRIC>/<NAME> # experiment description path
│ │ ├── figs # folder of figures (currently unused)
│ │ ├── metrics # folder of per-metric outputs
│ │ ├── npzs # folder of .npz files
│ │ ├── pickles # folder of .pkl files
│ │ ├── config.yaml # config file for this experiment
│ │ └── log.txt # human-readable experiment log
│ └── ⋮
├── plots # directory with plots and latex table
├── .gitignore
├── environment.yml
├── LICENSE
├── README.md
├── run.py # entry point for running an experiment; does set-up
└── run_all.py # entry point for launching multiple experiments
CustomDataset
parses .jsonl
template files from the specified template directory.
It expects the datasets to be in the following form:
- Each line of a template file is a dict with a
sentence_good
field, asentence_bad
field, and alabel
field. - The
sentence_{good/bad}
fields contain the strings of correct and incorrect sentences. These should differ by exactly one verb. - The
label
field is-2
if the correct verb is plural, and is-1
if the correct verb is singular.
The verbs in data/verbs/combined_verb_list.csv
are derived from COCA and the Penn Treebank. The rest of the verbs can be found here
and can be appended to the csv
file by running the command in the Getting Started section: python scripts/create_combined_verb_list.py
. (Note that this only needs to be run once.)
Custom models should extend MPEModel
and implement two methods: word_to_index
which maps vocabulary item strings to indices and predict
which returns logits given a batch of left and right contexts on the sides of the verb of interest.
To add a custom model that is in the Hugging Face Transformers library, extend the TransformersModel
class. (See src/models/roberta_model.py
for an example.) This class automatically creates the word_to_index
method from a transformers.PreTreainedTokenizer
.
Finally, to use the custom model add it to the model_name_to_MPEModel_class
dictionary in src/models/models.py
.
Custom models should extend MetricComputer
and implement a _compute
method to generate a score for each example given a batch of logits, labels indicating if plural in singular conjugations are preferred, and the model's word_to_index
.
To access the custom metric, add it to the metric_name_to_MetricComputer_class
dictionary in src/models/models.py
Below is an annotated config explaining how to run your own experiments
dataset:
add_trailing_period: true # adds trailing period to right context. Should be true.
capitalize_first_word: true # captilizes first word of minimal pair. Should be true for cased models.
max_examples_per_template: null # controls number of examples held in the dataset (useful for debugging)
name: custom # dataset name. Should always be custom
template_dir: data/ML # folder where templates is stored
template_files:
- simple_agrmt_all.jsonl # names of templates to evaluate
experiment:
max_examples: null # controls number of examples to send to the model. Functions the same as dataset.max_examples_per_template
logger:
path: results/bert-base-cased/ML_simple_agrmt/mw
print_metrics_to_general_log: false # prints individual metrics logging info to the general log in addition to metric-specific log
print_to_stdout: false # prints log to STDOUT as well as log file
metrics:
ML: # ML: TSE
example_aggregator: mean # averages scores over templates
use_custom_dataset: true # should always be true
main: # main: EW or MW (see use_equal_verb_voting)
cutoffs_bot: # bottom percentile cutoffs to investigate
- 0.5
- 0.1
- 0.01
- 0.001
- 0.0001
- 1.0e-05
- 1.0e-06
cutoffs_top: # top percentile cutoffs to investigate
- 0.1
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 0.95
- 0.97
- 1.0
example_aggregator: mean
lemma_inflections_path: data/verbs/combined_verb_list.csv
use_custom_dataset: true
use_equal_verb_voting: false # true = use EW, false = use MW
model:
name: bert-base-cased # name of the model. If using Hugging Face Transformers, should match the pretrained model name
- Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192-1202, Brussels, Belgium. Association for Computational Linguistics.
- Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2020. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
@inproceedings{newman2021refining,
title={Refining Targeted Syntactic Evaluation of Language Models},
author={Newman, Benjamin and Ang, Kai-Siang and Gong, Julia and Hewitt, John},
booktitle={Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
year={2021}
}