Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Data Generation

This project includes utilities and scripts for automatic dataset generation. It is used in the following papers:

  • Warstadt, A., Cao, Y., Grosu, I., Peng, W., Blix, H., Nie, Y., Alsop, A., Bordia, S., Liu, H., Parrish, A. and Wang, S.F., Bowman, S.R. 2019. Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs. arXiv preprint arXiv:1909.02597.
  • Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. F., & Bowman, S. R. (2019). BLiMP: A Benchmark of Linguistic Minimal Pairs for English. arXiv preprint arXiv:1912.00582.
  • Jeretic, P., Warstadt, A., Bhooshan, S., & Williams, A. (2020). Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition. arXiv preprint arXiv:2004.03066.


To run a sample data generation script, navigate to the data_generation directory and run the following command:

python -m generation_projects.blimp.adjunct_island

If all dependencies are present in your workspace, this will generate the adjunct_island dataset in BLiMP. Generation will take a minute to begin, after which point the progress can be watched in outputs/benchmark/adjunct_island.jsonl.


  • With the exception of BLiMP, all project-specific code is kept in separate branches. BLiMP appears in master as a helpful examplar.
  • Major branches include:
    • blimp
    • imppres
    • npi
    • msgs
    • structure_dependence qp

Project Structure

  • The project contains the following packages:
    • generation_projects: scripts for generating data, organized into subdirectories by research project.
    • mturk_qc: code for carrying out Amazon mechanical turk quality control.
    • outputs: generated data, organized into subdirectories by research project.
    • results: experiment results files
    • results_processing: scripts for analyzing results and producing figures
    • utils: shared code for generation projects. Includes utilities for proecessing the vocabulary, generating constituents, manipulating generated strings, etc.
  • It also contains a vocabulary file and documentation of the vocabulary:
    • vocabulary.csv: the vocab file.
    • the vocab documentation


  • The vocabulary file is vocabulary.csv.
  • Each row in the .csv is a lexical item. Each column is feature encoding grammatical information about the lexical item. Detailed documentation of the columns can be found in
  • The following notation is used to define selectional restrictions in the arg_1, arg_2, and arg_3 columns:
  • In other words, the entire restriction is written in disjunctive normal form where ; is used for disjunction and ^ is used for conjunction.
  • Example 1: arg_1 of lexical item breaking is animate=1. This means any noun appearing as the subject of breaking must have value 1 in the column animate.
  • Example 2: arg_1 of lexical item buys is institution=1^sg=1;animate=1^sg=1. This means any noun appearing as the subject of breaking must meet one of the following conditions:
    1. have value 1 in column institution and value 1 in column sg, or
    2. have value 1 in column animate and value 1 in column sg.
  • Disclaimer: As this project is under active development, data generated with different versions of the vocabulary may differ slightly.


  • The utils package contains the shared code for the various generation projects.
    • utils.conjugate includes functions which conjugate verbs and add selecting auxiliaries/modals
    • utils.constituent_building includes functions which "do syntax". The following are especially useful:
      • verb_args_from_verb: gather all arguments of a verb into a dictionary
      • V_to_VP_mutate: given a verb, modify the expression to contain the string corresponding to a full VP
      • N_to_DP_mutate: given a noun, gather all arguments and a determiner, and modify the expression to contain the string corresponding to a full DP
    • utils.data_generator defines general classes that are instantianted by a particular generation project. The classes contain metadata fields, the main loop for a generating a dataset (generate_paradigm), and functions for logging and exception handling
    • utils.data_type contains the data_type necessary for the numpy structured array data structure used in the vocabulary.
      • if the columns of the vocabulary file are ever modified, this file must be modified to match.
    • utils.string_utils contains functions for cleaning up generated strings (removing extra whitespace, capitalization, etc.)
    • utils.vocab_sets contains constants for accessing commonly used sets of vocab entries. Building these constants takes about a minute at the beginning of running a generation script, but this speeds up generation of large datasets.
    • utils.vocab_table contains functions for creating and accessing the vocabulary table
      • get_all gathers all vocab items with a given restriction
      • get_all_conjunctive gathers all vocab items with the given restrictions


If you use the data generation project in your work, please cite the BLiMP paper:

  title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
  author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1912.00582},


No description, website, or topics provided.



No releases published


No packages published