Data Generation

This project includes utilities and scripts for automatic dataset generation. It is used in the following papers:

Warstadt, A., Cao, Y., Grosu, I., Peng, W., Blix, H., Nie, Y., Alsop, A., Bordia, S., Liu, H., Parrish, A. and Wang, S.F., Bowman, S.R. 2019. Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs. arXiv preprint arXiv:1909.02597.
Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. F., & Bowman, S. R. (2019). BLiMP: A Benchmark of Linguistic Minimal Pairs for English. arXiv preprint arXiv:1912.00582.
Jeretic, P., Warstadt, A., Bhooshan, S., & Williams, A. (2020). Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition. arXiv preprint arXiv:2004.03066.

Usage

To run a sample data generation script, navigate to the data_generation directory and run the following command:

python -m generation_projects.blimp.adjunct_island

If all dependencies are present in your workspace, this will generate the adjunct_island dataset in BLiMP. Generation will take a minute to begin, after which point the progress can be watched in outputs/benchmark/adjunct_island.jsonl.

Branches

With the exception of BLiMP, all project-specific code is kept in separate branches. BLiMP appears in master as a helpful examplar.
Major branches include:
- blimp
- imppres
- npi
- msgs
- structure_dependence qp

Project Structure

The project contains the following packages:
- generation_projects: scripts for generating data, organized into subdirectories by research project.
- mturk_qc: code for carrying out Amazon mechanical turk quality control.
- outputs: generated data, organized into subdirectories by research project.
- results: experiment results files
- results_processing: scripts for analyzing results and producing figures
- utils: shared code for generation projects. Includes utilities for proecessing the vocabulary, generating constituents, manipulating generated strings, etc.
It also contains a vocabulary file and documentation of the vocabulary:
- vocabulary.csv: the vocab file.
- vocab_documentation.md: the vocab documentation

Vocabulary

The vocabulary file is vocabulary.csv.
Each row in the .csv is a lexical item. Each column is feature encoding grammatical information about the lexical item. Detailed documentation of the columns can be found in vocab_documentation.md.

The following notation is used to define selectional restrictions in the arg_1, arg_2, and arg_3 columns:

<DISJUNCTION> := <CONJUNCTION> | <CONJUNCTION>;<DISJUNCTION>
<CONJUNCTION> := <CONDITION> | <CONDITION>^<CONJUNCTION>
<CONDITION> := <COLUMN>=<VALUE>

In other words, the entire restriction is written in disjunctive normal form where ; is used for disjunction and ^ is used for conjunction.
Example 1: arg_1 of lexical item breaking is animate=1. This means any noun appearing as the subject of breaking must have value 1 in the column animate.
Example 2: arg_1 of lexical item buys is institution=1^sg=1;animate=1^sg=1. This means any noun appearing as the subject of breaking must meet one of the following conditions:
1. have value 1 in column institution and value 1 in column sg, or
2. have value 1 in column animate and value 1 in column sg.
Disclaimer: As this project is under active development, data generated with different versions of the vocabulary may differ slightly.

Utils

The utils package contains the shared code for the various generation projects.
- utils.conjugate includes functions which conjugate verbs and add selecting auxiliaries/modals
- utils.constituent_building includes functions which "do syntax". The following are especially useful:
  - verb_args_from_verb: gather all arguments of a verb into a dictionary
  - V_to_VP_mutate: given a verb, modify the expression to contain the string corresponding to a full VP
  - N_to_DP_mutate: given a noun, gather all arguments and a determiner, and modify the expression to contain the string corresponding to a full DP
- utils.data_generator defines general classes that are instantianted by a particular generation project. The classes contain metadata fields, the main loop for a generating a dataset (generate_paradigm), and functions for logging and exception handling
- utils.data_type contains the data_type necessary for the numpy structured array data structure used in the vocabulary.
  - if the columns of the vocabulary file are ever modified, this file must be modified to match.
- utils.string_utils contains functions for cleaning up generated strings (removing extra whitespace, capitalization, etc.)
- utils.vocab_sets contains constants for accessing commonly used sets of vocab entries. Building these constants takes about a minute at the beginning of running a generation script, but this speeds up generation of large datasets.
- utils.vocab_table contains functions for creating and accessing the vocabulary table
  - get_all gathers all vocab items with a given restriction
  - get_all_conjunctive gathers all vocab items with the given restrictions

Citation

If you use the data generation project in your work, please cite the BLiMP paper:

@article{warstadt2019blimp,
  title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
  author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1912.00582},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generation_projects

generation_projects

outputs

outputs

results_processing

results_processing

utils

utils

.gitignore

.gitignore

README.md

README.md

vocab_documentation.md

vocab_documentation.md

vocabulary.csv

vocabulary.csv

Repository files navigation

Data Generation

Usage

Branches

Project Structure

Vocabulary

Utils

Citation

About

Releases

Packages

Contributors 10

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 703 Commits
generation_projects		generation_projects
outputs		outputs
results_processing		results_processing
utils		utils
.gitignore		.gitignore
README.md		README.md
vocab_documentation.md		vocab_documentation.md
vocabulary.csv		vocabulary.csv

alexwarstadt/data_generation

Folders and files

Latest commit

History

Repository files navigation

Data Generation

Usage

Branches

Project Structure

Vocabulary

Utils

Citation

About

Resources

Stars

Watchers

Forks

Languages