Learning Data Augmentation Schedules for NLP

This code is made available for both reproducibility purposes and to allow for further research on the topic of data augmentation for NLP

The PBA part of the code is based on https://github.com/arcelien/pba and some of the rest comes from other publicly available repositories (see specific files for details)

Introduction

This repository contains code for the work "Learning Data Augmentation Schedules for NLP" in TensorFlow and Python.

Getting Started

Install requirements

The code was tested on Python 3.6.10.

pip install -r requirements.txt

Download SST-2 and MNLI datasets

Download the SST-2 and MNLI datasets from GLUE and paste the downloaded files in datasets/glue_data/SST-2 and datasets/glue_data/MNLI respectively.

Download BERT base uncased model

Download the BERT base uncase model from BERT and paste the files bert_config.json, bert_model.ckpt.data-00000-of-00001, bert_model.ckpt.index, bert_model.ckpt.meta and vocab.txt in datasets/pretrained_models/bert_base.

Download precomputed augmentated data

Download the pre-generated augmentated data for contextual augmentation using the link https://drive.google.com/file/d/1XYnp2MsqTZvm0nl8UydthYneoP0YJtGv/view?usp=sharing.

Reproduce Results

Scripts to reproduce results with augmenation reported in the paper are located in scripts/eval_glue_*.sh. And results without augmentation are located in scripts/no_aug_glue_*.sh. One argument, the dataset name (choices are sst2 or mnli_mm), is required for all of the scripts. Hyperparamaters are also located inside each script file. The hyperparameter --cpu should be modified to fit the number of available cpus.

For example, to reproduce the SST-2 results with augmentation and dataset size N=3000:

bash scripts/eval_glue_3000.sh sst2

To reproduce the MNLI results with augmentation and N=1500:

bash scripts/eval_glue_1500.sh mnli_mm

To reproduce the results without augmentation, for example on SST-2 with dataset size N=9000:

bash scripts/no_aug_glue_FULL.sh sst2

Run PBA Search

Run PBA search with the file scripts/search_*.sh. One argument, the dataset name, is required. Choices are sst2 or mnli_mm. The star replaces the dataset size (choices are 1500, 2000, 3000, 4000, FULL). For example to run a search for N=2000 on MNLI:

CUDA_VISIBLE_DEVICES=0 bash scripts/search_2000.sh mnli_mm

The resulting schedules used in search can be retrieved from the Ray result directory, and the log files can be converted into policy schedules with the main() function in pba/utils.py. The name of the experiment (name of the folder that contains the search results in pba/results) needs to be manually given at the beginning of the main() using the variable exp_name. Similarly, the variable task_name is used to indicate the name of the dataset.

Run PBA Search and Evaluation on New Dataset

Details will follow

Citation

If you use this code in your research, please cite:

@inproceedings{chopard2021learning,
  title     = {Learning Data Augmentation Schedules for Natural Language Processing},
  author    = {Daphn{\'e} Chopard and
               Matthias S. Treder and
               Irena Spasi{\'c}},
  maintitle = {EMNLP Workshop},
  booktitle = {Proceedings of the Second Workshop on Insights from Negative Results in NLP},
  year      = {2021}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autoaugment

autoaugment

datasets

datasets

pba

pba

results

results

schedules

schedules

scripts

scripts

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Learning Data Augmentation Schedules for NLP

Table of Contents

Introduction

Getting Started

Install requirements

Download SST-2 and MNLI datasets

Download BERT base uncased model

Download precomputed augmentated data

Reproduce Results

Run PBA Search

Run PBA Search and Evaluation on New Dataset

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
autoaugment		autoaugment
datasets		datasets
pba		pba
results		results
schedules		schedules
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

chopardda/LDAS-NLP

Folders and files

Latest commit

History

Repository files navigation

Learning Data Augmentation Schedules for NLP

Table of Contents

Introduction

Getting Started

Install requirements

Download SST-2 and MNLI datasets

Download BERT base uncased model

Download precomputed augmentated data

Reproduce Results

Run PBA Search

Run PBA Search and Evaluation on New Dataset

Citation

About

Resources

License

Stars

Watchers

Forks

Languages