Skip to content

Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021

License

Notifications You must be signed in to change notification settings

amazon-science/sc2qa-dril

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning

This repository contains code for our EMNLP 2021 paper: Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning. Li Zhou, Kevin Small, Yong Zhang, Sandeep Atluri.

Table of Contents

(SC)^2QA Dataset

We provide code and scripts to construct the public version of (SC)^2QA dataset from commoncrawl's news data stream. Compared with the internal version of (SC)^2QA used in our paper, this public version is larger, including 50,441 quesiton-article pairs and 65,332 {question, article, summary, length constraint} 4-tuples. We also provide an even larger question-article pairs dataset with 529,039 pairs.

For your convenience, we also provide the constructed dataset, and you can directly load the dataset with huggingface's load_dataset API.

#!pip install datasets
from datasets import load_dataset
sc2qa_dataset = load_dataset("sc2qa/sc2qa_commoncrawl")

This will load {Question, Article, Summary, Length Constraint} 4-tuples. However, if you only want to use question-article pairs as a dataset (i.e. output of the step 1 below), you can do

#!pip install datasets
from datasets import load_dataset
sc2q_dataset = load_dataset("sc2qa/sc2q_commoncrawl")

We also provide an even larger question-article pairs dataset (without summaries), which includes articles from an expanded domain list.

#!pip install datasets
from datasets import load_dataset
sc2q_dataset_large = load_dataset("sc2qa/sc2q_commoncrawl_large")

You can skip the remaining of this section if you use the load_dataset API above.

The following are steps to construct the dataset. Appendix A of our paper describes each step in details.

Install dependencies

pip3 install -r requirements.txt

Step 1 Collect Question-Article Pairs

cd CommonCrawl_Question_Mining
bash collect_qa.sh

This script will download WARC files between 2019/01 to 2021/09 from commoncrawl's news data stream, and then filter out news articles based on a set of rules we defined in CommonCrawl_Question_Mining/collect_qa_step2.py.

Step 2 Collect {Question, Article, Summary, Length Constraint} 4-Tuples as Training and Validation Set

bash collect_qasl.sh

This script will call BART, PEGASUS, and CTRLSum models to geneate length-constrained summaries for articles in step 1, and then used our pre-trained question-answering model to filter out summaries that are likely incorrect answers.

In total, we have 65,332 {Question, Article, Summary, Length Constraint} 4-tuples. We use the first 57,332 4-tuples as training set and the last 8,000 4-tuples as validation set.

Step 3 Collect Articles as Test Set (Optional)

bash collect_test_set_articles.sh

This script will randomly sample 10,000 news articles in 2021/09 from commoncrawl's news data stream. There is no ground-truth questions for these articles.

D-S-DRIL Model

Install dependencies

pip3 install -r requirements.txt

Train an Answer Generation Model using DRIL

cd D-S-DRIL
bash scripts/train_ag.sh

Train an answer generation model using the proposed DRIL method in the paper. The model samples summaries (answers) during training and calculate gradients based on the question reconstruction loss.

We use Amazon EC2 p3dn.24xlarge GPU instances for training. There are 8 GPUs and each GPU has 32GB memory. Training takes about 8 hours. If you encounter GPU out of memory issue, consider reducing the batch size.

During training, the model checkpoints will be saved to ag_model_output/checkpoint-*. Each checkpoint folder has a trainer_state.json file showing the current best model checkpoint. At the end of the training, the best model will be saved to ag_model_output/. However, you may encounter GPU out of memory issue when saving the best model. If so, you can manually copy the best model checkpoint from ag_model_output/checkpoint-* to ag_model_output/ and then copy the model configuration file from scripts/model_config/config.json to ag_model_output/config.json.

Train a Question Generation Model

bash scripts/train_qg.sh

Train a question generation model that generates questions based on summaries of articles.

Inference

bash scripts/inference.sh [validation|test]

Inference question and answer pairs of articles in the validation set and test set.

How to Cite

If you find this repository useful, please cite the following paper.

@inproceedings{zhou-etal-2021-generating,
    title = "Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning",
    author = "Zhou, Li and Small, Kevin and Zhang, Yong and Atluri, Sandeep",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2021",
    pages = "5103--5135",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.416",
}

About

Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published