Unsupervised Question answering via Cloze Translation
Code, Data and models supporting the experiments in the ACL 2019 Paper: Unsupervised Question Answering by Cloze Translation.

Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we take some of the first steps towards unsupervised QA, and develop an approach that, without using the SQuAD training data at all, achieves 56.4 F1 on SQuAD v1.1, and 64.5 F1 when the answer is a named entity mention.


This repository provides code to run pre-trained models to generate sythetic question answering question data. We also make a very large synthetic training dataset for extractive question answering available.

NOTE: The data is available for download now, the code and pre-trained models are coming soon.

Dataset Downloads

We make available a dataset of 5 million SQuAD-like question answering datapoints, automatically generated by the unsupervised system described in the system.

The data can be downloaded here. The data is in the SQuAD v1 format, and contains:

Fold # Paragraphs # QA pairs
unsupervised_qa_train.json 782,556 3,915,498
unsupervised_qa_dev.json 1,000 4,795
unsupervised_qa_test.json 1,000 4,804

Using this training data to fine-tune BERT-Large for reading comprehension, you should be able to achieve over 50.0 F1 on the SQuAD V1.1 development set.

Models and Code

Pre-trained models and the code to run them are coming soon.


Please cite [1] and [2] if you found the resources in this repository useful.

Unsupervised Question Answering by Cloze Translation

See the LICENSE file for more details.

