Skip to content
No description, website, or topics provided.
Python Jsonnet
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
models pytorch transformers adapted code + models added to readme Sep 22, 2019
samples Readme update and schema update Oct 6, 2019
.gitignore pytorch transformers running code Aug 20, 2019 format change bug fix in predict Oct 25, 2019 convert to SQuAD2.0 bug fix Sep 21, 2019 evaluate command Aug 12, 2019
requirements.txt format change bug fix in predict Oct 25, 2019


MultiQA is a project whose goal is to facilitate training and evaluating reading comprehension models over arbitrary sets of datasets. All datasets are in a single format, and it is accompanied by an AllenNLP DatasetReader and model that enable easy training and evaluation on multiple subsets of datasets.

This repository contains the code for our paper MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension.

This work was performed at The Allen Institute of Artificial Intelligence.

This project is constantly being improved. Contributions, comments and suggestions are welcome!


Date Message
Oct 06, 2019 ComplexQuestions and ComQA added, thanks to tuvuumass!
Sep 19, 2019 DuoRC added, and Multiple dataset links in SQuAD2.0 format available
Sep 15, 2019 DROP and WikiHop added.
Aug 31, 2019 Version 0.1.0 of the MultiQA format is out, with a json-schema for validation, and new pytests
Aug 24, 2019 New! convert multiqa format to SQuAD2.0 format + Pytorch-Transformers models support
Aug 14, 2019 BoolQ and ComplexWebQuestions data added.
Aug 12, 2019 added enabling easy multiple dataset training and evaluation.
Aug 07, 2019 TriviaQA-Wikipedia BERT-Base Model is now available, improved results will be soon to follow.
Aug 03, 2019 BERT-Large Models are now available!


Link to the single format datasets are provided in to formats: MultiQA, and SQuAD2.0. The SQuAD2.0 are GZipped JSONs that are the results of applying convert_multiqa_to_squad_format to the MultiQA dataset. To used them with Pytorch-Transformers code simply unzip them. (see models Readme for an example)

Dataset MultiQA format SQuAD2.0 format (GZipped)
SQuAD-1.1 train , dev train , dev
SQuAD-2.0 train , dev train , dev
NewsQA train , dev train , dev
HotpotQA train , dev train , dev
TriviaQA-unfiltered dev dev
TriviaQA-wiki train , dev train , dev
SearchQA train , dev train , dev
BoolQ train , dev train , dev
ComplexWebQuestions train , dev train , dev
DROP train , dev train , dev
WikiHop train , dev train , dev
DuoRC Paraphrase train , dev train , dev
DuoRC Self train , dev train , dev
ComplexQuestions train , dev train , dev
ComQA train , dev train , dev
Natural Questions Coming soon Coming soon

Datasets will be addeed weekly, so please stay tuned!

Models and Results

Trained Models for download and results will be posted in this table. The BERT-Base column contains evaluation results (EM/F1) as well as a link to the trained model. The MultiQA-5Base column contain the link to the model (in the header) and evalution results for this model. This model is BERT-Base that has been trained on 5 datasets. The pytorch-transformers columns contain .bin models trained with pytorch-transformers code added.

Dataset BERT-Base
BERT Base uncased
AllenNLP (model)
SQuAD-1.1 80.1 / 87.5
80.2 / 87.7
81.7 / 88.8 83.3 / 90.3
NewsQA 47.5 / 62.9
47.5 / 62.3
48.3 / 64.7 50.3 / 66.0
HotpotQA 50.1 / 63.2
- 54.0 / 67.0
TriviaQA-unfiltered 59.4 / 65.2
59.0 / 64.7 60.7 / 66.5
TriviaQA-wiki 57.5 / 62.3
- -
SearchQA 58.7 / 65.2
58.8 / 65.3 60.5 / 67.3

multiqa commands

In order to simply train BERT on multiple datasets (with AllenNLP) please use:

python train --datasets SQuAD1-1,NewsQA,SearchQA --cuda_device 0,1,2,3
python evaluate --model SQuAD1-1 --datasets SQuAD1-1,NewsQA,SearchQA --cuda_device 0

By default the output will be stored in models/datatset1_dataset2_... to change this please change --serialization_dir

Type python for additional options.

Note, this version uses the default multiqa format datasets stored in s3, to use your own dataset please see Readme for using allennlp core commands.

MultiQA format to SQuAD2.0 format

If you prefer using SQuAD2.0 format, or run the Pytorch-Trasformers models, please use:

python --datasets --output_file data/squad_format/HotpotQA_dev.json


Setting up a virtual environment

  1. First, clone the repository:

    git clone
  2. Change your directory to where you cloned the files:

    cd MultiQA
  3. Create a virtual environment with Python 3.6 or above:

    virtualenv venv --python=python3.7 (or python3.7 -m venv venv or conda create -n multiqa python=3.7)
  4. Activate the virtual environment. You will need to activate the venv environment in each terminal in which you want to use MultiQA.

    source venv/bin/activate (or source venv/bin/activate.csh or conda activate multiqa)
  5. Install the required dependencies:

    pip install -r requirements.txt


You can test all challenges using pytest, or using pycharm tests directory (pytest-pycharm added):

pytest pytests


The allennlp caching infra is used, so be sure to have enough disk space, and control the cache directory using ALLENNLP_CACHE_ROOT env variable.

Build Dataset

This will take a dataset from it's original URL and output the same dataset in the MultiQA format.

python --dataset_name HotpotQA --split train --output_file path/to/output.jsonl.gz --n_processes 10


first argument is the allennlp model, second is the preprocessed evalutaion file ( path/to/output.jsonl.gz in preprocess), then the dataset name (in order to create the official predictions format)

python --model --dataset --dataset_name SQuAD

To predict only a the first N examples use --sample_size N

To add a GPU device simply append: --cuda_device 0

By default the output will be saved at results/DATASET_NAME/... You may also change the output filename and path using --prediction_filepath path/to/my/output

Multiqa Data Format

see Readme in the datasets folder. A json-schema for a single context in multiqa is available here.

Training using AlleNLP

see Readme in the models folder.

Training using Pytorch-Trasformers

see Readme in the models folder.


Allennlp caching infra is used, so make sure to have enough disk space, and control the cache directory using ALLENNLP_CACHE_ROOT env variable.

You can’t perform that action at this time.