Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


ModularQA is a QA system that answers complex multi-hop and discrete reasoning questions by decomposing them into sub-questions answerable by two sub-models: a neural factoid single-span QA model and a symbolic calculator. These sub-questions and answers provided by the sub-model provide a natural language explanation of the model’s reasoning. This system is designed and trained based on the Text Modular Networks framework where the decompositions are generated in the language of the sub-models without needing annotated decompositions. For more details, refer to the paper.


Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models
Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal
NAACL 2021


  title={Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models},
  author={Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal},


Note that responses might be slow


QA Datasets

We used the following subsets of HotpotQA and DROP to train and evaluate our models

Training Datasets

  • NextGen Training Data: The decomposition chains generated from these DROP+HotpotQA subsets. These chains were used to train the NextGen model.

  • Chains Scorer Training Data: The chains generated by running inference using our NextGen model with associated labels: 1 indicates the final answer produced by this chain is correct (F1>0.2) and 0 indicates incorrect.


If you want the predictions of the system, without having to run the code, we provide them here:


We also provide the trained models used in our system.

  • NextGen Model: A BART-Large model trained to produce the next sub-question given the complex question and previous question-answer pairs. Sample input-output:
    QC: When did the magazine Wallace Hester work for run? QI: (squad) What magazine did Hester work for? A: "Vanity Fair". QS:

    (squad) When did the second Vanity Fair run?
  • Chains Scorer Model: A RoBERTa-Large model trained to predict whether the final answer produced by an inference chain is correct (captured by the score for the label 1). Sample input:
   QC: How many percent of jobs were not in wholesale? QI: (squad)What percent of jobs are in wholesale? A: 12.4 QI: (math)not(12.4) A: 87.6 QS: [EOQ]


We use a fork of the HuggingFace Transformers codebase: This is based on an older version of the Transformers codebase.

Running Inference

To run inference, follow the following steps:

  1. Clone the github repo from and checkout the modularqav2 branch.
  git clone modularqa_transformers
  cd modularqa_transformers
  git checkout modularqav2
  export PYTHONPATH=src
  1. Download the HotpotQA and DROP subsets from above and unzip here

  1. Save (or download from above) the trained
    • NextGen model to nexgen_model/
    • Chain scorer to chain_scorer/
    • SQuAD 2.0 QA model to qa_model/

If the models are downloaded to a different path, change the paths in the config files below.

  1. Run inference on HotpotQA Dev set, run:
python -u -m modularqa.inference.configurable_inference \
        --input hotpot_subset/dev.json \
        --output predictions_hotpot_dev.json \
        --config modularqa_configs/hotpot_dev_config.json --reader hotpot

To evaluate on the held-out test set, change dev to test in the command above.

To evaluate the DROP set, run:

python -u -m modularqa.inference.configurable_inference \
        --input drop_subset/dev.json \
        --output predictions_drop_dev.json \
        --config modularqa_configs/drop_dev_config.json --reader drop

Similary replace dev with test to evaluate on the held-out test set.

NOTE: These are slow inference steps but highly parallelizable. If you want, you can directly use our predictions available above.

  1. Compute the metrics using the evaluation scripts released with the HotpotQA and DROP dataset. For example:
  • HotpotQA
  python -m modularqa.evals.evaluate_hotpot_squad_format \
    predictions_hotpot_dev.json hotpot_subset/dev.json
  • DROP
  python -m modularqa.evals.drop_eval \
    --gold_path drop_subset/dev.json \
    --prediction_path predictions_drop_dev.json \


Code for ModularQA







No releases published


No packages published