# Open-domain Question Answering (ODQA) with Dense Passage Retrieval

This is the DPR baseline using subset of Wikipedia, which is presented as the EfficientQA challenge baseline. \
- Source: https://github.com/efficientqa/efficientqa.github.io/blob/master/getting_started.md \

This baseline code loads pre-trained retriever and reader to perform ODQA. You can train your own encoders (for retrieval) and reader following the instruction in DPR repository. 
- DPR Github: https://github.com/facebookresearch/DPR

[Other useful link]
- Dense Passage Retrieval for Open-Domain Question Answering: https://arxiv.org/abs/2004.04906 
- Natrual Questions: https://ai.google.com/research/NaturalQuestions 
- EfficientQA challenge: https://efficientqa.github.io/
- EfficientQA baselines: https://github.com/efficientqa/retrieval-based-baselines
- NQ open dataset: https://github.com/google-research-datasets/natural-questions/tree/master/nq_open

### Requirements

In [1]:
# Fix DPR version (<1.0.0) for reproducibiltiy
# When you train your model, you can use the latest version 

!git clone https://github.com/facebookresearch/DPR.git 
!cd DPR && git checkout -b under_v1 42161470d6f16d20c20f6ea2516941c224fc0b89
!cd DPR && pip3 install .

import sys
sys.path.append('/content/DPR')
!mkdir DPR/data

fatal: destination path 'DPR' already exists and is not an empty directory.
fatal: A branch named 'under_v1' already exists.
Processing /home/sjyang/federated_learning/DPR
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: dpr
  Building wheel for dpr (setup.py) ... [?25ldone
[?25h  Created wheel for dpr: filename=dpr-0.1.0-py3-none-any.whl size=12897 sha256=c2741fceb11c77d52a5e869fbd1de6e08a2703bcb452c0848c24a09704076e02
  Stored in directory: /tmp/pip-ephem-wheel-cache-oauwrkjl/wheels/0c/3d/31/4f671b52d9268c81687ed029900a32471ece8654ee68d595d3
Successfully built dpr
Installing collected packages

In [2]:
!pip install datasets==1.6.2
!pip install gdown
!pip install jsonlines

Collecting datasets==1.6.2
  Downloading datasets-1.6.2-py3-none-any.whl (221 kB)
[K     |████████████████████████████████| 221 kB 1.7 MB/s 
Collecting dill
  Using cached dill-0.3.3-py2.py3-none-any.whl (81 kB)
Collecting pyarrow>=1.0.0<4.0.0
  Downloading pyarrow-4.0.1-cp38-cp38-manylinux2014_x86_64.whl (21.9 MB)
[K     |████████████████████████████████| 21.9 MB 16.3 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.11.1-py38-none-any.whl (126 kB)
[K     |████████████████████████████████| 126 kB 60.3 MB/s 
[?25hCollecting tqdm<4.50.0,>=4.27
  Downloading tqdm-4.49.0-py2.py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 12.7 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp38-cp38-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 58.9 MB/s 
Collecting huggingface-hub<0.1.0
  Downloading huggingface_hub-0.0.10-py3-none-any.whl (37 kB)
Installing collected packages: tqdm, dill, xxhash, pyarrow, multiproc

### 1. Load Datasets

Load quesetion answering datasets (NQ-open) and Wikipedia documents (for retriever).




In the final project, you will use open-domain variant of the Natural Questions datset. \

In [1]:
###################################
import os
os.environ["CUDA_VISIBLE_DEVICES"]="YOUR_GPU_NUM"
###################################

from datasets import load_dataset

qa_dataset = load_dataset('nq_open')

Reusing dataset nq_open (/home/sjyang/.cache/huggingface/datasets/nq_open/nq_open/1.0.0/e2fefd08353e9ff28e75cf9849dd18e727be41e477bad044f2d7ec5200edb90c)


In [2]:
print("Num Train Samples: %d, Num Valid Samples: %d" 
      % (len(qa_dataset['train']), len(qa_dataset['validation'])))
qa_dataset['train'][0], qa_dataset['validation'][0]

Num Train Samples: 87925, Num Valid Samples: 1800


({'answer': ['Fernie Alpine Resort'],
  'question': 'where did they film hot tub time machine'},
 {'answer': ['1988'],
  'question': 'the last time la dodgers won the world series'})

In [3]:
questions = qa_dataset['validation']['question']
question_answers = qa_dataset['validation']['answer']
questions[0], question_answers[0]

('the last time la dodgers won the world series', ['1988'])

For open-domain question answering, you need a retrieval step for finding relevant documents (or passages). To reduce the disk memory usage, this baseline uses only the subset of Wikipedia, whose documents are relevant to the question on the training data. \
You can find DPR performance of full vs. subset WIkipedia with disk usage in this link: https://github.com/efficientqa/efficientqa.github.io/blob/master/getting_started.md \
As you can see, performance of full Wikipedia is much better than the subset (EM: 41 % vs. 34.8 % for NQ-dev). You can use full Wikipedia for your final project.

This code is for download subset of Wikipedia. (1GB)



In [4]:
!gdown https://drive.google.com/uc?id=1_V-P6GEqBhr-7WoK_BpYeGxQNccEzy4Z
!tar xf psgs_w100_subset.tar.gz -C DPR/data && rm psgs_w100_subset.tar.gz

/bin/bash: gdown: command not found
tar: psgs_w100_subset.tar.gz: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now


This code is for download whole Wikipedia. (13GB)

In [None]:
# This is for whole wikipedia dump
# !python3 /content/DPR/dpr/data/download_data.py \
#   --resource data.wikipedia_split --output_dir data

In [4]:
# from dense_retriever import load_passages

# 제출 때 제거
#############################################################
import sys ###
sys.path.insert(0, 'YOUR_DPR_DIRECTORY') ###
############################################################
from dense_retriever import load_passages

db_path = 'DPR/data/psgs_w100_subset.tsv'

all_passages = load_passages(db_path)

Reading data from: DPR/data/psgs_w100_subset.tsv


In [5]:
print(len(all_passages))
print(all_passages['1'])

1642807
('Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from', 'Aaron')


### 2. Download and Load DPR Model 

Download model checkpoints and load a question encoder.

In [6]:
######
# 제출할 때, options.py의 setup_args_gpu 수정 필요
######
import os
import argparse
import json
from dpr.models import init_biencoder_components
from dpr.utils.data_utils import Tensorizer
from dpr.utils.model_utils import setup_for_distributed_mode, get_model_obj, load_states_from_checkpoint
from dpr.indexer.faiss_indexers import DenseIndexer, DenseFlatIndexer
from dense_retriever import DenseRetriever, validate, save_results
from dpr.options import add_encoder_params, setup_args_gpu, print_args, set_encoder_params_from_state, \
            add_tokenizer_params, add_cuda_params, add_training_params, add_reader_preprocessing_params

Download model checkpoints (reader & retriever) and index. We will use faiss index for faster search. \
(faiss link: https://github.com/facebookresearch/faiss)

You can also download the updated version of DPR weight for your project. Please see the DPR repository for detail. (https://github.com/facebookresearch/DPR)

- Checkpoint: checkpoint.retriever.single-adv-hn.nq.bert-base-encoder
- Wikipedia embeddings: data.retriever_results.nq.single-adv-hn.wikipedia_passages

In [9]:

########################################################################################
!python3 DPR/data/download_data.py --resource checkpoint.retriever.single.nq.bert-base-encoder --output_dir DPR/data # retrieval checkpoint

# Subset Index
!python3 DPR/data/download_data.py --resource indexes.single.nq.subset --output_dir DPR/data # DPR index

!python3 DPR/data/download_data.py --resource checkpoint.reader.nq-single-subset.hf-bert-base --output_dir DPR/data # reader checkpoint
########################################################################################

Loading from  https://dl.fbaipublicfiles.com/dpr/checkpoint/retriever/single/nq/hf_bert_base.cp
File already exist  DPR/data/checkpoint/retriever/single/nq/bert-base-encoder.cp
Loading from  https://dl.fbaipublicfiles.com/dpr/checkpoint/indexes/single/nq/seen_only.index.dpr
File already exist  DPR/data/indexes/single/nq/subset/index.dpr
Loading from  https://dl.fbaipublicfiles.com/dpr/checkpoint/indexes/single/nq/seen_only.index_meta.dpr
File already exist  DPR/data/indexes/single/nq/subset/index_meta.dpr
Loading from  https://dl.fbaipublicfiles.com/dpr/checkpoint/reader/nq-single-seen_only/hf_bert_base.cp
File already exist  DPR/data/checkpoint/reader/nq-single-subset/hf-bert-base.cp


In [7]:
def arguments():
    parser = argparse.ArgumentParser()

    # general params
    parser.add_argument('--dpr_model_file', type=str, default="DOCUMENT RETRIEVAL MODEL PATH") ###########
    parser.add_argument('--retrieval_type', type=str, default='dpr',
                        choices=['tfidf', 'dpr'])
    parser.add_argument('--output_dir', type=str, default='DPR/data')

  # retrieval specific params
    parser.add_argument('--dense_index_path', type=str, default="DPR/data/indexes/single/nq/subset")
    parser.add_argument('--match', type=str, default='string', choices=['regex', 'string'])
    parser.add_argument('--n-docs', type=int, default=100)
    parser.add_argument('--index_buffer', type=int, default=50000,
                        help="Temporal memory data buffer size (in samples) for indexer")
    parser.add_argument("--hnsw_index", action='store_true', help='If enabled, use inference time efficient HNSW index')
    parser.add_argument("--save_or_load_index", action='store_true', default=True, help='If enabled, save index')

    # reader specific params
    add_encoder_params(parser)
    add_training_params(parser)
    add_tokenizer_params(parser)
    add_reader_preprocessing_params(parser)


    parser.add_argument("--max_n_answers", default=10, type=int,
                        help="Max amount of answer spans to marginalize per singe passage")
    parser.add_argument('--passages_per_question', type=int, default=2,
                        help="Total amount of positive and negative passages per question")
    parser.add_argument('--passages_per_question_predict', type=int, default=40,
                        help="Total amount of positive and negative passages per question for evaluation")
    parser.add_argument("--max_answer_length", default=10, type=int,
                        help="The maximum length of an answer that can be generated. This is needed because the start "
                             "and end predictions are not conditioned on one another.")
    parser.add_argument('--eval_top_docs', type=list, default=[10, 20, 40, 50, 80, 100],
                        help="top retrival passages thresholds to analyze prediction results for")
    parser.add_argument('--checkpoint_file_name', type=str, default='dpr_reader')
    parser.add_argument('--prediction_results_file', type=str)


    args = parser.parse_args("")
    args.model_file = 'DPR/data/checkpoint/reader/nq-single-subset/hf-bert-base.cp'
    args.dev_batch_size = 8
    args.batch_size = 8
    args.sequence_length = 350
    args.pretrained_model_cfg = 'bert-base-uncased'
    args.encoder_model_type = 'hf_bert'
    args.do_lower_case = True
    args.prediction_results_file = 'dev_predictions.json'
    

    return args

In [8]:
args = arguments()

In [9]:
saved_state = load_states_from_checkpoint(args.dpr_model_file)  
set_encoder_params_from_state(saved_state.encoder_params, args)
tensorizer, encoder, _ = init_biencoder_components(args.encoder_model_type, args, inference_only=True)
encoder = encoder.question_model
setup_args_gpu(args)
encoder, _ = setup_for_distributed_mode(encoder, None, args.device, args.n_gpu,
                                        args.local_rank,
                                        args.fp16)
encoder.eval()

model_to_load = get_model_obj(encoder)
prefix_len = len('question_model.')
question_encoder_state = {key[prefix_len:]: value for (key, value) in saved_state.model_dict.items() if
                          key.startswith('question_model.')}
model_to_load.load_state_dict(question_encoder_state)
vector_size = model_to_load.get_out_size()

Reading saved model from DPR/data/checkpoint/retriever/single/nq/bert-base-encoder.cp
model_state_dict keys odict_keys(['model_dict', 'optimizer_dict', 'scheduler_dict', 'offset', 'epoch', 'encoder_params'])
Overriding args parameter value from checkpoint state. Param = do_lower_case, value = True
Overriding args parameter value from checkpoint state. Param = pretrained_model_cfg, value = bert-base-uncased
Overriding args parameter value from checkpoint state. Param = encoder_model_type, value = hf_bert
Overriding args parameter value from checkpoint state. Param = sequence_length, value = 256
loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /home/sjyang/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob":

### 3. Retrieve relevant document (Retriever)

Load retriever and indexes, and retrieve relevant documents for each question. \
(As this requires large memory, you can skip below cells and download the retrieved results in 5.)

In [10]:
# load index
# Retreival requires large memory. You can execute below cells or just load retreival result file. 

index_buffer_sz = args.index_buffer
index = DenseFlatIndexer(vector_size)
retriever = DenseRetriever(encoder, args.batch_size, tensorizer, index)
retriever.index.deserialize_from(args.dense_index_path)

Loading index from DPR/data/indexes/single/nq/subset
Loaded index of type <class 'faiss.swigfaiss.IndexFlat'> and size 1642800


In [15]:
questions_tensor = retriever.generate_question_vectors(questions)
top_ids_and_scores = retriever.get_top_docs(questions_tensor.numpy(), 100) # we changed value in this. (1, 20, 40, 60, 80, 100)

Encoded queries 200
Encoded queries 400
Encoded queries 600
Encoded queries 800
Encoded queries 1000
Encoded queries 1200
Encoded queries 1400
Encoded queries 1600
Encoded queries 1800
Total encoded queries tensor torch.Size([1800, 768])
index search time: 3.196406 sec.


In [37]:
questions_doc_hits = validate(all_passages, question_answers, top_ids_and_scores,
                              1, args.match)

Matching answers in top docs...
Per question validation results len=1800
Validation results: top k documents hits [575, 750, 842, 909, 949, 980, 1002, 1020, 1037, 1054, 1069, 1085, 1096, 1100, 1107, 1115, 1120, 1124, 1132, 1138, 1148, 1152, 1162, 1165, 1170, 1173, 1178, 1181, 1183, 1186, 1194, 1199, 1204, 1207, 1208, 1215, 1217, 1220, 1226, 1227, 1230, 1231, 1235, 1238, 1239, 1243, 1246, 1249, 1250, 1250, 1251, 1251, 1252, 1254, 1254, 1257, 1258, 1258, 1260, 1260]
Validation results: top k documents hits accuracy [0.3194444444444444, 0.4166666666666667, 0.4677777777777778, 0.505, 0.5272222222222223, 0.5444444444444444, 0.5566666666666666, 0.5666666666666667, 0.5761111111111111, 0.5855555555555556, 0.5938888888888889, 0.6027777777777777, 0.6088888888888889, 0.6111111111111112, 0.615, 0.6194444444444445, 0.6222222222222222, 0.6244444444444445, 0.6288888888888889, 0.6322222222222222, 0.6377777777777778, 0.64, 0.6455555555555555, 0.6472222222222223, 0.65, 0.6516666666666666, 0.654444444444

In [15]:
retrieval_file = "retrieved.json"
save_results(all_passages,
            questions,
            question_answers, #["" for _ in questions],
            top_ids_and_scores,
            questions_doc_hits, #[[False for _ in range(args.n_docs)] for _n in questions],
            retrieval_file)

Saved results * scores  to retrieved.json


In [16]:
len(questions_doc_hits), len(questions_doc_hits[0])

(1800, 100)

### 4. Predict answers (Reader)

Predict the final answer for the question from retrieved documents.
Performance of this baseline (DPR-subset) is EM = 30%. \
You can find the performance of other baselines in this link (EfficientQa Dev)
-  https://github.com/google-research-datasets/natural-questions/tree/master/nq_open

In [11]:
# Load retrieved results

from train_reader import ReaderTrainer

retrieval_file = 'retrieved.json' #####
if not os.path.exists(retrieval_file):
  !gdown https://drive.google.com/uc?id=1_TQaJy1oBbx4BAO08SsqP8lD_65KZcqA


setup_args_gpu(args)
args.dev_file = retrieval_file


Initialized host aiamdserver01 as d.rank -1 on device=cuda, n_gpu=1, world size=1
16-bits training: False 


In [12]:
# Predict answers and validate results
# The prediction result file is saved as 'dev_predictions.json'

class MyReaderTrainer(ReaderTrainer):
  def _save_predictions(self, out_file, prediction_results):
    with open(out_file, 'w', encoding="utf-8") as output:
      save_results = []
      for r in prediction_results:
        save_results.append({
          'question': r.id,
          'prediction': r.predictions[args.passages_per_question_predict].prediction_text
          })
        output.write(json.dumps(save_results, indent=4) + "\n")

trainer = MyReaderTrainer(args)
trainer.validate()

for i in range(args.num_workers):
    os.remove(retrieval_file.replace(".json", ".{}.pkl".format(i)))

***** Initializing components for training *****
Reading saved model from DPR/data/checkpoint/reader/nq-single-subset/hf-bert-base.cp
model_state_dict keys odict_keys(['model_dict', 'optimizer_dict', 'scheduler_dict', 'offset', 'epoch', 'encoder_params'])
Overriding args parameter value from checkpoint state. Param = pretrained_model_cfg, value = bert-base-uncased
Overriding args parameter value from checkpoint state. Param = encoder_model_type, value = hf_bert
Overriding args parameter value from checkpoint state. Param = sequence_length, value = 350
loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /home/sjyang/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  