# Training Your Own "Dense Passage Retrieval" Model

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial9_DPR_training.ipynb)

Haystack contains all the tools needed to train your own Dense Passage Retrieval model.
This tutorial will guide you through the steps required to create a retriever that is specifically tailored to your domain.


In [1]:
# Here are some imports that we'll need

from haystack.retriever.dense import DensePassageRetriever
from haystack.preprocessor.utils import fetch_archive_from_http

01/07/2021 15:03:47 - INFO - faiss -   Loading faiss with AVX2 support.
01/07/2021 15:03:47 - INFO - faiss -   Loading faiss.


## Training Data

DPR training performed using Information Retrieval data.
More specifically, you want to feed in pairs of queries and relevant documents.

To train a model, we will need a dataset that has the same format as the original DPR training data.
Each data point in the dataset should have the following dictionary structure.

``` python
    {
        "dataset": str,
        "question": str,
        "answers": list of str
        "positive_ctxs": list of dictionaries of format {'title': str, 'text': str, 'score': int, 'title_score': int, 'passage_id': str}
        "negative_ctxs": list of dictionaries of format {'title': str, 'text': str, 'score': int, 'title_score': int, 'passage_id': str}
        "hard_negative_ctxs": list of dictionaries of format {'title': str, 'text': str, 'score': int, 'title_score': int, 'passage_id': str}
    }
```

## Using Question Answering Data

Question Answering datasets can sometimes be used as training data.
Google's Natural Questions dataset, is sufficiently large
and contains enough unique passages, that it can be converted into a DPR training set.
This is done simply by considering answer containing passages as relevant documents to the query.

The SQuAD dataset, however, is not as suited to this use case since its question and answer pairs
are created on only a very small slice of wikipedia documents.

## Download Original DPR Training Data

WARNING: These files are large! The train set is 7.4GB and the dev set is 800MB

We can download the original DPR training data with the following cell.
Note that this data is probably only useful if you are trying to train from scratch.

In [None]:
# Download original DPR data
# WARNING: the train set is 7.4GB and the dev set is 800MB

doc_dir = "data/dpr_training/"

s3_url_train = "https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz"
s3_url_dev = "https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz"

fetch_archive_from_http(s3_url_train, output_dir=doc_dir + "train/")
fetch_archive_from_http(s3_url_dev, output_dir=doc_dir + "dev/")

## Option 1: Training DPR from Scratch

The default variables that we provide below are chosen to train a DPR model from scratch.
Here, both passage and query embedding models are initialized using BERT base
and the model is trained using Google's Natural Questions dataset (in a format specialised for DPR).

If you are working in a language other than English,
you will want to initialize the passage and query embedding models with a language model that supports your language
and also provide a dataset in your language.

In [3]:
# Here are the variables to specify our training data, the models that we use to initialize DPR
# and the directory where we'll be saving the model

doc_dir = "data/dpr_training/"

train_filename = "train/biencoder-nq-train.json"
dev_filename = "dev/biencoder-nq-dev.json"

query_model = "bert-base-uncased"
passage_model = "bert-base-uncased"

save_dir = "../saved_models/dpr"

## Option 2: Finetuning DPR

If you have your own domain specific question answering or information retrieval dataset,
you might instead be interested in finetuning a pretrained DPR model.
In this case, you would initialize both query and passage models using the original pretrained model.
You will want to load something like this set of variables instead of the ones above

In [None]:
# Here are the variables you might want to use instead of the set above
# in order to perform pretraining

doc_dir = "PATH_TO_YOUR_DATA_DIR"
train_filename = "TRAIN_FILENAME"
dev_filename = "DEV_FILENAME"

query_model = "facebook/dpr-question_encoder-single-nq-base"
passage_model = "facebook/dpr-ctx_encoder-single-nq-base"

save_dir = "../saved_models/dpr"

## Initialization

Here we want to initialize our model either with plain language model weights for training from scratch
or else with pretrained DPR weights for finetuning.
We follow the [original DPR parameters](https://github.com/facebookresearch/DPR#best-hyperparameter-settings)
for their max passage length but set max query length to 64 since queries are very rarely longer.

In [4]:
## Initialize DPR model

retriever = DensePassageRetriever(
    document_store=None,
    query_embedding_model=query_model,
    passage_embedding_model=passage_model,
    max_seq_len_query=64,
    max_seq_len_passage=256
)

	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.


## Training

Let's start training and save our trained model!

On a V100 GPU, you can fit up to batch size 4 so we set gradient accumulation steps to 4 in order
to simulate the batch size 16 of the original DPR experiment.

When `embed_title=True`, the document title is prepended to the input text sequence with a `[SEP]` token
between it and document text.

When training from scratch with the above variables, 1 epoch takes around an hour and we reached the following performance:

```
loss: 0.09334952129693501
acc: 0.984035000191887
f1: 0.936147352264006
acc_and_f1: 0.9600911762279465
average_rank: 0.07075978511128166
```

In [None]:
# Start training our model and save it when it is finished

retriever.train(
    data_dir=doc_dir,
    train_filename=dev_filename,
    dev_filename=dev_filename,
    test_filename=dev_filename,
    n_epochs=1,
    batch_size=4,
    grad_acc_steps=4,
    save_dir=save_dir,
    evaluate_every=3000,
    embed_title=True
)

Preprocessing Dataset data/dpr_training/dev/biencoder-nq-dev.json: 100%|██████████| 6515/6515 [00:07<00:00, 928.71 Dicts/s] 
Preprocessing Dataset data/dpr_training/dev/biencoder-nq-dev.json:  66%|██████▌   | 4301/6515 [00:05<00:02, 786.28 Dicts/s] Process ForkPoolWorker-8:
Process ForkPoolWorker-12:
Process ForkPoolWorker-11:
Preprocessing Dataset data/dpr_training/dev/biencoder-nq-dev.json:  66%|██████▌   | 4301/6515 [00:05<00:02, 814.99 Dicts/s]Process ForkPoolWorker-10:
Process ForkPoolWorker-9:
Process ForkPoolWorker-14:
Process ForkPoolWorker-13:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/branden/Code/anaconda3/envs/haystack/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/branden/Code/anaconda3/envs/haystack/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/hom

## Loading

Loading our newly trained model is simple!

In [6]:
reloaded_retriever = DensePassageRetriever.load(load_dir=save_dir, document_store=None)



ValueError: Could not infer tokenizer_class from model config or name '..saved_models/dpr/query_encoder'. Set arg `tokenizer_class` in Tokenizer.load() to one of: AlbertTokenizer, XLMRobertaTokenizer, RobertaTokenizer, DistilBertTokenizer, BertTokenizer, XLNetTokenizer, CamembertTokenizer, ElectraTokenizer, DPRQuestionEncoderTokenizer,DPRContextEncoderTokenizer.