In [None]:
BRANCH = 'ir_tutorial'

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
!python -m pip install git+https://github.com/AlexGrinch/NeMo.git@{BRANCH}#egg=nemo_toolkit[nlp]


In [None]:
# If you're not using Colab, you might need to upgrade jupyter notebook to avoid the following error:
# 'ImportError: IProgress not found. Please update jupyter and ipywidgets.'

! pip install ipywidgets
! jupyter nbextension enable --py widgetsnbextension

# Please restart the kernel after running this cell

In [None]:
from nemo.collections import nlp as nemo_nlp
from nemo.utils.exp_manager import exp_manager

import os
import wget 
import torch
import pytorch_lightning as pl
from omegaconf import OmegaConf

In this tutorial, we are going to describe how to finetune a BERT-like model based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) on information retrieval task. We will be working on MSMARCO dataset which contains more than 800K queries and more than 8.8M short passages and the task is to rank all passages by their relevance to the particular query. Specifically, we will be training two different models.

### BERT joint re-ranking 

* Original paper: [Passage Re-ranking with BERT](https://arxiv.org/abs/1901.04085).
* Model overview: input query-passage pair is encoded as **[CLS] query_tokens [SEP] passage_tokens [SEP]** and is fed into BERT encoder. Last hidden state of [CLS] token is then passed into fully-connected layer to get similarity score.
* Pros: high accuracy.
* Cons: the model is too computationally demanding to use for all passages re-ranking. It is better to use this model for re-ranking the shortlist of ~top-100 candidates.

### Dense Passage Retrieval

* Original paper: [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906).
* Model overview: input query-passage pair is separately as **[CLS] query_tokens [SEP]** and **[CLS] passage_tokens [SEP]** which are fed into two different BERT encoders. Last hidden states of corresponding [CLS] tokens are treated as query and passage embedding respectively. The similarity score is computed as a dot-product between query and passage embeddings.
* Pros: as the computation of query and passage embeddings is disentangled, all passage embeddings can be pre-computed in a single run through the passage collection. Then, we can build FAISS index on top of them and retrieve relevant passages very fast.
* Cons: accuracy is lower comparing to joint model.

# Dataset

First of all, we need to download and prepare training dataset. Navigate to [examples/nlp/information_retrieval](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/information_retrieval) and download MSMARCO dataset:

`bash get_msmarco.sh`

The dataset should contain the following files:
* collection.tsv - training passages with entries (passage_id, passage_text)
* queries.(train/dev).tsv - training/development queries with entries (query_id, query_text)
* qrels.(train/dev).tsv - training/development relevance scores with entries (query_id, 0, passage_id, 1)

Information retrieval models are usually trained on lists of query, corresponding relevant passage, and several irrelevant passages in a contrastive manner. Thus, the format of training dataset for NeMo information retrieval models is the following:
* collection.tsv.(pkl/npz) - tokenized and cached training passages
* queries.train.tsv.(pkl/npz) - tokenized and cached training queries
* query2passages.tsv - file with entries (query_id, rel_psg_id, irrel_psg_1_id, ..., irrel_psg_k_id)

Note, the way we choose irrelevant passages for training is important for good performance of the model. If we choose too easy negative passages (which have nothing to do with the corresponding relevant passage), the model will quickly learn to distinguish them from positive passages, however, it will work poorly on harder negative passages (for example, those with significant overlap with query or positive passage tokens). Thus, it is common to construct harder negative passages, for example, by selecting irrelevant passages but with high BM25 score. For this tutorial, we will be choosing negative passages randomly. To do it, run corresponding script from [examples/nlp/information_retrieval](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/information_retrieval):

`python construct_random_negatives.py`

After this script creates two files query2passages.train.tsv and query2passages.dev.tsv, we will have everything we need to start training.

# Model Training

If you have NeMo installed locally, you can also train the model with `examples/nlp/information_retrieval/bert_joint_ir.py`

To run training script, use:

`python bert_joint_ir.py`