# Better Retrieval via "Dense Passage Retrieval"

EXECUTABLE VERSION: [colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)

### Importance of Retrievers

The Retriever has a huge impact on the performance of our overall search pipeline.


### Different types of Retrievers
#### Sparse
Family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.

**Examples**: BM25, TF-IDF

**Pros**: Simple, fast, well explainable

**Cons**: Relies on exact keyword matches between query and text
 

#### Dense
These retrievers use neural network models to create "dense" embedding vectors. Within this family there are two different approaches: 

a) Single encoder: Use a **single model** to embed both query and passage.  
b) Dual-encoder: Use **two models**, one to embed the query and one to embed the passage

Recent work suggests that dual encoders work better, likely because they can deal better with the different nature of query and passage (length, style, syntax ...). 

**Examples**: REALM, DPR, Sentence-Transformers

**Pros**: Captures semantinc similarity instead of "word matches" (e.g. synonyms, related topics ...)

**Cons**: Computationally more heavy, initial training of model


### "Dense Passage Retrieval"

In this Tutorial, we want to highlight one "Dense Dual-Encoder" called Dense Passage Retriever. 
It was introdoced by Karpukhin et al. (2020, https://arxiv.org/abs/2004.04906. 

Original Abstract: 

_"Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks."_

Paper: https://arxiv.org/abs/2004.04906  
Original Code: https://fburl.com/qa-dpr 


*Use this* [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb) *to open the notebook in Google Colab.*


### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/colab_gpu_runtime.jpg">

In [1]:
# Make sure you have a GPU running
!nvidia-smi

Mon Aug 24 11:56:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   41C    P0    39W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

In [2]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install urllib3==1.25.4

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-fqgbr4x7
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-fqgbr4x7


Building wheels for collected packages: farm-haystack
  Building wheel for farm-haystack (setup.py) ... [?25ldone
[?25h  Created wheel for farm-haystack: filename=farm_haystack-0.3.0-py3-none-any.whl size=99007 sha256=c46bad086db77ddc557d67d6a47b0e8ead6a76c20451e21bd7e56e7b3adf5434
  Stored in directory: /tmp/pip-ephem-wheel-cache-s2p1ltpe/wheels/5b/d7/60/7a15bd24f2905dfa70aa762413b9570b9d37add064b151aaf0
Successfully built farm-haystack
You should consider upgrading via the '/home/ubuntu/py3_6/bin/python3.6 -m pip install --upgrade pip' command.[0m


Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.5.1+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.5.1%2Bcu101-cp36-cp36m-linux_x86_64.whl (704.4 MB)
[K     |████████████████████████████████| 704.4 MB 9.3 kB/s eta 0:00:011
[?25hCollecting torchvision==0.6.1+cu101
  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.6.1%2Bcu101-cp36-cp36m-linux_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 881 kB/s eta 0:00:01
Installing collected packages: torch, torchvision
  Attempting uninstall: torch
    Found existing installation: torch 1.5.1
    Uninstalling torch-1.5.1:
      Successfully uninstalled torch-1.5.1
Successfully installed torch-1.5.1+cu101 torchvision-0.6.1+cu101
You should consider upgrading via the '/home/ubuntu/py3_6/bin/python3.6 -m pip install --upgrade pip' command.[0m


In [2]:
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

### Document Store

FAISS is a library for efficient similarity search on a cluster of dense vectors.
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
to store the document text and other meta data. The vector embeddings of the text are
indexed on a FAISS Index that later is queried for searching answers.

In [3]:
from haystack.document_store.faiss import FAISSDocumentStore

document_store = FAISSDocumentStore()

08/25/2020 08:27:51 - INFO - faiss -   Loading faiss with AVX2 support.
08/25/2020 08:27:51 - INFO - faiss -   Loading faiss.


### Cleaning & indexing documents

Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore

In [4]:
# Let's first get some files that we want to use
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Convert files to dicts
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

08/25/2020 08:27:53 - INFO - haystack.indexing.utils -   Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.


### Initalize Retriever, Reader,  & Finder

#### Retriever

**Here:** We use a `DensePassageRetriever`

**Alternatives:**

- The `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging

In [5]:
from haystack.retriever.dense import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=True,
                                  use_fast_tokenizers=True)
# Important: 
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation. 
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. 
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
document_store.update_embeddings(retriever)

08/25/2020 08:28:12 - INFO - haystack.database.faiss -   Updating embeddings for 2497 docs ...
	nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
	nonzero(Tensor input, *, bool as_tuple)
08/25/2020 08:28:13 - INFO - haystack.retriever.dense -   Embedded 80 / 2497 texts
08/25/2020 08:28:14 - INFO - haystack.retriever.dense -   Embedded 160 / 2497 texts
08/25/2020 08:28:14 - INFO - haystack.retriever.dense -   Embedded 240 / 2497 texts
08/25/2020 08:28:15 - INFO - haystack.retriever.dense -   Embedded 320 / 2497 texts
08/25/2020 08:28:16 - INFO - haystack.retriever.dense -   Embedded 400 / 2497 texts
08/25/2020 08:28:17 - INFO - haystack.retriever.dense -   Embedded 480 / 2497 texts
08/25/2020 08:28:17 - INFO - haystack.retriever.dense -   Embedded 560 / 2497 texts
08/25/2020 08:28:18 - INFO - haystack.retriever.dense -   Embedded 640 / 2497 texts
08/25/2020 08:28:19 - INFO - haystack.retriever.dense -   Embedded 720 / 2497 texts
08/25/2020 08:2

#### Reader

Similar to previous Tutorials we now initalize our reader.

Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)



##### FARMReader

In [6]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

08/25/2020 08:28:54 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
08/25/2020 08:28:54 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/25/2020 08:29:09 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
08/25/2020 08:29:10 - INFO - farm.infer -   Got ya 7 parallel workers to do inference ...
08/25/2020 08:29:10 - INFO - farm.infer -    0    0    0    0    0    0    0 
08/25/2020 08:29:10 - INFO - farm.infer -   /w\  /w\  /w\  /w\  /w\  /w\  /w\
08/25/2020 08:29:10 - INFO - farm.infer -   /'\  / \  /'\  /'\  / \  / \  /'\
08/25/2020 08:29:10 - INFO - farm.infer -               


#### Finder

The Finder sticks together reader and retriever in a pipeline to answer our actual questions. 

In [7]:
finder = Finder(reader, retriever)

### Voilà! Ask a question!

In [8]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = finder.get_answers(question="Who created the Dothraki vocabulary?", top_k_retriever=10, top_k_reader=5)

#prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)
#prediction = finder.get_answers(question="Who is the sister of Sansa?", top_k_retriever=10, top_k_reader=5)

08/25/2020 08:30:28 - INFO - haystack.finder -   Reader is looking for detailed answer in 9168 chars ...
  start_logits_normalized = nn.functional.softmax(start_logits)
  end_logits_normalized = nn.functional.softmax(end_logits)
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 67.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 67.10 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 66.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.91 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.05 Batches/s]


In [9]:
print_answers(prediction, details="minimal")

[   {   'answer': 'David J. Peterson',
        'context': '\n'
                   '===Valyrian===\n'
                   'David J. Peterson, who created the Dothraki language for '
                   'the first season of the show, was entrusted by the '
                   'producers to design a new '},
    {   'answer': 'David Peterson',
        'context': '\n'
                   '==Phonology and romanization==\n'
                   'David Peterson has said, "You know, most people probably '
                   "don't really know what Arabic actually sounds like, so to "
                   'an '},
    {   'answer': 'books',
        'context': 'ints.  First, the language had to match the uses already '
                   'put down in the books. Secondly, it had to be easily '
                   'pronounceable or learnable by the actors'},
    {   'answer': "'''Nevakhi vekha ha maan: Rekke, m'aresakea norethi fitte.'",
        'context': '\n'
                   '==Sample==\n'
             