# Open-retrieval Conversation Question Answering
Based on the paper _Open-retrieval Conversation Question Answering_ by _Qu et al_.

Since ConverSE is built upon Haystack. This notebook is very similar to the original notebook on Dense Passage Retrieval https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb#scrollTo=kFwiPP60A6N7

## Prepare environment

In [4]:
# Make sure you have a GPU running
!nvidia-smi


!pip install git+https://github.com/deepset-ai/haystack.git v# Install the latest master of Haystack
!pip install git+https://github.com/giguru/converse.git  # Install the latest master of Converse

/bin/sh: nvidia-smi: command not found
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /private/var/folders/n3/qrq_p64n7hn0ln41q671bk8w0000gn/T/pip-req-build-3k86i78x


Building wheels for collected packages: farm-haystack


  Building wheel for farm-haystack (setup.py) ... [?25ldone
[?25h  Created wheel for farm-haystack: filename=farm_haystack-0.4.0-py3-none-any.whl size=82059 sha256=da9eb10ac22dfd764b2ad654d2f8817001d5e7870422865fee0876d2058213c0
  Stored in directory: /private/var/folders/n3/qrq_p64n7hn0ln41q671bk8w0000gn/T/pip-ephem-wheel-cache-cwvgt6kn/wheels/a7/05/3b/9b33368d9af06a39f8e6af2e97fa2af876e893ade323cfc2c9
Successfully built farm-haystack


In [5]:
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.utils import print_answers

from converse.src.reader.farm import FARMReader
from converse.src.reader.transformers import TransformersReader
from converse.src.retriever.dense_passage_retriever import DensePassageRetriever
from converse.src.converse import Converse

## Indexer and data

Add document collection to a DocumentStore. The original text will be indexed. Conversion into embeddings can be
is done below.

In [6]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

10/01/2020 19:22:27 - INFO - faiss -   Loading faiss.


In [None]:
# Download evaluation data, which is a subset of Natural Questions development set containing 50 documents
doc_dir = "../data/nq"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

In [None]:
# make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
doc_index = "tutorial5_docs"
label_index = "tutorial5_labels"

In [None]:
# Add document collection to a DocumentStore. The original text will be indexed. Conversion into embeddings can be
# is done below.
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document",
                                            create_index=False, embedding_field="emb",
                                            embedding_dim=768, excluded_meta_data=["emb"])

In [None]:
# Add evaluation data to Elasticsearch Document Store
# We first delete the custom tutorial indices to not have duplicate elements
document_store.delete_all_documents(index=doc_index)
document_store.delete_all_documents(index=label_index)
document_store.add_eval_data(filename="../data/nq/nq_dev_subset_v2.json", doc_index=doc_index, label_index=label_index)

In [7]:
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",  # TODO replace with ORConvQA model
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",  # TODO replace with ORConvQA model
    use_gpu=True,
    embed_title=True,
    max_seq_len=256,
    batch_size=16,
    remove_sep_tok_from_untitled_passages=True
)

Some weights of DPRQuestionEncoder were not initialized from the model checkpoint at facebook/dpr-question_encoder-single-nq-base and are newly initialized: ['question_encoder.bert_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DPRContextEncoder were not initialized from the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base and are newly initialized: ['ctx_encoder.bert_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Embed passages
Since retrieval will be done on the embeddings, the embedding representation of the documents need to be computed
This only needs to be done once.

In [8]:
document_store.update_embeddings(retriever)

In [9]:
# Load a local model or any of the QA models on Hugging Face's model hub (https://huggingface.co/models)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

10/01/2020 19:23:18 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
10/01/2020 19:23:18 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
10/01/2020 19:23:29 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
10/01/2020 19:23:29 - INFO - farm.infer -   Got ya 7 parallel workers to do inference ...
10/01/2020 19:23:29 - INFO - farm.infer -    0    0    0    0    0    0    0 
10/01/2020 19:23:29 - INFO - farm.infer -   /w\  /w\  /w\  /w\  /w\  /w\  /w\
10/01/2020 19:23:29 - INFO - farm.infer -   /'\  / \  /'\  /'\  / \  / \  /'\
10/01/2020 19:23:29 - INFO - farm.infer -               


In [10]:
finder = Converse(reader, retrievers=[retriever])

## Evaluate pipeline

In [12]:
# Evaluate combination of Reader and Retriever through Finder
finder_eval_results = finder.eval(top_k_retriever=1, top_k_reader=10)
finder.print_eval_results(finder_eval_results)

TypeError: eval() missing 2 required positional arguments: 'label_index' and 'doc_index'