# Evaluation of a QA System

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb)

To be able to make a statement about the performance of a question-answering system, it is important to evalute it. Furthermore, evaluation allows to determine which parts of the system can be improved.

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [None]:
# Make sure you have a GPU running
!nvidia-smi

## Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install urllib3==1.25.4

In [None]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [1]:
from farm.utils import initialize_device_settings

device, n_gpu = initialize_device_settings(use_cuda=True)

03/31/2021 15:17:47 - INFO - farm.utils -   Using device: CUDA 
03/31/2021 15:17:47 - INFO - farm.utils -   Number of GPUs: 1
03/31/2021 15:17:47 - INFO - farm.utils -   Distributed Training: False
03/31/2021 15:17:47 - INFO - farm.utils -   Automatic Mixed Precision: None


In [2]:
from haystack.preprocessor.utils import fetch_archive_from_http

# Download evaluation data, which is a subset of Natural Questions development set containing 50 documents
doc_dir = "../data/nq"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

03/31/2021 15:17:51 - INFO - faiss -   Loading faiss with AVX2 support.
03/31/2021 15:17:51 - INFO - faiss -   Loading faiss.
03/31/2021 15:17:52 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
03/31/2021 15:17:52 - INFO - haystack.preprocessor.utils -   Found data stored in `../data/nq`. Delete this first if you really want to fetch new data.


False

In [3]:
# make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
doc_index = "tutorial5_docs"
label_index = "tutorial5_labels"

In [4]:
# Connect to Elasticsearch
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document",
                                            create_index=False, embedding_field="emb",
                                            embedding_dim=768, excluded_meta_data=["emb"])

03/31/2021 15:17:58 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.003s]


In [5]:
from haystack.preprocessor import PreProcessor

# Add evaluation data to Elasticsearch Document Store
# We first delete the custom tutorial indices to not have duplicate elements
# and also split our documents into shorter passages using the PreProcessor
preprocessor = PreProcessor(
    split_length=500,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False
)
document_store.delete_all_documents(index=doc_index)
document_store.delete_all_documents(index=label_index)
document_store.add_eval_data(
    filename="../data/nq/nq_dev_subset_v2.json",
    doc_index=doc_index,
    label_index=label_index,
    preprocessor=preprocessor
)

# Let's prepare the labels that we need for the retriever and the reader
labels = document_store.get_all_labels_aggregated(index=label_index)
q_to_l_dict = {
    l.question: {
        "retriever": l,
        "reader": l
    } for l in labels
}

03/31/2021 15:18:15 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_delete_by_query [status:200 request:0.053s]
03/31/2021 15:18:17 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_labels/_delete_by_query [status:200 request:0.028s]
03/31/2021 15:18:19 - INFO - elasticsearch -   HEAD http://localhost:9200/tutorial5_docs [status:200 request:0.001s]
03/31/2021 15:18:20 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.435s]
03/31/2021 15:18:21 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.016s]
03/31/2021 15:18:21 - INFO - elasticsearch -   HEAD http://localhost:9200/tutorial5_labels [status:200 request:0.001s]
03/31/2021 15:18:21 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.287s]
03/31/2021 15:18:21 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_labels/_search?scroll=1d&size=10000

## Initialize components of QA-System

In [6]:
# Initialize Retriever
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)
# Alternative: Evaluate DensePassageRetriever
# Note, that DPR works best when you index short passages < 512 tokens as only those tokens will be used for the embedding.
# Here, for nq_dev_subset_v2.json we have avg. num of tokens = 5220(!).
# DPR still outperforms Elastic's BM25 by a small margin here.
# from haystack.retriever.dense import DensePassageRetriever
# retriever = DensePassageRetriever(document_store=document_store,
#                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
#                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
#                                  use_gpu=True,
#                                  embed_title=True,
#                                  max_seq_len=256,
#                                  batch_size=16,
#                                  remove_sep_tok_from_untitled_passages=True)
#document_store.update_embeddings(retriever, index=doc_index)

In [8]:
# Initialize Reader
from haystack.reader.farm import FARMReader

reader = FARMReader("deepset/roberta-base-squad2", top_k_per_candidate=4, return_no_answer=True)


03/31/2021 15:20:01 - INFO - farm.utils -   Using device: CUDA 
03/31/2021 15:20:01 - INFO - farm.utils -   Number of GPUs: 1
03/31/2021 15:20:01 - INFO - farm.utils -   Distributed Training: False
03/31/2021 15:20:01 - INFO - farm.utils -   Automatic Mixed Precision: None
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
03/31/2021 15:20:14 - INFO - farm.utils -   Using device: CUDA 
03/31/2021 15:20:14 - INFO - farm.utils -   Number of GPUs: 1
03/31/2021 15:20:14 - INFO - farm.utils -   Distributed Training: False
03/31/2021 15:20:14 - INFO - farm.utils -   Automatic Mixed Precision: None
03/31/2021 15:20:14 - INFO - farm.infer -   Got ya 7 parallel workers to do inference ...
03/31/2021 15:20:14 - INFO - farm.infer -    0    0    0

In [7]:
from haystack.eval import EvalReader, EvalRetriever

# Here we initialize the nodes that perform evaluation
eval_retriever = EvalRetriever()
eval_reader = EvalReader()

## Evaluation of Retriever
Here we evaluate only the retriever, based on whether the gold_label document is retrieved.

In [9]:
## Evaluate Retriever on its own
retriever_eval_results = retriever.eval(top_k=20, label_index=label_index, doc_index=doc_index)
## Retriever Recall is the proportion of questions for which the correct document containing the answer is
## among the correct documents
print("Retriever Recall:", retriever_eval_results["recall"])
## Retriever Mean Avg Precision rewards retrievers that give relevant documents a higher rank
print("Retriever Mean Avg Precision:", retriever_eval_results["map"])

03/31/2021 15:20:52 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_labels/_search?scroll=1d&size=10000 [status:200 request:0.007s]
03/31/2021 15:20:52 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.002s]
03/31/2021 15:20:52 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.001s]
03/31/2021 15:20:52 - INFO - haystack.retriever.base -   Performing eval queries...
  0%|          | 0/50 [00:00<?, ?it/s]03/31/2021 15:20:52 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search [status:200 request:0.004s]
03/31/2021 15:20:52 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search [status:200 request:0.004s]
03/31/2021 15:20:52 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search [status:200 request:0.006s]
03/31/2021 15:20:52 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search [status:200 request:0.

Retriever Recall: 1.0
Retriever Mean Avg Precision: 0.5936623277242845


## Evaluation of Reader
Here we evaluate only the reader in a closed domain fashion i.e. the reader is given one query
and one document and metrics are calculated on whether the right position in this text is selected by
the model as the answer span (i.e. SQuAD style)

In [10]:
# Evaluate Reader on its own
reader_eval_results = reader.eval(document_store=document_store, device=device, label_index=label_index, doc_index=doc_index)
# Evaluation of Reader can also be done directly on a SQuAD-formatted file without passing the data to Elasticsearch
#reader_eval_results = reader.eval_on_file("../data/nq", "nq_dev_subset_v2.json", device=device)

## Reader Top-N-Accuracy is the proportion of predicted answers that match with their corresponding correct answer
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
## Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
## Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

03/31/2021 15:21:03 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_labels/_search?scroll=1d&size=10000 [status:200 request:0.007s]
03/31/2021 15:21:03 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.002s]
03/31/2021 15:21:03 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.001s]
03/31/2021 15:21:03 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search?scroll=1d&size=10000 [status:200 request:0.030s]
03/31/2021 15:21:03 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.002s]
03/31/2021 15:21:03 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.001s]
03/31/2021 15:21:03 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search [status:200 request:0.002s]
03/31/2021 15:21:03 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search [status:200 

Reader Top-N-Accuracy: 97.5975975975976
Reader Exact Match: 74.47447447447448
Reader F1-Score: 75.30691611151383


## Evaluation of Retriever and Reader (Open Domain)
Here we evaluate retriever and reader in open domain fashion i.e. a document is considered
correctly retrieved if it contains the answer string within it. The reader is evaluated based purely on the
predicted string, regardless of which document this came from and the position of the extracted span.

In [11]:
from haystack import Pipeline

# Here is the pipeline definition
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=eval_retriever, name="EvalRetriever", inputs=["ESRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["EvalRetriever"])
p.add_node(component=eval_reader, name="EvalReader", inputs=["QAReader"])
results = []

In [12]:
# This is how to run the pipeline
for q, l in q_to_l_dict.items():
    res = p.run(
        query=q,
        top_k_retriever=10,
        labels=l,
        top_k_reader=10,
        index=doc_index,
    )
    results.append(res)

03/31/2021 15:21:45 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search [status:200 request:0.005s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.82 Batches/s]
03/31/2021 15:21:46 - INFO - elasticsearch -   POST http://localhost:9200/tutorial5_docs/_search [status:200 request:0.003s]
Inferencing Samples:

In [13]:
# When we have run evaluation using the pipeline, we can print the results
eval_retriever.print()
print()
retriever.print_time()
print()
eval_reader.print(mode="reader")
print()
reader.print_time()
print()
eval_reader.print(mode="pipeline")


Retriever
-----------------
has_answer recall: 0.96 (24/25)
no_answer recall:  1.00 (25/25) (no_answer samples are always treated as correctly retrieved)
recall: 0.98 (49 / 50)

Retriever (Speed)
---------------
No indexing performed via Retriever.run()
Queries Performed: 50
Query time: 0.21246098399205948s
0.00424921967984119 seconds per query

Reader
-----------------
has answer queries: 24
top 1 EM: 0.16666666666666666
top k EM: 0.4583333333333333
top 1 F1: 0.3670704295704296
top k F1: 0.6450604950604949

no_answer queries: 25
top 1 no_answer accuracy: 0.56

Reader (Speed)
---------------
Queries Performed: 50
Query time: 52.35428286099341s
1.0470856572198681 seconds per query

Pipeline
-----------------
queries: 50
top 1 EM: 0.36
top k EM: 0.72
top 1 F1: 0.45619380619380623
top k F1: 0.8096290376290375
(top k results are likely inflated since the Reader always returns a no_answer prediction in its top k)
