# Evaluation of a QA System

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb)

To be able to make a statement about the performance of a question-answering system, it is important to evalute it. Furthermore, evaluation allows to determine which parts of the system can be improved.

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [1]:
# Make sure you have a GPU running
!nvidia-smi

Mon Oct 25 12:02:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [2]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

Collecting grpcio-tools==1.34.1
  Downloading grpcio_tools-1.34.1-cp37-cp37m-manylinux2014_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 12.2 MB/s 
Installing collected packages: grpcio-tools
Successfully installed grpcio-tools-1.34.1
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-x_63gv_n
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-x_63gv_n
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.5 MB/s 
[?25hCollecting mlflow<=1.13.1
  Downloading mlflow-1.13.1-py3-none-any.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 31 kB/s 
[?25hCollecting transformers==4.7.0
  Downloading transformers-4.7.0-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 34.9 MB/s 
Collecting fastapi
  Downloading fastapi-0.70.0-py3-none-any.whl (51 kB)

In [3]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [1]:
from haystack.modeling.utils import initialize_device_settings

devices, n_gpu = initialize_device_settings(use_cuda=True)



In [2]:
from haystack.utils import fetch_archive_from_http

# Download evaluation data, which is a subset of Natural Questions development set containing 50 documents
doc_dir = "../data/nq"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

True

In [3]:
# make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
doc_index = "tutorial5_docs"
label_index = "tutorial5_labels"

In [4]:
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document",
                                            create_index=False, embedding_field="emb",
                                            embedding_dim=768, excluded_meta_data=["emb"])

In [5]:
from haystack.nodes import PreProcessor

# Add evaluation data to Elasticsearch Document Store
# We first delete the custom tutorial indices to not have duplicate elements
# and also split our documents into shorter passages using the PreProcessor
preprocessor = PreProcessor(
    split_length=200,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False
)
document_store.delete_documents(index=doc_index)
document_store.delete_documents(index=label_index)
document_store.add_eval_data(
    filename="../data/nq/nq_dev_subset_v2.json",
    doc_index=doc_index,
    label_index=label_index,
    preprocessor=preprocessor
)

# Let's prepare the labels that we need for the retriever and the reader
labels = document_store.get_all_labels_aggregated(index=label_index, drop_negative_labels=True, drop_no_answers=False)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Initialize components of QA-System

In [6]:
# Initialize Retriever
from haystack.nodes import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)
# Alternative: Evaluate dense Retrievers (DPR and SentenceTransformers)
# Dense Passage Retrieval uses a separate transformer based encoder for query and document each
# SentenceTransformers have a single encoder for both
# Please make sure the "embedding_dim" parameter in the DocumentStore above matches the output dimension of you models!
# Please also take care that the PreProcessor splits your files into chunks that can be completely converted with
#        the max_seq_len limitations of Transformers
# The SentenceTransformer model "all-mpnet-base-v2" generelly works well on any kind of english text.
# For more information check out the documentation at: https://www.sbert.net/docs/pretrained_models.html
# from haystack.retriever import DensePassageRetriever, EmbeddingRetriever
# retriever = DensePassageRetriever(document_store=document_store,
#                                   query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
#                                   passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
#                                   use_gpu=True,
#                                   max_seq_len_passage=256,
#                                   embed_title=True)
# retriever = EmbeddingRetriever(document_store=document_store, model_format="sentence_transformers",
#                                embedding_model="all-mpnet-base-v2")
# document_store.update_embeddings(retriever, index=doc_index)

In [7]:
# Initialize Reader
from haystack.nodes import FARMReader

reader = FARMReader("deepset/roberta-base-squad2", top_k=4, return_no_answer=True)


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/roberta-base-squad2 were not used when initializing RobertaModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.


In [8]:
from haystack.nodes import EvalAnswers, EvalDocuments

# Here we initialize the nodes that perform evaluation
eval_retriever = EvalDocuments()
eval_reader = EvalAnswers(sas_model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

## Evaluation of Retriever
Here we evaluate only the retriever, based on whether the gold_label document is retrieved.

In [9]:
## Evaluate Retriever on its own
retriever_eval_results = retriever.eval(top_k=20, label_index=label_index, doc_index=doc_index)
## Retriever Recall is the proportion of questions for which the correct document containing the answer is
## among the correct documents
print("Retriever Recall:", retriever_eval_results["recall"])
## Retriever Mean Avg Precision rewards retrievers that give relevant documents a higher rank
print("Retriever Mean Avg Precision:", retriever_eval_results["map"])

100%|██████████| 771/771 [00:07<00:00, 106.27it/s]


Retriever Recall: 0.32814526588845655
Retriever Mean Avg Precision: 0.09174006526330329


## Evaluation of Reader
Here we evaluate only the reader in a closed domain fashion i.e. the reader is given one query
and one document and metrics are calculated on whether the right position in this text is selected by
the model as the answer span (i.e. SQuAD style)

In [10]:
# Evaluate Reader on its own
reader_eval_results = reader.eval(document_store=document_store, device=devices[0], label_index=label_index, doc_index=doc_index)
# Evaluation of Reader can also be done directly on a SQuAD-formatted file without passing the data to Elasticsearch
#reader_eval_results = reader.eval_on_file("../data/nq", "nq_dev_subset_v2.json", device=device)

## Reader Top-N-Accuracy is the proportion of predicted answers that match with their corresponding correct answer
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
## Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
## Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Evaluating: 100%|██████████| 32/32 [00:45<00:00,  1.41s/it]

Reader Top-N-Accuracy: 98.94875164257556
Reader Exact Match: 84.62549277266754
Reader F1-Score: 85.16284899931244





## Evaluation of Retriever and Reader (Open Domain)
Here we evaluate retriever and reader in open domain fashion i.e. a document is considered
correctly retrieved if it contains the answer string within it. The reader is evaluated based purely on the
predicted string, regardless of which document this came from and the position of the extracted span.

In [11]:
from haystack import Pipeline

# Here is the pipeline definition
p = Pipeline()
p.add_node(component=retriever, name="Retriever", inputs=["Query"])
p.add_node(component=eval_retriever, name="EvalRetriever", inputs=["Retriever"])
p.add_node(component=reader, name="Reader", inputs=["EvalRetriever"])
p.add_node(component=eval_reader, name="EvalReader", inputs=["Reader"])
results = []

In [12]:
# This is how to run the pipeline
for l in labels:
    res = p.run(
        query=l.query,
        labels=l,
        params={"index": doc_index, "Retriever": {"top_k": 10}, "Reader": {"top_k": 5}},
    )
    results.append(res)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.90 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.26 Batches/s]


Downloading:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.59 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.19 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.63 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.91 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

There seem to be empty string labels in the dataset suggesting that there are samples with is_impossible=True. Retrieval of these samples is always treated as correct.
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.92 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.73 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.68 Batches/s]
Inferencing Samples: 100%|███

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.08 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.90 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.59 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.13 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.17 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.91 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.13 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.15 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.34 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.06 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.14 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.74 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.91 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.34 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.90 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.91 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.53 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.34 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.68 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.81 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.90 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.34 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.84 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.03 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.67 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.20 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.12 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.75 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.68 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.53 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.27 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.84 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.14 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.95 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.67 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.58 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.75 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.73 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.03 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.01 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [13]:
# When we have run evaluation using the pipeline, we can print the results
n_queries = len(labels)
eval_retriever.print()
print()
retriever.print_time()
print()
eval_reader.print(mode="reader")
print()
reader.print_time()
print()
eval_reader.print(mode="pipeline")

EvalRetriever
-----------------
has_answer recall@10: 0.9200 (23/25)
no_answer recall@10:  1.00 (25/25) (no_answer samples are always treated as correctly retrieved)
has_answer mean_reciprocal_rank@10: 0.6110
no_answer mean_reciprocal_rank@10:  1.0000 (no_answer samples are always treated as correctly retrieved at rank 1)
recall@10: 0.9600 (48 / 50)
mean_reciprocal_rank@10: 0.8055

Retriever (Speed)
---------------
No indexing performed via Retriever.run()
Queries Performed: 50
Query time: 0.5041560229994957s
0.010083120459989913 seconds per query

Reader
-----------------
has answer queries: 23
top 1 EM: 0.2609
top k EM: 0.5217
top 1 F1: 0.3364
top k F1: 0.6180
top 1 SAS: 0.5375
top k SAS: 0.7885

no_answer queries: 25
top 1 no_answer accuracy: 0.0000

Reader (Speed)
---------------
Queries Performed: 50
Query time: 80.12184924100029s
1.6024369848200057 seconds per query

Pipeline
-----------------
queries: 50
top 1 EM: 0.1200
top k EM: 0.7400
top 1 F1: 0.1548
top k F1: 0.7843
top 1 S

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)