# Evaluation of a Pipeline and its Components

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb)

To be able to make a statement about the quality of results a question-answering pipeline or any other pipeline in haystack produces, it is important to evaluate it. Furthermore, evaluation allows determining which components of the pipeline can be improved.
The results of the evaluation can be saved as CSV files, which contain all the information to calculate additional metrics later on or inspect individual predictions.

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [2]:
# Make sure you have a GPU running
!nvidia-smi

zsh:1: command not found: nvidia-smi


In [6]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

zsh:1: no matches found: git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]


## Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [11]:
# If Docker is available: Start Elasticsearch as docker container
# from haystack.utils import launch_es
# launch_es()

# Alternative in Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

92b1a2f8e7ca65e67937b187f1531167b7af4743355c86269bc8e2ff44910d4b


## Fetch, Store And Preprocess the Evaluation Dataset

In [13]:
from haystack.utils import fetch_archive_from_http

# Download evaluation data, which is a subset of Natural Questions development set containing 50 documents with one question per document and multiple annotated answers
doc_dir = "../data/nq"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

INFO - haystack.utils.import_utils -  Found data stored in `../data/nq`. Delete this first if you really want to fetch new data.


False

In [14]:
# make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
doc_index = "tutorial5_docs"
label_index = "tutorial5_labels"

In [15]:
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(
    host="localhost",
    username="",
    password="",
    index=doc_index,
    label_index=label_index,
    embedding_field="emb",
    embedding_dim=768,
    excluded_meta_data=["emb"],
)

In [16]:
from haystack.nodes import PreProcessor

# Add evaluation data to Elasticsearch Document Store
# We first delete the custom tutorial indices to not have duplicate elements
# and also split our documents into shorter passages using the PreProcessor
preprocessor = PreProcessor(
    split_length=200,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False,
)
document_store.delete_documents(index=doc_index)
document_store.delete_documents(index=label_index)

# The add_eval_data() method converts the given dataset in json format into Haystack document and label objects. Those objects are then indexed in their respective document and label index in the document store. The method can be used with any dataset in SQuAD format.
document_store.add_eval_data(
    filename="../data/nq/nq_dev_subset_v2.json", doc_index=doc_index, label_index=label_index, preprocessor=preprocessor
)

## Initialize the Two Components of an ExtractiveQAPipeline: Retriever and Reader

In [23]:
# Initialize Retriever
from haystack.nodes import ElasticsearchRetriever

retriever = ElasticsearchRetriever(document_store=document_store)

# Alternative: Evaluate dense retrievers (EmbeddingRetriever or DensePassageRetriever)
# The EmbeddingRetriever uses a single transformer based encoder model for query and document.
# In contrast, DensePassageRetriever uses two separate encoders for both.

# Please make sure the "embedding_dim" parameter in the DocumentStore above matches the output dimension of your models!
# Please also take care that the PreProcessor splits your files into chunks that can be completely converted with
#        the max_seq_len limitations of Transformers
# The SentenceTransformer model "sentence-transformers/multi-qa-mpnet-base-dot-v1" generally works well with the EmbeddingRetriever on any kind of English text.
# For more information and suggestions on different models check out the documentation at: https://www.sbert.net/docs/pretrained_models.html

# from haystack.retriever import EmbeddingRetriever, DensePassageRetriever
# retriever = EmbeddingRetriever(document_store=document_store, model_format="sentence_transformers",
                                # embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1")
# retriever = DensePassageRetriever(document_store=document_store,
#                                   query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
#                                   passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
#                                   use_gpu=True,
#                                   max_seq_len_passage=256,
#                                   embed_title=True)
# document_store.update_embeddings(retriever, index=doc_index)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model sentence-transformers/multi-qa-mpnet-base-dot-v1
INFO - haystack.document_stores.elasticsearch -  Updating embeddings for all 1334 docs ...


Updating embeddings:   0%|          | 0/1334 [00:00<?, ? Docs/s]

Batches:   0%|          | 0/42 [00:00<?, ?it/s]

In [24]:
# Initialize Reader
from haystack.nodes import FARMReader

reader = FARMReader("deepset/roberta-base-squad2", top_k=4, return_no_answer=True)

# Define a pipeline consisting of the initialized retriever and reader
from haystack.pipelines import ExtractiveQAPipeline

pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

# The evaluation also works with any other pipeline.
# For example you could use a DocumentSearchPipeline as an alternative:

# from haystack.pipelines import DocumentSearchPipeline
# pipeline = DocumentSearchPipeline(retriever=retriever)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

INFO - haystack.modeling.infer -  Got ya 7 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0    0    0    0    0    0 
INFO - haystack.modeling.infer -  /w\  /w\  /w\  /w\  /w\  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \  /'\  /'\  / \  / \  /'\


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Evaluation of an ExtractiveQAPipeline
Here we evaluate retriever and reader in open domain fashion on the full corpus of documents i.e. a document is considered
correctly retrieved if it contains the gold answer string within it. The reader is evaluated based purely on the
predicted answer string, regardless of which document this came from and the position of the extracted span.

The generation of predictions is seperated from the calculation of metrics. This allows you to run the computation-heavy model predictions only once and then iterate flexibly on the metrics or reports you want to generate.


In [25]:
from haystack.schema import EvaluationResult, MultiLabel

# We can load evaluation labels from the document store
# We are also opting to filter out no_answer samples
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)
eval_labels = [label for label in eval_labels if not label.no_answer]  # filter out no_answer cases

## Alternative: Define queries and labels directly

# eval_labels = [
#    MultiLabel(
#        labels=[
#            Label(
#                query="who is written in the book of life",
#                answer=Answer(
#                    answer="every person who is destined for Heaven or the World to Come",
#                    offsets_in_context=[Span(374, 434)]
#                ),
#                document=Document(
#                    id='1b090aec7dbd1af6739c4c80f8995877-0',
#                    content_type="text",
#                    content='Book of Life - wikipedia Book of Life Jump to: navigation, search This article is
#                       about the book mentioned in Christian and Jewish religious teachings...'
#                ),
#                is_correct_answer=True,
#                is_correct_document=True,
#                origin="gold-label"
#            )
#        ]
#    )
# ]

# Similar to pipeline.run() we can execute pipeline.eval()
eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO - haystack.document_stores.base -  Numba not found, replacing njit() with no-op implementation. Enable it with 'pip install numba'.
INFO - haystack.document_stores.base -  Numba not found, replacing njit() with no-op implementation. Enable it with 'pip install numba'.
INFO - haystack.document_stores.base -  Numba not found, replacing njit() with no-op implementation. Enable it with 'pip install numba'.
INFO - haystack.document_stores.base -  Numba not found, replacing njit() with no-op implementation. Enable it with 'pip install numba'.
INFO - haystack.document_stores.base -  Numba not found, replacing njit() with no-op implementation. Enable it with 'pip install numba'.
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nv

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO - haystack.document_stores.base -  Numba not found, replacing njit() with no-op implementation. Enable it with 'pip install numba'.
INFO - haystack.document_stores.base -  Numba not found, replacing njit() with no-op implementation. Enable it with 'pip install numba'.
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.57 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.74 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.54 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.64 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.53 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.52 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.45 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.30 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.42 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.90 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.58 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.32 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.52 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.31 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.92 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.94 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.94 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.09 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.12 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.90 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.67 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.67 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.45 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.10 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.52 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.28 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.53 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.34 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.94 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.64 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.43 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.19 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.34 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.03 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.58 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.59 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.36 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.12 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.39 Batches/s]


In [26]:
# The EvaluationResult contains a pandas dataframe for each pipeline node.
# That's why there are two dataframes in the EvaluationResult of an ExtractiveQAPipeline.

retriever_result = eval_result["Retriever"]
retriever_result.head()

Unnamed: 0,multilabel_id,query,filters,gold_document_contents,content,gold_id_match,answer_match,gold_id_or_answer_match,rank,document_id,gold_document_ids,type,node,eval_mode
0,72540922682054949,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...","apostles' names are ``written in heaven'' (Luke x. 20), or ``the fellow-work...",0.0,0.0,0.0,1.0,1b090aec7dbd1af6739c4c80f8995877-3,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated
1,72540922682054949,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...","Book of Life - wikipedia Book of Life Jump to: navigation, search This artic...",1.0,1.0,1.0,2.0,1b090aec7dbd1af6739c4c80f8995877-0,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated
2,72540922682054949,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...","as adversaries (of God). Also, according to ib. xxxvi. 10, one who contrives...",0.0,0.0,0.0,3.0,1b090aec7dbd1af6739c4c80f8995877-2,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated
3,72540922682054949,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...","class of men who are no longer held in suspense. (Benonim, the middle), and ...",0.0,0.0,0.0,4.0,1b090aec7dbd1af6739c4c80f8995877-5,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated
4,72540922682054949,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...","people considered righteous before God. God has such a book, and to be blott...",0.0,0.0,0.0,5.0,1b090aec7dbd1af6739c4c80f8995877-1,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated


In [27]:
reader_result = eval_result["Reader"]
reader_result.head()

Unnamed: 0,multilabel_id,query,filters,gold_answers,answer,context,exact_match,f1,rank,document_id,gold_document_ids,offsets_in_document,gold_offsets_in_documents,type,node,eval_mode
0,72540922682054949,who is written in the book of life,b'null',"[all people considered righteous before God, every person who is destined fo...",,,0.0,0.0,1.0,,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]","[{'start': 0, 'end': 0}]","[{'start': 1107, 'end': 1149}, {'start': 374, 'end': 434}]",answer,Reader,integrated
1,72540922682054949,who is written in the book of life,b'null',"[all people considered righteous before God, every person who is destined fo...",God records the names of every person who is destined for Heaven or the Worl...,n tēs Zōēs) is the book in which God records the names of every person who i...,0.0,0.846154,2.0,1b090aec7dbd1af6739c4c80f8995877-0,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]","[{'start': 349, 'end': 434}]","[{'start': 1107, 'end': 1149}, {'start': 374, 'end': 434}]",answer,Reader,integrated
2,72540922682054949,who is written in the book of life,b'null',"[all people considered righteous before God, every person who is destined fo...",all people considered righteous before God,Hebrew Bible(edit) In the Hebrew Bible the Book of Life - the book or muste...,1.0,1.0,3.0,1b090aec7dbd1af6739c4c80f8995877-0,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]","[{'start': 1107, 'end': 1149}]","[{'start': 1107, 'end': 1149}, {'start': 374, 'end': 434}]",answer,Reader,integrated
3,72540922682054949,who is written in the book of life,b'null',"[all people considered righteous before God, every person who is destined fo...",those whose names are written in the Book of Life from the foundation of the...,"ohn of Patmos. As described, only those whose names are written in the Book ...",0.0,0.083333,4.0,1b090aec7dbd1af6739c4c80f8995877-3,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]","[{'start': 576, 'end': 658}]","[{'start': 1107, 'end': 1149}, {'start': 374, 'end': 434}]",answer,Reader,integrated
0,-8461897212797475232,who was the girl in the video brenda got a baby,b'null',[Ethel ``Edy'' Proctor],her cousin,ng a story in the newspaper of a 12-year-old girl getting pregnant by her co...,0.0,0.0,1.0,965a125f65658579529b39f8e4344969-3,[965a125f65658579529b39f8e4344969-3],"[{'start': 423, 'end': 433}]","[{'start': 181, 'end': 202}]",answer,Reader,integrated


In [28]:
# We can filter for all documents retrieved for a given query
query = "who is written in the book of life"
retriever_book_of_life = retriever_result[retriever_result["query"] == query]

In [None]:
# We can also filter for all answers predicted for a given query
reader_book_of_life = reader_result[reader_result["query"] == query]

In [30]:
# Save the evaluation result so that we can reload it later and calculate evaluation metrics without running the pipeline again.
eval_result.save("../")

## Calculating Evaluation Metrics
Load an EvaluationResult to quickly calculate standard evaluation metrics for all predictions,
such as F1-score of each individual prediction of the Reader node or recall of the retriever.
To learn more about the metrics, see [Evaluation Metrics](https://haystack.deepset.ai/guides/evaluation#metrics-retrieval)

In [31]:
saved_eval_result = EvaluationResult.load("../")
metrics = saved_eval_result.calculate_metrics()
print(f'Retriever - Recall (single relevant document): {metrics["Retriever"]["recall_single_hit"]}')
print(f'Retriever - Recall (multiple relevant documents): {metrics["Retriever"]["recall_multi_hit"]}')
print(f'Retriever - Mean Reciprocal Rank: {metrics["Retriever"]["mrr"]}')
print(f'Retriever - Precision: {metrics["Retriever"]["precision"]}')
print(f'Retriever - Mean Average Precision: {metrics["Retriever"]["map"]}')

print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
print(f'Reader - Exact Match: {metrics["Reader"]["exact_match"]}')

Retriever - Recall (single relevant document): 0.8
Retriever - Recall (multiple relevant documents): 0.8
Retriever - Mean Reciprocal Rank: 0.5773333333333334
Retriever - Precision: 0.16
Retriever - Mean Average Precision: 0.5773333333333334
Reader - F1-Score: 0.4957658522707067
Reader - Exact Match: 0.4


## Generating an Evaluation Report
A summary of the evaluation results can be printed to get a quick overview. It includes some aggregated metrics and also shows a few wrongly predicted examples.

In [32]:
pipeline.print_eval_report(saved_eval_result)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:  0.92
                        | recall_single_hit_top_1:  0.92
                        |
                      Reader
                        |
                        | exact_match:   0.4
                        | exact_match_top_1:  0.04
                        | f1: 0.496
                        | f1_top_1: 0.0804
                        |
                      Output

                Wrong Reader Examples
Query: 
 	who was the girl in the video brenda got a baby
Gold Answers: 
 	Ethel ``Edy'' Proctor
Gold Document Ids: 
 	965a125f65658579529b39f8e4344969-3
Metrics: 
 	f1: 0.0
 	exact_match: 0.0
Answers: 
 	multilabel_id: -8461897212797475232
 	filters: b'null'
 	answer: her cousin
 	context: ng a story in the newspaper of a 12-year-old girl getting pregnant by her c

## Advanced Evaluation Metrics
As an advanced evaluation metric, semantic answer similarity (SAS) can be calculated. This metric takes into account whether the meaning of a predicted answer is similar to the annotated gold answer rather than just doing string comparison.
To this end SAS relies on pre-trained models. For English, we recommend "cross-encoder/stsb-roberta-large", whereas for German we recommend "deepset/gbert-large-sts". A good multilingual model is "sentence-transformers/paraphrase-multilingual-mpnet-base-v2".
More info on this metric can be found in our [paper](https://arxiv.org/abs/2108.06130) or in our [blog post](https://www.deepset.ai/blog/semantic-answer-similarity-to-evaluate-qa).

In [33]:
advanced_eval_result = pipeline.eval(
    labels=eval_labels, params={"Retriever": {"top_k": 1}}, sas_model_name_or_path="cross-encoder/stsb-roberta-large"
)

metrics = advanced_eval_result.calculate_metrics()
print(metrics["Reader"]["sas"])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.42 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.15 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.44 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.10 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.57 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.24 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.55 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.17 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.94 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.46 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.18 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.55 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.22 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.44 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.71 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.33 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.50 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.27 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.89 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.35 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.62 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.63 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.81 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.31 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.98 Batches/s]


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/139 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

0.43349278


## Isolated Evaluation Mode
The isolated node evaluation uses labels as input to the Reader node instead of the output of the preceeding Retriever node.
Thereby, we can additionally calculate the upper bounds of the evaluation metrics of the Reader. Note that even with isolated evaluation enabled, integrated evaluation will still be running.


In [34]:
eval_result_with_upper_bounds = pipeline.eval(
    labels=eval_labels, params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 5}}, add_isolated_node_eval=True
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.34 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.52 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.15 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.53 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.74 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.46 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.66 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.53 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.85 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.71 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.73 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.68 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.88 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.37 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.56 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.54 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.31 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.75 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.15 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.26 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.12 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.64 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.67 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.43 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.13 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.29 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.12 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.16 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.13 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.04 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.19 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.23 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.27 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.14 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.24 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.65 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.67 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.56 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.64 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.67 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.44 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.57 Batches/s]


In [35]:
pipeline.print_eval_report(eval_result_with_upper_bounds)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:  0.92
                        | recall_single_hit_top_1:  0.92
                        |
                      Reader
                        |
                        | exact_match upper bound:  0.48
                        | exact_match:  0.44
                        | exact_match_top_1:  0.04
                        | f1 upper bound: 0.603
                        | f1: 0.536
                        | f1_top_1: 0.0804
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	who took over for batman when his back was broken
Gold Document Ids: 
 	4c02586bad8a9fa0ef164d0b9bc27499-1
 	4c02586bad8a9fa0ef164d0b9bc27499-1
Metrics: 
 	recall_multi_hit: 0.0
 	recall_single_hit: 0.0
 	precision: 0.0
 	map: 0.0
 	mrr: 0.0
 	ndcg:

## Evaluation of Individual Components: Retriever
Sometimes you might want to evaluate individual components, for example, if you don't have a pipeline but only a retriever or a reader with a model that you trained yourself.
Here we evaluate only the retriever, based on whether the gold_label document is retrieved.

In [36]:
## Evaluate Retriever on its own
# Note that no_answer samples are omitted when evaluation is performed with this method
retriever_eval_results = retriever.eval(top_k=5, label_index=label_index, doc_index=doc_index)
# Retriever Recall is the proportion of questions for which the correct document containing the answer is
# among the correct documents
print("Retriever Recall:", retriever_eval_results["recall"])
# Retriever Mean Avg Precision rewards retrievers that give relevant documents a higher rank
print("Retriever Mean Avg Precision:", retriever_eval_results["map"])

INFO - haystack.nodes.retriever.base -  Performing eval queries...
  0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  8%|▊         | 2/25 [00:00<00:01, 15.14it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 16%|█▌        | 4/25 [00:00<00:01, 14.97it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 24%|██▍       | 6/25 [00:00<00:01, 14.31it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 32%|███▏      | 8/25 [00:00<00:01, 13.74it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 40%|████      | 10/25 [00:00<00:01, 13.27it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 48%|████▊     | 12/25 [00:00<00:01, 12.20it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 56%|█████▌    | 14/25 [00:01<00:00, 12.47it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 64%|██████▍   | 16/25 [00:01<00:00, 13.55it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 72%|███████▏  | 18/25 [00:01<00:00, 13.86it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 80%|████████  | 20/25 [00:01<00:00, 14.32it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 88%|████████▊ | 22/25 [00:01<00:00, 14.51it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 96%|█████████▌| 24/25 [00:01<00:00, 14.85it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 25/25 [00:01<00:00, 14.03it/s]
INFO - haystack.nodes.retriever.base -  For 20 out of 25 questions (80.00%), the answer was in the top-5 candidate passages selected by the retriever.


Retriever Recall: 0.8
Retriever Mean Avg Precision: 0.5773333333333333


Just as a sanity check, we can compare the recall from `retriever.eval()` with the multi hit recall from `pipeline.eval(add_isolated_node_eval=True)`.
These two recall metrics are only comparable since we chose to filter out no_answer samples when generating eval_labels.


In [37]:
metrics = eval_result_with_upper_bounds.calculate_metrics()
print(metrics["Retriever"]["recall_multi_hit"])

0.8


## Evaluation of Individual Components: Reader
Here we evaluate only the reader in a closed domain fashion i.e. the reader is given one query
and its corresponding relevant document and metrics are calculated on whether the right position in this text is selected by
the model as the answer span (i.e. SQuAD style)

In [38]:
# Evaluate Reader on its own
reader_eval_results = reader.eval(document_store=document_store, label_index=label_index, doc_index=doc_index)
# Evaluation of Reader can also be done directly on a SQuAD-formatted file without passing the data to Elasticsearch
# reader_eval_results = reader.eval_on_file("../data/nq", "nq_dev_subset_v2.json", device=device)

# Reader Top-N-Accuracy is the proportion of predicted answers that match with their corresponding correct answer
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
# Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
# Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

INFO - haystack.nodes.reader.farm -  Performing Evaluation using top_k_per_candidate = 3 
and consequently, QuestionAnsweringPredictionHead.n_best = 4. 
This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5
  start_indices = flat_sorted_indices // max_seq_len
Evaluating: 100%|██████████| 33/33 [01:58<00:00,  3.60s/it]

Reader Top-N-Accuracy: 99.09208819714657
Reader Exact Match: 84.95460440985732
Reader F1-Score: 85.48499103563783





## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)