# Evalutaion
To be able to make a statement about the performance of a question-asnwering system, it is important to evalute it. Furthermore, evaluation allows to determine which parts of the system can be improved.

## Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [0]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack and install the version of torch that works with the colab GPUs
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

In [0]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [0]:
# install haystack
! pip install git+https://github.com/deepset-ai/haystack.git

In [6]:
from farm.utils import initialize_device_settings

device, n_gpu = initialize_device_settings(use_cuda=True)

06/05/2020 16:11:23 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None


In [7]:

from haystack.indexing.utils import fetch_archive_from_http

# Download evaluation data, which is a subset of Natural Questions development set containing 50 documents
doc_dir = "../data/nq"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

06/05/2020 16:11:26 - INFO - haystack.indexing.io -   Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset.json.zip to `../data/nq`
100%|██████████| 621983/621983 [00:01<00:00, 477723.47B/s]


True

In [0]:
# Connect to Elasticsearch
from haystack.database.elasticsearch import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document",
                                            create_index=False, embedding_field="emb",
                                            embedding_dim=768, excluded_meta_data=["emb"])

In [9]:
# Add evaluation data to Elasticsearch database
document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")

06/05/2020 16:11:30 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:1.613s]
06/05/2020 16:11:31 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:0.453s]


## Initialize components of QA-System

In [0]:
# Initialize Retriever
from haystack.retriever.sparse import ElasticsearchRetriever

retriever = ElasticsearchRetriever(document_store=document_store)

# Alternative: Evaluate DensePassageRetriever
# from haystack.retriever.dense import DensePassageRetriever
# retriever = DensePassageRetriever(document_store=document_store, embedding_model="dpr-bert-base-nq",batch_size=32)
# document_store.update_embeddings(retriever, index="eval_document")

In [11]:
# Initialize Reader
from haystack.reader.farm import FARMReader

reader = FARMReader("deepset/roberta-base-squad2")

06/05/2020 16:11:31 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
06/05/2020 16:11:31 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
06/05/2020 16:11:32 - INFO - filelock -   Lock 140574308859240 acquired on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=559.0, style=ProgressStyle(description_…

06/05/2020 16:11:33 - INFO - filelock -   Lock 140574308859240 released on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock





06/05/2020 16:11:33 - INFO - filelock -   Lock 140574717619952 acquired on /root/.cache/torch/transformers/5600193782e3a4c414cddf8f0e52bf650d4d6c4c022094532d275ee730cef8f5.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637366.0, style=ProgressStyle(descri…

06/05/2020 16:12:16 - INFO - filelock -   Lock 140574717619952 released on /root/.cache/torch/transformers/5600193782e3a4c414cddf8f0e52bf650d4d6c4c022094532d275ee730cef8f5.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock





	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
06/05/2020 16:12:37 - INFO - filelock -   Lock 140574306905112 acquired on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

06/05/2020 16:12:39 - INFO - filelock -   Lock 140574306905112 released on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock





06/05/2020 16:12:40 - INFO - filelock -   Lock 140574306905112 acquired on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

06/05/2020 16:12:42 - INFO - filelock -   Lock 140574306905112 released on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





06/05/2020 16:12:43 - INFO - filelock -   Lock 140574306905112 acquired on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…

06/05/2020 16:12:44 - INFO - filelock -   Lock 140574306905112 released on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock





06/05/2020 16:12:45 - INFO - filelock -   Lock 140574306905112 acquired on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=189.0, style=ProgressStyle(description_…

06/05/2020 16:12:46 - INFO - filelock -   Lock 140574306905112 released on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock
06/05/2020 16:12:46 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None





06/05/2020 16:12:46 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
06/05/2020 16:12:46 - INFO - farm.infer -    0 
06/05/2020 16:12:46 - INFO - farm.infer -   /w\
06/05/2020 16:12:46 - INFO - farm.infer -   /'\
06/05/2020 16:12:46 - INFO - farm.infer -   


In [0]:
# Initialize Finder which sticks together Reader and Retriever
from haystack.finder import Finder

finder = Finder(reader, retriever)

## Evaluation of Retriever

In [13]:
# Evaluate Retriever on its own
retriever_eval_results = retriever.eval()

## Retriever Recall is the proportion of questions for which the correct document containing the answer is
## among the correct documents
print("Retriever Recall:", retriever_eval_results["recall"])
## Retriever Mean Avg Precision rewards retrievers that give relevant documents a higher rank
print("Retriever Mean Avg Precision:", retriever_eval_results["map"])

06/05/2020 16:12:46 - INFO - elasticsearch -   POST http://localhost:9200/feedback/_search?scroll=5m&size=1000 [status:200 request:0.170s]
06/05/2020 16:12:46 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search [status:200 request:0.069s]
06/05/2020 16:12:46 - INFO - haystack.retriever.elasticsearch -   Got 10 candidates from retriever
06/05/2020 16:12:46 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search [status:200 request:0.022s]
06/05/2020 16:12:46 - INFO - haystack.retriever.elasticsearch -   Got 10 candidates from retriever
06/05/2020 16:12:46 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search [status:200 request:0.021s]
06/05/2020 16:12:46 - INFO - haystack.retriever.elasticsearch -   Got 10 candidates from retriever
06/05/2020 16:12:46 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search [status:200 request:0.019s]
06/05/2020 16:12:46 - INFO - haystack.retriever.elasticsearch -   Go

Retriever Recall: 1.0
Retriever Mean Avg Precision: 0.9367283950617283


## Evaluation of Reader

In [14]:
# Evaluate Reader on its own
reader_eval_results = reader.eval(document_store=document_store, device=device)

# Evaluation of Reader can also be done directly on a SQuAD-formatted file
# without passing the data to Elasticsearch
#reader_eval_results = reader.eval_on_file("../data/natural_questions", "dev_subset_v2.json", device=device)

## Reader Top-N-Accuracy is the proportion of predicted answers that match with their corresponding correct answer
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
## Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
## Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

06/05/2020 16:12:47 - INFO - elasticsearch -   POST http://localhost:9200/feedback/_search?scroll=5m&size=1000 [status:200 request:0.022s]
06/05/2020 16:12:47 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.005s]
06/05/2020 16:12:47 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.003s]
06/05/2020 16:12:47 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search?scroll=5m&size=1000 [status:200 request:0.039s]
06/05/2020 16:12:47 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.010s]
06/05/2020 16:12:47 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.003s]
Evaluating: 100%|██████████| 78/78 [00:31<00:00,  2.50it/s]


Reader Top-N-Recall: 0.6111111111111112
Reader Exact Match: 0.4074074074074074
Reader F1-Score: 0.4340132402934336


## Evaluation of Finder

In [15]:
# Evaluate combination of Reader and Retriever through Finder
finder_eval_results = finder.eval()

print("\n___Retriever Metrics in Finder___")
print("Retriever Recall:", finder_eval_results["retriever_recall"])
print("Retriever Mean Avg Precision:", finder_eval_results["retriever_map"])

# Reader is only evaluated with those questions, where the correct document is among the retrieved ones
print("\n___Reader Metrics in Finder___")
print("Reader Top-1 accuracy:", finder_eval_results["reader_top1_accuracy"])
print("Reader Top-1 accuracy (has answer):", finder_eval_results["reader_top1_accuracy_has_answer"])
print("Reader Top-k accuracy:", finder_eval_results["reader_top_k_accuracy"])
print("Reader Top-k accuracy (has answer):", finder_eval_results["reader_topk_accuracy_has_answer"])
print("Reader Top-1 EM:", finder_eval_results["reader_top1_em"])
print("Reader Top-1 EM (has answer):", finder_eval_results["reader_top1_em_has_answer"])
print("Reader Top-k EM:", finder_eval_results["reader_topk_em"])
print("Reader Top-k EM (has answer):", finder_eval_results["reader_topk_em_has_answer"])
print("Reader Top-1 F1:", finder_eval_results["reader_top1_f1"])
print("Reader Top-1 F1 (has answer):", finder_eval_results["reader_top1_f1_has_answer"])
print("Reader Top-k F1:", finder_eval_results["reader_topk_f1"])
print("Reader Top-k F1 (has answer):", finder_eval_results["reader_topk_f1_has_answer"])
print("Reader Top-1 no-answer accuracy:", finder_eval_results["reader_top1_no_answer_accuracy"])
print("Reader Top-k no-answer accuracy:", finder_eval_results["reader_topk_no_answer_accuracy"])

# Time measurements
print("\n___Time Measurements___")
print("Total retrieve time:", finder_eval_results["total_retrieve_time"])
print("Avg retrieve time per question:", finder_eval_results["avg_retrieve_time"])
print("Total reader timer:", finder_eval_results["total_reader_time"])
print("Avg read time per question:", finder_eval_results["avg_reader_time"])
print("Total Finder time:", finder_eval_results["total_finder_time"])

06/05/2020 16:13:44 - INFO - elasticsearch -   POST http://localhost:9200/feedback/_search?scroll=5m&size=1000 [status:200 request:0.014s]
06/05/2020 16:13:44 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search [status:200 request:0.015s]
06/05/2020 16:13:44 - INFO - haystack.retriever.elasticsearch -   Got 10 candidates from retriever
06/05/2020 16:13:45 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search [status:200 request:0.011s]
06/05/2020 16:13:45 - INFO - haystack.retriever.elasticsearch -   Got 10 candidates from retriever
06/05/2020 16:13:45 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search [status:200 request:0.021s]
06/05/2020 16:13:45 - INFO - haystack.retriever.elasticsearch -   Got 10 candidates from retriever
06/05/2020 16:13:45 - INFO - elasticsearch -   POST http://localhost:9200/eval_document/_search [status:200 request:0.016s]
06/05/2020 16:13:45 - INFO - haystack.retriever.elasticsearch -   Go


___Retriever Metrics in Finder___
Retriever Recall: 1.0
Retriever Mean Avg Precision: 0.9367283950617283

___Reader Metrics in Finder___
Reader Top-1 accuracy: 0.3333333333333333
Reader Top-1 accuracy (has answer): 0.12
Reader Top-k accuracy: 0.6851851851851852
Reader Top-k accuracy (has answer): 0.36
Reader Top-1 EM: 0.2777777777777778
Reader Top-1 EM (has answer): 0.0
Reader Top-k EM: 0.5370370370370371
Reader Top-k EM (has answer): 0.04
Reader Top-1 F1: 0.3891157185894027
Reader Top-1 F1 (has answer): 0.24048995215311006
Reader Top-k F1: 0.6400575387839845
Reader Top-k F1 (has answer): 0.2625242837734066
Reader Top-1 no-answer accuracy: 0.5172413793103449
Reader Top-k no-answer accuracy: 0.9655172413793104

___Time Measurements___
Total retrieve time: 1.1358914375305176
Avg retrieve time per question: 0.02049741480085585
Total reader timer: 717.9651441574097
Avg read time per question: 13.295561874354327
Total Finder time: 719.1010527610779
