# Evaluation of the Question Answering System using Haystack

Initialization - Set Runtime type as GPU

The individual components of the Question Answering systems can be evaluated using the Haystack components for Question Answering. 

The Question Anwering system implemented is a Retriever Reader system and the two different components are evaluated. 

The following pre-trained models from the Huggingface Transformer library evaluated for reader component.

1. twmkn9/bert-base-uncased-squad2
2. deepset/roberta-base-squad2
3. vanadhi/bert-base-uncased-fiqa-flm-sq-flit
4. vanadhi/roberta-base-fiqa-flm-sq-flit
5. deepset/minilm-uncased-squad2
6. ahotrod/albert_xxlargev1_squad2_512

The results of the evaluation are saved as CSV files, which contain all the information to calculate additional metrics later on or inspect individual predictions.

# Initialization

In [None]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1 -q
!pip install git+https://github.com/deepset-ai/haystack.git -q

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

[K     |████████████████████████████████| 2.5 MB 4.1 MB/s 
[K     |████████████████████████████████| 43 kB 89 kB/s 
[K     |████████████████████████████████| 14.1 MB 5.1 MB/s 
[K     |████████████████████████████████| 3.3 MB 41.3 MB/s 
[K     |████████████████████████████████| 51 kB 516 kB/s 
[K     |████████████████████████████████| 54 kB 2.8 MB/s 
[K     |████████████████████████████████| 79 kB 6.8 MB/s 
[K     |████████████████████████████████| 321 kB 37.7 MB/s 
[K     |████████████████████████████████| 359 kB 45.1 MB/s 
[K     |████████████████████████████████| 85 kB 3.7 MB/s 
[K     |████████████████████████████████| 981 kB 40.2 MB/s 
[K     |████████████████████████████████| 3.0 MB 47.1 MB/s 
[K     |████████████████████████████████| 78 kB 6.4 MB/s 
[K     |████████████████████████████████| 5.6 MB 13.5 MB/s 
[K     |████████████████████████████████| 100 kB 7.9 MB/s 
[K     |████████████████████████████████| 8.5 MB 28.9 MB/s 
[K     |██████████████████████████████

In [None]:
import pandas as pd
from haystack.nodes import FARMReader
from haystack.nodes import TransformersReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack.schema import EvaluationResult, MultiLabel

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
from haystack.modeling.utils import initialize_device_settings
devices, n_gpu = initialize_device_settings(use_cuda=True)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Start an Elasticsearch server

Elasticsearch is manually downloaded and executed from source.

In [None]:
# Alternative in Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

## Fetch, Store And Preprocess the Evaluation Dataset

In [None]:
# make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
doc_index = "finlit_docs"
label_index = "finlit_labels"
val_file_path = "/content/drive/MyDrive/FinLitQA/flitqa/eval_final.json"

In [None]:
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index=doc_index,
                                            label_index=label_index, embedding_field="emb",
                                            embedding_dim=768, excluded_meta_data=["emb"])

In [None]:
from haystack.nodes import PreProcessor
preprocessor = PreProcessor(
    split_length=300,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False
)
document_store.delete_documents(index=doc_index)
document_store.delete_documents(index=label_index)

document_store.add_eval_data(
    filename=val_file_path,
    doc_index=doc_index,
    label_index=label_index,
    preprocessor=preprocessor
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.




# Retriever


In [None]:
# Initialize Retriever
from haystack.nodes import ElasticsearchRetriever
retriever_es = ElasticsearchRetriever(document_store=document_store)


In [None]:
# from haystack.retriever import DensePassageRetriever, EmbeddingRetriever
# retriever_dpr = DensePassageRetriever(document_store=document_store,
#                                   query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
#                                   passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
#                                   use_gpu=True,
#                                   max_seq_len_passage=256,
#                                   embed_title=True)
# retriever_dpr = EmbeddingRetriever(document_store=document_store, model_format="sentence_transformers",
#                                embedding_model="all-mpnet-base-v2")
# document_store.update_embeddings(retriever_dpr, index=doc_index)

In [None]:
## Evaluate Retriever on its own
retriever_eval_results = retriever_es.eval(top_k=5, label_index=label_index, doc_index=doc_index)
retriever_eval_results

INFO - haystack.nodes.retriever.base -  Performing eval queries...
100%|██████████| 51/51 [00:00<00:00, 106.76it/s]
INFO - haystack.nodes.retriever.base -  For 51 out of 51 questions (100.00%), the answer was in the top-5 candidate passages selected by the retriever.


{'map': 0.9166666666666666,
 'mrr': 0.9166666666666666,
 'n_questions': 51,
 'recall': 1.0,
 'retrieve_time': 0.4659871339995334,
 'top_k': 5}

# Reader 1 - twmkn9/bert-base-uncased-squad2

In [None]:
# Initialize Reader
model_path = "twmkn9/bert-base-uncased-squad2"
result_path  = "/content/drive/MyDrive/FinLitQA/eval_results/bert-base-sq"

#reader_bert_sq = TransformersReader(model_path, use_gpu=1)
reader_bert_sq = FARMReader(model_name_or_path=model_path, top_k=4, return_no_answer=True)
pipeline_bert_sq = ExtractiveQAPipeline(reader=reader_bert_sq, retriever=retriever_es)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find twmkn9/bert-base-uncased-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded twmkn9/bert-base-uncased-squad2


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [None]:
## Evaluation of an ExtractiveQAPipeline
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)
eval_result = pipeline_bert_sq.eval(
    labels=eval_labels,
    params={"Retriever": {"top_k": 5}},
    sas_model_name_or_path="cross-encoder/stsb-roberta-large"
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 42.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 23.90 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.63 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.73 Batches/s

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/139 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [None]:
metrics_bert_sq = eval_result.calculate_metrics()
metrics_bert_sq_df = pd.DataFrame(metrics_bert_sq)
metrics_bert_sq_df

# Evaluate Reader on its own
reader_eval_results = reader_bert_sq.eval(document_store=document_store, device=devices[0], label_index=label_index, doc_index=doc_index)
reader_eval_results

eval_result.save(result_path)

Unnamed: 0,Retriever,Reader
recall_multi_hit,1.0,
recall_single_hit,1.0,
precision,0.205882,
map,0.916667,
mrr,0.916667,
exact_match,,0.529412
f1,,0.759684
sas,,0.810783


INFO - haystack.nodes.reader.farm -  Performing Evaluation using top_k_per_candidate = 3 
and consequently, QuestionAnsweringPredictionHead.n_best = 4. 
This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5
  cur_tensor = torch.tensor([sample[t_name] for sample in features], dtype=torch.long)
  start_indices = flat_sorted_indices // max_seq_len
Evaluating: 100%|██████████| 3/3 [00:01<00:00,  2.53it/s]
  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


{'EM': 41.17647058823529,
 'f1': 60.17273576097105,
 'reader_time': 1.2652803519999907,
 'seconds_per_query': 0.024809418666666482,
 'top_n': 4,
 'top_n_accuracy': 90.19607843137256}

In [None]:
## Generating an Evaluation Report
saved_eval_result = EvaluationResult.load(result_path)
pipeline_bert_sq.print_eval_report(saved_eval_result)

# The EvaluationResult contains a pandas dataframe for each pipeline node.
retriever_result = eval_result["Retriever"]
retriever_result.head(2)

reader_result = eval_result["Reader"]
reader_result.head(2)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:   1.0
                        | recall_single_hit_top_1:   1.0
                        |
                      Reader
                        |
                        | exact_match: 0.529
                        | exact_match_top_1: 0.353
                        | f1:  0.76
                        | f1_top_1: 0.493
                        | sas: 0.811
                        | sas_top_1: 0.538
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	What year did PMJJBY started?
Gold Document Ids: 
 	86b6adea4b99d7e62c865b8c75353e3a-0
Metrics: 
 	recall_multi_hit: 1.0
 	recall_single_hit: 1.0
 	precision: 0.4
 	map: 0.8333333333333333
 	mrr: 1.0
Documents: 
 	content: PMJJBY

The Pradhan Mantri Jeevan Jyoti Bima Yojana 

Unnamed: 0,content,document_id,type,gold_document_ids,gold_document_contents,gold_id_match,answer_match,gold_id_or_answer_match,rank,node,query,node_input
0,PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started b...,86b6adea4b99d7e62c865b8c75353e3a-0,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,1.0,1.0,1.0,1,Retriever,What year did PMJJBY started?,prediction
1,"will last till 31 May of the next year. Henceforth, the policy can be renewe...",86b6adea4b99d7e62c865b8c75353e3a-1,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,0.0,0.0,0.0,2,Retriever,What year did PMJJBY started?,prediction


Unnamed: 0,answer,document_id,offsets_in_document,context,type,gold_answers,gold_offsets_in_documents,gold_document_ids,exact_match,f1,rank,node,query,node_input,sas
0,2015.0,86b6adea4b99d7e62c865b8c75353e3a-0,"[{'start': 154, 'end': 158}]","hen Finance Minister, Mr. Arun Jaitley in the annual financial budget of 201...",answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],1.0,1.0,1,Reader,What year did PMJJBY started?,prediction,0.948495
1,,,"[{'start': 0, 'end': 0}]",,answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],0.0,0.0,2,Reader,What year did PMJJBY started?,prediction,0.01452


# Reader 2 - deepset/roberta-base-squad2

In [None]:
# Initialize Reader
model_path = "deepset/roberta-base-squad2"
result_path  = "/content/drive/MyDrive/FinLitQA/eval_results/roberta-base-sq2"

#reader_bert_sq = TransformersReader(model_path, use_gpu=1)
reader_roberta_sq = FARMReader(model_name_or_path=model_path, top_k=4, return_no_answer=True)
pipeline_roberta_sq = ExtractiveQAPipeline(reader=reader_roberta_sq, retriever=retriever_es)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [None]:
## Evaluation of an ExtractiveQAPipeline
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)
eval_result = pipeline_roberta_sq.eval(
    labels=eval_labels,
    params={"Retriever": {"top_k": 5}},
    sas_model_name_or_path="cross-encoder/stsb-roberta-large"
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 24.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.99 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.63 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.03 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 46.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.03 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.78 Batches/s

In [None]:
metrics_roberta_sq = eval_result.calculate_metrics()
metrics_roberta_sq_df = pd.DataFrame(metrics_roberta_sq)
metrics_roberta_sq_df

eval_result.save(result_path)

# Evaluate Reader on its own
reader_eval_results = reader_roberta_sq.eval(document_store=document_store, device=devices[0], label_index=label_index, doc_index=doc_index)
reader_eval_results

Unnamed: 0,Retriever,Reader
recall_multi_hit,1.0,
recall_single_hit,1.0,
precision,0.205882,
map,0.916667,
mrr,0.916667,
exact_match,,0.588235
f1,,0.784492
sas,,0.825707


INFO - haystack.nodes.reader.farm -  Performing Evaluation using top_k_per_candidate = 3 
and consequently, QuestionAnsweringPredictionHead.n_best = 4. 
This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5
  start_indices = flat_sorted_indices // max_seq_len
Evaluating: 100%|██████████| 3/3 [00:01<00:00,  2.57it/s]
  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


{'EM': 49.01960784313725,
 'f1': 64.27128427128427,
 'reader_time': 1.2769869420001214,
 'seconds_per_query': 0.025038959647061203,
 'top_n': 4,
 'top_n_accuracy': 86.27450980392157}

In [None]:
## Generating an Evaluation Report
saved_eval_result = EvaluationResult.load(result_path)
pipeline_roberta_sq.print_eval_report(saved_eval_result)

# The EvaluationResult contains a pandas dataframe for each pipeline node.
retriever_result = eval_result["Retriever"]
retriever_result.head(2)

reader_result = eval_result["Reader"]
reader_result.head(2)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:   1.0
                        | recall_single_hit_top_1:   1.0
                        |
                      Reader
                        |
                        | exact_match: 0.588
                        | exact_match_top_1: 0.333
                        | f1: 0.784
                        | f1_top_1: 0.421
                        | sas: 0.826
                        | sas_top_1: 0.445
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	What year did PMJJBY started?
Gold Document Ids: 
 	86b6adea4b99d7e62c865b8c75353e3a-0
Metrics: 
 	recall_multi_hit: 1.0
 	recall_single_hit: 1.0
 	precision: 0.4
 	map: 0.8333333333333333
 	mrr: 1.0
Documents: 
 	content: PMJJBY

The Pradhan Mantri Jeevan Jyoti Bima Yojana 

Unnamed: 0,content,document_id,type,gold_document_ids,gold_document_contents,gold_id_match,answer_match,gold_id_or_answer_match,rank,node,query,node_input
0,PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started b...,86b6adea4b99d7e62c865b8c75353e3a-0,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,1.0,1.0,1.0,1,Retriever,What year did PMJJBY started?,prediction
1,"will last till 31 May of the next year. Henceforth, the policy can be renewe...",86b6adea4b99d7e62c865b8c75353e3a-1,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,0.0,0.0,0.0,2,Retriever,What year did PMJJBY started?,prediction


Unnamed: 0,answer,document_id,offsets_in_document,context,type,gold_answers,gold_offsets_in_documents,gold_document_ids,exact_match,f1,rank,node,query,node_input,sas
0,2015.0,86b6adea4b99d7e62c865b8c75353e3a-0,"[{'start': 154, 'end': 158}]","hen Finance Minister, Mr. Arun Jaitley in the annual financial budget of 201...",answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],1.0,1.0,1,Reader,What year did PMJJBY started?,prediction,0.948495
1,,,"[{'start': 0, 'end': 0}]",,answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],0.0,0.0,2,Reader,What year did PMJJBY started?,prediction,0.01452


# Reader 3 - vanadhi/bert-base-uncased-fiqa-flm-sq-flit

In [None]:
# Initialize Reader
model_path = "vanadhi/bert-base-uncased-fiqa-flm-sq-flit"
result_path  = "/content/drive/MyDrive/FinLitQA/eval_results/bert-base-flit"

#reader_bert_sq = TransformersReader(model_path, use_gpu=1)
reader_bert_flit = FARMReader(model_name_or_path=model_path, top_k=4, return_no_answer=True)
pipeline_bert_flit = ExtractiveQAPipeline(reader=reader_bert_flit, retriever=retriever_es)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find vanadhi/bert-base-uncased-fiqa-flm-sq-flit locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/720 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/415M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded vanadhi/bert-base-uncased-fiqa-flm-sq-flit


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [None]:
## Evaluation of an ExtractiveQAPipeline
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)
eval_result = pipeline_bert_flit.eval(
    labels=eval_labels,
    params={"Retriever": {"top_k": 5}},
    sas_model_name_or_path="cross-encoder/stsb-roberta-large"
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 24.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 40.15 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 40.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 24.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 64.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.81 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.80 Batches/s

In [None]:
metrics_bert_flit = eval_result.calculate_metrics()
metrics_bert_flit_df = pd.DataFrame(metrics_bert_flit)
metrics_bert_flit_df

eval_result.save(result_path)

# Evaluate Reader on its own
reader_eval_results = reader_bert_flit.eval(document_store=document_store, device=devices[0], label_index=label_index, doc_index=doc_index)
reader_eval_results

Unnamed: 0,Retriever,Reader
recall_multi_hit,1.0,
recall_single_hit,1.0,
precision,0.205882,
map,0.916667,
mrr,0.916667,
exact_match,,0.627451
f1,,0.810368
sas,,0.866676


INFO - haystack.nodes.reader.farm -  Performing Evaluation using top_k_per_candidate = 3 
and consequently, QuestionAnsweringPredictionHead.n_best = 4. 
This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5
  start_indices = flat_sorted_indices // max_seq_len
Evaluating: 100%|██████████| 3/3 [00:01<00:00,  2.51it/s]
  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


{'EM': 54.90196078431373,
 'f1': 70.49892674230173,
 'reader_time': 1.2596296130000155,
 'seconds_per_query': 0.0246986198627454,
 'top_n': 4,
 'top_n_accuracy': 92.15686274509804}

In [None]:
## Generating an Evaluation Report
saved_eval_result = EvaluationResult.load(result_path)
pipeline_bert_flit.print_eval_report(saved_eval_result)

# The EvaluationResult contains a pandas dataframe for each pipeline node.
retriever_result = eval_result["Retriever"]
retriever_result.head(2)

reader_result = eval_result["Reader"]
reader_result.head(2)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:   1.0
                        | recall_single_hit_top_1:   1.0
                        |
                      Reader
                        |
                        | exact_match: 0.627
                        | exact_match_top_1:  0.49
                        | f1:  0.81
                        | f1_top_1: 0.635
                        | sas: 0.867
                        | sas_top_1: 0.682
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	What year did PMJJBY started?
Gold Document Ids: 
 	86b6adea4b99d7e62c865b8c75353e3a-0
Metrics: 
 	recall_multi_hit: 1.0
 	recall_single_hit: 1.0
 	precision: 0.4
 	map: 0.8333333333333333
 	mrr: 1.0
Documents: 
 	content: PMJJBY

The Pradhan Mantri Jeevan Jyoti Bima Yojana 

Unnamed: 0,content,document_id,type,gold_document_ids,gold_document_contents,gold_id_match,answer_match,gold_id_or_answer_match,rank,node,query,node_input
0,PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started b...,86b6adea4b99d7e62c865b8c75353e3a-0,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,1.0,1.0,1.0,1,Retriever,What year did PMJJBY started?,prediction
1,"will last till 31 May of the next year. Henceforth, the policy can be renewe...",86b6adea4b99d7e62c865b8c75353e3a-1,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,0.0,0.0,0.0,2,Retriever,What year did PMJJBY started?,prediction


Unnamed: 0,answer,document_id,offsets_in_document,context,type,gold_answers,gold_offsets_in_documents,gold_document_ids,exact_match,f1,rank,node,query,node_input,sas
0,2015.0,86b6adea4b99d7e62c865b8c75353e3a-0,"[{'start': 154, 'end': 158}]","hen Finance Minister, Mr. Arun Jaitley in the annual financial budget of 201...",answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],1.0,1.0,1,Reader,What year did PMJJBY started?,prediction,0.948495
1,,,"[{'start': 0, 'end': 0}]",,answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],0.0,0.0,2,Reader,What year did PMJJBY started?,prediction,0.01452


# Reader 4 - vanadhi/roberta-base-fiqa-flm-sq-flit

In [None]:
# Initialize Reader
model_path = "vanadhi/roberta-base-fiqa-flm-sq-flit"
result_path  = "/content/drive/MyDrive/FinLitQA/eval_results/roberta-base-flit"

#reader_bert_sq = TransformersReader(model_path, use_gpu=1)
reader_roberta_flit = FARMReader(model_name_or_path=model_path, top_k=4, return_no_answer=True)
pipeline_roberta_flit = ExtractiveQAPipeline(reader=reader_roberta_flit, retriever=retriever_es)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find vanadhi/roberta-base-fiqa-flm-sq-flit locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded vanadhi/roberta-base-fiqa-flm-sq-flit
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [None]:
## Evaluation of an ExtractiveQAPipeline
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)
eval_result = pipeline_roberta_flit.eval(
    labels=eval_labels,
    params={"Retriever": {"top_k": 5}},
    sas_model_name_or_path="cross-encoder/stsb-roberta-large"
)


  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.71 Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.49 Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.97 Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.67 Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 41.04 Batches/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.90 Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.83 Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.14 Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.03 Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 61.51 Batches/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.77 Batches/

In [None]:
metrics_roberta_flit = eval_result.calculate_metrics()
metrics_roberta_flit_df = pd.DataFrame(metrics_roberta_flit)
metrics_roberta_flit_df

eval_result.save(result_path)

# Evaluate Reader on its own
reader_eval_results = reader_roberta_flit.eval(document_store=document_store, device=devices[0], label_index=label_index, doc_index=doc_index)
reader_eval_results

Unnamed: 0,Retriever,Reader
recall_multi_hit,1.0,
recall_single_hit,1.0,
precision,0.205882,
map,0.916667,
mrr,0.916667,
exact_match,,0.666667
f1,,0.849473
sas,,0.883281


INFO - haystack.nodes.reader.farm -  Performing Evaluation using top_k_per_candidate = 3 
and consequently, QuestionAnsweringPredictionHead.n_best = 4. 
This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5

  start_indices = flat_sorted_indices // max_seq_len
Evaluating: 100%|██████████| 3/3 [00:01<00:00,  2.58it/s]
  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


{'EM': 60.78431372549019,
 'f1': 77.73634629750171,
 'reader_time': 1.2329038050002055,
 'seconds_per_query': 0.024174584411768736,
 'top_n': 4,
 'top_n_accuracy': 92.15686274509804}

In [None]:
## Generating an Evaluation Report
saved_eval_result = EvaluationResult.load(result_path)
pipeline_roberta_flit.print_eval_report(saved_eval_result)

# The EvaluationResult contains a pandas dataframe for each pipeline node.
retriever_result = eval_result["Retriever"]
retriever_result.head(2)

reader_result = eval_result["Reader"]
reader_result.head(2)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:   1.0
                        | recall_single_hit_top_1:   1.0
                        |
                      Reader
                        |
                        | exact_match: 0.667
                        | exact_match_top_1:  0.51
                        | f1: 0.849
                        | f1_top_1: 0.642
                        | sas: 0.883
                        | sas_top_1: 0.675
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	What year did PMJJBY started?
Gold Document Ids: 
 	86b6adea4b99d7e62c865b8c75353e3a-0
Metrics: 
 	recall_multi_hit: 1.0
 	recall_single_hit: 1.0
 	precision: 0.4
 	map: 0.8333333333333333
 	mrr: 1.0
Documents: 
 	content: PMJJBY

The Pradhan Mantri Jeevan Jyoti Bima Yojana 

Unnamed: 0,content,document_id,type,gold_document_ids,gold_document_contents,gold_id_match,answer_match,gold_id_or_answer_match,rank,node,query,node_input
0,PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started b...,86b6adea4b99d7e62c865b8c75353e3a-0,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,1.0,1.0,1.0,1,Retriever,What year did PMJJBY started?,prediction
1,"will last till 31 May of the next year. Henceforth, the policy can be renewe...",86b6adea4b99d7e62c865b8c75353e3a-1,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,0.0,0.0,0.0,2,Retriever,What year did PMJJBY started?,prediction


Unnamed: 0,answer,document_id,offsets_in_document,context,type,gold_answers,gold_offsets_in_documents,gold_document_ids,exact_match,f1,rank,node,query,node_input,sas
0,2015.0,86b6adea4b99d7e62c865b8c75353e3a-0,"[{'start': 154, 'end': 158}]","hen Finance Minister, Mr. Arun Jaitley in the annual financial budget of 201...",answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],1.0,1.0,1,Reader,What year did PMJJBY started?,prediction,0.948495
1,,,"[{'start': 0, 'end': 0}]",,answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],0.0,0.0,2,Reader,What year did PMJJBY started?,prediction,0.01452


# Reader 5 - deepset/minilm-uncased-squad2

In [None]:
# Initialize Reader
model_path = "deepset/minilm-uncased-squad2"
result_path  = "/content/drive/MyDrive/FinLitQA/eval_results/minilm_sq"

#reader_bert_sq = TransformersReader(model_path, use_gpu=1)
reader_minilm_sq = FARMReader(model_name_or_path=model_path, top_k=4, return_no_answer=True)
pipeline_minilm_sq = ExtractiveQAPipeline(reader=reader_minilm_sq, retriever=retriever_es)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/minilm-uncased-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/477 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/127M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/minilm-uncased-squad2


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/107 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [None]:
## Evaluation of an ExtractiveQAPipeline
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)
eval_result = pipeline_minilm_sq.eval(
    labels=eval_labels,
    params={"Retriever": {"top_k": 5}},
    sas_model_name_or_path="cross-encoder/stsb-roberta-large"
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 58.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 52.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 43.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.18 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.94 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 30.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.58 Batches/s

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/139 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [None]:
metrics_minilm_sq = eval_result.calculate_metrics()
metrics_minilm_sq_df = pd.DataFrame(metrics_minilm_sq)
metrics_minilm_sq_df

eval_result.save(result_path)

# Evaluate Reader on its own
reader_eval_results = reader_minilm_sq.eval(document_store=document_store, device=devices[0], label_index=label_index, doc_index=doc_index)
reader_eval_results

Unnamed: 0,Retriever,Reader
recall_multi_hit,1.0,
recall_single_hit,1.0,
precision,0.205882,
map,0.916667,
mrr,0.916667,
exact_match,,0.568627
sas,,0.863456
f1,,0.822283


INFO - haystack.nodes.reader.farm -  Performing Evaluation using top_k_per_candidate = 3 
and consequently, QuestionAnsweringPredictionHead.n_best = 4. 
This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5
  cur_tensor = torch.tensor([sample[t_name] for sample in features], dtype=torch.long)
  start_indices = flat_sorted_indices // max_seq_len
Evaluating: 100%|██████████| 3/3 [00:00<00:00,  5.63it/s]
  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


{'EM': 47.05882352941176,
 'f1': 71.59538437527053,
 'reader_time': 0.6250120240001706,
 'seconds_per_query': 0.012255137725493541,
 'top_n': 4,
 'top_n_accuracy': 96.07843137254902}

In [None]:
## Generating an Evaluation Report
saved_eval_result = EvaluationResult.load(result_path)
pipeline_minilm_sq.print_eval_report(saved_eval_result)

# The EvaluationResult contains a pandas dataframe for each pipeline node.
retriever_result = eval_result["Retriever"]
retriever_result.head(2)

reader_result = eval_result["Reader"]
reader_result.head(2)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:   1.0
                        | recall_single_hit_top_1:   1.0
                        |
                      Reader
                        |
                        | exact_match: 0.569
                        | exact_match_top_1: 0.412
                        | f1: 0.822
                        | f1_top_1: 0.583
                        | sas: 0.863
                        | sas_top_1: 0.595
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	What year did PMJJBY started?
Gold Document Ids: 
 	86b6adea4b99d7e62c865b8c75353e3a-0
Metrics: 
 	recall_multi_hit: 1.0
 	recall_single_hit: 1.0
 	precision: 0.4
 	map: 0.8333333333333333
 	mrr: 1.0
Documents: 
 	content: PMJJBY

The Pradhan Mantri Jeevan Jyoti Bima Yojana 

Unnamed: 0,content,document_id,type,gold_document_ids,gold_document_contents,gold_id_match,answer_match,gold_id_or_answer_match,rank,node,query,node_input
0,PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started b...,86b6adea4b99d7e62c865b8c75353e3a-0,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,1.0,1.0,1.0,1,Retriever,What year did PMJJBY started?,prediction
1,"will last till 31 May of the next year. Henceforth, the policy can be renewe...",86b6adea4b99d7e62c865b8c75353e3a-1,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,0.0,0.0,0.0,2,Retriever,What year did PMJJBY started?,prediction


Unnamed: 0,answer,document_id,offsets_in_document,context,type,gold_answers,gold_offsets_in_documents,gold_document_ids,exact_match,f1,rank,node,query,node_input,sas
0,2015.0,86b6adea4b99d7e62c865b8c75353e3a-0,"[{'start': 154, 'end': 158}]","hen Finance Minister, Mr. Arun Jaitley in the annual financial budget of 201...",answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],1.0,1.0,1,Reader,What year did PMJJBY started?,prediction,0.948495
1,,,"[{'start': 0, 'end': 0}]",,answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],0.0,0.0,2,Reader,What year did PMJJBY started?,prediction,0.01452


# Reader 6 - ahotrod/albert_xxlargev1_squad2_512

In [None]:
# Initialize Reader
model_path = "ahotrod/albert_xxlargev1_squad2_512"
result_path  = "/content/drive/MyDrive/FinLitQA/eval_results/albert_sq"

#reader_bert_sq = TransformersReader(model_path, use_gpu=1)
reader_albert_sq = FARMReader(model_name_or_path=model_path, top_k=4, return_no_answer=True)
pipeline_albert_sq = ExtractiveQAPipeline(reader=reader_albert_sq, retriever=retriever_es)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find ahotrod/albert_xxlargev1_squad2_512 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/849M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded ahotrod/albert_xxlargev1_squad2_512


Downloading:   0%|          | 0.00/742k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [None]:
## Evaluation of an ExtractiveQAPipeline
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)
eval_result = pipeline_albert_sq.eval(
    labels=eval_labels,
    params={"Retriever": {"top_k": 5}},
    sas_model_name_or_path="cross-encoder/stsb-roberta-large"
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.90 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.90 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.84 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.90 Batches/s

In [None]:
metrics_albert_sq = eval_result.calculate_metrics()
metrics_albert_sq_df = pd.DataFrame(metrics_albert_sq)
metrics_albert_sq_df

eval_result.save(result_path)

# Evaluate Reader on its own
reader_eval_results = reader_albert_sq.eval(document_store=document_store, device=devices[0], label_index=label_index, doc_index=doc_index)
reader_eval_results

Unnamed: 0,Retriever,Reader
recall_multi_hit,1.0,
recall_single_hit,1.0,
precision,0.205882,
map,0.916667,
mrr,0.916667,
exact_match,,0.627451
sas,,0.866875
f1,,0.81923


INFO - haystack.nodes.reader.farm -  Performing Evaluation using top_k_per_candidate = 3 
and consequently, QuestionAnsweringPredictionHead.n_best = 4. 
This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5
  start_indices = flat_sorted_indices // max_seq_len
Evaluating: 100%|██████████| 3/3 [00:22<00:00,  7.59s/it]
  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


{'EM': 39.21568627450981,
 'f1': 65.02934044662406,
 'reader_time': 22.84806640700026,
 'seconds_per_query': 0.44800130209804434,
 'top_n': 4,
 'top_n_accuracy': 94.11764705882352}

In [None]:
## Generating an Evaluation Report
saved_eval_result = EvaluationResult.load(result_path)
pipeline_albert_sq.print_eval_report(saved_eval_result)

# The EvaluationResult contains a pandas dataframe for each pipeline node.
retriever_result = eval_result["Retriever"]
retriever_result.head(2)

reader_result = eval_result["Reader"]
reader_result.head(2)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:   1.0
                        | recall_single_hit_top_1:   1.0
                        |
                      Reader
                        |
                        | exact_match: 0.627
                        | exact_match_top_1: 0.431
                        | f1: 0.819
                        | f1_top_1: 0.621
                        | sas: 0.867
                        | sas_top_1:  0.65
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	What year did PMJJBY started?
Gold Document Ids: 
 	86b6adea4b99d7e62c865b8c75353e3a-0
Metrics: 
 	recall_multi_hit: 1.0
 	recall_single_hit: 1.0
 	precision: 0.4
 	map: 0.8333333333333333
 	mrr: 1.0
Documents: 
 	content: PMJJBY

The Pradhan Mantri Jeevan Jyoti Bima Yojana 

Unnamed: 0,content,document_id,type,gold_document_ids,gold_document_contents,gold_id_match,answer_match,gold_id_or_answer_match,rank,node,query,node_input
0,PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started b...,86b6adea4b99d7e62c865b8c75353e3a-0,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,1.0,1.0,1.0,1,Retriever,What year did PMJJBY started?,prediction
1,"will last till 31 May of the next year. Henceforth, the policy can be renewe...",86b6adea4b99d7e62c865b8c75353e3a-1,document,[86b6adea4b99d7e62c865b8c75353e3a-0],[PMJJBY\n\nThe Pradhan Mantri Jeevan Jyoti Bima Yojana or PMJBY was started ...,0.0,0.0,0.0,2,Retriever,What year did PMJJBY started?,prediction


Unnamed: 0,answer,document_id,offsets_in_document,context,type,gold_answers,gold_offsets_in_documents,gold_document_ids,exact_match,f1,rank,node,query,node_input,sas
0,2015.0,86b6adea4b99d7e62c865b8c75353e3a-0,"[{'start': 153, 'end': 157}]","then Finance Minister, Mr. Arun Jaitley in the annual financial budget of 20...",answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],1.0,1.0,1,Reader,What year did PMJJBY started?,prediction,0.948495
1,,,"[{'start': 0, 'end': 0}]",,answer,[2015],"[{'start': 154, 'end': 158}]",[86b6adea4b99d7e62c865b8c75353e3a-0],0.0,0.0,2,Reader,What year did PMJJBY started?,prediction,0.01452
