# TM. Assignment 1. Sparse vs Dense Retrieval using Haystack

## Objectives
- Implement IR systems based on sparse representations and dense representations.
- Evaluate the performance of sparse versus dense representations using real datasets ([TREC COVID Collection](https://ir.nist.gov/trec-covid/))
- Use of a real-world end-to-end IR/NLP library (HayStack).

## **TASK 0.** Basic usage of Haystack

### Haystack

[Haystack](https://haystack.deepset.ai/) is an open-source Python library for building end-to-end Search and Question Answering (QA) systems and NLP applications.
* Provides an easy to use framework and a set of components that cover all stages of an NLP project. Making possible to work with text data, perform document retrieval, and apply ML to extract answers from documents.
* Includes integration components to work with LLMs (_Large Language Models_) and to interface with models from [Hugging Face](https://huggingface.co/), [Sentence Bert](https://www.sbert.net/), [OpenAI](https://platform.openai.com/docs/models) and others.

### Haystack Components

Applications built on Haystack are based on the [Pipeline](https://docs.haystack.deepset.ai/docs/pipelines) concept, which organizes the sequence of tasks to be performed on processed text or documents.
- These _Pipelines_ are composed of different components (called _Nodes_ in Haystack) that perform the corresponding task
- Pipeline components receive as input and emit as output core elements: [_Documents_, _Answers_, _Labels_](https://docs.haystack.deepset.ai/docs/documents_answers_labels)
- Available pipeline components are grouped according the function they perform (Data Handing, Semantic Search, Prompt and LLM, etc). See [Pipeline Components Overview](https://docs.haystack.deepset.ai/docs/nodes_overview)

**Notes**
- In the tasks of this assignment we will use the Haystak components independently, without integrating them into a _Pipeline_.
- We will work primarily with the [Semantic Search](https://docs.haystack.deepset.ai/docs/semantic_search) components

### Haystack Components relevant for this assignment

- **DocumentStore** Acts as a database that stores _Documents_ (textual contents + meta data + (optionaly) embbeding vector) and provides them to the _Retriever_ at query time. (Details in [DocumentStore documentation](https://docs.haystack.deepset.ai/docs/document_store))

    - There are several available _DocumentStore_ implementations, both sparse-based, dense-based, or hybrid: [InMemoryDocumentStore](https://docs.haystack.deepset.ai/reference/document-store-api#inmemorydocumentstore), [ElasticsearchDocumentStore](https://docs.haystack.deepset.ai/reference/document-store-api#elasticsearchdocumentstore), [OpenSearchDocumentStore](https://docs.haystack.deepset.ai/reference/document-store-api#opensearchdocumentstore), [FAISSDocumentStore](https://docs.haystack.deepset.ai/reference/document-store-api#faissdocumentstore), [WeviateDocumentStore](https://docs.haystack.deepset.ai/reference/document-store-api#weaviatedocumentstore), ...

    - Relevant methods:
        - _write_documents(docs)_: adds a list of _Documents_
        - _update_embeddings(retriever)_: updates vector embedding for _Documents_ in _DocumentStore_ using the given dense _Retriever_
- **Retriever** Performs document retrieval, backbed by a compatible DocumentStore, and returns a ranked set of candidate _Documents_ that are relevant to the given query  (Details in [Retreiver documentation](https://docs.haystack.deepset.ai/docs/retriever))

  - _Retrievers_ can follow a sparse retrival approach or a dense one (see [Vector-based vs Key-based](https://docs.haystack.deepset.ai/docs/vector-based-vs-keyword-based-retrievers)) and are tightly coupled with the corresponding _DocumentStore_ (see [Compatibility Matrix](https://docs.haystack.deepset.ai/docs/retriever#documentstore-compatibility))

      - Sparse _Retriever_: [BM25Retreiver](https://docs.haystack.deepset.ai/reference/retriever-api#bm25retriever), [TfidfRetriever](https://docs.haystack.deepset.ai/reference/retriever-api#tfidfretriever)

      - Dense _Retriever_: [EmbeddingRetriever](https://docs.haystack.deepset.ai/reference/retriever-api#embeddingretriever), [DensePassageRetriever](https://docs.haystack.deepset.ai/reference/retriever-api#densepassageretriever)

  - Relevant methods:
      - _retrieve(query, top_k=5, ...)_: returns a list of top _k_ scored _Document_ that match the given _query_ text

### Example

**STEP 1.** Install Haystack with GPU support (see https://docs.haystack.deepset.ai/docs/installation)

In [None]:
!pip install --upgrade pip
!pip install 'farm-haystack[all-gpu]'

Collecting pip
  Downloading pip-23.3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.1
Collecting farm-haystack[all-gpu]
  Downloading farm_haystack-1.21.2-py3-none-any.whl.metadata (26 kB)
Collecting boilerpy3 (from farm-haystack[all-gpu])
  Downloading boilerpy3-1.0.7-py3-none-any.whl.metadata (5.8 kB)
Collecting events (from farm-haystack[all-gpu])
  Downloading Events-0.5-py3-none-any.whl.metadata (3.9 kB)
Collecting httpx (from farm-haystack[all-gpu])
  Downloading httpx-0.25.1-py3-none-any.whl.metadata (7.1 kB)
Collecting lazy-imports==0.3.1 (from farm-haystack[all-gpu])
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting posthog (from farm-haystack[all

**STEP 2** Download TREC-COVID Collection (aprox. 170K documents) from BEIR Repository (see https://github.com/beir-cellar/beir#beers-available-datasets)


In [None]:
!wget -c https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip
!unzip trec-covid.zip
!echo -e "\nCORPUS CONTENTS"
!head -5 trec-covid/corpus.jsonl
!echo -e "\nQUERIES CONTENTS"
!head -5 trec-covid/queries.jsonl
!echo -e "\nQRELS CONTENTS"
!head -10 trec-covid/qrels/test.tsv

--2023-11-05 11:47:05--  https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip
Resolving public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)... 130.83.167.186
Connecting to public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)|130.83.167.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73876720 (70M) [application/zip]
Saving to: ‘trec-covid.zip’


2023-11-05 11:47:14 (9.77 MB/s) - ‘trec-covid.zip’ saved [73876720/73876720]

Archive:  trec-covid.zip
   creating: trec-covid/
   creating: trec-covid/qrels/
  inflating: trec-covid/qrels/test.tsv  
  inflating: trec-covid/corpus.jsonl  
  inflating: trec-covid/queries.jsonl  

CORPUS CONTENTS
{"_id": "ug7v899j", "title": "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia", "text": "OBJECTIVE: This retrospective chart review describes the epidemiology and cli

**STEP 3** Declare an utility function to read documents from trec-covid/corpus.jsonl and load them into Haystack Document objects

In [None]:
import json

from haystack import Document

def load_haystackdocs_from_trec_covid(path):
  result = []
  with open(path) as f:
    for line in f:
        doc_json = json.loads(line)
        doc_haystack = Document(content_type = 'text',
                                content = doc_json['title']+". "+doc_json['text'],
                                id = doc_json['_id'],
                                meta = {
                                    'pmid': doc_json['metadata']['pubmed_id'],
                                    'url': doc_json['metadata']['url']
                                })
        result.append(doc_haystack)
  return result

docs = load_haystackdocs_from_trec_covid('trec-covid/corpus.jsonl')

print(docs[4])

<Document: id=9785vg6d, content='Gene expression in epithelial cells in response to pneumovirus infection. Respiratory syncytial viru...'>


**STEP 4** Add TREC-COVID documents to an _InMemoryDocumentStore_ (set _use_bm25=True_ to enable BM25 retriever)

In [None]:
from haystack.document_stores import InMemoryDocumentStore
inmemory_store = InMemoryDocumentStore(use_bm25=True)

inmemory_store.write_documents(docs)
print("Number of indexed docs. {}".format(inmemory_store.get_document_count()))

Updating BM25 representation...: 100%|██████████| 171332/171332 [00:24<00:00, 7079.68 docs/s]


Number of indexed docs. 171332


**STEP 5**  Add a _BM25Retriever_ and search with the title of the 5th document in TREC-COVID collection (see https://docs.haystack.deepset.ai/docs/retriever)

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=inmemory_store)

candidate_documents = retriever.retrieve(
    query="Gene expression in epithelial cells in response to pneumovirus infection",
    top_k=10
)

for d in candidate_documents:
  print("Score({}) Id: {}  Content: {}".format(d.score, d.id, d.content[0:50]))

Score(0.9989907772473965) Id: 9785vg6d  Content: Gene expression in epithelial cells in response to
Score(0.9936369117196465) Id: b831y105  Content: Lung epithelial cells have virus-specific and shar
Score(0.9933910463807736) Id: qowp861l  Content: Gene expression and in situ protein profiling of c
Score(0.9933887346920499) Id: iedh762s  Content: Lung epithelial cells have virus-specific and shar
Score(0.9929880986768462) Id: 8kjqhyvd  Content: Differential expression of interferon-lambda recep
Score(0.9928198816586955) Id: zks03hq7  Content: The Healthy Infant Nasal Transcriptome: A Benchmar
Score(0.9927019727044312) Id: fdb6az0v  Content: Hypercapnia Alters Expression of Immune Response, 
Score(0.9923594653116434) Id: 08p8ns2d  Content: SARS-CoV-2 activates lung epithelia cell proinflam
Score(0.9915729430282115) Id: tna7e9dw  Content: Type 2 Inflammation Modulates ACE2 and TMPRSS2 in 
Score(0.9914339424159164) Id: rya21i2b  Content: The influence of interferon-lambda on restricting 


## **TASK 1.** Evaluate Sparse Retrieval with Haystack (TODO)

### Steps

1. Create an _ElasticsearhDocumentStore_ and add to it the TREC-COVID documents
2. Run the 50 queries from TREC-COVID collection using a _BM25Retriever_ and a _TfidfRetreiver_ over the previous _ElasticsearhDocumentStore_
    - For each query in `trec-covid/queries.jsonl` use as query text the field `metada.query` (**OJO!!! Important**)
3. Evaluate the retrieval performance of both methods using the following measures
    - _MAP_ (Mean Average Precision)
    - _P@5_, _P@10_ and _P@20_ (Precision at 5, 10 and 20)
    - _R@5_, _R@10_ and _R@20_ (Recall at 5, 10 and 20)
    - _nDCG@10_ (NDCG at 10) [to compare with [TREC COVID Round 1 leaderboard](https://castorini.github.io/TREC-COVID/round1/)]

  To compute these measures you can employ one of the following libraries ([ir-measures](https://ir-measur.es/en/latest/), [pytrec_eval](https://github.com/cvangysel/pytrec_eval), [trectools](https://github.com/joaopalotti/trectools))

### Preliminary remarks  (how to install Elasticsearch 7.x in Google Colab)

1. Download and uncompress Elasticsearch 7.x distribution (by default Elasticsearch 7.x omits HTTPS connections and does not need to create/manage TLS/SSL certificates)
2. Start Elasticsearch daemon  (it will take several seconds, >30s.)
3. Check Elasticsearch is running (daemon process ir running, port 9200 is open, RESP API responds)

Details at https://haystack.deepset.ai/tutorials/03_scalable_qa_system and https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/elasticsearch.ipynb

In [None]:
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q -c
!tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.2

In [None]:
%%bash --bg
sudo -H -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

In [None]:
!ps -ef | grep elasticsearch
!ss -lpn | grep 9200

root        3444    3442  0 16:45 ?        00:00:00 sudo -H -u daemon -- elasticsearch-7.9.2/bin/ela
root        3445     253  0 16:45 ?        00:00:00 /bin/bash -c ps -ef | grep elasticsearch
root        3447    3445  0 16:45 ?        00:00:00 grep elasticsearch


In [None]:
!curl -sX GET "localhost:9200/"

In [None]:
!pip install ir-measures

Collecting ir-measures
  Downloading ir_measures-0.3.3.tar.gz (48 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m41.0/48.8 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m796.2 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytrec-eval-terrier>=0.5.5 (from ir-measures)
  Downloading pytrec_eval_terrier-0.5.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (777 bytes)
Collecting cwl-eval>=1.0.10 (from ir-measures)
  Downloading cwl-eval-1.0.12.tar.gz (31 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading pytrec_eval_terrier-0.5.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.4/2

### Start here ....

In [None]:
from haystack.document_stores import ElasticsearchDocumentStore

elasticsearch_store = ElasticsearchDocumentStore(host="localhost", port=9200)
elasticsearch_store.write_documents(docs)

print(f"Number of indexed docs. {elasticsearch_store.get_document_count()}")

Number of indexed docs. 171332


In [None]:
from haystack.nodes import BM25Retriever, TfidfRetriever

bm25_retriever = BM25Retriever(document_store=elasticsearch_store)
tfidf_retriever = TfidfRetriever(document_store=elasticsearch_store)

In [None]:
def load_queries(path):
  '''Reads the queries metadata.query information from the given file.'''
  queries = []

  with open(path) as f:
    for query_line in f:
      data = json.loads(query_line)
      query_id = data["_id"]
      query_text = data["metadata"]["query"]
      queries.append({"id": query_id, "text": query_text})

    return queries

queries = load_queries("trec-covid/queries.jsonl")

In [None]:
from ir_measures import ScoredDoc

def make_query(query, retriever, top_k, ranker=None, calc_score=False):
  if ranker is None:
    docs = retriever.retrieve(query=query, top_k=top_k)
  else:
    docs = retriever.retrieve(query=query, top_k=top_k*2)
    docs = ranker.predict(query, docs, top_k=top_k)

  # Some retrievers (e.g. TfidfRetriever) do not return the scores, but we can
  # calculate them again and add them to the documents retrieved
  # See https://github.com/deepset-ai/haystack/blob/main/haystack/nodes/retriever/sparse.py#L457
  if (calc_score):
    scores = retriever._calc_scores([query], retriever.document_store.index)[0]

    # `scores` is an OrderedDict sorted by the score, so the first `top_k` scores
    # correspond to the scores of our `top_k` documents retrieved before
    for idx, (_, score) in enumerate(scores.items()):
      if (idx == top_k):
        break

      docs[idx].score = score

  return docs

def make_queries(queries, retriever, top_k, ranker=None, update_embeddings=False, calc_scores=False, print_results=False):

  if update_embeddings:
    # Update the embeddings of the documents using the given retriever
    retriever.document_store.update_embeddings(retriever)

  results = []  # query_id, doc_id, score

  for query in queries:
    query_id = query["id"]

    # Perform the query and save the documents retrieved
    candidate_documents = make_query(query["text"], retriever, top_k, ranker=ranker, calc_score=calc_scores)
    for d in candidate_documents:
      results.append(ScoredDoc(query_id, d.id, d.score))

    # Show the documents retrieved
    if (print_results):
      print("*"*50)
      print(f"QUERY: {query['text']}")
      for d in candidate_documents:
        print(f"Score({d.score}) Id: {d.id}  Content: {d.content[0:50]}")

  return results

In [None]:
bm25_run = make_queries(queries, bm25_retriever, 50, print_results=False)
bm25_run

[ScoredDoc(query_id='1', doc_id='8ccl9aui', score=0.7745283464148603),
 ScoredDoc(query_id='1', doc_id='kqqantwg', score=0.7744898263297265),
 ScoredDoc(query_id='1', doc_id='12dcftwt', score=0.7744898263297265),
 ScoredDoc(query_id='1', doc_id='es7q6c90', score=0.7738326195877332),
 ScoredDoc(query_id='1', doc_id='pl48ev5o', score=0.7693328369767511),
 ScoredDoc(query_id='1', doc_id='h8ahn8fw', score=0.7693328369767511),
 ScoredDoc(query_id='1', doc_id='6foz003n', score=0.7667897300355583),
 ScoredDoc(query_id='1', doc_id='jpnbppry', score=0.7667897300355583),
 ScoredDoc(query_id='1', doc_id='558awj1m', score=0.7651467195418238),
 ScoredDoc(query_id='1', doc_id='t7gpi2vo', score=0.7651467195418238),
 ScoredDoc(query_id='1', doc_id='e3wjo0yk', score=0.7651467195418238),
 ScoredDoc(query_id='1', doc_id='4dtk1kyh', score=0.7646983258337638),
 ScoredDoc(query_id='1', doc_id='ne5r4d4b', score=0.7594805151469529),
 ScoredDoc(query_id='1', doc_id='r0peje13', score=0.7591065399280874),
 Score

In [None]:
tfidf_run = make_queries(queries, tfidf_retriever, 50, calc_scores=True, print_results=False)
tfidf_run

[ScoredDoc(query_id='1', doc_id='pl48ev5o', score=0.6424834781394142),
 ScoredDoc(query_id='1', doc_id='h8ahn8fw', score=0.6424834781394142),
 ScoredDoc(query_id='1', doc_id='jkejiuf2', score=0.552093085289193),
 ScoredDoc(query_id='1', doc_id='dv9m19yk', score=0.5221092587168044),
 ScoredDoc(query_id='1', doc_id='beguhous', score=0.5145681490154591),
 ScoredDoc(query_id='1', doc_id='6foz003n', score=0.44894587692710586),
 ScoredDoc(query_id='1', doc_id='jpnbppry', score=0.44111053049107063),
 ScoredDoc(query_id='1', doc_id='be0mr85h', score=0.4386990199762382),
 ScoredDoc(query_id='1', doc_id='bp9xz9wk', score=0.4386990199762382),
 ScoredDoc(query_id='1', doc_id='j1cdoxqs', score=0.4386990199762382),
 ScoredDoc(query_id='1', doc_id='kvb7moqt', score=0.42836891639832636),
 ScoredDoc(query_id='1', doc_id='4977dzxz', score=0.42836891639832636),
 ScoredDoc(query_id='1', doc_id='v9f5jck9', score=0.41341790983452653),
 ScoredDoc(query_id='1', doc_id='a7w6lael', score=0.40994736075029725),
 

In [None]:
from ir_measures import Qrel

def load_qrels(path):
  '''Reads the query relevances from the given file.'''
  qrels = []

  with open(path) as f:
    f.readline()  # skip header

    for qrel_line in f:
      query_id, doc_id, relevance = qrel_line.rstrip().split("\t")
      qrel = Qrel(query_id, doc_id, int(relevance))
      qrels.append(qrel)

    return qrels

qrels = load_qrels("trec-covid/qrels/test.tsv")

In [None]:
import ir_measures
from ir_measures import MAP, P, R, nDCG, AP

metrics = [MAP, P@5, P@10, P@20, R@5, R@10, R@20, nDCG@10]  # MAP = AP
evaluator = ir_measures.evaluator(metrics, qrels)

In [None]:
bm25_measures = evaluator.calc_aggregate(bm25_run)
bm25_measures

{P@20: 0.628,
 R@5: 0.008743070585130616,
 nDCG@10: 0.6103525118191296,
 R@20: 0.030435964722485494,
 P@10: 0.69,
 P@5: 0.7,
 AP: 0.05024216889672638,
 R@10: 0.017536952661119924}

In [None]:
tfidf_measures = evaluator.calc_aggregate(tfidf_run)
tfidf_measures

{P@20: 0.44099999999999995,
 R@5: 0.005951740782327519,
 nDCG@10: 0.41380591302681763,
 R@20: 0.021608592853643936,
 P@10: 0.46400000000000013,
 P@5: 0.4720000000000001,
 AP: 0.03171703159523822,
 R@10: 0.011692836207329265}

 **Precision at K (P@K):**
  - P@5, P@10 and P@20 for BM25 are relatively high (0.7, 0.69 and 0.628, respectively), indicating that 70%, 69% and 62.8% of the top 5 and 10 retrieved documents are relevant to the queries. This shows that BM25 is effective in retrieving relevant information in the top results.

  - For Tfidf, P@5, P@10 and P@20 are much lower (0.472, 0.464 and 0.441, respectively). This suggests that Tfidf's performance in retrieving relevant documents in the top results is not as high as BM25.


 **Recall at K (R@K):**
  - R@5, R@10 and R@20 for BM25 are 0.0087, 0.0175 and 0.0304, respectively, which indicates that BM25 retrieves a small fraction of all relevant documents within the top 5 and 10 results.

  - For Tfidf, R@5, R@10 and R@20 are even lower, indicating that it misses more relevant documents within the top 5 and 10 results.


 **Mean Average Precision (AP):**
  - BM25 has a higher average precision (AP) of 0.0502, indicating that, on average, the relevant documents are ranked higher in the results. This suggests BM25 provides more consistent ranking quality.

  - Tfidf has a lower AP of 0.0317, which means it has less consistent ranking quality than BM25.


 **Normalized Discounted Cumulative Gain (nDCG@10):**
  - BM25's nDCG@10 is 0.6104, suggesting that it provides a much higher degree of relevance in the top 10 results. The provided scores for the top 50 results are higher than Tfidf's, but they are also more uniform (differences lower than 0.10 between the top 1 and top 50).

  - Tfidf's nDCG@10 is 0.4138, indicating that the degree of relevance in the top 10 results is lower compared to BM25. The provided scores are lower but show more diversity, reaching differences of even 0.50 between the first and last document of the ranking.

**TREC-COVID Round 1 Leaderboard:**
  - In comparison with the top submissions of this leaderboard, we notice that we have obtained a very good nDCG@10 with BM25, since the best one obtains a value of 0.6844 and the top 6 obtains a value of 0.6082 (lower than ours).

  - Regarding P@5, our retrievers are much worse in this metric. The top 1 from the leaderboard obtains a P@5 of 0.8333, while our best retriever (BM25) would be in the 14th position (tied with 15th and 16th).

  - In terms of MAP, BM25 would not even in the top 30 of the list. The best runs show a MAP higher than 0.3.


**Conclusions:**
- BM25Retriever outperforms TfidfRetriever in all the evaluated metrics, including precision, recall, mean average precision, and nDCG.

- BM25 retrieves more relevant documents in the top results, as indicated by higher P@K scores.

- BM25 provides a more consistent ranking quality, with a higher MAP.

- TfidfRetriever has lower precision, recall, and nDCG values, suggesting that it may not be as effective as BM25 in retrieving relevant information.

- The choice of retrieval method can significantly impact the quality of search results, and BM25 appears to be a better choice for this specific task.

- Both retrievers return very low recall values, but notice that this metric evaluates the number of relevant documents over the whole relevant documents of the collection, so it varies with respect to the `top_k` value. With a larger `top_k`, the recall should increase, while with a small value it should decrease. In this case, we pay more attention to the precision to avoid wasting time hypertuning this parameter to find an optimal value.

- BM25 show top results for nDCG, implying that it is able to give a proper score for the documents, but fails to return a very good quality ranking, as shown by worse values of MAP, in comparison with the leaderboard.

In summary, the evaluation results indicate that BM25Retriever is a more effective retrieval method for the given TREC-COVID queries when using an ElasticsearchDocumentStore. It provides better precision, recall, and ranking quality compared to TfidfRetriever.

### Extra

Complete the tests with the *BM25Retriever* using a *Ranker*.

Use a *SentencesTransformersRanker* and configure it with one of the CrossEncoders.

In [None]:
from haystack.nodes import SentenceTransformersRanker

ranker = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2")

(…)-MiniLM-L-12-v2/resolve/main/config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


(…)12-v2/resolve/main/tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

(…)co-MiniLM-L-12-v2/resolve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)-v2/resolve/main/special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
bm25_ranked_run = make_queries(queries, bm25_retriever, 50, ranker=ranker, print_results=False)
bm25_ranked_run

[ScoredDoc(query_id='1', doc_id='ne5r4d4b', score=0.9997484087944031),
 ScoredDoc(query_id='1', doc_id='4dtk1kyh', score=0.9997424483299255),
 ScoredDoc(query_id='1', doc_id='e6h1qvdk', score=0.9996529817581177),
 ScoredDoc(query_id='1', doc_id='3ll2tlzr', score=0.9995995163917542),
 ScoredDoc(query_id='1', doc_id='fs07zdu6', score=0.9994439482688904),
 ScoredDoc(query_id='1', doc_id='d2knbzhl', score=0.999431312084198),
 ScoredDoc(query_id='1', doc_id='1mjaycee', score=0.999190628528595),
 ScoredDoc(query_id='1', doc_id='t1iagum7', score=0.9991310238838196),
 ScoredDoc(query_id='1', doc_id='es7q6c90', score=0.9990391731262207),
 ScoredDoc(query_id='1', doc_id='utsr0zv7', score=0.9990293979644775),
 ScoredDoc(query_id='1', doc_id='wuegn0jg', score=0.9989718198776245),
 ScoredDoc(query_id='1', doc_id='yzp9wjuk', score=0.9989283680915833),
 ScoredDoc(query_id='1', doc_id='12dcftwt', score=0.9989114999771118),
 ScoredDoc(query_id='1', doc_id='kqqantwg', score=0.9989084005355835),
 ScoredD

In [None]:
bm25_ranked_measures = evaluator.calc_aggregate(bm25_ranked_run)
bm25_ranked_measures

{R@5: 0.009553427367945845,
 P@20: 0.6780000000000002,
 AP: 0.05565376213220255,
 P@10: 0.7199999999999999,
 R@20: 0.03311666783922393,
 nDCG@10: 0.6498648582417755,
 R@10: 0.01781965633239111,
 P@5: 0.748}

In [None]:
import statistics

def compare_bm25s(bm25_run, bm25_ranked_run):
  # Check differences in the documents retrieved
  print("Without rank:\t", bm25_run)
  print("With rank:\t", bm25_ranked_run)

  # Show common results
  common_results = []
  scores = []
  ranked_scores = []
  for doc in bm25_run:
    for doc_ranked in bm25_ranked_run:
      if doc.doc_id == doc_ranked.doc_id:
        common_results.append({"doc_id": doc.doc_id, "score": doc.score, "ranked_score": doc_ranked.score})
        scores.append(doc.score)
        ranked_scores.append(doc_ranked.score)

  print(f"Found {len(common_results)} common results:", common_results)
  print("Mean score without rank:", statistics.mean(scores), "+-", statistics.stdev(scores))
  print("Mean score with rank:", statistics.mean(ranked_scores), "+-", statistics.stdev(ranked_scores))
  print()

compare_bm25s(bm25_run[0:50], bm25_ranked_run[0:50])  # query 1
compare_bm25s(bm25_run[50:100], bm25_ranked_run[50:100])  # query 2

Without rank:	 [ScoredDoc(query_id='1', doc_id='8ccl9aui', score=0.7745283464148603), ScoredDoc(query_id='1', doc_id='kqqantwg', score=0.7744898263297265), ScoredDoc(query_id='1', doc_id='12dcftwt', score=0.7744898263297265), ScoredDoc(query_id='1', doc_id='es7q6c90', score=0.7738326195877332), ScoredDoc(query_id='1', doc_id='pl48ev5o', score=0.7693328369767511), ScoredDoc(query_id='1', doc_id='h8ahn8fw', score=0.7693328369767511), ScoredDoc(query_id='1', doc_id='6foz003n', score=0.7667897300355583), ScoredDoc(query_id='1', doc_id='jpnbppry', score=0.7667897300355583), ScoredDoc(query_id='1', doc_id='558awj1m', score=0.7651467195418238), ScoredDoc(query_id='1', doc_id='t7gpi2vo', score=0.7651467195418238), ScoredDoc(query_id='1', doc_id='e3wjo0yk', score=0.7651467195418238), ScoredDoc(query_id='1', doc_id='4dtk1kyh', score=0.7646983258337638), ScoredDoc(query_id='1', doc_id='ne5r4d4b', score=0.7594805151469529), ScoredDoc(query_id='1', doc_id='r0peje13', score=0.7591065399280874), Scor

We observe that the use of a Ranker after the BM25 retrieval improves slightly the results. For example, we have obtained a MAP value of 0.056 (vs 0.050), a nDCG@10 of 0.650 (vs 0.610) or a P@5 of 0.748 (vs 0.7). In general, all the metrics evaluated are improved with the use of the Ranker.

Also we observe the differences in the scores given to the ranked documents. For some queries is larger and even more uniform than before. For example, for query 1, BM25 alone return a mean score of 0.753 +- 0.012, while using a Ranker converts those scores to 0.998 +- 0.001. For other queries, the ranked scores  show more diversity, such as query 2 with a new mean score of 0.724 +- 0.296, in comparison with BM25 alone that returned a mean of 0.825 +- 0.015.

Notice that the final documents in the top are not the same. For query 1, for example, there are 36 common results between BM25 alone and BM25 with a Ranker, while for query 2 there are 25 common results.

In summary, we would say that the use of the Ranker helps to build a better ranking of retrieved documents. This is demonstrated by the improved metrics, but also qualitatively with a higher score deviation of the document scores, this is, BM25 always returned very similar scores for all the retrieved documents of a query, while with the Ranker sometimes it is more uniform but sometimes present a very large difference too. This goes in accordance with the common sense, since it is very strange to have a collection of documents to which for every query the retrieved documents have a similar score. This would imply that there are clusters of documents (each cluster associated to the results of a query) where the intracluster similarity is very high (the documents are almost identical). This is probably not desired, since we want to obtain a good diversity of results and not almost the same document for a given query.

## **TASK 2.** Evaluate Dense Retrieval with Haystack (TODO)

### Steps
1. Create a _FAISSDocumentStore_ and add to it the TREC-COVID documents
  
2. Run the 50 queries from TREC-COVID collection using an _EmbeddingRetriever_ and a _DensePassageRetriever_ over the previous _FAISSDocumentStore_
   - For each query in `trec-covid/queries.jsonl` use as query text the field `text` (**OJO!!! Important**)
   - _EmbeddingRetriever_ needs an `embedding_model` parameter, for example `embedding_model="sentence-transformers/multi-qa-mpnet-base-cos-v1"` using a model from [_SentenceTransformers_ models](https://www.sbert.net/docs/pretrained_models.html#)
   - _DensePassageRetriever_ uses two encoder models, one to embed the query and one to embed the document. By default these model are `query_embedding_model="facebook/dpr-question_encoder-single-nq-base"`and `passage_embedding_model = "facebook/dpr-ctx_encoder-single-nq-base"` from [Hugging Face models](https://huggingface.co/models?other=dpr)
   - See examples at https://haystack.deepset.ai/tutorials/06_better_retrieval_via_embedding_retrieval
   
3. Evaluate the retrieval performance of both methods using the following measures
   - MAP (Mean Average Precision)
   - P@5, P@10 and P@20 (Precision at 5, 10 and 20)
   - R@5, R@10 and R@20 (Recall at 5, 10 and 20)
   - _nDCG@10_ (NDCG at 10) [to compare with [TREC COVID Round 1 leaderboard](https://castorini.github.io/TREC-COVID/round1/)]


   To compute these measures you can employ one of the following libraries ([ir-measures](https://ir-measur.es/en/latest/), [pytrec_eval](https://github.com/cvangysel/pytrec_eval), [trectools](https://github.com/joaopalotti/trectools))

### Preliminary remarks (about using FAISSDocumentStore)

- _FAISSDocumentStore_ mixes two storage methods: a SQLite database to store _Document_ text and metadata and a FAISS index to store dense vectors
    - SQLite database file is stored implicitly, but FAISS index must be explicitly saved on disk by calling the _save(...)_ method (see [FAISSDocumentStore documentation](https://docs.haystack.deepset.ai/reference/document-store-api#faissdocumentstore))
    - Vector embeddings for stored _Documents_ are updated and indexed in FAISS by calling _update_embeddings(retriever)_ with a provided dense Retriever instance
- Dense Retrievers (both for _EmbeddingRetriever_ and _DesnsePassageRetriever_) requiere a GPU and will take more than 1 hour to process TREC-COVID collection.


### Start here ...

In [None]:
from haystack.document_stores import FAISSDocumentStore

faiss_store = FAISSDocumentStore()
faiss_store.save("my_faiss_index.faiss")
faiss_store.write_documents(docs)

print(f"Number of indexed docs. {faiss_store.get_document_count()}")

Writing Documents: 180000it [08:17, 362.16it/s]

Number of indexed docs. 171332





In [None]:
from haystack.nodes import EmbeddingRetriever, DensePassageRetriever

# A warning is thrown that the dot_product is used instead of cosine.
# At https://www.sbert.net/docs/pretrained_models.html#multi-qa-models we
# are informed that dot product can be used with this model (it is normalized)
embedding_retriever = EmbeddingRetriever(
    document_store=faiss_store,
    embedding_model="sentence-transformers/multi-qa-mpnet-base-cos-v1"
)

densepassage_retriever = DensePassageRetriever(
    document_store=faiss_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model = "facebook/dpr-ctx_encoder-single-nq-base"
)

(…)e/main/config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

(…)646302f229745e45d3e6546f6/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

(…)229745e45d3e6546f6/1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(…)04e5a646302f229745e45d3e6546f6/README.md:   0%|          | 0.00/9.20k [00:00<?, ?B/s]

(…)e5a646302f229745e45d3e6546f6/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

(…)6546f6/config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

(…)6302f229745e45d3e6546f6/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

(…)45e45d3e6546f6/sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

(…)9745e45d3e6546f6/special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

(…)646302f229745e45d3e6546f6/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

(…)229745e45d3e6546f6/tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

(…)46302f229745e45d3e6546f6/train_script.py:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

(…)04e5a646302f229745e45d3e6546f6/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)5a646302f229745e45d3e6546f6/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]



(…)-base/resolve/main/tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

(…)-single-nq-base/resolve/main/config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

(…)er-single-nq-base/resolve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)ngle-nq-base/resolve/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

(…)-base/resolve/main/tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

(…)-single-nq-base/resolve/main/config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

(…)er-single-nq-base/resolve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)ngle-nq-base/resolve/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
def load_queries(path):
  '''Reads the queries metadata.query information from the given file.'''
  queries = []

  with open(path) as f:
    for query_line in f:
      data = json.loads(query_line)
      query_id = data["_id"]
      query_text = data["text"]
      queries.append({"id": query_id, "text": query_text})

    return queries

queries = load_queries("trec-covid/queries.jsonl")

In [None]:
embedding_run = make_queries(queries, embedding_retriever, 50, update_embeddings=True, print_results=False)
embedding_run

Updating Embedding:   0%|          | 0/171332 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:   6%|▌         | 10000/171332 [03:35<57:49, 46.50 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  12%|█▏        | 20000/171332 [07:03<53:14, 47.37 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  18%|█▊        | 30000/171332 [10:38<50:07, 46.99 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  23%|██▎       | 40000/171332 [14:15<46:55, 46.65 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  29%|██▉       | 50000/171332 [17:51<43:32, 46.45 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  35%|███▌      | 60000/171332 [21:27<39:56, 46.45 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  41%|████      | 70000/171332 [25:03<36:25, 46.37 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  47%|████▋     | 80000/171332 [28:35<32:38, 46.63 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  53%|█████▎    | 90000/171332 [32:11<29:06, 46.56 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  58%|█████▊    | 100000/171332 [35:47<25:36, 46.42 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  64%|██████▍   | 110000/171332 [39:19<21:53, 46.68 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  70%|███████   | 120000/171332 [42:52<18:17, 46.77 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  76%|███████▌  | 130000/171332 [46:31<14:50, 46.42 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  82%|████████▏ | 140000/171332 [50:05<11:13, 46.51 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  88%|████████▊ | 150000/171332 [53:43<07:40, 46.30 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  93%|█████████▎| 160000/171332 [57:16<04:03, 46.53 docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Documents Processed:  99%|█████████▉| 170000/171332 [1:00:46<00:28, 46.82 docs/s]

Batches:   0%|          | 0/42 [00:00<?, ?it/s]

Documents Processed: 180000 docs [1:01:15, 48.97 docs/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[ScoredDoc(query_id='1', doc_id='dv9m19yk', score=0.5020940632870664),
 ScoredDoc(query_id='1', doc_id='1q2gqh22', score=0.5020714919958008),
 ScoredDoc(query_id='1', doc_id='h2m0rhk1', score=0.5020714919958008),
 ScoredDoc(query_id='1', doc_id='rzpbpxw2', score=0.5020711006980231),
 ScoredDoc(query_id='1', doc_id='gdfxiosc', score=0.5020655839355196),
 ScoredDoc(query_id='1', doc_id='qp0h50t3', score=0.5020458558766471),
 ScoredDoc(query_id='1', doc_id='wy0y5ztd', score=0.5020380956303417),
 ScoredDoc(query_id='1', doc_id='jb05x03a', score=0.5020380956303417),
 ScoredDoc(query_id='1', doc_id='u7u75sl0', score=0.5020207333813757),
 ScoredDoc(query_id='1', doc_id='utsr0zv7', score=0.5019837872897543),
 ScoredDoc(query_id='1', doc_id='1mjaycee', score=0.5019790714443196),
 ScoredDoc(query_id='1', doc_id='cs1g1z9b', score=0.5019595401992852),
 ScoredDoc(query_id='1', doc_id='y3uzb4dx', score=0.5019250273956012),
 ScoredDoc(query_id='1', doc_id='8lm4tkpd', score=0.5019182210927372),
 Score

In [None]:
densepassage_run = make_queries(queries, densepassage_retriever, 50, update_embeddings=True, print_results=False)
densepassage_run

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
Create embeddings:  16%|█▋        | 1632/10000 [00:23<02:03, 67.86 Docs/s][A
Create embeddings:  16%|█▋        | 1648/10000 [00:23<02:02, 67.97 Docs/s][A
Create embeddings:  17%|█▋        | 1664/10000 [00:23<02:03, 67.27 Docs/s][A
Create embeddings:  17%|█▋        | 1680/10000 [00:24<02:03, 67.62 Docs/s][A
Create embeddings:  17%|█▋        | 1696/10000 [00:24<02:03, 67.41 Docs/s][A
Create embeddings:  17%|█▋        | 1712/10000 [00:24<02:03, 67.20 Docs/s][A
Create embeddings:  17%|█▋        | 1728/10000 [00:24<02:02, 67.32 Docs/s][A
Create embeddings:  17%|█▋        | 1744/10000 [00:25<02:02, 67.66 Docs/s][A
Create embeddings:  18%|█▊        | 1760/10000 [00:25<02:01, 67.63 Docs/s][A
Create embeddings:  18%|█▊        | 1776/10000 [00:25<02:01, 67.50 Docs/s][A
Create embeddings:  18%|█▊        | 1792/10000 [00:25<02:01, 67.54 Docs/s][A
Create embeddings:  18%|█▊        | 1808/10000 [00:25<02:01, 67.68 

[ScoredDoc(query_id='1', doc_id='7u41zswj', score=0.6854896717042651),
 ScoredDoc(query_id='1', doc_id='0tv54lld', score=0.6838125611637117),
 ScoredDoc(query_id='1', doc_id='873txs85', score=0.6835343772785598),
 ScoredDoc(query_id='1', doc_id='10l6kx3e', score=0.6828443243706618),
 ScoredDoc(query_id='1', doc_id='5044ruwl', score=0.6823845142039454),
 ScoredDoc(query_id='1', doc_id='q7r2vfo1', score=0.6822365514530792),
 ScoredDoc(query_id='1', doc_id='e9dianoi', score=0.6819731637885053),
 ScoredDoc(query_id='1', doc_id='1ajqchgu', score=0.6807269632122527),
 ScoredDoc(query_id='1', doc_id='645s1i7e', score=0.6799853610943891),
 ScoredDoc(query_id='1', doc_id='d4h49d8b', score=0.6795969164833516),
 ScoredDoc(query_id='1', doc_id='bpavqshc', score=0.679561879452983),
 ScoredDoc(query_id='1', doc_id='fmri74v7', score=0.6795212246570593),
 ScoredDoc(query_id='1', doc_id='goordy3i', score=0.6793532434558165),
 ScoredDoc(query_id='1', doc_id='mqp3pjx6', score=0.6792454411476504),
 Scored

In [None]:
import ir_measures
from ir_measures import MAP, P, R, nDCG, AP

metrics = [MAP, P@5, P@10, P@20, R@5, R@10, R@20, nDCG@10]  # MAP = AP
evaluator = ir_measures.evaluator(metrics, qrels)

In [None]:
embedding_measures = evaluator.calc_aggregate(embedding_run)
embedding_measures

{P@5: 0.6559999999999999,
 R@10: 0.016658712296898495,
 R@20: 0.030905642273255936,
 P@10: 0.622,
 P@20: 0.6050000000000001,
 R@5: 0.008902055872757466,
 nDCG@10: 0.5957348271940447,
 AP: 0.047685854976561484}

In [None]:
densepassage_measures = evaluator.calc_aggregate(densepassage_run)
densepassage_measures

{P@5: 0.248,
 R@10: 0.005710123154555368,
 R@20: 0.009451163214174085,
 P@10: 0.23399999999999996,
 P@20: 0.2,
 R@5: 0.003086938969510733,
 nDCG@10: 0.21240929541905182,
 AP: 0.00787999885532798}

 **Precision at K (P@K):**
  - P@5, P@10 and P@20 for the Embedding Retriever are relatively high (0.6559, 0.622 and 0.605, respectively), indicating that, on average, 65.6%, 62.2% and 60.5% of the top 5, 10 and 20 retrieved documents are relevant to the queries. This shows the Embedding Retriever's effectiveness in maintaining a fairly high level of precision for an extensive set of search results.

  - For the Dense Passage Retriever, P@5, P@10 and P@20 are much lower (0.248, 0.2339 and 0.2, respectively). This suggests that the Dense Passage's performance in retrieving relevant documents in the top results is not as high as for the Embedding.


**Recall at K (R@K):**
  - R@5, R@10 and R@20 for the Embedding Retriever are 0.0089, 0.0166 and 0.0309, respectively, which indicates that Embedding retrieves a small fraction of all relevant documents within the top results.

  - For Dense Passage, R@5, R@10 and R@20 are even lower, indicating that it misses more relevant documents within the top 5, 10 and 20 results.


 **Mean Average Precision (AP):**
  - Embedding has a higher average precision (AP) of 0.0477, indicating that, on average, the relevant documents are ranked higher in the results. This suggests the Embedding Retriever provides more consistent ranking quality.

  - Dense Passage has a lower AP of 0.00788, which means it has a less consistent ranking quality.


 **Normalized Discounted Cumulative Gain (nDCG@10):**
  - The Embedding Retriever's nDCG@10 is 0.5957, suggesting that, on average, relevant documents are not only more likely to be retrieved but also ranked higher in the search results than with the Dense Passage Retriever (with a nDCG@10 of 0.2124).


**TREC-COVID Round 1 Leaderboard:**
  - The Embedding Retriever has achieved an nDCG@10 score that would put him in the top 10 of the leaderboard. However, it would rank at the bottom of that top group.

  - In terms of P@5, our retrievers are less competitive. The top-ranked retriever in the table boasts a P@5 score of 0.8333, while our top retriever (Embedding) would rank 23rd.

  - As far as MAP is concerned, the Embedding Retriever falls far short, not even in the top 50 on the list (ranked 92nd).


  **Conclusions:**
- EmbeddingRetriever's performance advantage over DensePassageRetriever can be seen in the evaluated metrics: precision, recall, mean average precision and nDCG.

- EmbeddingRetriever's higher P@K scores highlight its ability to return relevant documents in the top search results.

- EmbeddingRetriever's higher MAP highlights its ability to deliver documents in an order that optimizes relevance.

- This results emphasize the importance of selecting the right retrieval method, as the choice can drastically influence the quality of the search results. In this context, Embedding Retriever stands out as the preferable option.

In summary, the evaluation results validate the EmbeddingRetriever as a more effective retrieval method for the given TREC-COVID queries when combined with an ElasticsearchDocumentStore.

### Extra

Complete *DensePassageRetriever* tests using query and context encoders pretrained with biomedical texts.

In [None]:
from haystack.nodes import DensePassageRetriever

densepassage_biomedical_retriever = DensePassageRetriever(
    document_store=faiss_store,
    query_embedding_model="ncbi/MedCPT-Query-Encoder",
    passage_embedding_model="ncbi/MedCPT-Article-Encoder"
)

(…)coder/resolve/main/tokenizer_config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

(…)CPT-Query-Encoder/resolve/main/vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

(…)uery-Encoder/resolve/main/tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

(…)y-Encoder/resolve/main/added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

(…)der/resolve/main/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

(…)T-Query-Encoder/resolve/main/config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


(…)coder/resolve/main/tokenizer_config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

(…)T-Article-Encoder/resolve/main/vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

(…)icle-Encoder/resolve/main/tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

(…)e-Encoder/resolve/main/added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

(…)der/resolve/main/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

(…)Article-Encoder/resolve/main/config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
densepassage_biomedical_run = make_queries(queries, densepassage_biomedical_retriever, 50, update_embeddings=True, print_results=False)
densepassage_biomedical_run

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Create embeddings:  16%|█▌        | 1616/10000 [00:22<01:56, 71.97 Docs/s][A
Create embeddings:  16%|█▋        | 1632/10000 [00:22<01:56, 72.10 Docs/s][A
Create embeddings:  16%|█▋        | 1648/10000 [00:22<01:55, 72.12 Docs/s][A
Create embeddings:  17%|█▋        | 1664/10000 [00:22<01:55, 72.04 Docs/s][A
Create embeddings:  17%|█▋        | 1680/10000 [00:23<01:55, 72.03 Docs/s][A
Create embeddings:  17%|█▋        | 1696/10000 [00:23<01:55, 72.06 Docs/s][A
Create embeddings:  17%|█▋        | 1712/10000 [00:23<01:55, 71.92 Docs/s][A
Create embeddings:  17%|█▋        | 1728/10000 [00:23<01:54, 72.13 Docs/s][A
Create embeddings:  17%|█▋        | 1744/10000 [00:24<01:54, 72.07 Docs/s][A
Create embeddings:  18%|█▊        | 1760/10000 [00:24<01:54, 72.27 Docs/s][A
Create embeddings:  18%|█▊        | 1776/10000 [00:24<01:54, 71.98 Docs/s][A
Create embeddings:  18%|█▊        | 1792/10000 [00:24<01:53, 72.17 Docs/s][A

[ScoredDoc(query_id='1', doc_id='dv9m19yk', score=0.6683160626067299),
 ScoredDoc(query_id='1', doc_id='pl48ev5o', score=0.6661775836340293),
 ScoredDoc(query_id='1', doc_id='h8ahn8fw', score=0.6661775836340293),
 ScoredDoc(query_id='1', doc_id='gyj5213f', score=0.6648139150818296),
 ScoredDoc(query_id='1', doc_id='jkejiuf2', score=0.6643667583639502),
 ScoredDoc(query_id='1', doc_id='a7w6lael', score=0.6628848558852525),
 ScoredDoc(query_id='1', doc_id='89qmvo9a', score=0.6624332250253521),
 ScoredDoc(query_id='1', doc_id='mxvbbkc4', score=0.6621677806726863),
 ScoredDoc(query_id='1', doc_id='jpnbppry', score=0.661964055022119),
 ScoredDoc(query_id='1', doc_id='ytimgqb3', score=0.6618759743593672),
 ScoredDoc(query_id='1', doc_id='3fiz0tqy', score=0.6618210441777124),
 ScoredDoc(query_id='1', doc_id='dtv7to3l', score=0.6617693542559895),
 ScoredDoc(query_id='1', doc_id='wadf3d4a', score=0.661202339139029),
 ScoredDoc(query_id='1', doc_id='a4jn6gpk', score=0.6611945456462659),
 ScoredD

In [None]:
densepassage_biomedical_measures = evaluator.calc_aggregate(densepassage_biomedical_run)
densepassage_biomedical_measures

{R@20: 0.03349084608111749,
 AP: 0.05293514716959453,
 R@10: 0.01684344931740832,
 nDCG@10: 0.6131918056404521,
 P@10: 0.6679999999999999,
 P@20: 0.6549999999999998,
 P@5: 0.6840000000000002,
 R@5: 0.008397279897031429}

With the introduction of query and context encoders pretrained with biomedical texts, DensePassageRetriever shows significant improvements in precision metrics (P@5, P@10, P@20), indicating its encreased ability to retrieve highly relevant documents at different levels of retrieval depth.

Regarding recall metrics (R@5, R@10, R@20), they also exhibit some improvements, however, they reamin quite low. Nevertheless, the improvements in recall metrics are a positive sign of increased document retrieval efficiency.

Average Precision (AP) and nDCG@10, which are essential for assessing ranking quality, experience substantial improvements as well. This indicates that the introduction of biomedical pretraining enhances the ranking of retrieved documents, contributing to overall improved search results.

These findings emphasize the importance of using encoders pretrained in domain-specific texts. They show that these pretraining can substantially improve the performance of the DensePassageRetriever, making it a highly recommended choice for tasks in such domain.

However, even if improvements are observable with respect to the previous dense retrievers, the difference is not that large with respect to the Embedding retriever, especially taking into account that this one was trained specifically in biomedical documents and the Embedding one in general context ones.

# Conclusions

As general remarks of this work, we would highlight:
- The specific combination of document storage and retrieval is essential. We have seen the clearly superiority of the BM25 retriever over the Tfidf one, and the superiority of the Embedding retriever over the DensePassage one. Therefore, one should carefully select the combination of document storage and retriever depending on the problem to be tackled.

- On this collection, sparse retrievers perform much better than dense ones. The precision, MAP and nDCG@10 of BM25 is greater than anyone else's. About the recall, only the dense Embedding retriever can outperform BM25 on R@5 and R@20 by a short margin. The reason behind this behaviour could be that we are dealing with a very specific domain problem, with documents related to COVID topics. Therefore, models trained for general contexts will embed all the domain-specific documents into almost the same region of the embedding space, so documents will not be in turn distinguishable. This can be clearly visualized by checking that all the scores are similar for all the documents selected by dense retrievers.

- The use of a Ranker helps the retriever to return relevant documents, as we have observed by the slightly improvement on all the metrics with respect to the standalone BM25 retriever.

- Also, relying on domain-specific trained embeddings is a better choice when selecting a dense retriever, since it will have the capacity to capture fine-grained information of the specific domain and be able to really distinguish very similar documents. However, we still did not get outstanding results with the biomedical-trained embeddings, probably because our TREC-COVID collection is much more specific than the domain in which they were trained.