# Introduction

In this project, we are interested in dense information retrieval (IR) in the database of scientific papers related to long COVID (PASC) that are available on PubMed. There are a variey of [pretrained sentence transformers](https://huggingface.co/sentence-transformers) for IR/semantic search on SentenceTransformers 🤗. These models were trained on datasets such as [PAQ](https://github.com/facebookresearch/PAQ), [MSMARCO](https://microsoft.github.io/msmarco/), [GoogleAQ](https://github.com/allenai/gooaq) and, while their general performance is good, they can perform rather poorly on domains that are very different from the domain that they were trained on. The main challenge in domain adaptation for IR is the lack of labeled training data for the specific domian of interest. Adaptive pre-training methods such as [TSDAE](https://arxiv.org/pdf/2104.06979) and [generative pseudo-labeling (GPL)](https://arxiv.org/pdf/2112.07577) are the main two techniques that have been proposed for this scenario. In this project, we use GPL to finetune [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5) for IR in our domain.

### Motivation for information retrieval on publications related to long COVID
 Long COVID, also known as Post-Acute Sequelae of SARS-CoV-2 infection (PASC), is defined by WHO as the continuation or development of new symptoms 3 months after the initial SARS-CoV-2 infection, with these symptoms lasting for at least 2 months with no other explanation. PASC is a multi-organ disease for which over 200 different symptoms have been reported. While it is estimated that [about 7% of the population ](https://ceal.nih.gov/sites/default/files/2023-02/CEAL-WhatYouNeedtoKnowLongCOVID-English.pdf) suffers from PASC, this complex disease and possible treatments are not well-understood. As research on this prevalent and multi-organ condition is rapidly evolving, it is important for general practitioners, patients, and other indivisuals concerned or interested in this disease to be able to answer relevant questions based on the evolving research using an information retrieval system that performs well on related publications.

 ### Motivation for the use of GPL
 Adaptive pre-training methods first pre-train on the unlabeled target corpus using methods such MLM or TSDAE and then finetune on an existing labeled dateset. A major disadvantage of these methods is that these methods are very expensive computationally. To get a desirable results, one has to use a very varge labeled training dataset (of the order of tens of millions of training pairs/triplets). On the other hand, GPL could be used to finetune a pretrained sentence transformer/bi-encoder model using a generated training set of much smaller size (e.g. less than half a million triplets). While the performance of GPL is often not as high as TSDAE, it often achieves significant improvement over the zero-shot model (for some comparisons of different methods see Table 1 in the [GPL paper](https://arxiv.org/pdf/2112.07577). Given limited compuation resources for this personal project, I chose GPL over adaptive pre-training methods.

 ### Result
 I compared the performance of the finetuned model with the original [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/) model on a set of 30 questions that include both general questions such as the prevalence and the risk factors and advanced scientific questions such as the pathophysiology of PASC. While the origial model had a high performance on general queries, its performance was poor on advanced topics. Using the finetuned model, the mAP@10 (Mean Average Precision) increased from 0.646 to 0.704 (8% relative increase), mainly due to significantly better performance on queries related to more advanced scientific topics.

# Required Packages

In [None]:
!pip install datasets sentence_transformers faiss-gpu biopython Bio openai --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.0/281.0 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
import re
from Bio import Entrez
import pandas as pd
from google.colab import drive
import numpy as np
import random
from random import sample
from tqdm.auto import tqdm
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sentence_transformers import util, SentenceTransformer, CrossEncoder, InputExample, losses
from datasets import Dataset, load_from_disk
import faiss
from openai import OpenAI

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


# Creating the Dataset

Our train data and our database are obtrained from full-text publications available on PubMed Central (PMC) as well as abstracts availble on PubMed. We obtain top 3000 publications on PMC and top 10000 publications on PubMed related to the keyword "Long COVID." We then make a dataset of all passages together with the related info to each passage. The data for this noteboos was last obtained on dataframe was last created on July 19, 2024. Of course, one can regularly obtain an updated list of publications and use the finetuned bi-encoder to perform IR on the updated list of papers.

In [None]:
'''returns a list of papers's ids relevant to the given query'''
def search_papers(query, db='pmc', retmode='XML', max_results=5000):
  Entrez.email = "dorna.abdolazimi@gmail.com"
  handle = Entrez.esearch(db=db, term=query, retmax=max_results,  sort='relevance', retmode=retmode)
  results = Entrez.read(handle, validate=False)
  return results

'''given a list of publication's ids, returns a list of python objects (lists or dictionaries) that include each publications' info.'''
def fetch_papers(ids, db='pmc', retmode='XML', batch_size=500):
  id_list = []
  if len(ids) % batch_size != 0:
    l = len(ids)//batch_size+1
  else:
    l = len(ids)//batch_size
  for i in range(l):
    if i == len(ids)//batch_size:
      id_list.append(','.join(ids[i*batch_size:len(ids)]))
    else:
      id_list.append(','.join(ids[i*batch_size:(i+1)*batch_size]))
  Entrez.email = "dorna.abdolazimi@gmail.com"
  results = []
  for i in range(len(id_list)):
    handle = Entrez.efetch(db, id=id_list[i], retmode=retmode)
    batch_results = Entrez.read(handle, validate=False)
    if db=='pubmed':
      batch_results = batch_results['PubmedArticle']
    results = results + batch_results
  return results

In [None]:
'''given a dictionary that contain's a PMC paper's information, returns the paper's pmc id'''
def get_pmc_article_pmcid(paper):
  try:
    return paper['front']['article-meta']['article-id'][1]
  except:
    return None

'''given a dictionary that contain's a PMC paper's information, returns the paper's id on PubMed'''
def get_pmc_article_pmid(paper):
  try:
    return paper['front']['article-meta']['article-id'][0]
  except:
    return None

'''given a dictionary that contain's a PMC paper's information, returns the paper's abstract'''
def get_pmc_article_abstract(paper):
  try:
    return paper['front']['article-meta']['abstract'][0]['p']
  except:
    return None

'''given a dictionary that contain's a PMC paper's information, returns the paper's title'''
def get_pmc_article_title(paper):
  try:
    return paper['front']['article-meta']['title-group']['article-title']
  except:
    return None

'''given a dictionary that contain's a PMC paper's information, returns the paper's publication year'''
def get_pmc_article_year(paper):
  try:
    return paper['front']['article-meta']['pub-date'][0][-1]
  except:
    return None

'''given a dictionary that contain's a PMC paper's information, returns paper's authors as a string'''
def get_pmc_article_authors(paper):
  authors_names = []
  try:
    authors_info = paper['front']['article-meta']['contrib-group'][0]
  except:
    return authors_names
  for i in range(len(authors_info)):
    if isinstance(authors_info[i], list):
      if isinstance(authors_info[i][0], dict):
        try:
          authors_names.append(authors_info[i][0]['given-names']+ " " +authors_info[i][0]['surname'])
        except:
          pass
  return ', '.join(authors_names)

'''given a passage obtained from the body of a PMC publication, remove the inline citations'''
def remove_citations(text):
    text = re.sub(r'\>[^a-zA-Z]*\<', '', str(text))
    text = re.sub(r'\<[^>]*\>', '', text)
    text = re.sub(r'\(\)', '', text)
    text = re.sub(r'\[\]', '', text)
    return text

'''given a dictionary that contain's a PMC paper's information, returns a list of the paper's passages (other than abstract)'''
def get_pmc_article_body_passages(paper, with_subtitle=False):
  passages = []
  try:
    body = paper['body']['sec']
  except:
    return None
  for i in range(len(body)):
    section = body[i]
    for l in range(len(section['p'])):
      passage = remove_citations(section['p'][l])
      if len(passage):
        passages.append(passage)
    for j in range(len(section['sec'])):
      subsection = section['sec'][j]
      for l in range(len(subsection['p'])):
        passage = remove_citations(subsection['p'][l])
        if len(passage):
          if with_subtitle:
            passages.append(subsection.get('title', '') + ': '+ passage)
          else:
            passages.append(passage)
        for k in range(len(subsection['sec'])):
          subsubsection = subsection['sec'][k]
          for l in range(len(subsubsection['p'])):
            passage = remove_citations(subsubsection['p'][l])
            if len(passage):
              if with_subtitle:
                passages.append(subsubsection.get('title', '') + ': '+passage)
              else:
                passages.append(passage)
  return passages

In [None]:
'''given a dictionary that contain's a PubMed paper's information, returns the paper's PubMed id'''
def get_pm_article_pmid(paper):
    try:
      return str(paper['MedlineCitation']['PMID'])
    except:
      return None

'''given a dictionary that contain's a PubMed paper's information, returns the paper's abstract'''
def get_pm_article_abstract(paper):
    try:
      return paper['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
    except:
      return None

'''given a dictionary that contain's a PubMed paper's information, returns the paper's authors as a string'''
def get_pm_article_authors(paper):
  authors_names = []
  try:
    authors_info = paper['MedlineCitation']['Article']['AuthorList']
  except:
    return authors_names
  for i in range(len(authors_info)):
      if isinstance(authors_info[i], dict):
        try:
          authors_names.append(authors_info[i]['ForeName']+ " " +authors_info[i]['LastName'])
        except:
          pass
  return ', '.join(authors_names)

'''given a dictionary that contain's a PubMed paper's information, returns the paper's publication date'''
def get_pm_article_year(paper):
    try:
      return paper['MedlineCitation']['Article']['ArticleDate'][0]['Year']
    except:
      return None

'''given a dictionary that contain's a PubMed paper's information, returns the paper's title'''
def get_pm_article_title(paper):
    try:
      return paper['MedlineCitation']['Article']['ArticleTitle']
    except:
      return None

In [None]:
'''creates a dictionary mapping PMC papers' ids to other info'''
def map_pmcid_to_info(pmc_papers):
  pmcid_info_map = {}
  for paper in pmc_papers:
    pmcid = get_pmc_article_pmcid(paper)
    title = get_pmc_article_title(paper)
    authors = get_pmc_article_authors(paper)
    pmid = get_pmc_article_pmid(paper)
    year = get_pmc_article_year(paper)
    pmcid_info_map[pmcid] = {'title': title, 'authors': authors, 'year': year, 'pmid': pmid}
  return pmcid_info_map

'''given a list of PMC papers, returns a list of all passages, a list of all passages with the subtitle of the passage as prefix,
a list of corresponding PMC ids of passages, and a boolean list that specify if each passage belongs to the body or abstract'''
def get_pmc_article_passages(pmc_papers):
  max_token = []
  passages = []
  passages_with_subtitle = []
  passages_pmcid = []
  passages_is_abstract = []
  for paper in pmc_papers:
    body = get_pmc_article_body_passages(paper)
    body_with_subtitle = get_pmc_article_body_passages(paper, with_subtitle=True)
    abstract = get_pmc_article_abstract(paper)
    pmcid = get_pmc_article_pmcid(paper)
    if body is not None:
      if len(body_with_subtitle) != len(body):
        print('issue')
      for passage in body:
        passages.append(passage)
        passages_pmcid.append(pmcid)
        passages_is_abstract.append(False)
      for passage in body_with_subtitle:
        passages_with_subtitle.append(passage)
    if abstract is not None:
      for passage in abstract:
        passages.append(passage)
        passages_with_subtitle.append(passage)
        passages_pmcid.append(pmcid)
        passages_is_abstract.append(True)
  return passages, passages_with_subtitle, passages_pmcid, passages_is_abstract

In [None]:
'''creates a dictionary mapping PubMed papers' ids to other info'''
def map_pmid_to_info(pm_papers):
  pmid_info_map = {}
  for paper in pm_papers:
    pmid = get_pm_article_pmid(paper)
    title = get_pm_article_title(paper)
    authors = get_pm_article_authors(paper)
    year = get_pm_article_year(paper)
    pmid_info_map[pmid] = {'title': title, 'authors': authors, 'year': year, 'pmid': pmid, 'pmcid': None}
  return pmid_info_map

'''given a list of papers, returns a list of all passages and a list of corresponding PubMed ids of passages'''
def get_pm_article_passages(pm_papers):
  max_token = []
  passages = []
  passages_pmid = []
  for paper in pm_papers:
    abstract = get_pm_article_abstract(paper)
    pmid = get_pm_article_pmid(paper)
    if abstract is not None:
      passages.append(abstract)
      passages_pmid.append(pmid)
  return passages, passages_pmid

Using the above functions, we create a dataframe of passages of PubMed Central papers and a dataframe of passages of PubMed papers.

In [None]:
pmc_results = search_papers('Long COVID', max_results=3000)
pmc_id_list = pmc_results['IdList']
pmc_papers = fetch_papers(pmc_id_list)

In [None]:
pm_results = search_papers('Long COVID', db='pubmed', max_results=10000)
pm_id_list = pm_results['IdList']
pm_papers = fetch_papers(pm_id_list, db='pubmed')

In [None]:
pmc_passages, pmc_passages_with_subtitle, pmc_passages_pmcid, pmc_passages_is_abstract = get_pmc_article_passages(pmc_papers)
df_pmc = pd.DataFrame({'passage': pmc_passages, 'passage_with_subtitle': pmc_passages_with_subtitle, 'pmcid': pmc_passages_pmcid, 'is_abstract': pmc_passages_is_abstract})

In [None]:
pmcid_to_info_dict = map_pmcid_to_info(pmc_papers)
df_pmc['title'] = df_pmc['pmcid'].map(lambda pmcid: pmcid_to_info_dict[pmcid]['title'])
df_pmc['authors'] = df_pmc['pmcid'].map(lambda pmcid: pmcid_to_info_dict[pmcid]['authors'])
df_pmc['year'] = df_pmc['pmcid'].map(lambda pmcid: pmcid_to_info_dict[pmcid]['year'])
df_pmc['pmid'] = df_pmc['pmcid'].map(lambda pmcid: pmcid_to_info_dict[pmcid]['pmid'])
df_pmc['passage_with_titles'] = df_pmc['title'] + ': ' + df_pmc['passage_with_subtitle']
df_pmc['source'] = 'pmc'

In [None]:
pm_passages, pm_passages_pmid = get_pm_article_passages(pm_papers)
df_pm = pd.DataFrame({'passage': pm_passages,  'passage_with_subtitle': pm_passages, 'pmid': pm_passages_pmid})

In [None]:
pmid_to_info_dict = map_pmid_to_info(pm_papers)
df_pm['title'] = df_pm['pmid'].map(lambda pmid: pmid_to_info_dict[pmid]['title'])
df_pm['authors'] = df_pm['pmid'].map(lambda pmid: pmid_to_info_dict[pmid]['authors'])
df_pm['year'] = df_pm['pmid'].map(lambda pmid: pmid_to_info_dict[pmid]['year'])
df_pm['pmid'] = df_pm['pmid'].map(lambda pmid: pmid_to_info_dict[pmid]['pmid'])
df_pm['pmcid'] = df_pm['pmid'].map(lambda pmid: None)
df_pm['passage_with_titles'] = df_pm['title'] + ': ' + df_pm['passage_with_subtitle']
df_pm['is_abstract'] = True
df_pm['source'] = 'pm'

We concat the PubMed and PubMed Central dataframes and save the data.

In [None]:
df = pd.concat([df_pmc, df_pm])
df['passage_length'] = df['passage'].map(lambda x: len(x.split()))
df = df.loc[df['passage_length'] > 100]

In [None]:
df.size

682572

In [None]:
with open('/content/drive/My Drive/long_COVID_semantic_search/pmc_pm_long_COVID_dataset.csv', 'w') as f:
  df.to_csv(f)

# Query Generation

We use a set of 10,000 passages for finetuning.
As the first step of GPL, we use a doc2query model based on T5 to produce 10 query per each passage (so 100,000 query-passage pairs). Note that based on the GPL paper, using twice or three times as many query-passage pairs would likely result in a better model. However, for this version of this personal project, I use 100k pairs to be able to finetune the model with limited available GPU.

We use the titled version of passages, i.e. we concatenate the title of the paper and the subtitle of the section that the passage belongs to as a prefix to each passage. The reason for this decision is that the bi-encoder that we want to finetune already had a much better performance on the titled version of the passages.  

In [None]:
df = pd.read_csv('/content/drive/My Drive/long_COVID_semantic_search/pmc_pm_long_COVID_dataset.csv')

In [None]:
'''returns a list of "size" passages for the purpose finetuning our bi-encoder using GPL'''
def get_train_passages(size, abstract_ratio=0.05, with_titles=True):
  df_abstracts = df.loc[(df['is_abstract']==True) & (df['source']=='pm')]
  df_bodies = df.loc[df['is_abstract']==False]
  if with_titles:
    abstract_passages = df_abstracts['passage_with_titles'].drop_duplicates()[:int(abstract_ratio*size)].tolist()
    body_passages = df_bodies['passage_with_titles'].drop_duplicates()[:int((1-abstract_ratio)*size)].tolist()
  else:
    abstract_passages = df_abstracts['passage'].drop_duplicates()[:int(abstract_ratio*size)].tolist()
    body_passages = df_bodies['passage'].drop_duplicates()[:int((1-abstract_ratio)*size)].tolist()
  passages = body_passages + abstract_passages
  return passages

We use [doc2query/all-with_prefix-t5-base-v1](https://huggingface.co/doc2query/all-with_prefix-t5-base-v1) and to generate 10 queries for each passage. Note that this model was trained with a prefix and depending on the prefix, the output is different. We generate 5 queries using the prefix 'text2query' and 5 more using 'answer2question'.

In [None]:
'''given a list of passages, returns 5 queries for each passage, using "all-with_prefix-t5-base-v1",
a doc2query model based on T5, and a prefix indicating the type of the output'''
def get_query_passage_pairs(passages,max_length=350, prefix='text2query', num_queries=5):
  model_name = 'doc2query/all-with_prefix-t5-base-v1'
  query_passage_pairs = []
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name).cuda()
  passage_batch = []
  passage_prefix_batch = []
  batch_size = 64
  for i in tqdm(range(len(passages))):
    passage = passages[i]
    passage_prefix = prefix + ": " + passage
    passage_prefix_batch.append(passage_prefix)
    passage_batch.append(passage)
    if len(passage_prefix_batch) == batch_size or i==len(passages)-1:
      inputs = tokenizer(passage_prefix_batch, max_length=max_length, truncation=True, padding=True, return_tensors='pt')
      outputs = model.generate (
        input_ids=inputs['input_ids'].cuda(),
        max_length=64,
        do_sample=True,
        top_p=0.95,
        num_return_sequences=num_queries)
      queries = tokenizer.batch_decode(outputs, skip_special_tokens=True)
      for j in range(len(queries)):
        query_passage_pairs.append((queries[j], passage_batch[j//num_queries]))
      passage_batch = []
      passage_prefix_batch = []
  return query_passage_pairs

In [None]:
train_corpus_size = 10000
passages = get_train_passages(train_corpus_size)

In [None]:
torch.cuda.empty_cache()
query_passage_pairs = get_query_passage_pairs(passages, 'doc2query/all-with_prefix-t5-base-v1')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.12k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/702 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

In [None]:
torch.cuda.empty_cache()
query_passage_pairs.extend(get_query_passage_pairs(passages, prefix='answer2question'))

  0%|          | 0/10000 [00:00<?, ?it/s]

In [None]:
len(query_passage_pairs)

100000

In [None]:
df_query_passage = pd.DataFrame(query_passage_pairs, columns=['query', 'passage'])
with open('/content/drive/My Drive/long_COVID_semantic_search/train/query_passage_pairs_long_COVID.csv', 'w') as f:
  df_query_passage.to_csv(f)

# Negative Mining

As the next step of GPL, we mine for a hard negative passage for each query-passage pair. We perform the negative mining using a the same pretrained bi-encoder that we want to finetune: [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/).

In [None]:
df_query_passage = pd.read_csv('/content/drive/MyDrive/long_COVID_semantic_search/train/query_passage_pairs_long_COVID.csv')

In [None]:
'''a helper function that encoders a list of texts using a given bi-encoder model'''
def get_embeddings(texts, model, show_progress=True):
  embeddings = model.encode(texts, convert_to_tensor=True, show_progress_bar=show_progress).detach().cpu().numpy()
  return embeddings

In [None]:
'''given a bi-encoder, creates a huggingface dataset of passage, embedding pairs,
 add faiss index, and saves to dir'''
def create_train_passage_embedding_dataset(model, dir):
  passages = list(set(df_query_passage['passage']))
  passage_embeddings = get_embeddings(passages, model)
  passage_embedding_dict = {'passage': passages, 'passage_embedding':passage_embeddings}
  passage_embedding_dataset = Dataset.from_dict(passage_embedding_dict)
  passage_embedding_dataset.save_to_disk(dir)
  passage_embedding_dataset.add_faiss_index('passage_embedding')
  passage_embedding_dataset.save_faiss_index('passage_embedding', dir+'/index.faiss')
  return passage_embedding_dataset

In [None]:
'''given a bi-encoder, creates a huggingface dataset of query,
 embedding pairs, and saves to dir'''
def create_train_query_embedding_dataset(model, dir):
  queries = list(df_query_passage['query'])
  query_embeddings = get_embeddings(queries, model, show_progress=True)
  query_embedding_dict = {'query':queries, 'query_embedding': query_embeddings}
  query_embedding_dataset = Dataset.from_dict(query_embedding_dict)
  query_embedding_dataset.save_to_disk(dir)
  return query_embedding_dataset

For each query-passage pair, we mine for a hard negative for the query by retrieving 50 top relevant passages to the query and returning one of them uniformly at random as the hard negative. We then add the hard negative passage to the query-passage pair to form a (query, positive passage, negative passage) triplet. Note that the hard negative passage might actually be relevant to the query. But with the use of pseudo labeling in the next step, and finetuning using
MarginMSE loss, this will not cause a problem.

In [None]:
'''given a query embedding dataset and a passage embedding dataset,
 create a dataframe of train triplets, and saves to dir'''
def create_train_triplets(pssg_embd_dataset, query_embd_dataset, dir):
  passage_query_pairs = df_query_passage[['query', 'passage']].values.tolist()
  triplets=[]
  for i in tqdm(range(len(passage_query_pairs))):
    query, pos_passage = passage_query_pairs[i]
    query_embd = np.array(query_embd_dataset[i]['query_embedding'], dtype=np.float32)
    scores, samples = pssg_embd_dataset.get_nearest_examples("passage_embedding", query_embd, k=50)
    neg_passages = samples['passage']
    random.shuffle(neg_passages)
    for neg_passsage in neg_passages:
      if neg_passsage != pos_passage:
        triplets.append([query, pos_passage, neg_passsage])
        break
  df_triplets = pd.DataFrame(triplets, columns=['query', 'pos_passage', 'neg_passage'])
  with open(dir, 'w') as f:
    df_triplets.to_csv(f)
  return df_triplets

In [None]:
train_pssg_embd_dir = \
  "/content/drive/MyDrive/long_COVID_semantic_search/train/pssg_embd_msmarco_distilbert"
train_query_embd_dir = \
  "/content/drive/MyDrive/long_COVID_semantic_search/train/query_embd_msmarco_distilbert"
biencoder_negative_mining = SentenceTransformer("msmarco-distilbert-dot-v5")
biencoder_negative_mining.max_seq_length = 350

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.51k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
train_pssg_embd = create_train_passage_embedding_dataset(
    biencoder_negative_mining,
    train_pssg_embd_dir)
train_query_embd = create_train_query_embedding_dataset(
    biencoder_negative_mining,
    train_query_embd_dir)

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

  0%|          | 0/10 [00:00<?, ?it/s]

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Saving the dataset (0/1 shards):   0%|          | 0/100000 [00:00<?, ? examples/s]

In [None]:
train_pssg_embd_dir = \
  "/content/drive/MyDrive/long_COVID_semantic_search/train/pssg_embd_msmarco_distilbert"
train_query_embd_dir = \
  "/content/drive/MyDrive/long_COVID_semantic_search/train/query_embd_msmarco_distilbert"
biencoder_negative_mining = SentenceTransformer("msmarco-distilbert-dot-v5")
biencoder_negative_mining.max_seq_length = 350

In [None]:
train_passg_embd_reload = load_from_disk(train_pssg_embd_dir)
train_passg_embd_reload.load_faiss_index('passage_embedding', train_pssg_embd_dir+'/index.faiss')
train_query_embd_reload = load_from_disk(train_query_embd_dir)

In [None]:
triplets_dir = \
'/content/drive/My Drive/long_COVID_semantic_search/train/triplets_msmarco_distilbert.csv'

In [None]:
df_triplets = create_train_triplets(
    train_passg_embd_reload,
    train_query_embd_reload,
    triplets_dir)

100%|██████████| 100000/100000 [51:27<00:00, 32.39it/s]


# Pseudo Labeling

Now, with training triplets ready, we use a cross-encoder to generate labels for the triplets. The label for each $(Q, P+, P-)$ is as follows:
$$\operatorname{CE}(Q, P+) - \operatorname{CE}(Q, P-).$$
Note that the logic behind this step is that cross-encoders perform much better and are less prone to domain shifts compared to bi-encoders (for more info, see [this](https://arxiv.org/pdf/2112.07577), [this](https://arxiv.org/pdf/2010.08240), and [this](https://arxiv.org/pdf/2010.02666)), but have a very high computational overhead and cannot be efficiently used over the large vector DB for IR. By labeling train triplets using a cross-encoder, we are transferring knowledge from the cross-encoder
to bi-encoder.


In [None]:
'''given a cross encoder, computes triplets' score margin,
and saves the updated triplets dataframe to dir'''
def create_train_triplets_with_score(cross_encoder_model, df_triplets, dir):
  neg_pairs = df_triplets[['query', 'neg_passage']].values.tolist()
  pos_pairs = df_triplets[['query', 'pos_passage']].values.tolist()
  pos_scores = cross_encoder_model.predict(pos_pairs, convert_to_tensor=True, show_progress_bar=True).detach().cpu().numpy()
  neg_scores = cross_encoder_model.predict(neg_pairs, convert_to_tensor=True, show_progress_bar=True).detach().cpu().numpy()
  df_triplets.loc[:, 'score_margin'] = pos_scores - neg_scores
  with open(dir, 'w') as f:
    df_triplets.to_csv(f)
  return df_triplets

In [None]:
df_triplets_reload = pd.read_csv(triplets_dir)

In [None]:
cross_encoder_name= 'cross-encoder/ms-marco-MiniLM-L-12-v2'
cross_encoder_model = CrossEncoder(cross_encoder_name)
triplets_scored_dir = \
'/content/drive/My Drive/long_COVID_semantic_search/train/triplets_scored_msmarco_distilbert.csv'
cross_encoder_model.max_length = 350

config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
df_triplets_scored = create_train_triplets_with_score(
    cross_encoder_model=cross_encoder_model,
    df_triplets=df_triplets_reload,
    dir=triplets_scored_dir)

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

# Finetuning the Bi-Encoder



In [None]:
drive.mount('/content/drive')
triplets_scored_dir = \
  '/content/drive/My Drive/long_COVID_semantic_search/train/triplets_scored_msmarco_distilbert.csv'
df_triplets_scored_reload = pd.read_csv(triplets_scored_dir)
queries = df_triplets_scored_reload['query'].values.tolist()
pos_passages = df_triplets_scored_reload['pos_passage'].values.tolist()
neg_passages = df_triplets_scored_reload['neg_passage'].values.tolist()
scores = df_triplets_scored_reload['score_margin'].values.tolist()

In [None]:
train_data = []
for i in tqdm(range(len(queries))):
  train_data.append(InputExample(texts=[queries[i], pos_passages[i], neg_passages[i]], label=float(scores[i])))
torch.cuda.empty_cache()
batch_size = 32
loader = torch.utils.data.DataLoader(train_data, shuffle=True, batch_size=batch_size)

100%|██████████| 100000/100000 [00:00<00:00, 131496.65it/s]


In [None]:
model_long_covid_msmarco_distilbert = SentenceTransformer("msmarco-distilbert-dot-v5")
model_long_covid_msmarco_distilbert.max_seq_length = 350
model_long_covid_msmarco_distilbert

SentenceTransformer(
  (0): Transformer({'max_seq_length': 350, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

We train the model for 2 epochs, which took about 2 hours using google colab's L4 GPU.

In [None]:
loss = losses.MarginMSELoss(model=model_long_covid_msmarco_distilbert)
epoch = 2
warmup_steps = int(len(loader) * epoch * 0.1)

model_long_covid_msmarco_distilbert.fit(
    train_objectives=[(loader, loss)],
    epochs=epoch,
    warmup_steps=warmup_steps,
    show_progress_bar=True)

Step,Training Loss
500,7.2744
1000,5.7773
1500,5.0475
2000,4.9046
2500,4.5782
3000,4.4291
3500,3.9685
4000,2.7673
4500,2.3663
5000,2.3667


In [None]:
model_long_covid_msmarco_distilbert.save_pretrained(
    '/content/drive/My Drive/long_COVID_semantic_search/finetuned_long_covid_msmarco_distilbert')

# Semantic Search Using the Finetuned Bi-Encoder

We are done with finetuning the model using GPL. Now, we use the model to encode the database of all passages. Just like we did durig the training process, for searching in our embedding database we use the FAISS library to conduct approximate nearest neighbor and make IR faster.

In [None]:
df_passages_all = pd.read_csv('/content/drive/MyDrive/long_COVID_semantic_search/pmc_pm_long_COVID_dataset.csv')
columns = ['passage', 'passage_with_titles', 'title', 'authors', 'pmcid', 'pmid', 'source']
df_passages_all = df_passages_all[columns]

In [None]:
def create_passage_embd_database(model, dir):
  passages = list(set(df_passages_all['passage_with_titles']))
  passage_embeddings = get_embeddings(passages, model)
  passage_embedding_dict = {'passage': passages, 'passage_embedding':passage_embeddings}
  passage_embedding_dataset = Dataset.from_dict(passage_embedding_dict)
  passage_embedding_dataset.save_to_disk(dir)
  passage_embedding_dataset.add_faiss_index('passage_embedding')
  passage_embedding_dataset.save_faiss_index('passage_embedding', dir+'/index.faiss')
  return passage_embedding_dataset

In [None]:
biencoder_finetuned = \
 SentenceTransformer('/content/drive/My Drive/long_COVID_semantic_search/finetuned_long_covid_msmarco_distilbert')

In [None]:
pssg_embd_db_finetuned_dir = \
  '/content/drive/MyDrive/long_COVID_semantic_search/pssg_embd_database_long_covid_msmarco_distilbert'

In [None]:
pssg_embd_db_finetuned = create_passage_embd_database(
    biencoder_finetuned,
    pssg_embd_db_finetuned_dir)

We have the vector database of passages saved. We can reload it anytime and use it for semantic search.

In [None]:
pssg_embd_db_finetuned = load_from_disk(pssg_embd_db_finetuned_dir)
pssg_embd_db_finetuned.load_faiss_index('passage_embedding', pssg_embd_db_finetuned_dir+'/index.faiss')

The following function can be used to search for relevant passages to a query in a 🤗 database of vector embeddings indexed by FAISS. The bi-encoder has to be given as an input as well in order to encode the queries.

In [None]:
def semantic_search(biencoder, queries, passage_embd_db, k=5):
  query_embeddings = get_embeddings(queries, biencoder, show_progress=True)
  retrieved_passages = []
  for i in range(len(queries)):
    scores, samples = passage_embd_db.get_nearest_examples(
       "passage_embedding",
      query_embeddings[i], k=k)
    query_passage_rank= [(queries[i], samples['passage'][j], j) for j in range(len(samples['passage']))]
    retrieved_passages.extend(query_passage_rank)
  df_retrieved_passages = pd.DataFrame(retrieved_passages, columns=['query', 'passage', 'rank'])
  return df_retrieved_passages

Let's take a look at the model's performance on a number of queries, including more general questions and more technical ones.

In [None]:
queries = ['How prevalent is long COVID?',
           'what is known about the effect of reinfection on the risk of PASC?',
           'does covid-19 increase the risk of new-onset diabetes?',
            'what is the pathophysiology behind PASC?',
           ]

In [None]:
k=5
df_retrieved_passages_finetuned = semantic_search(
    biencoder=biencoder_finetuned,
    queries=queries,
    passage_embd_db=pssg_embd_db_finetuned,
    k=5)
for i in range(len(queries)):
  print(f'query: {queries[i]}')
  for j in range(k):
    passage = df_retrieved_passages_finetuned.iloc[i*k+j]['passage']
    print(f'passage {j}: {passage}')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

query: How prevalent is long COVID?
passage 0: Risk Factors for Long COVID in Older Adults: According to current research , the prevalence of long COVID is estimated to be between 31% and 69%, indicating that over 200 million individuals worldwide may experience long COVID symptoms. While some studies have suggested that older adults may not be at a higher risk of long COVID than younger individuals , this may be because there are more younger COVID-19 survivors, and the epidemiological statistics for long COVID in elderly people exclude a substantial number of fatalities and may introduce bias towards older adults . Additionally, many mechanisms of long COVID remain unclear, and there is a lack of targeted and effective treatment options . Considering that older adults are a population that requires considerable health care resources, long COVID in this population remains a major challenge in public health, clinical medicine, and basic medical research.
passage 1: Prevalence and chara

Here is a summary of the finetuned model's performance on these query:

For the first query, passages 0, 1, and 4 present relevant information to the question.

For the second query, passages 1 and 3 are very relevant.  Passage 2 is also relevant although the claim is not  stated in a rigorous manner.

For the third query, passages 0, 1, 3, 4 are relevant. Passage 2 mentions the possibility of new-onset diabetes very briefly but is focused on diabetes as a risk factor for long COVID.

For the fourth query, passages 2, 3, and 4 address the question. Passages 0, 1 discuss the question as the primary topic, although the do not provide a direct answer.

Let's compare the results for these queries with the original model before finetuning.

In [None]:
pssg_embd_db_original_dir = \
  '/content/drive/MyDrive/long_COVID_semantic_search/eval/pssg_embd_msmarco_database_distilbert'
biencoder_original = SentenceTransformer('msmarco-distilbert-dot-v5')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.51k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
pssg_embd_original = create_passage_embd_database(
    biencoder_original,
    pssg_embd_db_original_dir)

In [None]:
pssg_embd_original = load_from_disk(pssg_embd_db_original_dir)
pssg_embd_original.load_faiss_index('passage_embedding', pssg_embd_db_original_dir+'/index.faiss')

In [None]:
k=5
df_retrieved_passages_original = semantic_search(
    biencoder=biencoder_original,
    queries=queries,
    passage_embd_db=pssg_embd_original,
    k=5)
for i in range(len(queries)):
  print(f'query: {queries[i]}')
  for j in range(k):
    passage = df_retrieved_passages_original.iloc[i*k+j]['passage']
    print(f'passage {j}: {passage}')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

query: How prevalent is long COVID?
passage 0: Risk Factors for Long COVID in Older Adults: According to current research , the prevalence of long COVID is estimated to be between 31% and 69%, indicating that over 200 million individuals worldwide may experience long COVID symptoms. While some studies have suggested that older adults may not be at a higher risk of long COVID than younger individuals , this may be because there are more younger COVID-19 survivors, and the epidemiological statistics for long COVID in elderly people exclude a substantial number of fatalities and may introduce bias towards older adults . Additionally, many mechanisms of long COVID remain unclear, and there is a lack of targeted and effective treatment options . Considering that older adults are a population that requires considerable health care resources, long COVID in this population remains a major challenge in public health, clinical medicine, and basic medical research.
passage 1: Prevalence and Facto

Here is a summary of the original model's performance on these queries.

For the first query, passages 0, 2, and 3 provide relevant answers to the question.

For the second query, none of the retrieved passages are directly relevant to the question. The most relevant passage is passage 4 which highlights the impact of treatment during the acute phase on the risk of PASC, which indirectly relates to reinfection.

For the third query, passages 0, 2, and 4 provide an answer for the question. But passages 1 and 5 talk about diabetes as a risk factor rather than new-onset diabetes in people who contracted COVID-19.

For the third query, passages 1 and 4 are directly relevant to the question. Although Passages 0 and 2 do not directly answer the question, but discusse the question briefly. Passage 3 is completely irrelevant.






Once can observe a signifiacant improvement for the more technical queries such as the risk of developing new-onset diabetes as a result of a COVID-19 infection as well as more detailed questions about the mechanism of PASC and pharmaceutical options for preventings PASC.

In the next section, we use a set of 30 queries that were prepared before finetuning the model to evaluate the performance of the finetuned model against the original model.

# Evaluation

For a set of 30 queries that were prepared before finetuning the model, I look at the top 10 results given by the finetuned model and the original model for these queries and scored them as relevant or irrelevant. In a few cases that I was unsure, I consulted a friend who is a medical student. I calculated the Mean Average Precision afor the top 10 retireved results (mAP@10) for each bi-encoder.  

A comment on scoring: a passage is considered relevant to a query either of it contains an explicit answer to the question, or it discusses the question in details as a primary topic, e.g. discusses the existing challenges related to answering the question in details; discusses the question in details and indicates that the paper is going to study the question, etc.

In [None]:
biencoder_finetuned = \
 SentenceTransformer('/content/drive/My Drive/long_COVID_semantic_search/finetuned_long_covid_msmarco_distilbert')
pssg_embd_db_finetuned_dir = \
  '/content/drive/MyDrive/long_COVID_semantic_search/pssg_embd_database_long_covid_msmarco_distilbert'
pssg_embd_db_finetuned = load_from_disk(pssg_embd_db_finetuned_dir)
pssg_embd_db_finetuned.load_faiss_index('passage_embedding', pssg_embd_db_finetuned_dir+'/index.faiss')

In [None]:
biencoder_original = SentenceTransformer('msmarco-distilbert-dot-v5')
pssg_embd_db_original_dir = \
  '/content/drive/MyDrive/long_COVID_semantic_search/eval/pssg_embd_msmarco_database_distilbert'
pssg_embd_db_original = load_from_disk(pssg_embd_db_original_dir)
pssg_embd_db_original.load_faiss_index('passage_embedding', pssg_embd_db_original_dir+'/index.faiss')

Here is the list of queries that we consider for evaluation:

In [None]:
queries = ['what treatments are available for long covid?',
           'what are underlying causes of PASC?',
           'how prevalent is long-COVID?',
           'what are the common symptoms of long COVID?',
           'what are the risk factors for developing PASC?',
           'does vaccination lower the risk of long COVID?',
           'are vaccines effective in preventing pasc?',
           'do antiviral drugs prevent long COVID?',
           'do antiviral drugs prevent pasc?',
           'what is the pathophysiology behind PASC?',
           'what is the mechanism behind post-acute sequelae of SARS-CoV-2?',
           'what is known about microbiome changes in PASC?',
           'what is known about viral persistance in long-COVID?',
           'does covid-19 increase the risk of new-onset diabetes?',
           'does pre-existing diabetes increase the risk of developing pasc?',
           'what are cardiovascular symptoms of pasc?',
           'is obesity a risk factor for developing long COVID?',
           'what are cardiovascular symptoms of long COVID?',
           'what are the neurological manifestations of PASC?',
           'does reinfection increase the risk of developing PASC?',
           'does reinfection increase the risk of developing long COVID?',
           'does reinfection increase the risk of post-COVID conditions?',
           'what is the role of inflammation in long covid?',
           'what is the role of inflammation in pasc?',
           'what is known about the immunopathology of long-COVID?',
           'what is known about the immunopathology of PASC?',
           'what are the immune mechanisms underlying PASC?',
           'what is the underlying cause of ME/CFS in PASC',
           'what is the underlying cause of fatigue in long-covid',
           'is gender a risk factor for developing PASC?',
           ]

In [None]:
df_retrieved_passages_finetuned = semantic_search(
    biencoder=biencoder_finetuned,
    queries=queries,
    passage_embd_db=pssg_embd_db_finetuned,
    k=10)

df_retrieved_passages_original = semantic_search(
    biencoder=biencoder_original,
    queries=queries,
    passage_embd_db=pssg_embd_db_original,
    k=10)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

We merge the dataframes of retrieved passages and shuffle the rows to prevent bias in scoring.

In [None]:
df_retrieved_passages_finetuned ['model'] = 'finetuned'
df_retrieved_passages_original ['model'] = 'original'
df_retrieved_passages = pd.concat([df_retrieved_passages_finetuned, df_retrieved_passages_original])

In [None]:
df_retrieved_passages_shuffeled = df_retrieved_passages.sample(frac=1).reset_index(drop=True)
with open('/content/drive/My Drive/long_COVID_semantic_search/eval/retrieved_passages_long_COVID.csv', 'w') as f:
  df_retrieved_passages_shuffeled.to_csv(f)

Now we load the scored dataset.

In [None]:
df_retrieved_passages_scored = \
  pd.read_csv('/content/drive/MyDrive/long_COVID_semantic_search/eval/retrieved_passages_scored_long_COVID.csv')

We are ready to compute mAP for each model.

In [None]:
df_retrieved_passages_scored_finetuned = df_retrieved_passages_scored[df_retrieved_passages_scored.model=='finetuned']
df_retrieved_passages_scored_finetuned = df_retrieved_passages_scored_finetuned.sort_values(by=['query', 'rank'])
df_retrieved_passages_scored_finetuned = df_retrieved_passages_scored_finetuned.reset_index(drop=True)
df_retrieved_passages_scored_original = df_retrieved_passages_scored[df_retrieved_passages_scored.model=='original']
df_retrieved_passages_scored_original = df_retrieved_passages_scored_original.sort_values(by=['query', 'rank'])
df_retrieved_passages_scored_original = df_retrieved_passages_scored_original.reset_index(drop=True)

In [None]:
'''returns a list of size-k lists, where each size-k list is the relevance scores for each query'''
def get_relevance_lists(df, k=10):
  relevance_scores = df.relevance.tolist()
  relevance_lists = []
  for i in range(len(relevance_scores)//10):
    relevance_lists.append([])
    for j in range(10):
      relevance_lists[i].append(relevance_scores[i*10+j])
  return relevance_lists


In [None]:
'''Given relevance lists, calculate mAP'''
def calculate_mAP(relevance_lists):
    AP_sum = 0
    for relevance_list in relevance_lists:
        prec_sum = 0
        relevant_passages = 0
        for i, doc in enumerate(relevance_list):
            if doc == 1:
                relevant_passages += 1
                prec_sum += relevant_passages / (i + 1)
        if relevant_passages > 0:
          AP_sum += prec_sum / relevant_passages
    map = AP_sum / len(relevance_lists)
    return map

In [None]:
relevance_lists_finetuned = get_relevance_lists(df_retrieved_passages_scored_finetuned)
calculate_mAP(relevance_lists_finetuned)

0.7044419459141681

In [None]:
relevance_lists_original = get_relevance_lists(df_retrieved_passages_scored_original)
calculate_mAP(relevance_lists_original)

0.6460075480809608

Thus, by finetuning the performance of the model increased from 0.646 to 0.704.

Next, we use [LLM-as-a-judge](https://arxiv.org/pdf/2306.05685) to evaluate the performance of the models on the same queries. Advanced large language models like GPT-4o are able to generate scores for the relevance of retrieved passages given a query that have a high correlation with human ground truth (e.g. see [this](https://arxiv.org/pdf/2309.10621) and [this](https://arxiv.org/pdf/2304.09161)). I used the following function to get the gpt-4o generated scores.

In [None]:
client = OpenAI(
    api_key="", #key omitted from notebook
)


def get_score(query, passage):
    prompt = f"""given the query related to long COVID, the passage been retrieved by an information retrieval system
    from a database of scientific and medical publications.
    Given a query and a passage, you must provide a zero or one score with the following meanings:
    0 = represents that the passage is irrelevant to the query,
    1 = represents that the passage is relevant to the query.
    Important Instruction: Assign score 1 if the passage is
    related to the query. If the passage directly answer the query, assign score 1.
    If the passage does not provide a direct answer but discusses the query in details,
    consider it as related and assign score 1. If none of these hold, consider the passage
    as irreverent and assign score 0.
    Query: are antiviral drugs effective in preventing long COVID?
    Passage: Treatment of COVID-19 during the Acute Phase in Hospitalized Patients Decreases Post-Acute Sequelae of COVID-19:
    Despite the important impact of PASC on our health systems, there is a lack of studies specifically designed to evaluate whether
    treatments administered in the acute phase of COVID-19 disease can reduce the risk of PASC. There is evidence from a study that
    demonstrated that patients treated with interferon β-1b and antiviral therapy in the acute phase showed a greater probability
    of recovering their initial health status. Furthermore, there is growing evidence supporting the effectiveness
    of antiviral therapies in reducing the risk of PASC. Specifically, the use of molnupiravir within five days of
    a positive SARS-CoV-2 test result has been associated with a reduced risk of PASC, post-acute death,
    and post-acute hospital admission in individuals with at least one risk factor for severe COVID-19 .
    Similarly, the combined antivirals nirmatrelvir–ritonavir have also shown a reduction in the risk of PASC,
    post-acute death, and post-acute hospital admission in outpatients .'
    ##final score: 1
    Query: are antiviral drugs effective in preventing long COVID?
    Passage:  'Use of Antiandrogens as Therapeutic Agents in COVID-19 Patients: Considering the prohibiting cost of
    current COVID-19 drug regimens for low- and middle-income countries, the emerging SARS-CoV-2 variants and the COVID-19
    vaccine rollout and efficacy challenges, the need for cost-effective, orally available and broad-spectrum antivirals
    that can act against a wide range of SARS-CoV-2 variants remains urgent . Despite the promising antiviral effect that
    a range of antiandrogens display in vitro against SARS-CoV-2, the results of finalised clinical trials on the efficacy of
    ADT or antiandrogens in COVID-19 patients have not been conclusive enough to inform clinical practice. Various next-generation
    antiandrogens have been formulated, and the development of a lot more is underway, including apalutamide, darolutamide,
    orteronel and galeterone. These new drugs should be explored for their antiviral effects and clinical outcomes as they might
    be more effective against SARS-CoV-2 and perhaps more amenable for widespread use in COVID-19 patients.
    ##final score: 0
    Query: {query}
    Passage: {passage}
    ##final score:
    """
    response = client.chat.completions.create(
      model="gpt-4o",
      temperature=0,
      max_tokens=15,
      messages=[
        {"role": "system", "content": "You are a Relevance assessor that judges the relevance of a passage to a query."},
        {"role": "user", "content": prompt}
      ]
    )
    result = response.choices[0].message.content
    try:
        final_score = int(result.strip())
    except ValueError:
        final_score = None
    return final_score

In [None]:
def generate_relevance_scores_gpt(df):
  df_copy = df.copy()
  df_copy['relevance'] = 0
  for i in tqdm(range(df_copy.shape[0])):
    query = df_copy['query'].iloc[i]
    passage = df_copy['passage'].iloc[i]
    df_copy.at[i, 'relevance'] = get_score(query, passage)
  return df_copy

In [None]:
df_retrieved_passages_finetuned = df_retrieved_passages[df_retrieved_passages.model=='finetuned']
df_retrieved_passages_finetuned = df_retrieved_passages_finetuned.sort_values(by=['query', 'rank'])
df_retrieved_passages_finetuned = df_retrieved_passages_finetuned.reset_index(drop=True)
df_retrieved_passages_original = df_retrieved_passages[df_retrieved_passages.model=='original']
df_retrieved_passages_original = df_retrieved_passages_original.sort_values(by=['query', 'rank'])
df_retrieved_passages_original = df_retrieved_passages_original.reset_index(drop=True)

In [None]:
df_retrieved_passages_scored_finetuned = \
  generate_relevance_scores_gpt(df_retrieved_passages_finetuned)
df_retrieved_passages_scored_original = \
  generate_relevance_scores_gpt(df_retrieved_passages_original)

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/300 [00:00<?, ?it/s]

In [None]:
df_retrieved_passages_scored = pd.concat(
    [df_retrieved_passages_scored_finetuned,
     df_retrieved_passages_scored_original])
with open('/content/drive/My Drive/long_COVID_semantic_search/eval/retrieved_passages_scored_with_GPT_long_COVID.csv', 'w') as f:
  df_retrieved_passages_scored.to_csv(f)

In [None]:
relevance_lists_finetuned = get_relevance_lists(df_retrieved_passages_scored_finetuned)
calculate_mAP(relevance_lists_finetuned)

0.7573401675485009

In [None]:
relevance_lists_original = get_relevance_lists(df_retrieved_passages_scored_original)
calculate_mAP(relevance_lists_original)

0.6826686507936509

Therefore, we get a similar improvement according to the relevance scores generated by GPT-4o.

# Potential Further Improvements

Based on the results of the GPL paper, it is expected that using additional queries (e.g. 20 queries per passage) could result in improved performane. Addtionally, one can consider adding augmenting queries by replacing some keywords with their synonyms or more advanced methods.

Moreover, one can consider generating queries using a more advanced model that was trained more recently with an affordable price. For example, using GPT-4o-mini one can produce 10 queries for 10,000 passages with $2. Unlike using paid embedding models that would require ongoing payments for embedding use queries as well as for updating the vector database, this would be a one time payment used for finetuning our own embedding model.

Finally, it would be interesting to try applying GPL to other available pretrained bi-encoder models and compare the results.