# **Document Retrieval System 🗂️**

We are going to implement a document retrieval system to return titles and the relevant passages of scientific papers containing the answer to a given user question. Through the implementation of the document retrieval system I'm going to experiment with different models for sentence embeddings. 

Dataset: [COVID-19 Open Research Dataset (CORD-19)](https://www.semanticscholar.org/cord19).

We are going to use the articles in the folder `comm_use_subset` of the [*first version (2020-03-13)*](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-03-13.tar.gz) of CORD-19.

**Some essential imports**

In [None]:
import os, json
import pandas as pd
import numpy as np
import nltk
import ssl
import re
import torch
from nltk.tokenize import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity
import time

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### **Install the sentence transformers 🤗**

In [None]:
!pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, util

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.5 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 10.7 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 55.5 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 52.2 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 5.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████

### **Download the CORD-19 dataset ⬇️**

Get the first version (2020-03-13) of CORD-19 and extract the folder `comm_use_subset`.

In [None]:
!wget -nc https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-03-13.tar.gz

--2021-10-14 06:20:35--  https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-03-13.tar.gz
Resolving ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)... 52.218.181.153
Connecting to ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)|52.218.181.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278921140 (266M) [application/x-tar]
Saving to: ‘cord-19_2020-03-13.tar.gz’


2021-10-14 06:20:46 (24.2 MB/s) - ‘cord-19_2020-03-13.tar.gz’ saved [278921140/278921140]



In [None]:
!tar -xvf cord-19_2020-03-13.tar.gz
!tar -xvf 2020-03-13/comm_use_subset.tar.gz

2021-10-11/changelog
2021-10-11/cord_19_embeddings.tar.gz
^C
tar: 2021-10-11/comm_use_subset.tar.gz: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now


### **Data preprocessing 💽**

We get the *title, abstract* and *text* from each article.

#### **Keep the useful data from articles 📁**

In [None]:
# this finds our json files
path_to_json = 'comm_use_subset/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

# here I define my pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['title', 'abstract', 'text'])

# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
  with open(os.path.join(path_to_json, js)) as json_file:
    json_text = json.load(json_file)
    title = json_text['metadata']['title']
    #if not title:
    #  title = 'Article without title'
    
    all_abstract = ''
    if json_text['abstract']:
      for abstract in json_text['abstract']:
        all_abstract += abstract['text']
    
    #if not all_abstract:
    #  all_abstract = 'No abstract'
    
    all_text = ''
    for text in json_text['body_text']:
      all_text += text['text']
    
    #if not all_text:
    #  all_text = 'No text'
        
    jsons_data.loc[index] = [title, all_abstract, all_text]
        
# now that we have the pertinent json data in our DataFrame let's look at it
print(jsons_data)

                                                  title  ...                                               text
0     Hazard Analysis of Critical Control Points Ass...  ...  Since 1980, on average one new emerging infect...
1     Gene expression patterns induced at different ...  ...  Human rhinovirus (HRV), a non-segmented positi...
2     Comparison of Influenza Epidemiological and Vi...  ...  Influenza virus is estimated to cause 3 to 5 m...
3     expression cloning and production of human hea...  ...  FIgURe 1 | Schematic representation of the pro...
4     STATISTICS-BASED PREDICTIONS OF CORONAVIRUS EP...  ...  Here, we consider the development of an epidem...
...                                                 ...  ...                                                ...
8995  Comprehensive Genomic Characterization Analysi...  ...  Long non-coding RNAs (lncRNAs), which are tran...
8996       Emerging Microbes & Infections (2017) 6, e14  ...  The outbreak of severe acute respiratory s

In [None]:
print(jsons_data.head())

                                               title  ...                                               text
0  Hazard Analysis of Critical Control Points Ass...  ...  Since 1980, on average one new emerging infect...
1  Gene expression patterns induced at different ...  ...  Human rhinovirus (HRV), a non-segmented positi...
2  Comparison of Influenza Epidemiological and Vi...  ...  Influenza virus is estimated to cause 3 to 5 m...
3  expression cloning and production of human hea...  ...  FIgURe 1 | Schematic representation of the pro...
4  STATISTICS-BASED PREDICTIONS OF CORONAVIRUS EP...  ...  Here, we consider the development of an epidem...

[5 rows x 3 columns]


#### **Data cleaning 🧹**

In [None]:
def clean_data(sentences, MAX_SEQ_LEN):
  # Tokenize the passages into sentences
  sentences = sent_tokenize(sentences, language='english')
  # Keep only the sentences with more than 10 words and less than MAX_SEQ_LEN+1 words
  sentences = [sentence for sentence in sentences if len(sentence.split()) > 15 and len(sentence.split()) <= MAX_SEQ_LEN]

  return sentences

In [None]:
# Checking on the cleaned data
print(clean_data(jsons_data['abstract'][0], 256))

['Highly pathogenic avian influenza virus (HPAI) strain H5N1 has had direct and indirect economic impacts arising from direct mortality and control programmes in over 50 countries reporting poultry outbreaks.', 'HPAI H5N1 is now reported as the most widespread and expensive zoonotic disease recorded and continues to pose a global health threat.', 'The aim of this research was to assess the potential of utilising Hazard Analysis of Critical Control Points (HACCP) assessments in providing a framework for a rapid response to emerging infectious disease outbreaks.', 'This novel approach applies a scientific process, widely used in food production systems, to assess risks related to a specific emerging health threat within a known zoonotic disease hotspot.', "We conducted a HACCP assessment for HPAI viruses within Vietnam's domestic poultry trade and relate our findings to the existing literature.", "Our HACCP assessment identified poultry flock isolation, transportation, slaughter, prepara

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Working on {device}')

Working on cuda


### **Questions definition❓**

In [None]:
questions = ['What are the coronaviruses?', 
            'What was discovered in Wuhuan in December 2019?',
            'What is Coronovirus Disease 2019?',
            'What is COVID-19?',
            'What is caused by SARS-COV2?',
            'How is COVID-19 spread?',
            'Where was COVID-19 discovered?',
            'How many deaths were caused by COVID-19?',
            'Does wearing a mask prevent the spread of COVID-19?']

### **Models definition 🤖**

I am defining and listing the models from sentence transformers according to their performance (ranked on Semantic Search) from best to worst according to [ranking from SBERT](https://www.sbert.net/_static/html/models_en_sentence_embeddings.html).

1. `multi-qa-mpnet-base-dot-v1`
2. `all-mpnet-base-v2`        
3. `all-roberta-large-v1`       
4. `msmarco-bert-base-dot-v5`   
5. `msmarco-distilbert-dot-v5`  


In [None]:
model_1 = SentenceTransformer('multi-qa-mpnet-base-dot-v1').to(device) 
model_2 = SentenceTransformer('all-mpnet-base-v2').to(device) 
model_3 = SentenceTransformer('all-roberta-large-v1').to(device) 
model_4 = SentenceTransformer('msmarco-bert-base-dot-v5').to(device) 
model_5 = SentenceTransformer('msmarco-distilbert-dot-v5').to(device) 

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.84k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/650 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/328 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/6.14k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/636 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/461 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/6.14k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Declaring the models' names for printing purposes
model_names = {
  model_1 : 'multi-qa-mpnet-base-dot-v1',
  model_2 : 'all-mpnet-base-v2',
  model_3 : 'all-roberta-large-v1',
  model_4 : 'msmarco-bert-base-dot-v5',
  model_5 : 'msmarco-distilbert-dot-v5'
}

In [None]:
# Define the maximum sequence length of input for each model as stated in their documentation 
model_max_seq_len = {
  model_1 : 512,
  model_2 : 384,
  model_3 : 256,
  model_4 : 512,
  model_5 : 512,
}

### **Search function 🔍**

In [None]:
def search(model, N_SAMPLES = None):
  best_scores = [-100 for i in range(len(questions))]
  best_titles = ['' for i in range(len(questions))]
  best_answers = ['' for i in range(len(questions))]

  if N_SAMPLES != None:
    sample = jsons_data.sample(n=N_SAMPLES, random_state=123)
  else:
    sample = jsons_data

  print(f'Running the `{model_names[model]}` for {len(sample.index)} documents\n')
  
  questions_embeddings = model.encode(questions)
  MAX_SEQ_LEN = model_max_seq_len[model]

  start_time = time.time()
  counter = 0
  for index, row in sample.iterrows():
    counter += 1
    abstract_sentences = row['abstract']
    text_sentences = row['text']
    data = abstract_sentences + text_sentences
    data = clean_data(data, MAX_SEQ_LEN)

    if not row['title']:
      print(f'Document {counter-1} has no title. Proceeding..')
      continue
    if not data:
      print(f'No data on document {counter-1}. Proceeding..')
      continue

    data_embeddings = model.encode(data)

    for q_idx, question_embeddings in enumerate(questions_embeddings):
      
      scores = util.dot_score(question_embeddings, data_embeddings)[0].cpu().tolist()

      # Combine docs & scores
      doc_score_pairs = list(zip(data, scores))

      # Sort by decreasing score
      doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

      score = doc_score_pairs[0][1]
      answer = doc_score_pairs[0][0]

      if score > best_scores[q_idx]:
        best_scores[q_idx]  = score
        best_titles[q_idx]  = row['title']
        best_answers[q_idx] = answer
            
    print(f'Document {counter}/{len(sample.index)}')
      
  end_time = time.time()
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  print(f'Elapsed time: {elapsed_mins}m {elapsed_secs}s')

  return best_scores, best_titles, best_answers

In [None]:
# Check what GPU we are assigned
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-6ee0d770-3b4c-534f-6981-2d0748cc6745)


### **Run the models 🏃‍♂️**

Choose between the following models:

1. `model_1` : `multi-qa-mpnet-base-dot-v1`
2. `model_2` : `all-mpnet-base-v2`
3. `model_3` : `all-roberta-large-v1`
4. `model_4` : `msmarco-bert-base-dot-v5`
5. `model_5` : `msmarco-distilbert-dot-v5`




In [None]:
# SEARCH!
scores, titles, answers = search(model_1)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Document 4002/9000
Document 4003/9000
Document 4004/9000
Document 4005/9000
Document 4006/9000
Document 4006 has no title. Proceeding..
Document 4008/9000
Document 4009/9000
Document 4010/9000
Document 4011/9000
Document 4012/9000
Document 4013/9000
Document 4014/9000
Document 4015/9000
Document 4016/9000
Document 4017/9000
Document 4018/9000
Document 4019/9000
Document 4020/9000
Document 4021/9000
Document 4021 has no title. Proceeding..
Document 4023/9000
Document 4024/9000
Document 4025/9000
Document 4026/9000
Document 4027/9000
Document 4028/9000
Document 4029/9000
Document 4030/9000
Document 4031/9000
Document 4032/9000
Document 4033/9000
Document 4034/9000
Document 4035/9000
Document 4036/9000
Document 4037/9000
Document 4038/9000
Document 4039/9000
Document 4040/9000
Document 4041/9000
Document 4042/9000
Document 4043/9000
Document 4044/9000
Document 4044 has no title. Proceeding..
Document 4046/9000
Document 4047/

In [None]:
for idx, question in enumerate(questions):
  print(f'Question: {question}')
  print(f'From paper with title: {titles[idx]}')
  print(f'Passage answer: {answers[idx]}')
  print(f'Score: {scores[idx]:.2f}\n')

Question: What are the coronaviruses?
From paper with title: Alignment-free method for DNA sequence clustering using Fuzzy integral similarity
Passage answer: The coronaviruses are pleomorphic RNA viruses that are widespread among avians, bats, humans and other mammals.
Score: 85.51

Question: What was discovered in Wuhuan in December 2019?
From paper with title: Clinical Medicine Characteristics of and Public Health Responses to the Coronavirus Disease 2019 Outbreak in China
Passage answer: In December 2019In December 2019, a cluster of pneumonia of unknown etiology was detected in Wuhan City, Hubei Province of China.
Score: 79.24

Question: What is Coronovirus Disease 2019?
From paper with title: Comment
Passage answer: The ongoing coronavirus disease 2019 (COVID-19) outbreak is giving rise to worldwide anxieties, rumours, and online misinformation.
Score: 79.98

Question: What is COVID-19?
From paper with title: Consensus statement The species Severe acute respiratory syndrome- rela

### **Evaluate answers ✅**

Let's talk about the answers we got after running the model `multi-qa-mpnet-base-dot-v1` for the whole dataset (9000 documents) which took 1 hour on the Tesla P100 GPU. All models produce fairly good results with each having its own questions that are best answered. I provide the results for each of the models in the `results.pdf` file (in the current directory) and compare the performance of the models based on some metrics.

**Question:** What are the coronaviruses?  
**From paper with title:** Natural Bis-Benzylisoquinoline Alkaloids-Tetrandrine, Fangchinoline, and Cepharanthine, Inhibit Human Coronavirus OC43 Infection of MRC-5 Human Lung Cells  
**Passage answer:** Coronaviruses (CoVs) are enveloped, positive-sense, single-stranded RNA viruses that infect a broad range of animal species and cause multiple respiratory outcomes of varying severity, including the common cold, bronchiolitis, and pneumonia [13].

**Comments:** It's a very good answer. Just what we wanted.

---

**Question:** What was discovered in Wuhuan in December 2019?  
**From paper with title:** Identification of a Novel Polyomavirus from Patients with Acute Respiratory Tract Infections  
**Passage answer:** This discovery raises many questions for further investigation, such as, Is WU virus a human pathogen?

**Comments:** Relevant answer. It doesn't answer our question though. 

---

**Question:** What is Coronovirus Disease 2019?  
**From paper with title:** Systematic Comparison of Two Animal-to-Human Transmitted Human Coronaviruses: SARS-CoV-2 and SARS-CoV  
**Passage answer:** This virus causes acute lung symptoms, leading to a condition that has been named as "coronavirus disease 2019" (COVID-19).

**Comments:** It's a fine answer. Answers our question.

---

**Question:** What is COVID-19?  
**From paper with title:** Systematic Comparison of Two Animal-to-Human Transmitted Human Coronaviruses: SARS-CoV-2 and SARS-CoV. 
**Passage answer:** This virus causes acute lung symptoms, leading to a condition that has been named as "coronavirus disease 2019" (COVID-19).

**Comments:** It's a good answer. Answers our question.

---

**Question:** What is caused by SARS-COV2?  
**From paper with title:** Viral Mimicry to Usurp Ubiquitin and SUMO Host Pathways  
**Passage answer:** SARS CoV is the causative agent for severe acute respiratory syndrome, which frequently leads to outbreaks and high mortality.

**Comments:** It's a very good answer. Answers our question.

---

**Question:** How is COVID-19 spread?  
**From paper with title:** Rapid communication  
**Passage answer:** The outbreak has spread rapidly, affecting other parts of China, and cases have been recorded on several continents (Asia, Australia, Europe and North America); further global spread is likely to occur [4] .The spectrum of this disease in humans, now named coronavirus disease 2019 (COVID-19) [5] , is yet to be fully determined.

**Comments:** Related answer. We cannot say that it answers our question as we wanted it to but at the time the dataset was published such information may not have been available as it is stated in the text that "The spectrum of this disease in humans, now named coronavirus disease 2019 (COVID-19) [5] , *is yet to be fully determined.*"

---

**Question:** Where was COVID-19 discovered?  
**From paper with title:** Clinical Medicine Optimization Method for Forecasting Confirmed Cases of COVID-19 in China  
**Passage answer:** In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.

**Comments:** It's a very good answer. Just what we wanted.

---

**Question:** How many deaths were caused by COVID-19?  
**From paper with title:** Identification of COVID-19 Can be Quicker through Artificial Intelligence framework using a Mobile Phone-Based Survey in the Populations when Cities/Towns Are Under Quarantine  
**Passage answer:** As of February 25 th 2020, the World Health Organization's situational data indicates that there were about 77780 confirmed cases, including 2666 deaths due to COVID-19, including cases in 25 countries [4] .

**Comments:** Very good answer. Just what we wanted.

---

**Question:** Does wearing a mask prevent the spread of COVID-19?  
**From paper with title:** Transmission of Influenza A in a Student Office Based on Realistic Person-to-Person Contact and Surface Touch Behaviour
**Passage answer:** Wearing a mask can control the spread of disease via the long-range airborne, fomite and close contact routes.

**Comments:** Very good answer. Just what we wanted.


### **Search a question 🧐**

I have modified the *search function* in order to ask the system for just a single question.

In [None]:
def search_a_question(model, question, N_SAMPLES = None):
  best_score = -100 
  best_titles = '' 
  best_answers = ''

  if N_SAMPLES != None:
    sample = jsons_data.sample(n=N_SAMPLES, random_state=123)
  else:
    sample = jsons_data

  print(f'Running the `{model_names[model]}` for {len(sample.index)} documents\n')
  
  question_embeddings = model.encode(question)
  MAX_SEQ_LEN = model_max_seq_len[model]

  start_time = time.time()
  counter = 0
  for index, row in sample.iterrows():
    counter += 1
    abstract_sentences = row['abstract']
    text_sentences = row['text']
    data = abstract_sentences + text_sentences
    data = clean_data(data, MAX_SEQ_LEN)

    if not row['title']:
      print(f'Document {counter-1} has no title. Proceeding..')
      continue
    if not data:
      print(f'No data on document {counter-1}. Proceeding..')
      continue

    data_embeddings = model.encode(data)

    scores = util.dot_score(question_embeddings, data_embeddings)[0].cpu().tolist()

    # Combine docs & scores
    doc_score_pairs = list(zip(data, scores))

    # Sort by decreasing score
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

    score = doc_score_pairs[0][1]
    answer = doc_score_pairs[0][0]

    if score > best_score:
      best_score = score
      best_title = row['title']
      best_answer = answer
            
    print(f'Document {counter}/{len(sample.index)}')
      
  end_time = time.time()
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  print(f'Elapsed time: {elapsed_mins}m {elapsed_secs}s')

  return best_score, best_title, best_answer

In [None]:
question = 'What are the symptoms of COVID-19?'

score, title, answer = search_a_question(model_1, question, N_SAMPLES=150)

Running the `multi-qa-mpnet-base-dot-v1` for 150 documents

Document 1/150
Document 2/150
Document 3/150
Document 3 has no title. Proceeding..
Document 5/150
Document 6/150
Document 7/150
Document 8/150
Document 9/150
Document 10/150
Document 11/150
Document 12/150
Document 13/150
Document 14/150
Document 15/150
Document 16/150
Document 17/150
Document 18/150
Document 19/150
Document 20/150
Document 21/150
Document 22/150
Document 23/150
Document 24/150
Document 25/150
Document 25 has no title. Proceeding..
Document 27/150
Document 28/150
Document 29/150
Document 30/150
Document 31/150
Document 32/150
Document 33/150
Document 34/150
Document 35/150
Document 35 has no title. Proceeding..
Document 37/150
Document 38/150
Document 39/150
Document 40/150
Document 41/150
Document 42/150
Document 43/150
Document 44/150
Document 45/150
Document 46/150
Document 47/150
Document 48/150
Document 49/150
Document 50/150
Document 51/150
Document 52/150
Document 53/150
Document 54/150
Document 55/150


In [None]:
print(f'Question: {question}')
print(f'From paper with title: {title}')
print(f'Passage answer: {answer}')
print(f'Score: {score:.2f}\n')

Question: What are the symptoms of COVID-19?
From paper with title: The history and epidemiology of Middle East respiratory syndrome corona virus
Passage answer: The clinical presentation of MERS-CoV ranges from flu-like symptoms, i.e., fever and cough in 87% of patients, chills, rigor, rhinorrhea, myalgia, and fatigue, to more severe symptoms, including shortness of breath in 48% of patients and respiratory failure, resulting in the requirement for intubation and ventilation.
Score: 23.52



### **Thoughts 💭**

The answers above come after using the `multi-qa-mpnet-base-dot-v1` for embedding the whole dataset (9000 scientific articles) and using the `dot_score` for calculating the similarity of each sentence with each question posed. We can see that all answers are relevant to the related question and almost all of them are pretty good while they actually answer their question! 

**Technical details:**
* **Model used:** From SBERT, Pretrained Models, the `multi-qa-mpnet-base-dot-v1`
* **Run time:** 1 hour on the Tesla P100 GPU. On the GPU T8, the runtime is four times as long, about 4 hours. For this reason, I have set the `N_SAMPLES` parameter in the search function to specify a subset of randomly selected documents from the dataset to limit the search there. You can see that by setting `N_SAMPLES = 500`, which also gives pretty good results in short amount of time!
* **Similarity check function:** [`util.dot_score`](https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.dot_score) from sentence transformers library, which calculates the dot product between each embedded sentence and each embedded question.

**Can we do any better?**

I think we can. A very important parameter to consider is the way the similarity score is calculated. Given two sentences (in our case a question and a possible answer) the model does not understand the meanings of either the question or the answer. It simply tries to match as closely as possible any two sentences and hopes that this will be the best answer. But this is not the case in many cases. Take for example the question `"What are the coronaviruses?"`. If I didn't take only sentences at least 10 words long, my model would give the answer: coronaviruses. Yes, obviously coronaviruses are coronaviruses, but we are asking for a more detailed explanation of what a coronavirus is, which the model cannot understand. So, I think if we find a more intelligent way of calculating the similarity score we can get better results.

**Something kind of weird:** When running the model using `N_SAMPLES = 500` we get much better answers for some queries than when running the model for the whole dataset. Perhaps this is because the model selects as the best answer the sentence that seems (according to the sentence embeddings) the most similar to the question. This means that the model cannot understand the conceptual meaning of the question but tries to find sentences as "close" as possible to the question. As we can see, this produces poor results for some questions.

Overall, I would say that the model produces pretty good answers and at no case any irrelevant ones.
