# 1. Document Retrieval System.

## 1) Data Extracting & Preprocessing.

In this section of the first part of this assignment, we will extract all the useful data from the given papers.

First, we'll load from our local disk (or from our drive), the ***comm_use_subset.zip*** file, we'll extract it and then we'll create a folder named papers. Our structure will be something like this:

```
/content/papers/comm_use_subset/.jsonfiles
```

### (1.1): Useful Functions and Libraries.

In [1]:
import zipfile

def createFolder(zip_location):
  with zipfile.ZipFile(zip_location, 'r') as zip_ref:
    zip_ref.extractall('/content/papers')

In [2]:
def writeToCSV(jsonf, data_file):

  with open(jsonf) as json_file: 

      try:    
        data = json.load(json_file)

        csv_writer = csv.writer(data_file)

        row = []

        paper_id = data['paper_id']
        row.append(paper_id)

        title = data['metadata']['title']
        row.append(title)

        if data['abstract']:
          abstract = data['abstract'][0]['text']
          text = abstract 
        else:
          text = ""

        body_text = data['body_text']

        text = re.sub("\[\d*]","", text)
        text = re.sub(" +"," ",text)

        for sections in body_text:

          sub_text = sections['text']
          x = re.sub("\[\d*]","", sub_text)
          x = re.sub(" +"," ",x)
          text += x

        row.append(text)

        csv_writer.writerow(row)

      except:
        print('Error in json file:', json_file)

In [3]:
import os
import csv
import json
import re

import pandas as pd
from pandas import read_csv

def createCSV(fLocation):
  entries = os.listdir(fLocation)                                               # Read all papers
  
  data_file = open('papers.csv','w')
  csv_writer = csv.writer(data_file)
  csv_writer.writerow(['paper_id', 'title', 'text'])
  json_path = fLocation

  for entry in entries:
    
    json_path = fLocation + "/" +entry
    writeToCSV(json_path, data_file)

  data_file.close()

### (1.2): Explanation of the Data Management.

Then, in order to have a more human-friendly data visualisation, we'll convert the .json files into a single .csv file. Ofcourse, we'll keep only the important data from these files:



*   paper_id
*   title
*   body_text

It's important to emphasize the fact that we'll do some data cleaning in the text inside of paper. Almost at all the range of the text, there are mentions, like this:


    Old version: " Bolton et al. [19] demonstrated that the LysM protein ". 

That's something we don't want to include in the text, because it's clearly noise. So, we managed to get rid of it. So, all the sentences like this, we'll be from now and then clear:

    Cleared version:  " Bolton et al. demonstrated that the LysM protein ". 


Also, we've cleaned all the multiple white spaces. At this point, we are ready to extract all the useful informations from the given papers. The cleaned .csv file will be named: **papers.csv**.


In [4]:
#Import .zip file straight from /content

# Uncomment if the .zip is loaded
# createFolder('/content/cord.zip')

In [5]:
#Code to read .zip file from drive

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [6]:
downloaded = drive.CreateFile({'id':'1shaeus7dCdt6LZxnQu9Au9oOHZrcyWYf'})
downloaded.GetContentFile('comm_use_subset.zip')

In [7]:
createFolder('/content/comm_use_subset.zip')
createCSV('/content/papers/comm_use_subset')

In [8]:
LocationCSV = r'/content/papers.csv'
papers = pd.read_csv(LocationCSV);                                             
papers.head()

Unnamed: 0,paper_id,title,text
0,0d1448dbed0b123a78907316826d116ad97cfbcf,a section of the journal Frontiers in Pharmaco...,"Since its discovery in 2001, the major focus o..."
1,e1c5d3a82f4296f2867ccca63989530f17a773d7,molecules Using UPLC-MS/MS for Characterizatio...,"Yupingfeng (YPF), a famous traditional Chinese..."
2,b3ad716630b356b1399e9df08cad73b1e92f317d,Changing risk awareness and personal protectio...,Background: Outbreaks of low and high pathogen...
3,ae0700fe06361c6d9e2286332ba467940ec9f89e,Citation: A Host Factor GPNMB Restricts Porcin...,Porcine circovirus type 2 (PCV2) is the infect...
4,4c16e6d3922d2b800f5f3e2dec0e51b9f7700898,Programmed cell removal by calreticulin in tis...,T he process of viable cell clearance via phag...


## 2) Implementation of 2 different sentence embedding approaches.

### (2.1): First approach: SentenceBERT.

Sentence-BERT, uses a Siamese network like architecture to provide 2 sentences as an input. 

These 2 sentences are then passed to BERT models and a pooling layer to generate their embeddings. Then use the embeddings for the pair of sentences as inputs to calculate the **cosine similarity.**

Let's install Sentence BERT:

In [9]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 7.5MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 24.7MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 52.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K    

Then, we will load the pre-trained BERT model. We choose among a variant list of models, the **best-base-nli-mean-tokens**. [Here](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0) is a list with the models.

In [10]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

100%|██████████| 405M/405M [00:15<00:00, 26.9MB/s]


Then, we will create a list with **sentences**. Each sentences will be the content of each paper individually. 

We also should mention that from each paper we will subtract the "noise", such as emails and url's.

In [11]:
def createSentences(dataframe):

  sentences = []

  url_regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
  email_regex = r'\S*@\S*\s'

  for paper in dataframe['text']:

      # sentences.append(paper)
      paper = re.sub(email_regex, '', paper)
      paper = re.sub(url_regex, '', paper)
      # text = re.sub("\[\d\]","", paper)
      # text = re.sub(" +"," ",text)
      sentences.append(paper)

  return sentences

In [12]:
sentences = createSentences(papers)
print(len(sentences))

9000


### (2.2): Creation of the Sentence Embeddings.

Then, we'll create the sentence embeddings, with the help of our **pre-trained SBERT** model.

In [13]:
sentence_embeddings = sbert_model.encode(sentences)

Also, in order to testify how "simimlar" are 2 sentences, we'll use the cosine similarity metric.

In [14]:
import numpy as np

def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

Below, we have the code that will give us the most "good" paper, that will much likely give us the answer that we want. For the assignment's purpose, we'll print the title of the paper.

In [15]:
def get_key(val, my_dict):
    for key, value in my_dict.items():
         if val == value:
             return key
 
    return "key doesn't exist"

def askQuestion(query):

  # query = "I like salmon and shrimps"
  query_vec = sbert_model.encode([query])[0]
  dict_sent   = {}
  index_sent  = {}
  counter = 0

  for sent in range(len(sentences)):

    sim = cosine(query_vec, sbert_model.encode([sentences[sent]])[0])
    index_sent[sent] = sim
    dict_sent[sentences[sent]] = sim
    # print(counter)
    counter += 1

  a = sorted(dict_sent.items(), key=lambda x: x[1], reverse=True)    

  print("Top-4 Answers: \n")
  for hit in a[:4]:
    print("\t ", hit[1] , "\t", hit[0])

  # print(a[:3])
  value = a[0][1]
  
  # print(get_key(value, index_sent))
  return(get_key(value, index_sent))

def getTitle(dataframe, index, column):

  return dataframe[column][index]

Now, our next task is to create a list of questions in order to feed our model. (You can find this list of question [here](https://drive.google.com/file/d/1XxqGTttbTqcrC88M3ZerpsCS10L0tVME/view).

In [16]:
questions = ["What are the coronoviruses?", 
       "What was discovered in Wuhuan in December 2019?", 
       "What is Coronovirus Disease 2019?",
       "What is COVID-19?",
       "What is caused by SARS-COV2?",
       "Where was COVID-19 discovered?",
       "How does coronavirus spread?"]

Now, for each questions we will print the Top-4 scores and also the text we retrieved from our dataset. Finally for each question we'll return the title of the corresponding paper.

In [17]:
def provideQuestions(questions):


  for question in range(len(questions)):
    
    print(question,"\b.")
    index = askQuestion(questions[question])
    title = getTitle(papers, index, 'title')
    print("\n Title: ",title)

In [18]:
provideQuestions(questions)

0 .
Top-4 Answers: 

	  0.68208283 	 Porcine deltacoronavirus (PDCoV), a member of genus Deltacoronavirus, is an emerging swine enteropathogenic coronavirus (CoV). Although outstanding efforts have led to the identification of Alphacoronavirus and Betacoronavirus receptors, the receptor for Deltacoronavirus is unclear. Here, we compared the amino acid sequences of several representative CoVs. Phylogenetic analysis showed that PDCoV spike (S) protein was close to the cluster containing transmissible gastroenteritis virus (TGEV), which utilizes porcine aminopeptidase N (pAPN) as a functional receptor. Ectopic expression of pAPN in non-susceptible BHK-21 cells rendered them susceptible to PDCoV. These results indicate that pAPN may be a functional receptor for PDCoV infection. However, treatment with APN-specific antibody and inhibitors did not completely block PDCoV infection in IPI-2I porcine intestinal epithelial cells. pAPN knockout in IPI-2I cells completely blocked TGEV infection b

Let's sum up how good our results was:



1.   With a quick glance, we can see that the possible answers are not such good as expected, and also the best possible answers are not usually at the top of the returned list.

2.   The time it takes to run each question, is around ~3.5 min. That's because we need to search across **all the dataset** each time we provide a new question. That's a lot for each question.


We are having these bad results because the only thing we did is to compare the similarity of each questions and each text of each paper in the dataset. Below, we have implemented a far more better solution for our Q&A problem.

We are keeping this code anyways because it's more like a sentence-similarity program rather than a Q&A implentation.



### (2.3): Second Approach: Bidirectional & Cross Encoder.

For the second sentences embeddings approach, we were inspired from the examples of the [UKPLab](https://github.com/UKPLab/sentence-transformers).

In the end of this section of our notebook, we can input a query or a question. The script then uses semantic search to find relevant passages in the given [papers](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-03-13.tar.gz).

For semantic search, we use SentenceTransformer('msmarco-distilbert-base-v2') and retrieve 100 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')) that scores the query and all retrieved passages for their relevancy. The cross-encoder is neccessary to filter out certain noise that might be retrieved from the semantic search step.

In [19]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch

if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")

First, we'll initialize our model and we are going to define the number of the papers we want to retrieve.

In [20]:
#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
model_name = 'msmarco-distilbert-base-v2'
bi_encoder = SentenceTransformer(model_name)
top_k = 100     #Number of passages we want to retrieve with the bi-encoder

100%|██████████| 245M/245M [00:10<00:00, 23.4MB/s]


The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality.


In [21]:
# cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')
cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=612.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267871721.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=541.0, style=ProgressStyle(description_…




As dataset, we use the first version of
the COVID-19 Open Research Dataset (CORD-19). We split these papers into paragraphs and encode them with the bi-encoder (**msmarco-distilbert-base-v2 model**).

In [22]:
sentence_embeddings = bi_encoder.encode(sentences, convert_to_tensor=True, show_progress_bar=True)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=282.0, style=ProgressStyle(description_widt…




In [23]:
corpus_embeddings = sentence_embeddings
passages = sentences

Next, we have the *search* function. Its utility is to search across all the given dataset to find the best possible answer to each query.

In [24]:
def search(query):
  #Encode the query using the bi-encoder and find potentially relevant passages
  start_time = time.time()
  question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
  question_embedding = question_embedding.cuda()
  hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
  hits = hits[0]  # Get the hits for the first query

  #Now, score all retrieved passages with the cross_encoder
  cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
  cross_scores = cross_encoder.predict(cross_inp)

  #Sort results by the cross-encoder scores
  for idx in range(len(cross_scores)):
      hits[idx]['cross-score'] = cross_scores[idx]

  
  end_time = time.time()

  #Output of top-5 hits
  print("Input question:", query)
  print("Results (after {:.3f} seconds):".format(end_time - start_time))

  print("Top-5 Bi-Encoder Retrieval hits \n")
  hits = sorted(hits, key=lambda x: x['score'], reverse=True)
  for hit in hits[0:5]:
      print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

  print("Top-5 Cross-Encoder Re-ranker hits \n")
  hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
  for hit in hits[0:5]:
      print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))
  
  # print(hits[0]['cross-score'])
  return hits[0]['corpus_id']


For the best answer based on the *Cross-Encoder* , we'll also return the title of the paper.

In [25]:
indexes = []

def provideQuestions2(questions):

  for question in range(len(questions)):

    print(question,"\b. \n")
    index = search(questions[question])
    indexes.append(index)
    title = getTitle(papers, index, 'title')
    print("\n Title: ",title)

In [26]:
provideQuestions2(questions)

0 . 

Input question: What are the coronoviruses?
Results (after 2.335 seconds):
Top-5 Bi-Encoder Retrieval hits 

	0.541	Coronaviruses (CoVs) are a group of enveloped viruses with a large positive single-stranded RNA genome (∼26-32 kb in length) of the subfamily Coronavirinae under the family Coronaviridae. The complete genome of CoV contains five major open reading frames (ORFs) that encode replicase polyproteins (ORF1ab), spike glycoprotein (S), envelope protein (E), membrane protein (M), and nucleocapsid protein (N) flanked by a 5 -untranslated region (UTR) and a 3 -UTR. Currently, members of the subfamily Coronavirinae are classified into four genera, Alphacoronavirus, Betacoronavirus, Gammacoronavirus, and Deltacoronavirus (Fehr and Perlman, 2015; Su et al., 2016) . CoVs can cause upper and lower respiratory diseases, gastroenteritis, and central nervous system infections in a wide variety of avian and mammalian hosts. Some CoVs are human pathogens that cause mild to severe dise

Compared to the previous sentence embedding approach, this is far better.

The returned answers we're getting, are way more better than the previous ones. Also, the time for each query is around ~3seconds.

 We are going to contiinue with this approach, and we'll save in the array **indexes**, each index of the best returned paper. So, below this cell we are going to try return the answer **it-self** (passage).

For this task we are using an other notebook for minimizing the resources as much as possible. We only are going to need the indexes.

In [27]:
indexes

[8093, 1988, 2384, 2596, 3533, 7823, 4815]