<a href="https://colab.research.google.com/github/cemreefe/cmpe493-project/blob/main/huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip3 install xmltodict

import os
import io   
import re
import json
import math
import pickle
import string
import tarfile
import xmltodict
import numpy as np
import pandas as pd

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

Collecting xmltodict
  Downloading https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Installing collected packages: xmltodict
Successfully installed xmltodict-0.12.0


**Dataset download**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
def read_file(path):
  with open(path, 'r') as f:
    return f.read()

In [3]:
if not os.path.exists('drive/MyDrive/CMPE/CMPE493'):
  os.makedirs('drive/MyDrive/CMPE/CMPE493')

In [4]:
if not os.path.exists('drive/MyDrive/CMPE/CMPE493/topics-rnd5.xml'):
  !curl https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml --output drive/MyDrive/CMPE/CMPE493/topics-rnd5.xml

if not os.path.exists('drive/MyDrive/CMPE/CMPE493/qrels-covid_d5_j0.5-5.txt'):
  !curl https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt --output drive/MyDrive/CMPE/CMPE493/qrels-covid_d5_j0.5-5.txt

if not os.path.exists('drive/MyDrive/CMPE/CMPE493/cord-19_2020-07-16.tar.gz'):
  !curl https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-07-16.tar.gz --output drive/MyDrive/CMPE/CMPE493/cord-19_2020-07-16.tar.gz

In [5]:
if not os.path.exists('2020-07-16'):
  tar = tarfile.open('drive/MyDrive/CMPE/CMPE493/cord-19_2020-07-16.tar.gz', "r:gz")
  tar.extractall()
  tar.close()

**Using pandas dataframes to read and prepare the data**


In [6]:
df_metadata = pd.read_csv('2020-07-16/metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
# Delete unused data columns
del df_metadata['sha'], df_metadata['source_x'], df_metadata['doi'], df_metadata['pmcid'], df_metadata['pubmed_id'], df_metadata['license'], df_metadata['publish_time'], df_metadata['authors'], df_metadata['journal'], df_metadata['mag_id'], df_metadata['who_covidence_id'], df_metadata['arxiv_id'], df_metadata['pdf_json_files'], df_metadata['pmc_json_files'], df_metadata['url'], df_metadata['s2_id']

In [8]:
# Delete duplicate document entries
df_metadata.drop_duplicates(subset='cord_uid', keep='first', inplace=True)

In [9]:
df_metadata

Unnamed: 0,cord_uid,title,abstract
0,ug7v899j,Clinical features of culture-proven Mycoplasma...,OBJECTIVE: This retrospective chart review des...
1,02tnwd4m,Nitric oxide: a pro-inflammatory mediator in l...,Inflammatory diseases of the respiratory tract...
2,ejv2xln0,Surfactant protein-D and pulmonary host defense,Surfactant protein-D (SP-D) participates in th...
3,2b73a28n,Role of endothelin-1 in lung disease,Endothelin-1 (ET-1) is a 21 amino acid peptide...
4,9785vg6d,Gene expression in epithelial cells in respons...,Respiratory syncytial virus (RSV) and pneumoni...
...,...,...,...
192504,z4ro6lmh,Rapid radiological improvement of COVID-19 pne...,
192505,hi8k8wvb,SARS E protein in phospholipid bilayers: an an...,Abstract We report on an anomalous X-ray refle...
192506,ma3ndg41,Italian Society of Interventional Cardiology (...,COVID‐19 pandemic raised the issue to guarante...
192507,wh10285j,"Nimble, Together: A Training Program's Respons...",


In [12]:
# Read relevances file
topic_relevances = 'topic iter document_id judgement\n' + read_file('drive/MyDrive/CMPE/CMPE493/qrels-covid_d5_j0.5-5.txt')

df_relevances = pd.read_csv(  io.StringIO(topic_relevances)  , sep=" ")
del df_relevances['iter']

df_relevances

Unnamed: 0,topic,document_id,judgement
0,1,005b2j4b,2
1,1,00fmeepz,1
2,1,010vptx3,2
3,1,0194oljo,1
4,1,021q9884,1
...,...,...,...
69313,50,zvop8bxh,2
69314,50,zwf26o63,1
69315,50,zwsvlnwe,0
69316,50,zxr01yln,1


In [13]:
# Read topics file
topics_obj = xmltodict.parse(read_file('drive/MyDrive/CMPE/CMPE493/topics-rnd5.xml'))
topics     = json.loads(json.dumps(topics_obj))

# Query, question and narrative fields are concatenated
topics_dict = {}
for topic in topics['topics']['topic']:
  # a topic has the following fields:
  #  * @number
  #  * narrative
  #  * query
  #  * question
  topics_dict[topic['@number']] = topic['query'] + ' ' + topic['question'] + ' ' + topic['narrative']

# Data so far

* `topics_dict` 
      has `topic-id` for keys, and topic description for values
* `df_relevances` 
      has the following three columns:
      topic	document-id	judgement
* `df_metadata`
      holds information about the documents
      has the following three columns (others are deleted):
      cord_uid	title	abstract


In [14]:
# Download nltk English stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [15]:
docs = np.array(df_metadata)
contents = {}

for doc in docs:
  contents[doc[0]] = f'{doc[1]} {doc[2]}'

### `contents` is a dictionary with document id keys and f'{document title} {document content}' values.
```
document_id: f'{document_title} {document_content}'
```

Reference: [HuggingFace Sentence Transformers](https://huggingface.co/sentence-transformers/ce-ms-marco-TinyBERT-L-2)

In [None]:
# Install huggingface sentence-transformers library that uses BERT
!pip install -U sentence-transformers

Models that were tested:


*   sentence-transformers/ce-ms-marco-TinyBERT-L-2
*   sentence-transformers/ce-ms-marco-TinyBERT-L-6
*   sentence-transformers/ce-ms-marco-electra-base

In [32]:
from sentence_transformers import CrossEncoder
import transformers

model_name = 'sentence-transformers/ce-ms-marco-electra-base'
model = CrossEncoder(model_name, max_length=512)
model.tokenizer = transformers.BertTokenizerFast.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=730.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438022601.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=316.0, style=ProgressStyle(description_…




In [18]:
# Creating topic - document_id pairs that are present in df_relevances
pairs = df_relevances[['topic', 'document_id']]
pairs = np.array(pairs)
pairs

array([[1, '005b2j4b'],
       [1, '00fmeepz'],
       [1, '010vptx3'],
       ...,
       [50, 'zwsvlnwe'],
       [50, 'zxr01yln'],
       [50, 'zz8wvos9']], dtype=object)

In [19]:
# Replacing topic and document ids with their texts
value_pairs = [[]]*len(pairs)
for i,pair in enumerate(pairs):
  value_pairs[i] = (topics_dict[str(pair[0])], contents[pair[1]])

value_pairs[0]

("coronavirus origin what is the origin of COVID-19 seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans",
 'Monophyletic Relationship between Severe Acute Respiratory Syndrome Coronavirus and Group 2 Coronaviruses Although primary genomic analysis has revealed that severe acute respiratory syndrome coronavirus (SARS CoV) is a new type of coronavirus, the different protein trees published in previous reports have provided no conclusive evidence indicating the phylogenetic position of SARS CoV. To clarify the phylogenetic relationship between SARS CoV and other coronaviruses, we compiled a large data set composed of 7 concatenated protein sequences and performed comprehensive analyses, using the maximum-likelihood, Bayesian-inference, and maximum-parsimony methods. All resulting phylogenetic trees displayed an identical topology and supported the hypothesis that the relationship between SARS CoV and 

In [33]:
%%time
# Given a topic, query pair, the model predicts a relevancy score
scores = model.predict(value_pairs)

CPU times: user 22min 31s, sys: 19min 55s, total: 42min 27s
Wall time: 41min 40s


In [34]:
scores

array([0.6301793 , 0.99271387, 0.95577544, ..., 0.9325474 , 0.95387155,
       0.59199375], dtype=float32)

In [35]:
# Prepare results for writing to file
# Only take even topics for evaluation
results = []
for score, pair in zip(scores, pairs):
  if not pair[0] % 2:
    results.append(f'{pair[0]} 0 {pair[1]} 0 {score} 0')

In [36]:
results[:10]

['2 0 01goni72 0 0.09895388036966324 0',
 '2 0 01yc7lzk 0 0.22163823246955872 0',
 '2 0 02cy1s8x 0 0.801167905330658 0',
 '2 0 02f0opkr 0 0.3528762459754944 0',
 '2 0 03h85lvy 0 0.016960153356194496 0',
 '2 0 03id5o2g 0 0.7504514455795288 0',
 '2 0 03s9spbi 0 0.9414146542549133 0',
 '2 0 04awj06g 0 0.9425361752510071 0',
 '2 0 04rbtmmi 0 0.9602665305137634 0',
 '2 0 084o1dmp 0 0.0003308752493467182 0']

In [38]:
# Write results to file
with open('results_ce-ms-marco-electra-base.txt', 'w') as f:
    f.write('\n'.join(results))