# 1. Initalize Script

If you're running this script on Google Colab<br>
Mount your Google drive: 
1. Click on the folder icon on the left
2. Click Mount Drive
3. The root directory would be /content/
```
# your Google Drive folder would be at:
/content/drive/My Drive/
```

Change working directory:<br>
1. Run this command:
```
%cd /content/drive/My Drive/<your folder>
```

In [1]:
%cd /content/drive/My Drive/Data Science/CORD-19_NLP

/content/drive/My Drive/Data Science/CORD-19_NLP


# 2. Load NLP functions

In [2]:
from src.text_preprocessing import spacy_NLP, STOP_WORDS, text_preprocess
text_prep = lambda text: text_preprocess(text, tokenizer=spacy_NLP('en_core_web_sm').tokenize, stopwords=STOP_WORDS)

# from src.text_preprocessing import nltk_NLP
# from nltk.stem.porter import PorterStemmer
# from nltk.stem.wordnet import WordNetLemmatizer
# nlp_tokenizer = nltk_NLP().tokenize_API()
# nlp_tokenizer = nltk_NLP(stemming=PorterStemmer, lemmatisation=WordNetLemmatizer).tokenize()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 3. Prepare and Pre-process Dataset

## 3.1. Read all
Each paper are in json format

In [None]:
%%time
from src.covid_19_tp import authors_name, body_text, format_bib

from os import walk as dir_list
from tqdm import tqdm
import json

src_folder = 'raw_data/comm_use_subset'
data = [
    {
        'paper_id': file['paper_id'],
        'title': file['metadata']['title'],
        'authors': authors_name(file['metadata']['authors'], affiliation=True),

        'abstract': body_text(file['abstract']),
        'text': body_text(file['body_text']),

        'bibliography': format_bib(file['bib_entries'])
    }
    for subdir, dirs, files in dir_list(f'./{src_folder}')
    for file in tqdm(
        [
            json.load(open(f'{subdir}/{file}'))
            for file in tqdm(files, desc=f'Loading all files in {subdir}')
        ], desc=f'Reading individual files in {subdir}'
    )
]

import pandas as pd
data = pd.DataFrame(data)

# Save dataset to pickle for faster loading in the future
from src.helper import pickle_dump
filename = '_'.join(src_folder.split('/'))
des_folder = 'processed_data'
pickle_dump(f'{des_folder}/{filename}_df.pkl', data)

## 3.2. Create Corpus from dataset
Save the corpus as pickle file to save time in the future; Load the pickle file

In [None]:
from tqdm import tqdm
corpus = [
    text_prep(text)
    for text in tqdm(list(data['title'] + ' ' + data['abstract'] + ' ' + data['text']))
]

from src.helper import pickle_dump
filename = '_'.join(src_folder.split('/'))
folder = 'processed_data'
pickle_dump(f'./{folder}/{filename}_corpus.pkl', corpus)

## 3.3. Conduct TF-IDF and BM25


### 3.3.1. TFIDF

corpus_doc_tfidf: list of td-idf scores (terms: score) of each documents<br>
score:<br>
> Low = frequent terms<br>
> High = rare terms<br>

```
tfidf.corpus_doc_tfidf[:1]
```

term_doc_freq: a dict (key: value pairs) of a term and it's count of occurrence in different documents<br>

```
tfidf.term_df
```

In [4]:
from src.tfidf import TFIDF
tfidf = TFIDF(corpus)

TF and DF for each term on each document: 100%|██████████| 9315/9315 [00:07<00:00, 1279.63it/s]
IDF for each term: 100%|██████████| 181453/181453 [00:00<00:00, 931566.14it/s]
TF-IDF on each document: 100%|██████████| 9315/9315 [00:07<00:00, 1303.76it/s]


In [5]:
from src.sklearn_tfidf import sklearn_TFIDF
sk_tfidf = sklearn_TFIDF(corpus)

Conduct TFIDF for individual documents: 100%|██████████| 9315/9315 [00:33<00:00, 276.90it/s]


### 3.3.2. Conduct BM25

In [6]:
from src.bm25 import BM25
bm25 = BM25(corpus)

Conducting TF and DF on corpus: 100%|██████████| 9315/9315 [00:07<00:00, 1294.94it/s]
[BM25] IDF for each term: 100%|██████████| 181453/181453 [00:00<00:00, 906297.83it/s]


In [7]:
from src.bm25 import BM25L
bm25L = BM25L(corpus)

Conducting TF and DF on corpus: 100%|██████████| 9315/9315 [00:07<00:00, 1262.01it/s]
[BM25] IDF for each term: 100%|██████████| 181453/181453 [00:00<00:00, 884656.43it/s]


In [8]:
from src.bm25 import BM25plus
bm25plus = BM25plus(corpus)

Conducting TF and DF on corpus: 100%|██████████| 9315/9315 [00:07<00:00, 1302.43it/s]
[BM25] IDF for each term: 100%|██████████| 181453/181453 [00:00<00:00, 936717.65it/s]


## 3.4. Get keywords for each document
using TFIDF

In [None]:
data = data.reindex(columns=list(data.columns)+['keywords'])

import pandas as pd
from tqdm import tqdm
tqdm.pandas()
data['keywords'] = pd.Series(corpus).progress_apply(
    lambda doc: tfidf.doc_keywords(doc, 20)
)

from src.helper import pickle_dump
folder = 'processed_data'
filename = 'raw_data_comm_use_subset_df_keyword'
pickle_dump(f'./{folder}/{filename}.pkl', data)

## 3.5. Relevant Retrieve class
1. Corpus will be formed from data's columns ['title', 'abstract', 'text']
2. If text_preprocessor function passed into the class, text preprocessing will be done on the corpus
3. Conduct TFIDF on corpus
4. 20 keywords will be tagged to each document using TFIDF

### **search_similar** (func):
1. It will use tokens from the query to look for matching documents using their keywords; token to keyword matching
2. Using BM25 to get relevant document scoring

In [9]:
from src.covid_19_rr import rel_retrieve
rr = rel_retrieve(data, None, corpus)

TF and DF for each term on each document: 100%|██████████| 9315/9315 [00:07<00:00, 1300.32it/s]
IDF for each term: 100%|██████████| 181453/181453 [00:00<00:00, 947010.83it/s] 
TF-IDF on each document: 100%|██████████| 9315/9315 [00:07<00:00, 1314.12it/s]
  1%|          | 74/9315 [00:00<00:12, 732.75it/s]

Obtain 20 keywords from each documents using TFIDF


100%|██████████| 9315/9315 [00:13<00:00, 710.50it/s]


# 4. Load all pre-saved objects

In [3]:
from src.helper import pickle_load
filepath = 'processed_data/raw_data_comm_use_subset_df.pkl'
data = pickle_load(filepath)

filepath = 'processed_data/raw_data_comm_use_subset_corpus.pkl'
corpus = pickle_load(filepath)

# filepath = 'processed_data/raw_data_comm_use_subset_df_keyword.pkl'
# data = pickle_load(filepath)

# 5. Search Relevant Articles based on question
Comparing the different methods

In [11]:
question_list = [
    "Is the virus transmitted by aerisol, droplets, food, close contact, fecal matter, or water",
    "How long is the incubation period for the virus",
    "Can the virus be transmitted asymptomatically or during the incubation period",
    "What is the quantity of asymptomatic shedding",
    "How does temperature and humidity affect the tramsmission of 2019-nCoV",
    "How long can 2019-nCoV remain viable on inanimate, environmental, or common surfaces",
    "What types of inanimate or environmental surfaces affect transmission, survival, or inactivation of 2019-nCov",
    "Can the virus be found in nasal discharge, sputum, urine, fecal matter, or blood",
    "What risk factors contribute to the severity of 2019-nCoV",
    "How does hypertension affect patients"
]

from src.helper import sort_dict
from IPython.display import display
import pandas as pd

methods = [
    tfidf.search_similar,
    sk_tfidf.search_similar,
    bm25.get_scores,
    bm25L.get_scores,
    bm25plus.get_scores
]

for question in question_list[:1]:
    print(question)
    question = text_prep(question)

    result = sort_dict(
        rr.search_similar(question),
        'value', True, 10
    )
    print(result)
    display(
        pd.DataFrame(
            [
                {'Title': rr.df_dict.get(paper_id)[rr.col_idx_title['title']], 'Score': score}
                for paper_id, score in result.items()
            ]
        ).set_index('Title')
    )
    for method in methods:
        print(method.__func__)
        result = sort_dict(
            dict(zip(range(tfidf.n_doc), method(question))),
            'value', True, 10
        )

        display(
            pd.DataFrame(
                [{'Title': data.iloc[key]['title'], 'Score': score} for key, score in result.items()]
            ).set_index('Title')
        )

Is the virus transmitted by aerisol, droplets, food, close contact, fecal matter, or water


Conducting TF and DF on corpus: 100%|██████████| 114/114 [00:00<00:00, 75740.64it/s]
[BM25] IDF for each term: 100%|██████████| 815/815 [00:00<00:00, 460322.89it/s]
Conducting TF and DF on corpus: 100%|██████████| 114/114 [00:00<00:00, 19060.46it/s]
[BM25] IDF for each term: 100%|██████████| 5757/5757 [00:00<00:00, 1213946.41it/s]

{'340036a465efeb78d0b0160bf6dddd2322f293ef': 4576.699335146094, 'bd246a6d199cb961f7c4c3421d22324aface4ff4': 4576.699335146094, '04d02a37dcbb17916d2a5c03288cb9b59000ebba': 4576.699335146094, '936d646d345dd1d9e2df55574f53ebae22c29146': 4576.699335146094, '03d32bf9da6495150f5016a0bf2d4b7647620c7d': 4576.699335146094, 'a2ea85a02fee49f55f485a2b5d808636cb38a0bb': 4576.699335146094, 'b7de4f4a99e8da86891ed28bca52afcbcbdabfa1': 4576.699335146094, '9756bb3c608ed790d2306fc8db815a694eeca45f': 4548.642609742626, '513bf780b2d2a5af6a194122cd3dd98c7b507fbe': 4546.718965051681, '77dc09841a62d92ba5a40d4f848f34e3c4e27713': 4545.951882207763}





Unnamed: 0_level_0,Score
Title,Unnamed: 1_level_1
,4576.699335
,4576.699335
,4576.699335
,4576.699335
,4576.699335
,4576.699335
,4576.699335
Transmission routes of 2019-nCoV and controls in dental practice,4548.64261
"Patterns of human social contact and contact with animals in Shanghai, China",4546.718965
Social contact patterns relevant to the spread of respiratory infectious diseases in Hong,4545.951882


<function TFIDF.search_similar at 0x7fd5869e69d8>


Unnamed: 0_level_0,Score
Title,Unnamed: 1_level_1
The efficacy of medical masks and respirators against respiratory infection in healthcare workers,0.069543
The Effects of Temperature and Relative Humidity on the Viability of the SARS Coronavirus,0.067443
Detection of immunoglobulin (Ig) A antibodies against porcine epidemic diarrhea virus (PEDV) in fecal and serum samples,0.062812
Understanding Viral Transmission Behavior via Protein Intrinsic Disorder Prediction: Coronaviruses,0.058863
Transmission of Influenza A in a Student Office Based on Realistic Person-to-Person Contact and Surface Touch Behaviour,0.05472
Equine rhinitis B viruses in horse fecal samples from the Middle East,0.052421
"Awareness of droplet and airborne isolation precautions among dental health professionals during the outbreak of corona virus infection in Riyadh city, Saudi Arabia",0.051947
RNA Viral Community in Human Feces: Prevalence of Plant Pathogenic Viruses,0.051095
,0.050165
Recognition of aerosol transmission of infectious agents: a commentary,0.050053


<function sklearn_TFIDF.search_similar at 0x7fd559b9b378>


Unnamed: 0_level_0,Score
Title,Unnamed: 1_level_1
Cough aerosol in healthy participants: fundamental knowledge to optimize droplet-spread infectious respiratory disease management Cough aerosol in healthy participants: fundamental knowledge to optimize droplet-spread infectious respiratory disease management,0.323815
micromachines Recent Advances in Droplet-based Microfluidic Technologies for Biochemistry and Molecular Biology,0.317858
Effect of selected gastrointestinal parasites and viral agents on fecal S100A12 concentrations in puppies as a potential comparative model,0.316948
RNA Viral Community in Human Feces: Prevalence of Plant Pathogenic Viruses,0.261757
Characterizing the rapid spread of porcine epidemic diarrhea virus (PEDV) through an animal food manufacturing facility,0.254149
Effectiveness of cough etiquette maneuvers in disrupting the chain of transmission of infectious respiratory diseases,0.238926
Standardized Preparation for Fecal Microbiota Transplantation in Pigs,0.232145
Emerging Themes in Epidemiology Mixing patterns and the spread of close-contact infectious diseases,0.225107
Theoretical Biology and Medical Modelling,0.224804
Detection of immunoglobulin (Ig) A antibodies against porcine epidemic diarrhea virus (PEDV) in fecal and serum samples,0.218448


<function BM25.get_scores at 0x7fd559b9b620>


Unnamed: 0_level_0,Score
Title,Unnamed: 1_level_1
,40.411282
"Genetic Analysis of West Nile Virus Isolates from an Outbreak in Idaho, United States, 2006-2007",40.363436
SUPPLEMENTARY ONLINE MATERIAL -SEARCH STRATEGY Medline (Ovid) Studies that describe epidemiology in severe cases. Viruses,39.656789
Heterogeneity in District-level Transmission of Ebola Virus Disease during the Epidemic in West Africa,39.1665
Supplementary Information for A structure-based rationale for sialic acid independent host-cell entry of Sosuga virus,38.92572
SUPPLEMENTARY MATERIALS FOR Viruses in Vietnamese patients presenting with community acquired sepsis of unknown cause,38.744017
SUPPLEMENTARY DATA Sequence-independent characterization of viruses based on the pattern of viral small RNAs produced by the host,38.671825
,38.520246
GMC21.tibia_length GMC21.spleen_wt,38.425654
,38.284635


<function BM25L.get_scores at 0x7fd559b9b6a8>


Unnamed: 0_level_0,Score
Title,Unnamed: 1_level_1
Infectious Disease Risk Across the Growing Human-Non Human Primate Interface: A Review of the Evidence,22.959753
Transmission routes of 2019-nCoV and controls in dental practice,22.48145
PUBLIC HEALTH REVIEW ARTICLE,22.470956
The Effects of Temperature and Relative Humidity on the Viability of the SARS Coronavirus,22.419969
Hepatitis A Virus: Essential Knowledge and a Novel Identify-Isolate-Inform Tool for Frontline Healthcare Providers,22.348442
"Understanding community perceptions, social norms and current practice related to respiratory infection in Bangladesh during 2009: a qualitative formative study",22.236799
Fomite-mediated transmission as a sufficient pathway: a comparative analysis across three viral pathogens,22.195121
A Review and Update on Waterborne Viral Diseases Associated with Swimming Pools,22.096086
"PUBLIC HEALTH REVIEW ARTICLE Animal viruses, bacteria, and cancer: a brief commentary",22.058787
Transmission of Influenza A in a Student Office Based on Realistic Person-to-Person Contact and Surface Touch Behaviour,22.039038


<function BM25plus.get_scores at 0x7fd559b9b950>


Unnamed: 0_level_0,Score
Title,Unnamed: 1_level_1
Infectious Disease Risk Across the Growing Human-Non Human Primate Interface: A Review of the Evidence,28.529059
Microbiome analysis reveals the abundance of bacterial pathogens in Rousettus leschenaultii guano,27.584505
Fomite-mediated transmission as a sufficient pathway: a comparative analysis across three viral pathogens,27.578554
Hepatitis A Virus: Essential Knowledge and a Novel Identify-Isolate-Inform Tool for Frontline Healthcare Providers,27.264113
PUBLIC HEALTH REVIEW ARTICLE,27.13095
"Understanding community perceptions, social norms and current practice related to respiratory infection in Bangladesh during 2009: a qualitative formative study",27.068386
The Effects of Temperature and Relative Humidity on the Viability of the SARS Coronavirus,26.972003
Transmission routes of 2019-nCoV and controls in dental practice,26.849155
"PUBLIC HEALTH REVIEW ARTICLE Animal viruses, bacteria, and cancer: a brief commentary",26.685579
A Review and Update on Waterborne Viral Diseases Associated with Swimming Pools,26.678893
