### Initalize Script

If you're running this script on Google Colab<br>
Mount your Google drive: 
1. Click on the folder icon on the left
2. Click Mount Drive
3. The root directory would be /content/
```
# your Google Drive folder would be at:
/content/drive/My Drive/
```

Change working directory:<br>
1. Run this command:
```
%cd /content/drive/My Drive/<your folder>
```

In [None]:
%cd /content/drive/My Drive/Data Science/Covid-19

### Read all
Each paper are in json format

In [None]:
%%time
from src.covid_19_tp import authors_name, body_text, format_bib

from os import walk as dir_list
from tqdm import tqdm
import json

src_folder = 'raw_data/comm_use_subset'
data = [
    {
        'paper_id': file['paper_id'],
        'title': file['metadata']['title'],
        'authors': authors_name(file['metadata']['authors'], affiliation=True),

        'abstract': body_text(file['abstract']),
        'text': body_text(file['body_text']),

        'bibliography': format_bib(file['bib_entries'])
    }
    for subdir, dirs, files in dir_list(f'./{folder}')
    for file in tqdm(
        [
            json.load(open(f'{subdir}/{file}'))
            for file in tqdm(files, desc=f'Loading all files in {subdir}')
        ], desc=f'Reading individual files in {subdir}'
    )
]

### Create DataFrame with dataset

In [None]:
import pandas as pd
data = pd.DataFrame(data)

### Save Dataset

In [None]:
filename = '_'.join(src_folder.split('/'))
des_folder = 'processed_data'

from os.path import isdir
from os import mkdir
if not isdir(f'./{des_folder}'): mkdir(f'./{des_folder}') # Create folder if it does not exist

import pickle
with open(f'{des_folder}/{filename}_df.pkl', 'wb') as output:
    pickle.dump(data, output)

### Load Dataset

In [1]:
filepath = 'processed_data/raw_data_comm_use_subset_df.pkl'

import pickle
with open(filepath, 'rb') as f:
    data = pickle.load(f)

### Download optional (required) files

Download nltk stopwords to use Stopwords
```
import nltk
nltk.download('stopwords')
```
Download nltk wordnet to use WordNetLemmatizer:
```
import nltk
nltk.download('wordnet')
```
Download nltk punkt to use Punkt Sentence Tokenizer
```
import nltk
nltk.download('punkt')
```

### Load NLP functions

In [2]:
from src.text_preprocessing import nltk_NLP, spacy_NLP, STOP_WORDS, text_preprocess
spacy_tokenizer = spacy_NLP('en_core_web_sm').tokenize_API()
nlp_tokenizer = nltk_NLP().tokenize_API()

# from nltk.stem.porter import PorterStemmer
# from nltk.stem.wordnet import WordNetLemmatizer
# nlp_custom_tokenizer = nltk_NLP(stemming=PorterStemmer, lemmatisation=WordNetLemmatizer).custom_API()

text_prep = lambda text: text_preprocess(text, tokenizer=spacy_tokenizer, stopwords=STOP_WORDS)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Evan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Create Corpus from dataset
Save the corpus as pickle file to save time in the future; Load the pickle file

In [None]:
from tqdm import tqdm

corpus = [
    text_prep(text)
    for text in tqdm(list(data['title'] + ' ' + data['abstract'] + ' ' + data['text']))
]

filename = '_'.join(folder.split('/'))
folder = 'processed_data'

from os.path import isdir
from os import mkdir
if not isdir(f'./{folder}'): mkdir(f'./{folder}') # Create folder if it does not exist

import pickle
with open(f'{folder}/{filename}_corpus.pkl', 'wb') as output:
    pickle.dump(corpus, output)

### Load Corpus from Pickle

In [3]:
import pickle
folder = 'processed_data'
filename = 'raw_data_comm_use_subset_corpus'
with open(f'./{folder}/{filename}.pkl', 'rb') as f:
    corpus = pickle.load(f) 

### Conduct TF-IDF using custom functions

In [None]:
%%time
from src.tf_idf import TFIDF

tdidf = TFIDF()
tdidf.tfidf_corpus(corpus)

In [None]:
'''
    corpus_doc_tf_idf: list of td-idf scores (terms: score) of each documents
    score:
        Low = frequent terms
        High = rare terms
'''
tdidf.corpus_doc_tfidf[:1]

In [None]:
'''
    term_doc_freq: a dict (key: value pairs) of a term and it's count of occurrence in different documents
'''
tdidf.term_doc_freq

In [None]:
text = corpus[0]
print(text)

tdidf.get_text_keywords(text, 10)

In [None]:
question = "Is the virus transmitted by aerisol, droplets, food, close contact, fecal matter, or water"

tdidf.get_text_keywords(text_prep(question), 10)

### Conduct TF-IDF using sklearn functions

In [4]:
%%time
from src.tf_idf import sklearn_TFIDF
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

sk_tfidf = sklearn_TFIDF()
sk_tfidf.tfidf_corpus(corpus)

Wall time: 16.4 s


In [None]:
text = corpus[0]
print(text)

sk_tfidf.get_text_keywords(text, 10)

In [None]:
question = "Is the virus transmitted by aerisol, droplets, food, close contact, fecal matter, or water"

sk_tfidf.get_text_keywords(text_prep(question))

### Get keywords for each document
using sklearn's

In [5]:
data = data.reindex(columns=list(data.columns)+['keywords'])

import pandas as pd
from tqdm import tqdm
tqdm.pandas()
data['keywords'] = pd.Series(corpus).progress_apply(
    lambda doc: sk_tfidf.get_text_keywords(doc)
)

  from pandas import Panel
100%|█████████████████████████████████████████████████████████████████████████████| 9315/9315 [00:38<00:00, 241.09it/s]


### BM25

In [8]:
queries = [
    'coronavirus origin',
    'coronavirus response to weather changes',
    'coronavirus immunity'
]
from src.helper import sort_dict
from src.covid_19_BM25 import query_bm25

from IPython.display import display

for query in queries:
    print(f'The query is: {query}')
    
    result = query_bm25(
        text_prep(query).split(),
        data
    )
    display(result)

The query is: coronavirus origin


Unnamed: 0,Score
Bat origin of human coronaviruses,15.093743
Identification and Characterization of a Novel Alpaca Respiratory Coronavirus Most Closely Related to the Human Coronavirus 229E,13.851795
Novel algorithm for accelerated electroanatomic mapping and prediction of earliest activation of focal cardiac arrhythmias using mathematical optimization,7.648872
Modelling input-output flows of severe acute respiratory syndrome in mainland China,6.17483
Cryo-EM structure of infectious bronchitis coronavirus spike protein reveals structural and functional evolution of coronavirus spike proteins,4.770482
Complete genome analysis of canine respiratory coronavirus,4.157596
Isolation and characterization of avian coronavirus from healthy Eclectus parrots (Eclectus roratus) from Indonesia,3.959852
SCIENCE CHINA SPECIAL TOPIC: Haunted with and hunting for viruses A novel human coronavirus: Middle East respiratory syndrome human coronavirus,3.45686
Phylogenetic investigation of enteric bovine coronavirus in Ireland reveals partitioning between European and global strains,3.281272
Molecular evidence for the evolution of ichnoviruses from ascoviruses by symbiogenesis,3.031386


The query is: coronavirus response to weather changes


Unnamed: 0,Score
Pathogen seasonality and links with weather in England and Wales: a big data time series analysis,93.729419
Challenges in developing methods for quantifying the effects of weather and climate on water-associated diseases: A systematic review,56.138194
A case-crossover analysis of the impact of weather on primary cases of Middle East respiratory syndrome,43.226082
"Short Term Effects of Weather on Hand, Foot and Mouth Disease",42.016031
Directional and reoccurring sequence change in zoonotic RNA virus genomes visualized by time- series word count OPEN,32.464706
One Health Á a strategy for resilience in a changing arctic,31.394139
Regulation of Immunogen Processing: Signal Sequences and Their Application for the New Generation of DNA-Vaccines,29.439184
Estimating the Economic Impact of Climate Change on Cardiovascular Diseases-Evidence from Taiwan,26.905581
The rise and fall of infectious disease in a warmer world [version 1; referees: 2 approved],22.660624
Changes in temperature alter the potential outcomes of virus host shifts,21.63101


The query is: coronavirus immunity


Unnamed: 0,Score
Curating the innate immunity interactome,6.078533
Prolonging herd immunity to cholera via vaccination: Accounting for human mobility and waning vaccine effects,3.944526
innate immunity to Respiratory infection in early Life,3.021976
Impact of Preexisting Adenovirus Vector Immunity on Immunogenicity and Protection Conferred with an Adenovirus-Based H5N1 Influenza Vaccine,2.957018
Nasal Delivery of an Adenovirus-Based Vaccine Bypasses Pre-Existing Immunity to the Vaccine Carrier and Improves the Immune Response in Mice,2.946935
Egyptian rousette bats maintain long-term protective immunity against Marburg virus infection despite diminished antibody levels OPEN,2.912944
Pre-existing immunity against vaccine vectors - friend or foe?,2.892225
cells Interplay between Intrinsic and Innate Immunity during HIV Infection,2.447426
Cryo-EM structure of infectious bronchitis coronavirus spike protein reveals structural and functional evolution of coronavirus spike proteins,2.338074
Complete genome analysis of canine respiratory coronavirus,2.288797
