# COVID-19 Open Research Dataset Challenge (CORD-19) - Task 1

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?fbclid=IwAR1kPwO-OjU-MGEYnaaDAIHVfJ5lyhpn3V79AnL20bE1IOb16pmEg5H1NMM

## Dataset Description

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

## Introduction

We are issuing a call to action to the world's artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions. The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community the opportunity to apply text and data mining approaches to find answers to questions within, and connect insights across, this content in support of the ongoing COVID-19 response efforts worldwide. There is a growing urgency for these approaches because of the rapid increase in coronavirus literature, making it difficult for the medical community to keep up.

A list of our initial key questions can be found under the Tasks section () of this dataset. These key scientific questions are drawn from the NASEM’s SCIED (National Academies of Sciences, Engineering, and Medicine’s Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats) research topics and the World Health Organization’s R&D Blueprint for COVID-19.

Many of these questions are suitable for text mining, and we encourage researchers to develop text mining tools to provide insights on these questions.

### Task 1

**Task Details** What is known about transmission, incubation, and environmental stability? What do we know about natural history, transmission, and diagnostics for the virus? What have we learned about infection prevention and control?*

Specifically, we want to know what the literature reports about:

1. Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.
2. Prevalence of asymptomatic shedding and transmission (e.g., particularly children).
3. Seasonality of transmission.
4. Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).
5. Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).
6. Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).
7. Natural history of the virus and shedding of it from an infected person
8. Implementation of diagnostics and products to improve clinical processes
9. Disease models, including animal models for infection, disease and transmission
10. Tools and studies to monitor phenotypic change and potential adaptation of the virus
11. Immune response and immunity
12. Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings
13. Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings
14. Role of the environment in transmission

## Abstract

In order to respond to this Call to Action, and provide answers to these questions, i implemented an information retrieval engine, based on both Latent Dirichlet Allocation model (Blei et Al, 2003, for topics modelling, and Query likelihood maximum model for information retrieval. This model is called LDA-Based Document Models, and was theorised and experimented in 2006 by Xing Wei and W. Bruce Croft from the Computer Science Department of University of Massachusetts Amherst.
The tool is basically a simple search engine, it was designed to allow the user to query the given dataset to find the most relevants research papers.

Different researches in the machine learning litterature showed that documents clustering can improve retrieval effectiveness in the language modeling framework (Hoffman, 1999, probabilistic Latent Semantic Indexing (pLSI), Liu and Croft,2004, Cluster-Based Model). The experiments carried out by Liu and Croft showed that lda-based model (LBDM) outperforms both query likelihood model and cluster-based retrieval model (CBDM).



<figure>
<img src="./img/results.png" width="500" height="335"/>
<figcaption>Comparison of query likelihood retrieval (QL), cluster-based retrieval (CBDM) and retrieval with the LDA- based document models (LBDM). The evaluation measure is average precision. AP data set. Stars indicate statistically significant differences in performance with a 95% confidence according to the Wilcoxon test.
Table taken from LDA-Based Document Models for Ad-hoc Retrieval</figcaption>
</figure>

Unfortunatally, the given dataset does not include a query sets with its relevants documents to appreciate correctly the effectiveness of my model measuring the precision and the recall. However, i could still have an idea of the effectivness of the LDA-based model, directly by looking at the content of retrieves documents from any query. Furthermore, due te its limited capacities, my laptop was not able to processed the whole dataset.

Finally, the lda-based model has proved to be very efficient for quickly retrieving relevant informations from the dataset.

## Possible Future Work

LDA-based model is a great tool for retrieving most relevant articles, but it do not directly answer to the question. For that we have to go deeper in the text mining, in order to extract insight from the most relevant articles retrieved by the algorithm. in that respect, our LDA-based could be a good starting point.

To get the model more handly, it could be possible to develop a graphic interface.

## Ressources

Here is all the ressources that help me implement the model
- Xing Wei and W. Bruce Croft, "LDA-Based Document Models for Ad-hoc Retrieval", 2006
- 2009 Cambridge University, 8 Evaluation in information retrieval p. 151-175 link : https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
- Youtube, PyData, Bhargav Srinivasa Desikan - Topic Modelling (and more) with NLP framework Gensim. Link : https://www.youtube.com/watch?v=ZkAFJwi-G98&list=PLdgx9krQrE0Ng5c80ko5TEZESGtQ9qGM3&index=6&t=214s
- GenSim documentation : https://radimrehurek.com/gensim/auto_examples/index.html

## Code

In [204]:
# Pandas
import pandas as pd

# Numpy
import numpy as np
from numpy import load
from numpy import save

# Gensim library
from gensim.test.utils import datapath
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore
from gensim.matutils import corpus2csc

import gc

In [205]:

#gc.collect()
#reset_selective -f test_mat

359

In [None]:
#!pip install gensim
#!pip install google-compute-engine
#!pip3 uninstall boto
#!pip install --upgrade gensim smart_open

### Data collecting

Due to the limited ressources of my laptop, i could not handle the whole dataset, so i selected only the best papers, those that fulfill the condition has_full_text=True, and i removed papers without any title and/or no abstracts. 

In [None]:
meta = pd.read_csv('./CORD-19-research-challenge/metadata.csv')

In [None]:
meta.head()

In [None]:
meta_full_text = meta[meta.has_full_text == True]
#meta_full_text.head()

In [None]:
#body-text or abstracts
def get_content(tag_content):
    content = ''
    c = 1
    for el in tag_content:
        content += el['text'] + ' '
        c += 1
        
    return content
#TO - DO
#def get_ref(bibref_content):
#    bib_entries

In [None]:
import json
publish_time = []
paper_id = []
titles = []
abstracts = []
bodies = []
authors_name = []
for sha in meta_full_text['sha']:
    #because sha can contain more than 1 sha, so we split the sha into array 
    for sha2 in [x.strip() for x in sha.split(sep=';')]:
        file = './papers/' + sha2 + '.json'
        with open(file) as f:
            json_data = json.load(f)
        paper_id.append(sha2)
        titles.append(meta_full_text[meta_full_text.sha == sha]['title'].iloc[0])
        publish_time.append(meta_full_text[meta_full_text.sha == sha]['publish_time'].iloc[0])
        abstracts.append(get_content(json_data['abstract']))
        bodies.append(get_content(json_data['body_text']))


d = {
    'paper_id':paper_id,
    'title':titles,
    'abstract': abstracts,
    'body' : bodies}


docs = pd.DataFrame(d)


docs.head()

In [None]:
df_docs = pd.read_csv("dataset.csv")
print(df_docs.shape)
df_docs.head()

### Data pre-processing

In text pre-processing, we exclude punctuation and stop words (such as, "if", "the", or "on", which contain little topical content), stem and lemmatize all text. 

#### Reshape the dataset

First lets Merge all three columns (title abstract and body) into one column content_word, after dropping Nan Value

In [None]:
df_papers = pd.read_csv('dataset.csv')
#drop NaN value
print('shape : ' + str(df_papers.shape))
df_papers_dropna = df_papers.dropna()
print('shape (after drop nan) : ' + str(df_papers_dropna.shape))
df_papers_dropna.head()

In [None]:
df_papers_dropna['content_word'] = df_papers_dropna[df_papers_dropna.columns[1:]].apply(
    lambda x: ' '.join(x.astype(str)),
    axis=1
)
df_papers_merged = df_papers_dropna[['paper_id','content_word']]
df_papers_merged.shape
df_papers_merged.head()

#### Stemming Lemmatization and Tokenizing using nlkt library

In [4]:
# Initialize a list of stop words wish does not figure in the original NLKT stopwords list
stop_words_list = ['%', '0', '1', '2', '3', '4', '5', '<', '=', '>', 'also', 'although',
                   'and/or', 'b', 'c', 'conclude', 'conclusion', 'could', 'either', 'first',
                   'follow', 'followed', 'following', 'furthermore', 'h', 'however', 'ii', 'last',
                   'later', 'less', 'make', 'may', 'might', 'moreover', 'much', 'need', 'new',
                   'often', 'one', 'rather', 'result', 'result', 'results', 'see', 'seen', 'set',
                   'still', 'studied', 'studies10', 'study', 'therefore', 'thus', 'together', 'two',
                   'us', 'use', 'use', 'way', 'well', 'whereas', 'whether', 'within', 'without', 
                   'would', 'year', 'yet']

In [5]:
#obtain the resource :
#nltk.download()
import re
#!pip install --user -U nltk
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
#from nltk.stem import PorterStemmer
#from nltk.stem import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
snowball = SnowballStemmer("english",ignore_stopwords = True)
def stemSentence(sentence):
    sentence = re.sub("[?:!|.`',;()*-+]", "", sentence)
    sentence = sentence.replace("[","")
    sentence = sentence.replace("]","")
    sentence = sentence.lower()
    stop_words = set(stopwords.words("english"))
    stop_words.update(stop_words_list)
    token_words=word_tokenize(sentence)
    stem_sentence=[]
    for word in token_words:
        if word not in stop_words:
            stem_sentence.append(snowball.stem(word))
    return stem_sentence

In [6]:
#stemDataSet
#parameter : two columns dataframe : paper_id and content words
#return a stemmed and lemmatized matrix, in a numpy
def stemDataSet(original_dataframe):
    stemmed_words = []
    for s_brut in original_dataframe['content_word']:
        stemmed_words.append(stemSentence(s_brut))
    return np.array([original_dataframe['paper_id'],stemmed_words]).transpose()

Proceed to the Stemming and tokenization of the dataset

In [None]:
dataset_processed_numpy = stemDataSet(df_papers_merged)

In [None]:
# Save the newly created processed dataset.
save('dataset_processed_2.npy',dataset_processed_numpy)

### Modeling

####  Latent Dirichlet Allocation using GENSIM framework

In [7]:
# load numpy array from npy file
papers_matrix = load('dataset_processed.npy',allow_pickle=True )
papers_matrix.shape

(21260, 2)

##### Mapping function

In [8]:
# return the original paper's identifiant from id in the corpus
def id2paper(doc_id):
    return papers_matrix[doc_id][0]

# return id in the corpus from the original identifiant
def paper2id(paper_id):
    return np.argwhere(papers_matrix[:,0] ==paper_id)[0][0]

##### Converting text to bag of words

In [9]:
# Dictionary encapsulates the mapping between normalized words and their integer ids.
dictionary = Dictionary(papers_matrix[:,1])

In [10]:
bow_corpus = [dictionary.doc2bow(doc) for doc in papers_matrix[:,1]]

##### Training LDA

According to https://radimrehurek.com/gensim/models/ldamulticore.html,
number of workers should be one less than the number of core, we have a dual-core, so we set the parameter workers to 1.


According to the experiment of Wei Bruce Croft experiment : After 50 iterations, performance is quite stable

We use symmetric Dirichlet priors in the LDA estimation with α = 50 / K and η=0.01, which are common settings in the
literature

In [237]:
#initialize number of topics k = [0,k-1]
lda_model.clear()
k = 200
alpha = 50/k
eta = 0.01
lda_model =  LdaMulticore(
                        corpus=bow_corpus, 
                        num_topics = k,
                        iterations = 50,
                        alpha=alpha,
                        eta=eta,
                        id2word = dictionary,                                    
                        workers = 1)

In [None]:
# Save model to disk.
temp_file = datapath("lda_model_k250")
lda_model.save(temp_file)

In [11]:
# Load a potentially pretrained model from disk.
temp_file = datapath("lda_model_k100")
lda_model = LdaMulticore.load(temp_file)

##### Initialize important matrices

From our LDA model we construct the document θ theta and φ phi matrix, respectively the documents distribution over topics and the topics distribution over words

In [None]:
# Phi - φ
# get_topics : Get the term-topic matrix learned during inference.
word_topic_matrix = lda_model.get_topics()

# Theta - θ
# get_document_topics() : Get the topic distribution for the given document.
documents_topics = lda_model.get_document_topics(bow_corpus, minimum_probability=0)
all_topics_csr = corpus2csc(documents_topics)
doc_topics_matrix = all_topics_csr.T.toarray()

# Save numpy matrix to disk
save('word_topic_matrix_k300.npy',word_topic_matrix)
save('doc_topics_matrix_k300.npy',doc_topics_matrix)

In [12]:
# load numpy matrix θ and φ from npy file
word_topic_matrix = load('word_topic_matrix.npy',allow_pickle=True )
doc_topics_matrix   = load('doc_topics_matrix.npy',allow_pickle=True )

In [209]:
# After we get the posterior estimates of θ and φ , 
# we can compute the probability of a word in a document as following,
def proba_of_word_in_doc_lda(word_id,doc_id):
    return np.sum(word_topic_matrix[:,word_id]*doc_topics_matrix[doc_id,:])

function proba_of_word_in_doc_lda above correspond to the following posterior probability :
$P_{lda}( w | d,\hat{\theta},\hat{\phi}) = \sum_{z=1}^{K}P(w|z,\hat{\phi})P(z|\hat{\theta},d)$

#### Query likelihood model for IR

We get the words frequency for the ML model. A dok matrix is necessary to retrieve quikly the information

In [15]:
from scipy.sparse import dok_matrix
words_frequence_matrix = dok_matrix((papers_matrix.shape[0], len(dictionary)), dtype=int)
for i in range(papers_matrix.shape[0]):
    for tupl in bow_corpus[i]:
        words_frequence_matrix[i,tupl[0]] = tupl[1]

In [16]:
# GLOBAL
# Number of tokens in documents
NB_WORDS = 0
for doc in papers_matrix:
    NB_WORDS += len(doc[1])
NB_WORDS

62223933

In [17]:
# return the total number of token in the given document
def get_nb_token(doc_id):
    return len(papers_matrix[doc_id][1])

In [None]:
#sys.getsizeof(words_frequence_matrix)/1024/1024

In [194]:
def proba_of_word_in_doc_ml(word_id,doc_id):
    # μ (mu) smooting prior : a small value of mu means more smoothing, we use a fix value of μ = 1000,
    # since the best results are consistently obtained with this setting (according to the reported experiments
    # of Liu and Croft.
    # Nd is the number of token in the document D
    mu = 1000
    Nd = get_nb_token(doc_id)
    # we compute the maximum likelihood of P(w|Md) and P(w|Mc)
    p_w_given_md = words_frequence_matrix[doc_id,word_id]/Nd
    p_w_given_mc = dictionary.cfs[word_id]/NB_WORDS
    return (Nd/(Nd+mu))*p_w_given_md*(1 - Nd/(Nd+mu))*p_w_given_mc

proba_of_word_in_doc_ml(word_id,doc_id) correspond to the following the probability :

$P(w|D)= \frac{Nd}{Nd +μ}P_{ML} (w|D)+(1−\frac{Nd}{Nd +μ})P_{ML}(w|coll)$

where P’(w|D) is the maximum likelihood estimate of word w in the document D, and P’(w|coll) is the maximum likelihood estimate of word w in the entire collection. μ is the Dirichlet
prior

In [None]:
#import timeit
#timeit.timeit('proba_of_word_in_doc_ml(1,8)', number=20000,globals=globals())

#### Retrieval with LDA + Query likelihood

In [229]:
# this is the function that linearly combine document model and LDA
# parameter : word_id, doc_id
# return : the probability of the of the word_id given doc_id
#lm_score_words_docs = [] # contain lm score for each words
#lda_score_words_docs = [] # contain lda score for each words
def word_score_lbdm(word_id, doc_id):
    #the parameter lmbda secify the proportion of LDA in the linear combination
    lmbda = 0.9
    #lm_score_words_docs.append([doc_id,proba_of_word_in_doc_ml(word_id,doc_id)])
    #lda_score_words_docs.append([doc_id,proba_of_word_in_doc_lda(word_id,doc_id)])
    return lmbda*proba_of_word_in_doc_ml(word_id,doc_id)+(1-lmbda)*proba_of_word_in_doc_lda(word_id,doc_id)

In [None]:
timeit.timeit('word_score_lbdm(85524,4)', number=22000*4,globals=globals())

In [228]:
# The function tokenize and stem the given query, and compute the score of each word in the given document
# parameter: the query string, the document id
# return the score of the query in the document id
#lbdm_score_words_docs = [] # contain lbdm score for each words
def query_score_lbdm(query,doc_id):
    score = 1
    #lbdm_score_words = []
    for term in query:
        term_score = word_score_lbdm(term,doc_id)  
        #lbdm_score_words.append(term_score)
        score *= term_score
    #lbdm_score_words_docs.append([doc_id,lbdm_score_words])
    return score

In [None]:
timeit.timeit('query_score_lbdm(test_q,3)', number=22000,globals=globals())

In [193]:
# function that wrap up all the aboves function
# parameter : (raw) query
# return : the ranked list of the most 10 relevant paper id, with its rank score
def run_search_query(q): 
    # convert string query to a list of token id
    stem_query = [dictionary.token2id[term] for term in stemSentence(q) if dictionary.token2id.get(term)]
    docs_score_vector = []
    for i in range(papers_matrix.shape[0]):
        docs_score_vector.append([i,query_score_lbdm(stem_query,i)])
        #if(i%5000 == 0): print(i)
    docs_score_vector.sort(key=lambda x: x[1], reverse= True)

    return docs_score_vector[:20]

### Model Evaluation

To evaluatethe effectivness, we have two solutions : we judge by ourself if the document retrieved are relevant according to the input query.
Or in a more scientific way, we sample a set of query from papers title, execute all the queries in the sample and compute the number of time that the assiocated document is present in the top ten documents retrieved. With this method, we find a precision very close to the mean precision from the orignial experiment (see table at the beginning)
with the parameters :
λ = 0,7
Precision is 0,255 

In [None]:
# Create a sample from the dataset of paper we want to test
import random as rd
df_docs = pd.read_csv("dataset.csv")
sample = rd.seed(30)
sample = rd.sample(range(0, len(papers_matrix)), 200)
paper_title = []
for s in sample:
    s_ = id2paper(s)
    ti = df_docs[df_docs.paper_id == s_]['title'].iloc[0]
    paper_title.append([s_,ti])
    
%reset_selective -f df_docs
print("gc collect : " + str(gc.collect()))
print(len(paper_title))

In [None]:
list_position = []
for i in range(0,200):
    p_id = paper2id(paper_title[i][0])
    #print('paper id : ' + paper_title[i][0] + " / " + str(p_id))
    un_titre = paper_title[i][1]
    res = run_search_query(un_titre)
    pos = [doc_id[0] for doc_id in res].index(p_id)
    list_position.append(pos)
    print(str(i) + ' -- Position in the vector : ' + str(pos))
    #print('Top 3 most relevant paper : ')
    #print([id2paper(p[0]) for p in res[:3]])
print('Completed.')

In [None]:
# Print evaluation score :
from statistics import mean
top_ten = len([it for it in list_position if it < 10])/len(list_position)
print("% in the top ten : " + str(top_ten))

### parameter-selecting experiments 

### Try yourself

In [185]:
# Full term list.
menu = ["incubation period in humans COVID-19",
        "COVID-19 contagious period","COVID-19 asymptomatic shedding",
        "asymptomatic transmission COVID-19","seasonality coronavirus",
        "charge distribution COVID-19","hydrophilic surfaces COVID-19",
        "hydrophobic surfaces","environmental survival",
        "decontamination","viral shedding",
        "persistence","stability","natural history","diagnostics covid",
        "clinical process","disease model","phenotypic change",
        "immune response","secondary transmission",
        "personal protective equipment"]

#### Select a query among the list above, or type a query and affect it to the variable query and run the cell

In [230]:
query = menu[3]
results = run_search_query(query)

#### Run the cell below to see the result. Result is contained into the pandas data frame "df_res"

In [231]:
#df_docs = pd.read_csv("dataset.csv")
relevant_list = []
for p in [id2paper(x[0]) for x in results]:
    relevant_list.append([
    df_docs[df_docs.paper_id == p]['paper_id'],
    df_docs[df_docs.paper_id == p]['title'],
    df_docs[df_docs.paper_id == p]['abstract'],
    df_docs[df_docs.paper_id == p]['body']])
    
df_res = pd.DataFrame(relevant_list, columns = ['paper_id','title','abstract','body'])
for i in range(20):
    print(str(i+1) + ' : ' + df_res['title'].iloc[i].iloc[0])
    print(' --- ')

1 : Transmission potential of influenza A/H7N9, February to May 2013, China
 --- 
2 : Assessing the Impact of Reduced Travel on Exportation Dynamics of Novel Coronavirus Infection (COVID-19)
 --- 
3 : Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts
 --- 
4 : Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts
 --- 
5 : Time variations in the transmissibility of pandemic influenza in Prussia, Germany, from 1918–19
 --- 
6 : Effect of the One-Child Policy on Influenza Transmission in China: A Stochastic Transmission Model
 --- 
7 : The Waiting Time for Inter-Country Spread of Pandemic Influenza
 --- 
8 : Analysis of the epidemic growth of the early 2019-nCoV outbreak using internationally confirmed cases
 --- 
9 : Estimates of Outbreak Risk from New Introductions of Ebola with Immediate and Delayed Transmission Control
 --- 
10 : Estimating human-to-human transmissibility of hepatitis A virus in an outbreak at an elementary sc