# COVID-19 Open Research Dataset Challenge (CORD-19)
![](https://altaonline.typepad.com/.a/6a0192ac343706970d025d9b3673bb200c-800wi)

<h2>Goal</h2><br>
    This is my third response to the call to action to the artificial intelligence experts (if I can be called one) to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions. For that I will use the CORD-19 dataset, which represents the most extensive machine-readable coronavirus literature collection available for data mining to date. Bellow are the current tasks for this challenge, which will be completed by the creation of a custer graph, a text classification for each cluster and a wordcloud for the main keywords for each cluster for the all_sources_metadata file. There are around 29500 papers in the dataset. These are listed in the all_sources_metadata file. Some of the papers in the metadata are also in JSON files. The eventual goal is to connect the metadata with the JSON data.<br>
    <h2>Tasks</h2>
    <ul>
    <li>What is known about transmission, incubation, and environmental stability?</li>
    <li>What do we know about COVID-19 risk factors?</li>
    <li>What do we know about virus genetics, origin, and evolution?</li>
    <li>Sample task with sample submission</li>
    <li>What do we know about vaccines and therapeutics?</li>
    <li>What do we know about non-pharmaceutical interventions?</li>
    <li>What has been published about ethical and social science considerations?</li>
    <li>What do we know about diagnostics and surveillance?</li>
    <li>What has been published about medical care?</li>
    <li>What has been published about information sharing and inter-sectoral collaboration?</li>
    </ul>
    <h2>Citations, ups and downs</h2><br>
    I tried to do this notebook by myself, which was terrifying. I was still looking forward to testing different NLP solutions (check my other ones <a href='https://www.kaggle.com/beatrizyumi/covid-19-autocomplete-search-bar'>here</a>  and here). I opted for the LDA, because it is easy to implement, has a visual output which is also easy to implement and it is possible to retrieve the main keywords and documents from each topic. On the plus side, you know exactly what each cluster is about, you get to see visualy the main keywords about each cluster, it is easy to implement. On the down side, I had zero experience with this sort of modeling and problem solving, which made the whole ordeal even longer to be solved, and depending on the size of the dataframe, it can take a while to load (mine took a few hours). Besides that I am still working on a way to split authors when they are separated by comma or by semicolon.<br>
    <h2>Features of this notebook</h2>
<ol><li>Viewing the papers in the metdata csv as a dataframe</li>
    <li>Viewing the papers separated by clusters in a graph</li>
    <li>Viewing the clusters listed and separated by topics</li>
    <li>Having a word cloud for each cluster with it's main keywords</li></ol>
    <h2>Turn your internet on!</h2><br>
    For this notebook to work your internet must be on.

# Importing libraries and datasets
![](https://media1.giphy.com/media/2yqYwtl5MUa6V405Ib/giphy.gif?cid=790b7611312b8226ba98603f316f4a2740c6e4a11ef228fc&rid=giphy.gif)

In [1]:
# Importing libraries

import sys
import os
from pathlib import Path
import re, numpy as np, pandas as pd
from pprint import pprint
import json
import glob

# Gensim
import gensim, spacy, logging, warnings
import gensim.corpora as corpora
from gensim.utils import lemmatize, simple_preprocess
from gensim.models import CoherenceModel

# NLTK Stop words
import nltk
from nltk.corpus import stopwords

#plotting tools
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS

In [2]:
#loading metadata file

metadata_path = 'metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

FileNotFoundError: [Errno 2] File metadata.csv does not exist: 'metadata.csv'

In [None]:
# importing all json files

all_json = glob.glob(f'**/*.json', recursive=True)
len(all_json)

# Treating the data
![](https://media2.giphy.com/media/l41YtBXZvSRdgqq7m/giphy.gif?cid=790b76115c16e4f50dc6c6525b7b74b67a0ef53415ae01c0&rid=giphy.gif)

First we create a class that will read the json files in a humane readable way.

In [None]:
# File Reader class

class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'

In [None]:
# Checking if the File Reader Class worked

print(FileReader(all_json[0]))

In [None]:
# Function to add break every length characters

def get_breaks(content, length):
    data = ""
    words = content.split(' ')
    total_chars = 0

    for i in range(len(words)):
        total_chars += len(words[i])
        if total_chars > length:
            data = data + "<br>" + words[i]
            total_chars = 0
        else:
            data = data + " " + words[i]
    return data

We then input all the json files into a DataFrame. (This might take a while, since there are a lot of files). A lot of things will happen here.<br><br>
<b>First</b> we will only work with papers with meta data.<br><br>
<b>Second</b> if there is no abstract provided, we will use the title of the research paper as the abstract, as the analysis will use the abstract as a basis.<br><br>
<b>Third</b> all the other information is included on it's field, separated if there is more than one of each, in the case of authors, for example.


In [None]:
# Input the research papers into a DataFrame

dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
    
    # also create a column for the summary of abstract to be used in a plot
    if len(content.abstract) == 0: 
        # no abstract provided, we input the title
        dict_['abstract_summary'].append(meta_data['title'].values[0])
    else:
        dict_['abstract_summary'].append(content.abstract)
        
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    
    # if more than one author
    try:
        authors = str(meta_data['authors'].values[0]).split(';')
        authors1 = [i.split(',') for i in authors]    
        dict_['authors'].append(". ".join(authors))
    except Exception as e:
        dict_['authors'].append(". ".join(authors))
    
    # add the title information, add breaks when needed
    dict_['title'].append(meta_data['title'].values[0])
    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()

In [None]:
dict_ = None

In [None]:
# Adding word count column

df_covid['abstract_word_count'] = df_covid['abstract'].apply(lambda x: len(x.strip().split()))
df_covid['body_word_count'] = df_covid['body_text'].apply(lambda x: len(x.strip().split()))
df_covid.head()

There may be papers that were inputted from more than one source, so we should check and remove duplicated inputs.

In [None]:
# We will remove the duplicated papers
df_covid.shape

In [None]:
# Removing duplicated papers

duplicate_paper = ~(df_covid.title.isnull() | df_covid.abstract.isnull()) & (df_covid.duplicated(subset=['title', 'abstract']))
df_covid = df_covid[~duplicate_paper].reset_index(drop=True)
df_covid.shape

We will now start treating the text inside the DataFrame. Before doing anything, we have to define a lot of functions to do that.

In [None]:
# Creating a function that cleans text of special characters

def strip_characters(text):
    t = re.sub('\(|\)|:|,|;|\.|’|”|“|\?|%|>|<', '', text)
    t = re.sub('/', ' ', t)
    t = t.replace("'",'')
    return t

In [None]:
# Creating a function that makes text lowercase and uses the function created above

def clean(text):
    t = text.lower()
    t = strip_characters(t)
    return t

In [None]:
# Tokenize into individual tokens - words mostly

def tokenize(text):
    words = nltk.word_tokenize(text)
    return list(set([word for word in words 
                     if len(word) > 1
                     and not (word.isnumeric() and len(word) is not 4)
                     and (not word.isnumeric() or word.isalpha())] )
               )

In [None]:
# Creating a function that cleans and tokenize texts

def preprocess(text):
    t = clean(text)
    tokens = tokenize(t)
    return tokens

We will now tokenize the abstract and transform the output into a list.

In [None]:
# Preprocessing all the strings inside the column abstract. It will clean them, make them lower case and tokenize them.

df_covid['abstract_processed'] = df_covid['abstract'].apply(lambda x: preprocess(x))

In [None]:
# Transforming abstract_processed into a list

abstract = list(df_covid['abstract_processed'])
print(abstract[:1])

We will now create a bigram and a trigram models and lemmetize.

In [None]:
# Build the bigram and trigram models

bigram = gensim.models.Phrases(abstract, min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)

trigram = gensim.models.Phrases(bigram[abstract], threshold=100)  
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [None]:
# Creating a list of stopwords in english

stop_words = list(set(stopwords.words('english')))
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come'])

In [None]:
# Creating a function that will remove stopwords, form bigrams, form tigrams and lemmatize

def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
    texts = [bigram_mod[doc] for doc in texts]
    texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
    texts_out = []
    nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    # remove stopwords once more after lemmatization
    texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]    
    return texts_out

In [None]:
# Processing the abstract

abstract_processed = process_words(abstract)

# It is time for some modeling
![](https://media0.giphy.com/media/Mvm1XBC8O48EM/giphy.gif?cid=790b76119bdd3e0bcedc198f7b26c76abecbb392aba651cf&rid=giphy.gif)

But first let's create a dictionary and a corpus

In [None]:
# Creating the dictionary

id2word = corpora.Dictionary(abstract_processed)

In [None]:
# Create Corpus: Term Document Frequency

corpus = [id2word.doc2bow(text) for text in abstract_processed]

In [None]:
from gensim.models.ldamulticore import LdaMulticore

In [None]:
lda = LdaMulticore(corpus, id2word=id2word, num_topics=26)

In [None]:
# Defining the best number of topics for the LDA model

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaMulticore(corpus, id2word=id2word, num_topics=26)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=abstract_processed, start=2, limit=100, step=6)

In [None]:
# Show graph
limit=100; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

The highest Coherence value was with 26 topics before flattening out, so we will go with 20.

In [None]:
# Build LDA model

lda_model = LdaMulticore(corpus=corpus,
                        id2word=id2word,
                        num_topics=20, 
                        random_state=42,
                        chunksize=150,
                        passes=30,
                        iterations = 500,
                        per_word_topics=True)

In [None]:
# Compute Perplexity (lower = better)

print('\nPerplexity: ', lda_model.log_perplexity(corpus))

In [None]:
# Compute Coherence Score

coherence_model_lda = CoherenceModel(model=lda_model, texts=abstract_processed, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# Visualize the topics

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

In [None]:
# Select the model and print the topics

optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

In [None]:
# Find the dominant topic in each sentence

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=abstract_processed):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

In [None]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=abstract_processed)

In [None]:
# Format

df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

In [None]:
# Show

df_dominant_topic.head(10)

In [None]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf_mallet.head()

In [None]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics

In [None]:
# Creating word clouds for each topic

cols = [color for name, color in mcolors.XKCD_COLORS.items()]

cloud = WordCloud(stopwords=stop_words,
                  background_color='white',
                  width=2500,
                  height=1800,
                  max_words=10,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)

topics = lda_model.show_topics(formatted=False)

fig, axes = plt.subplots(2, 2, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')


plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()