## Topic Modelling With LDA

In this notebook, we are gonna try to find related articles by using topic modeling for [COVID-19 Open Research Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) . I'm gonna use [LDA(Latent Dirichlet Allocation)](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158). Shortly, LDA is being  acknowledged the most powerful topic modeling method. The main purpose of the LDA is to find a topics of a document belongs to, based on the words in it.  

### Explaratory Data Analysis

As a first step, we keep going with some  exploratory data analysis.

In [1]:
#Import necessary libraries

import os
import re
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from nltk import word_tokenize
from nltk.corpus import stopwords

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_lg

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

from tqdm import tqdm_notebook as tqdm
from pprint import pprint

%matplotlib inline

pd.set_option('display.max_rows',100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 100)

Some useful and important NLP libraries like NLTK, Spacy and Gensim were imported. You can find the differences between them, pros and cons from this [link](https://medium.com/activewizards-machine-learning-company/comparison-of-top-6-python-nlp-libraries-c4ce160237eb).

In [15]:
#Looking for metadata
main_path= '../input/CORD-19-research-challenge/'
meta_path = main_path+'metadata.csv'
meta_df = pd.read_csv(meta_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})

meta_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz Universit...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 4...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b02786f8929fd9c900897fb.json,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC35282/,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in lung disease?,10.1186/rr14,PMC59543,11667967,no-cc,Inflammatory diseases of the respiratory tract are commonly associated with elevated production ...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cross, Carroll E",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737eb0a2f63f2dce2e5a7d.json,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC59543/,
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972,no-cc,Surfactant protein-D (SP-D) participates in the innate response to inhaled microorganisms and or...,2000-08-25,"Crouch, Erika C",Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa72528f2eeaae1d58927.json,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC59549/,
3,2b73a28n,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,11686871,no-cc,Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been im...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",Respir Res,,,,document_parses/pdf_json/348055649b6b8cf2b9a376498df9bf41f7123605.json,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC59574/,
4,9785vg6d,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in response to pneumovirus infection,10.1186/rr61,PMC59580,11686888,no-cc,Respiratory syncytial virus (RSV) and pneumonia virus of mice (PVM) are viruses of the family Pa...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Rosenberg, Helene F",Respir Res,,,,document_parses/pdf_json/5f48792a5fa08bed9f56016f4981ae2ca6031b32.json,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC59580/,


In [16]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 659502 entries, 0 to 659501
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   cord_uid          659502 non-null  object 
 1   sha               208625 non-null  object 
 2   source_x          659502 non-null  object 
 3   title             659194 non-null  object 
 4   doi               333703 non-null  object 
 5   pmcid             217243 non-null  object 
 6   pubmed_id         288016 non-null  object 
 7   license           659502 non-null  object 
 8   abstract          483440 non-null  object 
 9   publish_time      659276 non-null  object 
 10  authors           643457 non-null  object 
 11  journal           617775 non-null  object 
 12  mag_id            0 non-null       float64
 13  who_covidence_id  304537 non-null  object 
 14  arxiv_id          8239 non-null    object 
 15  pdf_json_files    208625 non-null  object 
 16  pmc_json_files    16

In [17]:
# Checking the length of the files and creating pdf files name list

all_files = os.listdir(f'{main_path}/document_parses/pdf_json/')
pprint(all_files[:5])
print(len(all_files))

['8187ea360c53a56ca2c579d758a5d6aa67716836.json',
 'a0d063dca746b135afe0451ce0b3bb1e06cf15ae.json',
 'edb294108440787c9f074483fd3c953a83e53622.json',
 'ee5af71875f2e77135974c75980ce22fff03e4f8.json',
 'a0bc6bc5b8547b98a2d77b81ca81cb18fa1b7ee9.json']
221846


In [18]:
# Creating full path name list fo pdf files

pdf_files_path = f'{main_path}/document_parses/pdf_json/'
all_json = [pdf_files_path + jf for jf in all_files]
all_json[:5]

['../input/CORD-19-research-challenge//document_parses/pdf_json/8187ea360c53a56ca2c579d758a5d6aa67716836.json',
 '../input/CORD-19-research-challenge//document_parses/pdf_json/a0d063dca746b135afe0451ce0b3bb1e06cf15ae.json',
 '../input/CORD-19-research-challenge//document_parses/pdf_json/edb294108440787c9f074483fd3c953a83e53622.json',
 '../input/CORD-19-research-challenge//document_parses/pdf_json/ee5af71875f2e77135974c75980ce22fff03e4f8.json',
 '../input/CORD-19-research-challenge//document_parses/pdf_json/a0bc6bc5b8547b98a2d77b81ca81cb18fa1b7ee9.json']

In [13]:
# Looking a sample of pdf json file
pprint(json.load(open(all_json[1], 'rb')))

{'abstract': [{'cite_spans': [],
               'ref_spans': [],
               'section': 'Abstract',
               'text': 'Background Brazil ranks second worldwide in total '
                       'number of COVID-19 cases and deaths. Understanding the '
                       'possible socioeconomic and ethnic health inequities is '
                       'particularly important given the diverse population '
                       'and fragile political and economic situation. We aimed '
                       'to characterise the COVID-19 pandemic in Brazil and '
                       'assess variations in mortality according to region, '
                       'ethnicity, comorbidities, and symptoms.'},
              {'cite_spans': [],
               'ref_spans': [],
               'section': 'Abstract',
               'text': 'Methods We conducted a cross-sectional observational '
                       'study of COVID-19 hospital mortality using data from '
                

In [21]:
#Loading Dataset
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            json_data=json.load(file)
            self.paper_id=json_data['paper_id']
            self.abstract = []
            self.body_text = []
            
            for item in json_data['abstract']:
                self.abstract.append(item['text'])
            
            for item in json_data['body_text']:
                self.body_text.append(item['text'])
            
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
            
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'

first_row=FileReader(all_json[1])
pprint(first_row)        

a0d063dca746b135afe0451ce0b3bb1e06cf15ae: Background Brazil ranks second worldwide in total number of COVID-19 cases and deaths. Understanding the possible socioeconomic and ethnic health inequities is particularly important given the diverse... The COVID-19 pandemic has created an unprecedented worldwide strain on health care. Although early reports from east Asia and Europe meant that Brazil was well positioned to implement non-pharmaceutic...


In [22]:
# Checking all json file whether is valid or not

from tqdm import tqdm
all_json_clean = list()
for idx, entry in tqdm(enumerate(all_json), total=len(all_json)):
    
    try:
        content = FileReader(entry)
    except Exception as e:
        continue  # invalid paper format, skip
    
    if len(content.body_text) == 0:
        continue
    
    all_json_clean.append(all_json[idx])
    
all_json = all_json_clean
len(all_json)

  1%|          | 1565/221846 [00:13<30:58, 118.56it/s]


KeyboardInterrupt: 

In [None]:
file['paper_id']