# Opening, Cleaning and preprocessing the data

Notebook made to easier the use of the data. 

Files, in JSON format, are converted in csv. 

Preprocessing of stopwords and lemmatization is made to make data ready for use, for TF-IDF for example. 

## Loading Packages

In [2]:
import numpy as np 
import pandas as pd

import glob
import json

## Loading and preparing data
We follow two kaggle's notebook to help loading and preparing data : 

- [This one](https://www.kaggle.com/ivanegapratama/covid-eda-initial-exploration-tool)

- [This one](https://www.kaggle.com/danielwolffram/cord-19-create-dataframe)

In [5]:
root_path = 'CORD-19-research-challenge'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file
0,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535,els-covid,Abstract The etiologic basis for the vast majo...,1972-12-31,"Overall, James C.",American Heart Journal,,,False,custom_license
1,,Elsevier,Coronaviruses in Balkan nephritis,10.1016/0002-8703(80)90355-5,,6243850,els-covid,,1980-03-31,"Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;...",American Heart Journal,,,False,custom_license
2,,Elsevier,Cigarette smoking and coronary heart disease: ...,10.1016/0002-8703(80)90356-7,,7355701,els-covid,,1980-03-31,"Friedman, Gary D",American Heart Journal,,,False,custom_license
3,aecbc613ebdab36753235197ffb4f35734b5ca63,Elsevier,Clinical and immunologic studies in identical ...,10.1016/0002-9343(73)90176-9,,4579077,els-covid,"Abstract Middle-aged female identical twins, o...",1973-08-31,"Brunner, Carolyn M.; Horwitz, David A.; Shann,...",The American Journal of Medicine,,,True,custom_license
4,,Elsevier,Epidemiology of community-acquired respiratory...,10.1016/0002-9343(85)90361-4,,4014285,els-covid,Abstract Upper respiratory tract infections ar...,1985-06-28,"Garibaldi, Richard A.",The American Journal of Medicine,,,False,custom_license


In [6]:
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

29315

In [7]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
first_row = FileReader(all_json[0])
print(first_row)

0015023cc06b5362d332b3baf348d11567ca2fbb: word count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without permission. Abstract 27 The positive stranded RNA genomes of picornaviruses comprise a si... VP3, and VP0 (which is further processed to VP2 and VP4 during virus assembly) (6). The P2 64 and P3 regions encode the non-structural proteins 2B and 2C and 3A, 3B (1-3) (VPg), 3C pro and 4 structura...


In [8]:
dict_ = {'paper_id': [], 'abstract': [], 'body_text': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
papers = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text'])
papers.head()

Processing index: 0 of 29315
Processing index: 2931 of 29315
Processing index: 5862 of 29315
Processing index: 8793 of 29315
Processing index: 11724 of 29315
Processing index: 14655 of 29315
Processing index: 17586 of 29315
Processing index: 20517 of 29315
Processing index: 23448 of 29315
Processing index: 26379 of 29315
Processing index: 29310 of 29315


Unnamed: 0,paper_id,abstract,body_text
0,0015023cc06b5362d332b3baf348d11567ca2fbb,word count: 194 22 Text word count: 5168 23 24...,"VP3, and VP0 (which is further processed to VP..."
1,004f0f8bb66cf446678dc13cf2701feec4f36d76,,The 2019-nCoV epidemic has spread across China...
2,00d16927588fb04d4be0e6b269fc02f0d3c2aa7b,Infectious bronchitis (IB) causes significant ...,"Infectious bronchitis (IB), which is caused by..."
3,0139ea4ca580af99b602c6435368e7fdbefacb03,Nipah Virus (NiV) came into limelight recently...,Nipah is an infectious negative-sense single-s...
4,013d9d1cba8a54d5d3718c229b812d7cf91b6c89,Background: A novel coronavirus (2019-nCoV) em...,"In December 2019, a cluster of patients with p..."


In [9]:
df = pd.merge(papers, meta_df, left_on='paper_id', right_on='sha', how='left').drop('sha', axis=1)

## New columns :

In [12]:
# some new columns for convenience
df['publish_year'] = df.publish_time.str[:4].fillna(-1).astype(int) # 360 times None
df['link'] = 'http://dx.doi.org/' + df.doi

## Exploration/ Cleaning

In [13]:
df[df.abstract_x != df.abstract_y].shape

(27219, 19)

In [14]:
df[df.abstract_x != df.abstract_y].head()

Unnamed: 0,paper_id,abstract_x,body_text,source_x,title,doi,pmcid,pubmed_id,license,abstract_y,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,publish_year,link
0,0015023cc06b5362d332b3baf348d11567ca2fbb,word count: 194 22 Text word count: 5168 23 24...,"VP3, and VP0 (which is further processed to VP...",biorxiv,The RNA pseudoknots in foot-and-mouth disease ...,10.1101/2020.01.10.901801,,,biorxiv,The positive stranded RNA genomes of picornavi...,2020-01-11,"Ward, J. C. J.; Lasecka-Dykes, L.; Neil, C.; A...",,,,True,biorxiv_medrxiv,2020,http://dx.doi.org/10.1101/2020.01.10.901801
1,004f0f8bb66cf446678dc13cf2701feec4f36d76,,The 2019-nCoV epidemic has spread across China...,medrxiv,Healthcare-resource-adjusted vulnerabilities t...,10.1101/2020.02.11.20022111,,,medrvix,We integrate the human movement and healthcare...,2020-02-12,Hanchu Zhou; Jianan Yang; Kaichen Tang; Qingpe...,,,,True,biorxiv_medrxiv,2020,http://dx.doi.org/10.1101/2020.02.11.20022111
2,00d16927588fb04d4be0e6b269fc02f0d3c2aa7b,Infectious bronchitis (IB) causes significant ...,"Infectious bronchitis (IB), which is caused by...",biorxiv,"Real-time, MinION-based, amplicon sequencing f...",10.1101/634600,,,biorxiv,Infectious bronchitis (IB) causes significant ...,2019-05-10,"Butt, S. L.; Erwood, E. C.; Zhang, J.; Sellers...",,,,True,biorxiv_medrxiv,2019,http://dx.doi.org/10.1101/634600
3,0139ea4ca580af99b602c6435368e7fdbefacb03,Nipah Virus (NiV) came into limelight recently...,Nipah is an infectious negative-sense single-s...,biorxiv,A Combined Evidence Approach to Prioritize Nip...,10.1101/2020.03.12.977918,,,biorxiv,AbstractBackgroundNipah Virus (NiV) came into ...,2020-03-12,Nishi Kumari; Ayush Upadhyay; Kishan Kalia; Ra...,,,,True,biorxiv_medrxiv,2020,http://dx.doi.org/10.1101/2020.03.12.977918
4,013d9d1cba8a54d5d3718c229b812d7cf91b6c89,Background: A novel coronavirus (2019-nCoV) em...,"In December 2019, a cluster of patients with p...",medrxiv,Assessing spread risk of Wuhan novel coronavir...,10.1101/2020.02.04.20020479,,,medrvix,Background: A novel coronavirus (2019-nCoV) em...,2020-02-05,Shengjie Lai; Isaac Bogoch; Nick Ruktanonchai;...,,,,True,biorxiv_medrxiv,2020,http://dx.doi.org/10.1101/2020.02.04.20020479


In [15]:
df[df.abstract_x != df.abstract_y][['abstract_x', 'abstract_y', 'link']][
    (df.abstract_y.isnull()) & (df.abstract_x != '') & (~df.link.isnull())]

  


Unnamed: 0,abstract_x,abstract_y,link
422,A novel human coronavirus (2019-nCoV) was iden...,,http://dx.doi.org/10.1101/2020.02.02.20020016
462,As the outbreak of novel 2019 coronavirus (201...,,http://dx.doi.org/10.1101/2020.01.31.20019935
504,A novel corona virus (2019-nCoV) was identifie...,,http://dx.doi.org/10.1101/2020.02.01.20019984
787,medRxiv preprint Effect of pre-existing immuni...,,http://dx.doi.org/10.1101/2020.01.15.19015693
1137,The need for a names-based cyber-infrastructur...,,http://dx.doi.org/10.3897/BDJ.4.e8080
1787,"In the twenty-first century, we have seen the ...",,http://dx.doi.org/10.1007/s40506-016-0083-7
1862,The Drosophila genome encodes 18 canonical nuc...,,http://dx.doi.org/10.1371/journal.pmed.0030023
2560,Our understanding of non-coding RNA has signif...,,http://dx.doi.org/10.3390/genes10060457
3344,Positive-strand (+)RNA viruses are important a...,,http://dx.doi.org/10.1371/journal.ppat.1005912
3715,"The var multigene family encodes PfEMP1, which...",,http://dx.doi.org/10.1371/journal.pmed.0030222


In [16]:
df.shape

(29327, 19)

In [17]:
df.abstract_x.isnull().sum(), (df.abstract_x =='').sum() # missing abstracts in json files

(0, 8060)

In [18]:
df.abstract_y.isnull().sum(), (df.abstract_y=='').sum() # missing abstracts in metadata

(5282, 0)

In [19]:
df.loc[df.abstract_y.isnull() & (df.abstract_x != ''), 'abstract_y'] = df[(df.abstract_y.isnull()) & (df.abstract_x != '')].abstract_x

In [20]:
df.abstract_y.isnull().sum()

3639

In [21]:
df.isnull().sum()

paper_id                           0
abstract_x                         0
body_text                          0
source_x                        1637
title                           1681
doi                             1947
pmcid                          17249
pubmed_id                       8505
license                         1637
abstract_y                      3639
publish_time                    1709
authors                         2387
journal                         2531
Microsoft Academic Paper ID    29030
WHO #Covidence                 28904
has_full_text                   1637
full_text_file                  1637
publish_year                       0
link                            1947
dtype: int64

In [26]:
df[df.title.isnull()].body_text.iloc[1][:600]

'for fixed constant weights λ i ≥ 1, (often taken to be 2), i = 1, 2, 3. Under these assumptions single statistics\ncan be expressed [1] , where S k (y), U k (y) and T k (y) are the number of k-stars, k-2-paths and k-triangles in network y, respectively [1] . These single statistics are called the alternating k-star, alternating two-path and alternating k-triangle statistics, respectively.'

## Exporting Dataframe

In [27]:
df.rename(columns = {'abstract_y': 'abstract'}, inplace=True)
df.drop('abstract_x', axis=1, inplace=True)

In [28]:
df.columns

Index(['paper_id', 'body_text', 'source_x', 'title', 'doi', 'pmcid',
       'pubmed_id', 'license', 'abstract', 'publish_time', 'authors',
       'journal', 'Microsoft Academic Paper ID', 'WHO #Covidence',
       'has_full_text', 'full_text_file', 'publish_year', 'link'],
      dtype='object')

In [30]:
df.head()

Unnamed: 0,paper_id,body_text,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,publish_year,link
0,0015023cc06b5362d332b3baf348d11567ca2fbb,"VP3, and VP0 (which is further processed to VP...",biorxiv,The RNA pseudoknots in foot-and-mouth disease ...,10.1101/2020.01.10.901801,,,biorxiv,The positive stranded RNA genomes of picornavi...,2020-01-11,"Ward, J. C. J.; Lasecka-Dykes, L.; Neil, C.; A...",,,,True,biorxiv_medrxiv,2020,http://dx.doi.org/10.1101/2020.01.10.901801
1,004f0f8bb66cf446678dc13cf2701feec4f36d76,The 2019-nCoV epidemic has spread across China...,medrxiv,Healthcare-resource-adjusted vulnerabilities t...,10.1101/2020.02.11.20022111,,,medrvix,We integrate the human movement and healthcare...,2020-02-12,Hanchu Zhou; Jianan Yang; Kaichen Tang; Qingpe...,,,,True,biorxiv_medrxiv,2020,http://dx.doi.org/10.1101/2020.02.11.20022111
2,00d16927588fb04d4be0e6b269fc02f0d3c2aa7b,"Infectious bronchitis (IB), which is caused by...",biorxiv,"Real-time, MinION-based, amplicon sequencing f...",10.1101/634600,,,biorxiv,Infectious bronchitis (IB) causes significant ...,2019-05-10,"Butt, S. L.; Erwood, E. C.; Zhang, J.; Sellers...",,,,True,biorxiv_medrxiv,2019,http://dx.doi.org/10.1101/634600
3,0139ea4ca580af99b602c6435368e7fdbefacb03,Nipah is an infectious negative-sense single-s...,biorxiv,A Combined Evidence Approach to Prioritize Nip...,10.1101/2020.03.12.977918,,,biorxiv,AbstractBackgroundNipah Virus (NiV) came into ...,2020-03-12,Nishi Kumari; Ayush Upadhyay; Kishan Kalia; Ra...,,,,True,biorxiv_medrxiv,2020,http://dx.doi.org/10.1101/2020.03.12.977918
4,013d9d1cba8a54d5d3718c229b812d7cf91b6c89,"In December 2019, a cluster of patients with p...",medrxiv,Assessing spread risk of Wuhan novel coronavir...,10.1101/2020.02.04.20020479,,,medrvix,Background: A novel coronavirus (2019-nCoV) em...,2020-02-05,Shengjie Lai; Isaac Bogoch; Nick Ruktanonchai;...,,,,True,biorxiv_medrxiv,2020,http://dx.doi.org/10.1101/2020.02.04.20020479


In [31]:
df.shape

(29327, 18)

We can now export the cleaned DataFrame

In [32]:
df.to_csv('cord19_df.csv', index=False)