<h1> CORD-19 Solution Toolbox</h1>


We give here a minimal toolset to explore the dataset.

# Load packages

We just load the minimum packages for now.

In [1]:
import numpy as np
import pandas as pd

import os
import json

# Explore the data

In [2]:
count = 0
file_exts = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        count += 1
        file_ext = filename.split(".")[-1]
        file_exts.append(file_ext)

file_ext_set = set(file_exts)

print(f"Files: {count}")
print(f"Files extensions: {file_ext_set}\n\n=====================\nFiles extension count:\n=====================")
file_ext_list = list(file_ext_set)
for fe in file_ext_list:
    fe_count = file_exts.count(fe)
    print(f"{fe}: {fe_count}")

Files: 13206
Files extensions: {'json', 'readme', 'pdf', 'txt', 'csv'}

Files extension count:
json: 13202
readme: 1
pdf: 1
txt: 1
csv: 1


Let's also look to the structure of directories, to see how the data is structured high-level:

In [3]:
count = 0
for root, folders, filenames in os.walk('/kaggle/input'):
    print(root, folders)

/kaggle/input ['CORD-19-research-challenge']
/kaggle/input/CORD-19-research-challenge ['2020-03-13']
/kaggle/input/CORD-19-research-challenge/2020-03-13 ['comm_use_subset', 'pmc_custom_license', 'noncomm_use_subset', 'biorxiv_medrxiv']
/kaggle/input/CORD-19-research-challenge/2020-03-13/comm_use_subset ['comm_use_subset']
/kaggle/input/CORD-19-research-challenge/2020-03-13/comm_use_subset/comm_use_subset []
/kaggle/input/CORD-19-research-challenge/2020-03-13/pmc_custom_license ['pmc_custom_license']
/kaggle/input/CORD-19-research-challenge/2020-03-13/pmc_custom_license/pmc_custom_license []
/kaggle/input/CORD-19-research-challenge/2020-03-13/noncomm_use_subset ['noncomm_use_subset']
/kaggle/input/CORD-19-research-challenge/2020-03-13/noncomm_use_subset/noncomm_use_subset []
/kaggle/input/CORD-19-research-challenge/2020-03-13/biorxiv_medrxiv ['biorxiv_medrxiv']
/kaggle/input/CORD-19-research-challenge/2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv []


Majority of files are in json format. The files are grouped in 4 folders and 4 tar archives.

We provide some tools to explore the jsons.

## Read a json file



In [4]:
json_folder_path = "/kaggle/input/CORD-19-research-challenge/2020-03-13/pmc_custom_license/pmc_custom_license"
json_file_name = os.listdir(json_folder_path)[0]
json_path = os.path.join(json_folder_path, json_file_name)

with open(json_path) as json_file:
    json_data = json.load(json_file)

To use more easy, we can normalize the json. Here is the code.

In [5]:
json_data_df = pd.io.json.json_normalize(json_data)

The json was transformed in a row in a dataframe, with the column names resulted by aggregating the succesive levels of the json structure.   Let's check the result.

In [6]:
json_data_df

Unnamed: 0,paper_id,abstract,body_text,back_matter,metadata.title,metadata.authors,bib_entries.BIBREF0.ref_id,bib_entries.BIBREF0.title,bib_entries.BIBREF0.authors,bib_entries.BIBREF0.year,...,bib_entries.BIBREF9.authors,bib_entries.BIBREF9.year,bib_entries.BIBREF9.venue,bib_entries.BIBREF9.volume,bib_entries.BIBREF9.issn,bib_entries.BIBREF9.pages,bib_entries.BIBREF9.other_ids.DOI,ref_entries.FIGREF0.text,ref_entries.FIGREF0.latex,ref_entries.FIGREF0.type
0,05326cc45fa2898c5850df85d30dad3d2c82acef,[],[{'text': 'A quatic wild birds are the natural...,[{'text': 'We detected bovine kobuvirus (BKV) ...,References 1. STI Study Group. Syphilis and go...,"[{'first': 'J', 'middle': ['P'], 'last': 'Dori...",b0,Evolution and ecology of influenza A viruses,"[{'first': 'R', 'middle': ['G'], 'last': 'Webs...",1992,...,"[{'first': 'K', 'middle': ['L'], 'last': 'Laur...",2015,Clin Vaccine Immunol,22,,957--64,[10.1128/CVI.00278-15],Maximum-likelihood phylogenetic tree showing r...,,figure


## Convert the folder in a dataframe


Let's process now the entire folder. We will create a dataset with the data from the folder.

In [7]:
print(f"Files in folder: {len(os.listdir(json_folder_path))}")

Files in folder: 1426


In [8]:
from tqdm import tqdm

# to process all files, uncomment the next line and comment the line below
# list_of_files = list(os.listdir(json_folder_path))
list_of_files = list(os.listdir(json_folder_path))[0:50]
pmc_custom_license_df = pd.DataFrame()

for file in tqdm(list_of_files):
    json_path = os.path.join(json_folder_path, file)
    with open(json_path) as json_file:
        json_data = json.load(json_file)
    json_data_df = pd.io.json.json_normalize(json_data)
    pmc_custom_license_df = pmc_custom_license_df.append(json_data_df)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,
100%|██████████| 50/50 [00:11<00:00,  4.38it/s]


In [9]:
pmc_custom_license_df.head()

Unnamed: 0,abstract,back_matter,bib_entries.BIBREF0.authors,bib_entries.BIBREF0.issn,bib_entries.BIBREF0.other_ids.DOI,bib_entries.BIBREF0.pages,bib_entries.BIBREF0.ref_id,bib_entries.BIBREF0.title,bib_entries.BIBREF0.venue,bib_entries.BIBREF0.volume,...,ref_entries.TABREF6.type,ref_entries.TABREF7.latex,ref_entries.TABREF7.text,ref_entries.TABREF7.type,ref_entries.TABREF8.latex,ref_entries.TABREF8.text,ref_entries.TABREF8.type,ref_entries.TABREF9.latex,ref_entries.TABREF9.text,ref_entries.TABREF9.type
0,[],[{'text': 'We detected bovine kobuvirus (BKV) ...,"[{'first': 'R', 'middle': ['G'], 'last': 'Webs...",,,152--79,b0,Evolution and ecology of influenza A viruses,Microbiol Rev,56.0,...,,,,,,,,,,
0,[],[],[],,,,b0,Avian influenza: assessing the pandemic threat,,,...,,,,,,,,,,
0,[{'text': 'The Centers for Disease Control and...,"[{'text': 'We thank Sarah Hedges, Maria Varvou...",[],,,,b0,US Centers for Disease Control and Prevention ...,,,...,,,,,,,,,,
0,[{'text': 'Although oligonucleotide probes com...,[{'text': 'We would like to thank Nathaniel Hu...,"[{'first': 'K', 'middle': ['U'], 'last': 'Mir'...",,,329--360,b0,Sequence variation in genes and genomic DNA: m...,Annu. Rev. Genomics Hum. Genet,1.0,...,,,,,,,,,,
0,[{'text': 'Middle East respiratory syndrome co...,[{'text': 'We thank everyone who assisted with...,"[{'first': 'Z', 'middle': ['A'], 'last': 'Memi...",,[10.1056/NEJMc1308698],884--890,b0,Middle East respiratory syndrome coronavirus i...,N Engl J Med,369.0,...,,,,,,,,,,


## Extract abstract text


Let's extract now abstract text from abstract column.


In [10]:
pmc_custom_license_df['abstract_text'] = pmc_custom_license_df['abstract'].apply(lambda x: x[0]['text'] if x else "")

In [11]:
pd.set_option('display.max_colwidth', 500)
pmc_custom_license_df[['abstract', 'abstract_text']].head()

Unnamed: 0,abstract,abstract_text
0,[],
0,[],
0,"[{'text': 'The Centers for Disease Control and Prevention has established 10 Global Disease Detection (GDD) Program regional centers around the world that serve as centers of excellence for public health research on emerging and reemerging infectious diseases. The core activities of the GDD Program focus on applied public health research, surveillance, laboratory, public health informatics, and technical capacity building. During 2015-2016, program staff conducted 205 discrete projects on a ...","The Centers for Disease Control and Prevention has established 10 Global Disease Detection (GDD) Program regional centers around the world that serve as centers of excellence for public health research on emerging and reemerging infectious diseases. The core activities of the GDD Program focus on applied public health research, surveillance, laboratory, public health informatics, and technical capacity building. During 2015-2016, program staff conducted 205 discrete projects on a range of to..."
0,"[{'text': 'Although oligonucleotide probes complementary to single nucleotide substitutions are commonly used in microarray-based screens for genetic variation, little is known about the hybridization properties of probes complementary to small insertions and deletions. It is necessary to define the hybridization properties of these latter probes in order to improve the specificity and sensitivity of oligonucleotide microarray-based mutational analysis of disease-related genes. Here, we comp...","Although oligonucleotide probes complementary to single nucleotide substitutions are commonly used in microarray-based screens for genetic variation, little is known about the hybridization properties of probes complementary to small insertions and deletions. It is necessary to define the hybridization properties of these latter probes in order to improve the specificity and sensitivity of oligonucleotide microarray-based mutational analysis of disease-related genes. Here, we compare and con..."
0,"[{'text': 'Middle East respiratory syndrome coronavirus (MERS-CoV) infections sharply increased in the Arabian Peninsula during spring 2014. In Abu Dhabi, United Arab Emirates, these infections occurred primarily among healthcare workers and patients. To identify and describe epidemiologic and clinical characteristics of persons with healthcare-associated infection, we reviewed laboratory-confirmed MERS-CoV cases reported to the Health Authority of Abu Dhabi during January 1, 2013-May 9, 201...","Middle East respiratory syndrome coronavirus (MERS-CoV) infections sharply increased in the Arabian Peninsula during spring 2014. In Abu Dhabi, United Arab Emirates, these infections occurred primarily among healthcare workers and patients. To identify and describe epidemiologic and clinical characteristics of persons with healthcare-associated infection, we reviewed laboratory-confirmed MERS-CoV cases reported to the Health Authority of Abu Dhabi during January 1, 2013-May 9, 2014. Of 65 ca..."
