# Processing the CORD-19 Dataset

**Purpose:** This notebook is designed to interactively guide the user through processing the machine-readable corpus of COVID-19 research made available by the White House on 2020-03-16.  After downloading the original dataset, the user is simply required to input their directory (using the text boxes embedded in the Notebook) to read-in, process, and export the processed data.  This workflow is designed for anyone looking to leverage Python to explore and analyze the COVID-19 text.

**About the Dataset:**
In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.

This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. The corpus will be updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others.
    
</br></br>
- **Resources**
    - **[Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)**
    
</br></br>
- **Datasets**
    - **[COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)** - Kaggle is hosting the COVID-19 Open Research Dataset Challenge, a series of important questions designed to inspire the community to use CORD-19 to find new insights about the COVID-19 pandemic including the natural history, transmission, and diagnostics for the virus, management measures at the human-animal interface, lessons from previous epidemiological studies, and more.  You can download the full dataset on their website.
    - **[COVID-19 Open Research Dataset (CORD-19)](https://pages.semanticscholar.org/coronavirus-research)** - The CORD-19 resource is available on the Allen Institute’s SemanticScholar.org website and will continue to be updated as new research is published in archival services and peer-reviewed publications.

    
## Table of Contents

**1.0** **- Ingest Data**
    * 1.1 - Set Your Working Directory
    * 1.2 - Load Helper Functions
    * 1.3 - Load Processing Functions
    
**2.0** **- Process Data**
    * 2.1 - Run the Processing Function
    * 2.2 - View a Sample of the Data 
   
**3.0** **- Export Data**
    * 3.1 - Load Export Functions
    * 3.2 - Select Files and Export Locally

## Dependencies

This script was executed using the following version of Python:
* **Python 3.6.2 :: Anaconda, Inc.**

Use this link to install Python on your machine:
* https://www.anaconda.com/distribution/#download-section

**About Python Versions:**
If you are running a higher-version of Python and this notebook fails to execute properly, you can downgrade your version in the terminal by running the following commands:
* conda search python [to see which versions are available on your machine]
* conda install python=3.6.2 [which will switch the active version to 3.6.2; if available in the list above]

**About Python Packages:**
All packages used in this notebook can be installed on your machine using the "pip install [package_name]" command on your terminal.  Be sure you've installed each of the packages below before attempting to execute the notebook.

Current package requirements include:
* os - https://docs.python.org/3/library/os.html
* Pandas - https://pandas.pydata.org/
* Numpy - http://www.numpy.org/
* Datetime - https://docs.python.org/3/library/datetime.html
* ipywidgets - https://ipywidgets.readthedocs.io/en/stable/user_install.html
* ipython - https://ipython.org/ipython-doc/rel-0.10.2/html/interactive/extension_api.html
* requests - https://2.python-requests.org/en/master/user/install/
* io - https://docs.python.org/3/library/io.html
* warnings - https://docs.python.org/3/library/warnings.html
* pyarrow - https://arrow.apache.org/docs/python/parquet.html

The current template uses the following versions:
* os== module 'os' from '/anaconda3/lib/python3.6/os.py'
* pandas==0.24.1
* numpy==1.16.1
* datetime== module 'datetime' from '/anaconda3/lib/python3.6/datetime.py'
* ipywidgets==7.4.2
* ipython==6.2.1
* requests==2.18.4
* io== module 'io' from '/anaconda3/lib/python3.6/io.py'
* warnings== module 'warnings' from '/anaconda3/lib/python3.6/warnings.py'
* pyarrow==0.16.0

(1) Metadata for papers from these sources are combined: CZI, PMC, BioRxiv/MedRxiv. (total records 29500)
	- CZI 1236 records
	- PMC 27337
	- bioRxiv 566
	- medRxiv 361
(2) 17K of the paper records have PDFs and the hash of the PDFs are in 'sha'
(3) For PMC sourced papers, one paper's metadata can be associated with one or more PDFs/shas under that paper - a PDF/sha corresponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article.
(4)	13K of the PDFs were processed with fulltext ('has_full_text'=True)
(5) Various 'keys' are populated with the metadata:
	- 'pmcid': populated for all PMC paper records (27337 non null)
	- 'doi': populated for all BioRxiv/MedRxiv paper records and most of the other records (26357 non null)
	- 'WHO #Covidence': populated for all CZI records and none of the other records (1236 non null)
	- 'pubmed_id': populated for some of the records
	- 'Microsoft Academic Paper ID': populated for some of the records

## Before you begin, ensure you've installed the required Python packages

* See the list above and make note of the specific versions that were used in this notebook

In [33]:
############################################
###### Import required Python packages #####
############################################

import os
import json
from copy import deepcopy
import datetime as dt
import pyarrow
import numpy as np
import pandas as pd
from ipywidgets import interact, interactive, IntSlider, Layout
import ipywidgets as widgets
from IPython.display import display
import requests
import warnings
warnings.filterwarnings('ignore')

## AN IMPORTANT NOTE ABOUT INTERACTIVE WIDGETS

This notebook uses interactive widgets to help you make selections and inputs more conveniently.  As you work through this notebook, be sure to follow the steps below to ensure your selections are incorporated in the cells that follow:

#### 1. Run the cell containing the interactive widget(s) to bring them into view
#### 2. Apply your selections and/or inputs to the widgets that appear
#### 3. DO NOT rerun the cell as it will erase your selections and inputs
#### 4. To proceed, simply click on the next cell in the notebook, and Run it

<br/>

## 1.0 - Data Ingestion

The series of code blocks below will walk you through the process of mapping to your working directory and uploading your dataset.

## 1.1 - Set Your Working Directory

Your "working directory" is a folder location on your computer that will store files either read-in or written-out by this script.  This code by default will return your current, active directory.  You can change this directory by typing in a specific path into the text box provided.

In [34]:
set_working_directory = widgets.Text(
    value=os.getcwd(),
    placeholder='/Users/bblanchard006/Desktop/covid19/2020-03-13',
    description='Directory:',
    disabled=False,
    layout=Layout(width='50%')
)

display(set_working_directory)

Text(value='/Users/bblanchard006/Desktop/covid19/2020-03-13', description='Directory:', layout=Layout(width='5…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

After executing the cell above, you can leave the default directory or overwrite the text string that appears with your desired folder directory. **DO NOT execute the cell again after making your update.** The input above will be fed into the following code cell, where it will either successfully map to the new directory or notify you of an error.

In [35]:
try:
    os.chdir(set_working_directory.value)
    print('Changed directory to {}'.format(set_working_directory.value))
except Exception as e:
    print('Failed to change directory')
    print(e)

Changed directory to /Users/bblanchard006/Desktop/covid19/2020-03-13


## 1.2 - Load Helper Functions

The list of functions below will help us extract the important attributes embedded in the .json files.  Much of the source material detailed below was originally published to the Kaggle community by various supporters and has been lightly modified for the purposes of this Notebook.  You can access the original walkthrough provided on Kaggle below:

[CORD-19: EDA, parse JSON and generate clean CSV](https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv)

In [36]:
def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])

def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)

def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)

    return ", ".join(name_ls)    

def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body

def format_bib(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

def get_bib_authors(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str.upper(bib[k]) for k in ['authors']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

def get_bib_titles(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []

    for bib in bibs:
        formatted_ls = [str.upper(bib[k]) for k in ['title']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

def get_bib_count(bibs):
    if type(bibs) == dict:
        num_bibs = len(bibs)

    return num_bibs


## 1.3 - Load Processing Functions

The list of functions below will leverage the helper functions above to process, consolidate, and output the full-text articles from the CORD-19 data set into a columnar structure for machine learning.

In [37]:
def load_files(dirname):
    filenames = os.listdir(dirname)
    raw_files = []

    for filename in filenames:
        filename = dirname + filename
        file = json.load(open(filename, 'rb'))
        raw_files.append(file)
    
    return raw_files

def generate_clean_df(all_files):
    cleaned_files = []

    for file in all_files:
        features = [
            file['paper_id'],
            file['metadata']['title'].upper(),
            format_authors(file['metadata']['authors']),
            format_authors(file['metadata']['authors'], 
                           with_affiliation=True),
            format_body(file['abstract']),
            format_body(file['body_text']),
            format_bib(file['bib_entries']),
            get_bib_titles(file['bib_entries']),
            get_bib_count(file['bib_entries']),
            get_bib_authors(file['bib_entries']),
        ]

        cleaned_files.append(features)

    col_names = [
                 'paper_id',
                 'title',
                 'authors',
                 'affiliations',
                 'abstract',
                 'text', 
                 'bibliography',
                 'bibliography_titles',
                 'number_of_references',
                 'bibliography_authors',
    ]

    clean_df = pd.DataFrame(cleaned_files, columns=col_names)
    clean_df.head()
    
    return clean_df

The cell below loads the main processing function

In [38]:
def process_data():

    data_dict = {}
    
    folder_dir = [
            'biorxiv_medrxiv',
            'pmc_custom_license',
            'comm_use_subset',
            'noncomm_use_subset'
    ]
    
    for f in folder_dir:
        dir_mapping = os.getcwd()+os.sep+f+os.sep+f+os.sep
        temp_files = load_files(dir_mapping)
        temp_df = generate_clean_df(temp_files)
        temp_df = temp_df.fillna('None')
        temp_df = temp_df.replace(r'^\s*$', 'None', regex=True)
        data_dict.update({f+'_full_text':temp_df})

    meta_file = pd.read_csv(os.getcwd()+os.sep+'all_sources_metadata_2020-03-13.csv')      
    data_dict.update({'metadata_file':meta_file})

    full_text_frames = []
    for key, value in data_dict.items():
        if key != 'metadata_file':
            value['cord_19_source'] = key
            full_text_frames.append(value)
            
    stacked_df = pd.concat(full_text_frames, sort=False)

    sha_list = list(set(list(data_dict['metadata_file']['sha'])))
    paper_id_list = list(set(list(stacked_df['paper_id'])))
    
    stacked_df['metadata_match'] = stacked_df['paper_id'].apply(lambda x: 'yes' if x in sha_list else 'no')  
    data_dict['metadata_file']['paper_id_match'] = data_dict['metadata_file']['sha'].apply(lambda x: 'yes' if x in paper_id_list else 'no')

    data_dict.update({'consolidated_full_text':stacked_df})

    return data_dict


## 2.0 - Process the Raw Data

This code block will read in the 13k JSON files contained in the (4) folders making up the entire CORD-19 dataset.

## 2.1 - Run the Processing Function

The following cell will execute the "process_data()" function on the (4) data folders included in the CORD-19 dataset.  The function will return a dictionary containing (5) seperate dataframes.

In [39]:
data_dict = process_data()

## 2.2 - View a Sample of the Data

The cell below will return the dimensions of the files processed and added to the dictionary

In [40]:
for key, value in data_dict.items():
    print('{} with {} rows and {} columns was successfully added to dictionary'.format(key, value.shape[0], value.shape[1]))
    

biorxiv_medrxiv_full_text with 803 rows and 11 columns was successfully added to dictionary
pmc_custom_license_full_text with 1426 rows and 11 columns was successfully added to dictionary
comm_use_subset_full_text with 9000 rows and 11 columns was successfully added to dictionary
noncomm_use_subset_full_text with 1973 rows and 11 columns was successfully added to dictionary
metadata_file with 29500 rows and 15 columns was successfully added to dictionary
consolidated_full_text with 13202 rows and 12 columns was successfully added to dictionary


The cell below will return the first row of data from the **consolidated_full_text** dataframe (which contains the stacked, full-text articles contained in the (4) primary datasets)

In [41]:
data_dict['consolidated_full_text'].head(1)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,bibliography_titles,number_of_references,bibliography_authors,cord_19_source,metadata_match
0,f905f78b32f63c6d14a79984dfb33f1b358b8ab4,MULTIMERIZATION OF HIV-1 INTEGRASE HINGES ON C...,"Meytal Galilee, Akram Alian",Meytal Galilee (Technion -Israel Institute of ...,Abstract\n\nNew anti-AIDS treatments must be c...,"\n\nIn the absence of a curative treatment, th...",HIV drug resistance against strand transfer in...,HIV DRUG RESISTANCE AGAINST STRAND TRANSFER IN...,38,"K ANSTETT, B BRENNER, T MESPLEDE, M A WAINBERG...",biorxiv_medrxiv_full_text,yes


The cell below returns the full text from the first document found in the **consolidated_full_text** dataframe

In [42]:
list(data_dict['consolidated_full_text']['text'][1])

['Introduction\n\nEighteen years ago, severe acute respiratory syndrome (SARS) broke out globally, which caused more than 8000 cases with a fatality rate of 9.6% (1). Since December 2019, a pneumonia infection with 2019-nCoV, which was now named as Novel Coronavirus Pneumonia (NCP), broke out in Wuhan, Hubei Province, China, and rapidly spread throughout China and to many other countries (2) (3) (4) . As of February 9, 2020, 2019-nCoV had transmitted to 34 provinces, regions and municipal cities across China. A total of 40,235 confirmed NCP cases with 908 deaths (2.3%) and 6,484 (16.1%) critically ill cases, and there were still 23,589 suspected cases (5) .As an emerging infectious disease, we need to quickly understand its etiological, epidemiological and clinical characteristics to take prevention and control measures. Several studies have described the epidemiological and clinical characteristics of NCP, and have demonstrated that it can transmitted between humans (4, 6) . A few stu

## 3.0 - Export Dataframes for Offline Analysis or Secondary Processes

The following code block will allow you to select and export dataframes to a local directory.  Use the inputs below to write the files to your current directory and to apply a timestamp to the filenames to prevent the risk of overwriting prior files saved to that folder.

## 3.1 - Load Export Functions

The functions below will make two types of export types available to the user.  One will allow you to export the files as .xlsx files, while the other supports exporting as parquet.

In [43]:
def dict_to_excel(dict_name, dframe, subfolder, timestamp = False):
    
    # Inputs: a dictionary of dataframes; timestamp = True adds an ISO-formatted suffix to the filename
    # Description: Writes dataframes contained within a dictionary to xlsx (on your directory)

    if subfolder:
        file_path = subfolder+'/'
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.xlsx' if timestamp else '.xlsx'  
        file_path = os.path.join(file_path, dframe + suffix)
    else:
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.xlsx' if timestamp else '.xlsx'  
        file_path = os.path.join(dframe + suffix)
        
    try:
        dict_name[dframe].to_excel(file_path, index = False)
        print('Successfully wrote {} with {} rows and {} columns to the directory'.format(dframe+suffix, dict_name[dframe].shape[0], dict_name[dframe].shape[1]))
    except Exception as e:
        print('Writing the data to the directory failed')
        
def dict_to_parquet(dict_name, dframe, subfolder, timestamp = False):
    
    # Inputs: a dictionary of dataframes; timestamp = True adds an ISO-formatted suffix to the filename
    # Description: Writes dataframes contained within a dictionary to parquet (on your directory)

    if subfolder:
        file_path = subfolder+'/'
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.parquet.gzip' if timestamp else '.parquet.gzip'  
        file_path = os.path.join(file_path, dframe + suffix)
    else:
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.parquet.gzip' if timestamp else '.parquet.gzip'  
        file_path = os.path.join(dframe + suffix)
        
    try:
        dict_name[dframe].to_parquet(file_path, compression='gzip')
        print('Successfully wrote {} with {} rows and {} columns to the directory'.format(dframe+suffix, dict_name[dframe].shape[0], dict_name[dframe].shape[1]))
    except Exception as e:
        print('Writing the data to the directory failed')


## 3.2 - Select and Export Files

The code block below will allow you to select which files you would like to extract and in what format.  If choosing to write the files to a "subfolder" - please ensure that the subfolder exists in your directory.

Select one or more available dataframes, then select whether or not you'd like the files saved to the current working directory or a subfolder in the directory.  Lastly, if you would like a timestamp to be added to your exported filenames, select Timestamp = 'yes' to prevent overwriting prior files saved to the folder.

### Important Note:
The **consolidate_full_text** label in the Tables-menu option below contains the "stacked" full-text contained in all of the original datasources (combined).  The other labels containing the **full_text** suffix contain the full text for those respective sources ONLY.

In [44]:
dict_keys = widgets.SelectMultiple(
    options=data_dict.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

subfolder_option = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

output_type = widgets.RadioButtons(
    options=['xlsx','parquet'],
    value='xlsx',
    description='Output Type:',
    disabled=False
)

timestamp_option = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Timestamp:',
    disabled=False
)

subfolder_text = widgets.Text(
    value='output',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

def sub_folder_edit(y):
    if(y=='yes'):
        display(subfolder_text)
        subfolder_text.on_submit(subfolder_text)
        print('Your file(s) will be written to the subfolder in {}[Your Entry Above]'.format(os.getcwd()+os.sep))
    else:
        print('Using {} folder'.format(os.getcwd()))
        
y = widgets.interactive(sub_folder_edit, y=subfolder_option)

display(y, timestamp_option, output_type)

SelectMultiple(description='Tables:', layout=Layout(width='50%'), options=('biorxiv_medrxiv_full_text', 'pmc_c…

interactive(children=(RadioButtons(description='Subfolder:', options=('no', 'yes'), value='no'), Output()), _d…

RadioButtons(description='Timestamp:', options=('no', 'yes'), value='no')

RadioButtons(description='Output Type:', options=('xlsx', 'parquet'), value='xlsx')

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
Execute the code cell below to export the csv files to your chosen directory.

**NOTE:** If you have chosen to write your files to a "subfolder" - ensure that the folder can be found in your working directory.  The function below will "not create a subfolder" in your directory.

In [45]:
if subfolder_option.value == 'yes':
    subfolder = subfolder_text.value
else:
    subfolder = None
    
dframe_list = []
for df in dict_keys.value:
    dframe_list.append(df)

if timestamp_option.value == 'yes':
    timestamp_boolean = True
else:
    timestamp_boolean = False
 
for df in dframe_list:
    if output_type.value == 'parquet':
        dict_to_parquet(data_dict, df, subfolder, timestamp = timestamp_boolean)
    else:
        dict_to_excel(data_dict, df, subfolder, timestamp = timestamp_boolean)

Successfully wrote biorxiv_medrxiv_full_text.xlsx with 803 rows and 11 columns to the directory
Successfully wrote pmc_custom_license_full_text.xlsx with 1426 rows and 11 columns to the directory
Successfully wrote comm_use_subset_full_text.xlsx with 9000 rows and 11 columns to the directory
Successfully wrote noncomm_use_subset_full_text.xlsx with 1973 rows and 11 columns to the directory
Successfully wrote metadata_file.xlsx with 29500 rows and 15 columns to the directory
Successfully wrote consolidated_full_text.xlsx with 13202 rows and 12 columns to the directory
