## About this notebook

In this notebook, I quickly explore the `biorxiv` subset of the papers. Since it is stored in JSON format, the structure is likely too complex to directly perform analysis. Thus, I not only explore the structure of those files, but I also provide the following helper functions for you to easily format inner dictionaries from each file:
* `format_name(author)`
* `format_affiliation(affiliation)`
* `format_authors(authors, with_affiliation=False)`
* `format_body(body_text)`
* `format_bib(bibs)`

Feel free to reuse those functions for your own purpose! If you do, please leave a link to this notebook.

Throughout the EDA, I show you how to use each of those files. At the end, I show you how to generate a clean version of the `biorxiv` as well as all the other datasets, which you can directly use by choosing this notebook as a data source ("File" -> "Add or upload data" -> "Kernel Output File" tab -> search the name of this notebook).

### Update Log

* V9: First release.
* V10: Updated paths to include the [14k new papers](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137474).

In [1]:
import os
import json
from pprint import pprint
from copy import deepcopy

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

## Helper Functions

Unhide the cell below to find the definition of the following functions:
* `format_name(author)`
* `format_affiliation(affiliation)`
* `format_authors(authors, with_affiliation=False)`
* `format_body(body_text)`
* `format_bib(bibs)`

In [2]:
def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])


def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)

def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)
    
    return ", ".join(name_ls)

def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body

def format_bib(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

Unhide the cell below to find the definition of the following functions:
* `load_files(dirname)`
* `generate_clean_df(all_files)`

In [3]:
def load_files(dirname):
    filenames = os.listdir(dirname)
    raw_files = []

    for filename in tqdm(filenames):
        filename = dirname + filename
        file = json.load(open(filename, 'rb'))
        raw_files.append(file)
    
    return raw_files

def generate_clean_df(all_files):
    cleaned_files = []
    
    for file in tqdm(all_files):
        features = [
            file['paper_id'],
            file['metadata']['title'],
            format_authors(file['metadata']['authors']),
            format_authors(file['metadata']['authors'], 
                           with_affiliation=True),
            format_body(file['abstract']),
            format_body(file['body_text']),
            format_bib(file['bib_entries']),
            file['metadata']['authors'],
            file['bib_entries']
        ]

        cleaned_files.append(features)

    col_names = ['paper_id', 'title', 'authors',
                 'affiliations', 'abstract', 'text', 
                 'bibliography','raw_authors','raw_bibliography']

    clean_df = pd.DataFrame(cleaned_files, columns=col_names)
    clean_df.head()
    
    return clean_df

## Biorxiv: Exploration

Let's first take a quick glance at the `biorxiv` subset of the data. We will also use this opportunity to load all of the json files into a list of **nested** dictionaries (each `dict` is an article).

In [4]:
biorxiv_dir = '/home/ubuntu/covid19-challenge/data/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/'
filenames = os.listdir(biorxiv_dir)
print("Number of articles retrieved from biorxiv:", len(filenames))

Number of articles retrieved from biorxiv: 2670


In [6]:
all_files = []

for filename in filenames:
    filename = biorxiv_dir + filename
    file = json.load(open(filename, 'rb'))
    all_files.append(file)

In [7]:
file = all_files[0]
print("Dictionary keys:", file.keys())

Dictionary keys: dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])


## Biorxiv: Abstract

The abstract dictionary is fairly simple:

In [8]:
pprint(file['abstract'])

[{'cite_spans': [],
  'ref_spans': [],
  'section': 'Abstract',
  'text': 'The potential infectiousness of asymptomatic COVID-19 cases '
          'together with a substantial fraction of asymptomatic infections '
          'among all infections, have been highlighted in clinical studies. We '
          'conducted statistical modeling analysis to derive the '
          'delay-adjusted asymptomatic proportion of the positive COVID-19 '
          'infections onboard the Princess Cruises ship along with the '
          'timeline of infections. We estimated the asymptomatic proportion at '
          '17.9% (95% CrI: 15.5%-20.2%), with most of the infections occurring '
          'before the start of the 2-week quarantine.'},
 {'cite_spans': [],
  'ref_spans': [],
  'section': 'Abstract',
  'text': 'Word count:'},
 {'cite_spans': [], 'ref_spans': [], 'section': 'Abstract', 'text': 'Main:'}]


## Biorxiv: body text

Let's first probe what the `body_text` dictionary looks like:

In [9]:
print("body_text type:", type(file['body_text']))
print("body_text length:", len(file['body_text']))
print("body_text keys:", file['body_text'][0].keys())

body_text type: <class 'list'>
body_text length: 22
body_text keys: dict_keys(['text', 'cite_spans', 'ref_spans', 'section'])


We take a look at the first part of the `body_text` content. As you will notice, the body text is separated into a list of small subsections, each containing a `section` and a `text` key. Since multiple subsection can have the same section, we need to first group each subsection before concatenating everything.

In [10]:
print("body_text content:")
pprint(file['body_text'][:2], depth=3)

body_text content:
[{'cite_spans': [{...}, {...}, {...}, {...}, {...}, {...}, {...}],
  'ref_spans': [],
  'section': '',
  'text': 'The clinical and epidemiological characteristics of COVID-19 '
          'continue to be investigated as the virus continues its march '
          'through the human population [2] [3] . While reliable estimates of '
          'the reproduction number and the death risk associated with COVID-19 '
          'are crucially needed to guide public health policy, another key '
          'epidemiological parameter that could inform the intensity and range '
          'of social distancing strategies to combat COVID-19 is the '
          'asymptomatic proportion, which is broadly defined as the proportion '
          'of asymptomatic infections among all the infections of the disease. '
          'Indeed, the asymptomatic proportion is a useful quantity to gauge '
          'the true burden of the disease and better interpret estimates of '
          'the transm

Let's see what the grouped section titles are for the example above:

In [11]:
texts = [(di['section'], di['text']) for di in file['body_text']]
texts_di = {di['section']: "" for di in file['body_text']}
for section, text in texts:
    texts_di[section] += text

pprint(list(texts_di.keys()))

['', 'Statistical modelling', 'Discussion']


The following example shows what the final result looks like, after we format each section title with its content:

In [12]:
body = ""

for section, text in texts_di.items():
    body += section
    body += "\n\n"
    body += text
    body += "\n\n"

print(body[:3000])



The clinical and epidemiological characteristics of COVID-19 continue to be investigated as the virus continues its march through the human population [2] [3] . While reliable estimates of the reproduction number and the death risk associated with COVID-19 are crucially needed to guide public health policy, another key epidemiological parameter that could inform the intensity and range of social distancing strategies to combat COVID-19 is the asymptomatic proportion, which is broadly defined as the proportion of asymptomatic infections among all the infections of the disease. Indeed, the asymptomatic proportion is a useful quantity to gauge the true burden of the disease and better interpret estimates of the transmission potential. This proportion varies widely across infectious diseases, ranging from 8% for measles and 32% for norovirus up to 90-95% for polio [4] [5] [6] . Most importantly, it is well established that asymptomatic individuals are frequently able to transmit the viru

The function below lets you display the body text in one line (unhide to see exactly the same as above):

In [13]:
print(format_body(file['body_text'])[:3000])



The clinical and epidemiological characteristics of COVID-19 continue to be investigated as the virus continues its march through the human population [2] [3] . While reliable estimates of the reproduction number and the death risk associated with COVID-19 are crucially needed to guide public health policy, another key epidemiological parameter that could inform the intensity and range of social distancing strategies to combat COVID-19 is the asymptomatic proportion, which is broadly defined as the proportion of asymptomatic infections among all the infections of the disease. Indeed, the asymptomatic proportion is a useful quantity to gauge the true burden of the disease and better interpret estimates of the transmission potential. This proportion varies widely across infectious diseases, ranging from 8% for measles and 32% for norovirus up to 90-95% for polio [4] [5] [6] . Most importantly, it is well established that asymptomatic individuals are frequently able to transmit the viru

## Biorxiv: Metadata

Let's first see what keys are contained in the `metadata` dictionary:

In [14]:
print(all_files[0]['metadata'].keys())

dict_keys(['title', 'authors'])


Let's take a look at each of the correspond values:

In [15]:
print(all_files[0]['metadata']['title'])

Estimating the Asymptomatic Proportion of 2019 Novel Coronavirus onboard the Princess Cruises Ship, 2020


In [16]:
authors = all_files[0]['metadata']['authors']
pprint(authors[:3])

[{'affiliation': {'institution': 'Kyoto University Yoshida-Nakaadachi-cho',
                  'laboratory': '',
                  'location': {'addrLine': 'Sakyo-ku',
                               'country': 'Japan',
                               'settlement': 'Kyoto'}},
  'email': 'mizumoto.kenji.5a@kyoto-u.ac.jp',
  'first': 'Kenji',
  'last': 'Mizumoto1',
  'middle': [],
  'suffix': ''},
 {'affiliation': {'institution': 'Kyoto University',
                  'laboratory': '',
                  'location': {'addrLine': 'Sakyo-ku',
                               'country': 'Japan',
                               'settlement': 'Yoshidahonmachi, Kyoto'}},
  'email': '',
  'first': 'Katsushi',
  'last': 'Kagaya',
  'middle': [],
  'suffix': ''},
 {'affiliation': {'institution': 'University of Oxford',
                  'laboratory': '',
                  'location': {'country': 'UK'}},
  'email': '',
  'first': 'Alexander',
  'last': 'Zarebski5',
  'middle': [],
  'suffix': ''}]


The `format_name` and `format_affiliation` functions:

In [17]:
for author in authors:
    print("Name:", format_name(author))
    print("Affiliation:", format_affiliation(author['affiliation']))
    print()

Name: Kenji Mizumoto1
Affiliation: Kyoto University Yoshida-Nakaadachi-cho, Sakyo-ku, Kyoto, Japan

Name: Katsushi Kagaya
Affiliation: Kyoto University, Sakyo-ku, Yoshidahonmachi, Kyoto, Japan

Name: Alexander Zarebski5
Affiliation: University of Oxford, UK

Name: Gerardo Chowell3 Affiliations
Affiliation: Georgia State University, Atlanta, Georgia, USA



Now, let's take as an example a slightly longer list of authors:

In [18]:
pprint(all_files[4]['metadata'], depth=4)

{'authors': [{'affiliation': {'institution': 'Zhongnan Hospital of Wuhan '
                                             'University',
                              'laboratory': '',
                              'location': {...}},
              'email': '',
              'first': 'Guangming',
              'last': 'Ye',
              'middle': [],
              'suffix': ''},
             {'affiliation': {'institution': 'Sun Yat-sen University',
                              'laboratory': '',
                              'location': {...}},
              'email': '',
              'first': 'Hualiang',
              'last': 'Lin',
              'middle': [],
              'suffix': ''},
             {'affiliation': {'institution': 'Zhongnan Hospital of Wuhan '
                                             'University',
                              'laboratory': '',
                              'location': {...}},
              'email': '',
              'first': 'Liangjun',
         

Here, I provide the function `format_authors` that let you format a list of authors to get a final string, with the optional argument of showing the affiliation:

In [19]:
authors = all_files[4]['metadata']['authors']
print("Formatting without affiliation:")
print(format_authors(authors, with_affiliation=False))
print("\nFormatting with affiliation:")
print(format_authors(authors, with_affiliation=True))

Formatting without affiliation:
Guangming Ye, Hualiang Lin, Liangjun Chen, Shichan Wang, Zhikun Zeng, Wei Wang, Shiyu Zhang, Terri Rebmann, Yirong Li, Zhenyu Pan, Zhonghua Yang, Ying Wang, Fubing Wang, Zhengmin , Min Qian, Xinghuan Wang

Formatting with affiliation:
Guangming Ye (Zhongnan Hospital of Wuhan University, Wuhan, China), Hualiang Lin (Sun Yat-sen University, Guangzhou, China), Liangjun Chen (Zhongnan Hospital of Wuhan University, Wuhan, China), Shichan Wang (Zhongnan Hospital of Wuhan University, Wuhan, China), Zhikun Zeng (Zhongnan Hospital of Wuhan University, Wuhan, China), Wei Wang, Shiyu Zhang (Zhongnan Hospital of Wuhan University, Wuhan, China), Terri Rebmann (Saint Louis University, USA), Yirong Li (Zhongnan Hospital of Wuhan University, Wuhan, China), Zhenyu Pan (Zhongnan Hospital of Wuhan University, Wuhan, China), Zhonghua Yang (Zhongnan Hospital of Wuhan University, Wuhan, China), Ying Wang (Zhongnan Hospital of Wuhan University, Wuhan, China), Fubing Wang (Zhon

## Biorxiv: bibliography

Let's take a look at the bibliography section. 

In [20]:
bibs = list(file['bib_entries'].values())
pprint(bibs[:2], depth=4)

[{'authors': [],
  'issn': '',
  'other_ids': {},
  'pages': '',
  'ref_id': 'b0',
  'title': 'Novel Coronavirus (2019-nCoV) situation reports',
  'venue': '',
  'volume': '',
  'year': None},
 {'authors': [{'first': 'N', 'last': 'Linton', 'middle': [...], 'suffix': ''},
              {'first': 'T', 'last': 'Kobayashi', 'middle': [], 'suffix': ''},
              {'first': 'Y', 'last': 'Yang', 'middle': [], 'suffix': ''},
              {'first': 'K', 'last': 'Hayashi', 'middle': [], 'suffix': ''},
              {'first': 'A',
               'last': 'Akhmetzhanov',
               'middle': [...],
               'suffix': ''},
              {'first': 'S', 'last': 'Jung', 'middle': [...], 'suffix': ''}],
  'issn': '',
  'other_ids': {'DOI': ['10.1101/2020.01.26.20018754']},
  'pages': '',
  'ref_id': 'b1',
  'title': 'Epidemiological characteristics of novel coronavirus infection: A '
           'statistical analysis of publicly available case data. medRxiv '
           '2020.01.26',
  've

You can reused the `format_authors` function here:

In [21]:
format_authors(bibs[1]['authors'], with_affiliation=False)

'N M Linton, T Kobayashi, Y Yang, K Hayashi, A R Akhmetzhanov, S M Jung'

The following function let you format the bibliography all at once. It only extracts the title, authors, venue, year, and separate each entry of the bibliography with a `;`.

In [22]:
bib_formatted = format_bib(bibs[:5])
print(bib_formatted)

Novel Coronavirus (2019-nCoV) situation reports, , , None; Epidemiological characteristics of novel coronavirus infection: A statistical analysis of publicly available case data. medRxiv 2020.01.26, N M Linton, T Kobayashi, Y Yang, K Hayashi, A R Akhmetzhanov, S M Jung, , 2001; Incubation period of 2019 novel coronavirus (2019-nCoV) infections among travelers from Wuhan, China, J A Backer, D Klinkenberg, J Wallinga, Euro Surveill, 2020; Abortive and subclinical poliomyelitis in a family during the 1992 epidemic in The Netherlands, F P Kroon, H T Weiland, A M Van Loon, R Van Furth, Clin Infect Dis, 1995; Achieving measles control: lessons from the 2002-06 measles control strategy for Uganda, W B Mbabazi, M Nanyunja, I Makumbi, Health Policy Plan, 2009


## Biorxiv: Generate CSV

In this section, I show you how to manually generate the CSV files. As you can see, it's now super simple because of the `format_` helper functions. In the next sections, I show you have to generate them in 3 lines using the `load_files` and `generate_clean_dr` helper functions.

In [23]:
cleaned_files = []

for file in tqdm(all_files):
    features = [
        file['paper_id'],
        file['metadata']['title'],
        format_authors(file['metadata']['authors']),
        format_authors(file['metadata']['authors'], 
                       with_affiliation=True),
        format_body(file['abstract']),
        format_body(file['body_text']),
        format_bib(file['bib_entries']),
        file['metadata']['authors'],
        file['bib_entries']
    ]
    
    cleaned_files.append(features)

HBox(children=(FloatProgress(value=0.0, max=2670.0), HTML(value='')))




In [24]:
col_names = [
    'paper_id', 
    'title', 
    'authors',
    'affiliations', 
    'abstract', 
    'text', 
    'bibliography',
    'raw_authors',
    'raw_bibliography'
]

clean_df = pd.DataFrame(cleaned_files, columns=col_names)
clean_df.head()

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,207dcb6e3cd43cdea91d29a15be1ee34068bd54e,Estimating the Asymptomatic Proportion of 2019...,"Kenji Mizumoto1, Katsushi Kagaya, Alexander Za...",Kenji Mizumoto1 (Kyoto University Yoshida-Naka...,Abstract\n\nThe potential infectiousness of as...,\n\nThe clinical and epidemiological character...,Novel Coronavirus (2019-nCoV) situation report...,"[{'first': 'Kenji', 'middle': [], 'last': 'Miz...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Novel C..."
1,9e780780dbaf35c0d537ddb290dfd484148a3c55,Estimating the Relative Probability of Direct ...,"Sarah V Leavitt, Robyn S Lee, Paola Sebastiani...","Sarah V Leavitt (Boston University, Boston, MA...",Abstract\n\nEstimating infectious disease para...,INTRODUCTION\n\nInfectious disease parameters ...,Transmission parameters of the A / H1N1 ( 2009...,"[{'first': 'Sarah', 'middle': ['V'], 'last': '...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Transmi..."
2,781bb4f96f0bba44488b7e396e8a8ca451e936f8,Title: Influenza-Negative Influenza-Like Illne...,"Fatima N Mirza, Amyn A Malik, Saad B Omer, Phd","Fatima N Mirza (Yale University, 06510, New Ha...",Abstract\n\nThough ideal for determining the b...,\n\nCC-BY-NC 4.0 International license It is m...,An interactive web-based dashboard to track CO...,"[{'first': 'Fatima', 'middle': ['N'], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'An inte..."
3,09274ab34ed74eb299f51e0171c2fddd0092bc18,Predicting the epidemic trend of COVID-19 in C...,"Mengyuan Li, Zhilan Zhang, Shanmei Jiang, Qian...","Mengyuan Li (China Pharmaceutical University, ...",Abstract\n\nBackground: Although COVID-19 has ...,Background\n\nAlthough the spread of COVID-19 ...,Responding to Covid-19 -A Once-in-a-Century Pa...,"[{'first': 'Mengyuan', 'middle': [], 'last': '...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Respond..."
4,254c1e4d661d7fb5954c855387595c3cde0e8156,Environmental contamination of the SARS-CoV-2 ...,"Guangming Ye, Hualiang Lin, Liangjun Chen, Shi...",Guangming Ye (Zhongnan Hospital of Wuhan Unive...,,Introduction\n\nAn outbreak of COVID-19 began ...,The Novel Coronaries Pneumonia Emergency Respo...,"[{'first': 'Guangming', 'middle': [], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The Nov..."


In [25]:
clean_df.to_csv('biorxiv_clean.csv', index=False)

## Generate CSV: Custom (PMC), Commercial, Non-commercial licenses

In [27]:
pmc_dir = '/home/ubuntu/covid19-challenge/data/custom_license/custom_license/pdf_json/'
pmc_files = load_files(pmc_dir)
pmc_df = generate_clean_df(pmc_files)
pmc_df.to_csv('clean_pmc.csv', index=False)
pmc_df.head()

HBox(children=(FloatProgress(value=0.0, max=32450.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=32450.0), HTML(value='')))




Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,0ae02f293c03e3e1a2d4582e62c22f2c0c291f48,Development of animal models against emerging ...,"Troy C Sutton, Kanta Subbarao","Troy C Sutton (NIAID, NIH, United States), Kan...",Abstract\n\nTwo novel coronaviruses have emerg...,"Introduction\n\nWithin the last two decades, t...",Replication and shedding of MERS-CoV in upper ...,"[{'first': 'Troy', 'middle': ['C'], 'last': 'S...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Replica..."
1,640de65e9f09545c463bc419bffb7084fc40fae5,X-RAY CRYSTALLOGRAPHIC STUDIES OF THE IDIOTYPI...,"Nenad Ban, Alexander Mcpherson","Nenad Ban (University of California, 92521, Ri...",,\n\n1. viral: type B viral hepatitis (Kennedy ...,"Three-dimensional structure of antibodies, P M...","[{'first': 'Nenad', 'middle': [], 'last': 'Ban...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Three-d..."
2,a8676c57d7e3a52378b9e554cc0886ad91999e13,,,,,"\n\nziektegeschiedenis Patiënt A, een 29-jarig...",Onbegrepen klachten bij patiënten met infectie...,[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'Onbegre..."
3,c2ab03046662fc55e0162afc133b4f73ea9ed866,AVCpred: an integrated web server for predicti...,"Abid Qureshi, Gazaldeep Kaur, Manoj Kumar","Abid Qureshi, Gazaldeep Kaur, Manoj Kumar",Abstract\n\nViral infections constantly jeopar...,"\n\nvalidation. Furthermore, similar performan...",Using Locally Weighted Learning to Improve SMO...,"[{'first': 'Abid', 'middle': [], 'last': 'Qure...","{'BIBREF45': {'ref_id': 'b45', 'title': 'Using..."
4,96798549e4d680ff281d40d8d1dd400fb6afaafa,TOPICAL REVIEW Microbial volatile compounds in...,"Robin Michael, Statham Thorn, John Greenman","Robin Michael, Statham Thorn, John Greenman",Abstract\n\nMicrobial cultures and/or microbia...,Introduction\n\nVolatile organic compounds (VO...,Defining the normal bacterial flora of the ora...,"[{'first': 'Robin', 'middle': [], 'last': 'Mic...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Definin..."


In [28]:
comm_dir = '/home/ubuntu/covid19-challenge/data/comm_use_subset/comm_use_subset/pdf_json/'
comm_files = load_files(comm_dir)
comm_df = generate_clean_df(comm_files)
comm_df.to_csv('clean_comm_use.csv', index=False)
comm_df.head()

HBox(children=(FloatProgress(value=0.0, max=9918.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=9918.0), HTML(value='')))




Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,5da136317f5b97ed8371d5121d8828f1c9a5372d,Congenital Malaria in China,"Zhi-Yong Tao, Qiang Fang, Xue Liu, Richard Cul...","Zhi-Yong Tao (Bengbu Medical College, Bengbu, ...","Abstract\n\nBackground: Congenital malaria, in...",Introduction\n\nMalaria is a mosquito-borne in...,Estimates of child deaths prevented from malar...,"[{'first': 'Zhi-Yong', 'middle': [], 'last': '...","{'BIBREF1': {'ref_id': 'b1', 'title': 'Estimat..."
1,8befdc2bb43130a5e90c11061e8bc8955718a825,,"Melinda Frost, Richun Li, Ronald Moolenaar, Q...","Melinda Frost, Richun Li, Ronald Moolenaar, Q...",Abstract\n\nBackground: Following the SARS out...,"Background\n\nIn 2003, the world was struck by...",Chinese Center for Health Education; CFETP: Ch...,"[{'first': 'Melinda', 'middle': [], 'last': 'F...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Chinese..."
2,1771673809c10324fde2768ce37d548a5077577f,Quercetin Feeding in Newborn Dairy Calves Cann...,"Jeannine Gruse, Ellen Kanitz, Joachim M Weitze...","Jeannine Gruse, Ellen Kanitz, Joachim M Weitze...",Abstract\n\nImmaturity of the neonatal immune ...,Introduction\n\nCalfhood diseases play a key r...,"Calf mortality in Norwegian dairy herds, S M G...","[{'first': 'Jeannine', 'middle': [], 'last': '...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Calf mo..."
3,b8e9fcc34571f9c29e04f9feb34197250556cb9a,Mechanisms of protective immune responses indu...,"Mccoy, Margaret E Mccoy, Hannah E Golden, Tai...","Mccoy, Margaret E Mccoy, Hannah E Golden, Tai...",Abstract\n\nBackground: A lack of defined corr...,Background\n\nThe most basic and desirable out...,Live attenuated malaria vaccine designed to pr...,"[{'first': '', 'middle': [], 'last': 'Mccoy', ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Live at..."
4,2ec6d0d83b08c9b4e1784e319978aff1930d5484,What Effect Did the Global Financial Crisis Ha...,"Philip D Parker, John Jerrim, Jake Anders","Philip D Parker, John Jerrim, Jake Anders",,\n\nThe influence of macrolevel events and con...,Mastering metrics: The path from cause to effe...,"[{'first': 'Philip', 'middle': ['D'], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Masteri..."


In [29]:
noncomm_dir = '/home/ubuntu/covid19-challenge/data/noncomm_use_subset/noncomm_use_subset/pdf_json/'
noncomm_files = load_files(noncomm_dir)
noncomm_df = generate_clean_df(noncomm_files)
noncomm_df.to_csv('clean_noncomm_use.csv', index=False)
noncomm_df.head()

HBox(children=(FloatProgress(value=0.0, max=2584.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2584.0), HTML(value='')))




Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,9a0f63b592a07e5823715e548da6f3605faf2bbb,Multiplex SYBR Green Real-Time PCR Assay for D...,"Mozhdeh Sultani, ; Talat, Mokhtari Azad, Moham...",Mozhdeh Sultani (Tehran University of Medical ...,,Background\n\nAcute respiratory infections (AR...,"Use of sensitive, broad-spectrum molecular ass...","[{'first': 'Mozhdeh', 'middle': [], 'last': 'S...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Use of ..."
1,c8e37f08a80fe7f78bcb5ca1e9598bc65c470d07,Supplemental Information Systems Vaccinology I...,"Anne Rechtien, Laura Richert, Hadrien Lorenzo,...","Anne Rechtien, Laura Richert, Hadrien Lorenzo,...",,\n\n. Correlation matrix (Pearson correlation ...,Controlling the false discovery rate: a practi...,"[{'first': 'Anne', 'middle': [], 'last': 'Rech...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Control..."
2,5f52ac4738312fdb215ad06e5a2e886a6fb63efd,SUP PLE MEN TAL MAT ERI AL,Napier,Napier,,\n\n. Biallelic mutations using CRI SPR-Cas9 i...,Enhanced monocyte response and decreased centr...,"[{'first': '', 'middle': [], 'last': 'Napier',...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Enhance..."
3,8dd1202c8cda0d3a3859c3346cfbc6647317b4f3,Use of Highly Pathogenic Avian Influenza A(H5N...,"C Todd Davis, Li-Mei Chen, Claudia Pappas, Jam...","C Todd Davis, Li-Mei Chen, Claudia Pappas, Jam...",Abstract\n\n. Use of highly pathogenic avian i...,\n\nZ oonotic influenza viruses circulating in...,Cumulative number of confirmed human cases of ...,"[{'first': 'C', 'middle': ['Todd'], 'last': 'D...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Cumulat..."
4,6fc30224f0fef04b0cb8feab92a3a89ecb52870f,,,,,Patient transport and operating room managing ...,Jae Hee Woo (Ewha Womans University College of...,[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'Jae Hee..."
