# Introduction

This notebook imports the data needed for the analysis. Three data imports are made for three different points in time. The following fields are required: paper ID, abstract, body text.

A sample of 5,000 papers will be used for each point in time (LDA TBC). This will consist of 2,500 papers that were in the last subset and 2,500 new papers. This is to recreate what data would've been available were this analysis to be used at the time.

In [2]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

import numpy as np
from langdetect import detect
import re
import joblib

# April 2020

This data is currently imported using the cleaned CSV created by Kaggle submission: https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv/data?select=clean_pmc.csv

In [2]:
if False:
    data_april = pd.read_csv("data/2020-04/clean_pmc.csv", nrows=5000, usecols=[0])
    joblib.dump(data_april, "outputs/april_ids_5k.pkl")
    data_april.shape

In [11]:
joblib.dump(data_april['paper_id'], "outputs/ids_april_5000.pkl")

['outputs/ids_april_5000.pkl']

In [12]:
data_april = joblib.load("outputs/ids_april_5000.pkl")
data_june = joblib.load("outputs/ids_june_5000.pkl")

In [13]:
data_april.head()

0    14572a7a9b3e92b960d92d9755979eb94c448bb5
1    bb790e8366da63c4f5e2d64fa7bbd5673b93063c
2    24f204ce5a1a4d752dc9ea7525082d225caed8b3
3    f5bc62a289ef384131f592ec3a8852545304513a
4    ab78a42c688ac199a2d5669e42ee4c39ff0df2b8
Name: paper_id, dtype: object

In [14]:
data_june.head()

0    f5bc62a289ef384131f592ec3a8852545304513a
1    31105078a2953217223699d09c6a80d0f5edfdf6
2    e16072734dff5a66dcdd1c147957daf02444f84d
3    dff25248c855cb84f8a9c5e07c1a220675c023d7
4    6a4a0de0df03bfd7860d0732ea9d42daeac06b10
Name: paper_id, dtype: object

In [24]:
data_april['paper_id'].nunique()

5000

In [6]:
# only keep rows that contain an alpha character (a letter) as
# detect() throws an error if this isn't the case
data_cleaned = data[data['text'].apply(lambda x: bool(re.match('.*[a-zA-Z]+', x)))]

NameError: name 'data' is not defined

In [None]:
# create pd.Series that predicts the language of the text. Because this takes ~30 mins to run,
# export the result to CSV which can then be imported. The two lines of code that create the CSV
# have been commented out.

# lang = data_cleaned['text'].progress_apply(detect)
# lang.to_csv("Data/predicted_lang.csv")
lang = pd.read_csv("Data/predicted_lang.csv", index_col=0).squeeze()

In [None]:
# filter only english text
data_eng = data_cleaned[lang == 'en']
print('Rows before removing non-english:', data_cleaned.shape[0])
print('Rows after removing non-english:', data_eng.shape[0])

Rows before removing non-english: 17526
Rows after removing non-english: 17045


# June 2020

In [4]:
import glob

In [5]:
all_papers_june = glob.glob("data/2020-06/pdf_json/*")
all_papers_june_id = [re.findall("([a-z0-9]+)\.json", paper)[0] for paper in all_papers_june]
len(all_papers_june_id)

28523

Check there aren't any duplicate ID's in the June papers

In [6]:
len(set(all_papers_june_id))

28523

In [7]:
data_april['paper_id'].isin(all_papers_june_id).sum()

1228

There are 1,228 papers in the June data that was also in the April data. We'll use all of these in the subset.

In [8]:
previous_papers = data_april[data_april['paper_id'].isin(all_papers_june_id)]

Then we need to select 3,772 new papers from the June data to get a subset of 5,000

In [9]:
new_papers = [paper for paper in all_papers_june_id if paper not in data_april['paper_id'].values]
# convert to series to enable use of sample method
new_papers = pd.Series(new_papers, name='paper_id')
new_papers_sample = new_papers.sample(n=3772, random_state=42)
new_papers_sample.shape

(3772,)

Sanity check that none of the new June papers were in the old April papers

In [26]:
any(~new_papers_sample.isin(previous_papers))

True

In [11]:
june_papers_to_use = pd.concat([previous_papers['paper_id'], new_papers_sample], ignore_index=True)
june_papers_to_use

0       f5bc62a289ef384131f592ec3a8852545304513a
1       31105078a2953217223699d09c6a80d0f5edfdf6
2       e16072734dff5a66dcdd1c147957daf02444f84d
3       dff25248c855cb84f8a9c5e07c1a220675c023d7
4       6a4a0de0df03bfd7860d0732ea9d42daeac06b10
                          ...                   
4995    0036b28fddf7e93da0970303672934ea2f9944e7
4996    b09362a1f23af2b6603e7b92b58003b7f9719840
4997    e12d7aa58d54b0ff35deeec1ebba036e91c81468
4998    e7a4eae5bc97a5dc97189e3faa40ef9a91bb3207
4999    1ccd924dbf169e600355923da3de6e6a8ac217c1
Name: paper_id, Length: 5000, dtype: object

Confirm all the june papers to use are in the June data

In [27]:
all(True for paper in june_papers_to_use.values if paper in all_papers_june_id)

True

Import abstract and text body of each paper into pandas data frame.

In [13]:
import json

In [14]:
data_json=[]
for paper in june_papers_to_use:
    paper_name = 'data/2020-06/pdf_json/' + paper + '.json'
    # print(paper_name)
    try:
        json.load(open(paper_name, 'rb'))
    except:
        pass
        # print("Errors raised when parsing paper: " + paper_name)
    else:
        paper_json = json.load(open(paper_name, 'rb'))
    data_json.append(paper_json)

In [15]:
abstracts=[]
body_texts=[]
for paper in data_json:
    abstract = paper['abstract']
    body_text = paper['body_text']
    # check if abstract is empty list
    if not abstract:
        abstracts.append(np.nan)
    else:
        abstracts.append(abstract[0]['text'])
    
    no_of_paragraphs = len(body_text)
    paragraphs=''
    for i in range(no_of_paragraphs):
        paragraph = body_text[i]['text']
        paragraphs += paragraph + '\n\n'
    body_texts.append(paragraphs)

In [39]:
len(abstracts) == len(body_texts) == 5000

True

In [17]:
data_dict = {'paper_id': june_papers_to_use.values, 'abstract': abstracts, 'text': body_texts}
data_june = pd.DataFrame(data_dict)
data_june

Unnamed: 0,paper_id,abstract,text
0,f5bc62a289ef384131f592ec3a8852545304513a,,"Worldwide, the leading causes of death in neon..."
1,31105078a2953217223699d09c6a80d0f5edfdf6,,"Worldwide, the leading causes of death in neon..."
2,e16072734dff5a66dcdd1c147957daf02444f84d,,"Worldwide, the leading causes of death in neon..."
3,dff25248c855cb84f8a9c5e07c1a220675c023d7,,"Worldwide, the leading causes of death in neon..."
4,6a4a0de0df03bfd7860d0732ea9d42daeac06b10,An outbreak of aseptic meningitis occurred in ...,E nteroviruses circulate worldwide and are the...
...,...,...,...
4995,0036b28fddf7e93da0970303672934ea2f9944e7,and Blautia (P = 0.008) significantly decrease...,human type 1 DM. The aim of this study was to ...
4996,b09362a1f23af2b6603e7b92b58003b7f9719840,Vertebrate interferon-induced transmembrane (I...,First discovered by cDNA library screening in ...
4997,e12d7aa58d54b0ff35deeec1ebba036e91c81468,This paper presents an applied study of schedu...,The bus transit scheduling problem is of great...
4998,e7a4eae5bc97a5dc97189e3faa40ef9a91bb3207,Astragali radix (AR) is one of the most widely...,"Astragali radix (AR), also well-known as Huang..."


In [19]:
joblib.dump(data_june, 'outputs/data_june_5k.pkl')

['outputs/data_june_5k.pkl']

**TODO:**
- Don't include papers with missing abstracts
- Repeat process for August data