# Introduction

This notebook imports the data needed for the analysis. Three data imports are made for three different points in time. The following fields are required: paper ID, abstract, body text.

A sample of 5,000 papers will be used for each point in time (LDA TBC). This will consist of 2,500 papers that were in the last subset and 2,500 new papers. This is to recreate what data would've been available were this analysis to be used at the time.

In [1]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

import numpy as np
from langdetect import detect
import re

# April 2020

This data is currently imported using the cleaned CSV created by Kaggle submission: https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv/data?select=clean_pmc.csv

In [16]:
data_april = pd.read_csv("data/2020-04/clean_pmc.csv", nrows=5000, usecols=[0])
data_april.shape

(5000, 1)

In [17]:
data_april.head()

Unnamed: 0,paper_id
0,14572a7a9b3e92b960d92d9755979eb94c448bb5
1,bb790e8366da63c4f5e2d64fa7bbd5673b93063c
2,24f204ce5a1a4d752dc9ea7525082d225caed8b3
3,f5bc62a289ef384131f592ec3a8852545304513a
4,ab78a42c688ac199a2d5669e42ee4c39ff0df2b8


In [25]:
data_april.nunique()

paper_id    5000
dtype: int64

In [None]:
# only keep rows that contain an alpha character (a letter) as
# detect() throws an error if this isn't the case
data_cleaned = data[data['text'].apply(lambda x: bool(re.match('.*[a-zA-Z]+', x)))]

In [None]:
# create pd.Series that predicts the language of the text. Because this takes ~30 mins to run,
# export the result to CSV which can then be imported. The two lines of code that create the CSV
# have been commented out.

# lang = data_cleaned['text'].progress_apply(detect)
# lang.to_csv("Data/predicted_lang.csv")
lang = pd.read_csv("Data/predicted_lang.csv", index_col=0).squeeze()

In [None]:
# filter only english text
data_eng = data_cleaned[lang == 'en']
print('Rows before removing non-english:', data_cleaned.shape[0])
print('Rows after removing non-english:', data_eng.shape[0])

Rows before removing non-english: 17526
Rows after removing non-english: 17045


# June 2020

In [18]:
metadata_june = pd.read_csv("data/2020-06/metadata.csv")
metadata_june.shape

  metadata_june = pd.read_csv("data/2020-06/metadata.csv")


(140532, 19)

In [19]:
metadata_june.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636.0,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967.0,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972.0,no-cc,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
3,2b73a28n,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,11686871.0,no-cc,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",Respir Res,,,,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
4,9785vg6d,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in respons...,10.1186/rr61,PMC59580,11686888.0,no-cc,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",Respir Res,,,,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,


Drop duplicated papers using CORD UID to ensure lookup between papers is 1:1

In [21]:
metadata_june.drop_duplicates('cord_uid', inplace=True)
metadata_june.shape

(139532, 19)

In [33]:
data_april['paper_id'].isin(metadata_june['sha']).sum()

3599

There are 3,599 papers in the June data that was also in the April data. We'll randomly select 2,500 of these.

In [61]:
previous_papers = data_april['paper_id'][data_april['paper_id'].isin(metadata_june['sha'])]
previous_papers_sample = previous_papers.sample(n=2500, random_state=42)
previous_papers_sample.shape

(2500,)

Then we need to select 2,500 new papers from the June data.

In [74]:
new_papers = metadata_june['sha'][~metadata_june['sha'].isin(data_april['paper_id'])]
new_papers = new_papers[new_papers.notnull()]
new_papers_sample = new_papers.sample(n=2500, random_state=42)
new_papers_sample.shape

(2500,)

Sanity check that none of the new June papers were in the old April papers

In [75]:
any(new_papers_sample.isin(previous_papers_sample))

False

In [78]:
# june_papers_to_use = pd.Series(previous_papers_sample, new_papers_sample)
# june_papers_to_use
pd.concat([previous_papers_sample, new_papers_sample])

1861               8fd61d620483a0c7f420c44888a0f7607aa91eb3
2231               92a7e4ead4b225560c768ff6033653ab3c1d686d
3061               2ca704516c253ae250de0ed2598315aa1beceda8
3969               5a2bbe485917282b8eff252ec710a1e456768dfe
303                baa783a36a0a189d01d4ac57d975e49b807b0f95
                                ...                        
116644             6eca85c97b8b0cc6af6f18532615a3fb9e126a24
112546    aada914c9454237bad1a0dd33ae257e8f9d2ca2f; 308e...
12412              c91f21cf133b4b9e4d6a725d0dc05b4083a20bbe
60176              4f79b4f685424d235bb6e1e2402c87424307afdd
1764               20656a048091de934d0fb432f8d070304a1664dd
Length: 5000, dtype: object