# Metadata Analysis

In [1]:
import pandas as pd
import numpy as np
import re
%load_ext autoreload
%autoreload 2
from cord.cord19 import ResearchPapers
from pathlib import Path, PurePath
from IPython.display import display

pd.options.display.max_colwidth=200
pd.options.display.max_rows = 4000

In [2]:
data_path = Path('data') / 'CORD-19-research-challenge'
metadata_path = PurePath(data_path) / 'metadata.csv'

## Load Metadata

In [3]:
from cord.core import describe_dataframe
metadata = ResearchPapers.load_metadata()
describe_dataframe(metadata)

Loading metadata from data\CORD-19-research-challenge
Cleaning metadata
Applying tags to metadata


Unnamed: 0,non-null,null,unique,duplicate,most common
cord_uid,128492,0,128162.0,330,f8hgcngj
sha,55751,72741,55748.0,3,0ed3c6a5559cd73307184f51fc53ccc76da559bc
source,128492,0,33.0,128459,Medline
title,128492,0,118159.0,10333,
doi,128492,0,100162.0,28330,
pmcid,60771,67721,60771.0,0,PMC7184393
pubmed_id,99124,29368,92197.0,6927,15473650
license,128492,0,17.0,128475,unk
abstract,128492,0,118732.0,9760,
published,128477,15,7100.0,121377,2020-01-01 00:00:00


In [4]:
papers = ResearchPapers.load()

Loading metadata from data\CORD-19-research-challenge
Cleaning metadata
Applying tags to metadata

Indexing research papers
Creating the BM25 index from the abstracts of the papers
Use index="text" if you want to index the texts of the paper instead
Finished Indexing in 98.0 seconds


In [5]:
papers

Unnamed: 0,Papers,Oldest,Newest,SARS-COV-2,SARS,Coronavirus,Virus,Antivirals
,128492,1870-01-01,2021-12-31,26799,7491,25755,54345,1083

Unnamed: 0,title,abstract,journal,authors,published,when
0,"Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia",OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospita...,BMC Infect Dis,"Madani, Tariq A; Al-Ghamdi, Aisha A",2001-07-04,19 years ago
1,Nitric oxide: a pro-inflammatory mediator in lung disease?,Inflammatory diseases of the respiratory tract are commonly associated with elevated production of nitric oxide (NO•) and increased indices of NO• -dependent oxidative stress. Although NO• is know...,Respir Res,"Vliet, Albert van der; Eiserich, Jason P; Cross, Carroll E",2000-08-15,20 years ago
2,Surfactant protein-D and pulmonary host defense,"Surfactant protein-D (SP-D) participates in the innate response to inhaled microorganisms and organic antigens, and contributes to immune and inflammatory regulation within the lung. SP-D is synth...",Respir Res,"Crouch, Erika C",2000-08-25,20 years ago
3,Role of endothelin-1 in lung disease,"Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflamm...",Respir Res,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",2001-02-22,19 years ago
4,Gene expression in epithelial cells in response to pneumovirus infection,"Respiratory syncytial virus (RSV) and pneumonia virus of mice (PVM) are viruses of the family Paramyxoviridae, subfamily pneumovirus, which cause clinically important respiratory infections in hum...",Respir Res,"Domachowske, Joseph B; Bonville, Cynthia A; Rosenberg, Helene F",2001-05-11,19 years ago
...,...,...,...,...,...,...
128487,Impact of BSE on the biotechnology industry — detection and risk assessment,This feature briefly looks at filtration technologies that are being developed by Pall Corp to reduce the potential risk of ‘Mad Cow Disease’ in biological products and drugs. It also provides det...,Membrane Technology,"Atkinson, Simon",2004-10-31,15 years ago
128488,Evolving Gene Targets and Technology in Influenza Detection,"Influenza viruses cause recurring epidemic outbreaks every year associated with high morbidity and mortality. Despite extensive research and surveillance efforts to control influenza outbreaks, th...",Mol Diagn Ther,"Malanoski, Anthony P.; Lin, Baochuan",2013-05-18,7 years ago
128489,How to train health personnel to protect themselves from SARS-CoV-2 (novel coronavirus) infection when caring for a patient or suspected case,How to train health personnel to protect themselves from SARS-CoV-2 (novel coronavirus) infection when caring for a patient or suspected case,J Educ Eval Health Prof,"Huh, Sun",2020-03-07,2 months ago
128490,Epidemic spreading in complex networks,"The study of epidemic spreading in complex networks is currently a hot topic and a large body of results have been achieved. In this paper, we briefly review our contributions to this field, which...",Front Phys China,"Zhou, Jie; Liu, Zong-hua",2008-07-08,12 years ago


In [17]:
metadata_nonnulls = metadata.dropna(subset=['url'])
metadata_multi = metadata_nonnulls[metadata_nonnulls.url.str.contains(';')]

In [25]:
metadata_multi[metadata_multi.url.str.contains('https://api.elsevier')]

Unnamed: 0,cord_uid,sha,source,title,doi,pmcid,pubmed_id,license,abstract,published,...,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,when,covid_related,virus,coronavirus,sars
57460,07redqbb,,Elsevier,,10.1016/b978-0-12-812343-0.09995-1,,,els-covid,,2017-12-31,...,,,,https://api.elsevier.com/content/article/pii/B9780128123430099951; https://www.sciencedirect.com/science/article/pii/B9780128123430099951,214757615.0,2 years ago,False,False,False,False
57475,dthdszdn,75cf04320213a3ed0e444333d095f803065bd0ee,Elsevier,Cardiovascular implications of the COVID-19 pandemic: a global perspective,10.1016/j.cjca.2020.05.018,,,els-covid,"The Coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), represents the pandemic of the century, with approximately 3.5 million cases and 2...",2020-05-16,...,,document_parses/pdf_json/75cf04320213a3ed0e444333d095f803065bd0ee.json,,https://api.elsevier.com/content/article/pii/S0828282X20304645; https://www.sciencedirect.com/science/article/pii/S0828282X20304645?v=s5,218657226.0,2 weeks ago,True,True,True,False
57478,3qmag7ny,,Elsevier,"The Political Economy of the SARS Epidemic: The Impact on Human Resources in East Asia, Grace O.M. Lee, Malcolm Warner, Routledge London and New York, 2008, xxii+168 pp. $19.95, ISBN: 0-415-39498-8.",10.1016/j.asieco.2008.09.008,,,els-covid,"The Political Economy of the SARS Epidemic: The Impact on Human Resources in East Asia, Grace O.M. Lee, Malcolm Warner, Routledge London and New York, 2008, xxii+168 pp. $19.95, ISBN: 0-415-39498-8.",2009-01-31,...,,,,https://api.elsevier.com/content/article/pii/S1049007808000900; https://www.sciencedirect.com/science/article/pii/S1049007808000900,152480814.0,11 years ago,False,False,False,True
57480,sa2c3z14,62cc1be8f41f61f0b55f07df753b74e2f4c4b4ce,Elsevier,Manejo de pacientes de Ortopedia y Traumatología en el contexto de la contingencia por covid-19: revisión de conceptos actuales,10.1016/j.rccot.2020.05.001,,,els-covid,"Resumen El objetivo del presente estudio es presentar una revisión de la literatura disponible, que permita abordar de forma ordenada la evidencia actual con respecto a la organización de un servi...",2020-05-16,...,,document_parses/pdf_json/62cc1be8f41f61f0b55f07df753b74e2f4c4b4ce.json,,https://api.elsevier.com/content/article/pii/S0120884520300468; https://www.sciencedirect.com/science/article/pii/S0120884520300468?v=s5,218656924.0,2 weeks ago,True,False,False,False
57485,5kohk0c3,46c645c9ad604d7cb3740e444618b75e32012d66,Elsevier,"Impact of urbanization on pollution-related agricultural input intensity in Hubei, China",10.1016/j.ecolind.2015.11.002,,,els-covid,"Agricultural input intensity increases significantly during the rapid urbanization in China, which has contributed to the increasingly serious non-point pollution. Using the vector autoregression...",2016-03-31,...,,document_parses/pdf_json/46c645c9ad604d7cb3740e444618b75e32012d66.json,,https://api.elsevier.com/content/article/pii/S1470160X15006330; https://www.sciencedirect.com/science/article/pii/S1470160X15006330,86132905.0,4 years ago,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128482,ks6tcyka,1f0097fb57b078c0e81d219f65f218d571314be2,Elsevier; Medline; PMC,APOBEC3G directly binds Hepatitis B virus core protein in cell and cell free systems,10.1016/j.virusres.2010.05.009,PMC7173111,20510315,els-covid,APOBEC3G (A3G) is an intrinsic antiretroviral factor which can inhibit Hepatitis B virus (HBV) replication. This antiviral activity mainly depends on A3G incorporation into viral particles. Howev...,2010-08-31,...,,document_parses/pdf_json/1f0097fb57b078c0e81d219f65f218d571314be2.json,document_parses/pmc_json/PMC7173111.xml.json,https://api.elsevier.com/content/article/pii/S0168170210001760; https://www.sciencedirect.com/science/article/pii/S0168170210001760,22731338.0,10 years ago,False,True,False,False
128483,71omypkg,0a003aa69f43cee4357f1e943df79a8b87c0a88e; 64f48f817331fc485485f543207eee3d76f1b022,Elsevier; PMC,Identification of Leukotoxin and other vaccine candidate proteins in a Mannheimia haemolytica commercial antigen,10.1016/j.heliyon.2016.e00158,PMC5035357,27699279,cc-by-nc-nd,"Bovine Respiratory Disease is the most costly disease that affects beef and dairy cattle industry. Its etiology is multifactorial, arising from predisposing environmental stress conditions as well...",2016-09-19,...,,document_parses/pdf_json/0a003aa69f43cee4357f1e943df79a8b87c0a88e.json; document_parses/pdf_json/64f48f817331fc485485f543207eee3d76f1b022.json,document_parses/pmc_json/PMC5035357.xml.json,https://api.elsevier.com/content/article/pii/S2405844016310933; https://www.sciencedirect.com/science/article/pii/S2405844016310933,15303860.0,4 years ago,False,True,False,False
128484,faec051u,89c97f15ddca5f7b2f4bf2757ee4a5cd7fd2a651,Elsevier; Medline; PMC; WHO,Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study,10.1016/s2589-7500(20)30026-1,PMC7158945,32309796,no-cc,": As the outbreak of coronavirus disease 2019 (COVID-19) progresses, epidemiological data are needed to guide situational awareness and intervention strategies. Here we describe efforts to compile...",2020-02-20,...,,document_parses/pdf_json/89c97f15ddca5f7b2f4bf2757ee4a5cd7fd2a651.json,document_parses/pmc_json/PMC7158945.xml.json,https://www.sciencedirect.com/science/article/pii/S2589750020300261; https://api.elsevier.com/content/article/pii/S2589750020300261,213852992.0,3 months ago,True,True,True,False
128487,a11jyui1,,Elsevier; PMC,Impact of BSE on the biotechnology industry — detection and risk assessment,10.1016/s0958-2118(04)00238-1,PMC7148827,,els-covid,This feature briefly looks at filtration technologies that are being developed by Pall Corp to reduce the potential risk of ‘Mad Cow Disease’ in biological products and drugs. It also provides det...,2004-10-31,...,,,,https://www.sciencedirect.com/science/article/pii/S0958211804002381; https://api.elsevier.com/content/article/pii/S0958211804002381,109580667.0,15 years ago,False,False,False,False


In [39]:
papers[128491].url

'https://www.sciencedirect.com/science/article/pii/S0261517709001745'