The variables below should be updated accordingly.  
`NEO4J_OUTPUT_DIR` is where the data to be imported into Neo4j is ouputted.  
`METADATA_PATH` is that path to the file containing metadata for each article.  
`JOURNAL_INFO_PATH` is the path to the file containing journal information, including impact factor (to be integrated with the rest of the metadata.    

In [1]:
NEO4J_OUTPUT_DIR = '../neo4j-import' # folder to store files for neo4j import
METADATA_PATH = '../data/metadata.csv' # CORD-19 metadata from kaggle
JOURNAL_INFO_PATH = './journal_data.csv' # journal info from SCI Journal Citation Reports with impact factor

In [2]:
import pandas as pd
import os

In [3]:
metadata_df = pd.read_csv(METADATA_PATH)
journal_df = pd.read_csv(JOURNAL_INFO_PATH, skiprows=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# create journal to impact factor mapping
journal_abbr_title_df = journal_df[['JCR Abbreviated Title', 'Journal Impact Factor']].dropna().drop_duplicates()
journal_abbr_title_df.columns = ['journal_title', 'journal_info']

journal_full_title_df = journal_df[['Full Journal Title', 'Journal Impact Factor']].dropna().drop_duplicates()
journal_full_title_df.columns = ['journal_title', 'journal_info']
journal_to_impact_factor_df = journal_abbr_title_df.append(journal_full_title_df, ignore_index=True)
journal_to_impact_factor_df['journal_title_lower'] = journal_to_impact_factor_df['journal_title'].str.lower()
journal_to_impact_factor_df = journal_to_impact_factor_df.drop_duplicates(subset=['journal_title_lower'])

In [5]:
journal_to_impact_factor_df.head()

Unnamed: 0,journal_title,journal_info,journal_title_lower
0,CA-CANCER J CLIN,223.679,ca-cancer j clin
1,NAT REV MATER,74.449,nat rev mater
2,NEW ENGL J MED,70.67,new engl j med
3,LANCET,59.102,lancet
4,NAT REV DRUG DISCOV,57.618,nat rev drug discov


In [6]:
# add journal impact factors
metadata_df['journal_title_lower'] = metadata_df['journal'].str.lower()
metadata_df = metadata_df.merge(journal_to_impact_factor_df, how='left', on='journal_title_lower')

In [7]:
# text-document relationships produced from Stage 2, used to get the list of documents for which to extract metadata
sentences_path = os.path.join(NEO4J_OUTPUT_DIR, 'text_edges.csv')
sentences_df = pd.read_csv(sentences_path)
all_paper_ids = set(sentences_df['doc_id:END_ID(Document)'])
print('total number of doccument/paper IDs after Stage 2:', len(all_paper_ids))

# formated metadata for Neo4j import
output_path = os.path.join(NEO4J_OUTPUT_DIR, 'metadata.csv')

total number of doccument/paper IDs after Stage 2: 178900


In [8]:
# metadata statistics
all_abstracts = set(metadata_df['abstract'].dropna().str.lower().str.strip())
print('Total number of abstracts in metadata =', len(all_abstracts))
print('Total number of PMC IDs in metadata =', len(set(metadata_df['pmcid'].dropna())))
print('Total number of PubMed IDs in metadata =', len(set(metadata_df['pubmed_id'].dropna())))

Total number of abstracts in metadata = 309175
Total number of PMC IDs in metadata = 170722
Total number of PubMed IDs in metadata = 233312


In [9]:
header = ['cord_uid', 'title', 'authors', 'journal', 'journal_info', 'publish_time'] # original header
header_neo4j = ['doc_id:ID(Document)', 'title:STRING', 'authors:STRING', 'journal:STRING', 'journal_info:STRING', 'publish_time:DATE'] # headers for neo4j import

In [10]:
# format metadata for Neo4j import
metadata_df['publish_time'] = pd.to_datetime(metadata_df['publish_time'])
metadata_df = metadata_df.sort_values(by='publish_time', ascending=False)
metadata_df = metadata_df[header]
metadata_df = metadata_df[metadata_df['cord_uid'].isin(all_paper_ids)]
metadata_df = metadata_df.dropna(subset=['cord_uid']).drop_duplicates(subset=['cord_uid'])
metadata_df.columns = header_neo4j
paper_ids = set(metadata_df['doc_id:ID(Document)'])

print('total number of doccument/paper IDs after Stage 2 with metadata information:', len(paper_ids))

total number of doccument/paper IDs after Stage 2 with metadata information: 178900


In [11]:
# add papers w/o metadata
temp_df = pd.DataFrame(all_paper_ids.difference(paper_ids)) 
if len(temp_df) > 0:
    temp_df.columns = [list(header_neo4j)[0]]
    metadata_df = metadata_df.append(temp_df)
metadata_df[':LABEL'] = 'Document'

In [12]:
# save to .csv
metadata_df.to_csv(output_path, index=False)

In [13]:
metadata_df.head()

Unnamed: 0,doc_id:ID(Document),title:STRING,authors:STRING,journal:STRING,journal_info:STRING,publish_time:DATE,:LABEL
257938,4fmocguu,Exposure of pediatric emergency patients to im...,"Floriani, Isabela Dombeck; Borgmann, Ariela Vi...","Rev. Paul. Pediatr. (Ed. Port., Online)",,2022-01-01,Document
299658,5cxheca8,Chapter 40 - COVID-19 Infection: A Novel Fatal...,"Maleki, Majid Norouzi Zeinab Maleki Alireza",Practical Cardiology (Second Edition),,2022-01-01,Document
257937,8ulzzzjc,Exposure of pediatric emergency patients to im...,"Floriani, Isabela Dombeck Borgmann Ariela Vict...","Rev. Paul. Pediatr. (Ed. Port., Online)",,2022-01-01,Document
399032,ra86cr5c,Chapter 5 Treatment of COVID-19,"Qu, Jie-Ming; Cao, Bin; Chen, Rong-Chang",COVID-19,,2021-12-31,Document
437334,2eyzbjdp,Pathogenic Human Coronaviruses,"Schoeman, Dewald; Gordon, Bianca; Fielding, Bu...",Reference Module in Biomedical Sciences,,2021-12-31,Document
