# Load SARS-CoV-2 Strain Metadata from Nextstrain.org
**[Work in progress]**

This notebook downloads and standardizes SARS-CoV-2 strain metadata from [Nextstrain.org](https://nextstrain.org) for ingestion into a Knowledge Graph. 

This notebook uses a local copy of the [nextstrain_ncov_global_metadata.tsv](../../reference_data/nextstrain_ncov_global_metadata.tsv) file, since it must be manually downloaded from Nextstrain.org. It will not be updated daily.

Additional information about these strains will be loaded later in the [01e-CNCBStrain.ipynb](01e-CNCBStrain.ipynb) notebook.

Data source: [Nextstrain.org](https://nextstrain.org/ncov/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import csv
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_HOME = Path(os.getenv('NEO4J_HOME'))
print(NEO4J_HOME)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-4af96121-2328-4e2f-ba60-6d8b728a26d5/installation-4.0.3


In [4]:
df = pd.read_csv("../../reference_data/nextstrain_ncov_global_metadata.tsv", sep = '\t', dtype=str, error_bad_lines=False)

b'Skipping line 1229: expected 22 fields, saw 30\nSkipping line 1776: expected 22 fields, saw 30\n'


## Transform and standardize data

Graph databases don't have "null" values. By setting missing values to '', they will not be represented in the graph.

In [5]:
df.replace('?', '', inplace=True)
df.replace('Unknown', '', inplace=True)
df.fillna('', inplace=True)

In [6]:
df.head()

Unnamed: 0,Strain,Age,Clade,Country,Country of exposure,Admin Division,Division of exposure,genbank_accession,gisaid_epi_isl,Host,Old Nextstrain clade,Location,Originating Lab,Pangolin lineage,Submission Date,Region,Sex,Submitting Lab,url,Collection Data,Author,Region of exposure
0,Wuhan/WIV05/2019,52,19A,China,China,Hubei,Hubei,MN996529,EPI_ISL_402128,Human,,Wuhan,Wuhan Jinyintan Hospital,B,Older,Asia,Female,"Wuhan Institute of Virology, Chinese Academy o...",,2019-12-30,Peng Zhou et al,
1,Wuhan/WIV02/2019,32,19A,China,China,Hubei,Hubei,MN996527,EPI_ISL_402127,Human,,Wuhan,Wuhan Jinyintan Hospital,B,Older,Asia,Male,"Wuhan Institute of Virology, Chinese Academy o...",,2019-12-30,Peng Zhou et al,
2,Hangzhou/ZJU-011/2020,62,19A,China,China,Zhejiang,Zhejiang,,EPI_ISL_418991,Human,,Hangzhou,State Key Laboratory for Diagnosis and Treatme...,B,Older,Asia,Male,State Key Laboratory for Diagnosis and Treatme...,,2020-02-04,Hangping Yao et al,
3,Hangzhou/ZJU-07/2020,0,19A,China,China,Zhejiang,Zhejiang,,EPI_ISL_416425,Human,,Hangzhou,State Key Laboratory for Diagnosis and Treatme...,B,Older,Asia,Male,State Key Laboratory for Diagnosis and Treatme...,,2020-02-03,Hangping Yao et al,
4,Taiwan/TSGH-37/2020,21,19A,Taiwan,Taiwan,Taiwan,Taiwan,,EPI_ISL_457733,Human,,Taipei,TSGH-CP molecular lab,,1-2 days ago,Asia,Male,TSGH-CP molecular lab,,2020-02-08,Cherng-Lih Perng et al,


Apply Neo4j property naming conventions

In [7]:
df.rename(columns={'Strain': 'name', 'Clade': 'clade', 'Age': 'age', 'Sex': 'sex', 'Collection Data': 'collectionDate'}, inplace=True)
df.rename(columns={'Country of exposure': 'exposureCountry', 'Division of exposure': 'exposureAdmin1'}, inplace=True)

Fix collection dates

In [8]:
df['collectionDate'] = df['collectionDate'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

Taxonomy

In [9]:
# assign taxonomy for SARS-CoV-2
df['taxonomyId'] = 'taxonomy:2697049'

In [10]:
# read Organism reference dictionary
organism_to_id = dict()
data = pd.read_csv("../../reference_data/OrganismDictionary.csv", comment='#', index_col=False)
for index, row in data.iterrows():
    organism_to_id[row['organism']] = row['taxonomyId']

In [11]:
# assign taxonomy id for host
df['Host'] = df['Host'].str.strip()
df['hostTaxonomyId'] = df['Host'].apply(lambda s: organism_to_id.get(s.lower(), s))

In [12]:
df['hostTaxonomyId'].unique()

array(['taxonomy:9606', 'Environment', '', 'taxonomy:9666',
       'taxonomy:9608', 'taxonomy:9685', 'taxonomy:419130'], dtype=object)

In [13]:
# for consistency with NCBI, use lower case
df['sex'] = df['sex'].str.lower()

In [14]:
df.head()

Unnamed: 0,name,age,clade,Country,exposureCountry,Admin Division,exposureAdmin1,genbank_accession,gisaid_epi_isl,Host,Old Nextstrain clade,Location,Originating Lab,Pangolin lineage,Submission Date,Region,sex,Submitting Lab,url,collectionDate,Author,Region of exposure,taxonomyId,hostTaxonomyId
0,Wuhan/WIV05/2019,52,19A,China,China,Hubei,Hubei,MN996529,EPI_ISL_402128,Human,,Wuhan,Wuhan Jinyintan Hospital,B,Older,Asia,female,"Wuhan Institute of Virology, Chinese Academy o...",,2019-12-30,Peng Zhou et al,,taxonomy:2697049,taxonomy:9606
1,Wuhan/WIV02/2019,32,19A,China,China,Hubei,Hubei,MN996527,EPI_ISL_402127,Human,,Wuhan,Wuhan Jinyintan Hospital,B,Older,Asia,male,"Wuhan Institute of Virology, Chinese Academy o...",,2019-12-30,Peng Zhou et al,,taxonomy:2697049,taxonomy:9606
2,Hangzhou/ZJU-011/2020,62,19A,China,China,Zhejiang,Zhejiang,,EPI_ISL_418991,Human,,Hangzhou,State Key Laboratory for Diagnosis and Treatme...,B,Older,Asia,male,State Key Laboratory for Diagnosis and Treatme...,,2020-02-04,Hangping Yao et al,,taxonomy:2697049,taxonomy:9606
3,Hangzhou/ZJU-07/2020,0,19A,China,China,Zhejiang,Zhejiang,,EPI_ISL_416425,Human,,Hangzhou,State Key Laboratory for Diagnosis and Treatme...,B,Older,Asia,male,State Key Laboratory for Diagnosis and Treatme...,,2020-02-03,Hangping Yao et al,,taxonomy:2697049,taxonomy:9606
4,Taiwan/TSGH-37/2020,21,19A,Taiwan,Taiwan,Taiwan,Taiwan,,EPI_ISL_457733,Human,,Taipei,TSGH-CP molecular lab,,1-2 days ago,Asia,male,TSGH-CP molecular lab,,2020-02-08,Cherng-Lih Perng et al,,taxonomy:2697049,taxonomy:9606


### Create unique and interoperable identifiers

**id**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)

In [15]:
df['id'] = 'https://www.gisaid.org/' + df['gisaid_epi_isl']

### Save data for Knowledge Graph Import

In [16]:
df = df[['id', 'name', 'taxonomyId', 'collectionDate',
         'hostTaxonomyId', 'sex', 'age', 'clade',
         'exposureCountry', 'exposureAdmin1']]
df.head()

Unnamed: 0,id,name,taxonomyId,collectionDate,hostTaxonomyId,sex,age,clade,exposureCountry,exposureAdmin1
0,https://www.gisaid.org/EPI_ISL_402128,Wuhan/WIV05/2019,taxonomy:2697049,2019-12-30,taxonomy:9606,female,52,19A,China,Hubei
1,https://www.gisaid.org/EPI_ISL_402127,Wuhan/WIV02/2019,taxonomy:2697049,2019-12-30,taxonomy:9606,male,32,19A,China,Hubei
2,https://www.gisaid.org/EPI_ISL_418991,Hangzhou/ZJU-011/2020,taxonomy:2697049,2020-02-04,taxonomy:9606,male,62,19A,China,Zhejiang
3,https://www.gisaid.org/EPI_ISL_416425,Hangzhou/ZJU-07/2020,taxonomy:2697049,2020-02-03,taxonomy:9606,male,0,19A,China,Zhejiang
4,https://www.gisaid.org/EPI_ISL_457733,Taiwan/TSGH-37/2020,taxonomy:2697049,2020-02-08,taxonomy:9606,male,21,19A,Taiwan,Taiwan


In [17]:
df.to_csv(NEO4J_HOME / "import/01b-Nextstrain.csv", index=False)