# Load SARS-CoV-2 Strain Metadata from Nextstrain.org
**[Work in progress]**

This notebook downloads and standardizes SARS-CoV-2 strain metadata from [Nextstrain.org](https://nextstrain.org) for ingestion into a Knowledge Graph. 

This notebook uses a local copy of the [nextstrain_ncov_global_metadata.tsv](../../reference_data/nextstrain_ncov_global_metadata.tsv) file, since it must be manually downloaded from Nextstrain.org. It will not be updated daily.

Additional information about these strains will be loaded later in the [01e-CNCBStrain.ipynb](01e-CNCBStrain.ipynb) notebook.

Data source: [Nextstrain.org](https://nextstrain.org/ncov/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import csv
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


In [4]:
df = pd.read_csv("../../reference_data/nextstrain_ncov_global_metadata.tsv", sep = '\t', dtype=str, error_bad_lines=False)

## Transform and standardize data

Graph databases don't have "null" values. By setting missing values to '', they will not be represented in the graph.

In [5]:
df.replace('?', '', inplace=True)
df.replace('Unknown', '', inplace=True)
df.fillna('', inplace=True)

In [6]:
df.head()

Unnamed: 0,Strain,GISAID Clade,Age,Clade,Country,Country of exposure,Admin Division,Division of exposure,gisaid_epi_isl,Host,Old Nextstrain clade,Originating Lab,Pangolin lineage,Submission Date,Region,Sex,Emerging clade,Submitting Lab,url,Collection Data,Author,Location,genbank_accession,Region of exposure
0,Guangdong/GD2020139-P0007/2020,L,53,19A,China,China,Guangdong,Guangdong,EPI_ISL_413882,Human,,Guangdong Provincial Institution of Public Hea...,B,Older,Asia,Male,19A,Guangdong Provincial Institution of Public Health,,2020-02-02,Jing Lu et al (https://dx.doi.org/10.1016/j.ce...,,,
1,Wuhan/HBCDC-HB-01/2019,L,49,19A,China,China,Hubei,Hubei,EPI_ISL_402132,Human,,Wuhan Jinyintan Hospital,B,Older,Asia,Female,19A,Hubei Provincial Center for Disease Control an...,,2019-12-30,Bin Fang et al (https://dx.doi.org/10.1101/202...,Wuhan,,
2,Wuhan/WH01/2019,L,44,19A,China,China,Hubei,Hubei,EPI_ISL_406798,Human,,General Hospital of Central Theater Command of...,B,Older,Asia,Male,19A,"BGI & Institute of Microbiology, Chinese Acade...",,2019-12-26,Weijun Chen et al (https://dx.doi.org/10.1016/...,Wuhan,LR757998,
3,Wuhan/WIV07/2019,L,56,19A,China,China,Hubei,Hubei,EPI_ISL_402130,Human,,Wuhan Jinyintan Hospital,B,Older,Asia,Male,19A,"Wuhan Institute of Virology, Chinese Academy o...",,2019-12-30,Peng Zhou et al,Wuhan,MN996531,
4,Wuhan/IPBCAMS-WH-01/2019,L,65,19A,China,China,Hubei,Hubei,EPI_ISL_402123,Human,,"Institute of Pathogen Biology, Chinese Academy...",B,Older,Asia,Male,19A,"Institute of Pathogen Biology, Chinese Academy...",,2019-12-24,Lili Ren et al A,Wuhan,MT019529,


Apply Neo4j property naming conventions

In [7]:
df.rename(columns={'Strain': 'name', 'Clade': 'clade', 'Age': 'age', 'Sex': 'sex'}, inplace=True)
df.rename(columns={'Country of exposure': 'exposureCountry', 'Division of exposure': 'exposureAdmin1'}, inplace=True)

Fix collection dates

In [8]:
# for consistency with NCBI, use lower case
df['sex'] = df['sex'].str.lower()

### Create unique and interoperable identifiers

**id**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)

In [9]:
df['id'] = 'https://www.gisaid.org/' + df['gisaid_epi_isl']

### Save data for Knowledge Graph Import

In [10]:
df = df[['id', 'sex', 'age', 'clade', 'exposureCountry', 'exposureAdmin1']]
df.head()

Unnamed: 0,id,sex,age,clade,exposureCountry,exposureAdmin1
0,https://www.gisaid.org/EPI_ISL_413882,male,53,19A,China,Guangdong
1,https://www.gisaid.org/EPI_ISL_402132,female,49,19A,China,Hubei
2,https://www.gisaid.org/EPI_ISL_406798,male,44,19A,China,Hubei
3,https://www.gisaid.org/EPI_ISL_402130,male,56,19A,China,Hubei
4,https://www.gisaid.org/EPI_ISL_402123,male,65,19A,China,Hubei


In [11]:
df.to_csv(NEO4J_IMPORT / "01d-Nextstrain.csv", index=False)

In [12]:
df.shape

(5174, 6)