# Load SARS-CoV-2 Strain Metadata from Nextstrain.org
**[Work in progress]**

This notebook downloads and standardizes SARS-CoV-2 strain metadata from [Nextstrain.org](https://nextstrain.org) obtained from [GISAID](https://www.gisaid.org/) for ingestion into a Knowledge Graph. 

This notebook uses a local copy of the [nextstrain_ncov_global_metadata.tsv](../reference_data/nextstrain_ncov_global_metadata.tsv) file, since it must be manually downloaded from Nextstrain.org. It will not be updated daily.

Additional information about these strains will be loaded later in the [01e-CNCBStrain.ipynb](01e-CNCBStrain.ipynb) notebook.

Data source: [Nextstrain.org](https://nextstrain.org/ncov/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_HOME = Path(os.getenv('NEO4J_HOME'))
print(NEO4J_HOME)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-4af96121-2328-4e2f-ba60-6d8b728a26d5/installation-4.0.3


In [4]:
df = pd.read_csv("../reference_data/nextstrain_ncov_global_metadata.tsv", sep = '\t', dtype=str, error_bad_lines=False)

b'Skipping line 2920: expected 19 fields, saw 24\n'


## Transform and standardize data

Graph databases don't have "null" values. By setting missing values to '', they will not be represented in the graph.

In [5]:
df.replace('?', '', inplace=True)
df.replace('Unknown', '', inplace=True)
df.fillna('', inplace=True)

In [6]:
df.head()

Unnamed: 0,Strain,Admin Division,gisaid_epi_isl,url,Country,Age,Sex,Host,Submission Date,Submitting Lab,Originating Lab,Region,Division of exposure,Clade,Collection Data,Author,genbank_accession,Location,Country of exposure
0,Shanghai/SH0007/2020,Shanghai,EPI_ISL_416320,,China,65,Male,Human,Older,National Research Center for Translational Med...,"Shanghai Public Health Clinical Center, Shangh...",Asia,Shanghai,,2020-01-28,Shengyue Wang et al,,,
1,Wuhan/WH01/2019,Hubei,EPI_ISL_406798,,China,44,Male,Human,Older,"BGI & Institute of Microbiology, Chinese Acade...",General Hospital of Central Theater Command of...,Asia,Hubei,,2019-12-26,Weijun Chen et al,LR757998,Wuhan,
2,Hangzhou/ZJU-07/2020,Zhejiang,EPI_ISL_416425,,China,0,Male,Human,Older,State Key Laboratory for Diagnosis and Treatme...,State Key Laboratory for Diagnosis and Treatme...,Asia,Zhejiang,,2020-02-03,Hangping Yao et al,,Hangzhou,
3,Guangzhou/GZMU0060/2020,Guangdong,EPI_ISL_429103,,China,25,,Human,One month ago,BGI-shenzhen & The First Affiliated Hospital o...,The First Affiliated Hospital of Guangzhou Med...,Asia,Guangdong,,2020-02-09,et al,,Guangzhou,
4,Wuhan/IVDC-HB-05/2019,Hubei,EPI_ISL_402121,,China,32,Male,Human,Older,National Institute for Viral Disease Control a...,National Institute for Viral Disease Control a...,Asia,Hubei,,2019-12-30,Wenjie Tan et al,,Wuhan,


Apply Neo4j property naming conventions

In [7]:
df.rename(columns={'Strain': 'name', 'Clade': 'clade', 'Age': 'age', 'Sex': 'sex', 'Collection Data': 'collectionDate'}, inplace=True)
df.rename(columns={'Country of exposure': 'exposureCountry', 'Division of exposure': 'exposureAdmin1'}, inplace=True)

Fix collection dates

In [8]:
df['collectionDate'] = df['collectionDate'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

Taxonomy

In [9]:
# assign taxonomy for SARS-CoV-2
df['taxonomyId'] = 'taxonomy:2697049'

In [10]:
# TODO a find general solution to map host name to NCBI taxonomy**

# some host specifications are ambiguous, 
# they don't match a specific species in NCBI taxonomy

taxonomy_to_id = {'Human': 'taxonomy:9606', 
                  'Homo sapiens': 'taxonomy:9606',
                  'Rhinolophus affinis': 'taxonomy:59477', 
                  'Rhinolophus sp. (bat)': 'taxonomy:49442',
                  'Mustela lutreola': 'taxonomy:9666',
                  'Panthera tigris jacksoni': 'taxonomy:419130',
                  'bat': 'taxonomy:49442',
                  'Manis javanica': 'taxonomy:9974',
                  'palm civet': 'taxonomy:71116',
                  'Canine': 'taxonomy:9608',
                  'Felis catus': 'taxonomy:9685'
                 }

In [11]:
# assign taxonomy id for host
df['Host'] = df['Host'].str.strip()
df['hostTaxonomyId'] = df['Host'].apply(lambda s: taxonomy_to_id.get(s, ''))
df = df.query("hostTaxonomyId != ''")

In [12]:
df['hostTaxonomyId'].unique()

array(['taxonomy:9606', 'taxonomy:9666', 'taxonomy:9685',
       'taxonomy:419130'], dtype=object)

In [13]:
# for consistency with NCBI, use lower case
df['sex'] = df['sex'].str.lower()

In [14]:
df.head()

Unnamed: 0,name,Admin Division,gisaid_epi_isl,url,Country,age,sex,Host,Submission Date,Submitting Lab,Originating Lab,Region,exposureAdmin1,clade,collectionDate,Author,genbank_accession,Location,exposureCountry,taxonomyId,hostTaxonomyId
0,Shanghai/SH0007/2020,Shanghai,EPI_ISL_416320,,China,65,male,Human,Older,National Research Center for Translational Med...,"Shanghai Public Health Clinical Center, Shangh...",Asia,Shanghai,,2020-01-28,Shengyue Wang et al,,,,taxonomy:2697049,taxonomy:9606
1,Wuhan/WH01/2019,Hubei,EPI_ISL_406798,,China,44,male,Human,Older,"BGI & Institute of Microbiology, Chinese Acade...",General Hospital of Central Theater Command of...,Asia,Hubei,,2019-12-26,Weijun Chen et al,LR757998,Wuhan,,taxonomy:2697049,taxonomy:9606
2,Hangzhou/ZJU-07/2020,Zhejiang,EPI_ISL_416425,,China,0,male,Human,Older,State Key Laboratory for Diagnosis and Treatme...,State Key Laboratory for Diagnosis and Treatme...,Asia,Zhejiang,,2020-02-03,Hangping Yao et al,,Hangzhou,,taxonomy:2697049,taxonomy:9606
3,Guangzhou/GZMU0060/2020,Guangdong,EPI_ISL_429103,,China,25,,Human,One month ago,BGI-shenzhen & The First Affiliated Hospital o...,The First Affiliated Hospital of Guangzhou Med...,Asia,Guangdong,,2020-02-09,et al,,Guangzhou,,taxonomy:2697049,taxonomy:9606
4,Wuhan/IVDC-HB-05/2019,Hubei,EPI_ISL_402121,,China,32,male,Human,Older,National Institute for Viral Disease Control a...,National Institute for Viral Disease Control a...,Asia,Hubei,,2019-12-30,Wenjie Tan et al,,Wuhan,,taxonomy:2697049,taxonomy:9606


### Create unique and interoperable identifiers

**id**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)

In [15]:
df['id'] = 'https://www.gisaid.org/' + df['gisaid_epi_isl']

### Save data for Knowledge Graph Import

In [16]:
df = df[['id', 'name', 'taxonomyId', 'collectionDate',
         'hostTaxonomyId', 'sex', 'age', 'clade',
         'exposureCountry', 'exposureAdmin1']]
df.head()

Unnamed: 0,id,name,taxonomyId,collectionDate,hostTaxonomyId,sex,age,clade,exposureCountry,exposureAdmin1
0,https://www.gisaid.org/EPI_ISL_416320,Shanghai/SH0007/2020,taxonomy:2697049,2020-01-28,taxonomy:9606,male,65,,,Shanghai
1,https://www.gisaid.org/EPI_ISL_406798,Wuhan/WH01/2019,taxonomy:2697049,2019-12-26,taxonomy:9606,male,44,,,Hubei
2,https://www.gisaid.org/EPI_ISL_416425,Hangzhou/ZJU-07/2020,taxonomy:2697049,2020-02-03,taxonomy:9606,male,0,,,Zhejiang
3,https://www.gisaid.org/EPI_ISL_429103,Guangzhou/GZMU0060/2020,taxonomy:2697049,2020-02-09,taxonomy:9606,,25,,,Guangdong
4,https://www.gisaid.org/EPI_ISL_402121,Wuhan/IVDC-HB-05/2019,taxonomy:2697049,2019-12-30,taxonomy:9606,male,32,,,Hubei


In [17]:
df.to_csv(NEO4J_HOME / "import/01b-Nextstrain.csv", index=False)