# Load SARS-CoV-2 Strain Metadata from Nextstrain.org
**[Work in progress]**

This notebook downloads and standardizes SARS-CoV-2 strain metadata from [Nextstrain.org](https://nextstrain.org) for ingestion into a Knowledge Graph. 

This notebook uses a local copy of the [nextstrain_ncov_global_metadata.tsv](../../reference_data/nextstrain_ncov_global_metadata.tsv) file, since it must be manually downloaded from Nextstrain.org. It will not be updated daily.

Additional information about these strains will be loaded later in the [01e-CNCBStrain.ipynb](01e-CNCBStrain.ipynb) notebook.

Data source: [Nextstrain.org](https://nextstrain.org/ncov/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import csv
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-328d8379-6ab4-4cc1-a397-2de37909d2e4/installation-4.1.0/import


In [4]:
df = pd.read_csv("../../reference_data/nextstrain_ncov_global_metadata.tsv", sep = '\t', dtype=str, error_bad_lines=False)

b'Skipping line 978: expected 24 fields, saw 33\n'


## Transform and standardize data

Graph databases don't have "null" values. By setting missing values to '', they will not be represented in the graph.

In [5]:
df.replace('?', '', inplace=True)
df.replace('Unknown', '', inplace=True)
df.fillna('', inplace=True)

In [6]:
df.head()

Unnamed: 0,Strain,GISAID Clade,Age,Clade,Country,Country of exposure,Admin Division,Division of exposure,gisaid_epi_isl,Host,Old Nextstrain clade,Location,Originating Lab,Pangolin lineage,Submission Date,Region,Sex,Emerging clade,Submitting Lab,url,Collection Data,Author,genbank_accession,Region of exposure
0,Guangdong/20SF179/2020,O,36,19A,China,China,Guangdong,Guangdong,EPI_ISL_428451,Human,,Zhuhai,Guangdong Provincial Center for Diseases Contr...,B,Older,Asia,Male,19A,"School of Public Health, The University of Hon...",,2020-01-22,Bosheng Li et al,,
1,Taiwan/TSGH-37/2020,L,21,19A,Taiwan,Taiwan,Taiwan,Taiwan,EPI_ISL_457733,Human,,Taipei,TSGH-CP molecular lab,B,Older,Asia,Male,19A,TSGH-CP molecular lab,,2020-02-08,Cherng-Lih Perng et al,,
2,Wuhan/IVDC-HB-05/2019,L,32,19A,China,China,Hubei,Hubei,EPI_ISL_402121,Human,,Wuhan,National Institute for Viral Disease Control a...,B,Older,Asia,Male,19A,National Institute for Viral Disease Control a...,,2019-12-30,Wenjie Tan et al A (https://dx.doi.org/10.1056...,,
3,Wuhan/WIV07/2019,L,56,19A,China,China,Hubei,Hubei,EPI_ISL_402130,Human,,Wuhan,Wuhan Jinyintan Hospital,B,Older,Asia,Male,19A,"Wuhan Institute of Virology, Chinese Academy o...",,2019-12-30,Peng Zhou et al,MN996531,
4,Wuhan/WH01/2019,L,44,19A,China,China,Hubei,Hubei,EPI_ISL_406798,Human,,Wuhan,General Hospital of Central Theater Command of...,B,Older,Asia,Male,19A,"BGI & Institute of Microbiology, Chinese Acade...",,2019-12-26,Weijun Chen et al (https://dx.doi.org/10.1016/...,LR757998,


Apply Neo4j property naming conventions

In [7]:
df.rename(columns={'Strain': 'name', 'Clade': 'clade', 'Age': 'age', 'Sex': 'sex', 'Collection Data': 'collectionDate'}, inplace=True)
df.rename(columns={'Country of exposure': 'exposureCountry', 'Division of exposure': 'exposureAdmin1'}, inplace=True)

Fix collection dates

In [8]:
df['collectionDate'] = df['collectionDate'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

Taxonomy

In [9]:
# assign taxonomy for SARS-CoV-2
df['taxonomyId'] = 'taxonomy:2697049'

In [10]:
# read Organism reference dictionary
organism_to_id = dict()
data = pd.read_csv("../../reference_data/OrganismDictionary.csv", comment='#', index_col=False)
for index, row in data.iterrows():
    organism_to_id[row['organism']] = row['taxonomyId']

In [11]:
# assign taxonomy id for host
df['Host'] = df['Host'].str.strip()
df['hostTaxonomyId'] = df['Host'].apply(lambda s: organism_to_id.get(s.lower(), s))

In [12]:
df['hostTaxonomyId'].unique()

array(['taxonomy:9606', '', 'taxonomy:9666', 'taxonomy:9685',
       'Environment'], dtype=object)

In [13]:
# for consistency with NCBI, use lower case
df['sex'] = df['sex'].str.lower()

In [14]:
def create_location_string(x):
    y = x[0]
    # Nextstrain assigns the country name to the Admin Division in some cases, then ignore Admin Division
    if x[1] != '' and x[1] != x[0]:
        y = y + ',' + x[1]
    if x[2] != '':
        y = y + ',' + x[2]
    return y

In [15]:
df['origLocation'] = df[['Country', 'Admin Division', 'Location']].apply(lambda x: create_location_string(x),axis=1)

In [16]:
df.head()

Unnamed: 0,name,GISAID Clade,age,clade,Country,exposureCountry,Admin Division,exposureAdmin1,gisaid_epi_isl,Host,Old Nextstrain clade,Location,Originating Lab,Pangolin lineage,Submission Date,Region,sex,Emerging clade,Submitting Lab,url,collectionDate,Author,genbank_accession,Region of exposure,taxonomyId,hostTaxonomyId,origLocation
0,Guangdong/20SF179/2020,O,36,19A,China,China,Guangdong,Guangdong,EPI_ISL_428451,Human,,Zhuhai,Guangdong Provincial Center for Diseases Contr...,B,Older,Asia,male,19A,"School of Public Health, The University of Hon...",,2020-01-22,Bosheng Li et al,,,taxonomy:2697049,taxonomy:9606,"China,Guangdong,Zhuhai"
1,Taiwan/TSGH-37/2020,L,21,19A,Taiwan,Taiwan,Taiwan,Taiwan,EPI_ISL_457733,Human,,Taipei,TSGH-CP molecular lab,B,Older,Asia,male,19A,TSGH-CP molecular lab,,2020-02-08,Cherng-Lih Perng et al,,,taxonomy:2697049,taxonomy:9606,"Taiwan,Taipei"
2,Wuhan/IVDC-HB-05/2019,L,32,19A,China,China,Hubei,Hubei,EPI_ISL_402121,Human,,Wuhan,National Institute for Viral Disease Control a...,B,Older,Asia,male,19A,National Institute for Viral Disease Control a...,,2019-12-30,Wenjie Tan et al A (https://dx.doi.org/10.1056...,,,taxonomy:2697049,taxonomy:9606,"China,Hubei,Wuhan"
3,Wuhan/WIV07/2019,L,56,19A,China,China,Hubei,Hubei,EPI_ISL_402130,Human,,Wuhan,Wuhan Jinyintan Hospital,B,Older,Asia,male,19A,"Wuhan Institute of Virology, Chinese Academy o...",,2019-12-30,Peng Zhou et al,MN996531,,taxonomy:2697049,taxonomy:9606,"China,Hubei,Wuhan"
4,Wuhan/WH01/2019,L,44,19A,China,China,Hubei,Hubei,EPI_ISL_406798,Human,,Wuhan,General Hospital of Central Theater Command of...,B,Older,Asia,male,19A,"BGI & Institute of Microbiology, Chinese Acade...",,2019-12-26,Weijun Chen et al (https://dx.doi.org/10.1016/...,LR757998,,taxonomy:2697049,taxonomy:9606,"China,Hubei,Wuhan"


### Create unique and interoperable identifiers

**id**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)

In [17]:
df['id'] = 'https://www.gisaid.org/' + df['gisaid_epi_isl']

### Save data for Knowledge Graph Import

In [18]:
df = df[['id', 'name', 'taxonomyId', 'collectionDate',
         'hostTaxonomyId', 'sex', 'age', 'clade',
         'exposureCountry', 'exposureAdmin1', 'origLocation']]
df.head()

Unnamed: 0,id,name,taxonomyId,collectionDate,hostTaxonomyId,sex,age,clade,exposureCountry,exposureAdmin1,origLocation
0,https://www.gisaid.org/EPI_ISL_428451,Guangdong/20SF179/2020,taxonomy:2697049,2020-01-22,taxonomy:9606,male,36,19A,China,Guangdong,"China,Guangdong,Zhuhai"
1,https://www.gisaid.org/EPI_ISL_457733,Taiwan/TSGH-37/2020,taxonomy:2697049,2020-02-08,taxonomy:9606,male,21,19A,Taiwan,Taiwan,"Taiwan,Taipei"
2,https://www.gisaid.org/EPI_ISL_402121,Wuhan/IVDC-HB-05/2019,taxonomy:2697049,2019-12-30,taxonomy:9606,male,32,19A,China,Hubei,"China,Hubei,Wuhan"
3,https://www.gisaid.org/EPI_ISL_402130,Wuhan/WIV07/2019,taxonomy:2697049,2019-12-30,taxonomy:9606,male,56,19A,China,Hubei,"China,Hubei,Wuhan"
4,https://www.gisaid.org/EPI_ISL_406798,Wuhan/WH01/2019,taxonomy:2697049,2019-12-26,taxonomy:9606,male,44,19A,China,Hubei,"China,Hubei,Wuhan"


In [19]:
df.to_csv(NEO4J_IMPORT / "01b-Nextstrain.csv", index=False)

In [20]:
df.shape

(4702, 11)