# Load SARS-CoV-2 Strain Data
**[Work in progress]**

This notebook downloads and standardizes SARS-CoV-2 strain information from [Nextstrain.org](https://nextstrain.org) obtained from [GISAID](https://www.gisaid.org/) for ingestion into a Knowledge Graph.

Data source: [git repo](https://github.com/nextstrain/ncov)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_HOME = Path(os.getenv('NEO4J_HOME'))
print(NEO4J_HOME)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-4af96121-2328-4e2f-ba60-6d8b728a26d5/installation-4.0.3


In [4]:
df = pd.read_csv("https://github.com/nextstrain/ncov/raw/master/data/metadata.tsv", sep = '\t', dtype=str)

## Transform and standardize data

Graph databases don't have "null" values. By setting missing values to '', they will not be represented in the graph.

In [5]:
df.replace('?', '', inplace=True)
df.replace('Unknown', '', inplace=True)
df.fillna('', inplace=True)

In [6]:
df.head()

Unnamed: 0,strain,virus,gisaid_epi_isl,genbank_accession,date,region,country,division,location,region_exposure,country_exposure,division_exposure,segment,length,host,age,sex,originating_lab,submitting_lab,authors,url,title,date_submitted
0,Algeria/G0638_2264/2020,ncov,EPI_ISL_418241,,2020-03-02,Africa,Algeria,Boufarik,,Africa,Algeria,Boufarik,genome,29862,Human,28,Female,NIC Viral Respiratory Unit - Institut Pasteur ...,National Reference Center for Viruses of Respi...,Albert et al,https://www.gisaid.org,,2020-03-29
1,Algeria/G0640_2265/2020,ncov,EPI_ISL_418242,,2020-03-08,Africa,Algeria,Blida,,Africa,Algeria,Blida,genome,29867,Human,87,Male,NIC Viral Respiratory Unit - Institut Pasteur ...,National Reference Center for Viruses of Respi...,Albert et al,https://www.gisaid.org,,2020-03-29
2,Algeria/G0860_2262/2020,ncov,EPI_ISL_420037,,2020-03-02,Africa,Algeria,Boufarik,,Africa,Algeria,Boufarik,genome,29862,Human,41,Male,NIC Viral Respiratory Unit - Institut Pasteur ...,National Reference Center for Viruses of Respi...,Albert et al,https://www.gisaid.org,,2020-04-04
3,Anhui/SZ005/2020,ncov,EPI_ISL_413485,,2020-01-24,Asia,China,Anhui,Suzhou,Asia,China,Anhui,genome,29860,Human,58,Male,"Department of microbiology laboratory,Anhui Pr...","Department of microbiology laboratory,Anhui Pr...",Li et al,https://www.gisaid.org,,2020-03-05
4,Argentina/C121/2020,ncov,EPI_ISL_420600,,2020-03-07,South America,Argentina,Argentina,,South America,Argentina,,genome,29903,Human,51,Male,Servicio Virosis Respiratorias-Departamento Vi...,Instituto Nacional Enfermedades Infecciosas C....,Baumeister et al,https://www.gisaid.org,,2020-04-06


Fix collection dates

In [7]:
# TODO: date standardization introduces artifacts, e.g. Dec 2019 -> 2019-12-01
# Add column that specifies time granularity: Y, M, D
df.rename(columns={'date': 'collection_date'}, inplace=True)
# fix dates with this format: 2020-01-XX
df['collection_date'] = df['collection_date'].str.replace('-XX','')
df['collection_date'] = df['collection_date'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

Assign a unique strain id, use Genbank accession when available to enable joining with NCBI data

In [8]:
df.rename(columns={'strain': 'name'}, inplace=True)

Taxonomy

In [9]:
# assign taxonomy for SARS-CoV-2
df['taxonomy_id'] = 'taxonomy:2697049'

In [10]:
# TODO a find general solution to map host name to NCBI taxonomy**

# some host specifications are ambiguous, 
# they don't match a specific species in NCBI taxonomy

taxonomy_to_id = {'Human': 'taxonomy:9606', 
                  'Homo sapiens': 'taxonomy:9606',
                  'Rhinolophus affinis': 'taxonomy:59477', 
                  'Rhinolophus sp. (bat)': 'taxonomy:49442',
                  'bat': 'taxonomy:49442',
                  'Manis javanica': 'taxonomy:9974',
                  'palm civet': 'taxonomy:71116',
                  'Canine': 'taxonomy:9608'
                 }

In [11]:
# assign taxonomy id to host
df['host'] = df['host'].str.strip()
df['host_taxonomy_id'] = df['host'].apply(lambda s: taxonomy_to_id.get(s, ''))

In [12]:
# for consistency with NCBI, use lower case
df['sex'] = df['sex'].str.lower()

In [13]:
df.head()

Unnamed: 0,name,virus,gisaid_epi_isl,genbank_accession,collection_date,region,country,division,location,region_exposure,country_exposure,division_exposure,segment,length,host,age,sex,originating_lab,submitting_lab,authors,url,title,date_submitted,taxonomy_id,host_taxonomy_id
0,Algeria/G0638_2264/2020,ncov,EPI_ISL_418241,,2020-03-02,Africa,Algeria,Boufarik,,Africa,Algeria,Boufarik,genome,29862,Human,28,female,NIC Viral Respiratory Unit - Institut Pasteur ...,National Reference Center for Viruses of Respi...,Albert et al,https://www.gisaid.org,,2020-03-29,taxonomy:2697049,taxonomy:9606
1,Algeria/G0640_2265/2020,ncov,EPI_ISL_418242,,2020-03-08,Africa,Algeria,Blida,,Africa,Algeria,Blida,genome,29867,Human,87,male,NIC Viral Respiratory Unit - Institut Pasteur ...,National Reference Center for Viruses of Respi...,Albert et al,https://www.gisaid.org,,2020-03-29,taxonomy:2697049,taxonomy:9606
2,Algeria/G0860_2262/2020,ncov,EPI_ISL_420037,,2020-03-02,Africa,Algeria,Boufarik,,Africa,Algeria,Boufarik,genome,29862,Human,41,male,NIC Viral Respiratory Unit - Institut Pasteur ...,National Reference Center for Viruses of Respi...,Albert et al,https://www.gisaid.org,,2020-04-04,taxonomy:2697049,taxonomy:9606
3,Anhui/SZ005/2020,ncov,EPI_ISL_413485,,2020-01-24,Asia,China,Anhui,Suzhou,Asia,China,Anhui,genome,29860,Human,58,male,"Department of microbiology laboratory,Anhui Pr...","Department of microbiology laboratory,Anhui Pr...",Li et al,https://www.gisaid.org,,2020-03-05,taxonomy:2697049,taxonomy:9606
4,Argentina/C121/2020,ncov,EPI_ISL_420600,,2020-03-07,South America,Argentina,Argentina,,South America,Argentina,,genome,29903,Human,51,male,Servicio Virosis Respiratorias-Departamento Vi...,Instituto Nacional Enfermedades Infecciosas C....,Baumeister et al,https://www.gisaid.org,,2020-04-06,taxonomy:2697049,taxonomy:9606


In [14]:
df['admin1_exposure'] = df['division_exposure']

### Read Clade information
Clade info is missing in the file downloaded above. The file with clade info can only be downloaded manually from the Nextstrain.org web site. Furthermore, the file has fewer strains and even less clade assignments. File Github issues (https://github.com/nextstrain/ncov/issues/207, https://github.com/nextstrain/ncov/issues/208

In [15]:
clade = pd.read_csv("../reference_data/nextstrain_ncov_global_metadata.tsv", sep = '\t', dtype=str)

In [16]:
clade.head()

Unnamed: 0,Strain,Age,Clade,Country,Admin Division,gisaid_epi_isl,Host,Location,Originating Lab,Submission Date,Region,Submitting Lab,url,Collection Data,Author,Sex,genbank_accession,Exposure History
0,Guangzhou/GZMU0060/2020,25,,China,Guangdong,EPI_ISL_429103,Human,Guangzhou,The First Affiliated Hospital of Guangzhou Med...,3-7 days ago,Asia,BGI-shenzhen & The First Affiliated Hospital o...,,2020-02-09,et al,,,
1,Wuhan/IVDC-HB-04/2020,61,,China,Hubei,EPI_ISL_402120,Human,Wuhan,National Institute for Viral Disease Control a...,Older,Asia,National Institute for Viral Disease Control a...,,2020-01-01,Tan et al,Male,,
2,Shanghai/SH0036/2020,52,,China,Shanghai,EPI_ISL_416345,Human,,"Shanghai Public Health Clinical Center, Shangh...",Older,Asia,National Research Center for Translational Med...,,2020-02-04,Wang et al,Female,,
3,Hangzhou/ZJU-07/2020,0,,China,Zhejiang,EPI_ISL_416425,Human,Hangzhou,State Key Laboratory for Diagnosis and Treatme...,Older,Asia,State Key Laboratory for Diagnosis and Treatme...,,2020-02-03,Yao et al,Male,,
4,Wuhan/WIV07/2019,56,,China,Hubei,EPI_ISL_402130,Human,Wuhan,Wuhan Jinyintan Hospital,Older,Asia,"Wuhan Institute of Virology, Chinese Academy o...",,2019-12-30,Zhou et al,Male,MN996531,


### Merge strain data with clade assignments
Strains with similar genomic sequences are clustered into clades.

In [17]:
clade = clade[['Strain', 'Clade']]
clade.rename(columns={'Strain': 'name', 'Clade': 'clade'}, inplace=True)
df = df.merge(clade, on='name', how='left')
df.fillna('', inplace=True)

### Create unique and interoperable identifiers

**id**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)

**alias**: CURIE: [insdc](https://registry.identifiers.org/registry/insdc) (International Nucleotide Sequence Database Collaboration, INSDC)

A [CURIE](https://en.wikipedia.org/wiki/CURIE) (Compact URI) is a compact abbreviation for Uniform Resource Identifiers (URIs) that can be resolved by [Identifiers.org](https://identifiers.org/).

In [18]:
df.rename(columns={'strain': 'name'}, inplace=True)
df['id'] = 'https://www.gisaid.org/' + df['gisaid_epi_isl']
df['alias'] = df['genbank_accession'].apply(lambda s:  "insdc:" + s if len(s) > 0 else s)

### Save data for Knowledge Graph Import

In [19]:
df = df[['id', 'name', 'alias', 'taxonomy_id', 'collection_date',
         'host_taxonomy_id', 'sex', 'age', 'clade',
         'country_exposure', 'admin1_exposure', 'country', 'division', 'location']]
df.head()

Unnamed: 0,id,name,alias,taxonomy_id,collection_date,host_taxonomy_id,sex,age,clade,country_exposure,admin1_exposure,country,division,location
0,https://www.gisaid.org/EPI_ISL_418241,Algeria/G0638_2264/2020,,taxonomy:2697049,2020-03-02,taxonomy:9606,female,28,A2a,Algeria,Boufarik,Algeria,Boufarik,
1,https://www.gisaid.org/EPI_ISL_418242,Algeria/G0640_2265/2020,,taxonomy:2697049,2020-03-08,taxonomy:9606,male,87,A2a,Algeria,Blida,Algeria,Blida,
2,https://www.gisaid.org/EPI_ISL_420037,Algeria/G0860_2262/2020,,taxonomy:2697049,2020-03-02,taxonomy:9606,male,41,A2a,Algeria,Boufarik,Algeria,Boufarik,
3,https://www.gisaid.org/EPI_ISL_413485,Anhui/SZ005/2020,,taxonomy:2697049,2020-01-24,taxonomy:9606,male,58,B,China,Anhui,China,Anhui,Suzhou
4,https://www.gisaid.org/EPI_ISL_420600,Argentina/C121/2020,,taxonomy:2697049,2020-03-07,taxonomy:9606,male,51,A2a,Argentina,,Argentina,Argentina,


In [20]:
df.to_csv(NEO4J_HOME / "import/01b-Nextstrain.csv", index=False)