# Load SARS-CoV-2 Virus Strain Metadata from CNCB
**[Work in progress]**

This notebook downloads and standardizes viral strain data from CNCB for ingestion into a Knowledge Graph.

Data source: [China National Center for Bioinformation, 2019 Novel Coronavirus Resource (2019nCoVR)](https://bigd.big.ac.cn/ncov/release_genome)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import shutil
import glob
import ftplib
import re
import dateutil
import pandas as pd
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
metadata_url = "https://bigd.big.ac.cn/ncov/genome/export/meta"

In [4]:
# Path will take care of handling operating system differences.
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


### Download strain metadata

In [5]:
df = pd.read_excel(metadata_url, dtype='str')
df.fillna('', inplace=True)

In [6]:
print("Total number of strains:", df.shape[0])

Total number of strains: 203052


In [7]:
df.head(10)

Unnamed: 0,Virus Strain Name,Accession ID,Data Source,Related ID,Nuc.Completeness,Sequence Length,Sequence Quality,Quality Assessment,Host,Sample Collection Date,Location,Originating Lab,Submission Date,Submitting Lab,Create Time,Last Update Time
0,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01,NMDC,EPI_ISL_402132,Complete,29848,High,0/0/0/1/NO,Homo sapiens,2019-12-30,China / Hubei,Hubei Provincial Center for Disease Control an...,2020-01-19,Hubei Provincial Center for Disease Control an...,2020-01-20 20:04:48,2020-09-09 11:31:17
1,hCoV-19/Thailand/74/2020,EPI_ISL_403963,GISAID,,Complete,29859,High,0/0/0/0/NO,Homo sapiens,2020-01-13,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-09-09 11:31:17
2,hCoV-19/Thailand/61/2020,EPI_ISL_403962,GISAID,,Complete,29848,High,0/0/0/0/NO,Homo sapiens,2020-01-08,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-09-09 11:31:17
3,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01,NMDC,EPI_ISL_402120,Complete,29896,High,0/0/0/2/NO,Homo sapiens,2020-01-01,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-11,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-09-09 11:31:17
4,BetaCoV/Wuhan/IVDC-HB-01/2019,NMDC60013084-01,NMDC,EPI_ISL_402119,Complete,29891,High,0/0/0/0/NO,Homo sapiens,2019-12-30,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-10,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-09-09 11:31:17
5,BetaCoV/Wuhan/IVDC-HB-05/2019,NMDC60013086-01,NMDC,EPI_ISL_402121,Complete,29891,High,0/0/0/2/NO,Homo sapiens,2019-12-30,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-10,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-09-09 11:31:17
6,BetaCoV/Kanagawa/1/2020,EPI_ISL_402126,GISAID,,Partial,369,High,0/0/NA/NA/NA,Homo sapiens,2020-01-14,Japan / Kanagawa Prefecture,"Department of Virology III, National Institute...",2020-01-16,"Department of Virology III, National Institute...",2020-01-20 20:04:48,2020-01-20 20:04:48
7,Wuhan-Hu-1,MN908947,GenBank,"NC_045512,EPI_ISL_402125",Complete,29903,High,0/0/0/0/NO,Homo sapiens,2019-12-30,China / Hubei / Wuhan,Shanghai Public Health Clinical Center & Schoo...,2020-01-17,Shanghai Public Health Clinical Center & Schoo...,2020-01-20 20:04:48,2020-05-20 11:14:12
8,BetaCoV/Zhejiang/WZ-01/2020,NMDC60013099-01,NMDC,EPI_ISL_404227,Complete,29839,High,0/0/0/2/NO,Homo sapiens,2020-01-16,China / Zhejiang,"Department of Microbiology, Zhejiang Provincia...",2020-01-21,"Department of Microbiology, Zhejiang Provincia...",2020-01-22 20:56:56,2020-09-09 11:31:17
9,BetaCoV/Zhejiang/WZ-02/2020,NMDC60013100-01,NMDC,EPI_ISL_404228,Complete,29859,High,0/0/0/0/NO,Homo sapiens,2020-01-17,China / Zhejiang,"Department of Microbiology, Zhejiang Provincia...",2020-01-21,"Department of Microbiology, Zhejiang Provincia...",2020-01-22 20:56:56,2020-09-09 11:31:17


### Assign identifiers, aliases, and assign compact identifiers (CURIES)

In [8]:
# https://registry.identifiers.org/registry/insdc
insdc_pattern = re.compile('^([A-Z]\d{5}|[A-Z]{2}\d{6}|[A-Z]{4}\d{8}|[A-J][A-Z]{2}\d{5})(\.\d+)?$')
# https://registry.identifiers.org/registry/refseq
refseq_pattern = re.compile('^(((AC|AP|NC|NG|NM|NP|NR|NT|NW|XM|XP|XR|YP|ZP)_\d+)|(NZ\_[A-Z]{2,4}\d+))(\.\d+)?$')

In [9]:
def assign_curie(id):
    id = id.strip()
    if len(id) > 0:
        if id.startswith('EPI'):
            return 'https://www.gisaid.org/' + id
        elif refseq_pattern.match(id) != None:
            return 'refseq:' + id
        elif insdc_pattern.match(id) != None:
            return 'insdc:' + id
        else:
            # TODO are URIs available for these cases?
            return id
    else:
        return id

In [10]:
def assign_curies(ids):
    return [assign_curie(id) for id in ids.split(',')]

In [11]:
def get_gisaid_id(ids):
    for id in ids:
        if id.startswith('https://www.gisaid.org/'):
            return id
        
    return ''

#### Rename and concatenate fields

In [12]:
#df['Accession ID'] = df['Accession ID'].str.strip()
#df['Related ID'] = df['Related ID'].str.strip()

# combine all ids into an accession column and assign curies
df['accessions'] = df['Accession ID'] + df['Related ID'].apply(lambda s: ',' + s if len(s) > 0 else s)
df['accessions'] = df['accessions'].apply(assign_curies)
df['gisaidId'] = df['accessions'].apply(get_gisaid_id)
df['accessions'] = df['accessions'].apply(lambda x: ';'.join(x))

df['accession'] = df['Accession ID'].apply(lambda s: assign_curie(s))

In [13]:
df.rename(columns={'Data Source': 'source'}, inplace=True)
df.rename(columns={'Sequence Length': 'sequenceLength'}, inplace=True)
df.rename(columns={'Sequence Quality': 'sequenceQuality'}, inplace=True)
df.rename(columns={'Quality Assessment': 'qualityAssessment'}, inplace=True)
df.rename(columns={'Originating Lab': 'originatingLab'}, inplace=True)
df.rename(columns={'Virus Strain Name': 'name'}, inplace=True)
df.rename(columns={'Sample Collection Date':'collectionDate'},inplace=True)
df.rename(columns={'Location':'location'}, inplace=True)

Remove invalid collection date

In [14]:
df.query("collectionDate == '2020-00-00'")

Unnamed: 0,name,Accession ID,source,Related ID,Nuc.Completeness,sequenceLength,sequenceQuality,qualityAssessment,Host,collectionDate,location,originatingLab,Submission Date,Submitting Lab,Create Time,Last Update Time,accessions,gisaidId,accession
97092,covid_hub_pl_ibch_0028,LR877414,GenBank,,Partial,29903,Low,27050/0/NA/NA/NA,Homo sapiens,2020-00-00,Poland,WSSE,2020-08-17,"COVID-HUB-PL, Institute of Bioorganic Chemistr...",2020-09-09 18:34:25,2020-09-14 23:26:05,insdc:LR877414;,,insdc:LR877414
97102,covid_hub_pl_ibch_0044,LR877424,GenBank,,Complete,29903,Low,654/0/NA/NA/NA,Homo sapiens,2020-00-00,Poland,WSSE,2020-08-17,"COVID-HUB-PL, Institute of Bioorganic Chemistr...",2020-09-09 18:34:25,2020-09-14 23:26:05,insdc:LR877424;,,insdc:LR877424


In [15]:
df['collectionDate'] = df['collectionDate'].apply(lambda d: '' if d == '2020-00-00' else d)

In [16]:
df['collectionDate'] = df['collectionDate'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

In [17]:
df[df['accessions'].str.contains('refseq:NC_045512')]

Unnamed: 0,name,Accession ID,source,Related ID,Nuc.Completeness,sequenceLength,sequenceQuality,qualityAssessment,Host,collectionDate,location,originatingLab,Submission Date,Submitting Lab,Create Time,Last Update Time,accessions,gisaidId,accession
7,Wuhan-Hu-1,MN908947,GenBank,"NC_045512,EPI_ISL_402125",Complete,29903,High,0/0/0/0/NO,Homo sapiens,2019-12-30,China / Hubei / Wuhan,Shanghai Public Health Clinical Center & Schoo...,2020-01-17,Shanghai Public Health Clinical Center & Schoo...,2020-01-20 20:04:48,2020-05-20 11:14:12,insdc:MN908947;refseq:NC_045512;https://www.gi...,https://www.gisaid.org/EPI_ISL_402125,insdc:MN908947


In [18]:
df.head()

Unnamed: 0,name,Accession ID,source,Related ID,Nuc.Completeness,sequenceLength,sequenceQuality,qualityAssessment,Host,collectionDate,location,originatingLab,Submission Date,Submitting Lab,Create Time,Last Update Time,accessions,gisaidId,accession
0,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01,NMDC,EPI_ISL_402132,Complete,29848,High,0/0/0/1/NO,Homo sapiens,2019-12-30,China / Hubei,Hubei Provincial Center for Disease Control an...,2020-01-19,Hubei Provincial Center for Disease Control an...,2020-01-20 20:04:48,2020-09-09 11:31:17,NMDC60013088-01;https://www.gisaid.org/EPI_ISL...,https://www.gisaid.org/EPI_ISL_402132,NMDC60013088-01
1,hCoV-19/Thailand/74/2020,EPI_ISL_403963,GISAID,,Complete,29859,High,0/0/0/0/NO,Homo sapiens,2020-01-13,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-09-09 11:31:17,https://www.gisaid.org/EPI_ISL_403963,https://www.gisaid.org/EPI_ISL_403963,https://www.gisaid.org/EPI_ISL_403963
2,hCoV-19/Thailand/61/2020,EPI_ISL_403962,GISAID,,Complete,29848,High,0/0/0/0/NO,Homo sapiens,2020-01-08,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-09-09 11:31:17,https://www.gisaid.org/EPI_ISL_403962,https://www.gisaid.org/EPI_ISL_403962,https://www.gisaid.org/EPI_ISL_403962
3,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01,NMDC,EPI_ISL_402120,Complete,29896,High,0/0/0/2/NO,Homo sapiens,2020-01-01,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-11,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-09-09 11:31:17,NMDC60013085-01;https://www.gisaid.org/EPI_ISL...,https://www.gisaid.org/EPI_ISL_402120,NMDC60013085-01
4,BetaCoV/Wuhan/IVDC-HB-01/2019,NMDC60013084-01,NMDC,EPI_ISL_402119,Complete,29891,High,0/0/0/0/NO,Homo sapiens,2019-12-30,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-10,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-09-09 11:31:17,NMDC60013084-01;https://www.gisaid.org/EPI_ISL...,https://www.gisaid.org/EPI_ISL_402119,NMDC60013084-01


#### Assign taxonomy ids

In [19]:
# read Organism reference dictionary
organism_to_id = dict()
data = pd.read_csv("../../reference_data/OrganismDictionary.csv", comment='#')
for index, row in data.iterrows():
    organism_to_id[row['organism']] = row['taxonomyId']

In [20]:
print(organism_to_id)

{'human': 'taxonomy:9606', 'homo sapiens': 'taxonomy:9606', 'mus musculus': 'taxonomy:10090', 'rhinolophus affinis': 'taxonomy:59477 ', 'rhinolophus malayanus': 'taxonomy:608659', 'mustela lutreola': 'taxonomy:9666', 'panthera tigris jacksoni': 'taxonomy:419130', 'rhinolophus sp. (bat)': 'taxonomy:49442', 'bat': 'taxonomy:49442', 'manis javanica': 'taxonomy:9974', 'manis pentadactyla': 'taxonomy:143292', 'palm civet': 'taxonomy:71116', 'canine': 'taxonomy:9608', 'felis catus': 'taxonomy:9685', 'neovison vison': 'taxonomy:452646', 'mesocricetus auratus': 'taxonomy:10036', 'panthera leo': 'taxonomy:9689', 'panthera tigris': 'taxonomy:9694', 'environment': 'Environment', 'environmental': 'Environment'}


In [21]:
# assign taxonomy id to host
df['Host'] = df['Host'].str.strip()
df['hostTaxonomyId'] = df['Host'].apply(lambda s: organism_to_id.get(s.lower(), s))
df['hostTaxonomyId'].unique()

array(['taxonomy:9606', 'taxonomy:59477 ', 'Environment', 'taxonomy:9974',
       'taxonomy:9608', 'taxonomy:9685', 'unknown', 'taxonomy:608659',
       'taxonomy:419130', 'taxonomy:9666', 'Vero cell culture',
       'taxonomy:10090', 'taxonomy:452646', 'taxonomy:9689',
       'taxonomy:9694', 'taxonomy:143292', 'taxonomy:10036'], dtype=object)

In [22]:
df['taxonomyId'] = 'taxonomy:2697049' # SARS-CoV-2

#### Standardize location information

In [23]:
df[['loc0', 'loc1', 'loc2', 'loc3']] = df['location'].str.split('/', n=3, expand=True)
# strip white space
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [24]:
df['origLocation'] = df[['loc0', 'loc1', 'loc2', 'loc3']].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)

### Save strain metadata

In [25]:
strains = df[['name', 'accession', 'accessions', 'gisaidId', 'source', 'taxonomyId', 'hostTaxonomyId', 
              'sequenceLength', 'sequenceQuality', 'qualityAssessment', 'collectionDate', 'location', 
              'origLocation', 'originatingLab']].copy()

In [26]:
strains.head()

Unnamed: 0,name,accession,accessions,gisaidId,source,taxonomyId,hostTaxonomyId,sequenceLength,sequenceQuality,qualityAssessment,collectionDate,location,origLocation,originatingLab
0,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01,NMDC60013088-01;https://www.gisaid.org/EPI_ISL...,https://www.gisaid.org/EPI_ISL_402132,NMDC,taxonomy:2697049,taxonomy:9606,29848,High,0/0/0/1/NO,2019-12-30,China / Hubei,"China,Hubei",Hubei Provincial Center for Disease Control an...
1,hCoV-19/Thailand/74/2020,https://www.gisaid.org/EPI_ISL_403963,https://www.gisaid.org/EPI_ISL_403963,https://www.gisaid.org/EPI_ISL_403963,GISAID,taxonomy:2697049,taxonomy:9606,29859,High,0/0/0/0/NO,2020-01-13,Thailand/ Nonthaburi Province,"Thailand,Nonthaburi Province","Department of Medical Sciences, Ministry of Pu..."
2,hCoV-19/Thailand/61/2020,https://www.gisaid.org/EPI_ISL_403962,https://www.gisaid.org/EPI_ISL_403962,https://www.gisaid.org/EPI_ISL_403962,GISAID,taxonomy:2697049,taxonomy:9606,29848,High,0/0/0/0/NO,2020-01-08,Thailand/ Nonthaburi Province,"Thailand,Nonthaburi Province","Department of Medical Sciences, Ministry of Pu..."
3,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01,NMDC60013085-01;https://www.gisaid.org/EPI_ISL...,https://www.gisaid.org/EPI_ISL_402120,NMDC,taxonomy:2697049,taxonomy:9606,29896,High,0/0/0/2/NO,2020-01-01,China / Hubei / Wuhan,"China,Hubei,Wuhan",National Institute for Viral Disease Control a...
4,BetaCoV/Wuhan/IVDC-HB-01/2019,NMDC60013084-01,NMDC60013084-01;https://www.gisaid.org/EPI_ISL...,https://www.gisaid.org/EPI_ISL_402119,NMDC,taxonomy:2697049,taxonomy:9606,29891,High,0/0/0/0/NO,2019-12-30,China / Hubei / Wuhan,"China,Hubei,Wuhan",National Institute for Viral Disease Control a...


In [27]:
print('Number of strains:',strains.shape[0])

Number of strains: 203052


In [28]:
strains.to_csv(NEO4J_IMPORT / "01c-CNCBStrainPre.csv", index=False)