# Load SARS-CoV-2 Virus Strain Data from CNCB
**[Work in progress]**

This notebook downloads and standardizes viral strain and variation data from CNCB for ingestion into a Knowledge Graph.

Data source: [China National Center for Bioinformation, 2019 Novel Coronavirus Resource (2019nCoVR)](https://bigd.big.ac.cn/ncov/release_genome)

Author: Peter Rose (pwrose@ucsd.edu)

In [36]:
import os
import pandas as pd
import dateutil
import re
from pathlib import Path
import glob
import ftplib

In [37]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [38]:
# Path will take care of handling operating system differences.
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-b9d10363-6d59-4deb-9595-2cb904a99d1d/installation-4.1.0/import


In [39]:
# Create a directory to cache variation data
CACHE = Path(NEO4J_IMPORT / 'cache')
CACHE.mkdir(exist_ok=True)

In [40]:
# Create a directory to cache variation data that could not be parsed
CACHE_FAILED = Path(NEO4J_IMPORT / 'cache_failed')
CACHE_FAILED.mkdir(exist_ok=True)

## Download SARS-CoV-2 Strain metadata

This notebook will download > 20,000 files. To work on a small sample (50 files), set 

`run_small_sample_only = True`

In [41]:
run_small_sample_only = False

In [42]:
metadata_url = "https://bigd.big.ac.cn/ncov/genome/export/meta"
annotation_url = "ftp://download.big.ac.cn/GVM/Coronavirus/gff3/" 

In [43]:
df = pd.read_excel(metadata_url, dtype='str')
df.fillna('', inplace=True)

In [44]:
print("Total number of strains:", df.shape[0])

Total number of strains: 96075


In [45]:
df = df.query("`Sequence Quality` == 'High'")
df = df.query("`Nuc.Completeness` == 'Complete'")

In [46]:
print("Number of complete high quality strains", df.shape[0])

Number of complete high quality strains 50907


In [47]:
df.head()

Unnamed: 0,Virus Strain Name,Accession ID,Data Source,Related ID,Nuc.Completeness,Sequence Length,Sequence Quality,Quality Assessment,Host,Sample Collection Date,Location,Originating Lab,Submission Date,Submitting Lab,Create Time,Last Update Time
0,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01,NMDC,EPI_ISL_402132,Complete,29848,High,0/0/0/1/NO,Homo sapiens,2019-12-30,China / Hubei,Hubei Provincial Center for Disease Control an...,2020-01-19,Hubei Provincial Center for Disease Control an...,2020-01-20 20:04:48,2020-05-07 23:03:25
1,hCoV-19/Thailand/74/2020,EPI_ISL_403963,GISAID,,Complete,29859,High,0/0/0/0/NO,Homo sapiens,2020-01-13,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-06-28 15:24:28
2,hCoV-19/Thailand/61/2020,EPI_ISL_403962,GISAID,,Complete,29848,High,0/0/0/0/NO,Homo sapiens,2020-01-08,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-06-28 15:24:28
3,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01,NMDC,EPI_ISL_402120,Complete,29896,High,0/0/0/2/NO,Homo sapiens,2020-01-01,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-11,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-05-07 23:03:25
4,BetaCoV/Wuhan/IVDC-HB-01/2019,NMDC60013084-01,NMDC,EPI_ISL_402119,Complete,29891,High,0/0/0/0/NO,Homo sapiens,2019-12-30,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-10,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-05-07 23:03:25


#### Create a separate row for each Accession and Related ID

In [None]:
df['Accession ID'] = df['Accession ID'].str.strip()
df['Related ID'] = df['Related ID'].str.strip()

# combine all ids into a single column
df['alias'] = df['Accession ID'] + df['Related ID'].apply(lambda s: ',' + s if len(s) > 0 else s)
df['alias'] = df['alias'].str.replace(' ', '')

# then "explode" ids into separate rows
df['id'] = df['alias'].apply(lambda s: s.split(','))
df = df.explode('id')
df['id'] = df['id'].str.strip()
df['alias'] = df['alias'].str.replace(',', ';')

#### Assign taxonomy ids

In [None]:
# read Organism reference dictionary
organism_to_id = dict()
data = pd.read_csv("../../reference_data/OrganismDictionary.csv", comment='#')
for index, row in data.iterrows():
    organism_to_id[row['organism']] = row['taxonomyId']

In [None]:
# assign taxonomy id to host
df['Host'] = df['Host'].str.strip()
df['hostTaxonomyId'] = df['Host'].apply(lambda s: organism_to_id.get(s.lower(), s))
df['hostTaxonomyId'].unique()

In [None]:
df['taxonomyId'] = 'taxonomy:2697049' # SARS-CoV-2

#### Standardize node property names (CURIEs and URIs)

In [None]:
df.rename(columns={'Virus Strain Name': 'name',
                   'Sample Collection Date':'collectionDate',
                   'Location':'location'}, 
          inplace=True)

In [19]:
# https://registry.identifiers.org/registry/insdc
insdc_pattern = re.compile('^([A-Z]\d{5}|[A-Z]{2}\d{6}|[A-Z]{4}\d{8}|[A-J][A-Z]{2}\d{5})(\.\d+)?$')

In [20]:
def assign_curie(id):
    id = id.strip()
    # remove underscore to enable CURIE matching of NCBI reference sequences NC_...
    id = id.replace('NC_', 'NC') 
    if len(id) > 0:
        if id.startswith('EPI'):
            return 'https://www.gisaid.org/' + id
        elif id.startswith('NC_'):
            # NCBI reference sequences resolve with ncbiprotein CURIE
            return 'ncbiprotein:' + id
        elif insdc_pattern.match(id) != None:
            return 'insdc:' + id
        else:
            # TODO are URIs available for these cases?
            return id
    else:
        return id

In [21]:
df[['loc0', 'loc1', 'loc2', 'loc3']] = df['location'].str.split('/', n=3, expand=True)
# strip white space
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [22]:
df['origLocation'] = df[['loc0', 'loc1', 'loc2', 'loc3']].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)

In [29]:
strains = df[['id', 'name', 'alias', 'taxonomyId', 'hostTaxonomyId','collectionDate', 'location', 'origLocation']].copy()
strains['id'] = strains['id'].apply(assign_curie)
strains.head()

Unnamed: 0,id,name,alias,taxonomyId,hostTaxonomyId,collectionDate,location,origLocation
0,NMDC60013088-01,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01;EPI_ISL_402132,taxonomy:2697049,taxonomy:9606,2019-12-30,China / Hubei,"China,Hubei"
0,https://www.gisaid.org/EPI_ISL_402132,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01;EPI_ISL_402132,taxonomy:2697049,taxonomy:9606,2019-12-30,China / Hubei,"China,Hubei"
1,https://www.gisaid.org/EPI_ISL_403963,hCoV-19/Thailand/74/2020,EPI_ISL_403963,taxonomy:2697049,taxonomy:9606,2020-01-13,Thailand/ Nonthaburi Province,"Thailand,Nonthaburi Province"
2,https://www.gisaid.org/EPI_ISL_403962,hCoV-19/Thailand/61/2020,EPI_ISL_403962,taxonomy:2697049,taxonomy:9606,2020-01-08,Thailand/ Nonthaburi Province,"Thailand,Nonthaburi Province"
3,NMDC60013085-01,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01;EPI_ISL_402120,taxonomy:2697049,taxonomy:9606,2020-01-01,China / Hubei / Wuhan,"China,Hubei,Wuhan"


In [51]:
strains.query("collectionDate == '2020-01-30'")

Unnamed: 0,id,name,alias,taxonomyId,hostTaxonomyId,collectionDate,location,origLocation
88,https://www.gisaid.org/EPI_ISL_407896,hCoV-19/Australia/QLD02/2020,EPI_ISL_407896,taxonomy:2697049,taxonomy:9606,2020-01-30,Australia / Queensland / Gold Coast,"Australia,Queensland,Gold Coast"
215,https://www.gisaid.org/EPI_ISL_412029,hCoV-19/Hong Kong/VB20024950-2/2020,EPI_ISL_412029,taxonomy:2697049,taxonomy:9606,2020-01-30,China / Hong Kong,"China,Hong Kong"
235,https://www.gisaid.org/EPI_ISL_412870,hCoV-19/South Korea/KCDC06/2020,EPI_ISL_412870,taxonomy:2697049,taxonomy:9606,2020-01-30,South Korea/ Seoul,"South Korea,Seoul"
236,https://www.gisaid.org/EPI_ISL_412869,hCoV-19/South Korea/KCDC05/2020,EPI_ISL_412869,taxonomy:2697049,taxonomy:9606,2020-01-30,South Korea /Seoul,"South Korea,Seoul"
414,insdc:MT066156,SARS-CoV-2/human/ITA/INMI1/2020,MT066156,taxonomy:2697049,taxonomy:9606,2020-01-30,Italy,Italy
460,NMDC60013027-01,hCoV-19/Guangdong/2020XN4273-P0036/2020,NMDC60013027-01;EPI_ISL_413860,taxonomy:2697049,taxonomy:9606,2020-01-30,China / Guangdong,"China,Guangdong"
460,https://www.gisaid.org/EPI_ISL_413860,hCoV-19/Guangdong/2020XN4273-P0036/2020,NMDC60013027-01;EPI_ISL_413860,taxonomy:2697049,taxonomy:9606,2020-01-30,China / Guangdong,"China,Guangdong"
462,NMDC60013025-01,hCoV-19/Guangdong/2020XN4459-P0041/2020,NMDC60013025-01;EPI_ISL_413858,taxonomy:2697049,taxonomy:9606,2020-01-30,China / Guangdong,"China,Guangdong"
462,https://www.gisaid.org/EPI_ISL_413858,hCoV-19/Guangdong/2020XN4459-P0041/2020,NMDC60013025-01;EPI_ISL_413858,taxonomy:2697049,taxonomy:9606,2020-01-30,China / Guangdong,"China,Guangdong"
467,NMDC60013021-01,hCoV-19/Guangdong/2020XN4475-P0042/2020,NMDC60013021-01;EPI_ISL_413854,taxonomy:2697049,taxonomy:9606,2020-01-30,China / Guangdong,"China,Guangdong"


In [23]:
strains.to_csv(NEO4J_IMPORT / "01d-CNCBStrain.csv", index=False)

## Merge Metadata with Variation Data

#### Get list of file names from FTP site

In [24]:
server = "download.big.ac.cn"
user = "anonymous"
password = "anonymous"
source = "/GVM/Coronavirus/gff3/"

ftp = ftplib.FTP(server)
ftp.login(user, password)
ftp.cwd(source) 
filelist=ftp.nlst()
ftp.quit()

'221 Goodbye.'

In [25]:
df_file = pd.DataFrame(filelist, columns=['filename'])

Extract identifiers from file name

Example: 2019-nCoV_CNA0013697_variants.gff3 -> CNA0013697

In [26]:
df_file['id'] = df_file['filename'].str[10:]
df_file['id'] = df_file['id'].str.replace('_variants.gff3','')

In [27]:
df_file = df_file.query("id != ''")

In [28]:
print("Number of available files:", df_file.shape[0])

Number of available files: 43946


In [29]:
df_file.head()

Unnamed: 0,filename,id
0,2019-nCoV_CNA0007332_variants.gff3,CNA0007332
1,2019-nCoV_CNA0007334_variants.gff3,CNA0007334
2,2019-nCoV_CNA0007335_variants.gff3,CNA0007335
3,2019-nCoV_CNA0013697_variants.gff3,CNA0013697
4,2019-nCoV_CNA0013698_variants.gff3,CNA0013698


In [30]:
df = df.merge(df_file, on='id')

In [31]:
print('Strains with a matching filename:', df.shape[0])

Strains with a matching filename: 43804


In [32]:
if run_small_sample_only:
    df = df.sample(n=50, random_state=5)

In [33]:
df.head()

Unnamed: 0,name,Accession ID,Data Source,Related ID,Nuc.Completeness,Sequence Length,Sequence Quality,Quality Assessment,Host,collectionDate,location,Originating Lab,Submission Date,Submitting Lab,Create Time,Last Update Time,alias,id,hostTaxonomyId,taxonomyId,loc1,loc2,loc3,loc4,origLocation,filename
0,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01,NMDC,EPI_ISL_402132,Complete,29848,High,0/0/0/1/NO,Homo sapiens,2019-12-30,China / Hubei,Hubei Provincial Center for Disease Control an...,2020-01-19,Hubei Provincial Center for Disease Control an...,2020-01-20 20:04:48,2020-05-07 23:03:25,NMDC60013088-01;EPI_ISL_402132,EPI_ISL_402132,taxonomy:9606,taxonomy:2697049,China,Hubei,,,"China,Hubei",2019-nCoV_EPI_ISL_402132_variants.gff3
1,hCoV-19/Thailand/74/2020,EPI_ISL_403963,GISAID,,Complete,29859,High,0/0/0/0/NO,Homo sapiens,2020-01-13,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-06-28 15:24:28,EPI_ISL_403963,EPI_ISL_403963,taxonomy:9606,taxonomy:2697049,Thailand,Nonthaburi Province,,,"Thailand,Nonthaburi Province",2019-nCoV_EPI_ISL_403963_variants.gff3
2,hCoV-19/Thailand/61/2020,EPI_ISL_403962,GISAID,,Complete,29848,High,0/0/0/0/NO,Homo sapiens,2020-01-08,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-06-28 15:24:28,EPI_ISL_403962,EPI_ISL_403962,taxonomy:9606,taxonomy:2697049,Thailand,Nonthaburi Province,,,"Thailand,Nonthaburi Province",2019-nCoV_EPI_ISL_403962_variants.gff3
3,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01,NMDC,EPI_ISL_402120,Complete,29896,High,0/0/0/2/NO,Homo sapiens,2020-01-01,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-11,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-05-07 23:03:25,NMDC60013085-01;EPI_ISL_402120,EPI_ISL_402120,taxonomy:9606,taxonomy:2697049,China,Hubei,Wuhan,,"China,Hubei,Wuhan",2019-nCoV_EPI_ISL_402120_variants.gff3
4,BetaCoV/Wuhan/IVDC-HB-01/2019,NMDC60013084-01,NMDC,EPI_ISL_402119,Complete,29891,High,0/0/0/0/NO,Homo sapiens,2019-12-30,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-10,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-05-07 23:03:25,NMDC60013084-01;EPI_ISL_402119,EPI_ISL_402119,taxonomy:9606,taxonomy:2697049,China,Hubei,Wuhan,,"China,Hubei,Wuhan",2019-nCoV_EPI_ISL_402119_variants.gff3


Keep only unique entries (there are a few duplicate cases)

In [34]:
df.drop_duplicates(subset='name', inplace=True)
df.drop_duplicates(subset='id', inplace=True)

In [35]:
print('Strains with a matching filename:', df.shape[0])

Strains with a matching filename: 43743


#### Download variant annotations for each strain
To avoid download the same files every time, they are cached, and newly downloaded files are added to the cache.

In [36]:
names = ['taxon1', 'variantType', 'name', 'start', 'end','x1', 'x2', 'x3','taxon2', 'x4', 'strainStart', 'taxon3', 'x5', 'strainEnd', 'ref', 'alt', 'vepAnnotation']

In [37]:
def download_gff3(filename, url):
    gff3 = pd.read_csv(url, header=None, comment='#', sep='[\t|;]', engine='python', names=names)
    try:
        gff3['ref'] = gff3['ref'].str.replace('REF=','')
        gff3['alt'] = gff3['alt'].str.replace('ALT=','')
        gff3['vepAnnotation'] = gff3['vepAnnotation'].str.replace('VEP=','')
        # prepare for 3-way split (need at least two commas)
        gff3['vepAnnotation'] = gff3['vepAnnotation'].apply(lambda s: s + ',,' if s.count(',') < 2 else s)
        # 3-way split
        gff3[['variantConsequence','proteinVariant','geneVariant']] = gff3['vepAnnotation'].str.split(',', n=2, expand=True)
        gff3['geneVariant'] = gff3['geneVariant'].str.replace('gene-','')
        gff3 = gff3[['name', 'variantType', 'start', 'end', 'ref', 'alt', 'variantConsequence', 'proteinVariant', 'geneVariant']]
    
        filename = row['filename'] + '.csv'
        gff3.to_csv(CACHE / filename, index=False)
    except:
        print('Parsing failed for: ', row['filename'])
        # cache files that failed to parse so we don't reprocess them next time
        filename = row['filename'] + '.csv'
        gff3 = pd.read_csv(url, header=None, comment='#', sep='[\t|;]', engine='python', names=names)
        gff3.to_csv(CACHE_FAILED / filename, index=False)

In [None]:
for index, row in df.iterrows():
    url = annotation_url + row['filename']
    filename = row['filename'] + '.csv'
    # skip files that have been processed in previous runs
    if not (Path.exists(CACHE / filename) or Path.exists(CACHE_FAILED / filename)):
        try:
            download_gff3(row['filename'], url)
            print(row['filename'], end=' ')
        except:
            print('Download failed for: ', row['filename'])

2019-nCoV_EPI_ISL_411220_variants.gff3 2019-nCoV_EPI_ISL_415787_variants.gff3 2019-nCoV_EPI_ISL_416907_variants.gff3 2019-nCoV_EPI_ISL_418186_variants.gff3 2019-nCoV_EPI_ISL_419264_variants.gff3 2019-nCoV_EPI_ISL_419265_variants.gff3 2019-nCoV_EPI_ISL_419266_variants.gff3 2019-nCoV_EPI_ISL_419908_variants.gff3 2019-nCoV_EPI_ISL_419909_variants.gff3 2019-nCoV_EPI_ISL_419910_variants.gff3 2019-nCoV_EPI_ISL_419911_variants.gff3 2019-nCoV_EPI_ISL_419912_variants.gff3 2019-nCoV_EPI_ISL_419914_variants.gff3 2019-nCoV_EPI_ISL_419915_variants.gff3 2019-nCoV_EPI_ISL_419916_variants.gff3 2019-nCoV_EPI_ISL_419917_variants.gff3 2019-nCoV_EPI_ISL_419918_variants.gff3 2019-nCoV_EPI_ISL_419919_variants.gff3 2019-nCoV_EPI_ISL_419920_variants.gff3 2019-nCoV_EPI_ISL_419921_variants.gff3 2019-nCoV_EPI_ISL_419922_variants.gff3 2019-nCoV_EPI_ISL_419923_variants.gff3 2019-nCoV_EPI_ISL_419924_variants.gff3 2019-nCoV_EPI_ISL_419925_variants.gff3 2019-nCoV_EPI_ISL_419926_variants.gff3 2019-nCoV_EPI_ISL_419927_

### Concatenate all variation data into a single dataframe

In [None]:
# use all cached data files
path = str(CACHE / '*.gff3.csv')
filenames = glob.glob(path)

variations = pd.concat((pd.read_csv(f, index_col=None, header=0) for f in filenames))
variations.fillna('', inplace=True)

List of variant types and consequences:

https://uswest.ensembl.org/info/genome/variation/prediction/classification.html

https://uswest.ensembl.org/info/genome/variation/prediction/predicted_data.html#consequences

#### Extract protein position and protein id from proteinVariant string

Example: QHD43415.1:p.5828P>L

proteinPosition: 5828
proteinId: QHD43415

In [None]:
position_pattern = re.compile(':p\.(.*?)[A-Z|\-]+')

In [None]:
def extract_protein_position(s):
    if s == '':
        return s
    else:
        groups = position_pattern.search(s)
        if groups == None:
            return ''
        else:
            return groups.group(1)

In [None]:
variations['proteinPosition'] = variations['proteinVariant'].apply(extract_protein_position)

In [None]:
variations['proteinAccession'] = variations['proteinVariant'].apply(lambda s: s.split('.')[0] if '.' in s else '')

In [None]:
variations['proteinAccession'].unique()

#### Assign SARS-CoV-2 taxonomy id

In [None]:
variations['taxonomyId'] = 'taxonomy:2697049'

#### Assign Reference genome

The first SARS-CoV-2 genome sequence is the reference for the variant annotation below.

[Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1](https://www.ncbi.nlm.nih.gov/nuccore/MN908947)

In [None]:
# variations['referenceGenome'] = 'insdc:MN908947' -> same as NCBI reference sequence NC_045512
variations['referenceGenome'] = 'ncbiprotein:NC_045512'

In [None]:
variations['proteinAccession'] = variations['proteinAccession'].apply(lambda s: 'ncbiprotein:' + s if s != '' else s)

In [None]:
print('variantType:', variations['variantType'].unique())

In [None]:
print("variantConsequence:", variations['variantConsequence'].unique())

In [None]:
print("Number of variants:", variations.shape[0])

In [None]:
variations.to_csv(NEO4J_IMPORT / "01d-CNCBVariant.csv", index=False)

In [None]:
variations.head()