# Load SARS-CoV-2 Virus Strain Data from CNCB
**[Work in progress]**

This notebook downloads and standardizes viral strain and variation data from CNCB for ingestion into a Knowledge Graph.

Data source: [China National Center for Bioinformation, 2019 Novel Coronavirus Resource (2019nCoVR)](https://bigd.big.ac.cn/ncov/release_genome)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
import re
from pathlib import Path
import glob
import ftplib

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
# Path will take care of handling operating system differences.
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-19636412-9e74-4bac-8a4c-c6c8b49bb9d3/installation-4.1.0/import


In [4]:
# Create a directory to cache variation data
CACHE = Path(NEO4J_IMPORT / 'cache')
CACHE.mkdir(exist_ok=True)

In [5]:
# Create a directory to cache variation data that could not be parsed
CACHE_FAILED = Path(NEO4J_IMPORT / 'cache_failed')
CACHE_FAILED.mkdir(exist_ok=True)

## Download SARS-CoV-2 Strain metadata

This notebook will download > 20,000 files. To work on a small sample (50 files), set 

`run_small_sample_only = True`

In [6]:
run_small_sample_only = False

In [7]:
metadata_url = "https://bigd.big.ac.cn/ncov/genome/export/meta"
annotation_url = "ftp://download.big.ac.cn/GVM/Coronavirus/gff3/" 

In [8]:
df = pd.read_excel(metadata_url, dtype='str')
df.fillna('', inplace=True)

In [9]:
print("Total number of strains:", df.shape[0])

Total number of strains: 64641


In [10]:
df = df.query("`Sequence Quality` == 'High'")
df = df.query("`Nuc.Completeness` == 'Complete'")

In [11]:
print("Number of complete high quality strains", df.shape[0])

Number of complete high quality strains 31658


In [12]:
df.head()

Unnamed: 0,Virus Strain Name,Accession ID,Data Source,Related ID,Nuc.Completeness,Sequence Length,Sequence Quality,Quality Assessment,Host,Sample Collection Date,Location,Originating Lab,Submission Date,Submitting Lab,Create Time,Last Update Time
0,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01,NMDC,EPI_ISL_402132,Complete,29848,High,0/0/0/1/NO,Homo sapiens,2019-12-30,China / Hubei,Hubei Provincial Center for Disease Control an...,2020-01-19,Hubei Provincial Center for Disease Control an...,2020-01-20 20:04:48,2020-05-07 23:03:25
1,hCoV-19/Thailand/74/2020,EPI_ISL_403963,GISAID,,Complete,29859,High,0/0/0/0/NO,Homo sapiens,2020-01-13,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-06-28 15:24:28
2,hCoV-19/Thailand/61/2020,EPI_ISL_403962,GISAID,,Complete,29848,High,0/0/0/0/NO,Homo sapiens,2020-01-08,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 20:04:48,2020-06-28 15:24:28
3,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01,NMDC,EPI_ISL_402120,Complete,29896,High,0/0/0/2/NO,Homo sapiens,2020-01-01,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-11,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-05-07 23:03:25
4,BetaCoV/Wuhan/IVDC-HB-01/2019,NMDC60013084-01,NMDC,EPI_ISL_402119,Complete,29891,High,0/0/0/0/NO,Homo sapiens,2019-12-30,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-10,National Institute for Viral Disease Control a...,2020-01-20 20:04:48,2020-05-07 23:03:25


#### Create a separate row for each Accession and Related ID

In [13]:
df['Accession ID'] = df['Accession ID'].str.strip()
df['Related ID'] = df['Related ID'].str.strip()

# combine all ids into a single column
df['alias'] = df['Accession ID'] + df['Related ID'].apply(lambda s: ',' + s if len(s) > 0 else s)
df['alias'] = df['alias'].str.replace(' ', '')

# then "explode" ids into separate rows
df['id'] = df['alias'].apply(lambda s: s.split(','))
df = df.explode('id')
df['id'] = df['id'].str.strip()
df['alias'] = df['alias'].str.replace(',', ';')

#### Assign taxonomy ids

In [14]:
# read Organism reference dictionary
organism_to_id = dict()
data = pd.read_csv("../../reference_data/OrganismDictionary.csv", comment='#')
for index, row in data.iterrows():
    organism_to_id[row['organism']] = row['taxonomyId']

In [15]:
# assign taxonomy id to host
df['Host'] = df['Host'].str.strip()
df['hostTaxonomyId'] = df['Host'].apply(lambda s: organism_to_id.get(s.lower(), s))
df['hostTaxonomyId'].unique()

array(['taxonomy:9606', 'taxonomy:59477 ', 'Environment', 'taxonomy:9974',
       'taxonomy:608659', 'taxonomy:419130', 'taxonomy:9666',
       'taxonomy:9608', 'taxonomy:10090', 'taxonomy:9685'], dtype=object)

In [16]:
df['taxonomyId'] = 'taxonomy:2697049' # SARS-CoV-2

#### Standardize node property names (CURIEs and URIs)

In [17]:
df.rename(columns={'Virus Strain Name': 'name',
                   'Sample Collection Date':'collectionDate',
                   'Location':'location'}, 
          inplace=True)

In [18]:
# https://registry.identifiers.org/registry/insdc
insdc_pattern = re.compile('^([A-Z]\d{5}|[A-Z]{2}\d{6}|[A-Z]{4}\d{8}|[A-J][A-Z]{2}\d{5})(\.\d+)?$')

In [19]:
def assign_curie(id):
    id = id.strip()
    # remove underscore to enable CURIE matching of NCBI reference sequences NC_...
    id = id.replace('NC_', 'NC') 
    if len(id) > 0:
        if id.startswith('EPI'):
            return 'https://www.gisaid.org/' + id
        elif id.startswith('NC_'):
            # NCBI reference sequences resolve with ncbiprotein CURIE
            return 'ncbiprotein:' + id
        elif insdc_pattern.match(id) != None:
            return 'insdc:' + id
        else:
            # TODO are URIs available for these cases?
            return id
    else:
        return id

In [20]:
strains = df[['id', 'name', 'alias', 'taxonomyId', 'hostTaxonomyId','collectionDate', 'location']].copy()
strains['id'] = strains['id'].apply(assign_curie)
strains.head()

Unnamed: 0,id,name,alias,taxonomyId,hostTaxonomyId,collectionDate,location
0,NMDC60013088-01,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01;EPI_ISL_402132,taxonomy:2697049,taxonomy:9606,2019-12-30,China / Hubei
0,https://www.gisaid.org/EPI_ISL_402132,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01;EPI_ISL_402132,taxonomy:2697049,taxonomy:9606,2019-12-30,China / Hubei
1,https://www.gisaid.org/EPI_ISL_403963,hCoV-19/Thailand/74/2020,EPI_ISL_403963,taxonomy:2697049,taxonomy:9606,2020-01-13,Thailand/ Nonthaburi Province
2,https://www.gisaid.org/EPI_ISL_403962,hCoV-19/Thailand/61/2020,EPI_ISL_403962,taxonomy:2697049,taxonomy:9606,2020-01-08,Thailand/ Nonthaburi Province
3,NMDC60013085-01,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01;EPI_ISL_402120,taxonomy:2697049,taxonomy:9606,2020-01-01,China / Hubei / Wuhan


In [21]:
strains.to_csv(NEO4J_IMPORT / "01d-CNCBStrain.csv", index=False)

## Merge Metadata with Variation Data

#### Get list of file names from FTP site

In [22]:
server = "download.big.ac.cn"
user = "anonymous"
password = "anonymous"
source = "/GVM/Coronavirus/gff3/"

ftp = ftplib.FTP(server)
ftp.login(user, password)
ftp.cwd(source) 
filelist=ftp.nlst()
ftp.quit()

'221 Goodbye.'

In [23]:
df_file = pd.DataFrame(filelist, columns=['filename'])

Extract identifiers from file name

Example: 2019-nCoV_CNA0013697_variants.gff3 -> CNA0013697

In [24]:
df_file['id'] = df_file['filename'].str[10:]
df_file['id'] = df_file['id'].str.replace('_variants.gff3','')

In [25]:
df_file = df_file.query("id != ''")

In [26]:
print("Number of available files:", df_file.shape[0])

Number of available files: 31721


In [27]:
df_file.head()

Unnamed: 0,filename,id
0,2019-nCoV_CNA0007332_variants.gff3,CNA0007332
1,2019-nCoV_CNA0007334_variants.gff3,CNA0007334
2,2019-nCoV_CNA0007335_variants.gff3,CNA0007335
3,2019-nCoV_CNA0013697_variants.gff3,CNA0013697
4,2019-nCoV_CNA0013698_variants.gff3,CNA0013698


In [28]:
df = df.merge(df_file, on='id')

In [29]:
print('Strains with a matching filename:', df.shape[0])

Strains with a matching filename: 31642


In [30]:
if run_small_sample_only:
    df = df.sample(n=50, random_state=5)

In [31]:
df.head()

Unnamed: 0,name,Accession ID,Data Source,Related ID,Nuc.Completeness,Sequence Length,Sequence Quality,Quality Assessment,Host,collectionDate,location,Originating Lab,Submission Date,Submitting Lab,Create Time,Last Update Time,alias,id,hostTaxonomyId,taxonomyId,filename
5671,SARS-CoV-2/human/USA/RI_0882/2020,MT344960,GenBank,EPI_ISL_426435,Complete,29882,High,0/0/0/7/NO,Homo sapiens,2020-03-09,United States / RI,"Division of Viral Diseases, Centers for Diseas...",2020-04-15,"Division of Viral Diseases, Centers for Diseas...",2020-04-17 14:20:21,2020-04-17 14:20:21,MT344960;EPI_ISL_426435,MT344960,taxonomy:9606,taxonomy:2697049,2019-nCoV_MT344960_variants.gff3
13331,hCoV-19/Israel/13075882/2020,EPI_ISL_447442,GISAID,,Complete,29892,High,"0/0/0/8/28881~28883(3-3-1.00,SNP:28881; SNP:28...",Homo Sapiens,2020-03-30,Israel / Tel Aviv District,Stern Lab,2020-05-17,Stern Lab,2020-05-18 01:37:26,2020-05-18 01:37:26,EPI_ISL_447442,EPI_ISL_447442,taxonomy:9606,taxonomy:2697049,2019-nCoV_EPI_ISL_447442_variants.gff3
27595,hCoV-19/England/OXON-B1781/2020,EPI_ISL_479044,GISAID,,Complete,29903,High,0/0/0/8/NO,Homo Sapiens,2020-05-04,United Kingdom / England,"Oxford Viromics, NDM, University of Oxford; Ox...",2020-06-30,COVID-19 Genomics UK (COG-UK) Consortium,2020-07-01 19:10:51,2020-07-01 19:10:51,EPI_ISL_479044,EPI_ISL_479044,taxonomy:9606,taxonomy:2697049,2019-nCoV_EPI_ISL_479044_variants.gff3
24890,hCoV-19/England/BIRM-5F8CE/2020,EPI_ISL_473347,GISAID,,Complete,29876,High,"2/0/1/8/28881~28883(3-3-1.00,SNP:28881; SNP:28...",Homo Sapiens,2020-05-18,United Kingdom / England,University of Birmingham,2020-06-23,COVID-19 Genomics UK (COG-UK) Consortium,2020-06-24 18:33:55,2020-06-24 18:33:55,EPI_ISL_473347,EPI_ISL_473347,taxonomy:9606,taxonomy:2697049,2019-nCoV_EPI_ISL_473347_variants.gff3
17560,hCoV-19/Croatia/7R-S8new/2020,EPI_ISL_455567,GISAID,,Complete,29884,High,0/0/0/7/NO,Homo Sapiens,2020-03-29,Croatia / Istria,Institute for Public Health,2020-05-31,Laboratory for advanced genomics,2020-06-01 12:13:52,2020-06-01 12:13:52,EPI_ISL_455567,EPI_ISL_455567,taxonomy:9606,taxonomy:2697049,2019-nCoV_EPI_ISL_455567_variants.gff3


Keep only unique entries (there are a few duplicate cases)

In [32]:
df.drop_duplicates(subset='name', inplace=True)
df.drop_duplicates(subset='id', inplace=True)

In [33]:
print('Strains with a matching filename:', df.shape[0])

Strains with a matching filename: 50


#### Download variant annotations for each strain
To avoid download the same files every time, they are cached, and newly downloaded files are added to the cache.

In [34]:
names = ['taxon1', 'variantType', 'name', 'start', 'end','x1', 'x2', 'x3','taxon2', 'x4', 'strainStart', 'taxon3', 'x5', 'strainEnd', 'ref', 'alt', 'vepAnnotation']

In [35]:
def download_gff3(filename, url):
    gff3 = pd.read_csv(url, header=None, comment='#', sep='[\t|;]', engine='python', names=names)
    try:
        gff3['ref'] = gff3['ref'].str.replace('REF=','')
        gff3['alt'] = gff3['alt'].str.replace('ALT=','')
        gff3['vepAnnotation'] = gff3['vepAnnotation'].str.replace('VEP=','')
        # prepare for 3-way split (need at least two commas)
        gff3['vepAnnotation'] = gff3['vepAnnotation'].apply(lambda s: s + ',,' if s.count(',') < 2 else s)
        # 3-way split
        gff3[['variantConsequence','proteinVariant','geneVariant']] = gff3['vepAnnotation'].str.split(',', n=2, expand=True)
        gff3['geneVariant'] = gff3['geneVariant'].str.replace('gene-','')
        gff3 = gff3[['name', 'variantType', 'start', 'end', 'ref', 'alt', 'variantConsequence', 'proteinVariant', 'geneVariant']]
    
        filename = row['filename'] + '.csv'
        gff3.to_csv(CACHE / filename, index=False)
    except:
        print('Parsing failed for: ', row['filename'])
        # cache files that failed to parse so we don't reprocess them next time
        filename = row['filename'] + '.csv'
        gff3 = pd.read_csv(url, header=None, comment='#', sep='[\t|;]', engine='python', names=names)
        gff3.to_csv(CACHE_FAILED / filename, index=False)

In [36]:
for index, row in df.iterrows():
    url = annotation_url + row['filename']
    filename = row['filename'] + '.csv'
    # skip files that have been processed in previous runs
    if not (Path.exists(CACHE / filename) or Path.exists(CACHE_FAILED / filename)):
        try:
            download_gff3(row['filename'], url)
            print(row['filename'], end=' ')
        except:
            print('Download failed for: ', row['filename'])

2019-nCoV_MT344960_variants.gff3 2019-nCoV_EPI_ISL_479044_variants.gff3 2019-nCoV_EPI_ISL_473347_variants.gff3 2019-nCoV_EPI_ISL_455567_variants.gff3 2019-nCoV_EPI_ISL_463260_variants.gff3 2019-nCoV_EPI_ISL_456906_variants.gff3 2019-nCoV_MT706144_variants.gff3 2019-nCoV_EPI_ISL_486505_variants.gff3 2019-nCoV_EPI_ISL_463705_variants.gff3 2019-nCoV_EPI_ISL_444717_variants.gff3 2019-nCoV_NMDC60013174_variants.gff3 2019-nCoV_EPI_ISL_457988_variants.gff3 2019-nCoV_MT470146_variants.gff3 2019-nCoV_EPI_ISL_429815_variants.gff3 2019-nCoV_EPI_ISL_483316_variants.gff3 2019-nCoV_EPI_ISL_481676_variants.gff3 2019-nCoV_EPI_ISL_429421_variants.gff3 2019-nCoV_EPI_ISL_485193_variants.gff3 2019-nCoV_EPI_ISL_480365_variants.gff3 2019-nCoV_EPI_ISL_434047_variants.gff3 2019-nCoV_EPI_ISL_477005_variants.gff3 2019-nCoV_EPI_ISL_465708_variants.gff3 2019-nCoV_EPI_ISL_468482_variants.gff3 2019-nCoV_EPI_ISL_480998_variants.gff3 2019-nCoV_MT658504_variants.gff3 2019-nCoV_EPI_ISL_457594_variants.gff3 2019-nCoV_EP

### Concatenate all variation data into a single dataframe

In [37]:
# use all cached data files
path = str(CACHE / '*.gff3.csv')
filenames = glob.glob(path)

variations = pd.concat((pd.read_csv(f, index_col=None, header=0) for f in filenames))
variations.fillna('', inplace=True)

List of variant types and consequences:

https://uswest.ensembl.org/info/genome/variation/prediction/classification.html

https://uswest.ensembl.org/info/genome/variation/prediction/predicted_data.html#consequences

#### Extract protein position and protein id from proteinVariant string

Example: QHD43415.1:p.5828P>L

proteinPosition: 5828
proteinId: QHD43415

In [38]:
position_pattern = re.compile(':p\.(.*?)[A-Z|\-]+')

In [39]:
def extract_protein_position(s):
    if s == '':
        return s
    else:
        groups = position_pattern.search(s)
        if groups == None:
            return ''
        else:
            return groups.group(1)

In [40]:
variations['proteinPosition'] = variations['proteinVariant'].apply(extract_protein_position)

In [41]:
variations['proteinAccession'] = variations['proteinVariant'].apply(lambda s: s.split('.')[0] if '.' in s else '')

In [42]:
variations['proteinAccession'].unique()

array(['QHD43415', 'QHD43416', 'QHD43422', '', 'QHD43417', 'QHI42199',
       'QHD43423', 'QHD43419', 'QHD43418', 'QHD43421', 'QHD43420'],
      dtype=object)

#### Assign SARS-CoV-2 taxonomy id

In [43]:
variations['taxonomyId'] = 'taxonomy:2697049'

#### Assign Reference genome

The first SARS-CoV-2 genome sequence is the reference for the variant annotation below.

[Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1](https://www.ncbi.nlm.nih.gov/nuccore/MN908947)

In [44]:
# variations['referenceGenome'] = 'insdc:MN908947' -> same as NCBI reference sequence NC_045512
variations['referenceGenome'] = 'ncbiprotein:NC_045512'

In [45]:
variations['proteinAccession'] = variations['proteinAccession'].apply(lambda s: 'ncbiprotein:' + s if s != '' else s)

In [46]:
print('variantType:', variations['variantType'].unique())

variantType: ['SNP' 'Deletion' 'Insertion' 'Indel']


In [47]:
print("variantConsequence:", variations['variantConsequence'].unique())

variantConsequence: ['synonymous_variant' 'missense_variant' 'intergenic_variant'
 'upstream_gene_variant' 'downstream_gene_variant'
 'coding_sequence_variant' 'inframe_deletion' 'inframe_insertion'
 'frameshift_variant' 'stop_gained' 'protein_altering_variant'
 'start_lost' 'stop_lost']


In [48]:
print("Number of variants:", variations.shape[0])

Number of variants: 47162


In [49]:
variations.to_csv(NEO4J_IMPORT / "01d-CNCBVariant.csv", index=False)

In [50]:
variations.head()

Unnamed: 0,name,variantType,start,end,ref,alt,variantConsequence,proteinVariant,geneVariant,proteinPosition,proteinAccession,taxonomyId,referenceGenome
0,hCoV-19/Iceland/259/2020,SNP,8782,8782,C,T,synonymous_variant,QHD43415.1:p.2839S,orf1ab:c.8517agC>agT,2839,ncbiprotein:QHD43415,taxonomy:2697049,ncbiprotein:NC_045512
1,hCoV-19/Iceland/259/2020,SNP,17747,17747,C,T,missense_variant,QHD43415.1:p.5828P>L,orf1ab:c.17483cCt>cTt,5828,ncbiprotein:QHD43415,taxonomy:2697049,ncbiprotein:NC_045512
2,hCoV-19/Iceland/259/2020,SNP,17858,17858,A,G,missense_variant,QHD43415.1:p.5865Y>C,orf1ab:c.17594tAt>tGt,5865,ncbiprotein:QHD43415,taxonomy:2697049,ncbiprotein:NC_045512
3,hCoV-19/Iceland/259/2020,SNP,18060,18060,C,T,synonymous_variant,QHD43415.1:p.5932L,orf1ab:c.17796ctC>ctT,5932,ncbiprotein:QHD43415,taxonomy:2697049,ncbiprotein:NC_045512
4,hCoV-19/Iceland/259/2020,SNP,24694,24694,A,T,synonymous_variant,QHD43416.1:p.1044G,S:c.3132ggA>ggT,1044,ncbiprotein:QHD43416,taxonomy:2697049,ncbiprotein:NC_045512
