# Example queries for biological entities on COVID-19-Net Knowledge Graph
[Work in progress]

This notebook demonstrates how to run Cypher queries to retrieve data about biological entities in the Knowledge Graph.

In [1]:
import datetime
import pandas as pd
from py2neo import Graph

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

#### Connect to COVID-19-Net Knowledge Graph

In [3]:
graph = Graph("bolt://132.249.238.185:7687", user="reader", password="demo")

### What kind of entities, including bioentities, does the KG contain?

In [4]:
query = """
MATCH (n:NodeMetadata)
RETURN n.name, n.shortDescription, n.description, n.example, n.definitionSource, n.dataProviders
"""
graph.run(query).to_data_frame()

Unnamed: 0,n.name,n.shortDescription,n.description,n.example,n.definitionSource,n.dataProviders
0,Location,Geographic location,A geographic location,"World, ..., Country, State, Country, City, Cru...",,"[GeoNames, UNSD, USCensus, HUD, JHU]"
1,World,The World,Top level location,,,
2,UNRegion,Continental regions,Continental regions according to the M49 stan...,Americas,https://unstats.un.org/unsd/methodology/m49/,[UNSD]
3,UNSubRegion,Subcontinental regions,Subcontinental regions according to the M49 st...,Latin America and the Caribbean,https://unstats.un.org/unsd/methodology/m49/,[UNSD]
4,UNIntermediateRegion,Subdivisions of subcontinental regions,Subdivisions of subcontinental regions accordi...,Caribbean,https://unstats.un.org/unsd/methodology/m49/,[UNSD]
5,Country,Countries and dependent Territories,Countries and dependent Territories defined b...,United States,http://www.geonames.org/,[GeoNames]
6,Admin1,"State, Province, Municipality","First administrative divisions, e.g, State, Pr...",California,http://www.geonames.org/,"[GeoNames, USCensus]"
7,Admin2,County,Second administrative divisions: County in the US,San Diego County,http://www.geonames.org/,"[GeoNames, USCensus]"
8,City,City,City,San Diego,http://www.geonames.org/,"[GeoNames, USCensus]"
9,PostalCode,Postal Code,"E.g., a ZIP Code is a postal code used by the ...",92121,http://purl.obolibrary.org/obo/OPMI_0000120,"[GeoNames, HUD]"


### Fuzzy full-text search for bioentities
Results are ordered by match score. 

Note, proteins may have one or more protein names. Proteins that have been cleaved have the fullLength flag set to `False`.

In [5]:
query = """
CALL db.index.fulltext.queryNodes("bioentities", "Spike") YIELD node, score
RETURN node.name AS name, node.fullLength AS fullLength, labels(node), score
"""
df = graph.run(query).to_data_frame()
df.head(25)

Unnamed: 0,name,fullLength,labels(node),score
0,spike protein,,[ProteinName],11.634229
1,spike glycoprotein,,[ProteinName],11.634229
2,spike glycoprotein,True,[Protein],5.612414
3,Spike glycoprotein,True,[Protein],5.612414
4,Spike glycoprotein,,[ProteinName],5.612414
5,Spike glycoprotein,,[ProteinName],5.612414
6,spike protein,True,[Protein],5.612414
7,Spike glycoprotein,False,[Protein],5.612414
8,Spike protein S1,False,[Protein],5.084917
9,Spike protein S2',False,[Protein],5.084917


### Exact full-text search for bioentities
For exact matches, enclose the phrase with `\"`.

In [6]:
query = """
CALL db.index.fulltext.queryNodes("bioentities", '\"Spike glycoprotein\"') YIELD node
RETURN node.name AS name, node.start as start, node.end as end, node.fullLength as fullLength, node.accession as accession, node.proId as proId, labels(node)
"""
df = graph.run(query).to_data_frame()
df.head(25)

Unnamed: 0,name,start,end,fullLength,accession,proId,labels(node)
0,spike glycoprotein,,,True,ncbiprotein:YP_009825051,,[Protein]
1,Spike glycoprotein,,,,uniprot:P0DTC2,uniprot.chain:PRO_0000449646,[ProteinName]
2,Spike glycoprotein,,,,uniprot:P0DTC2,,[ProteinName]
3,Spike glycoprotein,13.0,1273.0,False,uniprot:P0DTC2,uniprot.chain:PRO_0000449646,[Protein]
4,Spike glycoprotein,1.0,1273.0,True,uniprot:P0DTC2,,[Protein]
5,spike glycoprotein,,,,ncbiprotein:YP_009825051,,[ProteinName]


### Which organisms have been identified as hosts for SARS-CoV-2

In [7]:
query = """
MATCH (h:Host)
RETURN h.name AS name, h.scientificName AS scientificName, h.id AS taxonomyId
"""
graph.run(query).to_data_frame()

Unnamed: 0,name,scientificName,taxonomyId
0,human,Homo sapiens,taxonomy:9606
1,house mouse,Mus musculus,taxonomy:10090
2,intermediate horseshoe bat,Rhinolophus affinis,taxonomy:59477
3,Malayan horseshoe bat,Rhinolophus malayanus,taxonomy:608659
4,horseshoe bat,Rhinolophus,taxonomy:49442
5,Malayan pangolin,Manis javanica,taxonomy:9974
6,palm civet,Paradoxurus,taxonomy:71116
7,carnivores,Canidae,taxonomy:9608
8,cat,Felis catus,taxonomy:9685
9,European mink,Mustela lutreola,taxonomy:9666


### Which Pathogens have caused Coronavirus Outbreaks?

In [8]:
query = """
MATCH (p:Pathogen)-[:CAUSES]->(o:Outbreak)
RETURN p.name AS name, p.scientificName AS scientificName, p.id AS taxonomyId, o.id AS outbreak, o.startDate AS startDate
"""
graph.run(query).to_data_frame()

Unnamed: 0,name,scientificName,taxonomyId,outbreak,startDate
0,SARS-CoV-2,Severe acute respiratory syndrome coronavirus 2,taxonomy:2697049,COVID-19,2019
1,MERS-CoV,Middle East respiratory syndrome-related coron...,taxonomy:1335626,MERS,2012
2,SARS-CoV,Severe acute respiratory syndrome-related coro...,taxonomy:227984,SARS,2003


### Which PubMedCentral articles mention strain datasets about different hosts?

In [9]:
query = """
MATCH (p:Publication)-[:MENTIONS]->(s:Strain)<-[:CARRIES]-(h:Host)
RETURN p.id AS pmc, s.name AS name, s.collectionDate  AS collectionDate, h.name AS host, h.id AS hostTaxonomyId
ORDER by s.collectionDate
"""
graph.run(query).to_data_frame().head(20)

Unnamed: 0,pmc,name,collectionDate,host,hostTaxonomyId
0,pmc:PMC7166309,hCoV-19/pangolin/Guangxi/P4L/2017,2017-01-01,Malayan pangolin,taxonomy:9974
1,pmc:PMC7166309,hCoV-19/pangolin/Guangxi/P5E/2017,2017-01-01,Malayan pangolin,taxonomy:9974
2,pmc:PMC7166309,hCoV-19/pangolin/Guangxi/P1E/2017,2017-01-01,Malayan pangolin,taxonomy:9974
3,pmc:PMC7166309,hCoV-19/pangolin/Guangxi/P5L/2017,2017-01-01,Malayan pangolin,taxonomy:9974
4,pmc:PMC7166309,hCoV-19/pangolin/Guangxi/P2V/2017,2017-01-01,Malayan pangolin,taxonomy:9974
5,https://europepmc.org/article/PPR/PPR166395,hCoV-19/pangolin/Guangxi/P5L/2017,2017-01-01,Malayan pangolin,taxonomy:9974
6,pmc:PMC7447761,hCoV-19/pangolin/Guangxi/P4L/2017,2017-01-01,Malayan pangolin,taxonomy:9974
7,https://europepmc.org/article/PPR/PPR161913,hCoV-19/pangolin/Guangxi/P5L/2017,2017-01-01,Malayan pangolin,taxonomy:9974
8,https://europepmc.org/article/PPR/PPR161913,hCoV-19/pangolin/Guangxi/P5E/2017,2017-01-01,Malayan pangolin,taxonomy:9974
9,https://europepmc.org/article/PPR/PPR161913,hCoV-19/pangolin/Guangxi/P4L/2017,2017-01-01,Malayan pangolin,taxonomy:9974


### What are the genes and gene products of the SARS-CoV-2 Virus?
This query lists the genes and proteins encoded by the SARS-CoV-2 NC_045512 reference genome. MN908947 represents the original un-annotated dataset submitted to NCBI. This is the first sequenced genome of SARS-CoV-2 collected in Wuhan.

In [10]:
query = """
MATCH (n:Genome{id: 'refseq:NC_045512'})-[:HAS_GENE]->(g:Gene)-[:ENCODES]->(p:Protein)
RETURN n.id as referenceGenome, n.name as name, 
       g.name as gene, g.id as geneId, p.name as protein, p.accession as accession, p.sequence as sequence
"""
graph.run(query).to_data_frame()

Unnamed: 0,referenceGenome,name,gene,geneId,protein,accession,sequence
0,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,ORF3a,refseq:NC_045512-25393-26220,ORF3a protein,uniprot:P0DTC3,MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATIPIQASLPFGWL...
1,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,ORF8,refseq:NC_045512-27894-28259,ORF8 protein,uniprot:P0DTC8,MKFLVFLGIITTVAAFHQECSLQSCTQHQPYVVDDPCPIHFYSKWY...
2,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,ORF1ab,refseq:NC_045512-266-21555,Replicase polyprotein 1ab,uniprot:P0DTD1,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...
3,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,ORF7b,refseq:NC_045512-27756-27887,ORF7b protein,uniprot:P0DTD8,MIELSLIDFYLCFLAFLLFLVLIMLIIFWFSLELQDHNETCHA
4,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,ORF7a,refseq:NC_045512-27394-27759,ORF7a protein,uniprot:P0DTC7,MKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEGNSPF...
5,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,E,refseq:NC_045512-26245-26472,Envelope small membrane protein,uniprot:P0DTC4,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...
6,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,ORF6,refseq:NC_045512-27202-27387,ORF6 protein,uniprot:P0DTC6,MFHLVDFQVTIAEILLIIMRTFKVSIWNLDYIINLIIKNLSKSLTE...
7,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,M,refseq:NC_045512-26523-27191,Membrane protein,uniprot:P0DTC5,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...
8,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,ORF1ab,refseq:NC_045512-266-13483,Replicase polyprotein 1a,uniprot:P0DTC1,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...
9,refseq:NC_045512,Severe acute respiratory syndrome coronavirus ...,N,refseq:NC_045512-28274-29533,Nucleoprotein,uniprot:P0DTC9,MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLP...


### How many complete SARS-CoV-2 genomes have been sequenced in each country?
Here we aggregate the number of genomes over up to 3 hops to the UN Region level:

`City -> Admin2(county) -> Admin1(state, province) -> Country`

using the variable-length relationship `[:IN*0..3]`.

This number includes only complete high-quality genomes as determined by the [China National Center for Bioinformation, 2019 Novel Coronavirus Resource (2019nCoVR)](https://bigd.big.ac.cn/ncov/release_genome).

Note, some strains have been deposited under different names in multiple repositories. Therefore, the numbers below includes some duplicates.

In [11]:
query = """
MATCH (s:Strain)-[:FOUND_IN]->(l:Location)-[:IN*0..3]->(c:Country)
RETURN count(s) AS count, c.name AS country
ORDER by count DESC
"""
graph.run(query).to_data_frame()

Unnamed: 0,count,country
0,36021,United States
1,35340,United Kingdom
2,7979,Australia
3,2292,Canada
4,2234,Netherlands
5,2169,India
6,1641,Switzerland
7,1252,France
8,1225,Spain
9,1016,Portugal


### How many complete SARS-CoV-2 genome have been sequenced in each UN Region?
Here we aggregate the number of genomes over up to 6 hops to the UN Region level:

`City -> Admin2(county) -> Admin1(state, province) -> Country -> UNSubRegion -> UNIntermediateRegion -> UNRegion`

using the variable-length relationship `[:IN*0..6]`.

In [12]:
query = """
MATCH (s:Strain)-[:FOUND_IN]->(l:Location)-[:IN*0..6]->(u:UNRegion)
RETURN count(s) AS count, u.name
ORDER by count DESC
"""
graph.run(query).to_data_frame()

Unnamed: 0,count,u.name
0,49463,Europe
1,46586,Americas
2,8505,Oceania
3,8355,Asia
4,1849,Africa
