# Example queries for biological entities on COVID-19-Net Knowledge Graph
[Work in progress]

This notebook demonstrates how to run Cypher queries to retrieve data about biological entities in the Knowledge Graph.

In [1]:
import datetime
import pandas as pd
from py2neo import Graph

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

#### Connect to COVID-19-Net Knowledge Graph

In [3]:
graph = Graph("bolt://132.249.238.185:7687", user="reader", password="demo")

### What kind of entities, including bioentities, does the KG contain?

In [4]:
query = """
MATCH (n:MetaNode)
RETURN n.name, n.shortDescription, n.description, n.example, n.definitionSource, n.dataSources
"""
graph.run(query).to_data_frame()

### Fuzzy full-text search for bioentities
Results are ordered by match score. 

Note, proteins may have one or more protein names. Proteins that have been cleaved have the fullLength flag set to `False`.

In [5]:
query = """
CALL db.index.fulltext.queryNodes("bioentities", "Spike") YIELD node, score
RETURN node.name AS name, node.fullLength AS fullLength, labels(node), score
"""
df = graph.run(query).to_data_frame()
df.head(25)

Unnamed: 0,name,fullLength,labels(node),score
0,Spike,True,[Protein],4.653281
1,Spike,True,[Protein],4.653281
2,Spike,,[ProteinName],4.653281
3,Spike,,[ProteinName],4.653281
4,Spike,,[ProteinName],4.653281
5,Spike,,[ProteinName],4.653281
6,Spike,,[ProteinName],4.653281
7,Spike,,[ProteinName],4.653281
8,Spike,,[ProteinName],4.653281
9,Spike,,[ProteinName],4.653281


### Exact full-text search for bioentities
For exact matches, enclose the phrase with `\"`.

In [6]:
query = """
CALL db.index.fulltext.queryNodes("bioentities", '\"Spike glycoprotein\"') YIELD node
WHERE 'Protein' IN labels(node)
RETURN node.name AS name, node.start as start, node.end as end, node.fullLength as fullLength, node.accession as accession, node.proId as proId, node.taxonomyId as taxonomy
ORDER BY taxonomy, start
"""
df = graph.run(query).to_data_frame()
df.head(25)

Unnamed: 0,name,start,end,fullLength,accession,proId,taxonomy
0,Spike glycoprotein,1,1173,True,uniprot:H1AG29,,taxonomy:11137
1,Spike glycoprotein,1,1170,True,uniprot:Q1HVK9,,taxonomy:11137
2,Spike glycoprotein,1,1171,True,uniprot:A0A384RBT3,,taxonomy:11137
3,Spike glycoprotein,1,1170,True,uniprot:A0A384QXY4,,taxonomy:11137
4,Spike glycoprotein,1,1171,True,uniprot:A0A384RHP9,,taxonomy:11137
5,Spike glycoprotein,1,567,True,uniprot:Q6TUL3,,taxonomy:11137
6,Spike glycoprotein,1,353,True,uniprot:A0A0A0QCI3,,taxonomy:11137
7,Spike glycoprotein,1,567,True,uniprot:Q6TUL1,,taxonomy:11137
8,Spike glycoprotein,1,1170,True,uniprot:A0A384RNQ9,,taxonomy:11137
9,Spike glycoprotein,1,1170,True,uniprot:Q1HVL9,,taxonomy:11137


### Which organisms have been identified as hosts for Coronaviruses

In [9]:
query = """
MATCH (h:Organism{type: 'Host'})
RETURN h.name AS name, h.scientificName AS scientificName, h.id AS taxonomyId
"""
graph.run(query).to_data_frame()

Unnamed: 0,name,scientificName,taxonomyId
0,human,Homo sapiens,taxonomy:9606
1,carnivores,Canidae,taxonomy:9608
2,dog,Canis lupus familiaris,taxonomy:9615
3,European mink,Mustela lutreola,taxonomy:9666
4,cat,Felis catus,taxonomy:9685
5,lion,Panthera leo,taxonomy:9689
6,tiger,Panthera tigris,taxonomy:9694
7,Malayan pangolin,Manis javanica,taxonomy:9974
8,golden hamster,Mesocricetus auratus,taxonomy:10036
9,house mouse,Mus musculus,taxonomy:10090


### Which Pathogens have caused Coronavirus Outbreaks?

In [10]:
query = """
MATCH (p:Organism)-[:CAUSES]->(o:Outbreak)
RETURN p.name AS name, p.scientificName AS scientificName, p.id AS taxonomyId, o.id AS outbreak, o.startDate AS startDate
"""
graph.run(query).to_data_frame()

Unnamed: 0,name,scientificName,taxonomyId,outbreak,startDate
0,SARS-CoV-2,Severe acute respiratory syndrome coronavirus 2,taxonomy:2697049,COVID-19,2019
1,MERS-CoV,Middle East respiratory syndrome-related coron...,taxonomy:1263720,MERS,2012
2,SARS-CoV,Severe acute respiratory syndrome-related coro...,taxonomy:694009,SARS,2003


### Which PubMedCentral articles mention strain datasets about different hosts?

In [16]:
query = """
MATCH (p:Publication)-[:MENTIONS]->(s:Strain)<-[:CARRIES]-(h:Organism)
RETURN p.id AS pmc, s.name AS name, s.collectionDate  AS collectionDate, h.name AS host, h.id AS hostTaxonomyId
"""
graph.run(query).to_data_frame().sample(10)

Unnamed: 0,pmc,name,collectionDate,host,hostTaxonomyId
1626,https://europepmc.org/article/PPR/PPR204749,hCoV-19/USA/NV-NSPHL-A0132/2020,2020-04-29,human,taxonomy:9606
2477,https://europepmc.org/article/PPR/PPR180499,hCoV-19/Canada/ON_QGLO-011/2020,2020-03-26,human,taxonomy:9606
2181,https://europepmc.org/article/PPR/PPR195212,hCoV-19/Turkey/HSGM-10232/2020,2020-03-24,human,taxonomy:9606
3410,pubmed:32543353,SARS-CoV-2/human/TWN/CGMH-CGU-18/2020,2020-03-18,human,taxonomy:9606
2402,https://europepmc.org/article/PPR/PPR210955,hCoV-19/Peru/INS-024/2020,2020-03-25,human,taxonomy:9606
3511,https://europepmc.org/article/PPR/PPR214566,hCoV-19/India/CCMB_J375/2020,2020-04-02,human,taxonomy:9606
3696,https://europepmc.org/article/PPR/PPR210888,hCoV-19/South Korea/KCDC2655/2020,2020-07-21,human,taxonomy:9606
3300,https://europepmc.org/article/PPR/PPR224372,hCoV-19/Cyprus/005/2020,2020-04-01,human,taxonomy:9606
804,pubmed:32937193,Wuhan-Hu-1,2019-12-30,human,taxonomy:9606
122,pubmed:32489698,hCoV-19/Italy/SPL1/2020,2020-01-29,human,taxonomy:9606


### What are the genes and gene products of the SARS-CoV-2 Virus?
This query lists the genes and proteins encoded by the SARS-CoV-2 NC_045512 reference genome. MN908947 represents the original un-annotated dataset submitted to NCBI. This is the first sequenced genome of SARS-CoV-2 collected in Wuhan.

In [19]:
query = """
MATCH (o:Organism{name: 'SARS-CoV-2'})-[:HAS_GENOME]-(n:Genome)-[:HAS_CHROMOSOME]-(Chromosome)-[:HAS_GENE]-(g:Gene)-[:ENCODES]->(p:Protein{reviewed: True})
RETURN o.name as name, g.name as gene, p.name as protein, p.accession as accession, p.sequence as sequence
"""
graph.run(query).to_data_frame()

Unnamed: 0,name,gene,protein,accession,sequence
0,SARS-CoV-2,M,Membrane protein,uniprot:P0DTC5,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...
1,SARS-CoV-2,3a,ORF3a protein,uniprot:P0DTC3,MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATIPIQASLPFGWL...
2,SARS-CoV-2,rep,Replicase polyprotein 1ab,uniprot:P0DTD1,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...
3,SARS-CoV-2,N,Nucleoprotein,uniprot:P0DTC9,MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLP...
4,SARS-CoV-2,6,ORF6 protein,uniprot:P0DTC6,MFHLVDFQVTIAEILLIIMRTFKVSIWNLDYIINLIIKNLSKSLTE...
5,SARS-CoV-2,ORF14,Uncharacterized protein 14,uniprot:P0DTD3,MLQSCYNFLKEQHCQKASTQKGAEAAVKPLLVPHHVVATVQEIQLQ...
6,SARS-CoV-2,E,Envelope small membrane protein,uniprot:P0DTC4,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...
7,SARS-CoV-2,9b,ORF9b protein,uniprot:P0DTD2,MDPKISEMHPALRLVDPQIQLAVTRMENAVGRDQNNVGPKVYPIIL...
8,SARS-CoV-2,1a,Replicase polyprotein 1a,uniprot:P0DTC1,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...
9,SARS-CoV-2,S,Spike glycoprotein,uniprot:P0DTC2,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...


#### The following proteins are protolytic cleavage products of expressed proteins

In [20]:
query = """
MATCH (o:Organism{name: 'SARS-CoV-2'})-[:HAS_GENOME]-(n:Genome)-[:HAS_CHROMOSOME]-(Chromosome)-[:HAS_GENE]-(g:Gene)-[:ENCODES]->(:Protein{reviewed: True})-[:CLEAVED_TO]->(p:Protein)
RETURN o.name as name, g.name as gene, p.name as protein, p.accession as accession, p.sequence as sequence
"""
graph.run(query).to_data_frame()

Unnamed: 0,name,gene,protein,accession,sequence
0,SARS-CoV-2,rep,Non-structural protein 4,uniprot:P0DTC1,KIVNNWLKQLIKVTLVFLFVAAIFYLITPVHVMSKHTDFSSEIIGY...
1,SARS-CoV-2,rep,Non-structural protein 2,uniprot:P0DTC1,AYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKR...
2,SARS-CoV-2,rep,RNA-directed RNA polymerase,uniprot:P0DTD1,SADAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFA...
3,SARS-CoV-2,rep,Helicase,uniprot:P0DTD1,AVGACVLCNSQTSLRCGACIRRPFLCCKCCYDHVISTSHKLVLSVN...
4,SARS-CoV-2,rep,Non-structural protein 7,uniprot:P0DTC1,SKMSDVKCTSVVLLSVLQQLRVESSSKLWAQCVQLHNDILLAKDTT...
5,SARS-CoV-2,rep,3C-like proteinase,uniprot:P0DTC1,SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS...
6,SARS-CoV-2,rep,Non-structural protein 3,uniprot:P0DTC1,APTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEKCSAYTVEL...
7,SARS-CoV-2,rep,Host translation inhibitor nsp1,uniprot:P0DTC1,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...
8,SARS-CoV-2,rep,Non-structural protein 8,uniprot:P0DTC1,AIASEFSSLPSYAAFATAQEAYEQAVANGDSEVVLKKLKKSLNVAK...
9,SARS-CoV-2,rep,Uridylate-specific endoribonuclease,uniprot:P0DTD1,SLENVAFNVVNKGHFDGQQGEVPVSIINNTVYTKVDGVDVELFENK...


### How many complete SARS-CoV-2 genomes have been sequenced in each country?
Here we aggregate the number of genomes over up to 3 hops to the UN Region level:

`City -> Admin2(county) -> Admin1(state, province) -> Country`

using the variable-length relationship `[:IN*0..3]`.

This number includes only complete high-quality genomes as determined by the [China National Center for Bioinformation, 2019 Novel Coronavirus Resource (2019nCoVR)](https://bigd.big.ac.cn/ncov/release_genome).

Note, some strains have been deposited under different names in multiple repositories. Therefore, the numbers below includes some duplicates.

In [30]:
query = """
MATCH (s:Strain)-[:FOUND_IN]->(l:Location)-[:IN*0..3]->(c:Country)
RETURN count(s) AS count, c.name AS country
ORDER by count DESC
"""
graph.run(query).to_data_frame()

Unnamed: 0,count,country
0,90014,United Kingdom
1,67228,United States
2,14343,Australia
3,9008,Denmark
4,4500,Spain
5,3496,Netherlands
6,3350,Switzerland
7,3261,Canada
8,3256,India
9,1986,South Africa


### How many complete SARS-CoV-2 genome have been sequenced in each UN Region?
Here we aggregate the number of genomes over up to 6 hops to the UN Region level:

`City -> Admin2(county) -> Admin1(state, province) -> Country -> UNSubRegion -> UNIntermediateRegion -> UNRegion`

using the variable-length relationship `[:IN*0..6]`.

In [31]:
query = """
MATCH (s:Strain)-[:FOUND_IN]->(l:Location)-[:IN*0..6]->(u:UNRegion)
RETURN count(s) AS count, u.name
ORDER by count DESC
"""
graph.run(query).to_data_frame()

Unnamed: 0,count,u.name
0,123088,Europe
1,83625,Americas
2,15322,Oceania
3,12065,Asia
4,3757,Africa
