# Examples of Full-text queries in the COVID-19-Net Knowledge Graph
[Work in progress]

This notebook demonstrates how to run full-text [Lucene queries](https://lucene.apache.org/core/5_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Overview) to retrieve data about entities in the Knowledge Graph. 

COVID-19-Net KG supports text searches for the following types of entities
* bioentities: full-text query for biological entities (proteins, genes, domains, etc.)
* locations: full-text query for locations (cities, counties, states, countries, etc.)
* geoids: keyword (exact) query for geographic identifiers (zip codes, fips codes, iso codes, etc.)

Full-text queries have the following format

```CALL db.index.fulltext.queryNodes('<type of entity>', '<text query>') YIELD node, score```

The queries return the matching nodes, and a score for each match (higher scores indicate better matches).

[Query Syntax](https://lucene.apache.org/core/5_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Overview)

[Learn more about full-text searches](https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html).

Author: Peter W. Rose (pwrose@ucsd.edu)

In [1]:
import datetime
import pandas as pd
from py2neo import Graph

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

#### Connect to COVID-19-Net Knowledge Graph

In [3]:
graph = Graph("bolt://132.249.238.185:7687", user="reader", password="demo")

### Full-text search for a term (word)
Results are ordered by match score ([TermsQuery](https://lucene.apache.org/core/5_5_5/queries/org/apache/lucene/queries/TermsQuery.html)).
Note, queries are case insensitive.

In [4]:
# This query searches for the term "Spike" in the bioentities.
query = """
CALL db.index.fulltext.queryNodes('bioentities', 'Spike') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(20)

Results: 62


Unnamed: 0,name,labels(node),score
0,Spike glycoprotein,[ProteinName],8.872746
1,Spike glycoprotein,[ProteinName],8.872746
2,Spike glycoprotein,[ProteinName],8.872746
3,Spike glycoprotein,[ProteinName],8.872746
4,Spike glycoprotein,[ProteinName],8.872746
5,Spike glycoprotein,[ProteinName],8.872746
6,Spike glycoprotein,[ProteinName],8.872746
7,Spike glycoprotein,[ProteinName],8.872746
8,Spike glycoprotein,[ProteinName],8.872746
9,Spike glycoprotein,[ProteinName],8.872746


### Search for an exact Phrase 
For exact matches, enclose the phrase in **double** quotes ([PhraseQuery](https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/PhraseQuery.html))

In [5]:
# This query searches for the exact phrase "Spike glycoprotein" in the bioentities.
query = """
CALL db.index.fulltext.queryNodes('bioentities', '"Spike glycoprotein"') YIELD node
WHERE 'Protein' IN labels(node)
RETURN node.name AS name, node.start as start, node.end AS end, node.fullLength AS fullLength, node.accession AS accession, node.proId as proId, 
       node.taxonomyId AS taxonomy, node.sequence AS sequence
ORDER BY taxonomy, start
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 15


Unnamed: 0,name,start,end,fullLength,accession,proId,taxonomy,sequence
0,Spike glycoprotein,1.0,1173.0,True,uniprot:P15423,,taxonomy:11137,MFVLLVAYALLHIAGCQTTNGLNTSYSVCNGCVGYSENVFAVESGG...
1,Spike glycoprotein,16.0,1173.0,False,uniprot:P15423,uniprot.chain:PRO_0000037203,taxonomy:11137,CQTTNGLNTSYSVCNGCVGYSENVFAVESGGYIPSDFAFNNWFLLT...
2,Spike glycoprotein,1.0,1353.0,True,uniprot:K9N5Q8,,taxonomy:1263720,MIHSVFLLMFLLTPTESYVDVGPDSVKSACIEVDIQQTFFDKTWPR...
3,Spike glycoprotein,18.0,1353.0,False,uniprot:K9N5Q8,uniprot.chain:PRO_0000422465,taxonomy:1263720,YVDVGPDSVKSACIEVDIQQTFFDKTWPRPIDVSKADGIIYPQGRT...
4,Spike glycoprotein,1.0,1273.0,True,uniprot:P0DTC2,,taxonomy:2697049,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...


### Search a specific node property
For exact matches, enclose the phrase in **double** quotes ([PhraseQuery](https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/PhraseQuery.html))

In [6]:
# This query matches nodes with the property scientificName == "Homo sapiens".

query = """
CALL db.index.fulltext.queryNodes('bioentities', 'scientificName: "Homo sapiens"') YIELD node, score
RETURN node.name AS name, node.scientificName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(15)

Results: 1


Unnamed: 0,name,node.scientificName,labels(node),score
0,human,Homo sapiens,[Organism],2.801928


### Fuzzy Query
A fuzzy query finds approximate matches by adding a tilde "~" at the end of the query term
([FuzzyQuery](https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/FuzzyQuery.html)).

In [7]:
# This query does a fuzzy search for the term "protease" in the bioentities. 
# It matches terms such as "proteinase" or "proteasome".
query = """
CALL db.index.fulltext.queryNodes('bioentities', 'protease~') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.sample(5)

Results: 953


Unnamed: 0,name,labels(node),score
145,Papain-like proteinase,[ProteinName],5.508669
381,Ubiquitin-specific-processing protease 19,[ProteinName],4.805839
532,Proteasome subunit alpha type-2,[ProteinName],4.050915
431,Ubiquitin-specific-processing protease 40,[ProteinName],4.805839
59,Serine protease 23,[ProteinName],5.9311


To restrict the fuzzy search to more similar terms, a similary threshold [0.0 .. 1.0] can be specified (default 0.5)

In [8]:
# This query searches for the location "antonio" with a similarity threshold of 0.8.

query = """
CALL db.index.fulltext.queryNodes('locations', 'antonio~0.8') YIELD node, score
RETURN node.name AS name, node.placeName as placeName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 393


Unnamed: 0,name,placeName,labels(node),score
0,81120,Antonito,"[Location, PostalCode]",4.358802
1,Saint-Antonin,,"[Location, City]",3.965481
2,Maria Antonia,,"[Location, City]",3.965481
3,Sant Antoni,,"[Location, City]",3.5902
4,Sankt Antoni,,"[Location, City]",3.5902


### Wildcard query
Wildcards are used to search for alternate spellings and variations on a root word. A "*" matches multiple characters and a "?" matches a single character
([WildcardQuery](https://lucene.apache.org/core/5_5_5/core/org/apache/lucene/search/WildcardQuery.html)).

In [9]:
# This query matches terms that start with "poly". "*" matches zero or more characters.

query = """
CALL db.index.fulltext.queryNodes('bioentities', 'poly*') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 1509


Unnamed: 0,name,labels(node),score
0,Replicase polyprotein 1ab,[ProteinName],2.0
1,"DNA-directed RNA polymerases I, II, and III su...",[ProteinName],2.0
2,"RNA polymerases I, II, and III subunit ABC3",[ProteinName],2.0
3,DNA-directed RNA polymerase II subunit H,[ProteinName],2.0
4,"DNA-directed RNA polymerases I, II, and III 17...",[ProteinName],2.0


A "?" represents a wildcard for a single character

In [10]:
# This query matches the terms RNA and DNA.
query = """
CALL db.index.fulltext.queryNodes('bioentities', '?NA') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 1807


Unnamed: 0,name,labels(node),score
0,RNA-directed RNA polymerase,[ProteinName],2.0
1,RNA-directed RNA polymerase,[ProteinName],2.0
2,RNA-directed RNA polymerase,[ProteinName],2.0
3,RNA-directed RNA polymerase,[ProteinName],2.0
4,DNA repair protein REV1,[ProteinName],2.0
5,Ribosomal RNA processing protein 1 homolog A,[ProteinName],2.0
6,DNA amplified in mammary carcinoma 1 protein,[ProteinName],2.0
7,CAAT box DNA-binding protein subunit A,[ProteinName],2.0
8,RNA-binding protein Nova-1,[ProteinName],2.0
9,Long intergenic non-protein coding RNA 1006,[ProteinName],2.0


### Boolean expressions in queries
Boolean operators OR and AND can be specified in searches.

#### OR operator

In [11]:
# Find matches for the terms RNA or DNA.
query = """
CALL db.index.fulltext.queryNodes('bioentities', 'DNA OR RNA') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 1792


Unnamed: 0,name,labels(node),score
0,DNA-directed DNA/RNA polymerase mu,[ProteinName],10.641387
1,"DNA-directed RNA polymerase, mitochondrial",[ProteinName],9.090837
2,"DNA-directed RNA polymerase, mitochondrial",[ProteinName],9.090837
3,DNA/RNA-binding protein KIN17,[ProteinName],9.090837
4,ATP-dependent DNA/RNA helicase DHX36,[ProteinName],8.30603
5,DNA-directed RNA polymerase II subunit C,[ProteinName],7.646928
6,DNA-directed RNA polymerase II subunit RPB3,[ProteinName],7.646928
7,DNA-directed RNA polymerase III subunit RPC1,[ProteinName],7.646928
8,DNA-directed RNA polymerase III largest subunit,[ProteinName],7.646928
9,DNA-directed RNA polymerase III subunit A,[ProteinName],7.646928


#### AND operator

In [12]:
# Find matches that contain the terms DNA and RNA
query = """
CALL db.index.fulltext.queryNodes('bioentities', 'DNA AND RNA') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 137


Unnamed: 0,name,labels(node),score
0,DNA-directed DNA/RNA polymerase mu,[ProteinName],10.641387
1,"DNA-directed RNA polymerase, mitochondrial",[ProteinName],9.090837
2,"DNA-directed RNA polymerase, mitochondrial",[ProteinName],9.090837
3,DNA/RNA-binding protein KIN17,[ProteinName],9.090837
4,ATP-dependent DNA/RNA helicase DHX36,[ProteinName],8.30603
5,DNA-directed RNA polymerase I subunit RPA1,[ProteinName],7.646928
6,DNA-directed RNA polymerase I largest subunit,[ProteinName],7.646928
7,DNA-directed RNA polymerase I subunit A,[ProteinName],7.646928
8,DNA-directed RNA polymerase I subunit RPA34,[ProteinName],7.646928
9,DNA-directed RNA polymerase I subunit G,[ProteinName],7.646928


### Keyword searches for geographic ids

In [13]:
# Find matches for the id "USA" (iso3 country code)
query = """
CALL db.index.fulltext.queryNodes('geoids', 'USA') YIELD node, score
RETURN node.name AS name, node.placeName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 1


Unnamed: 0,name,node.placeName,labels(node),score
0,United States,,"[Location, Country]",2.330875


In [14]:
# Find matches for the id "CA"
query = """
CALL db.index.fulltext.queryNodes('geoids', 'CA') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 3


Unnamed: 0,name,labels(node),score
0,Canada,"[Location, Country]",7.894792
1,California,"[Location, Admin1]",3.345964
2,Capellen,"[Location, Admin1]",3.345964


In [15]:
# Find matches that contain the ZIP code 90210
query = """
CALL db.index.fulltext.queryNodes('geoids', '90210') YIELD node, score
RETURN node.name AS name, node.placeName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 1


Unnamed: 0,name,node.placeName,labels(node),score
0,90210,Beverly Hills,"[Location, PostalCode]",5.563916


In [16]:
# Find matches that contain the county FIPS code 073
query = """
CALL db.index.fulltext.queryNodes('geoids', '073') YIELD node, score
RETURN node.name AS name, node.fips, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 35


Unnamed: 0,name,node.fips,labels(node),score
0,Bokak Atoll,,"[Location, Admin1]",3.345964
1,Jayuya,,"[Location, Admin1]",3.345964
2,Pondera County,73.0,"[Location, Admin2]",2.655782
3,Owyhee County,73.0,"[Location, Admin2]",2.655782
4,Lincoln County,73.0,"[Location, Admin2]",2.655782
5,San Diego County,73.0,"[Location, Admin2]",2.655782
6,Marathon County,73.0,"[Location, Admin2]",2.655782
7,Jerauld County,73.0,"[Location, Admin2]",2.655782
8,Lawrence County,73.0,"[Location, Admin2]",2.655782
9,Orleans County,73.0,"[Location, Admin2]",2.655782
