# Examples of Full-text queries in the COVID-19-Net Knowledge Graph
[Work in progress]

This notebook demonstrates how to run full-text [Lucene queries](https://lucene.apache.org/core/5_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Overview) in the Knowledge Graph. 

COVID-19-Net KG supports text searches for the following types of entities
* bioentities: full-text query for biological entities (proteins, genes, domains, etc.)
* locations: full-text query for locations (cities, counties, states, countries, etc.)
* sequences: full-text search for protein sequences
* geoids: keyword (exact) query for geographic identifiers (zip codes, fips codes, iso codes, etc.)

Full-text queries have the following format

```CALL db.index.fulltext.queryNodes('<type of entity>', '<text query>') YIELD node, score```

The queries return the matching nodes, and a score for each match (higher scores indicate better matches).

[Query Syntax](https://lucene.apache.org/core/5_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Overview)

[Learn more about full-text searches](https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html).

Author: Peter W. Rose (pwrose@ucsd.edu)

In [1]:
import datetime
import pandas as pd
from py2neo import Graph

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

#### Connect to COVID-19-Net Knowledge Graph

In [3]:
graph = Graph("bolt://132.249.238.185:7687", user="reader", password="demo")

### Full-text Query for a term (word)
Results are ordered by match score ([TermsQuery](https://lucene.apache.org/core/5_5_5/queries/org/apache/lucene/queries/TermsQuery.html)).
Note, full-text queries are case insensitive.

In [4]:
# This query searches for the term "Spike" in the bioentities.
query = """
CALL db.index.fulltext.queryNodes('bioentities', 'Spike') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 406


Unnamed: 0,name,labels(node),score
0,Spike glycoprotein,[Protein],5.255547
1,Spike glycoprotein,[Protein],5.255547
2,Spike glycoprotein,[Protein],5.255547
3,Spike glycoprotein,[Protein],5.255547
4,spike protein,[ProteinName],5.255547
5,Spike glycoprotein,[ProteinName],5.255547
6,Spike glycoprotein,[ProteinName],5.255547
7,Spike glycoprotein,[Protein],5.255547
8,Spike glycoprotein,[Protein],5.255547
9,Spike glycoprotein,[ProteinName],5.255547


### Search for an exact Phrase 
For exact matches, enclose the phrase in **double** quotes ([PhraseQuery](https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/PhraseQuery.html))

In [5]:
# This query searches for the exact phrase "Spike glycoprotein" in the bioentities.
query = """
CALL db.index.fulltext.queryNodes('bioentities', '"Spike glycoprotein"') YIELD node
WHERE 'Protein' IN labels(node)
RETURN node.name AS name, node.start as start, node.end AS end, node.fullLength AS fullLength, node.accession AS accession, node.proId as proId, 
       node.taxonomyId AS taxonomy, node.sequence AS sequence
ORDER BY taxonomy, start
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 14


Unnamed: 0,name,start,end,fullLength,accession,proId,taxonomy,sequence
0,Spike glycoprotein,1,1173,True,uniprot:P15423,,taxonomy:11137,MFVLLVAYALLHIAGCQTTNGLNTSYSVCNGCVGYSENVFAVESGG...
1,Spike glycoprotein,16,1173,False,uniprot:P15423,uniprot.chain:PRO_0000037203,taxonomy:11137,CQTTNGLNTSYSVCNGCVGYSENVFAVESGGYIPSDFAFNNWFLLT...
2,Spike glycoprotein,1,1353,True,uniprot:K9N5Q8,,taxonomy:1263720,MIHSVFLLMFLLTPTESYVDVGPDSVKSACIEVDIQQTFFDKTWPR...
3,Spike glycoprotein,18,1353,False,uniprot:K9N5Q8,uniprot.chain:PRO_0000422465,taxonomy:1263720,YVDVGPDSVKSACIEVDIQQTFFDKTWPRPIDVSKADGIIYPQGRT...
4,Spike glycoprotein,1,1273,True,uniprot:P0DTC2,,taxonomy:2697049,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...


### Node Property Query
Queries can be restricted to a specific node property name: `<node property>: <term>`.

In [6]:
# This query matches nodes with the node property 'scientificName'

query = """
CALL db.index.fulltext.queryNodes('bioentities', 'scientificName: "Homo sapiens"') YIELD node, score
RETURN node.name AS name, node.scientificName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(15)

Results: 1


Unnamed: 0,name,node.scientificName,labels(node),score
0,human,Homo sapiens,[Organism],2.801928


### Fuzzy Query
A fuzzy query finds approximate matches by adding a tilde `~` at the end of the query term
([FuzzyQuery](https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/FuzzyQuery.html)).

In [7]:
# This query does a fuzzy search for the term "protease" in the bioentities. 
# It matches terms such as "proteinase" or "proteasome".
query = """
CALL db.index.fulltext.queryNodes('bioentities', 'protease~') YIELD node, score
RETURN node.name AS name, node.description AS description, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(5)

Results: 4172


Unnamed: 0,name,description,labels(node),score
0,Proteasome,Proteasome subunit,[Domain],7.068081
1,Prostase,,[ProteinName],6.422506
2,pdb:4CR2,"PROTEASOME COMPONENT PRE3, PROTEASOME COMPONEN...",[Structure],5.389273
3,pdb:4CR3,"PROTEASOME COMPONENT PRE3, PROTEASOME COMPONEN...",[Structure],5.389273
4,pdb:4CR4,"PROTEASOME COMPONENT PRE3, PROTEASOME COMPONEN...",[Structure],5.389273


### Fuzzy query with Similarity Threshold
The similary threshold [0.0 .. 1.0] can be specified in a fuzzy query (default 0.5).

In [8]:
# This query searches for the location "antonio" with a similarity threshold of 0.8.

query = """
CALL db.index.fulltext.queryNodes('locations', 'antonio~0.8') YIELD node, score
RETURN node.name AS name, node.placeName as placeName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 393


Unnamed: 0,name,placeName,labels(node),score
0,81120,Antonito,"[Location, PostalCode]",4.358802
1,Saint-Antonin,,"[Location, City]",3.965481
2,Maria Antonia,,"[Location, City]",3.965481
3,Sant Antoni,,"[Location, City]",3.5902
4,Sankt Antoni,,"[Location, City]",3.5902


### Fuzzy Query with Edit Distance
Lucene supports fuzzy searches based on [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance). To do a fuzzy search use the tilde, `~`, symbol at the end of a Single word Term. The maximum number of edits (substitutions, deletions, and insertions) can be specified after the tile, e.g. `~3` to search for matches with up to 3 edits.

In [9]:
# Find matches that contain sequences similar to SADAQSFLNGFAV with up to 3 edits (e.g. substitutions)
# This query returns two results:
#    exact match for nsp11 protein (SARS-CoV-2, taxonomyId:2697049) 
#    approximate match for nsp11 protein (SARS-CoV, taxonomyId:694009) with 2 substitutions QS -> ST

query = """
CALL db.index.fulltext.queryNodes('sequences', 'SADAQSFLNGFAV~3') YIELD node, score
RETURN node.name AS name, node.sequence, node.accession, node.taxonomyId, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 2


Unnamed: 0,name,node.sequence,node.accession,node.taxonomyId,labels(node),score
0,Non-structural protein 11,SADAQSFLNGFAV,uniprot:P0DTC1,taxonomy:2697049,[Protein],5.965485
1,Non-structural protein 11,SADASTFLNGFAV,uniprot:P0C6U8,taxonomy:694009,[Protein],5.047718


### Wildcard Query
Wildcards are used to search for alternate spellings and variations on a root word. A `*` matches multiple characters and a `?` matches a single character
([WildcardQuery](https://lucene.apache.org/core/5_5_5/core/org/apache/lucene/search/WildcardQuery.html)).

In [10]:
# This query matches terms that start with "poly". "*" matches zero or more characters.

query = """
CALL db.index.fulltext.queryNodes('bioentities', 'poly*') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 7814


Unnamed: 0,name,labels(node),score
0,Polysacc_deac_1,[Domain],2.0
1,Polyketide_cyc2,[Domain],2.0
2,polyprenyl_synt,[Domain],2.0
3,Polyoma_coat,[Domain],2.0
4,Polysacc_synt_2,[Domain],2.0


In [11]:
# This query matches the terms RNA and DNA.
query = """
CALL db.index.fulltext.queryNodes('bioentities', '?NA') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 14908


Unnamed: 0,name,labels(node),score
0,Rap1-DNA-bind,[Domain],2.0
1,PTS_2-RNA,[Domain],2.0
2,MerR-DNA-bind,[Domain],2.0
3,Arm-DNA-bind_3,[Domain],2.0
4,Arm-DNA-bind_1,[Domain],2.0
5,Cauli_DNA-bind,[Domain],1.0
6,Drc1-Sld2,[Domain],1.0
7,SLD3,[Domain],1.0
8,Methyltrn_RNA_4,[Domain],1.0
9,DMP12,[Domain],1.0


### Boolean expressions in queries
Boolean operators `OR` and `AND` can be specified in searches.

#### OR operator

In [12]:
# Find matches for the terms RNA or DNA.
query = """
CALL db.index.fulltext.queryNodes('bioentities', 'DNA OR RNA') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 14877


Unnamed: 0,name,labels(node),score
0,DNA-directed DNA/RNA polymerase mu,[Protein],5.980085
1,DNA-directed DNA/RNA polymerase mu,[ProteinName],5.980085
2,PTS_2-RNA,[Domain],5.875489
3,Rap1-DNA-bind,[Domain],5.48579
4,MerR-DNA-bind,[Domain],5.48579
5,Arm-DNA-bind_3,[Domain],5.414452
6,"DNA-directed RNA polymerase, mitochondrial",[Protein],5.257301
7,DNA/RNA-binding protein KIN17,[Protein],5.257301
8,"DNA-directed RNA polymerase, mitochondrial",[ProteinName],5.257301
9,"DNA-directed RNA polymerase, mitochondrial",[ProteinName],5.257301


#### AND operator

In [13]:
# Find matches that contain the terms DNA and RNA
query = """
CALL db.index.fulltext.queryNodes('bioentities', 'DNA AND RNA') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 1278


Unnamed: 0,name,labels(node),score
0,DNA-directed DNA/RNA polymerase mu,[Protein],5.980085
1,DNA-directed DNA/RNA polymerase mu,[ProteinName],5.980085
2,"DNA-directed RNA polymerase, mitochondrial",[Protein],5.257301
3,"DNA-directed RNA polymerase, mitochondrial",[ProteinName],5.257301
4,"DNA-directed RNA polymerase, mitochondrial",[ProteinName],5.257301
5,DNA/RNA-binding protein KIN17,[ProteinName],5.257301
6,DNA/RNA-binding protein KIN17,[Protein],5.257301
7,"DNA-directed RNA polymerase, mitochondrial",[Protein],5.257301
8,pdb:4QIW,[Structure],4.875343
9,pdb:6KF3,[Structure],4.851152


### Keyword Query for Geographic Ids

In [14]:
# Find matches for the id "USA" (iso3 country code)
query = """
CALL db.index.fulltext.queryNodes('geoids', 'USA') YIELD node, score
RETURN node.name AS name, node.placeName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 1


Unnamed: 0,name,node.placeName,labels(node),score
0,United States,,"[Location, Country]",2.330875


In [15]:
# Find matches for the id "CA"
query = """
CALL db.index.fulltext.queryNodes('geoids', 'CA') YIELD node, score
RETURN node.name AS name, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 3


Unnamed: 0,name,labels(node),score
0,Canada,"[Location, Country]",7.894792
1,California,"[Location, Admin1]",3.345964
2,Capellen,"[Location, Admin1]",3.345964


In [16]:
# Find matches that contain the ZIP code 90210
query = """
CALL db.index.fulltext.queryNodes('geoids', '90210') YIELD node, score
RETURN node.name AS name, node.placeName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 1


Unnamed: 0,name,node.placeName,labels(node),score
0,90210,Beverly Hills,"[Location, PostalCode]",5.563916


In [17]:
# Find matches that contain the county FIPS code 073
query = """
CALL db.index.fulltext.queryNodes('geoids', '073') YIELD node, score
RETURN node.name AS name, node.fips, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head(10)

Results: 35


Unnamed: 0,name,node.fips,labels(node),score
0,Bokak Atoll,,"[Location, Admin1]",3.345964
1,Jayuya,,"[Location, Admin1]",3.345964
2,Pondera County,73.0,"[Location, Admin2]",2.655782
3,Owyhee County,73.0,"[Location, Admin2]",2.655782
4,Lincoln County,73.0,"[Location, Admin2]",2.655782
5,San Diego County,73.0,"[Location, Admin2]",2.655782
6,Marathon County,73.0,"[Location, Admin2]",2.655782
7,Jerauld County,73.0,"[Location, Admin2]",2.655782
8,Lawrence County,73.0,"[Location, Admin2]",2.655782
9,Orleans County,73.0,"[Location, Admin2]",2.655782


### Range Query
Range Queries allow one to match documents whose field(s) values are between the lower and upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the upper and lower bounds. Inclusive range queries are denoted by square brackets `[ ]`. Exclusive range queries are denoted by curly brackets `{ }`. Sorting is done lexicographically.

In [18]:
# Find matches that contain the ZIP range 90200 to 90250
query = """
CALL db.index.fulltext.queryNodes('geoids', 'name: [90200 TO 90250]') YIELD node, score
RETURN node.name AS name, node.placeName, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 25


Unnamed: 0,name,node.placeName,labels(node),score
0,90201,Bell Gardens,"[Location, PostalCode]",1.0
1,90233,Culver City,"[Location, PostalCode]",1.0
2,90250,Hawthorne,"[Location, PostalCode]",1.0
3,90249,Gardena,"[Location, PostalCode]",1.0
4,90248,Gardena,"[Location, PostalCode]",1.0


### Protein Sequence Query
**NOTE: The Lucence full-text search indexes only the first 255 characters.** Sequences longer than 255 residues will be truncated to 255 characters and only the first 255 characters will be used in the query.

In [19]:
# Find matches that contain the sequence: SADAQSFLNGFAV
query = """
CALL db.index.fulltext.queryNodes('sequences', 'SADAQSFLNGFAV') YIELD node, score
RETURN node.name AS name, node.accession, node.taxonomyId, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 1


Unnamed: 0,name,node.accession,node.taxonomyId,labels(node),score
0,Non-structural protein 11,uniprot:P0DTC1,taxonomy:2697049,[Protein],5.965485


### Regular Expression Query
Lucene supports a limited number of queries with regular expressions ([RegExp](https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/util/automaton/RegExp.html)). Regular expressions are enclosed between two forward slashes `/regular expression/`. 

#### Matching variable characters
A `.` specifies a variable character in a term.

In [20]:
# Find matches for the sequence SADA..FLNGFAV with two variable residues.
query = """
CALL db.index.fulltext.queryNodes('sequences', '/SADA..FLNGFAV/') YIELD node, score
RETURN node.name AS name, node.sequence, node.accession, node.taxonomyId, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 2


Unnamed: 0,name,node.sequence,node.accession,node.taxonomyId,labels(node),score
0,Non-structural protein 11,SADAQSFLNGFAV,uniprot:P0DTC1,taxonomy:2697049,[Protein],1.0
1,Non-structural protein 11,SADASTFLNGFAV,uniprot:P0C6U8,taxonomy:694009,[Protein],1.0


#### Matching alternative characters
A list of alternative characters can be specified in brackets `[ ]`.

In [21]:
# Find matches that contain sequences with alternative residues Q or S at position 4 and S or T
# at position 5 (zero-based positions)

query = """
CALL db.index.fulltext.queryNodes('sequences', '/SADA[QS][ST]FLNGFAV/') YIELD node, score
RETURN node.name AS name, node.sequence, node.accession, node.taxonomyId, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 2


Unnamed: 0,name,node.sequence,node.accession,node.taxonomyId,labels(node),score
0,Non-structural protein 11,SADAQSFLNGFAV,uniprot:P0DTC1,taxonomy:2697049,[Protein],1.0
1,Non-structural protein 11,SADASTFLNGFAV,uniprot:P0C6U8,taxonomy:694009,[Protein],1.0


#### Matching an expression with multiple occurances

The number of occurances of an expression can be specified:

`?` (zero or one occurrence) </br>
`*` (zero or more occurrences) </br>
`+` (one or more occurrences) </br>
`{n}` (n occurrences) </br>
`{n,}` (n or more occurrences) </br>
`{n,m}` (n to m occurrences, including both) </br>

In [22]:
# This query searches for the subsequence AVGACVLCNSQTSLR by specifiying zero or more variable characters 
# at the beginning and end of a string.

query = """
CALL db.index.fulltext.queryNodes('sequences', '/.*AVGACVLCNSQTSLR.*/') YIELD node, score
RETURN node.name AS name, node.sequence, node.accession, node.taxonomyId, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 4


Unnamed: 0,name,node.sequence,node.accession,node.taxonomyId,labels(node),score
0,Replicase polyprotein 1ab,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,uniprot:P0DTD1,taxonomy:2697049,[Protein],1.0
1,Helicase,AVGACVLCNSQTSLRCGACIRRPFLCCKCCYDHVISTSHKLVLSVN...,uniprot:P0C6X7,taxonomy:694009,[Protein],1.0
2,Replicase polyprotein 1ab,MESLVLGVNEKTHVQLSLPVLQVRDVLVRGFGDSVEEALSEAREHL...,uniprot:P0C6X7,taxonomy:694009,[Protein],1.0
3,Helicase,AVGACVLCNSQTSLRCGACIRRPFLCCKCCYDHVISTSHKLVLSVN...,uniprot:P0DTD1,taxonomy:2697049,[Protein],1.0


In [23]:
# This query specifies zero or more variable characters at the beginning and end of an expression.

query = """
CALL db.index.fulltext.queryNodes('sequences', '/.*SADA[QS][ST]FLNGFAV.*/') YIELD node, score
RETURN node.name AS name, node.sequence, node.accession, node.taxonomyId, labels(node), score
"""
df = graph.run(query).to_data_frame()
print('Results:', df.shape[0])
df.head()

Results: 4


Unnamed: 0,name,node.sequence,node.accession,node.taxonomyId,labels(node),score
0,Replicase polyprotein 1a,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,uniprot:P0DTC1,taxonomy:2697049,[Protein],1.0
1,Non-structural protein 11,SADASTFLNGFAV,uniprot:P0C6U8,taxonomy:694009,[Protein],1.0
2,Replicase polyprotein 1a,MESLVLGVNEKTHVQLSLPVLQVRDVLVRGFGDSVEEALSEAREHL...,uniprot:P0C6U8,taxonomy:694009,[Protein],1.0
3,Non-structural protein 11,SADAQSFLNGFAV,uniprot:P0DTC1,taxonomy:2697049,[Protein],1.0
