# Introduction

This notebook demonstrates basic usage of BioThings Explorer, an engine for autonomously querying a distributed knowledge graph. BioThings Explorer can answer two classes of queries -- "EXPLAIN" and "PREDICT".  EXPLAIN queries are described in [EXPLAIN_demo.ipynb](EXPLAIN_demo.ipynb). Here, we describe PREDICT queries and how to use BioThings Explorer to execute them.  A more detailed overview of the BioThings Explorer systems is provided in [these slides](https://docs.google.com/presentation/d/1QWQqqQhPD_pzKryh6Wijm4YQswv8pAjleVORCPyJyDE/edit?usp=sharing).

PREDICT queries are designed to **predict plausible relationships between one entity and an entity class**.  For example, in this notebook, we explore the question:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*"What drugs might be used to treat hyperphenylalaninemia?"*

**To experiment with an executable version of this notebook, [load it in Google Colaboratory](https://colab.research.google.com/github/biothings/biothings_explorer/blob/master/jupyter%20notebooks/PREDICT_demo.ipynb).**

## Step 0: Load BioThings Explorer modules

First, install the `biothings_explorer` and `biothings_schema` packages, as described in this [README](https://github.com/biothings/biothings_explorer/blob/master/jupyter%20notebooks/README.md#prerequisite).  This only needs to be done once (but including it here for compability with [colab](https://colab.research.google.com/)).

In [1]:
%%capture
!pip install git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer

Next, import the relevant modules:

* **Hint**: Find corresponding bio-entity representation used in BioThings Explorer based on user input (could be any database IDs, symbols, names)
* **FindConnection**: Find intermediate bio-entities which connects user specified input and output

In [2]:
# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

## Step 1: Find representation of "hyperphenylalaninemia" in BTE

In this step, BioThings Explorer translates our query string "hyperphenylalaninemia"  into BioThings objects, which contain mappings to many common identifiers.  Generally, the top result returned by the `Hint` module will be the correct item, but you should confirm that using the identifiers shown.

Search terms can correspond to any child of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `DiseaseOrPhenotypicFeature` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle").

In [3]:
ht = Hint()
# find all potential representations of hyperphenylalaninemia
hyperphenylalaninemia_hint = ht.query("hyperphenylalaninemia")
# select the correct representation of hyperphenylalaninemia
hyperphenylalaninemia = hyperphenylalaninemia_hint['DiseaseOrPhenotypicFeature'][0]
hyperphenylalaninemia

{'mondo': 'MONDO:0016543',
 'umls': 'C0751435',
 'name': 'hyperphenylalaninemia',
 'display': 'mondo(MONDO:0016543) umls(C0751435) name(hyperphenylalaninemia) ',
 'type': 'DiseaseOrPhenotypicFeature',
 'primary': {'identifier': 'mondo',
  'cls': 'DiseaseOrPhenotypicFeature',
  'value': 'MONDO:0016543'}}

## Step 2: Find drugs that are associated with genes which invovled in hyperphenylalaninemia

In this section, we find all paths in the knowledge graph that connect hyperphenylalaninemia to any entity that is a chemical compound.  To do that, we will use `FindConnection`.  This class is a convenient wrapper around two advanced functions for **query path planning** and **query path execution**. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months. 

The parameters for `FindConnection` are described below:


In [4]:
help(FindConnection.__init__)

Help on function __init__ in module biothings_explorer.user_query_dispatcher:

__init__(self, input_obj, output_obj, intermediate_nodes, registry=None)
    Find relationships in the Knowledge Graph between an Input Object and an Output Object.
    
    Args:
        input_obj (required): must be an object returned from Hint corresponding to a specific biomedical entity.
                            Examples:
                Hint().query("Fanconi anemia")['DiseaseOrPhenotypicFeature'][0]
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
    
        output_obj (required): must EITHER be an object returned from Hint corresponding to a specific biomedical
                            entity, OR be a string or list of strings corresponding to Biolink Entity classes.
                            Examples:
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
                'Gene'
                ['Gene','ChemicalSubstance']
    
        intermediate_nodes (

Here, we formulate a `FindConnection` query with "hyperphenylalaninemia" as the `input_ojb`, "ChemicalSubstance" as the `output_obj` (which corresponds to a Biolink Entity type).  We further specify with the `intermediate_nodes` parameter that we are looking for paths joining hyperphenylalaninemia and chemical compounds with *one* intermediate node that is a Gene.  (The ability to search for longer reasoning paths that include additional intermediate nodes will be added shortly.)

In [5]:
fc = FindConnection(input_obj=hyperphenylalaninemia, output_obj='ChemicalSubstance', intermediate_nodes=['Gene'])

We next execute the `connect` method, which performs the **query path planning** and **query path execution** process.  In short, BioThings Explorer is deconstructing the query into individual API calls, executing those API calls, then assembling the results.

A verbose log of this process is displayed below:

In [6]:
# set verbose=True will display all steps which BTE takes to find the connection
fc.connect(verbose=True)


BTE will find paths that join 'hyperphenylalaninemia' and 'ChemicalSubstance'.                   Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: Gene




==== Step #1: Query path planning ====

Because hyperphenylalaninemia is of type 'DiseaseOrPhenotypicFeature', BTE will query our meta-KG for APIs that can take 'DiseaseOrPhenotypicFeature' as input and 'Gene' as output

BTE found 5 apis:

API 1. scibite_disease2gene(1 API call)
API 2. biolink_disease2gene(1 API call)
API 3. semmeddisease(1 API call)
API 4. DISEASES(1 API call)
API 5. mydisease.info(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 5.1: http://mydisease.info/v1/query (POST "q=C0751436,C0751435&scopes=mondo.xrefs.umls,disgenet.xrefs.umls&fields=disgenet.genes_related_to_disease&species=human&size=100")
API 4.1: https://biothings.ncats.io/DISEASES/query (POST "q=MONDO:



==== Step #3: Output normalization ====

API 1.1 cordgene: 723 hits
API 6.1 dgidb_gene2chemical: 1 hits
API 6.2 dgidb_gene2chemical: 1 hits
API 6.3 dgidb_gene2chemical: 3 hits
API 6.4 dgidb_gene2chemical: No hits
API 6.5 dgidb_gene2chemical: 11 hits
API 6.6 dgidb_gene2chemical: 17 hits
API 6.7 dgidb_gene2chemical: No hits
API 6.8 dgidb_gene2chemical: 8 hits
API 6.9 dgidb_gene2chemical: No hits
API 6.10 dgidb_gene2chemical: 1 hits
API 6.11 dgidb_gene2chemical: 1 hits
API 6.12 dgidb_gene2chemical: 35 hits
API 6.13 dgidb_gene2chemical: No hits
API 6.14 dgidb_gene2chemical: No hits
API 6.15 dgidb_gene2chemical: 22 hits
API 6.16 dgidb_gene2chemical: 1 hits
API 4.1 mychem.info: 4 hits
API 4.2 mychem.info: 51 hits
API 4.3 mychem.info: 15 hits
API 3.1 opentarget: 15 hits
API 3.2 opentarget: No hits
API 3.3 opentarget: No hits
API 3.4 opentarget: No hits
API 3.5 opentarget: No hits
API 3.6 opentarget: No hits
API 3.7 opentarget: 15 hits
API 3.8 opentarget: No hits
API 3.9 opentarget: No hits


## Step 3: Display and Filter results
This section demonstrates post-query filtering done in Python. Later, more advanced filtering functions will be added to the **query path execution** module for interleaved filtering, thereby enabling longer query paths. More details to come...

First, all matching paths can be exported to a data frame. Let's examine a sample of those results.

In [7]:
df = fc.display_table_view()

# because UMLS is not currently well-integrated in our ID-to-object translation system, removing UMLS-only outputs here
patternDel = "^umls:C\d+"
filter = df.output_id.str.contains(patternDel)
df = df[~filter]

patternDel2 = "^InChI="
filter2 = df.output_name.str.contains(patternDel2)
df = df[~filter2]

patternDel3 = "^umls:C\d+"
filter3 = df.node1_id.str.contains(patternDel3)
df = df[~filter3]

df.sample(10)

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
40959,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,scibite,scibite_disease2gene,,Gene,GCHFR,entrez:2644,coexists_with,SEMMED API,semmedgene,2615775,ChemicalSubstance,SODIUM LAURYL SULFATE,chembl:CHEMBL23393
55741,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,"10585341,11388593,11694255,25525159,7493990,22...",Gene,PTS,entrez:5805,coexists_with,SEMMED API,semmedgene,1704167170416725107546,ChemicalSubstance,FLECAINIDE,chembl:CHEMBL652
2726,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,related_to,SEMMED API,semmeddisease,25260610,Gene,APP,entrez:351,positively_regulated_by,SEMMED API,semmedgene,27363580,ChemicalSubstance,DEFEROXAMINE,chembl:CHEMBL556
60077,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,treated_by,SEMMED API,semmeddisease,9634518,Gene,PAH,entrez:5053,associated_with,CORD API,cordgene,,ChemicalSubstance,cellobiose,pubchem:439178
37575,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,caused_by,SEMMED API,semmeddisease,28794131,Gene,HSPA4,entrez:3308,positively_regulated_by,SEMMED API,semmedgene,7784180,ChemicalSubstance,CYCLOHEXIMIDE,chembl:CHEMBL123292
39356,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,related_to,SEMMED API,semmeddisease,25260610,Gene,APP,entrez:351,physically_interacts_with,SEMMED API,semmedgene,27392820,ChemicalSubstance,MELATONIN,chembl:CHEMBL45
59064,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,"10585341,11388593,11694255,25525159,7493990,22...",Gene,PTS,entrez:5805,associated_with,CORD API,cordgene,,ChemicalSubstance,D-galactosaminate,pubchem:16738705
59473,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,caused_by,SEMMED API,semmeddisease,10874306,Gene,PTS,entrez:5805,associated_with,CORD API,cordgene,,ChemicalSubstance,fulminate,chembl:CHEMBL2220234
50999,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,"10585341,11388593,11694255,25525159,7493990,22...",Gene,PTS,entrez:5805,coexists_with,SEMMED API,semmedgene,28016559,ChemicalSubstance,EXEMESTANE,chembl:CHEMBL1200374
41006,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,disgenet,mydisease.info,,Gene,PTS,entrez:5805,negatively_regulated_by,SEMMED API,semmedgene,1945499,ChemicalSubstance,SORBITOL,chembl:CHEMBL1682


While most results are based on edges from [semmed](https://skr3.nlm.nih.gov/SemMed/), edges from [DGIdb](http://www.dgidb.org/), [biolink](https://monarchinitiative.org/), [disgenet](http://www.disgenet.org/), and [mychem.info](https://mychem.info) were also retrieved from their respective APIs. 

Next, let's examine which chemical compounds were most frequently mentioned.

In [8]:
df.output_name.value_counts().head(20)

DEXTROSE               88
Linear DNA duplexes    85
HYDROGEN PEROXIDE      54
NITRIC OXIDE           54
CHOLESTEROL            43
NOREPINEPHRINE         37
4301                   35
HYDROCORTISONE         31
797752                 29
UREA                   28
ALCOHOL                25
SODIUM CHLORIDE        24
DOPAMINE               23
OXYGEN                 22
Ferrocenyl Sugar       22
SAPROPTERIN            21
CISPLATIN              21
ESTROGEN               19
CHEBI:24433            19
DB01109                18
Name: output_name, dtype: int64

Hyperphenylalaninemia is a condition characterized by elevated levels of phenylalanine in the blood. This phenotype is strongly associated with phenylketonuria (PKU), an inherited, genetic disorder that affects the ability to metabolize phenylalanine.  **Sapropterin** is a naturally-occurring cofactor associated with several enzymatic processes, including the metabolism of phenylalanine to tyrosine. It has been [FDA-approved](https://www.centerwatch.com/drug-information/fda-approved-drugs/drug/972/kuvan-sapropterin-dihydrochloride) as a treatment for PKU patients.  Tyrosine is also a precursor to several neurotransmitters, including **norepinephrine** and **dopamine**.

While several of the listed compounds would be considered positive controls, others on the list could be viewed as **testable hypotheses and discovery opportunities** to be evaluated by domain experts. For example, let's examine which genes link hyperphenylalaninemia to nitric oxide.

In [9]:
df[df.output_name=="NITRIC OXIDE"]

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
39767,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,"10585341,11388593,11694255,25525159,7493990,22...",Gene,PTS,entrez:5805,associated_with,CORD API,cordgene,,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39768,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,disgenet,mydisease.info,,Gene,PTS,entrez:5805,associated_with,CORD API,cordgene,,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39770,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,caused_by,SEMMED API,semmeddisease,10874306,Gene,PTS,entrez:5805,associated_with,CORD API,cordgene,,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39771,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,treated_by,SEMMED API,semmeddisease,8841415,Gene,PTS,entrez:5805,associated_with,CORD API,cordgene,,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39772,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,"10585341,11388593,11694255,25525159,7493990,22...",Gene,PTS,entrez:5805,coexists_with,SEMMED API,semmedgene,15202319,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39773,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,disgenet,mydisease.info,,Gene,PTS,entrez:5805,coexists_with,SEMMED API,semmedgene,15202319,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39775,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,caused_by,SEMMED API,semmeddisease,10874306,Gene,PTS,entrez:5805,coexists_with,SEMMED API,semmedgene,15202319,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39776,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,treated_by,SEMMED API,semmeddisease,8841415,Gene,PTS,entrez:5805,coexists_with,SEMMED API,semmedgene,15202319,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39777,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,"10585341,11388593,11694255,25525159,7493990,22...",Gene,PTS,entrez:5805,positively_regulated_by,SEMMED API,semmedgene,22919591,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689
39778,hyperphenylalaninemia,DiseaseOrPhenotypicFeature,associatedWith,disgenet,mydisease.info,,Gene,PTS,entrez:5805,positively_regulated_by,SEMMED API,semmedgene,22919591,ChemicalSubstance,NITRIC OXIDE,chembl:CHEMBL1200689


BTE identifies a number of genes, including [HSPA4](https://www.ncbi.nlm.nih.gov/gene/3308) and [GCH1](https://www.ncbi.nlm.nih.gov/gene/2643), as being the key nodes that join hyperphenylalaninemia with nitric oxide.

## Conclusions and caveats

This notebook demonstrated the use of BioThings Explorer in PREDICDT mode to identify chemical compounds related to hyperphenylalaninemia. BioThings Explorer autonomously queried a **distributed knowledge graph of biomedical APIs** to find the most common compounds, and several of the top hits had a clear biochemical and clinical relevance to hyperphenylalaninemia.

There are still many areas for improvement (and some areas in which BioThings Explorer is still buggy).  And of course, BioThings Explorer is dependent on the accessibility of the APIs that comprise the distributed knowledge graph.  Nevertheless, we encourage users to try other variants of the PREDICT queries demonstrated in this notebook.