# Introduction

This notebook demonstrates basic usage of BioThings Explorer, an engine for autonomously querying a distributed knowledge graph of biomedical knowledge.

BioThings Explorer can answer two classes of queries -- "EXPLAIN" and "PREDICT". EXPLAIN queries are described in [EXPLAIN_demo.ipynb](EXPLAIN_demo.ipynb). Here, we describe PREDICT queries and how to use BioThings Explorer to execute them. A more detailed overview of the BioThings Explorer systems is provided in [these slides](https://docs.google.com/presentation/d/1QWQqqQhPD_pzKryh6Wijm4YQswv8pAjleVORCPyJyDE/edit?usp=sharing).

PREDICT queries are designed to **predict plausible relationships between one specific biomedical entity (an object or a concept) and one biomedical entity type**.  For example, in this notebook, we explore the question:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*"What drugs might be used to treat hyperphenylalaninemia?"*

**To experiment with an executable version of this notebook, [load it in Google Colaboratory](https://colab.research.google.com/github/biothings/biothings_explorer/blob/master/jupyter%20notebooks/PREDICT_demo.ipynb).**

## Step 0: Load BioThings Explorer modules

First, install the `biothings_explorer` and `biothings_schema` packages, as described in this [README](https://github.com/biothings/biothings_explorer/blob/master/jupyter%20notebooks/README.md#prerequisite).  This only needs to be done once (but it's included here for compability with [Colab](https://colab.research.google.com/)).

In [1]:
%%capture
!pip install git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer

Next, import the relevant modules:

* **Hint**: Find corresponding BioThings representations to use in BioThings Explorer based on user input (could be any database IDs, symbols, names)
* **FindConnection**: Find the relationship(s) between a specific entity and an entity type (PREDICT) or two specific entities (EXPLAIN))

In [1]:
## import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

In [2]:
## functions to add to modules?
def hint_display(query, hint_result):
    """
    show the type, name, number of IDs for all results returned by the query
    
    :param: query: string used in hint query
    :param: hint_result: object returned from hint query, a dictionary of lists of dictionaries
    
    Returns: None
    """
    ## function needs to be rewritten if it's going to give the exact index of each object within its type 
    display = ['type', 'name']  ## replace with the parts of the BioThings object you want to see
    concise_results = []
    for BT_type, result in hint_result.items():
        if result:  ## basically if it's not empty
            for items in result:
                ## number of identifiers per object: number of keys - 4 (name, primary, display, type)
                temp = len(items) - 4
                concise_results.append((items[display[0]], items[display[1]], 
                                         str(temp)))
                    
    print('There are {total} BioThings objects returned for {ht}:'.format(\
                total = len(concise_results), ht = query))
    for display_info in concise_results:
        print('{0}, {1}, num of IDs: {2}'.format(display_info[0], display_info[1], display_info[2]))

## Step 1: Find representation of "hyperphenylalaninemia" in BTE

In this step, BioThings Explorer translates our query string "hyperphenylalaninemia"  into BioThings objects, which contain mappings to many common identifiers.  Generally, the top result returned by the `Hint` module will match what you want, but you should confirm that using the identifiers shown. 

BioThings types correspond to children and descendants of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `Disease` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle"). **However, [only a subset of the Biolink BiologicalEntity children / descendants are currently implemented in BTE](https://smart-api.info/portal/translator/metakg)**. More biomedical object types will be available as more knowledge sources (APIs) are added to the system. 

**Note that the type `BiologicalEntity` means any BioThings type currently implemented in BTE will be accepted.**

In [3]:
ht = Hint()
object1 = 'hyperphenylalaninemia'
## find all potential representations of hyperphenylalaninemia
hyperphenylalaninemia_hint = ht.query(object1)

hint_display(object1, hyperphenylalaninemia_hint)

There are 11 BioThings objects returned for hyperphenylalaninemia:
Disease, mild hyperphenylalaninemia, num of IDs: 2
Disease, Transient hyperphenylalaninemia, num of IDs: 2
Disease, Maternal hyperphenylalaninemia, num of IDs: 2
Disease, Atypical hyperphenylalaninemia, num of IDs: 2
Disease, hyperphenylalaninemia due to tetrahydrobiopterin deficiency, num of IDs: 3
PhenotypicFeature, Hyperphenylalaninemia, num of IDs: 3
PhenotypicFeature, Atypical hyperphenylalaninemia, num of IDs: 2
PhenotypicFeature, Transient hyperphenylalaninemia, num of IDs: 2
PhenotypicFeature, Maternal hyperphenylalaninemia, num of IDs: 2
Pathway, Hyperphenylalaninemia due to dhpr-deficiency, num of IDs: 0
Pathway, Hyperphenylalaninemia due to 6-pyruvoyltetrahydropterin synthase deficiency (ptps), num of IDs: 0


Based on the information above, we'll pick the top Disease choice (indexed at 0) for hyperphenylalaninemia for our query. We can look at identifier mappings inside this BioThings object. 

In [4]:
## select the first Disease object for our hyperphenylalaninemia query
hyperphenylalaninemia = hyperphenylalaninemia_hint['Disease'][0]
hyperphenylalaninemia

{'MONDO': 'MONDO:0019335',
 'name': 'mild hyperphenylalaninemia',
 'ORPHANET': '79651',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0019335'},
 'display': 'MONDO(MONDO:0019335) ORPHANET(79651) name(mild hyperphenylalaninemia)',
 'type': 'Disease'}

## Step 2: Find drugs associated with the genes involved in hyperphenylalaninemia

In this section, we find some paths in the knowledge graph that connect hyperphenylalaninemia to chemical compounds (nodes with the biomedical object type "Chemical Substance"). 

To do that, we will use `FindConnection`.  This class is a convenient wrapper around two advanced functions for **query path planning** and **query path execution**. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months. 

The parameters for `FindConnection` are described below:

In [5]:
help(FindConnection.__init__)

Help on function __init__ in module biothings_explorer.user_query_dispatcher:

__init__(self, input_obj, output_obj, intermediate_nodes, registry=None)
    Find relationships in the Knowledge Graph between an Input Object and an Output Object.
    
    Args:
        input_obj (required): must be an object returned from Hint corresponding to a specific biomedical entity.
                            Examples:
                Hint().query("Fanconi anemia")['DiseaseOrPhenotypicFeature'][0]
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
    
        output_obj (required): must EITHER be an object returned from Hint corresponding to a specific biomedical
                            entity, OR be a string or list of strings corresponding to Biolink Entity classes.
                            Examples:
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
                'Gene'
                ['Gene','ChemicalSubstance']
    
        intermediate_nodes (

Here, we formulate a `FindConnection` query with "hyperphenylalaninemia" as the `input_obj`, "ChemicalSubstance" as the `output_obj` (which corresponds to a BioThings type).  We further specify with the `intermediate_nodes` parameter that we are looking for paths joining hyperphenylalaninemia and chemical compounds with *one* intermediate node that is a "Gene" (BioThings type).  

In [6]:
fc = FindConnection(input_obj = hyperphenylalaninemia, \
                    output_obj = 'ChemicalSubstance', \
                    intermediate_nodes = ['Gene'])

We next execute the `connect` method, which performs the **query path planning** and **query path execution** process.  In short, BioThings Explorer is deconstructing the query into individual API calls, executing those API calls, and then assembling the results.

A verbose log of this process is displayed below:

In [7]:
## set verbose=True will display all steps which BTE takes to find the connection
fc.connect(verbose = True)


BTE will find paths that join 'mild hyperphenylalaninemia' and 'ChemicalSubstance'.                   Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: Gene




==== Step #1: Query path planning ====

Because mild hyperphenylalaninemia is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'Gene' as output

BTE found 5 apis:

API 1. scigraph(1 API call)
API 2. biolink(1 API call)
API 3. pharos(1 API call)
API 4. hetio(1 API call)
API 5. scibite(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 2.1: https://api.monarchinitiative.org/api/bioentity/disease/MONDO:0019335/genes?rows=200
API 1.1: https://automat.renci.org/cord19-scigraph/disease/gene/MONDO:0019335
API 3.1: https://automat.renci.org/pharos/disease/gene/MONDO:0019335
API 5.1: https://automat.renci.org/cord19-scibite/disease/gene/MONDO:001

## Step 3: Display and filter results

This section demonstrates post-query filtering done in Python. Later, more advanced filtering functions will be added to the **query path execution** module for interleaved filtering, thereby enabling longer query paths. More details to come...

First, results can be exported to a data frame. Let's examine a sample of those results.

In [8]:
df = fc.display_table_view()

## because UMLS is not currently well-integrated in our ID-to-object 
##      translation system, removing UMLS-only outputs here
patternDel = "^UMLS:C\d+"
filter = df.output_id.str.contains(patternDel)
df = df[~filter]

patternDel2 = "^InChI="
filter2 = df.output_name.str.contains(patternDel2)
df = df[~filter2]

patternDel3 = "^umls:C\d+"
filter3 = df.node1_id.str.contains(patternDel3)
df = df[~filter3]

# df[df.output_id.str.contains(patternDel)]
print(df.shape)
df.sample(10)

(289, 16)


Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
343,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,related_to,Translator Text Mining Provider,CORD Gene API,,ChemicalSubstance,AP24534,name:AP24534
330,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,related_to,Translator Text Mining Provider,CORD Gene API,,ChemicalSubstance,GLYOXYLATE,name:GLYOXYLATE
153,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,physically_interacts_with,SEMMED,SEMMED Gene API,11386853.0,ChemicalSubstance,GLYCERIN,name:GLYCERIN
321,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,related_to,Translator Text Mining Provider,CORD Gene API,,ChemicalSubstance,PYRROLIDINE DITHIOCARBAMATE,name:PYRROLIDINE DITHIOCARBAMATE
163,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,positively_regulated_by,SEMMED,SEMMED Gene API,10964764.0,ChemicalSubstance,OCHRATOXINS,name:OCHRATOXINS
59,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,related_to,Translator Text Mining Provider,CORD Gene API,,ChemicalSubstance,CHEBI:50904,CHEBI:CHEBI:50904
209,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,negatively_regulated_by,SEMMED,SEMMED Gene API,10839989.0,ChemicalSubstance,BIOPTERIN,name:BIOPTERIN
244,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,related_to,scigraph,Automat CORD19 Scigraph API,,ChemicalSubstance,PYRENE,name:PYRENE
58,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,related_to,Translator Text Mining Provider,CORD Gene API,,ChemicalSubstance,CHEBI:50112,CHEBI:CHEBI:50112
42,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,related_to,scigraph,Automat CORD19 Scigraph API,,ChemicalSubstance,CHEBI:76413,CHEBI:CHEBI:76413


Notice that results give the knowledge source (API) and that many also give the literature source (Pubmed ID). 

Next, let's examine which chemical compounds were most frequently mentioned.

In [9]:
df.output_name.value_counts().head(10)

SAPROPTERIN          4
(R)-NORADRENALINE    4
CORTISOL             3
PTERINS              3
DIOXYGEN             3
(R)-ADRENALINE       3
WATER                3
BENZO(A)PYRENE       2
CHEBI:24632          2
CHEBI:59659          2
Name: output_name, dtype: int64

*Note: as of 2020-08-13, the notebook output above matches the description below. As knowledge sources (APIs) and code are updated and added, users may obtain different results.*

Hyperphenylalaninemia is a condition characterized by elevated levels of phenylalanine in the blood. This phenotype is strongly associated with phenylketonuria (PKU), an inherited, genetic disorder that affects the ability to metabolize phenylalanine.  **Sapropterin** is a naturally-occurring cofactor associated with several enzymatic processes, including the metabolism of phenylalanine to tyrosine. It has been [FDA-approved](https://www.centerwatch.com/drug-information/fda-approved-drugs/drug/972/kuvan-sapropterin-dihydrochloride) as a treatment for PKU patients.  Tyrosine is also a precursor to several neurotransmitters, including **norepinephrine** and dopamine.

While several of the listed compounds would be considered positive controls, others on the list could be viewed as **testable hypotheses and discovery opportunities** to be evaluated by domain experts. For example, let's examine which genes link hyperphenylalaninemia to nitric oxide.

In [10]:
df[df.output_name == "NITRIC OXIDE"]

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
328,MILD HYPERPHENYLALANINEMIA,Disease,related_to,,BioLink API,,Gene,PAH,NCBIGene:5053,related_to,Translator Text Mining Provider,CORD Gene API,,ChemicalSubstance,NITRIC OXIDE,name:NITRIC OXIDE


*Note: as of 2020-08-13, the notebook output above matches the description below. As knowledge sources (APIs) and code are updated and added, users may obtain different results.*

BTE identifies [PAH](https://www.ncbi.nlm.nih.gov/gene/5053) as being a gene connecting hyperphenylalaninemia and nitric oxide.

## Conclusions and caveats

This notebook demonstrates the use of BioThings Explorer (PREDICT queries) to identify chemical compounds related to hyperphenylalaninemia. BioThings Explorer autonomously queried a **distributed knowledge graph of biomedical APIs** to find the most common compounds, and several of the top hits had a clear biochemical and clinical relevance to hyperphenylalaninemia.

There are still many areas for improvement (and some areas in which BioThings Explorer is still buggy).  And of course, BioThings Explorer is dependent on the accessibility of the APIs that comprise the distributed knowledge graph.  Nevertheless, we encourage users to try other variants of the PREDICT queries demonstrated in this notebook.

## More query information for users

*Note: as of 2020-08-13, the information below is valid.*

The input into the Hint query should be one string with **at least one whole word.** 

Currently, the user must set the input_obj parameter of FindConnection() as **one specific BioThings object** (ex: one chemical, phenotype, disease, GO term, etc.). To do a PREDICT query, the output_obj parameter of FindConnection() must be **one BioThings type**. 

For a PREDICT query, BTE will accept one or more intermediate nodes (example: the chain of queries is then A -> B -> C -> D). **However, having more than one intermediate node currently leads to slow performance and an overwhelming number of results so it is discouraged.**  

Intermediate nodes cannot be specific biomedical entities. They must be BioThings type(s), None, or an empty list ([]). 

The intermediate node parameter can be a list or None. **None or an empty list will tell BTE to look for direct connections (no intermediate nodes).** For a non-empty list, each element is a string (one acceptable entity type) or a tuple of strings (describing the acceptable types) for one node. 
* [('AnatomicalEntity, 'CellularComponent')] means one intermediate node that can be an anatomical entity or a cellular component (GO term)
* [('AnatomicalEntity, 'CellularComponent'), 'CellularComponent'] means two intermediate nodes so the chain of queries is A -> B -> C -> D. The first intermediate node (B) can be either of the two types, while the second intermediate node (C) MUST be a cellular-component (GO term). 

BTE will query from input_obj -> intermediate and from intermediate -> output_obj and then join results based on shared intermediates. The results should therefore be read as input_obj -> intermediate -> output_obj. 

If **an intermediate node has a tuple of acceptable types**, BTE will look for knowledge sources that can handle at least one of the types listed.   