# Introduction

This notebook demonstrates basic usage of BioThings Explorer, an engine for autonomously querying a distributed knowledge graph. BioThings Explorer can answer two classes of queries -- "PREDICT" and "EXPLAIN".  PREDICT queries are described in [PREDICT_demo.ipynb](PREDICT_demo.ipynb). Here, we describe EXPLAIN queries and how to use BioThings Explorer to execute them.  A more detailed overview of the BioThings Explorer systems is provided in [these slides](https://docs.google.com/presentation/d/1QWQqqQhPD_pzKryh6Wijm4YQswv8pAjleVORCPyJyDE/edit?usp=sharing).

EXPLAIN queries are designed to **identify plausible reasoning chains to explain the relationship between two entities**.  For example, in this notebook, we explore the question:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"*Why does hydroxychloroquine have an effect on ACE2?*"  



**To experiment with an executable version of this notebook, [load it in Google Colaboratory](https://colab.research.google.com/github/biothings/biothings_explorer/blob/master/jupyter%20notebooks/EXPLAIN_ACE2_hydroxychloroquine_demo.ipynb).**

## Step 0: Load BioThings Explorer modules

First, install the `biothings_explorer` and `biothings_schema` packages, as described in this [README](https://github.com/biothings/biothings_explorer/blob/master/jupyter%20notebooks/README.md#prerequisite).  This only needs to be done once (but including it here for compability with [colab](https://colab.research.google.com/)).

In [1]:
!pip install git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer

Collecting biothings_explorer from git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer
  Cloning https://github.com/biothings/biothings_explorer to /private/var/folders/59/w2v_bg_d2rj_hg69468vdxzw0000gn/T/pip-install-87l7sgk0/biothings-explorer
  Running command git clone -q https://github.com/biothings/biothings_explorer /private/var/folders/59/w2v_bg_d2rj_hg69468vdxzw0000gn/T/pip-install-87l7sgk0/biothings-explorer
  Running command git submodule update --init --recursive -q
Building wheels for collected packages: biothings-explorer
  Building wheel for biothings-explorer (setup.py) ... [?25ldone
[?25h  Stored in directory: /private/var/folders/59/w2v_bg_d2rj_hg69468vdxzw0000gn/T/pip-ephem-wheel-cache-mp83lp3v/wheels/61/44/e1/901cb798059240028e8e2b5d8ed46a47aafa11af30a20c465a
Successfully built biothings-explorer
Installing collected packages: biothings-explorer
Successfully installed biothings-explorer-0.0.1
You should consider upgrading via the 'pip install

Next, import the relevant modules:

* **Hint**: Find corresponding bio-entity representation used in BioThings Explorer based on user input (could be any database IDs, symbols, names)
* **FindConnection**: Find intermediate bio-entities which connects user specified input and output

In [1]:
# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

## Step 1: Find representation of "ACE2" and "hydroxychloroquine" in BTE

In this step, BioThings Explorer translates our query strings "ACE2" and "hydroxychloroquine " into BioThings objects, which contain mappings to many common identifiers.  Generally, the top result returned by the `Hint` module will be the correct item, but you should confirm that using the identifiers shown.

Search terms can correspond to any child of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `DiseaseOrPhenotypicFeature` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle").

In [2]:
ht = Hint()
# find all potential representations of ACE2
ace2_hint = ht.query("ACE2")
# select the correct representation of ACE2
ace2 = ace2_hint['Gene'][0]
ace2

{'NCBIGene': '59272',
 'name': 'angiotensin I converting enzyme 2',
 'SYMBOL': 'ACE2',
 'UMLS': 'C1422064',
 'HGNC': '13557',
 'UNIPROTKB': 'Q9BYF1',
 'ENSEMBL': 'ENSG00000130234',
 'primary': {'identifier': 'NCBIGene', 'cls': 'Gene', 'value': '59272'},
 'display': 'NCBIGene(59272) ENSEMBL(ENSG00000130234) HGNC(13557) UMLS(C1422064) UNIPROTKB(Q9BYF1) SYMBOL(ACE2)',
 'type': 'Gene'}

In [3]:
# find all potential representations of hydroxychloroquine 
hydroxychloroquine_hint = ht.query("hydroxychloroquine")
# select the correct representation of hydroxychloroquine 
hydroxychloroquine = hydroxychloroquine_hint['ChemicalSubstance'][0]
hydroxychloroquine

{'DRUGBANK': 'DB01611',
 'CHEBI': 'CHEBI:5801',
 'name': 'hydroxychloroquine',
 'primary': {'identifier': 'CHEBI',
  'cls': 'ChemicalSubstance',
  'value': 'CHEBI:5801'},
 'display': 'CHEBI(CHEBI:5801) DRUGBANK(DB01611) name(hydroxychloroquine)',
 'type': 'ChemicalSubstance'}

## Step 2: Find intermediate nodes connecting ACE2 and hydroxychloroquine 

In this section, we find all paths in the knowledge graph that connect ACE2 and hydroxychloroquine .  To do that, we will use `FindConnection`.  This class is a convenient wrapper around two advanced functions for **query path planning** and **query path execution**. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months. 

The parameters for `FindConnection` are described below:


In [5]:
help(FindConnection.__init__)

Help on function __init__ in module biothings_explorer.user_query_dispatcher:

__init__(self, input_obj, output_obj, intermediate_nodes, registry=None)
    Find relationships in the Knowledge Graph between an Input Object and an Output Object.
    
    Args:
        input_obj (required): must be an object returned from Hint corresponding to a specific biomedical entity.
                            Examples: 
                Hint().query("Fanconi anemia")['DiseaseOrPhenotypicFeature'][0]
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
    
        output_obj (required): must EITHER be an object returned from Hint corresponding to a specific biomedical
                            entity, OR be a string or list of strings corresponding to Biolink Entity classes.
                            Examples:
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
                'Gene'
                ['Gene','ChemicalSubstance']
    
        intermediate_nodes 

Here, we formulate a `FindConnection` query with "CML" as the `input_ojb`, "imatinib" as the `output_obj`.  We further specify with the `intermediate_nodes` parameter that we are looking for paths joining chronic myelogenous leukemia and imatinib with *one* intermediate node that is a Gene.  (The ability to search for longer reasoning paths that include additional intermediate nodes will be added shortly.)

In [4]:
fc = FindConnection(input_obj=ace2, output_obj=hydroxychloroquine, intermediate_nodes=['BiologicalEntity'])

We next execute the `connect` method, which performs the **query path planning** and **query path execution** process.  In short, BioThings Explorer is deconstructing the query into individual API calls, executing those API calls, then assembling the results.

A verbose log of this process is displayed below:

In [5]:
# set verbose=True will display all steps which BTE takes to find the connection
fc.connect(verbose=True)


BTE will find paths that join 'ACE2' and 'hydroxychloroquine'. Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: BiologicalEntity



==== Step #1: Query path planning ====

Because ACE2 is of type 'Gene', BTE will query our meta-KG for APIs that can take 'Gene' as input and 'None' as output

BTE found 17 apis:

API 1. hmdb(1 API call)
API 2. mychem(3 API calls)
API 3. dgidb(1 API call)
API 4. cord_gene(1 API call)
API 5. ebi_gene2phenotype(1 API call)
API 6. scibite(2 API calls)
API 7. chembio(1 API call)
API 8. mydisease(1 API call)
API 9. biolink(4 API calls)
API 10. semmed_gene(13 API calls)
API 11. mygene(9 API calls)
API 12. DISEASES(1 API call)
API 13. pharos(2 API calls)
API 14. opentarget(1 API call)
API 15. scigraph(2 API calls)
API 16. hetio(1 API call)
API 17. ctd(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 8.1: http://

## Step 3: Display and Filter results
This section demonstrates post-query filtering done in Python. Later, more advanced filtering functions will be added to the **query path execution** module for interleaved filtering, thereby enabling longer query paths. More details to come...

First, all matching paths can be exported to a data frame. Let's examine a sample of those results.

In [8]:
df = fc.display_table_view()

df.head()

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
0,ACE2,Gene,positively_regulates,SEMMED,SEMMED Gene API,1897819418978194,Gene,C3539645,UMLS:C3539645,positively_regulates,SEMMED,SEMMED Chemical API,1099312210993122,Gene,HYDROXYCHLOROQUINE,name:HYDROXYCHLOROQUINE
1,ACE2,Gene,negatively_regulates,SEMMED,SEMMED Gene API,18156169,Gene,C1709136,UMLS:C1709136,physically_interacts_with,SEMMED,SEMMED Chemical API,27939463,Gene,HYDROXYCHLOROQUINE,name:HYDROXYCHLOROQUINE
2,ACE2,Gene,physically_interacts_with,SEMMED,SEMMED Gene API,25875512,Gene,C0014442,UMLS:C0014442,physically_interacts_with,SEMMED,SEMMED Chemical API,9001825,Gene,HYDROXYCHLOROQUINE,name:HYDROXYCHLOROQUINE
3,ACE2,Gene,negatively_regulates,SEMMED,SEMMED Gene API,18156169,Gene,C0669365,UMLS:C0669365,physically_interacts_with,SEMMED,SEMMED Chemical API,27939463,Gene,HYDROXYCHLOROQUINE,name:HYDROXYCHLOROQUINE
4,ACE2,Gene,physically_interacts_with,SEMMED,SEMMED Gene API,19232015,ChemicalSubstance,C0309049,UMLS:C0309049,physically_interacts_with,SEMMED,SEMMED Chemical API,246698762619770726872459,ChemicalSubstance,HYDROXYCHLOROQUINE,name:HYDROXYCHLOROQUINE


While most results are based on edges from [semmed](https://skr3.nlm.nih.gov/SemMed/), edges from [DGIdb](http://www.dgidb.org/), [biolink](https://monarchinitiative.org/), [disgenet](http://www.disgenet.org/), [mydisease.info](https://mydisease.info) and [drugcentral](http://drugcentral.org/) were also retrieved from their respective APIs.  

Next, let's look to see which genes are mentioned the most.

In [9]:
df.node1_type.unique()

array(['Gene', 'ChemicalSubstance', 'Disease', 'AnatomicalEntity',
       'GenomicEntity', 'BiologicalProcess', 'CellularComponent'],
      dtype=object)