# Introduction

This notebook demonstrates basic usage of BioThings Explorer, an engine for autonomously querying a distributed knowledge graph. BioThings Explorer can answer two classes of queries -- "PREDICT" and "EXPLAIN".  PREDICT queries are described in [PREDICT_demo.ipynb](PREDICT_demo.ipynb). Here, we describe EXPLAIN queries and how to use BioThings Explorer to execute them.  A more detailed overview of the BioThings Explorer systems is provided in [these slides](https://docs.google.com/presentation/d/1QWQqqQhPD_pzKryh6Wijm4YQswv8pAjleVORCPyJyDE/edit?usp=sharing).

EXPLAIN queries are designed to **identify plausible reasoning chains to explain the relationship between two entities**.  For example, in this notebook, we explore the question:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"*Why does hydroxychloroquine have an effect on ACE2?*"  



**To experiment with an executable version of this notebook, [load it in Google Colaboratory](https://colab.research.google.com/github/biothings/biothings_explorer/blob/master/jupyter%20notebooks/EXPLAIN_ACE2_hydroxychloroquine_demo.ipynb).**

## Step 0: Load BioThings Explorer modules

First, install the `biothings_explorer` and `biothings_schema` packages, as described in this [README](https://github.com/biothings/biothings_explorer/blob/master/jupyter%20notebooks/README.md#prerequisite).  This only needs to be done once (but including it here for compability with [colab](https://colab.research.google.com/)).

In [1]:
!pip install git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer

Collecting biothings_explorer from git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer
  Cloning https://github.com/biothings/biothings_explorer to /private/var/folders/59/w2v_bg_d2rj_hg69468vdxzw0000gn/T/pip-install-87l7sgk0/biothings-explorer
  Running command git clone -q https://github.com/biothings/biothings_explorer /private/var/folders/59/w2v_bg_d2rj_hg69468vdxzw0000gn/T/pip-install-87l7sgk0/biothings-explorer
  Running command git submodule update --init --recursive -q
Building wheels for collected packages: biothings-explorer
  Building wheel for biothings-explorer (setup.py) ... [?25ldone
[?25h  Stored in directory: /private/var/folders/59/w2v_bg_d2rj_hg69468vdxzw0000gn/T/pip-ephem-wheel-cache-mp83lp3v/wheels/61/44/e1/901cb798059240028e8e2b5d8ed46a47aafa11af30a20c465a
Successfully built biothings-explorer
Installing collected packages: biothings-explorer
Successfully installed biothings-explorer-0.0.1
You should consider upgrading via the 'pip install

Next, import the relevant modules:

* **Hint**: Find corresponding bio-entity representation used in BioThings Explorer based on user input (could be any database IDs, symbols, names)
* **FindConnection**: Find intermediate bio-entities which connects user specified input and output

In [1]:
# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

## Step 1: Find representation of "ACE2" and "hydroxychloroquine" in BTE

In this step, BioThings Explorer translates our query strings "ACE2" and "hydroxychloroquine " into BioThings objects, which contain mappings to many common identifiers.  Generally, the top result returned by the `Hint` module will be the correct item, but you should confirm that using the identifiers shown.

Search terms can correspond to any child of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `DiseaseOrPhenotypicFeature` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle").

In [4]:
ht = Hint()
# find all potential representations of ACE2
ace2_hint = ht.query("ACE2")
# select the correct representation of ACE2
ace2 = ace2_hint['Gene'][0]
ace2

{'entrez': '59272',
 'name': 'angiotensin I converting enzyme 2',
 'symbol': 'ACE2',
 'taxonomy': 9606,
 'umls': 'C1422064',
 'uniprot': 'Q9BYF1',
 'hgnc': '13557',
 'ensembl': 'ENSG00000130234',
 'display': 'entrez(59272) name(angiotensin I converting enzyme 2) symbol(ACE2) taxonomy(9606) umls(C1422064) uniprot(Q9BYF1) hgnc(13557) ensembl(ENSG00000130234) ',
 'type': 'Gene',
 'primary': {'identifier': 'entrez', 'cls': 'Gene', 'value': '59272'}}

In [5]:
# find all potential representations of hydroxychloroquine 
hydroxychloroquine_hint = ht.query("hydroxychloroquine")
# select the correct representation of hydroxychloroquine 
hydroxychloroquine = hydroxychloroquine_hint['ChemicalSubstance'][0]
hydroxychloroquine

{'chembl': 'CHEMBL1535',
 'drugbank': 'DB01611',
 'name': 'Hydroxychloroquine',
 'pubchem': 3652,
 'umls': 'C0020336',
 'mesh': 'D006886',
 'chebi': 'CHEBI:5801',
 'smiles': 'CCN(CCO)CCCC(C)Nc1ccnc2cc(Cl)ccc12',
 'display': 'chembl(CHEMBL1535) drugbank(DB01611) name(Hydroxychloroquine) pubchem(3652) umls(C0020336) mesh(D006886) chebi(CHEBI:5801) smiles(CCN(CCO)CCCC(C)Nc1ccnc2cc(Cl)ccc12) ',
 'type': 'ChemicalSubstance',
 'primary': {'identifier': 'chembl',
  'cls': 'ChemicalSubstance',
  'value': 'CHEMBL1535'}}

## Step 2: Find intermediate nodes connecting ACE2 and hydroxychloroquine 

In this section, we find all paths in the knowledge graph that connect ACE2 and hydroxychloroquine .  To do that, we will use `FindConnection`.  This class is a convenient wrapper around two advanced functions for **query path planning** and **query path execution**. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months. 

The parameters for `FindConnection` are described below:


In [5]:
help(FindConnection.__init__)

Help on function __init__ in module biothings_explorer.user_query_dispatcher:

__init__(self, input_obj, output_obj, intermediate_nodes, registry=None)
    Find relationships in the Knowledge Graph between an Input Object and an Output Object.
    
    Args:
        input_obj (required): must be an object returned from Hint corresponding to a specific biomedical entity.
                            Examples: 
                Hint().query("Fanconi anemia")['DiseaseOrPhenotypicFeature'][0]
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
    
        output_obj (required): must EITHER be an object returned from Hint corresponding to a specific biomedical
                            entity, OR be a string or list of strings corresponding to Biolink Entity classes.
                            Examples:
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
                'Gene'
                ['Gene','ChemicalSubstance']
    
        intermediate_nodes 

Here, we formulate a `FindConnection` query with "CML" as the `input_ojb`, "imatinib" as the `output_obj`.  We further specify with the `intermediate_nodes` parameter that we are looking for paths joining chronic myelogenous leukemia and imatinib with *one* intermediate node that is a Gene.  (The ability to search for longer reasoning paths that include additional intermediate nodes will be added shortly.)

In [12]:
fc = FindConnection(input_obj=ace2, output_obj=hydroxychloroquine, intermediate_nodes=['BiologicalEntity'])

We next execute the `connect` method, which performs the **query path planning** and **query path execution** process.  In short, BioThings Explorer is deconstructing the query into individual API calls, executing those API calls, then assembling the results.

A verbose log of this process is displayed below:

In [13]:
# set verbose=True will display all steps which BTE takes to find the connection
fc.connect(verbose=True)


BTE will find paths that join 'ACE2' and 'Hydroxychloroquine'. Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: BiologicalEntity



==== Step #1: Query path planning ====

Because ACE2 is of type 'Gene', BTE will query our meta-KG for APIs that can take 'Gene' as input and 'None' as output

BTE found 18 apis:

API 1. semmedgene(1 API call)
API 2. biolink_gene2anatomy(1 API call)
API 3. ctd_gene2disease(1 API call)
API 4. scibite_gene2chemical(1 API call)
API 5. pfocr(1 API call)
API 6. DISEASES(1 API call)
API 7. mydisease.info(1 API call)
API 8. mychem.info(3 API calls)
API 9. mygene.info(4 API calls)
API 10. biolink_gene2phenotype(1 API call)
API 11. dgidb_gene2chemical(1 API call)
API 12. opentarget(1 API call)
API 13. biolink_gene2disease(1 API call)
API 14. scibite_gene2disease(1 API call)
API 15. cordgene(1 API call)
API 16. myvariant.info(1 API call)
API 17. biolink_geneinteraction(1 API call)
API 18. ebigene2phenotype(1 API call)


=

## Step 3: Display and Filter results
This section demonstrates post-query filtering done in Python. Later, more advanced filtering functions will be added to the **query path execution** module for interleaved filtering, thereby enabling longer query paths. More details to come...

First, all matching paths can be exported to a data frame. Let's examine a sample of those results.

In [16]:
df = fc.display_table_view()

df[df['pred2_api'] == 'scibite_chemical2disease']

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
72,ACE2,Gene,associatedWith,DISEASES,DISEASES,,DiseaseOrPhenotypicFeature,malaria,mondo:MONDO:0005136,associatedWith,scibite,scibite_chemical2disease,,DiseaseOrPhenotypicFeature,HYDROXYCHLOROQUINE,chembl:CHEMBL1535


While most results are based on edges from [semmed](https://skr3.nlm.nih.gov/SemMed/), edges from [DGIdb](http://www.dgidb.org/), [biolink](https://monarchinitiative.org/), [disgenet](http://www.disgenet.org/), [mydisease.info](https://mydisease.info) and [drugcentral](http://drugcentral.org/) were also retrieved from their respective APIs.  

Next, let's look to see which genes are mentioned the most.

In [18]:
df.node1_type.unique()

array(['GenomicEntity', 'AnatomicalEntity', 'DiseaseOrPhenotypicFeature',
       'ChemicalSubstance', 'Gene', 'CellularComponent',
       'BiologicalProcess'], dtype=object)