# Introduction

This notebook demonstrates basic usage of BioThings Explorer, an engine for autonomously querying a distributed knowledge graph. BioThings Explorer can answer two classes of queries -- "PREDICT" and "EXPLAIN".  PREDICT queries are described in [PREDICT_demo.ipynb](PREDICT_demo.ipynb). Here, we describe EXPLAIN queries and how to use BioThings Explorer to execute them.  A more detailed overview of the BioThings Explorer systems is provided in [these slides](https://docs.google.com/presentation/d/1QWQqqQhPD_pzKryh6Wijm4YQswv8pAjleVORCPyJyDE/edit?usp=sharing).

EXPLAIN queries are designed to **identify plausible reasoning chains to explain the relationship between two entities**.  For example, in this notebook, we explore the question:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"*Why does imatinib have an effect on the treatment of chronic myelogenous leukemia (CML)?*"  

Later, we also compare those results to a similar query looking at imatinib's role in treating gastrointestinal stromal tumors (GIST).

**To experiment with an executable version of this notebook, [load it in Google Colaboratory](https://colab.research.google.com/github/biothings/biothings_explorer/blob/master/jupyter%20notebooks/EXPLAIN_demo.ipynb).**

## Step 0: Load BioThings Explorer modules

First, install the `biothings_explorer` and `biothings_schema` packages, as described in this [README](https://github.com/biothings/biothings_explorer/blob/master/jupyter%20notebooks/README.md#prerequisite).  This only needs to be done once (but including it here for compability with [colab](https://colab.research.google.com/)).

In [1]:
%%capture
!pip install git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer

Next, import the relevant modules:

* **Hint**: Find corresponding bio-entity representation used in BioThings Explorer based on user input (could be any database IDs, symbols, names)
* **FindConnection**: Find intermediate bio-entities which connects user specified input and output

In [2]:
# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

## Step 1: Find representation of "chronic myelogenous leukemia" and "imatinib" in BTE

In this step, BioThings Explorer translates our query strings "chronic myelogenous leukemia" and "imatinib" into BioThings objects, which contain mappings to many common identifiers.  Generally, the top result returned by the `Hint` module will be the correct item, but you should confirm that using the identifiers shown.

Search terms can correspond to any child of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `DiseaseOrPhenotypicFeature` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle").

In [3]:
ht = Hint()
# find all potential representations of CML
cml_hint = ht.query("chronic myelogenous leukemia")
# select the correct representation of CML
cml = cml_hint['DiseaseOrPhenotypicFeature'][0]
cml

{'mondo': 'MONDO:0011996',
 'doid': 'DOID:8552',
 'umls': 'C1292772',
 'mesh': 'D015464',
 'name': 'chronic myelogenous leukemia',
 'display': 'mondo(MONDO:0011996) doid(DOID:8552) umls(C1292772) mesh(D015464) name(chronic myelogenous leukemia) ',
 'type': 'DiseaseOrPhenotypicFeature',
 'primary': {'identifier': 'mondo',
  'cls': 'DiseaseOrPhenotypicFeature',
  'value': 'MONDO:0011996'}}

In [4]:
# find all potential representations of imatinib
imatinib_hint = ht.query("imatinib")
# select the correct representation of imatinib
imatinib = imatinib_hint['ChemicalSubstance'][0]
imatinib

{'chembl': 'CHEMBL941',
 'drugbank': 'DB00619',
 'name': 'Imatinib',
 'pubchem': 5291,
 'umls': 'C0935989',
 'mesh': 'D000068877',
 'chebi': 'CHEBI:45783',
 'smiles': 'Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1',
 'display': 'chembl(CHEMBL941) drugbank(DB00619) name(Imatinib) pubchem(5291) umls(C0935989) mesh(D000068877) chebi(CHEBI:45783) smiles(Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1) ',
 'type': 'ChemicalSubstance',
 'primary': {'identifier': 'chembl',
  'cls': 'ChemicalSubstance',
  'value': 'CHEMBL941'}}

## Step 2: Find intermediate nodes connecting imatinib and chronic myelogenous leukemia

In this section, we find all paths in the knowledge graph that connect imatinib and chronic myelogenous leukemia.  To do that, we will use `FindConnection`.  This class is a convenient wrapper around two advanced functions for **query path planning** and **query path execution**. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months. 

The parameters for `FindConnection` are described below:


In [5]:
help(FindConnection.__init__)

Help on function __init__ in module biothings_explorer.user_query_dispatcher:

__init__(self, input_obj, output_obj, intermediate_nodes, registry=None)
    Find relationships in the Knowledge Graph between an Input Object and an Output Object.
    
    Args:
        input_obj (required): must be an object returned from Hint corresponding to a specific biomedical entity.
                            Examples:
                Hint().query("Fanconi anemia")['DiseaseOrPhenotypicFeature'][0]
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
    
        output_obj (required): must EITHER be an object returned from Hint corresponding to a specific biomedical
                            entity, OR be a string or list of strings corresponding to Biolink Entity classes.
                            Examples:
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
                'Gene'
                ['Gene','ChemicalSubstance']
    
        intermediate_nodes (

Here, we formulate a `FindConnection` query with "CML" as the `input_ojb`, "imatinib" as the `output_obj`.  We further specify with the `intermediate_nodes` parameter that we are looking for paths joining chronic myelogenous leukemia and imatinib with *one* intermediate node that is a Gene.  (The ability to search for longer reasoning paths that include additional intermediate nodes will be added shortly.)

In [6]:
fc = FindConnection(input_obj=cml, output_obj=imatinib, intermediate_nodes='Gene')

We next execute the `connect` method, which performs the **query path planning** and **query path execution** process.  In short, BioThings Explorer is deconstructing the query into individual API calls, executing those API calls, then assembling the results.

A verbose log of this process is displayed below:

In [7]:
# set verbose=True will display all steps which BTE takes to find the connection
fc.connect(verbose=True)


BTE will find paths that join 'chronic myelogenous leukemia' and 'Imatinib'. Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: Gene



==== Step #1: Query path planning ====

Because chronic myelogenous leukemia is of type 'DiseaseOrPhenotypicFeature', BTE will query our meta-KG for APIs that can take 'DiseaseOrPhenotypicFeature' as input and 'Gene' as output

BTE found 7 apis:

API 1. biolink_disease2gene(1 API call)
API 2. scibite_disease2gene(1 API call)
API 3. mgigene2phenotype(1 API call)
API 4. corddisease(1 API call)
API 5. semmeddisease(1 API call)
API 6. DISEASES(1 API call)
API 7. mydisease.info(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 6.1: https://biothings.ncats.io/DISEASES/query (POST "q=MONDO:0011996&scopes=_id&fields=DISEASES.associatedWith&species=human&size=100")
API 3.1: https://biothings.ncats.io/mgigene2phen

## Step 3: Display and Filter results
This section demonstrates post-query filtering done in Python. Later, more advanced filtering functions will be added to the **query path execution** module for interleaved filtering, thereby enabling longer query paths. More details to come...

First, all matching paths can be exported to a data frame. Let's examine a sample of those results.

In [8]:
df = fc.display_table_view()

# because UMLS is not currently well-integrated in our ID-to-object translation system, removing UMLS-only entries here
patternDel = "^umls:C\d+"
filter = df.node1_id.str.contains(patternDel)
df = df[~filter]

print(df.shape)
df.sample(10)

(1621, 16)


Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
2087,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,disrupted_by,SEMMED API,semmeddisease,26734127,Gene,PTEN,entrez:5728,coexists_with,SEMMED API,semmedchemical,24091144,Gene,IMATINIB,chembl:CHEMBL941
220,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,related_to,SEMMED API,semmeddisease,"11979553,10498618,10822991,11368359,20809971,1...",Gene,ABL1,entrez:25,negatively_regulates,SEMMED API,semmedchemical,21693657,Gene,IMATINIB,chembl:CHEMBL941
341,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,caused_by,SEMMED API,semmeddisease,2647654426476544,Gene,KDR,entrez:3791,associated_with,CORD API,cordchemical,,Gene,IMATINIB,chembl:CHEMBL941
122,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,associatedWith,scibite,scibite_disease2gene,,Gene,ABL1,entrez:25,associatedWith,ctd,ctd_chemical2gene,12032135,Gene,IMATINIB,chembl:CHEMBL941
2159,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,related_to,SEMMED API,semmeddisease,"16263579,17064569,24726782,10979972,12496355,1...",Gene,VEGFA,entrez:7422,physically_interacts_with,SEMMED API,semmedchemical,1807530224726782,Gene,IMATINIB,chembl:CHEMBL941
208,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,related_to,SEMMED API,semmeddisease,18406870,Gene,ABL1,entrez:25,coexists_with,SEMMED API,semmedchemical,260302912661817526618175,Gene,IMATINIB,chembl:CHEMBL941
260,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,affected_by,SEMMED API,semmeddisease,6686171,Gene,CA2,entrez:760,associatedWith,ctd,ctd_chemical2gene,18237030,Gene,IMATINIB,chembl:CHEMBL941
410,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,associated_with,CORD API,corddisease,,Gene,KIT,entrez:3815,associatedWith,ctd,ctd_chemical2gene,1742028627793025,Gene,IMATINIB,chembl:CHEMBL941
1974,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,disrupted_by,SEMMED API,semmeddisease,18184863,Gene,AKT1,entrez:207,negatively_regulates,SEMMED API,semmedchemical,12481435126720431799974218794444,Gene,IMATINIB,chembl:CHEMBL941
1983,chronic myelogenous leukemia,DiseaseOrPhenotypicFeature,caused_by,SEMMED API,semmeddisease,24598853,Gene,AKT1,entrez:207,physically_interacts_with,SEMMED API,semmedchemical,25182956,Gene,IMATINIB,chembl:CHEMBL941


While most results are based on edges from [semmed](https://skr3.nlm.nih.gov/SemMed/), edges from [DGIdb](http://www.dgidb.org/), [biolink](https://monarchinitiative.org/), [disgenet](http://www.disgenet.org/), [mydisease.info](https://mydisease.info) and [drugcentral](http://drugcentral.org/) were also retrieved from their respective APIs.  

Next, let's look to see which genes are mentioned the most.

In [9]:
df.node1_name.value_counts().head(10)

BCR       221
ABL1      156
KIT       130
ABCB1      75
MTTP       63
AKT1       55
TP53       49
CD34       36
TNF        32
STAT5A     28
Name: node1_name, dtype: int64

Not surprisingly, the top two genes that BioThings Explorer found that join imatinib to CML are *ABL1* and *BCR*, the two genes that are fused in the "Philadelphia chromosome", the genetic abnormality that underlies CML, *and* the validate target of imatinib.

Let's examine some of the PubMed articles linking **CML to *ABL1*** and ***ABL1* to imatinib**.

In [10]:
# fetch all articles connecting 'chronic myelogenous leukemia' and 'ABL1'
articles = []
for info in fc.display_edge_info('chronic myelogenous leukemia', 'ABL1').values():
    if 'pubmed' in info['info']:
        articles += info['info']['pubmed']
print("There are "+str(len(articles))+" articles supporting the edge between CML and ABL1. Sampling of 10 of those:")
x = [print("http://pubmed.gov/"+str(x)) for x in articles[0:10] ]

There are 22 articles supporting the edge between CML and ABL1. Sampling of 10 of those:
http://pubmed.gov/18243808
http://pubmed.gov/25389112
http://pubmed.gov/26186983
http://pubmed.gov/27282582
http://pubmed.gov/23955590
http://pubmed.gov/27904889
http://pubmed.gov/11979553
http://pubmed.gov/10498618
http://pubmed.gov/10822991
http://pubmed.gov/11368359


In [11]:
# fetch all articles connecting 'ABL1' and 'Imatinib
articles = []
for info in fc.display_edge_info('ABL1', 'Imatinib').values():
    if 'pubmed' in info['info']:
        articles += info['info']['pubmed']
print("There are "+str(len(articles))+" articles supporting the edge between ABL1 and imatinib. Sampling of 10 of those:")
x = [print("http://pubmed.gov/"+str(x)) for x in articles[0:10] ]

There are 33 articles supporting the edge between ABL1 and imatinib. Sampling of 10 of those:
http://pubmed.gov/15799618
http://pubmed.gov/15917650
http://pubmed.gov/15949566
http://pubmed.gov/16153117
http://pubmed.gov/16205964
http://pubmed.gov/12032135
http://pubmed.gov/12446442
http://pubmed.gov/15039284
http://pubmed.gov/15107311
http://pubmed.gov/15329907


## Comparing results between CML and GIST

Let's perform another BioThings Explorer query, this time looking to EXPLAIN the relationship between imatinib and gastrointestinal stromal tumors (GIST), another disease treated by imatinib.

In [12]:
ht = Hint()
# find all potential representations of CML
gist_hint = ht.query("gastrointestinal stromal tumor")
# select the correct representation of CML
gist = gist_hint['DiseaseOrPhenotypicFeature'][0]
gist

{'mondo': 'MONDO:0011719',
 'doid': 'DOID:9253',
 'umls': 'C3179349',
 'mesh': 'D046152',
 'name': 'gastrointestinal stromal tumor',
 'display': 'mondo(MONDO:0011719) doid(DOID:9253) umls(C3179349) mesh(D046152) name(gastrointestinal stromal tumor) ',
 'type': 'DiseaseOrPhenotypicFeature',
 'primary': {'identifier': 'mondo',
  'cls': 'DiseaseOrPhenotypicFeature',
  'value': 'MONDO:0011719'}}

In [13]:
fc = FindConnection(input_obj=gist, output_obj=imatinib, intermediate_nodes='Gene')

In [14]:
fc.connect(verbose=False) # skipping the verbose log here

0, message='Attempt to decode JSON with unexpected mimetype: text/plain;charset=utf-8', url='http://ctdbase.org/tools/batchQuery.go?inputType=chem&inputTerms=D000068877%7Cmercury&report=genes_curated&format=json


In [15]:
df = fc.display_table_view()

# because UMLS is not currently well-integrated in our ID-to-object translation system, removing UMLS-only entries here
patternDel = "^umls:C\d+"
filter = df.node1_id.str.contains(patternDel)
df = df[~filter]

print(df.shape)
df.sample(10)

(1757, 16)


Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
500,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,has_part,SEMMED API,semmeddisease,150628761882269221835435,Gene,KIT,entrez:3815,target,drugbank,mychem.info,1686556516087693174585631736958317559139,Gene,IMATINIB,chembl:CHEMBL941
1105,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,disrupted_by,SEMMED API,semmeddisease,12918066,Gene,KIT,entrez:3815,coexists_with,SEMMED API,semmedchemical,18515176267796182147585021566433,Gene,IMATINIB,chembl:CHEMBL941
2070,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,associatedWith,DISEASES,DISEASES,,Gene,TP53,entrez:7157,physically_interacts_with,SEMMED API,semmedchemical,2009479829115375,Gene,IMATINIB,chembl:CHEMBL941
967,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,treated_by,SEMMED API,semmeddisease,"15717993,16740027,17534660,16596277,19466511,1...",Gene,KIT,entrez:3815,physically_interacts_with,SEMMED API,semmedchemical,25174682,Gene,IMATINIB,chembl:CHEMBL941
1908,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,treated_by,SEMMED API,semmeddisease,26078569,Gene,PECAM1,entrez:5175,associatedWith,ctd,ctd_chemical2gene,15867366,Gene,IMATINIB,chembl:CHEMBL941
312,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,associatedWith,DISEASES,DISEASES,,Gene,PDGFRA,entrez:5156,negatively_regulates,SEMMED API,semmedchemical,17437861,Gene,IMATINIB,chembl:CHEMBL941
1164,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,associatedWith,scibite,scibite_disease2gene,,Gene,KIT,entrez:3815,negatively_regulates,SEMMED API,semmedchemical,"16818499,14531349,16614006,17363509,15139047,1...",Gene,IMATINIB,chembl:CHEMBL941
1239,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,produces,SEMMED API,semmeddisease,1522395815223958182656492619128711012738,Gene,KIT,entrez:3815,physically_interacts_with,SEMMED API,semmedchemical,"12441322,15846297,16797704,21586300,21249321,2...",Gene,IMATINIB,chembl:CHEMBL941
1765,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,has_part,SEMMED API,semmeddisease,2886226328862263,Gene,NF1,entrez:4763,associated_with,CORD API,cordchemical,,Gene,IMATINIB,chembl:CHEMBL941
2140,gastrointestinal stromal tumor,DiseaseOrPhenotypicFeature,related_to,SEMMED API,semmeddisease,"14614641,14614641,12949061,16029637,16029637,1...",Gene,ZNRD2,entrez:10534,physically_interacts_with,SEMMED API,semmedchemical,18974147,Gene,IMATINIB,chembl:CHEMBL941


In [16]:
df.node1_name.value_counts().head(10)

KIT       810
PDGFRA    255
BCR        51
TP53       49
CD34       48
EGFR       36
VEGFA      36
ABL1       28
BRAF       25
MTOR       18
Name: node1_name, dtype: int64

Here, the top two genes that BioThings Explorer found that join imatinib to GIST are *PDGFRA* and *KIT*, the most commonly mutated genes found in GIST *and* validated targets of imatinib.

While several of the listed genes would be considered positive controls, others on the list could be viewed as **testable hypotheses and discovery opportunities** to be evaluated by domain experts.

## Conclusions and caveats

This notebook demonstrated the use of BioThings Explorer in EXPLAIN mode to investigate the relationship between imatinib and two diseases that it treats -- chronic myelogenous leukemia (CML) and gastrointestinal stromal tumors (GIST).  In each case, BioThings Explorer autonomously queried a **distributed knowledge graph of biomedical APIs** to find the most common genes, and in each case the relevant targets were retrieved.

There are still many areas for improvement (and some areas in which BioThings Explorer is still buggy).  And of course, BioThings Explorer is dependent on the accessibility of the APIs that comprise the distributed knowledge graph.  Nevertheless, we encourage users to try other variants of the EXPLAIN queries demonstrated in this notebook.