# Introduction

This notebook demonstrates basic usage of BioThings Explorer, an engine for autonomously querying a distributed knowledge graph of biomedical knowledge.

BioThings Explorer can answer two classes of queries -- "EXPLAIN" and "PREDICT". EXPLAIN queries are described in [EXPLAIN_demo.ipynb](EXPLAIN_demo.ipynb). Here, we describe PREDICT queries and how to use BioThings Explorer to execute them. A more detailed overview of the BioThings Explorer systems is provided in [these slides](https://docs.google.com/presentation/d/1QWQqqQhPD_pzKryh6Wijm4YQswv8pAjleVORCPyJyDE/edit?usp=sharing).

EXPLAIN queries are designed to **identify plausible reasoning chains to explain the relationship between two specific biomedical entities (objects or concepts)**.  For example, in this notebook, we explore the question:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"*Why does imatinib have an effect on the treatment of chronic myelogenous leukemia (CML)?*"  

Later, we also compare those results to a similar query looking at imatinib's role in treating gastrointestinal stromal tumors (GIST).

**To experiment with an executable version of this notebook, [load it in Google Colaboratory](https://colab.research.google.com/github/biothings/biothings_explorer/blob/master/jupyter%20notebooks/EXPLAIN_demo.ipynb).**

## Step 0: Load BioThings Explorer modules

First, install the `biothings_explorer` and `biothings_schema` packages, as described in this [README](https://github.com/biothings/biothings_explorer/blob/master/jupyter%20notebooks/README.md#prerequisite).  This only needs to be done once (but it's included here for compability with [Colab](https://colab.research.google.com/)).

In [1]:
%%capture
!pip install git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer

Next, import the relevant modules:

* **Hint**: Find corresponding BioThings representations to use in BioThings Explorer based on user input (could be any database IDs, symbols, names)
* **FindConnection**: Find the relationship(s) between a specific entity and an entity type (PREDICT) or two specific entities (EXPLAIN))

In [1]:
# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

In [4]:
## functions to add to modules?
def hint_display(query, hint_result):
    """
    show the type, name, number of IDs for all results returned by the query
    
    :param: query: string used in hint query
    :param: hint_result: object returned from hint query, a dictionary of lists of dictionaries
    
    Returns: None
    """
    ## function needs to be rewritten if it's going to give the exact index of each object within its type 
    display = ['type', 'name']  ## replace with the parts of the BioThings object you want to see
    concise_results = []
    for BT_type, result in hint_result.items():
        if result:  ## basically if it's not empty
            for items in result:
                ## number of identifiers per object: number of keys - 4 (name, primary, display, type)
                temp = len(items) - 4
                concise_results.append((items[display[0]], items[display[1]], 
                                         str(temp)))
                    
    print('There are {total} BioThings objects returned for {ht}:'.format(\
                total = len(concise_results), ht = query))
    for display_info in concise_results:
        print('{0}, {1}, num of IDs: {2}'.format(display_info[0], display_info[1], display_info[2]))

## Step 1: Find representations of "chronic myelogenous leukemia" and "imatinib" in BTE

In this step, BioThings Explorer translates our query strings "chronic myelogenous leukemia" and "imatinib" into BioThings objects, which contain mappings to many common identifiers.  Generally, the top result returned by the `Hint` module will match what you want, but you should confirm that using the identifiers shown. 

BioThings types correspond to children and descendants of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `Disease` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle"). **However, [only a subset of the Biolink BiologicalEntity children / descendants are currently implemented in BTE](https://smart-api.info/portal/translator/metakg)**. More biomedical object types will be available as more knowledge sources (APIs) are added to the system. 

**Note that the type `BiologicalEntity` means any BioThings type currently implemented in BTE will be accepted.**

In [5]:
ht = Hint()
cml_str = "chronic myelogenous leukemia"
# find all potential representations of CML
cml_hint = ht.query(cml_str)

hint_display(cml_str, cml_hint)

There are 6 BioThings objects returned for chronic myelogenous leukemia:
Disease, chronic myelogenous leukemia, BCR-ABL1 positive, num of IDs: 6
Disease, Philadelphia chromosome positive chronic myelogenous leukemia, num of IDs: 2
Disease, Philadelphia chromosome negative chronic myelogenous leukemia, num of IDs: 2
Disease, Chronic myelogenous leukemia, BCR/ABL positive, num of IDs: 2
Disease, blast phase chronic myelogenous leukemia, BCR-ABL1 positive, num of IDs: 3
PhenotypicFeature, Chronic myelogenous leukemia, num of IDs: 3


Based on the information above, we'll pick the top Disease choice (indexed at 0) for CML for our query. We can look at identifier mappings inside this BioThings object. 

In [6]:
# select the first Disease object for our CML query
cml = cml_hint['Disease'][0]
cml

{'MONDO': 'MONDO:0011996',
 'DOID': 'DOID:8552',
 'UMLS': 'C0023473',
 'name': 'chronic myelogenous leukemia, BCR-ABL1 positive',
 'MESH': 'D015464',
 'OMIM': '608232',
 'ORPHANET': '521',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0011996'},
 'display': 'MONDO(MONDO:0011996) DOID(DOID:8552) OMIM(608232) ORPHANET(521) UMLS(C0023473) MESH(D015464) name(chronic myelogenous leukemia, BCR-ABL1 positive)',
 'type': 'Disease'}

In [7]:
# find all potential representations of imatinib
imatinib_str = "imatinib"
imatinib_hint = ht.query(imatinib_str)

hint_display(imatinib_str, imatinib_hint)

There are 5 BioThings objects returned for imatinib:
ChemicalSubstance, IMATINIB, num of IDs: 9
ChemicalSubstance, Imatinib, num of IDs: 6
ChemicalSubstance, IMATINIB MESYLATE, num of IDs: 9
Pathway, Imatinib Pathway, Pharmacokinetics/Pharmacodynamics, num of IDs: 1
Pathway, Imatinib and Chronic Myeloid Leukemia, num of IDs: 1


While multiple results look valid (indexed at 0, 1, 3, 4), for now we'll pick the top ChemicalSubstance choice (indexed at 0) for imatinib for our query. We can look at identifier mappings inside this BioThings object. 

In [8]:
# select the correct representation of imatinib
imatinib = imatinib_hint['ChemicalSubstance'][0]
imatinib

{'CHEMBL.COMPOUND': 'CHEMBL941',
 'DRUGBANK': 'DB00619',
 'PUBCHEM': 5291,
 'CHEBI': 'CHEBI:45783',
 'UMLS': 'C0935989',
 'MESH': 'D000068877',
 'UNII': '8A1O1M485B',
 'INCHIKEY': 'KTUFNOKKBVMGRW-UHFFFAOYSA-N',
 'INCHI': 'InChI=1S/C29H31N7O/c1-21-5-10-25(18-27(21)34-29-31-13-11-26(33-29)24-4-3-12-30-19-24)32-28(37)23-8-6-22(7-9-23)20-36-16-14-35(2)15-17-36/h3-13,18-19H,14-17,20H2,1-2H3,(H,32,37)(H,31,33,34)',
 'name': 'IMATINIB',
 'primary': {'identifier': 'CHEBI',
  'cls': 'ChemicalSubstance',
  'value': 'CHEBI:45783'},
 'display': 'CHEBI(CHEBI:45783) CHEMBL.COMPOUND(CHEMBL941) DRUGBANK(DB00619) PUBCHEM(5291) MESH(D000068877) UNII(8A1O1M485B) UMLS(C0935989) name(IMATINIB)',
 'type': 'ChemicalSubstance'}

## Step 2: Find intermediate nodes connecting imatinib and chronic myelogenous leukemia

In this section, we find some paths in the knowledge graph that connect imatinib and chronic myelogenous leukemia.  To do that, we will use `FindConnection`.  This class is a convenient wrapper around two advanced functions for **query path planning** and **query path execution**. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months. 

The parameters for `FindConnection` are described below:

In [9]:
help(FindConnection.__init__)

Help on function __init__ in module biothings_explorer.user_query_dispatcher:

__init__(self, input_obj, output_obj, intermediate_nodes, registry=None)
    Find relationships in the Knowledge Graph between an Input Object and an Output Object.
    
    Args:
        input_obj (required): must be an object returned from Hint corresponding to a specific biomedical entity.
                            Examples:
                Hint().query("Fanconi anemia")['DiseaseOrPhenotypicFeature'][0]
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
    
        output_obj (required): must EITHER be an object returned from Hint corresponding to a specific biomedical
                            entity, OR be a string or list of strings corresponding to Biolink Entity classes.
                            Examples:
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
                'Gene'
                ['Gene','ChemicalSubstance']
    
        intermediate_nodes (

Here, we formulate a `FindConnection` query with CML as the `input_obj`, imatinib as the `output_obj`.  We further specify with the `intermediate_nodes` parameter that we are looking for paths joining chronic myelogenous leukemia and imatinib with *one* intermediate node that is a "Gene" (BioThings type).  

In [10]:
fc = FindConnection(input_obj = cml, output_obj = imatinib, \
                    intermediate_nodes = 'Gene')

We next execute the `connect` method, which performs the **query path planning** and **query path execution** process.  In short, BioThings Explorer is deconstructing the query into individual API calls, executing those API calls, then assembling the results.

A verbose log of this process is displayed below:

In [11]:
# set verbose=True will display all steps which BTE takes to find the connection
fc.connect(verbose = True)


BTE will find paths that join 'chronic myelogenous leukemia, BCR-ABL1 positive' and 'IMATINIB'. Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: Gene



==== Step #1: Query path planning ====

Because chronic myelogenous leukemia, BCR-ABL1 positive is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'Gene' as output

BTE found 10 apis:

API 1. scibite(1 API call)
API 2. semmed_disease(15 API calls)
API 3. pharos(1 API call)
API 4. cord_disease(1 API call)
API 5. biolink(1 API call)
API 6. hetio(1 API call)
API 7. mydisease(1 API call)
API 8. scigraph(1 API call)
API 9. mgi_gene2phenotype(1 API call)
API 10. DISEASES(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 5.1: https://api.monarchinitiative.org/api/bioentity/disease/MONDO:0011996/genes?rows=200
API 7.1: https://mydisease.info/v1/query


After id-to-object translation, BTE retrieved 608 unique objects.



BTE found 328 unique intermediate nodes connecting 'chronic myelogenous leukemia, BCR-ABL1 positive' and 'IMATINIB'


## Step 3: Display and filter results

This section demonstrates post-query filtering done in Python. Later, more advanced filtering functions will be added to the **query path execution** module for interleaved filtering, thereby enabling longer query paths. More details to come...

First, results can be exported to a data frame. Let's examine a sample of those results.

In [12]:
df = fc.display_table_view()

# because UMLS is not currently well-integrated in our ID-to-object translation system, removing UMLS-only entries here
patternDel = "^UMLS:C\d+"
filter = df.node1_id.str.contains(patternDel)
df = df[~filter]

print(df.shape)
df.sample(10)

(1249, 16)


Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
663,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,treated_by,SEMMED,SEMMED Disease API,"15717993,16205964,16596277,17694264,17891251,2...",Gene,KIT,NCBIGene:3815,physically_interacts_with,SEMMED,SEMMED Chemical API,"12481435,12672043,14645423,15689683,15893416,1...",Gene,IMATINIB,name:IMATINIB
1801,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,treated_by,SEMMED,SEMMED Disease API,29041012,Gene,WT1,NCBIGene:7490,coexists_with,SEMMED,SEMMED Chemical API,19580340,Gene,IMATINIB,name:IMATINIB
983,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,related_to,SEMMED,SEMMED Disease API,15995325,Gene,FGFR1,NCBIGene:2260,physically_interacts_with,drugcentral,MyChem.info API,,Gene,IMATINIB,name:IMATINIB
990,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,related_to,SEMMED,SEMMED Disease API,29096333,Gene,EPHB4,NCBIGene:2050,physically_interacts_with,SEMMED,SEMMED Chemical API,24415355,Gene,IMATINIB,name:IMATINIB
903,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,related_to,pharos,Automat PHAROS API,,Gene,ABL1,NCBIGene:25,positively_regulated_by,SEMMED,SEMMED Chemical API,26251899,Gene,IMATINIB,name:IMATINIB
1468,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,related_to,disgenet,mydisease.info API,,Gene,CXCR4,NCBIGene:7852,negatively_regulates,SEMMED,SEMMED Chemical API,24502926,Gene,IMATINIB,name:IMATINIB
1571,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,caused_by,SEMMED,SEMMED Disease API,15833859,Gene,IFI27,NCBIGene:3429,physically_interacts_with,SEMMED,SEMMED Chemical API,18974147,Gene,IMATINIB,name:IMATINIB
684,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,treated_by,SEMMED,SEMMED Disease API,"15717993,16205964,16596277,17694264,17891251,2...",Gene,KIT,NCBIGene:3815,coexists_with,SEMMED,SEMMED Chemical API,"16143881,17208434,17372901,18234488,21181476,2...",Gene,IMATINIB,name:IMATINIB
766,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,related_to,scibite,Automat CORD19 Scibite API,,Gene,PDGFRA,NCBIGene:5156,physically_interacts_with,drugbank,MyChem.info API,1589492815921304159283351594658915995325,Gene,IMATINIB,name:IMATINIB
1836,"CHRONIC MYELOGENOUS LEUKEMIA, BCR-ABL1 POSITIVE",Disease,treated_by,SEMMED,SEMMED Disease API,24574458,Gene,IL2RA,NCBIGene:3559,negatively_regulates,SEMMED,SEMMED Chemical API,18190601,Gene,IMATINIB,name:IMATINIB


Notice that results give the knowledge source (API) and that many also give the literature source (Pubmed ID). 

Next, let's look to see which genes are mentioned the most.

In [13]:
df.node1_name.value_counts().head(10)

BCR      156
ABL1      81
KIT       80
MTTP      42
AKT1      30
TP53      25
JAK2      24
ABCB1     21
ABCG2     21
LYN       20
Name: node1_name, dtype: int64

*Note: as of 2020-08-13, the notebook output above matches the description below. As knowledge sources (APIs) and code are updated and added, users may obtain different results.*

Not surprisingly, the top two genes that BioThings Explorer found that join imatinib to CML are *ABL1* and *BCR*, the two genes that are fused in the "Philadelphia chromosome", the genetic abnormality that underlies CML, *and* the validated target of imatinib.

Let's examine some of the PubMed articles linking **CML to *ABL1*** and ***ABL1* to imatinib**.

In [14]:
# fetch all articles connecting 'chronic myelogenous leukemia' and 'ABL1'
ABL1_str = 'ABL1'
CML_name = cml['name']

articles = []
for info in fc.display_edge_info(CML_name, ABL1_str).values():
    if 'pubmed' in info['info']:
        articles += info['info']['pubmed']
print("There are "+str(len(articles))+" articles supporting the edge between CML and ABL1. Sampling of 10 of those:")
x = [print("http://pubmed.gov/" + str(x)) for x in articles[0:10] ]

There are 17 articles supporting the edge between CML and ABL1. Sampling of 10 of those:
http://pubmed.gov/24662807
http://pubmed.gov/26179066
http://pubmed.gov/10498618
http://pubmed.gov/10822991
http://pubmed.gov/11368359
http://pubmed.gov/11979553
http://pubmed.gov/18082628
http://pubmed.gov/18243808
http://pubmed.gov/18406870
http://pubmed.gov/20809971


In [15]:
# fetch all articles connecting 'ABL1' and 'Imatinib
imatinib_name = imatinib['name']

articles = []
for info in fc.display_edge_info(ABL1_str, imatinib_name).values():
    if 'pubmed' in info['info']:
        articles += info['info']['pubmed']
print("There are "+str(len(articles))+" articles supporting the edge between ABL1 and imatinib. Sampling of 10 of those:")
x = [print("http://pubmed.gov/" + str(x)) for x in articles[0:10] ]

There are 11 articles supporting the edge between ABL1 and imatinib. Sampling of 10 of those:
http://pubmed.gov/15799618
http://pubmed.gov/15917650
http://pubmed.gov/15949566
http://pubmed.gov/16153117
http://pubmed.gov/16205964
http://pubmed.gov/15713800
http://pubmed.gov/19166098
http://pubmed.gov/26030291
http://pubmed.gov/26618175
http://pubmed.gov/21693657


## Comparing results between CML and GIST

Let's perform another BioThings Explorer query, this time looking to EXPLAIN the relationship between imatinib and gastrointestinal stromal tumors (GIST), another disease treated by imatinib.

In [16]:
ht = Hint()
gist_str = "gastrointestinal stromal tumor"

# find all potential representations of CML
gist_hint = ht.query(gist_str)

hint_display(gist_str, gist_hint)

There are 5 BioThings objects returned for gastrointestinal stromal tumor:
Disease, gastrointestinal stromal tumor, num of IDs: 6
Disease, colorectal gastrointestinal stromal tumor, num of IDs: 2
Disease, Gastrointestinal stromal tumor, benign, num of IDs: 2
Disease, Childhood Gastrointestinal Stromal Tumor, num of IDs: 2
Disease, GASTROINTESTINAL STROMAL TUMOR, FAMILIAL, num of IDs: 2


While all results look valid, for now we'll pick the top Disease choice (indexed at 0) for GIST for our query. We can look at identifier mappings inside this BioThings object. 

In [17]:
# select the correct representation of CML
gist = gist_hint['Disease'][0]
gist

{'MONDO': 'MONDO:0011719',
 'DOID': 'DOID:9253',
 'UMLS': 'C0238198',
 'name': 'gastrointestinal stromal tumor',
 'MESH': 'D046152',
 'OMIM': '606764',
 'ORPHANET': '44890',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0011719'},
 'display': 'MONDO(MONDO:0011719) DOID(DOID:9253) OMIM(606764) ORPHANET(44890) UMLS(C0238198) MESH(D046152) name(gastrointestinal stromal tumor)',
 'type': 'Disease'}

Notice that like before, we are specifying with the `intermediate_nodes` parameter that we are looking for paths joining GIST and imatinib with *one* intermediate node that is a "Gene" (BioThings type).  

In [18]:
fc2 = FindConnection(input_obj = gist, output_obj = imatinib, \
                     intermediate_nodes = 'Gene')

In [19]:
fc2.connect(verbose = True) 


BTE will find paths that join 'gastrointestinal stromal tumor' and 'IMATINIB'. Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: Gene



==== Step #1: Query path planning ====

Because gastrointestinal stromal tumor is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'Gene' as output

BTE found 10 apis:

API 1. scibite(1 API call)
API 2. semmed_disease(15 API calls)
API 3. pharos(1 API call)
API 4. cord_disease(1 API call)
API 5. biolink(1 API call)
API 6. hetio(1 API call)
API 7. mydisease(1 API call)
API 8. scigraph(1 API call)
API 9. mgi_gene2phenotype(1 API call)
API 10. DISEASES(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 5.1: https://api.monarchinitiative.org/api/bioentity/disease/MONDO:0011719/genes?rows=200
API 7.1: https://mydisease.info/v1/query?fields=disgenet.genes_related_to_


After id-to-object translation, BTE retrieved 608 unique objects.



BTE found 135 unique intermediate nodes connecting 'gastrointestinal stromal tumor' and 'IMATINIB'


In [20]:
df2 = fc2.display_table_view()

# because UMLS is not currently well-integrated in our ID-to-object translation system, removing UMLS-only entries here
patternDel = "^UMLS:C\d+"
filter = df2.node1_id.str.contains(patternDel)
df2 = df2[~filter]

print(df2.shape)
df2.sample(10)

(790, 16)


Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
854,GASTROINTESTINAL STROMAL SARCOMA,Disease,related_to,SEMMED,SEMMED Disease API,18265649186487362552585128698433,Gene,PDGFRB,NCBIGene:5159,related_to,Translator Text Mining Provider,CORD Chemical API,,Gene,IMATINIB,name:IMATINIB
245,GASTROINTESTINAL STROMAL SARCOMA,Disease,prevented_by,SEMMED,SEMMED Disease API,21814597,Gene,KIT,NCBIGene:3815,physically_interacts_with,drugcentral,MyChem.info API,,Gene,IMATINIB,name:IMATINIB
1088,GASTROINTESTINAL STROMAL SARCOMA,Disease,related_to,DISEASE,DISEASES API,,Gene,TP53,NCBIGene:7157,positively_regulated_by,SEMMED,SEMMED Chemical API,19150257,Gene,IMATINIB,name:IMATINIB
464,GASTROINTESTINAL STROMAL SARCOMA,Disease,related_to,scibite,Automat CORD19 Scibite API,,Gene,KIT,NCBIGene:3815,positively_regulates,SEMMED,SEMMED Chemical API,28016253,Gene,IMATINIB,name:IMATINIB
882,GASTROINTESTINAL STROMAL SARCOMA,Disease,related_to,Translator Text Mining Provider,CORD Disease API,,Gene,EGFR,NCBIGene:1956,positively_regulates,SEMMED,SEMMED Chemical API,15844661,Gene,IMATINIB,name:IMATINIB
230,GASTROINTESTINAL STROMAL SARCOMA,Disease,related_to,scibite,Automat CORD19 Scibite API,,Gene,KIT,NCBIGene:3815,physically_interacts_with,drugbank,MyChem.info API,1608769316865565173695831745856317559139,Gene,IMATINIB,name:IMATINIB
605,GASTROINTESTINAL STROMAL SARCOMA,Disease,treated_by,SEMMED,SEMMED Disease API,"15717993,16596277,16740027,17064855,17534660,1...",Gene,KIT,NCBIGene:3815,related_to,Translator Text Mining Provider,CORD Chemical API,,Gene,IMATINIB,name:IMATINIB
923,GASTROINTESTINAL STROMAL SARCOMA,Disease,affected_by,SEMMED,SEMMED Disease API,18615679,Gene,BRAF,NCBIGene:673,related_to,Translator Text Mining Provider,CORD Chemical API,,Gene,IMATINIB,name:IMATINIB
1036,GASTROINTESTINAL STROMAL SARCOMA,Disease,related_to,Translator Text Mining Provider,CORD Disease API,,Gene,FIP1L1,NCBIGene:81608,related_to,scibite,Automat CORD19 Scibite API,,Gene,IMATINIB,name:IMATINIB
781,GASTROINTESTINAL STROMAL SARCOMA,Disease,related_to,pharos,Automat PHAROS API,,Gene,PDGFRA,NCBIGene:5156,related_to,pharos,Automat PHAROS API,,Gene,IMATINIB,name:IMATINIB


In [21]:
df2.node1_name.value_counts().head(10)

KIT       288
PDGFRA     81
BCR        48
TP53       30
ABL1       27
VEGFA      20
BRAF       15
CD34       15
PDGFRB     14
MTTP       14
Name: node1_name, dtype: int64

*Note: as of 2020-08-13, the notebook output above matches the description below. As knowledge sources (APIs) and code are updated and added, users may obtain different results.*

Here, the top two genes that BioThings Explorer found that join imatinib to GIST are *PDGFRA* and *KIT*, the most commonly mutated genes found in GIST *and* validated targets of imatinib.

While several of the listed genes would be considered positive controls, others on the list could be viewed as **testable hypotheses and discovery opportunities** to be evaluated by domain experts.

## Conclusions and caveats

This notebook demonstrated the use of BioThings Explorer in EXPLAIN mode to investigate the relationship between imatinib and two diseases that it treats -- chronic myelogenous leukemia (CML) and gastrointestinal stromal tumors (GIST).  In each case, BioThings Explorer autonomously queried a **distributed knowledge graph of biomedical APIs** to find the most common genes, and in each case the relevant targets were retrieved.

There are still many areas for improvement (and some areas in which BioThings Explorer is still buggy).  And of course, BioThings Explorer is dependent on the accessibility of the APIs that comprise the distributed knowledge graph.  Nevertheless, we encourage users to try other variants of the EXPLAIN queries demonstrated in this notebook.

## More query information for users

*Note: as of 2020-08-13, the information below is valid.*

The input into the Hint query should be one string with **at least one whole word.** 

Currently, the user must set the input_obj parameter of FindConnection() as **one specific BioThings object** (ex: one chemical, phenotype, disease, GO term, etc.). To do an EXPLAIN query, the output_obj parameter of FindConnection() must be **another specific BioThings object**. 

**For an EXPLAIN query, BTE will only accept one intermediate node.** The intermediate node cannot be a specific BioThings object. They must be BioThings type(s), None, or an empty list ([]). 

The intermediate node parameter can be a single entity type (string, example: 'BiologicalEntity'), a list of acceptable entity types (list of strings, example: ['Gene', 'AnatomicalEntity']), None, or an empty list. **None or an empty list will tell BTE to look for direct connections (no intermediate nodes).** If the intermediate node parameter is a list, BTE will look for knowledge sources that can handle at least one of the types listed.   

BTE will query from input_obj -> intermediate and **from output_obj -> intermediate** and then join results based on shared intermediates. The results should therefore be read as **input_obj -> intermediate <- output_obj. This directionality in results should be kept in mind when interpreting them.**