# Introduction

<img src="img/tidbit2.png" width=300 style="float: right;"/>

This notebook demonstrates how BioThings Explorer can be used to answer the following query:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*"What drugs might be used to treat Parkinson's disease?"*

This query corresponds to [Tidbit 2](https://ncats.nih.gov/tidbit/tidbit_02.html) which was formulated as a demonstration of the NCATS Translator program.

**Background**: BioThings Explorer can answer two classes of queries -- "EXPLAIN" and "PREDICT".  EXPLAIN queries are described in [EXPLAIN_demo.ipynb](https://github.com/biothings/biothings_explorer/blob/master/jupyter%20notebooks/EXPLAIN_demo.ipynb), and PREDICT queries are described in [PREDICT_demo.ipynb](https://github.com/biothings/biothings_explorer/blob/master/jupyter%20notebooks/PREDICT_demo.ipynb). Here, we describe PREDICT queries and how to use BioThings Explorer to execute them.  A more detailed overview of the BioThings Explorer systems is provided in [these slides](https://docs.google.com/presentation/d/1QWQqqQhPD_pzKryh6Wijm4YQswv8pAjleVORCPyJyDE/edit?usp=sharing).

## Step 1: Find representation of "Parkinson disease" in BTE

In this step, BioThings Explorer translates our query string "Parkinson disease"  into BioThings objects, which contain mappings to many common identifiers.  Generally, the top result returned by the `Hint` module will be the correct item, but you should confirm that using the identifiers shown.

Search terms can correspond to any child of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `DiseaseOrPhenotypicFeature` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle").

In [1]:
from biothings_explorer.hint import Hint
ht = Hint()
parkDis = ht.query("Parkinson disease")['DiseaseOrPhenotypicFeature'][0]

parkDis

{'mondo': 'MONDO:0005180',
 'doid': 'DOID:14330',
 'umls': 'C0030567',
 'mesh': 'D010300',
 'name': 'Parkinson disease',
 'display': 'mondo(MONDO:0005180) doid(DOID:14330) umls(C0030567) mesh(D010300) name(Parkinson disease) ',
 'type': 'DiseaseOrPhenotypicFeature',
 'primary': {'identifier': 'mondo',
  'cls': 'DiseaseOrPhenotypicFeature',
  'value': 'MONDO:0005180'}}

## Step 2: Find drugs that are associated with genes which invovled in Parkinson disease

In this section, we find all paths in the knowledge graph that connect Parkinson disease to any entity that is a chemical compound.  To do that, we will use `FindConnection`.  This class is a convenient wrapper around two advanced functions for **query path planning** and **query path execution**. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months. 

The parameters for `FindConnection` are described below:


In [2]:
from biothings_explorer.user_query_dispatcher import FindConnection

fc = FindConnection(input_obj=parkDis, output_obj='ChemicalSubstance', intermediate_nodes=['Gene'])
fc.connect(verbose=True)


BTE will find paths that join 'Parkinson disease' and 'ChemicalSubstance'. Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: Gene





==== Step #1: Query path planning ====

Because Parkinson disease is of type 'DiseaseOrPhenotypicFeature', BTE will query our meta-KG for APIs that can take 'DiseaseOrPhenotypicFeature' as input

BTE found 3 apis:

API 1. mydisease.info(1 API call)
API 2. biolink_disease2gene(1 API call)
API 3. semmeddisease(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 2.1: http://mydisease.info/v1/query (POST "q=C0030567&scopes=mondo.xrefs.umls,disgenet.xrefs.umls&fields=disgenet.genes_related_to_disease&species=human&size=100")
API 3.1: http://pending.biothings.io/semmed/query (POST "q=C0030567&scopes=umls&fields=AFFECTS_reverse.gene,CAUSES_reverse.gene,AFFECTS_reverse.protein,AFFECTS.gene,AFFECTS.protein,ASSOCIAT

API 2.95: http://www.dgidb.org/api/v2/interactions.json?genes=SLC11A2
API 2.58: http://www.dgidb.org/api/v2/interactions.json?genes=ABCB1
API 2.81: http://www.dgidb.org/api/v2/interactions.json?genes=HLA-DRA
API 2.92: http://www.dgidb.org/api/v2/interactions.json?genes=APOE
dgidb_gene2chemical failed
API 2.77: http://www.dgidb.org/api/v2/interactions.json?genes=ATXN8OS
API 2.67: http://www.dgidb.org/api/v2/interactions.json?genes=PTEN
API 2.76: http://www.dgidb.org/api/v2/interactions.json?genes=CYP2E1
API 2.70: http://www.dgidb.org/api/v2/interactions.json?genes=RET
API 2.90: http://www.dgidb.org/api/v2/interactions.json?genes=LINC02451
API 2.99: http://www.dgidb.org/api/v2/interactions.json?genes=TMED9
API 2.117: http://www.dgidb.org/api/v2/interactions.json?genes=TWSG1
API 2.110: http://www.dgidb.org/api/v2/interactions.json?genes=MOG
API 2.109: http://www.dgidb.org/api/v2/interactions.json?genes=SMAD3
API 2.82: http://www.dgidb.org/api/v2/interactions.json?genes=CXCR4
API 2.75: htt

API 2.246: http://www.dgidb.org/api/v2/interactions.json?genes=PARK16
API 2.245: http://www.dgidb.org/api/v2/interactions.json?genes=TRPS1
API 2.244: http://www.dgidb.org/api/v2/interactions.json?genes=EDN1
API 2.248: http://www.dgidb.org/api/v2/interactions.json?genes=CNTN1
API 2.247: http://www.dgidb.org/api/v2/interactions.json?genes=CEBPZ
API 2.236: http://www.dgidb.org/api/v2/interactions.json?genes=SLC6A2
API 2.243: http://www.dgidb.org/api/v2/interactions.json?genes=TALDO1
API 2.233: http://www.dgidb.org/api/v2/interactions.json?genes=CA2
API 2.249: http://www.dgidb.org/api/v2/interactions.json?genes=GPR65
API 2.250: http://www.dgidb.org/api/v2/interactions.json?genes=CD24
API 2.251: http://www.dgidb.org/api/v2/interactions.json?genes=CAT
API 2.254: http://www.dgidb.org/api/v2/interactions.json?genes=EPHX1
API 2.253: http://www.dgidb.org/api/v2/interactions.json?genes=TARDBP
API 2.252: http://www.dgidb.org/api/v2/interactions.json?genes=CDK5R1
API 2.255: http://www.dgidb.org/api

API 3.3: http://mychem.info/v1/query (POST "q=CCDC82,HSPA4,RAB39B,DAPK2,TGFA,INSR,PLPPR1,RPS8,FTL,ADORA1,CASC16,UTRN,IREB2,HEXIM2,GFAP,EGF,BIN3,FAM47E,ZCWPW2,LCN2,SPTSSB,FXN,HMOX1,COL5A2,SIRT1,NFE2L2,NAT2,CCDC62,WASHC1,CYBB,PAK1,FKBP1AP2,TXNIP,ATP13A2,MSX1,TNF,GFPT1,SKP1,ATXN2,COPS5,BRINP1,RPL3,TPPP,BDNF,ABCB6,MIR185,DNAJA3,ITGA8,IGBP1,MFN1,CDC42,VPS13C,BST1,FKBP1AP1,RPL14,PRSS53,KITLG,ABCB1,FKBP1AP3,TIRAP,MDGA2,ATF4,CTSB,ARHGAP24,FGB,WASH6P,PTEN,DNAH8,UCN2,RET,APP,PRDX2,RNF41,ZNF646,CYP2D6,CYP2E1,ATXN8OS,SOD1,KCNJ4,ND2,HLA-DRA,CXCR4,CLRN3,DSG3,SEMA5A,SYNJ1,ALDH1A1,RIT2,LINC02331,LINC02451,MSMB,APOE,SCARB2,SLC18A2,SLC11A2,TRIB3,TMC3-AS1,XK,TMED9,GRN,VCP,CTC1,SLC6A3,HTR2A-AS1,BAX,ITGA2B,FYN,COL13A1,SMAD3,MOG,KLHDC1,TMEM230,ZNF165,TMEM189,MIR30E,HSPA8,TWSG1,HDAC6,CHRNA4,MT2A,GPNMB,TBC1D5,GBF1,FBP1,RTN4,MAP3K5,GRK2,COMT,MTERF4,DCTN4,LRRK2-DT,INS,MTNR1B,VPS26A,P2RX7,TMPRSS9,TH,PARK7,HTRA2,AAK1,DNAJC6,LRRK2,COX6A1,SREBF1,PDE10A,PSMC1,CNKSR3,HSPA1A,DRD2,AKT1,MAPT,CRK,MAOB,UNC13B,IGF2R,ASPRV1

API 2.115 dgidb_gene2chemical: No hits
API 2.116 dgidb_gene2chemical: 8 hits
API 2.117 dgidb_gene2chemical: No hits
API 2.118 dgidb_gene2chemical: 24 hits
API 2.119 dgidb_gene2chemical: No hits
API 2.120 dgidb_gene2chemical: 2 hits
API 2.121 dgidb_gene2chemical: No hits
API 2.122 dgidb_gene2chemical: No hits
API 2.123 dgidb_gene2chemical: No hits
API 2.124 dgidb_gene2chemical: 6 hits
API 2.125 dgidb_gene2chemical: No hits
API 2.126 dgidb_gene2chemical: 2 hits
API 2.127 dgidb_gene2chemical: No hits
API 2.128 dgidb_gene2chemical: 14 hits
API 2.129 dgidb_gene2chemical: No hits
API 2.130 dgidb_gene2chemical: No hits
API 2.131 dgidb_gene2chemical: No hits
API 2.132 dgidb_gene2chemical: 2 hits
API 2.133 dgidb_gene2chemical: 26 hits
API 2.134 dgidb_gene2chemical: No hits
API 2.135 dgidb_gene2chemical: 17 hits
API 2.136 dgidb_gene2chemical: No hits
API 2.137 dgidb_gene2chemical: 17 hits
API 2.138 dgidb_gene2chemical: No hits
API 2.139 dgidb_gene2chemical: No hits
API 2.140 dgidb_gene2chemical:

API 2.336 dgidb_gene2chemical: 2 hits
API 2.337 dgidb_gene2chemical: 5 hits
API 2.338 dgidb_gene2chemical: 19 hits
API 3.1 mychem.info: 352 hits
API 3.2 mychem.info: 1208 hits
API 3.3 mychem.info: 1239 hits
API 1.1 semmedgene: 63036 hits

After id-to-object translation, BTE retrieved 11358 unique objects.



In the #1 query, BTE found 475 unique Gene nodes
In the #2 query, BTE found 11358 unique ChemicalSubstance nodes


In [3]:
df = fc.display_table_view()


The df object contains the full output from BioThings Explorer. Each row shows one path that joins the input node (Parkinson's disease) to an intermediate node (a gene or protein) to an ending node (a chemical compound). The data frame includes a set of columns with additional details on each node and edge (including human-readable labels, identifiers, and sources). Let's remove all examples where the output_name (the compound label) is None, and specifically focus on paths with specific mechanistic predicates causedBy and targetedBy.

In [5]:
dfFilt = df.loc[df['output_name'].notnull()].query('pred1 == "causedBy" and pred2 == "targetedBy"')
dfFilt

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_id,node1_name,node1_type,pred2,pred2_source,pred2_api,pred2_pubmed,output_id,output_name,output_type
34,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:5599,MAPK8,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL47,VITAMIN E,ChemicalSubstance
69,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:2534,FYN,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL442,ERGOTAMINE,ChemicalSubstance
98,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:6530,SLC6A2,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL716,QUETIAPINE,ChemicalSubstance
145,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:134,ADORA1,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL106265,8-CYCLOPENTYLTHEOPHYLLINE,ChemicalSubstance
202,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:1813,DRD2,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL567,PERPHENAZINE,ChemicalSubstance
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92887,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:1565,CYP2D6,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL415,CLOMIPRAMINE,ChemicalSubstance
92922,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:6530,SLC6A2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL508583,TRODUSQUEMINE,ChemicalSubstance
92979,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:760,CA2,Gene,targetedBy,mychem.info,mychem.info,,drugbank:DB03221,InChI=1S/C14H17N3O5S3/c1-16-12-8-17(9-4-3-5-10...,ChemicalSubstance
93021,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:6010,RHO,Gene,targetedBy,mychem.info,mychem.info,,drugbank:DB04450,InChI=1S/C13H26O5S/c1-2-3-4-5-6-7-19-13-12(17)...,ChemicalSubstance


Let's examine how many unique Parkinsons - GENE - DRUG paths there are:

In [6]:
dfFiltUnique = dfFilt[["input","node1_name","output_name"]].drop_duplicates()
dfFiltUnique

Unnamed: 0,input,node1_name,output_name
34,Parkinson disease,MAPK8,VITAMIN E
69,Parkinson disease,FYN,ERGOTAMINE
98,Parkinson disease,SLC6A2,QUETIAPINE
145,Parkinson disease,ADORA1,8-CYCLOPENTYLTHEOPHYLLINE
202,Parkinson disease,DRD2,PERPHENAZINE
...,...,...,...
92886,Parkinson disease,CYP2D6,CLOMIPRAMINE
92922,Parkinson disease,SLC6A2,TRODUSQUEMINE
92979,Parkinson disease,CA2,InChI=1S/C14H17N3O5S3/c1-16-12-8-17(9-4-3-5-10...
93021,Parkinson disease,RHO,InChI=1S/C13H26O5S/c1-2-3-4-5-6-7-19-13-12(17)...


## Results

Finally, let's sort the drugs by the number of proteins that link them to Parkinson's Disease.

In [9]:
import pandas as pd

genes = dfFiltUnique.groupby(['output_name'])['node1_name'].apply(','.join)
count = dfFiltUnique.groupby(['output_name'])['node1_name'].count()
result = pd.DataFrame({ 'genes': genes, 'count': count } )

result.sort_values("count", ascending=False).head(30)

Unnamed: 0_level_0,genes,count
output_name,Unnamed: 1_level_1,Unnamed: 2_level_1
ZINC CHLORIDE,"CA2,SLC6A2,GSN,MT2A,TP53,FYN,UTRN,PON1,APP",9
InChI=1S/Cu,"PON1,APP,BDNF,PARK7,PRDX2,GSN,SNCA,HSPA8",8
TAMOXIFEN,"ADORA1,LRRK2,FYN,NFE2L2,TP53,MAPK8",6
HALOPERIDOL,"BDNF,HSPA4,DRD2,SLC18A2,TP53,NR4A1",6
ZINC ACETATE,"MT2A,PON1,APP,GSN,UTRN,TP53",6
PACLITAXEL,"TP53,FYN,PTEN,MAPT,NAT2,BDNF",6
InChI=1S/Zn,"MT2A,PON1,APP,UTRN,GSN,TP53",6
RESVERATROL,"AKT1,SNCA,SIRT1,MTNR1B,APP",5
LEVODOPA,"FYN,BDNF,NR4A1,DRD2,COMT",5
QUERCETIN,"AKT1,GABPA,CDK5R1,CA2,ADORA1",5


While the list above clearly could benefit from more filtering and sorting, the table provides a wide range of information from our distributed knowledge graph on potential testable hypotheses. For more details on any individual drug candidate, we again can query the original BTE results. For example, here we examine the evidence behind the link between Parkinson's Disease and the drug nintedanib.

In [10]:
df[df['output_name'] == 'NINTEDANIB']

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_id,node1_name,node1_type,pred2,pred2_source,pred2_api,pred2_pubmed,output_id,output_name,output_type
9105,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,disgenet,mydisease.info,,entrez:3643,INSR,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
11417,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,semmed,semmeddisease,,entrez:5979,RET,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
11418,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,semmed,semmeddisease,,entrez:5979,RET,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
43775,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,,entrez:22848,AAK1,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
47749,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,disgenet,mydisease.info,,entrez:3480,IGF1R,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
65089,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
65090,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
65091,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,disgenet,mydisease.info,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
65092,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,semmed,semmeddisease,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
65093,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,semmed,semmeddisease,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
