# Introduction

<img src="img/tidbit2.png" width=300 style="float: right;"/>
This notebook demonstrates how BioThings Explorer can be used to answer the following query: 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*"What existing drugs might be used to treat Parkinson's disease based on an intermediate protein?"*

This query corresponds to [Tidbit 2](https://ncats.nih.gov/tidbit/tidbit_02.html) which was formulated as a demonstration of the NCATS Translator program.

**Background**: BioThings Explorer is an engine for autonomously querying a distributed knowledge graph. BioThings Explorer can answer two classes of queries -- "EXPLAIN" and "PREDICT".  EXPLAIN queries are described in [EXPLAIN_demo.ipynb](EXPLAIN_demo.ipynb), and PREDICT queries are described in [PREDICT_demo.ipynb](PREDICT_demo.ipynb). A more detailed overview of the BioThings Explorer systems is provided in [these slides](https://docs.google.com/presentation/d/1QWQqqQhPD_pzKryh6Wijm4YQswv8pAjleVORCPyJyDE/edit?usp=sharing).


## Step 1: Find representation of "Parkinson disease" in BTE

In this step, BioThings Explorer translates our query string "Parkinson disease"  into BioThings objects, which contain mappings to many common identifiers.  Generally, the top result returned by the `Hint` module will be the correct item, but you should confirm that using the identifiers shown.

Search terms can correspond to any child of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `DiseaseOrPhenotypicFeature` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle").

In [1]:
from biothings_explorer.hint import Hint
ht = Hint()
parkDis = ht.query("Parkinson disease")['DiseaseOrPhenotypicFeature'][0]

In [2]:
parkDis

{'mondo': 'MONDO:0005180',
 'doid': 'DOID:14330',
 'umls': 'C0030567',
 'mesh': 'D010300',
 'name': 'Parkinson disease',
 'display': 'mondo(MONDO:0005180) doid(DOID:14330) umls(C0030567) mesh(D010300) name(Parkinson disease) ',
 'type': 'DiseaseOrPhenotypicFeature',
 'primary': {'identifier': 'mondo',
  'cls': 'DiseaseOrPhenotypicFeature',
  'value': 'MONDO:0005180'}}

## Step 2: Find drugs that are associated with genes which are involved in Parkinson disease

In this section, we find all paths in the knowledge graph that connect Parkinson disease to any entity that is a chemical compound.  To do that, we will use `FindConnection`.  This class is a convenient wrapper around two advanced functions for **query path planning** and **query path execution**. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months. 


In [3]:
from biothings_explorer.user_query_dispatcher import FindConnection

In [4]:
fc = FindConnection(input_obj=parkDis, output_obj='ChemicalSubstance', intermediate_nodes=['Gene'])

In [5]:
fc.connect(verbose=True)


BTE will find paths that join 'Parkinson disease' and 'ChemicalSubstance'. Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: Gene





==== Step #1: Query path planning ====

Because Parkinson disease is of type 'DiseaseOrPhenotypicFeature', BTE will query our meta-KG for APIs that can take 'DiseaseOrPhenotypicFeature' as input

BTE found 3 apis:

API 1. mydisease.info(1 API call)
API 2. semmeddisease(1 API call)
API 3. biolink_disease2gene(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 1.1: http://mydisease.info/v1/query (POST "q=C0030567&scopes=mondo.xrefs.umls,disgenet.xrefs.umls&fields=disgenet.genes_related_to_disease&species=human&size=100")
API 2.1: http://pending.biothings.io/semmed/query (POST "q=C0030567&scopes=umls&fields=AFFECTS.protein,AFFECTS.gene,ASSOCIATED_WITH.gene,CAUSES_reverse.gene,AFFECTS_reverse.protein,AFFECTS_

API 3.68: http://www.dgidb.org/api/v2/interactions.json?genes=FTL
API 3.66: http://www.dgidb.org/api/v2/interactions.json?genes=PARK16
API 3.67: http://www.dgidb.org/api/v2/interactions.json?genes=IGBP1
API 3.71: http://www.dgidb.org/api/v2/interactions.json?genes=TFEB
dgidb_gene2chemical failed
API 3.82: http://www.dgidb.org/api/v2/interactions.json?genes=GIGYF2
API 3.79: http://www.dgidb.org/api/v2/interactions.json?genes=PLEKHM1
API 3.83: http://www.dgidb.org/api/v2/interactions.json?genes=CTC1
dgidb_gene2chemical failed
dgidb_gene2chemical failed
dgidb_gene2chemical failed
API 3.89: http://www.dgidb.org/api/v2/interactions.json?genes=GSTM1
API 3.92: http://www.dgidb.org/api/v2/interactions.json?genes=MIR185
API 3.86: http://www.dgidb.org/api/v2/interactions.json?genes=PLPPR1
API 3.73: http://www.dgidb.org/api/v2/interactions.json?genes=FBXO7
API 3.87: http://www.dgidb.org/api/v2/interactions.json?genes=NQO1
dgidb_gene2chemical failed
API 3.74: http://www.dgidb.org/api/v2/interactio

API 3.240: http://www.dgidb.org/api/v2/interactions.json?genes=HMOX1
API 3.242: http://www.dgidb.org/api/v2/interactions.json?genes=HTR2A-AS1
API 3.234: http://www.dgidb.org/api/v2/interactions.json?genes=IGKV2-14
API 3.245: http://www.dgidb.org/api/v2/interactions.json?genes=HSPA1A
API 3.246: http://www.dgidb.org/api/v2/interactions.json?genes=XK
API 3.220: http://www.dgidb.org/api/v2/interactions.json?genes=IGF1R
API 3.251: http://www.dgidb.org/api/v2/interactions.json?genes=ATP7A
API 3.252: http://www.dgidb.org/api/v2/interactions.json?genes=ATF4
API 3.258: http://www.dgidb.org/api/v2/interactions.json?genes=RPL3
API 3.266: http://www.dgidb.org/api/v2/interactions.json?genes=SMAD3
API 3.271: http://www.dgidb.org/api/v2/interactions.json?genes=NUCKS1
API 3.273: http://www.dgidb.org/api/v2/interactions.json?genes=SLC41A1
API 3.262: http://www.dgidb.org/api/v2/interactions.json?genes=CASC16
API 3.263: http://www.dgidb.org/api/v2/interactions.json?genes=EBP
API 3.275: http://www.dgidb.o

API 2.1: http://mychem.info/v1/query (POST "q=WBP1L,DDIT4,SCN2A,INS,FYN,TMPRSS9,TRH,SKP1,CHCHD2,DNAJA3,EPHB1,DENR,TWSG1,AIF1,DSG3,ATXN2,TMEM229B,ZNF646,DNAJC6,WNT3,HEXIM2,VPS26A,LINC02451,TCEANC2,LRPPRC,TAZ,MTERF4,PDSS2,BAG2,SNCA-AS1,NDUFAF2,CP,PNPLA8,COX17,TIRAP,EIF4G1,CAT,PARK16,FTL,FBXO7,PRDX2,PDE10A,PLEKHM1,KLHDC1,NQO1,SIPA1L2,GPR37,NSF,IGF2,CHL1,TARDBP,SH3GL2,ODAPH,RAB39B,SLC2A13,SYNJ1,GDNF,CA8,AKT1,TALDO1,PITX3,TMC3,TPPP,ZP3,MOG,GPR65,HSPA4,FUS,ARHGAP24,RHOD,BIN3,FKBP1AP2,HDAC6,LY6E,ASPRV1,GABPA,IREB2,STAP1,COL13A1,EGF,PGAM5,TNS1,EDN1,LINGO1,CA2,GRK2,COX7B,CHRNA4,CNR2,ZCWPW2,DGKQ,RIT2,FBP1,UCN2,MCCC1,MSX1,MAPT-AS1,MFN1,HSPA5,TXNIP,MDGA2,SPTSSB,TWNK,SPPL2C,IGKV2-14,SREBF1,TMED9,TRPS1,XK,BRINP1,HTR2A-AS1,HSPA1A,BCKDK,EPHX1,ATF4,SEMA5A,KCNJ4,ND2,RPL3,LTB,ATXN3,CDC42,CASC16,CTSB,CLRN3,P2RX7,FGB,TMEM189,ARL6IP5,SPHK2,GBF1,HSPA9,KITLG,HGF,HLA-DRA,SNCA,SLC6A2,PSMC1,MSMB,CYP17A1,GFPT1,TBC1D5,ITGA8,COX6A1,BAX,FKBP1AP1,ABCB6,GBA,SQSTM1,KCNIP4,CEBPZ,WASH6P,DLG2,NEDD4,SOD1,INPP5F,RPL6,NAT2,R

API 3.49 dgidb_gene2chemical: No hits
API 3.50 dgidb_gene2chemical: No hits
API 3.51 dgidb_gene2chemical: No hits
API 3.52 dgidb_gene2chemical: No hits
API 3.53 dgidb_gene2chemical: 3 hits
API 3.54 dgidb_gene2chemical: 19 hits
API 3.55 dgidb_gene2chemical: No hits
API 3.56 dgidb_gene2chemical: 2 hits
API 3.57 dgidb_gene2chemical: No hits
API 3.58 dgidb_gene2chemical: No hits
API 3.59 dgidb_gene2chemical: No hits
API 3.60 dgidb_gene2chemical: No hits
API 3.61 dgidb_gene2chemical: No hits
API 3.62 dgidb_gene2chemical: No hits
API 3.63 dgidb_gene2chemical: No hits
API 3.64 dgidb_gene2chemical: No hits
API 3.65 dgidb_gene2chemical: 1 hits
API 3.66 dgidb_gene2chemical: No hits
API 3.67 dgidb_gene2chemical: No hits
API 3.68 dgidb_gene2chemical: 2 hits
API 3.69 dgidb_gene2chemical: No hits
API 3.70 dgidb_gene2chemical: 22 hits
API 3.71 dgidb_gene2chemical: No hits
API 3.72 dgidb_gene2chemical: 77 hits
API 3.73 dgidb_gene2chemical: No hits
API 3.74 dgidb_gene2chemical: No hits
API 3.75 dgidb_g

API 3.265 dgidb_gene2chemical: 104 hits
API 3.266 dgidb_gene2chemical: 4 hits
API 3.267 dgidb_gene2chemical: No hits
API 3.268 dgidb_gene2chemical: No hits
API 3.269 dgidb_gene2chemical: No hits
API 3.270 dgidb_gene2chemical: No hits
API 3.271 dgidb_gene2chemical: No hits
API 3.272 dgidb_gene2chemical: 71 hits
API 3.273 dgidb_gene2chemical: No hits
API 3.274 dgidb_gene2chemical: 17 hits
API 3.275 dgidb_gene2chemical: 6 hits
API 3.276 dgidb_gene2chemical: No hits
API 3.277 dgidb_gene2chemical: No hits
API 3.278 dgidb_gene2chemical: No hits
API 3.279 dgidb_gene2chemical: No hits
API 3.280 dgidb_gene2chemical: No hits
API 3.281 dgidb_gene2chemical: No hits
API 3.282 dgidb_gene2chemical: No hits
API 3.283 dgidb_gene2chemical: 1 hits
API 3.284 dgidb_gene2chemical: 17 hits
API 3.285 dgidb_gene2chemical: No hits
API 3.286 dgidb_gene2chemical: 13 hits
API 3.287 dgidb_gene2chemical: 18 hits
API 3.288 dgidb_gene2chemical: No hits
API 3.289 dgidb_gene2chemical: No hits
API 3.290 dgidb_gene2chemic

In [6]:
df = fc.display_table_view()

The `df` object contains the full output from BioThings Explorer.  Each row shows one path that joins the input node (Parkinson's disease) to an intermediate node (a gene or protein) to an ending node (a chemical compound).  The data frame includes a set of columns with additional details on each node and edge (including human-readable labels, identifiers, and sources).  Let's remove all examples where the output_name (the compound label) is `None`, and specifically focus on paths with specific mechanistic predicates `causedBy` and `targetedBy`.

In [105]:
dfFilt = df.loc[df['output_name'].notnull()].query('pred1 == "causedBy" and pred2 == "targetedBy"')
dfFilt

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_id,node1_name,node1_type,pred2,pred2_source,pred2_api,pred2_pubmed,output_id,output_name,output_type
33,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:7157,TP53,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL1421,DASATINIB,ChemicalSubstance
113,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:1813,DRD2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL419792,SUMANIROLE,ChemicalSubstance
123,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:351,APP,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL1743058,PONEZUMAB,ChemicalSubstance
151,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:6530,SLC6A2,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL1201260,BETHANIDINE,ChemicalSubstance
152,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:6530,SLC6A2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL1201260,BETHANIDINE,ChemicalSubstance
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42407,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:10,NAT2,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL446,SULFAMETHAZINE,ChemicalSubstance
42497,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:1312,COMT,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL1454910,NITROXOLINE,ChemicalSubstance
42498,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:1312,COMT,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL1454910,NITROXOLINE,ChemicalSubstance
42504,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:7157,TP53,Gene,targetedBy,dgidb,dgidb_gene2chemical,,chembl:CHEMBL2103879,GANETESPIB,ChemicalSubstance


Let's examine how many unique `Parkinsons` - `GENE` - `DRUG` paths there are:

In [110]:
dfFiltUnique = dfFilt[["input","node1_name","output_name"]].drop_duplicates()
dfFiltUnique

Unnamed: 0,input,node1_name,output_name
33,Parkinson disease,TP53,DASATINIB
113,Parkinson disease,DRD2,SUMANIROLE
123,Parkinson disease,APP,PONEZUMAB
151,Parkinson disease,SLC6A2,BETHANIDINE
238,Parkinson disease,BDNF,DOXORUBICIN
...,...,...,...
42404,Parkinson disease,BDNF,K-252A
42407,Parkinson disease,NAT2,SULFAMETHAZINE
42497,Parkinson disease,COMT,NITROXOLINE
42504,Parkinson disease,TP53,GANETESPIB


Next, let's sort the drugs by the number of proteins that link them to Parkinson's Disease.

In [113]:
import pandas as pd

genes = dfFiltUnique.groupby(['output_name'])['node1_name'].apply(','.join)
count = dfFiltUnique.groupby(['output_name'])['node1_name'].count()
result = pd.DataFrame({ 'genes': genes, 'count': count } )

result.sort_values("count", ascending=False).head(30)

Unnamed: 0_level_0,genes,count
output_name,Unnamed: 1_level_1,Unnamed: 2_level_1
ZINC CHLORIDE,"FYN,CA2,APP,MT2A,PON1,GSN,UTRN,SLC6A2,TP53",9
TAMOXIFEN,"ADORA1,MAPK8,TP53,NFE2L2,FYN,LRRK2",6
PACLITAXEL,"NAT2,FYN,PTEN,TP53,BDNF,MAPT",6
ZINC ACETATE,"MT2A,PON1,UTRN,TP53,APP,GSN",6
DESIPRAMINE,"BDNF,SLC6A2,MC1R,CYP2D6,DRD2",5
DOCETAXEL,"NAT2,GSTM1,MAPT,ATP7A,TP53",5
ERLOTINIB,"CYP2D6,LRRK2,EPHB1,TP53,PTEN",5
RESVERATROL,"SNCA,AKT1,SIRT1,MTNR1B,APP",5
NINTEDANIB,"LRRK2,DAPK2,EPHB1,FYN,MAPK8",5
FOSTAMATINIB,"FYN,PAK1,LRRK2,EPHB1,DAPK2",5


While the list above clearly could benefit from more filtering and sorting, the table provides a wide range of information from our distributed knowledge graph on potential testable hypotheses. For more details on any individual drug candidate, we again can query the original BTE results:

In [117]:
df[df['output_name'] == 'NINTEDANIB']

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_id,node1_name,node1_type,pred2,pred2_source,pred2_api,pred2_pubmed,output_id,output_name,output_type
1313,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1314,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1315,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,disgenet,mydisease.info,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1316,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,semmed,semmeddisease,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1317,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,semmed,semmeddisease,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1318,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,semmed,semmeddisease,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1319,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1320,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1321,Parkinson disease,DiseaseOrPhenotypicFeature,causedBy,semmed,semmeddisease,,entrez:120892,LRRK2,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
1585,Parkinson disease,DiseaseOrPhenotypicFeature,associatedWith,biolink,biolink_disease2gene,,entrez:22848,AAK1,Gene,targetedBy,mychem.info,mychem.info,,chembl:CHEMBL502835,NINTEDANIB,ChemicalSubstance
