# CX Work-In-Progress (WIP) Notebook

There is where I play with BioThings Explorer (BTE) features, troubleshoot issues, and build functions. Other notebooks in this folder include specific use cases (drafts of chained query logic, result post-processing, corroborating evidence for results) and demos. 

Note: are there still issues with interpreting ULMS results? I notice in the PREDICT and EXPLAIN demo notebooks, output objects with ULMS IDs are removed. 

## Documentation (demo intro?):         

BioThings Explorer is an engine for autonomously querying a distributed knowledge graph of biomedical knowledge.

Currently, the user must set the input_obj parameter of FindConnection() as **one specific BioThings object** (ex: one chemical, phenotype, disease, GO term, etc.). The output_obj parameter of FindConnection can be one specific BioThings object (for EXPLAIN class queries) OR one BioThings type (for PREDICT class queries). See EXPLAIN vs PREDICT below.  

BioThings types correspond to children and descendants of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `Disease` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle"). **However, [only a subset of the Biolink BiologicalEntity children / descendants are currently implemented in BTE](https://smart-api.info/portal/translator/metakg)**. More biomedical object types will be available as more knowledge sources (APIs) are added to the system. 

**Note that the type `BiologicalEntity` means any BioThings type currently implemented in BTE will be accepted.**

## Overview of a basic query

Import the relevant modules:
* **Hint**: Find corresponding BioThings representations to use in BioThings Explorer based on user input (could be any database IDs, symbols, names)
* **FindConnection**: Find the relationship(s) between a specific entity and an entity type (PREDICT) or two specific entities (EXPLAIN))

The steps for one query currently are (current for master version as of 2020-08-13):  
1. **Use Hint module**: Find corresponding BioThings representations to use in BioThings Explorer based on user input (could be any database IDs, symbols, names). The input into the Hint query should be one string with at least one whole word. The BioThings objects includes identifiers (IDs) that will help with querying knowledge sources (APIs). 
2. **Use FindConnection module**: BTE then finds ways to chain knowledge-source searches (API calls) to find the relationship(s) between a specific entity and an entity type (PREDICT) or two specific entities (EXPLAIN))
    * BTE may be able to find direct connections between objects. All of these are one-way/directed connections (input_obj -> output_obj). 
        - Example: the input_obj is "chronic myelogenous leukemia" (a disease), and the output_obj is "PhenotypicFeature" (a biomedical entity type). BTE will find knowledge sources with disease (input) -> phenotype (output). 
        - **Done by setting the intermediate_node parameter to None or [] (empty list)**
    * The user can set intermediate node(s), chaining queries and finding indirect connections between objects. Note that this behavior is very different depending on whether the output_obj is a specific biomedical entity or an entity type (see EXPLAIN vs PREDICT below).
3. BTE constructs the knowledge-source searches (queries the APIs), sends out the queries, receives the results, and does some pre-processing. 
4. The results can be explored by in a table (pandas dataframe object) or exported as a graph (graphml file) or JSON file (normal or ReasonerStdAPI format). 

## EXPLAIN vs PREDICT

As mentioned above, relationships between objects are one way (example: disease -> phenotype, if BTE puts a specific disease's name into this knowledge source, can it retrieve phenotypic information?). 

But what happens if only indirect relationships exist and BTE has to chain queries together? This is the current situation:

### Intermediate nodes cannot be specific biomedical entities. They must be entity type(s). The acceptable format to specify this depends on what the output_obj is.

### EXPLAIN: If your output_obj is a specific biomedical entity too

This is known as an "Explain"-type query. 

BTE will only accept one intermediate node. The intermediate node parameter can be a single entity type (string, example: 'BiologicalEntity') or a list of acceptable entity types (list of strings, example: ['Gene', 'AnatomicalEntity']). 

BTE will query from input_obj -> intermediate and from output_obj -> intermediate, then join results based on shared intermediates. The results should therefore be read as input_obj -> intermediate <- output_obj.  

If the intermediate node parameter is a list of acceptable types, BTE will look for knowledge sources that can handle at least one of the types listed.   

### PREDICT: If your output_obj is an entity type

This is known as a "Predict"-type query. 

Currently, the user must set the input_obj parameter of FindConnection() as **one specific BioThings object** (ex: one chemical, phenotype, disease, GO term, etc.). To do a Predict-type query, the output_obj parameter of FindConnection() must be **one BioThings type**. 

BTE will accept one or more intermediate nodes (example: the chain of queries is then A -> B -> C -> D). **However, having more than one intermediate node currently (as-of 8/12/2020) leads to slow performance and an overwhelming number of results.**  

Intermediate nodes cannot be specific biomedical entities. They must be BioThings type(s). The **intermediate node parameter** is a list, where each element is a string (one acceptable entity type) or a tuple of strings (describing the acceptable types) for one node. 
* [('AnatomicalEntity, 'CellularComponent')] means one intermediate node that can be an anatomical entity or a cellular component (GO term)
* [('AnatomicalEntity, 'CellularComponent'), 'CellularComponent'] means two intermediate nodes so the chain of queries is A -> B -> C -> D. The first intermediate node (B) can be either of the two types, while the second intermediate node (C) MUST be a cellular-component (GO term). 

BTE will query from input_object -> intermediate and from intermediate -> output_object, then join results based on shared intermediates. The results should therefore be read as input_object -> intermediate -> output_object. 

If **an intermediate node has a tuple of acceptable types**, BTE will look for knowledge sources that can handle at least one of the types listed.   

### What if I find multiple specific biomedical objects that are valid in step 1? 

This can happen due to different IDs, slightly different contexts or definitions, etc. The problem of object and identifier resolution is an on-going issue in knowledge integration.   

Currently there is no way to combine them into one object for one query. The user has to run multiple / parallel queries for each BioThings object that is valid to their question.  

Below is code blocks in prep for showing issues. 

In [1]:
## Setup 
## CX: make it so multiple lines of code can print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

In [2]:
def hint_display(query, hint_result):
    """
    input: query (string used in hint query), hint_result (object returned from hint query)
    hint_result is a dictionary of lists of dictionaries
    returns: None
    show the name and type of all results returned by the query
    """
    display = ['name', 'type']  ## replace with the parts of the BioThings object you want to see
    concise_results = []
    for BT_type, result in hint_result.items():
        if result:  ## basically if it's not empty
            for items in result:
                concise_results.append((items[display[0]], items[display[1]]))
    print('There are {total} BioThings objects returned for {ht}:'.format(\
                total = len(concise_results), ht = query))
    for (idx, display_info) in enumerate(concise_results):
        print('{i}: {val}'.format(\
                 i = idx, val = display_info))

## Issues with Hint module()

### Synonyms don't work

In [3]:
Synonym_str = "GIST"
SynonymLook = Hint().query(Synonym_str)
hint_display(Synonym_str, SynonymLook)

There are 0 BioThings objects returned for GIST:


### Partial words don't work

In [4]:
## notice that this works for hyphenated words, but doesn't return names with 'gastrointestinal'
PartialWorks_str = "gastro"
PartialWorksLook = Hint().query(PartialWorks_str)
hint_display(PartialWorks_str, PartialWorksLook)

print('\n')

## looking for contractures (phenotype)
PartialNoWork_str = "contract"
PartialNoWorkLook = Hint().query(PartialNoWork_str)
hint_display(PartialNoWork_str, PartialNoWorkLook)

There are 7 BioThings objects returned for gastro:
0: ('Drugs For Peptic Ulcer And Gastro-oesophageal Reflux Disease (gord)', 'ChemicalSubstance')
1: ('Other drugs for peptic ulcer and gastro-oesophageal reflux disease (GORD)', 'ChemicalSubstance')
2: ('genetic gastro-esophageal disease', 'Disease')
3: ('obsolete gastro-enteropancreatic neuroendocrine tumor', 'Disease')
4: ('Gastro-esophageal reflux disease with esophagitis', 'Disease')
5: ('gastro-intestinal system smooth muscle contraction', 'BiologicalProcess')
6: ('positive regulation of gastro-intestinal system smooth muscle contraction', 'BiologicalProcess')


There are 0 BioThings objects returned for contract:


### Singular vs Plural queries behave in unexpected ways

Different results:

In [5]:
## notice that this gives 4 objects
SingularGIST = "gastrointestinal stromal tumor"
SingularLook = Hint().query(SingularGIST)
hint_display(SingularGIST, SingularLook)

print('\n')

## notice that this gives 1 object
PluralGIST = "gastrointestinal stromal tumors"
PluralLook = Hint().query(PluralGIST)
hint_display(PluralGIST, PluralLook)

There are 4 BioThings objects returned for gastrointestinal stromal tumor:
0: ('gastrointestinal stromal tumor', 'Disease')
1: ('GASTROINTESTINAL STROMAL TUMOR, FAMILIAL', 'Disease')
2: ('Gastric Gastrointestinal Stromal Tumor', 'Disease')
3: ('colorectal gastrointestinal stromal tumor', 'Disease')


There are 1 BioThings objects returned for gastrointestinal stromal tumors:
0: ('gastrointestinal stromal tumor', 'Disease')


## Issues with FindConnection() module  

compare the help for FindConnection() compared to the errors or verbose output below**

In [6]:
help(FindConnection.__init__)

Help on function __init__ in module biothings_explorer.user_query_dispatcher:

__init__(self, input_obj, output_obj, intermediate_nodes, registry=None)
    Find relationships in the Knowledge Graph between an Input Object and an Output Object.
    
    Args:
        input_obj (required): must be an object returned from Hint corresponding to a specific biomedical entity.
                            Examples:
                Hint().query("Fanconi anemia")['DiseaseOrPhenotypicFeature'][0]
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
    
        output_obj (required): must EITHER be an object returned from Hint corresponding to a specific biomedical
                            entity, OR be a string or list of strings corresponding to Biolink Entity classes.
                            Examples:
                Hint().query("acetaminophen")['ChemicalSubstance'][0]
                'Gene'
                ['Gene','ChemicalSubstance']
    
        intermediate_nodes (

Some example objects to use as input and output in later queries. This is from Step 1 of the *Overview of the basic query* above.

In [7]:
## looking through the options here: pick the first disease option

LaforaString = 'Lafora'
LaforaLookup = Hint().query(LaforaString)
LaforaLookup.keys()
hint_display(LaforaString, LaforaLookup)
LaforaInput = LaforaLookup['Disease'][0]

print('\n')

DravetString = 'Dravet'
DravetLookup = Hint().query(DravetString)
hint_display(DravetString, DravetLookup)
DravetOutput = DravetLookup['Disease'][0]

dict_keys(['Gene', 'SequenceVariant', 'ChemicalSubstance', 'Disease', 'MolecularActivity', 'BiologicalProcess', 'CellularComponent', 'Pathway', 'AnatomicalEntity', 'PhenotypicFeature'])

There are 4 BioThings objects returned for Lafora:
0: ('Lafora disease', 'Disease')
1: ('early-onset Lafora body disease', 'Disease')
2: ('Lafora body', 'CellularComponent')
3: ('Myoclonic epilepsy of Lafora', 'Pathway')


There are 2 BioThings objects returned for Dravet:
0: ('Dravet syndrome', 'Disease')
1: ('obsolete Dravet syndrome', 'Disease')


### Intermediate node cannot be a specific biomedical entity (BioThings object) 

Confirms what the FindConnection() help says. 

In [8]:
fc1 = FindConnection(input_obj = LaforaInput, output_obj = 'PhenotypicFeature', \
                    intermediate_nodes = DravetOutput)
fc1.connect(verbose = True)


BTE will find paths that join 'Lafora disease' and 'PhenotypicFeature'.                   Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: {'MONDO': 'MONDO:0100135', 'DOID': 'DOID:0060171', 'UMLS': 'C0751122', 'name': 'Dravet syndrome', 'primary': {'identifier': 'MONDO', 'cls': 'Disease', 'value': 'MONDO:0100135'}, 'display': 'MONDO(MONDO:0100135) DOID(DOID:0060171) UMLS(C0751122) name(Dravet syndrome)', 'type': 'Disease'}



TypeError: sequence item 0: expected str instance, dict found

### Output_obj can't be a list of entities - not matching the FindConnection() help

In [9]:
fc2 = FindConnection(input_obj = LaforaInput,\
                     output_obj = ['CellularComponent', 'AnatomicalFeature'], \
                    intermediate_nodes = None)
fc2.connect(verbose = True)


BTE will find paths that join 'Lafora disease' and '['CellularComponent', 'AnatomicalFeature']'.                   Paths will have 0 intermediate node.



TypeError: sequence item 0: expected str instance, list found

### EXPLAIN: unexpected behavior when trying for > 1 intermediate node 

If the output_obj is a specific biomedical entity ('Explain' type query), there can only be one intermediate node - not matching the FindConnection() help. 

Notice now the parameter syntax gets resolved unexpectedly...and the contradicting/erroneous verbose output. The result matches this workflow: Lafora disease -> Phenotypic feature or gene <- Dravet syndrome. The table output has a wrong column (output_type being Gene rather than Disease). 

In [10]:
fc3 = FindConnection(input_obj = LaforaInput, output_obj = DravetOutput, \
                    intermediate_nodes = ['PhenotypicFeature', 'Gene'])  
## trying to get two intermediate nodes
fc3.connect(verbose=True)
df3 = fc3.display_table_view()
df3


BTE will find paths that join 'Lafora disease' and 'Dravet syndrome'. Paths will have 2 intermediate node.

Intermediate node #1 will have these type constraints: PhenotypicFeature,Gene


Intermediate node #2 will have these type constraints: PhenotypicFeature,Gene



==== Step #1: Query path planning ====

Because Lafora disease is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'PhenotypicFeature AND Gene' as output

BTE found 10 apis:

API 1. cord_disease(1 API call)
API 2. DISEASES(1 API call)
API 3. mgi_gene2phenotype(1 API call)
API 4. pharos(1 API call)
API 5. semmed_disease(15 API calls)
API 6. scigraph(1 API call)
API 7. mydisease(2 API calls)
API 8. scibite(1 API call)
API 9. biolink(2 API calls)
API 10. hetio(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 7.2: https://mydisease.info/v1/query?fields=disgenet.genes_related_


After id-to-object translation, BTE retrieved 4 unique objects.



BTE found 3 unique intermediate nodes connecting 'Lafora disease' and 'Dravet syndrome'


Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,output_type,output_name,output_id
0,"LAFORA BODY DISEASE, LATE ONSET",Disease,treated_by,SEMMED,SEMMED Disease API,113549789989632,Gene,C0017337,UMLS:C0017337,caused_by,SEMMED,SEMMED Disease API,12821740,Gene,DRAVET SYNDROME,MONDO:MONDO:0100135
1,"LAFORA BODY DISEASE, LATE ONSET",Disease,related_to,SEMMED,SEMMED Disease API,111752831861753023922729,Gene,C0017337,UMLS:C0017337,caused_by,SEMMED,SEMMED Disease API,12821740,Gene,DRAVET SYNDROME,MONDO:MONDO:0100135
2,"LAFORA BODY DISEASE, LATE ONSET",Disease,treated_by,SEMMED,SEMMED Disease API,113549789989632,Gene,C0017337,UMLS:C0017337,treated_by,SEMMED,SEMMED Disease API,11940708,Gene,DRAVET SYNDROME,MONDO:MONDO:0100135
3,"LAFORA BODY DISEASE, LATE ONSET",Disease,related_to,SEMMED,SEMMED Disease API,111752831861753023922729,Gene,C0017337,UMLS:C0017337,treated_by,SEMMED,SEMMED Disease API,11940708,Gene,DRAVET SYNDROME,MONDO:MONDO:0100135


Notice that the only results were shared genes and not shared phenotypes, even though both are seizure-related disorders. I believe that this is because of [an issue with MONDO where Dravet Syndrome has no phenotypes](https://monarchinitiative.org/disease/MONDO:0100135) but [another more general disease term with Dravet Syndrome in the description does have phenotypes](https://monarchinitiative.org/disease/MONDO:0018214).