## Step 0: Load BTE modules, notebook functions

In [1]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

## show time that this notebook was executed 
from datetime import datetime

## packages to work with objects 
import re

In [2]:
## functions to add to modules?
def hint_display(query, hint_result):
    """
    show the name and type of all results returned by the query
    
    :param: query: string used in hint query
    :param: hint_result: object returned from hint query, a dictionary of lists of dictionaries
    
    Returns: None
    """
    ## function needs to be rewritten if it's going to give the exact index of each object within its type 
    display = ['type', 'name']  ## replace with the parts of the BioThings object you want to see
    concise_results = []
    for BT_type, result in hint_result.items():
        if result:  ## basically if it's not empty
            for items in result:
                concise_results.append((items[display[0]], items[display[1]]))
    print('There are {total} BioThings objects returned for {ht}:'.format(\
                total = len(concise_results), ht = query))
    for display_info in concise_results:
        print(display_info)

In [3]:
def filter_table(df):
    """
    use _source and _method columns to remove rows (paths) from the dataframe
    :param: pandas dataframe containing results from BTE FindConnection module, in table form
    
    Returns: filtered dataframe
    """
    ## note: still needs checking with EXPLAIN queries
    ## key is the string to match to column, value is a list of strings to match to column values
    filter_out = {'_source': ['SEMMED', 'CTD', 'ctd', 'omia']   
#                   '_method': []  ## currently no method stuff I want to filter out
                 }
    ## SEMMED: text mining results wrong for PhenotypicFeature -> Gene
    ## CTD/ctd: results odd for MSUD -> ChemicalSubstance
    ## omia: results wrong or discontinued gene IDs for PhenotypicFeature -> Gene
    
    
    df_temp = df.copy()  ## so the original df isn't modified in-place
    for key,val in filter_out.items():
        ## find columns that match the key string
        columns = [i for i in df_temp.columns if key in i]
        ## iterate through each column
        for col in columns:
            ## iterate through each value to take out, check if string CONTAINS match. 
            ## only keep rows that don't contain the value
            for i in val:
                df_temp = df_temp[~ df_temp[col].str.contains(i)]

    return df_temp

In [28]:
## record when cell blocks are executed
print('The time that this notebook was executed is...')
print('Local time (PST, West Coast USA): ')
print(datetime.now())
print('UTC time: ')
print(datetime.utcnow())

The time that this notebook was executed is...
Local time (PST, West Coast USA): 
2020-09-13 19:38:27.652538
UTC time: 
2020-09-14 02:38:27.652735


## Step 1: Find representation of "maple syrup urine disease" in BTE

In this step, BioThings Explorer translates our query string "maple syrup urine disease"  into BioThings objects, which contain mappings to many common identifiers. We then pick the BioThings object that best matches what we want (the rare disease). 

Generally, the top result returned by the Hint module for your BioThings type of interest will match what you want, but you should confirm that using the identifiers shown. 


> BioThings types correspond to children and descendants of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `Disease` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle"). **However, [only a subset of the Biolink BiologicalEntity children / descendants are currently implemented in BTE](https://smart-api.info/portal/translator/metakg)**. More biomedical object types will be available as more knowledge sources (APIs) are added to the system. **Note that the type `BiologicalEntity` means any BioThings type currently implemented in BTE will be accepted.**

In [5]:
ht = Hint()  ## neater way to call this BTE module

## the human user gives this input
MSUD_starting_str = "maple syrup urine disease"

MSUD_hint = ht.query(MSUD_starting_str)
## the object is a dictionary with keys = set strings, values = list of dictionaries. 
hint_display(MSUD_starting_str, MSUD_hint)

There are 6 BioThings objects returned for maple syrup urine disease:
('Disease', 'maple syrup urine disease')
('Disease', 'maple syrup urine disease, mild variant')
('Disease', 'pyruvate dehydrogenase E3 deficiency')
('Disease', 'inborn disorder of branched-chain amino acid metabolism')
('Disease', 'classic maple syrup urine disease')
('Pathway', 'Maple Syrup Urine Disease')


Based on the information above, we'll pick the top `Disease` choice (indexed at 0) for our query. We can look at identifier mappings inside this BioThings object for maple syrup urine disease (MSUD). 

In [6]:
## the human user makes this choice, gives this input
MSUD_choice_type = 'Disease'
MSUD_choice_idx = 0

MSUD_hint_obj = MSUD_hint[MSUD_choice_type][MSUD_choice_idx]  
MSUD_hint_obj
## these inner dictionaries are keys = id type, 
##       values = curie, normal string, or dictionary (for the key 'primary')

{'MONDO': 'MONDO:0009563',
 'DOID': 'DOID:9269',
 'UMLS': 'C0024776',
 'name': 'maple syrup urine disease',
 'MESH': 'D008375',
 'OMIM': '248600',
 'ORPHANET': '511',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0009563'},
 'display': 'MONDO(MONDO:0009563) DOID(DOID:9269) OMIM(248600) ORPHANET(511) UMLS(C0024776) MESH(D008375) name(maple syrup urine disease)',
 'type': 'Disease'}

## Step 3: MSUD &rarr; PhenotypicFeature

In [29]:
## the human user gives this input
q1_output_type = 'PhenotypicFeature'
q1_intermediate = None

q1 = FindConnection(input_obj = MSUD_hint_obj,\
                     output_obj = q1_output_type, \
                    intermediate_nodes = q1_intermediate)
q1.connect(verbose = True)


BTE will find paths that join 'maple syrup urine disease' and 'PhenotypicFeature'.                   Paths will have 0 intermediate node.




==== Step #1: Query path planning ====

Because maple syrup urine disease is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'PhenotypicFeature' as output

BTE found 3 apis:

API 1. biolink(1 API call)
API 2. mydisease(2 API calls)
API 3. semmed_disease(11 API calls)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 2.2: https://mydisease.info/v1/query?fields=hpo.phenotype_related_to_disease (POST -d q=511&scopes=hpo.orphanet)
API 2.1: https://mydisease.info/v1/query?fields=hpo.phenotype_related_to_disease (POST -d q=248600&scopes=hpo.omim)
API 3.4: https://biothings.ncats.io/semmed/query?fields=caused_by (POST -d q=C1855371,C0024776&scopes=umls)
API 3.11: https://biothings.ncats.io/semmed/query?fields=negat

In [30]:
# q1_r_graph = q1.fc.G  ## for changing the graph object to reflect the table
q1_r_paths_table = q1.display_table_view()

q1_type = re.findall("dispatcher.([a-zA-Z]+)'", str(type(q1.fc)))
q1_type = "".join(q1_type)  ## convert to string

# q1 = None  ## clear memory

We can see the number of PhenotypicFeatures that were linked to MSUD and the total number of paths from MSUD to PhenotypicFeature nodes. 

In [31]:
## show number of unique output nodes
print("There are {0} unique {1}s.".format( \
    q1_r_paths_table.output_name.nunique(), q1_output_type))

## show number of paths 
print("There are {0} unique paths.".format( \
    q1_r_paths_table.shape[0]))

There are 27 unique PhenotypicFeatures.
There are 45 unique paths.


### Filtering and scoring

Filtering involves using edge provenance, like the source this relationship came from and the method used to make this association, to filter out edges (removing nodes in the process). 

However, in this particular example, either no edges or only a few edges will be removed (we consider almost all the information reliable). 

In [32]:
q1_r_paths_table = filter_table(q1_r_paths_table)

## show number of paths from MSUD to genes
print("There are {0} unique paths.".format( \
    q1_r_paths_table.shape[0]))

There are 45 unique paths.


For a direct query (no intermediates)...I won't be able to score if I flatten ALL the multi-edges. So instead I have to keep some of multi-edges and then count the edges per output node. 

I don't like counting as multiple edges when biolink returns multi-edges that only differ by pubmed sources or only differ by a long list of sources (that only differ by one member)

Then normalize by dividing by the maximum-possible number of edges it could have. So the max. number of edges is 4. 

The scores will be funny since right now the most we have is 1 edge per API...

In [47]:
q1_r_paths_table.columns
q1_r_paths_table[['input', 'pred1', 'pred1_api', 'pred1_method']].drop_duplicates()

Index(['input', 'input_type', 'pred1', 'pred1_source', 'pred1_api',
       'pred1_pubmed', 'pred1_method', 'output_type', 'output_name',
       'output_id'],
      dtype='object')

Unnamed: 0,input,pred1,pred1_api,pred1_method
0,BCKD DEFICIENCY,has_phenotype,BioLink API,
2,BCKD DEFICIENCY,related_to,mydisease.info API,PCS
4,BCKD DEFICIENCY,related_to,mydisease.info API,IEA
17,BCKD DEFICIENCY,related_to,mydisease.info API,TAS


In [38]:
set(q1_r_paths_table.columns)

{'input',
 'input_type',
 'output_id',
 'output_name',
 'output_type',
 'pred1',
 'pred1_api',
 'pred1_method',
 'pred1_pubmed',
 'pred1_source'}

In [48]:
def scoring_output(df, q_type):
    """
    score results based on whether query was Predict or Explain type, number of 
        intermediate nodes 
    :param: pandas dataframe containing results from BTE FindConnection module
    :param: string describing type of query (Predict or Explain)
    
    May flatten some edges, because score only counts one edge per 
        unique predicate / API / method (ignoring source and pubmed col)
    
    Predict queries: score each output node by counting # of paths
        from input nodes to it. Normalize by dividing by maximum
        possible # of paths
    Explain two-hop (one intermediate) queries: score each intermediate node by 
        counting # of paths (between input and output nodes) that include it. 
        Normalize by dividing by maximum possible # of paths    

    Explain one-hop (direct) queries: no need to score, prints message
    Other Explain queries (many-hops): currently not able to score, prints message     
    
    Returns: pandas series with scores, index is output_name
             or None (one-hop or many-hop Explain query)
    """
    df_temp = df.copy()  ## so no chance to mutate this   
    flag_direct = False  ## one-hop query or not
    ## use df_col to look quicker into columns
    df_col = set(df_temp.columns)
    
    ## ignore source and pubmed col in looking at unique edges 
    columns_drop = [col for col in df_col if (('_source' in col) or ('_pubmed' in col))]
    df_temp.drop(columns = columns_drop, inplace = True)    
    df_temp.drop_duplicates(inplace = True)
    
    ## check if query is one-hop or not
    if "node1_name" not in df_col:    ## name for first intermediate node layer
        flag_direct = True  
    
    if q_type == 'Explain':
        if flag_direct:   # one hop / no intermediates
            print('No valid node scoring for one-hop (direct) Explain queries.')
            return None
        ## if there are many-hops/intermediate layers
        elif "node2_name" in df_col:  ## name for 2nd intermed. node layer
            print('Cannot currently score many-hop Explain queries.')
            return None
        else:   ## two-hop / 1 intermediate layer
            ## count multi-edges to results (the intermediate node1 col)
            scores = df_temp.node1_name.value_counts() 
            ## to find the maximum-possible number of edges, look at non-result cols
            columns_drop = [col for col in df_col if 'node1' in col]
            df_temp.drop(columns = columns_drop, inplace = True)
            ## now look at number of unique combos for input, edge info, output
            df_temp.drop_duplicates(inplace = True)
            max_paths = df_temp.shape[0]            
            ## normalize scores by dividing each by max number of paths
            scores = scores / max_paths
            ## return scores as pandas dataframe
            return scores.to_frame(name = 'score')            

    else:  ## Predict type query
        ## count multi-edges to results (the output col)
        scores = df_temp.output_name.value_counts()
        ## to find the maximum number of multi-edges, look at non-output col
        columns_drop = [col for col in df_temp.columns if 'output' in col]
        df_temp.drop(columns = columns_drop, inplace = True)
        ## now look at number of unique paths possible
        df_temp.drop_duplicates(inplace = True)
        max_paths = df_temp.shape[0]
        ## normalize scores by dividing each by max number of paths
        scores = scores / max_paths
        ## return scores as pandas dataframe
        return scores.to_frame(name = 'score')                 

In [49]:
## create scoring table for results (output nodes)
q1_scoring = scoring_output(q1_r_paths_table, q1_type)

q1_scoring

Unnamed: 0,score
INCREASED LEVEL OF HIPPURIC ACID IN URINE,0.5
BRAIN EDEMA,0.5
EPILEPSY,0.5
EMESIS,0.5
ATAXIA,0.5
HYPOGLYCAEMIA,0.5
COMA,0.5
LETHARGY,0.5
ELEVATED PLASMA BRANCHED CHAIN AMINO ACIDS,0.5
DECREASED MUSCLE TONE,0.5
