# Translator Use Case Question 2: Maple Syrup Urine Disease

## Introduction

The Translator Use Case Question #2 is:    

> What genes or proteins may be associated with symptoms of a disease X (such as based on drugs that are currently used to treat that disease, etc)?

<br>

BioThings Explorer (BTE) can answer two classes of queries -- "Explain" and "Predict". This Question fits the Predict  template of starting with **a specific biomedical entity** (a `Disease` X) and finding relationships with **one biomedical entity type** (like `PhenotypicFeature` or `Gene`).
* Note that currently a `Protein` biomedical entity type is not implemented in BTE. Instead, protein-coding and some non-coding genes are `Genes`. 

<br>

We will use the rare metabolic disease [Maple Syrup Urine Disease (MSUD)](https://ghr.nlm.nih.gov/condition/maple-syrup-urine-disease) as our specific disease of interest. In this disease, the normal process of breaking down branched-chain amino acids (isoleucine, leucine, and valine) from food for energy is impaired. We first address the question above using the query: `Disease` MSUD &rarr; `PhenotypicFeature` &rarr; `Gene`.    
* In other words, starting with the Disease MSUD, find entities of type PhenotypicFeature associated with this disease, then find the entities of type Gene associated with those PhenotypicFeatures.       
* This query will return a graph object with entities as nodes and relationships as edges. We then use edge provenance information to **filter** the results. For each Gene node, we use the number of unique paths from MSUD (input node) to that node to **score** it. The scores can then be used to sort the results. 

<br>

We then address the question using a different query and the same process: `Disease` MSUD &rarr; `ChemicalSubstance` &rarr; `Gene`.     
In other words, starting with the Disease MSUD, find entities of type ChemicalSubstances (which will include drugs) associated with this disease, then find the entities of type Gene associated with those ChemicalSubstances.    

## Step 0: Load BTE modules, notebook functions

In [1]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

## show time that this notebook was executed 
from datetime import datetime

## packages to work with objects 
import re

In [2]:
## functions to add to modules?
def hint_display(query, hint_result):
    """
    show the type, name, number of IDs for all results returned by the query
    
    :param: query: string used in hint query
    :param: hint_result: object returned from hint query, a dictionary of lists of dictionaries
    
    Returns: None
    """
    ## function needs to be rewritten if it's going to give the exact index of each object within its type 
    display = ['type', 'name']  ## replace with the parts of the BioThings object you want to see
    concise_results = []
    for BT_type, result in hint_result.items():
        if result:  ## basically if it's not empty
            for items in result:
                ## number of identifiers per object: number of keys - 4 (name, primary, display, type)
                temp = len(items) - 4
                concise_results.append((items[display[0]], items[display[1]], 
                                         str(temp)))
                    
    print('There are {total} BioThings objects returned for {ht}:'.format(\
                total = len(concise_results), ht = query))
    for display_info in concise_results:
        print('{0}, {1}, num of IDs: {2}'.format(display_info[0], display_info[1], display_info[2]))

In [3]:
def filter_table(df):
    """
    use _source and _method columns to remove rows (paths) from the dataframe
    :param: pandas dataframe containing results from BTE FindConnection module, in table form
    
    Returns: filtered dataframe
    """
    ## note: still needs checking with EXPLAIN queries
    ## key is the string to match to column, value is a list of strings to match to column values
    filter_out = {'_source': ['SEMMED', 'CTD', 'ctd', 'omia']   
#                   '_method': []  ## currently no method stuff I want to filter out
                 }
    ## SEMMED: text mining results wrong for PhenotypicFeature -> Gene
    ## CTD/ctd: results odd for MSUD -> ChemicalSubstance
    ## omia: results wrong or discontinued gene IDs for PhenotypicFeature -> Gene
    
    
    df_temp = df.copy()  ## so the original df isn't modified in-place
    for key,val in filter_out.items():
        ## find columns that match the key string
        columns = [i for i in df_temp.columns if key in i]
        ## iterate through each column
        for col in columns:
            ## iterate through each value to take out, check if string CONTAINS match. 
            ## only keep rows that don't contain the value
            for i in val:
                df_temp = df_temp[~ df_temp[col].str.contains(i, na = False)]
    return df_temp

In [4]:
def scoring_output(df, q_type):
    """
    score results based on whether query was Predict or Explain type, number of 
        intermediate nodes 
    :param: pandas dataframe containing results from BTE FindConnection module
    :param: string describing type of query (Predict or Explain)
    
    May flatten some edges, because score only counts one edge per 
        unique predicate / API / method (ignoring source and pubmed col)
    
    Predict queries: score each output node by counting # of paths
        from input nodes to it. Normalize by dividing by maximum
        possible # of paths
    Explain two-hop (one intermediate) queries: score each intermediate node by 
        counting # of paths (between input and output nodes) that include it. 
        Normalize by dividing by maximum possible # of paths    

    Explain one-hop (direct) queries: no need to score, prints message
    Other Explain queries (many-hops): currently not able to score, prints message     
    
    Returns: pandas series with scores, index is output_name
             or None (one-hop or many-hop Explain query)
    """
    df_temp = df.copy()  ## so no chance to mutate this   
    flag_direct = False  ## one-hop query or not
    ## use df_col to look quicker into columns
    df_col = set(df_temp.columns)
    
    ## ignore source and pubmed col in looking at unique edges 
    columns_drop = [col for col in df_col if (('_source' in col) or ('_pubmed' in col))]
    df_temp.drop(columns = columns_drop, inplace = True)    
    df_temp.drop_duplicates(inplace = True)
    
    ## check if query is one-hop or not
    if "node1_name" not in df_col:    ## name for first intermediate node layer
        flag_direct = True  
    
    if q_type == 'Explain':
        if flag_direct:   # one hop / no intermediates
            print('No valid node scoring for one-hop (direct) Explain queries.')
            return None
        ## if there are many-hops/intermediate layers
        elif "node2_name" in df_col:  ## name for 2nd intermed. node layer
            print('Cannot currently score many-hop Explain queries.')
            return None
        else:   ## two-hop / 1 intermediate layer
            ## count multi-edges to results (the intermediate node1 col)
            scores = df_temp.node1_name.value_counts() 
            ## to find the maximum-possible number of edges, look at non-result cols
            columns_drop = [col for col in df_col if 'node1' in col]
            df_temp.drop(columns = columns_drop, inplace = True)
            ## now look at number of unique combos for input, edge info, output
            df_temp.drop_duplicates(inplace = True)
            max_paths = df_temp.shape[0]            
            ## normalize scores by dividing each by max number of paths
            scores = scores / max_paths
            ## return scores as pandas dataframe
            return scores.to_frame(name = 'score')            

    else:  ## Predict type query
        ## count multi-edges to results (the output col)
        scores = df_temp.output_name.value_counts()
        ## to find the maximum number of multi-edges, look at non-output col
        columns_drop = [col for col in df_temp.columns if 'output' in col]
        df_temp.drop(columns = columns_drop, inplace = True)
        ## now look at number of unique paths possible
        df_temp.drop_duplicates(inplace = True)
        max_paths = df_temp.shape[0]
        ## normalize scores by dividing each by max number of paths
        scores = scores / max_paths
        ## return scores as pandas dataframe
        return scores.to_frame(name = 'score')                 

In [5]:
## record when cell blocks are executed
print('The time that this notebook was executed is...')
print('Local time (PST, West Coast USA): ')
print(datetime.now())
print('UTC time: ')
print(datetime.utcnow())

The time that this notebook was executed is...
Local time (PST, West Coast USA): 
2020-09-14 17:54:21.300690
UTC time: 
2020-09-15 00:54:21.300802


## Step 1: Find representation of "maple syrup urine disease" in BTE

In this step, BioThings Explorer translates our query string "maple syrup urine disease"  into BioThings objects, which contain mappings to many common identifiers. We then pick the BioThings object that best matches what we want. 

Generally, the top result returned by the Hint module for your BioThings type of interest will match what you want, but you should confirm that using the identifiers shown. 
> BioThings types correspond to children and descendants of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `Disease` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle"). **However, [only a subset of the Biolink BiologicalEntity children / descendants are currently implemented in BTE](https://smart-api.info/portal/translator/metakg)**. More biomedical object types will be available as more knowledge sources (APIs) are added to the system. **Note that the type `BiologicalEntity` means any BioThings type currently implemented in BTE will be accepted.**

In [6]:
ht = Hint()  ## neater way to call this BTE module

## the human user gives this input
MSUD_starting_str = "maple syrup urine disease"

MSUD_hint = ht.query(MSUD_starting_str)
hint_display(MSUD_starting_str, MSUD_hint)

There are 6 BioThings objects returned for maple syrup urine disease:
Disease, maple syrup urine disease, num of IDs: 6
Disease, maple syrup urine disease, mild variant, num of IDs: 3
Disease, pyruvate dehydrogenase E3 deficiency, num of IDs: 4
Disease, inborn disorder of branched-chain amino acid metabolism, num of IDs: 3
Disease, classic maple syrup urine disease, num of IDs: 3
Pathway, Maple Syrup Urine Disease, num of IDs: 0


Based on the information above, we'll pick the top `Disease` choice (indexed at 0) for our query. We can look at identifier mappings inside this BioThings object for maple syrup urine disease (MSUD). 

In [7]:
## the human user makes this choice, gives this input
MSUD_choice_type = 'Disease'
MSUD_choice_idx = 0

MSUD_hint_obj = MSUD_hint[MSUD_choice_type][MSUD_choice_idx]  
MSUD_hint_obj

{'MONDO': 'MONDO:0009563',
 'DOID': 'DOID:9269',
 'UMLS': 'C0024776',
 'name': 'maple syrup urine disease',
 'MESH': 'D008375',
 'OMIM': '248600',
 'ORPHANET': '511',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0009563'},
 'display': 'MONDO(MONDO:0009563) DOID(DOID:9269) OMIM(248600) ORPHANET(511) UMLS(C0024776) MESH(D008375) name(maple syrup urine disease)',
 'type': 'Disease'}

## Step 2: MSUD &rarr; PhenotypicFeature &rarr; Gene

### Generating a knowledge graph

In this section, we dynamically generate a knowledge graph with paths connecting maple syrup urine disease (MSUD) to genes *using PhenotypicFeature intermediates* (representing the symptoms part of the question).  

BTE performs the **query path planning** and **query path execution** by deconstructing the query into individual API calls, executing those API calls, and then assembling the results.

The code block below takes 1-2 minutes to run. 

In [8]:
## the human user gives this input
q1_output_type = 'Gene'
q1_intermediate = 'PhenotypicFeature'

q1 = FindConnection(input_obj = MSUD_hint_obj,\
                     output_obj = q1_output_type, \
                    intermediate_nodes = q1_intermediate)
q1.connect(verbose = False)

In [9]:
# q1_r_graph = q1.fc.G  ## for changing the graph object to reflect the table
q1_r_paths_table = q1.display_table_view()

q1_type = re.findall("dispatcher.([a-zA-Z]+)'", str(type(q1.fc)))
q1_type = "".join(q1_type)  ## convert to string

q1 = None  ## clear memory

We can see the number of PhenotypicFeatures that were linked to both MSUD and to a Gene, the number of Genes returned as output nodes, and the total number of paths from MSUD to Gene nodes. 

In [10]:
## show number of unique intermediate nodes
print("There are {0} unique {1}s for {2}.".format( \
    q1_r_paths_table.node1_name.nunique(), q1_intermediate, MSUD_starting_str))

## show number of unique output nodes
print("There are {0} unique Genes linked to those {1}s.".format( \
    q1_r_paths_table.output_name.nunique(), q1_intermediate))

## show number of paths from MSUD to genes
print("There are {0} unique paths.".format( \
    q1_r_paths_table.shape[0]))

There are 27 unique PhenotypicFeatures for maple syrup urine disease.
There are 1561 unique Genes linked to those PhenotypicFeatures.
There are 7964 unique paths.


### Filtering and scoring

Filtering involves using edge provenance, like the source this relationship came from and the method used to make this association, to filter out edges (removing nodes in the process). 

However, in this particular example, either no edges or only a few edges will be removed (we consider almost all the information reliable). 

In [11]:
q1_r_paths_table = filter_table(q1_r_paths_table)

## show number of paths from MSUD to genes
print("There are {0} unique paths.".format( \
    q1_r_paths_table.shape[0]))

There are 7960 unique paths.


The scoring process for Predict queries (the type of query we're using now): 

1. To score individual Gene nodes, we first take a copy of the knowledge graph (KG) and remove some multi-edges. 
    * Each edge has predicate, API, method, source, and pubmed information. For scoring purposes, we will ignore pubmed and source information because APIs handle this information differently (returning multiple edges or single edges). 
2. We then count the number of paths from the MSUD node to each Gene node.        
3. Finally, we "normalize" the score by dividing those counts by maximum-possible number of paths from the MSUD node to a Gene node.

We can then see the top-scored nodes. A score of closer to 1 means that the many PhenotypicFeatures and relationships link MSUD and the Gene node. A score closer to 0 means that only a few PhenotypicFeatures and relationships link MSUD and the Gene node. 

In [12]:
## create scoring table for Genes (output nodes)
q1_scoring = scoring_output(q1_r_paths_table, q1_type)

q1_scoring.head(10)

Unnamed: 0,score
BCKDHA,0.871795
DBT,0.871795
BCKDHB,0.846154
DLD,0.615385
NDUFV1,0.551282
NDUFS1,0.512821
NDUFA1,0.448718
NDUFS7,0.448718
PCCB,0.448718
PCCA,0.448718


Different knowledge sources (APIs) were called in different parts of the query. 

In the first part of the query (maple syrup urine disease &rarr; PhenotypicFeature), the following APIs returned results and the following predicates (semantic relationships) were found.

In [13]:
## show that the APIs use different predicates
q1_r_paths_table[['pred1_api', 'pred1']].drop_duplicates().sort_values(by = ['pred1_api', 'pred1'])

Unnamed: 0,pred1_api,pred1
0,BioLink API,has_phenotype
2,mydisease.info API,related_to


In the second part of the query (PhenotypicFeatures &rarr; Gene), the following APIs returned results and the following predicates (semantic relationships) were found.

In [14]:
## show that the APIs use different predicates
q1_r_paths_table[['pred2_api', 'pred2']].drop_duplicates().sort_values(by = ['pred2_api', 'pred2'])

Unnamed: 0,pred2_api,pred2
7741,BioLink API,contributes_to_condition
0,BioLink API,has_phenotype
4,EBIgene2phenotype API,related_to


[TO DO: Actual code depends on graph object (ReasonerStd object or Networkx MultiDiGraph) used.]

We then add the scores and score provenance information to the graph object that will be returned to the ARS. These stored as node attributes on the results (`Gene`-type nodes) of the "answer graph". 

In [15]:
# nx.set_node_attributes(q1_r_graph, values = q1_scoring.to_dict(), \
#                        name = 'score')
# nx.set_node_attributes(q1_r_graph, values = '(0-1]', name = 'score_range')
# nx.set_node_attributes(q1_r_graph, values = 'normalized path count', name = 'score_method')
# ## what is a good score, larger or smaller? 
# nx.set_node_attributes(q1_r_graph, values = 'larger', name = 'score_better_direction')  

# ## example of Gene node object with score-related attributes
# q1_r_graph.nodes()['NDUFS1']

## Step 3: MSUD &rarr; ChemicalSubstance &rarr; Gene

### Generating a knowledge graph

In this section, we dynamically generate a knowledge graph with paths connecting maple syrup urine disease (MSUD) to genes *with ChemicalSubstance intermediates* (representing the drugs part of the question).  

BTE performs the **query path planning** and **query path execution** by deconstructing the query into individual API calls, executing those API calls, and then assembling the results.

The following code block takes about 30 seconds to run. 

In [16]:
## the human user gives this input
q2_output_type = 'Gene'
q2_intermediate = 'ChemicalSubstance'

q2 = FindConnection(input_obj = MSUD_hint_obj,\
                     output_obj = q2_output_type, \
                    intermediate_nodes = q2_intermediate)
q2.connect(verbose = False)

API 4.1 pharos failed


In [17]:
# q2_r_graph = q2.fc.G ## for changing the graph object to reflect the table
q2_r_paths_table = q2.display_table_view()

q2_type = re.findall("dispatcher.([a-zA-Z]+)'", str(type(q2.fc)))
q2_type = "".join(q2_type)  ## convert to string

q2 = None ## clear memory

We can see the number of ChemicalSubstances that were linked to both MSUD and to a Gene, the number of Genes returned as output nodes, and the total number of paths from MSUD to Gene nodes. 

In [18]:
## show number of unique intermediate nodes
print("There are {0} unique {1}s for {2}.".format( \
    q2_r_paths_table.node1_name.nunique(), q2_intermediate, MSUD_starting_str))

## show number of unique output nodes
print("There are {0} unique Genes linked to those {1}s.".format( \
    q2_r_paths_table.output_name.nunique(), q2_intermediate))

## show number of paths from MSUD to genes
print("There are {0} unique paths.".format( \
    q2_r_paths_table.shape[0]))

There are 58 unique ChemicalSubstances for maple syrup urine disease.
There are 11847 unique Genes linked to those ChemicalSubstances.
There are 62968 unique paths.


### Filtering and scoring

Filtering involves using edge provenance, like the source this relationship came from and the method used to make this association, to filter out edges (removing nodes in the process). 

In [19]:
q2_r_paths_table = filter_table(q2_r_paths_table)

After filtering, there are fewer results in the answer knowledge graph. 

In [20]:
## show number of unique intermediate nodes
print("There are {0} unique {1}s for {2}.".format( \
    q2_r_paths_table.node1_name.nunique(), q2_intermediate, MSUD_starting_str))

## show number of unique output nodes
print("There are {0} unique Genes linked to those {1}s.".format( \
    q2_r_paths_table.output_name.nunique(), q2_intermediate))

## show number of paths from MSUD to genes
print("There are {0} unique paths.".format( \
    q2_r_paths_table.shape[0]))

There are 30 unique ChemicalSubstances for maple syrup urine disease.
There are 1140 unique Genes linked to those ChemicalSubstances.
There are 1925 unique paths.


The scoring process for Predict queries (the type of query we're using now): 

1. To score individual Gene nodes, we first take a copy of the knowledge graph (KG) and remove some multi-edges. 
    * Each edge has predicate, API, method, source, and pubmed information. For scoring purposes, we will ignore pubmed and source information because APIs handle this information differently (returning multiple edges or single edges). 
2. We then count the number of paths from the MSUD node to each Gene node.        
3. Finally, we "normalize" the score by dividing those counts by maximum-possible number of paths from the MSUD node to a Gene node.

We can then see the top-scored nodes. A score of closer to 1 means that the many ChemicalSubstances and relationships link MSUD and the Gene node. A score closer to 0 means that only a few ChemicalSubstances and relationships link MSUD and the Gene node. 

In [21]:
## create scoring table for Genes (output nodes)
q2_scoring = scoring_output(q2_r_paths_table, q2_type)

q2_scoring.head(10)

Unnamed: 0,score
CAT,0.117117
BCAT1,0.108108
BCAT2,0.099099
PPIB,0.099099
INS,0.081081
CS,0.081081
CASP14,0.072072
TLN1,0.072072
SLC25A5,0.063063
PTS,0.063063


Different knowledge sources (APIs) were called in different parts of the query. 

In the first part of the query (maple syrup urine disease &rarr; ChemicalSubstance), the following APIs returned results and the following predicates (semantic relationships) were found.

In [22]:
## show that the APIs use different predicates, one each
q2_r_paths_table[['pred1_api', 'pred1']].drop_duplicates().sort_values(by = ['pred1_api', 'pred1'])

Unnamed: 0,pred1_api,pred1
5166,Automat CORD19 Scigraph API,related_to
5148,Automat HMDB API,related_to


In the second part of the query (ChemicalSubstance &rarr; Gene), the following APIs returned results and the following predicates (semantic relationships) were found.

In [23]:
## show that the APIs use different predicates
q2_r_paths_table[['pred2_api', 'pred2']].drop_duplicates().sort_values(by = ['pred2_api', 'pred2'])

Unnamed: 0,pred2_api,pred2
19162,Automat CHEMBIO API,related_to
5254,Automat CORD19 Scibite API,related_to
5252,Automat CORD19 Scigraph API,related_to
7661,Automat HMDB API,related_to
22615,Automat PHAROS API,related_to
5148,CORD Chemical API,related_to
6073,DGIdb API,physically_interacts_with
22892,MyChem.info API,metabolic_processing_affected_by
6072,MyChem.info API,physically_interacts_with


[TO DO: Actual code depends on graph object (ReasonerStd object or Networkx MultiDiGraph) used.]

We then add the scores and score provenance information to the graph object that will be returned to the ARS. These stored as node attributes on the Gene nodes of the "answer graph". 

In [24]:
# nx.set_node_attributes(q2_r_graph, values = q2_scoring.to_dict(), \
#                        name = 'score')
# nx.set_node_attributes(q2_r_graph, values = '(0-1]', name = 'score_range')
# nx.set_node_attributes(q2_r_graph, values = 'normalized path count', name = 'score_method')
# ## what is a good score, larger or smaller? 
# nx.set_node_attributes(q2_r_graph, values = 'larger', name = 'score_better_direction')  

# ## example of Gene node object with score-related attributes
# q2_r_graph.nodes()['BCAT2']

## Compare answers to known genes

[From the NIH](https://ghr.nlm.nih.gov/condition/maple-syrup-urine-disease#genes), maple syrup urine disease (MSUD) is caused by mutations in the genes **BCKDHA**, **BCKDHB**, and **DBT**. 

These three genes are the **top-scored results of the query `Disease` MSUD &rarr; `PhenotypicFeature` &rarr; `Gene`**. The information supporting this answer came from multiple APIs. 

In [25]:
known_answers = ['BCKDHA', 'BCKDHB', 'DBT']

In [26]:
print('Scoring from the MSUD -> PhenotypicFeature -> Gene query')

## reset index to show placement of genes
q1_scoring_df = q1_scoring.reset_index().rename(columns = {'index': 'output_name'})
q1_scoring_df[q1_scoring_df.output_name.isin(known_answers)]

## show APIs/sources for the underlying info for these answers
q1_r_paths_table[q1_r_paths_table.output_name.isin(known_answers)][['pred1_api']].drop_duplicates()
q1_r_paths_table[q1_r_paths_table.output_name.isin(known_answers)][['pred2_api']].drop_duplicates()

Scoring from the MSUD -> PhenotypicFeature -> Gene query


Unnamed: 0,output_name,score
0,BCKDHA,0.871795
1,DBT,0.871795
2,BCKDHB,0.846154


Unnamed: 0,pred1_api
52,BioLink API
54,mydisease.info API


Unnamed: 0,pred2_api
52,BioLink API
55,EBIgene2phenotype API


Two of the genes (**BCKDHA**, **BCKDHB**) are found in the query `Disease` MSUD &rarr; `ChemicalSubstance` &rarr; `Gene`. However, they are not highly scored. This is likely because:
* ChemicalSubstance intermediates included chemicals that are involved in many biological processes (and would therefore be annotated to many genes)
* the causal genes for this disease act on branched-chain-amino-acid-derived alpha-keto-acids (and would therefore not be annotated to many chemicals). 

The information supporting those answers came from the Automat HMDB API. 

In [27]:
print('Scoring from the MSUD -> ChemicalSubstance -> Gene query')

## reset index to show placement of genes
q2_scoring_df = q2_scoring.reset_index().rename(columns = {'index': 'output_name'})
q2_scoring_df[q2_scoring_df.output_name.isin(['BCKDHA', 'BCKDHB', 'DBT'])]

## show APIs/sources for the underlying info for these answers
q2_r_paths_table[q2_r_paths_table.output_name.isin(known_answers)][['pred1_api']].drop_duplicates()
q2_r_paths_table[q2_r_paths_table.output_name.isin(known_answers)][['pred2_api']].drop_duplicates()

Scoring from the MSUD -> ChemicalSubstance -> Gene query


Unnamed: 0,output_name,score
125,BCKDHA,0.027027
184,BCKDHB,0.027027


Unnamed: 0,pred1_api
55761,Automat HMDB API


Unnamed: 0,pred2_api
55761,Automat HMDB API


Other interesting, high-ranked genes returned by the queries include...
* **DLD** encodes a protein that is [a member of the same enzyme complex as the three known MSUD genes](https://ghr.nlm.nih.gov/condition/dihydrolipoamide-dehydrogenase-deficiency#genes). It is also a member of several other enzyme complexes. **DLD** mutations can result in dihydrolipoamide dehydrogenase deficiency (previously known as a type of maple syrup urine disease), a disease with [that has some overlapping features but has other more severe phenotypes](https://www.omim.org/entry/248600). 
* **BCAT1** and **BCAT2** are part of the same process (branched-chain amino acid catabolism) as the three known MSUD genes; they both encode proteins that act in an earlier step of the process. [This reference from 2019](https://pubmed.ncbi.nlm.nih.gov/31177572/) describes how **BCAT2** deficiencies (caused by gene mutations) appear to have some similarities and differences from MSUD. 

[This reference](https://academic.oup.com/hmg/article/23/R1/R1/2900649) explains the role of the BCAT genes and MSUD genes in branched-chain amino acid metabolism. Note that the **BCAT** gene products participate in the first step of the process (amino acid -> alpha-keto-acid). The **BCKDHA** (E1-alpha subunit), **BCKDHB** (E1-beta subunit), **DBT** (E2 subunit), and **DLD** (E3 subunit) gene products participate in the next (rate-limiting) step of the process. 

The code blocks below show the genes returned by the queries and the APIs that the underlying information supporting those genes came from. 

In [28]:
other_interesting_genes = ['DLD', 'BCAT1', 'BCAT2']

print('Scoring from the MSUD -> PhenotypicFeature -> Gene query')

q1_scoring_df[q1_scoring_df.output_name.isin(other_interesting_genes)]
## show APIs/sources for the underlying info for these answers
q1_r_paths_table[q1_r_paths_table.output_name.isin(other_interesting_genes)][['pred1_api']].drop_duplicates()
q1_r_paths_table[q1_r_paths_table.output_name.isin(other_interesting_genes)][['pred2_api']].drop_duplicates()

Scoring from the MSUD -> PhenotypicFeature -> Gene query


Unnamed: 0,output_name,score
3,DLD,0.615385


Unnamed: 0,pred1_api
1222,BioLink API
1223,mydisease.info API


Unnamed: 0,pred2_api
1222,EBIgene2phenotype API
1235,BioLink API


In [29]:
print('Scoring from the MSUD -> ChemicalSubstance -> Gene query')

q2_scoring_df[q2_scoring_df.output_name.isin(other_interesting_genes)]
## show APIs/sources for the underlying info for these answers
q2_r_paths_table[q2_r_paths_table.output_name.isin(other_interesting_genes)][['pred1_api']].drop_duplicates()
q2_r_paths_table[q2_r_paths_table.output_name.isin(other_interesting_genes)][['pred2_api']].drop_duplicates()

Scoring from the MSUD -> ChemicalSubstance -> Gene query


Unnamed: 0,output_name,score
1,BCAT1,0.108108
2,BCAT2,0.099099
741,DLD,0.009009


Unnamed: 0,pred1_api
22889,Automat HMDB API


Unnamed: 0,pred2_api
22889,Automat HMDB API
22891,MyChem.info API
22893,DGIdb API
