# Translator Use Case SARS (as COVID-19 proxy)

## Introduction

The Translator Use Case Question #2 is:    

> What genes or proteins may be associated with symptoms of a disease X (such as based on drugs that are currently used to treat that disease, etc)?

<br>

BioThings Explorer (BTE) can answer two classes of queries -- "EXPLAIN" and "PREDICT". This Question fits the PREDICT  template of starting with **a specific biomedical entity** (a specific `Disease` X) and finding relationships with **one biomedical entity type** (like `PhenotypicFeature` or `Gene`).
* Note that currently a `Protein` biomedical entity type is not implemented in BTE. Instead, protein-coding and some non-coding gene information is implemented as the `Gene` type. 

<br>

We will use severe acute respiratory syndrome (SARS), as a proxy for COVID-19, as our specific disease of interest. We address the question above using the query: `Disease` SARS &rarr; `PhenotypicFeature` &rarr; `Gene`.    
* In other words, starting with the Disease SARS, find entities of type PhenotypicFeature associated with this disease, then find the entities of type Gene associated with those PhenotypicFeatures.       
* This query will return a graph object with entities as nodes and relationships as edges. We then use edge provenance information to **filter** the results. For each Gene node, we use the number of unique paths from SARS (input node) to that node to **score** it. The scores can then be used to sort the results.  

## Step 0: Load BTE modules, notebook functions

In [11]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

## show time that this notebook was executed 
from datetime import datetime

## packages to work with objects 
import re

In [12]:
## functions to add to modules?
def hint_display(query, hint_result):
    """
    show the name and type of all results returned by the query
    
    :param: query: string used in hint query
    :param: hint_result: object returned from hint query, a dictionary of lists of dictionaries
    
    Returns: None
    """
    ## function needs to be rewritten if it's going to give the exact index of each object within its type 
    display = ['type', 'name']  ## replace with the parts of the BioThings object you want to see
    concise_results = []
    for BT_type, result in hint_result.items():
        if result:  ## basically if it's not empty
            for items in result:
                concise_results.append((items[display[0]], items[display[1]]))
    print('There are {total} BioThings objects returned for {ht}:'.format(\
                total = len(concise_results), ht = query))
    for display_info in concise_results:
        print(display_info)

In [13]:
def filter_table(df):
    """
    use _source and _method columns to remove rows (paths) from the dataframe
    :param: pandas dataframe containing results from BTE FindConnection module, in table form
    
    Returns: filtered dataframe
    """
    ## note: still needs checking with EXPLAIN queries
    ## key is the string to match to column, value is a list of strings to match to column values
    filter_out = {'_source': ['SEMMED', 'CTD', 'ctd', 'omia']   
#                   '_method': []  ## currently no method stuff I want to filter out
                 }
    ## SEMMED: text mining results wrong for PhenotypicFeature -> Gene
    ## CTD/ctd: results odd for MSUD -> ChemicalSubstance
    ## omia: results wrong or discontinued gene IDs for PhenotypicFeature -> Gene
    
    
    df_temp = df.copy()  ## so the original df isn't modified in-place
    for key,val in filter_out.items():
        ## find columns that match the key string
        columns = [i for i in df_temp.columns if key in i]
        ## iterate through each column
        for col in columns:
            ## iterate through each value to take out, check if string CONTAINS match. 
            ## only keep rows that don't contain the value
            for i in val:
                df_temp = df_temp[~ df_temp[col].str.contains(i)]

    return df_temp

In [4]:
def scoring_output(df, q_type):
    """
    score based on flattened edges, equal to number of unique paths between input and output
        node(s) divided by the maximum number of paths possible between those node types
    :param: pandas dataframe containing results from BTE FindConnection module, in table form
    :param: string describing type of query (Predict or Explain)
    
    Returns: pandas series with scores, index is output_name
    """
    ## note: still needs code to work with EXPLAIN queries
    df_temp = df.copy()
    
    ## equivalent of flattening edges, remove edge provenance and remove duplicate rows
    columns_to_drop = [col for col in df_temp.columns if 'pred' in col]
    df_temp.drop(columns = columns_to_drop, inplace = True)
    df_temp.drop_duplicates(inplace = True)
    
    if q_type == "Predict":
        scores = df_temp.output_name.value_counts()

        ## figure out the other node layers, besides the output nodes
        ## currently the column with input node name is named "input" only. 
        ## other node name columns have _name suffixes
        temp_list = []
        for col in df_temp.columns:
            if col != 'output_name':
                if ('_name' in col) or (col == 'input'):
                    temp_list.append(col)    

        ## calculate maximum number of paths to output, using other node layers
        max_paths = 1
        for col in temp_list:
            max_paths *= df_temp[col].nunique()

        ## normalize scores by dividing each by max number of paths
        scores = scores / max_paths

        ## return scores as pandas dataframe
        return scores.to_frame(name = 'score')

In [5]:
## record when cell blocks are executed
print('The time that this notebook was executed is...')
print('Local time (PST, West Coast USA): ')
print(datetime.now())
print('UTC time: ')
print(datetime.utcnow())

The time that this notebook was executed is...
Local time (PST, West Coast USA): 
2020-09-04 16:15:57.848837
UTC time: 
2020-09-04 23:15:57.848993


## Step 1: Find representation of "coronavirus" in BTE

In this step, BioThings Explorer translates our query string "coronavirus"  into BioThings objects, which contain mappings to many common identifiers. We then pick the BioThings object that best matches what we want (the rare disease). 

Generally, the top result returned by the Hint module for your BioThings type of interest will match what you want, but you should confirm that using the identifiers shown. 


> BioThings types correspond to children and descendants of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `Disease` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle"). **However, [only a subset of the Biolink BiologicalEntity children / descendants are currently implemented in BTE](https://smart-api.info/portal/translator/metakg)**. More biomedical object types will be available as more knowledge sources (APIs) are added to the system. **Note that the type `BiologicalEntity` means any BioThings type currently implemented in BTE will be accepted.**

In [6]:
ht = Hint()  ## neater way to call this BTE module

## the human user gives this input
corona_starting_str = "coronavirus"

corona_hint = ht.query(corona_starting_str)
## the object is a dictionary with keys = set strings, values = list of dictionaries. 
hint_display(corona_starting_str, corona_hint)

There are 6 BioThings objects returned for coronavirus:
('Gene', 'human coronavirus sensitivity')
('ChemicalSubstance', 'AZD1222')
('Disease', 'COVID-19')
('Disease', 'severe acute respiratory syndrome')
('Disease', 'Orthocoronavirinae infectious disease')
('PhenotypicFeature', 'Susceptibility to coronavirus 229e')


Note: the query failed to retrieve Disease &rarr; PhenotypicFeatures for COVID-19 (sibling of SARS in Mondo) and Orthocoronavirinae infectious disease (parent of COVID-19 and SARS). 

So we'll pick the SARS `Disease` choice (indexed at 1) for our query. We can look at identifier mappings inside this BioThings object. 

In [24]:
## the human user makes this choice, gives this input
corona_choice_type = 'Disease'
corona_choice_idx = 1

corona_hint_obj = corona_hint[corona_choice_type][corona_choice_idx]  
corona_hint_obj
## these inner dictionaries are keys = id type, 
##       values = curie, normal string, or dictionary (for the key 'primary')

{'MONDO': 'MONDO:0005091',
 'DOID': 'DOID:2945',
 'UMLS': 'C1175175',
 'name': 'severe acute respiratory syndrome',
 'MESH': 'D045169',
 'ORPHANET': '140896',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0005091'},
 'display': 'MONDO(MONDO:0005091) DOID(DOID:2945) ORPHANET(140896) UMLS(C1175175) MESH(D045169) name(severe acute respiratory syndrome)',
 'type': 'Disease'}

## Step 2: SARS &rarr; PhenotypicFeature &rarr; Gene

[edit this]

In this section, we dynamically generate a knowledge graph with paths connecting SARS to genes *using PhenotypicFeature intermediates* (representing the symptoms part of the question).  

BTE performs the **query path planning** and **query path execution** by deconstructing the query into individual API calls, executing those API calls, and then assembling the results.

The code block below takes about 1 minute to run.   
Ran: 21:35:05 2020-09-06 local time (PST)    

In [25]:
## the human user gives this input
q1_output_type = 'Gene'
q1_intermediate = 'PhenotypicFeature'

q1 = FindConnection(input_obj = corona_hint_obj,\
                     output_obj = q1_output_type, \
                    intermediate_nodes = q1_intermediate)
q1.connect(verbose = True)


BTE will find paths that join 'severe acute respiratory syndrome' and 'Gene'.                   Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: PhenotypicFeature




==== Step #1: Query path planning ====

Because severe acute respiratory syndrome is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'PhenotypicFeature' as output

BTE found 3 apis:

API 1. biolink(1 API call)
API 2. semmed_disease(11 API calls)
API 3. mydisease(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 3.1: https://mydisease.info/v1/query?fields=hpo.phenotype_related_to_disease (POST -d q=140896&scopes=hpo.orphanet)
API 2.1: https://biothings.ncats.io/semmed/query?fields=affected_by (POST -d q=C1175175&scopes=umls)
API 2.6: https://biothings.ncats.io/semmed/query?fields=prevented_by (POST -d q=C1175175&scopes=umls)
API 

API 2.14: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0002721/genes?rows=200&direct=true
API 2.3: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0001626/genes?rows=200&direct=true
API 2.7: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0002664/genes?rows=200&direct=true
API 2.12: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0012735/genes?rows=200&direct=true
API 2.18: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0000819/genes?rows=200
API 2.1: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0025439/genes?rows=200&direct=true
API 2.6: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0012418/genes?rows=200&direct=true
API 2.4: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0006528/genes?rows=200&direct=true
API 2.28: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0012735/genes?rows=200
API 2.10: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0001

In [26]:
q1_r_graph = q1.fc.G
q1_r_paths_table = q1.display_table_view()

q1_type = re.findall("dispatcher.([a-zA-Z]+)'", str(type(q1.fc)))
q1_type = "".join(q1_type)  ## convert to string

q1 = None  ## clear memory

Different knowledge sources (APIs) were called in different parts of the query. 

In the first part of the query (Disease &rarr; PhenotypicFeature), the following APIs returned results and the following predicates (semantic relationships) were found between SARS and PhenotypicFeatures. The following symptoms/PhenotypicFeatures were found. 

In [171]:
## show that the APIs use different predicates
q1_r_paths_table[['pred1_api', 'pred1_source', 'pred1']].drop_duplicates().sort_values(by = ['pred1_api', 'pred1'])
q1_r_paths_table['node1_name'].unique()

Unnamed: 0,pred1_api,pred1_source,pred1
0,BioLink API,hpoa,has_phenotype
1,mydisease.info API,hpo,related_to


array(['DECREASED IMMUNE FUNCTION', 'HEADACHE', 'DIABETES MELLITUS',
       'FEVER', 'CHRONIC LUNG DISEASE', 'ABNORMAL TISSUE MASS',
       'BREATHING DIFFICULTIES', 'MUSCLE ACHE', 'ABNORMAL BREATHING',
       'COUGH',
       'RESPIRATORY DISTRESS NECESSITATING MECHANICAL VENTILATION',
       'HYPOXEMIA', 'ABNORMALITY OF THE CARDIOVASCULAR SYSTEM',
       'ACUTE INFECTIOUS PNEUMONIA', 'ACUTE KIDNEY FAILURE',
       'PHARYNGITIS'], dtype=object)

In the second part of the query (PhenotypicFeature &rarr; Gene), the following APIs returned results and the following predicates (semantic relationships) were found between PhenotypicFeatures and Genes.

In [28]:
## show that the APIs use different predicates
q1_r_paths_table[['pred2_api', 'pred2_source', 'pred2']].drop_duplicates().sort_values(by = ['pred2_api', 'pred2'])

Unnamed: 0,pred2_api,pred2_source,pred2
628,BioLink API,gwascatalog,contributes_to_condition
2,BioLink API,"omim,orphanet,hpoa",has_phenotype
12,BioLink API,"omim,orphanet,clinvar,hpoa",has_phenotype
20,BioLink API,"omim,coriell,orphanet,clinvar,hpoa",has_phenotype
22,BioLink API,"omim,coriell,hpoa",has_phenotype
36,BioLink API,"orphanet,hpoa",has_phenotype
44,BioLink API,"omim,clinvar,hpoa",has_phenotype
106,BioLink API,"clinvar,hpoa",has_phenotype
144,BioLink API,"orphanet,clinvar,hpoa",has_phenotype
148,BioLink API,"omim,coriell,orphanet,hpoa",has_phenotype


Finally, we can see the number of PhenotypicFeatures that were linked to both SARS and to a Gene, the number of Genes returned as output nodes, and the total number of paths from SARS to Gene nodes. 

In [30]:
## show number of unique intermediate nodes
print("There are {0} unique {1}s for {2}.".format( \
    q1_r_paths_table.node1_name.nunique(), q1_intermediate, corona_starting_str))

## show number of unique output nodes
print("There are {0} unique Genes linked to those {1}s.".format( \
    q1_r_paths_table.output_name.nunique(), q1_intermediate))

## show number of paths from disease to genes
print("There are {0} unique paths.".format( \
    q1_r_paths_table.shape[0]))

There are 16 unique PhenotypicFeatures for coronavirus.
There are 985 unique Genes linked to those PhenotypicFeatures.
There are 3140 unique paths.


### Filtering and scoring

Filtering involves using edge provenance, like the source this relationship came from and the method used to make this association, to filter out edges (removing nodes in the process). 

However, in this particular example, no edges will be removed (we consider almost all the information reliable). 

In [31]:
q1_r_paths_table = filter_table(q1_r_paths_table)

## show number of paths from disease to genes
print("There are {0} unique paths.".format( \
    q1_r_paths_table.shape[0]))

There are 3140 unique paths.


The scoring process relies on the assumption that the user would be most interested in Genes that share many phenotypes (intermediate nodes) with SARS. 

To score individual Gene nodes, we first take a copy of the knowledge graph (KG) and flatten the multi-edges into single edges. 

We then count the number of paths from the SARS node to each Gene node. This count is also equivalent to the number of phenotypes/symptoms (PhenotypicFeatures) that SARS and the Gene in common and to the node (out)degree of the Gene in the flattened KG.    

Finally, we "normalize" the score by dividing those counts by maximum-possible number of paths from the SARS node to a Gene node. In this case (with  one input node and one intermediate node type), this is equivalent to the number of PhenotypicFeature nodes. 

We can then see the top-scored nodes. A score of 1 would mean that the maximum number of PhenotypicFeatures were linked to both SARS and to that Gene. A score closer to 0 would mean that only a few PhenotypicFeatures were linked to both SARS and to that Gene. 

In [32]:
## create scoring table for Genes (output nodes)
q1_scoring = scoring_output(q1_r_paths_table, q1_type)

q1_scoring.head(10)

Unnamed: 0,score
HBB,0.375
MTHFR,0.375
CTNNB1,0.3125
APC,0.3125
CSF2RB,0.3125
WT1,0.3125
ADA2,0.3125
GBA,0.3125
PDGFRA,0.3125
RET,0.3125


[TO DO: Actual code depends on graph object (ReasonerStd object or Networkx MultiDiGraph) used.]

We then add the scores and score provenance information to the graph object that will be returned to the ARS. These stored as node attributes on the Gene nodes of the "answer graph". 

In [None]:
# nx.set_node_attributes(q1_r_graph, values = q1_scoring.to_dict(), \
#                        name = 'score')
# nx.set_node_attributes(q1_r_graph, values = '(0-1]', name = 'score_range')
# nx.set_node_attributes(q1_r_graph, values = 'normalized path count', name = 'score_method')
# ## what is a good score, larger or smaller? 
# nx.set_node_attributes(q1_r_graph, values = 'larger', name = 'score_better_direction')  

# ## example of Gene node object with score-related attributes
# q1_r_graph.nodes()['NDUFS1']

## Compare answers to bradykinin article

The ["bradykinin storm" hypothesis article](https://elifesciences.org/articles/59177) describes many gene products that may be linked to COVID-19 symptoms. 

We compared BTE's answers to the genes of the biological entities mentioned in the article. 

In [146]:
article_genes = ['AGT', ## AGT = angiotensin precursor. 
                 'AGTR1', 'AGTR2', 'MAS1', ## receptors angiotensin binds to
                 'REN', 'ACE', 'ACE2',  ## enzymes working on angiotensin II precursors
                 'KNG1', 'KLKB1', ## Bradykinin precursor
                 'BDKRB2', 'BDKRB1', ## receptors for bradykinin
                 'NOS1', 'NOS2',  ## nitric oxide synthases: missing NOS3
                 'VDR', 'CYP24A1', 'CYP3A4',  ## vitamin D related
                 'KLK1', 'KLK2', 'KLK3', 'KLK4', 'KLK5', ## KLK = enzymes working on bradykinin precursor 
                 'KLK6', 'KLK7', 'KLK8', 'KLK9', 'KLK10', 
                 'KLK11', 'KLK12', 'KLK13', 'KLK14', 'KLK15', 
                 'KLK11', 'KLK12', 'KLK13', 'KLK14', 'KLK15', 
                 'F12',  ## F12: factor XII related to clotting and bradykinin precursor
                 'SERPING1', ## protease inhibitor, involved in kinin, clotting, complement pathways
                 'CPN1', ## protease linked to kinins, angioedema
                 'HAS1', 'HAS2', 'HAS3', ## hyaluronic acid synthases
                 'HYAL1', 'HYAL2',  ## degrade hyaluronic acid 
                 'CD44',  ## related to hyaluronic acid and immune system
                 'TMSB4X', ## linked to ACE, tissue regeneration
                 'IL2', ## link to CD44, capillary leak syndrome (called VLS in the paper)
                 'APLN', 'APLNR',  ## apelin and its receptor. related to RAS, heart 
                 'MME', ## related to degrading bradykinin, apelin 
                 'NFKB1', 'NFKB2', 'RELA', 'RELB', 'REL', ## genes for NF-kB complex
                 'IKBKG', ## part of IKB kinase complex that activates NFkappaB, degraded by virus 3CLpro protease?
                 'DPP4' ## linked to MERS and immune system
                ]  

Of the gene entities discussed in the bradykinin article, the following are in our results:     
* **NOS1** linked to cough,   
* **SERPING1** linked to breahing difficulties and abnormal breathing    
* **NFKB1**, **NFKB2**, **IKBKG** linked to decreased immune function    

In [148]:
print('Scoring from SARS -> PhenotypicFeature -> Gene query')

## reset index to show placement of genes
q1_scoring_df = q1_scoring.reset_index().rename(columns = {'index': 'output_name'})
q1_scoring_df[q1_scoring_df.output_name.isin(article_genes)]

## show APIs/sources for the underlying info for these answers
q1_r_paths_table[q1_r_paths_table.output_name.isin(article_genes)][['pred1_api']].drop_duplicates()
q1_r_paths_table[q1_r_paths_table.output_name.isin(article_genes)][['pred2_api']].drop_duplicates()

Scoring from SARS -> PhenotypicFeature -> Gene query


Unnamed: 0,output_name,score
177,SERPING1,0.125
410,IKBKG,0.0625
444,NFKB2,0.0625
540,NOS1,0.0625
896,NFKB1,0.0625


Unnamed: 0,pred1_api
0,BioLink API
1,mydisease.info API


Unnamed: 0,pred2_api
0,EBIgene2phenotype API
2,BioLink API


In [152]:
q1_r_paths_table[q1_r_paths_table.output_name.isin(article_genes)].node1_name.drop_duplicates()
q1_r_paths_table[q1_r_paths_table.output_name.isin(article_genes)]

0       DECREASED IMMUNE FUNCTION
1134                        COUGH
1812       BREATHING DIFFICULTIES
1814           ABNORMAL BREATHING
Name: node1_name, dtype: object

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,pred1_method,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,pred2_method,output_type,output_name,output_id
0,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,DECREASED IMMUNE FUNCTION,UMLS:C0021051,related_to,EBI,EBIgene2phenotype API,,,Gene,IKBKG,NCBIGene:8517
1,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,DECREASED IMMUNE FUNCTION,UMLS:C0021051,related_to,EBI,EBIgene2phenotype API,,,Gene,IKBKG,NCBIGene:8517
2,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,DECREASED IMMUNE FUNCTION,UMLS:C0021051,has_phenotype,"omim,orphanet,hpoa",BioLink API,,,Gene,IKBKG,NCBIGene:8517
3,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,DECREASED IMMUNE FUNCTION,UMLS:C0021051,has_phenotype,"omim,orphanet,hpoa",BioLink API,,,Gene,IKBKG,NCBIGene:8517
122,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,DECREASED IMMUNE FUNCTION,UMLS:C0021051,has_phenotype,"omim,orphanet,hpoa",BioLink API,,,Gene,NFKB2,NCBIGene:4791
123,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,DECREASED IMMUNE FUNCTION,UMLS:C0021051,has_phenotype,"omim,orphanet,hpoa",BioLink API,,,Gene,NFKB2,NCBIGene:4791
124,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,DECREASED IMMUNE FUNCTION,UMLS:C0021051,has_phenotype,"omim,orphanet,hpoa",BioLink API,,,Gene,NFKB1,NCBIGene:4790
125,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,DECREASED IMMUNE FUNCTION,UMLS:C0021051,has_phenotype,"omim,orphanet,hpoa",BioLink API,,,Gene,NFKB1,NCBIGene:4790
1134,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,COUGH,UMLS:C0010200,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,NOS1,NCBIGene:4842
1135,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,COUGH,UMLS:C0010200,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,NOS1,NCBIGene:4842


### Top gene results

In [161]:
q1_scoring_df.head(20)

Unnamed: 0,output_name,score
0,HBB,0.375
1,MTHFR,0.375
2,CTNNB1,0.3125
3,APC,0.3125
4,CSF2RB,0.3125
5,WT1,0.3125
6,ADA2,0.3125
7,GBA,0.3125
8,PDGFRA,0.3125
9,RET,0.3125


Interesting top genes:   
* **CSF2RB** (subunit for GM-CSF immune receptors, including IL3 receptor, IL5 receptor, GM-CSF receptor. these receptors are expressed on innate immune cells).   
* **ADA2** (related to immune system function, levels increased in several immune diseases and cancer)   
* **WAS** (related to clotting and immune deficiency, linked to Wiskott–Aldrich syndrome, interacts with PIK3R1)
* **SCNN1A**, **SCNN1G** (related to blood pressure, potassium/sodium levels)
* **ATP6** (mitochondrial gene. linked to neurodegenerative/cardiovascular disease, including Leigh syndrome. Leigh syndrome frequently happens after an event that uses up a lot of energy, including viral infection. Cardiac/neural effects thought to be related to energy production issues)
* **TRNL1** (mitochondrial gene. Linked to a bunch of diseases https://ghr.nlm.nih.gov/gene/MT-TL1#conditions) 

Linked to the biology of genes above: 
Interesting information on GM-CSF (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4553090/)
* **RET** (related to cancer, enteric nervous system development, kidney development, GM-CSF-related genes)
* **JAK2** (linked to cancer and to GM-CSF immune receptors)
* **PIK3R1** (related to cancer, insulin resistance)


Not-so-interesting top genes:
* HBB (hemoglobin)    
* MTHFR (related to cancer, neuronal disease, vascular disease)    
* CTNNB1 (related to cancer, heart disease, cell adhesion)    
* APC (see above on CTNNB1)    
* WT1 (related to cancer)  
* GBA (mutations cause Gaucher's disease)
* PDGFRA (related to cancer)
* TP53 (cancer)
* LRRC56 (linked to primary ciliary dyskinesia, which has a lot of lung infections)
* DMPK (linked to myotonic dystrophy) 

In [172]:
q1_r_paths_table[q1_r_paths_table['output_name'] == 'CSF2RB']

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,pred1_method,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,pred2_method,output_type,output_name,output_id
1430,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,COUGH,UMLS:C0010200,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1431,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,COUGH,UMLS:C0010200,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1432,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,HYPOXEMIA,UMLS:C0700292,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1433,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,HYPOXEMIA,UMLS:C0700292,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1434,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,ABNORMAL BREATHING,UMLS:C0013404,has_phenotype,"omim,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1435,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,ABNORMAL BREATHING,UMLS:C0013404,has_phenotype,"omim,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1436,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,RESPIRATORY DISTRESS NECESSITATING MECHANICAL ...,UMLS:C4025279,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1437,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,RESPIRATORY DISTRESS NECESSITATING MECHANICAL ...,UMLS:C4025279,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1438,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,ACUTE INFECTIOUS PNEUMONIA,UMLS:C4023112,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
1439,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,ACUTE INFECTIOUS PNEUMONIA,UMLS:C4023112,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439


## ERROR: SARS &rarr; ChemicalSubstance &rarr; Gene

In this section, we dynamically generate a knowledge graph with paths connecting SARS to genes *with ChemicalSubstance intermediates* (representing the drugs part of the question).  

BTE performs the **query path planning** and **query path execution** by deconstructing the query into individual API calls, executing those API calls, and then assembling the results.

The following code block takes about 30 seconds to run. 

In [33]:
## the human user gives this input
q2_output_type = 'Gene'
q2_intermediate = 'ChemicalSubstance'

q2 = FindConnection(input_obj = corona_hint_obj,\
                     output_obj = q2_output_type, \
                    intermediate_nodes = q2_intermediate)
q2.connect(verbose = True)


BTE will find paths that join 'severe acute respiratory syndrome' and 'Gene'.                   Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: ChemicalSubstance




==== Step #1: Query path planning ====

Because severe acute respiratory syndrome is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'ChemicalSubstance' as output

BTE found 8 apis:

API 1. scigraph(1 API call)
API 2. pharos(1 API call)
API 3. mychem(2 API calls)
API 4. hmdb(1 API call)
API 5. cord_disease(1 API call)
API 6. semmed_disease(15 API calls)
API 7. scibite(1 API call)
API 8. mydisease(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 7.1: https://mydisease.info/v1/query?fields=ctd.chemical_related_to_disease (POST -d q=D045169&scopes=mondo.xrefs.mesh, disgenet.xrefs.mesh)
API 5.2: https://biothings.ncats.io/semmed/qu

API 7.36 dgidb failed
API 10.128: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:2955
API 10.126: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:16796
API 10.124: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:27899
API 10.123: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:95002
API 10.125: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:36066
API 7.28 dgidb failed
API 10.122: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:16526
API 10.121: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:50122
API 10.127: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:7789
API 10.129: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:16027
API 7.8: http://dgidb.genome.wustl.edu/api/v2/interactions.json?drugs=CHEMBL584
API 7.37 dgidb failed
API 4.54 ctd failed
API 4.15: http://ctdbase.org/tools/batchQuery.go?inputType=chem&report=genes_curated&format=json&i

API 7.118 dgidb failed
API 7.119 dgidb failed
API 10.185: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:4659
API 10.186: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:26340
API 4.124 ctd failed
API 4.125 ctd failed
API 10.190: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:31857
API 10.188: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:50266
API 4.126 ctd failed
API 7.120 dgidb failed
API 4.127 ctd failed
API 4.128 ctd failed
API 10.192: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:27690
API 10.191: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:48912
API 10.193: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:36067
API 10.195: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:26334
API 10.194: https://automat.renci.org/chembio/chemical_substance/gene/CHEBI:57433
API 4.28: http://ctdbase.org/tools/batchQuery.go?inputType=chem&report=genes_curated&form

API 8.49: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:7798
API 4.171 ctd failed
API 7.173 dgidb failed
API 4.85: http://ctdbase.org/tools/batchQuery.go?inputType=chem&report=genes_curated&format=json&inputTerms=C011501
API 4.86: http://ctdbase.org/tools/batchQuery.go?inputType=chem&report=genes_curated&format=json&inputTerms=D003401
API 8.50: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:29309
API 8.51: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:29194
API 4.75: http://ctdbase.org/tools/batchQuery.go?inputType=chem&report=genes_curated&format=json&inputTerms=D018984
API 8.52: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:50667
API 8.53: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:16811
API 7.174 dgidb failed
API 4.97: http://ctdbase.org/tools/batchQuery.go?inputType=chem&report=genes_curated&format=json&inputTerms=D053243
API 8.54: https://automat.renci.org

API 8.120: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:78161
API 8.122: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:16526
API 8.121: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:50122
API 8.123: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:95002
API 4.182: http://ctdbase.org/tools/batchQuery.go?inputType=chem&report=genes_curated&format=json&inputTerms=C099410
API 8.124: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:27899
API 8.125: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:36066
API 4.181: http://ctdbase.org/tools/batchQuery.go?inputType=chem&report=genes_curated&format=json&inputTerms=D064704
API 8.126: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:16796
API 8.132: https://automat.renci.org/cord19-scibite/chemical_substance/gene/CHEBI:15377
API 4.184: http://ctdbase.org/tools/batchQuery.go?inputType=ch

API 5.47: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:46740
API 5.46: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:34623
API 5.45: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:79736
API 5.44: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:68657
API 5.43: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:9062
API 5.48: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:31821
API 5.42: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:31858
API 5.40: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:17650
API 5.49: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:7798
API 5.51: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:29194
API 5.52: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:50667
API 5.53: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:16811
API 5.50: https://automat.renci.org/pharos/chemical_su

API 5.182: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:35222
API 5.180: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:25847
API 5.187: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:27584
API 7.137: http://dgidb.genome.wustl.edu/api/v2/interactions.json?drugs=CHEMBL196537
API 5.183: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:47916
API 5.181: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:36044
API 5.189: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:17246
API 5.198: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:63598
API 5.193: https://automat.renci.org/pharos/chemical_substance/gene/CHEBI:36067
API 1.2: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:35660
API 1.1: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:46761
API 1.6: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:26343
API 5.179: htt

API 7.192: http://dgidb.genome.wustl.edu/api/v2/interactions.json?drugs=CHEMBL529
API 1.79: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:24780
API 1.86: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:90705
API 1.83: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:57947
API 1.84: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:25434
API 1.80: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:58007
API 1.85: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:8382
API 7.179: http://dgidb.genome.wustl.edu/api/v2/interactions.json?drugs=CHEMBL91704
API 7.188: http://dgidb.genome.wustl.edu/api/v2/interactions.json?drugs=CHEMBL163
API 1.88: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:2639
API 1.87: https://automat.renci.org/cord19-scigraph/chemical_substance/gene/CHEBI:26335
API 1.90: https://automat.renci.org/cord19-scigrap

API 3.1: https://biothings.ncats.io/semmedchemical/query?fields=physically_interacts_with (POST -d q=C0244713,C0003995,C0008269,C0042287,C0027866,C0012091,C0243077,C0674432,C0007012,C0164613,C0282580,C0206443,C0002598,C0032623,C0298067,C0013227,C0771194,C0034496,C1956582,C0450442,C0042525,C1257994,C0257685,C0003451,C0007009,C0038960,C0012772,C0000618,C0392938,C0028333,C0939788,C0040880,C0024330,C0058516,C0389500,C0002932,C0066700,C0006280,C0012253,C0033554,C0663241,C0243192,C0283243,C0078809,C1101610,C0078794,C0041305,C0043553,C0003751,C0032521,C0078774,C0078787,C0599740,C0043505,C0016229,C0001869,C0033567,C0033613,C0700798,C0017977,C0142223,C0016745,C0002771,C0018340,C0042210,C1512792,C0026874,C0010286,C0061751,C0065972,C0171302,C0003320,C0036751,C0699572,C0042212,C0019588,C0072683,C0310663,C0014838,C0527189,C0041942,C0028910,C0038317,C0017110,C0002502,C0032346,C0389484,C0965617,C0005525,C0013879,C0065180,C0021747,C0020268,C1136254,C0034260,C0077394,C0043047,C0008947,C0728966,C0055147

API 6.152: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:23888
API 6.150: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:27889
API 6.149: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:64276
API 6.148: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:3561
API 6.153: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:46915
API 6.151: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:75984
API 6.157: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:48433
API 6.166: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:30185
API 6.160: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:67040
API 6.154: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:94796
API 6.162: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:63175
API 6.158: https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:50059
API 6.161: https://automat.renci.org/hmdb/chemical_substance/gene

API 3.9: https://biothings.ncats.io/semmedchemical/query?fields=produced_by (POST -d q=C0244713,C0003995,C0008269,C0042287,C0027866,C0012091,C0243077,C0674432,C0007012,C0164613,C0282580,C0206443,C0002598,C0032623,C0298067,C0013227,C0771194,C0034496,C1956582,C0450442,C0042525,C1257994,C0257685,C0003451,C0007009,C0038960,C0012772,C0000618,C0392938,C0028333,C0939788,C0040880,C0024330,C0058516,C0389500,C0002932,C0066700,C0006280,C0012253,C0033554,C0663241,C0243192,C0283243,C0078809,C1101610,C0078794,C0041305,C0043553,C0003751,C0032521,C0078774,C0078787,C0599740,C0043505,C0016229,C0001869,C0033567,C0033613,C0700798,C0017977,C0142223,C0016745,C0002771,C0018340,C0042210,C1512792,C0026874,C0010286,C0061751,C0065972,C0171302,C0003320,C0036751,C0699572,C0042212,C0019588,C0072683,C0310663,C0014838,C0527189,C0041942,C0028910,C0038317,C0017110,C0002502,C0032346,C0389484,C0965617,C0005525,C0013879,C0065180,C0021747,C0020268,C1136254,C0034260,C0077394,C0043047,C0008947,C0728966,C0055147,C0022634,C003

API 3.13: https://biothings.ncats.io/semmedchemical/query?fields=positively_regulated_by (POST -d q=C0244713,C0003995,C0008269,C0042287,C0027866,C0012091,C0243077,C0674432,C0007012,C0164613,C0282580,C0206443,C0002598,C0032623,C0298067,C0013227,C0771194,C0034496,C1956582,C0450442,C0042525,C1257994,C0257685,C0003451,C0007009,C0038960,C0012772,C0000618,C0392938,C0028333,C0939788,C0040880,C0024330,C0058516,C0389500,C0002932,C0066700,C0006280,C0012253,C0033554,C0663241,C0243192,C0283243,C0078809,C1101610,C0078794,C0041305,C0043553,C0003751,C0032521,C0078774,C0078787,C0599740,C0043505,C0016229,C0001869,C0033567,C0033613,C0700798,C0017977,C0142223,C0016745,C0002771,C0018340,C0042210,C1512792,C0026874,C0010286,C0061751,C0065972,C0171302,C0003320,C0036751,C0699572,C0042212,C0019588,C0072683,C0310663,C0014838,C0527189,C0041942,C0028910,C0038317,C0017110,C0002502,C0032346,C0389484,C0965617,C0005525,C0013879,C0065180,C0021747,C0020268,C1136254,C0034260,C0077394,C0043047,C0008947,C0728966,C0055147,

API 3.7: https://biothings.ncats.io/semmedchemical/query?fields=coexists_with (POST -d q=C0244713,C0003995,C0008269,C0042287,C0027866,C0012091,C0243077,C0674432,C0007012,C0164613,C0282580,C0206443,C0002598,C0032623,C0298067,C0013227,C0771194,C0034496,C1956582,C0450442,C0042525,C1257994,C0257685,C0003451,C0007009,C0038960,C0012772,C0000618,C0392938,C0028333,C0939788,C0040880,C0024330,C0058516,C0389500,C0002932,C0066700,C0006280,C0012253,C0033554,C0663241,C0243192,C0283243,C0078809,C1101610,C0078794,C0041305,C0043553,C0003751,C0032521,C0078774,C0078787,C0599740,C0043505,C0016229,C0001869,C0033567,C0033613,C0700798,C0017977,C0142223,C0016745,C0002771,C0018340,C0042210,C1512792,C0026874,C0010286,C0061751,C0065972,C0171302,C0003320,C0036751,C0699572,C0042212,C0019588,C0072683,C0310663,C0014838,C0527189,C0041942,C0028910,C0038317,C0017110,C0002502,C0032346,C0389484,C0965617,C0005525,C0013879,C0065180,C0021747,C0020268,C1136254,C0034260,C0077394,C0043047,C0008947,C0728966,C0055147,C0022634,C0

API 8.55 scibite: 2 hits
API 8.56 scibite: No hits
API 8.57 scibite: 2 hits
API 8.58 scibite: 2 hits
API 8.59 scibite: No hits
API 8.60 scibite: 5 hits
API 8.61 scibite: 15 hits
API 8.62 scibite: No hits
API 8.63 scibite: No hits
API 8.64 scibite: No hits
API 8.65 scibite: No hits
API 8.66 scibite: No hits
API 8.67 scibite: No hits
API 8.68 scibite: No hits
API 8.69 scibite: No hits
API 8.70 scibite: No hits
API 8.71 scibite: No hits
API 8.72 scibite: No hits
API 8.73 scibite: No hits
API 8.74 scibite: No hits
API 8.75 scibite: No hits
API 8.76 scibite: 3 hits
API 8.77 scibite: No hits
API 8.78 scibite: No hits
API 8.79 scibite: No hits
API 8.80 scibite: No hits
API 8.81 scibite: 2 hits
API 8.82 scibite: No hits
API 8.83 scibite: No hits
API 8.84 scibite: No hits
API 8.85 scibite: 6 hits
API 8.86 scibite: 29 hits
API 8.87 scibite: No hits
API 8.88 scibite: 8 hits
API 8.89 scibite: No hits
API 8.90 scibite: No hits
API 8.91 scibite: 1 hits
API 8.92 scibite: No hits
API 8.93 scibite: No 

API 5.190 pharos: 1 hits
API 5.191 pharos: No hits
API 5.192 pharos: 1 hits
API 5.193 pharos: No hits
API 5.194 pharos: No hits
API 5.195 pharos: No hits
API 5.196 pharos: No hits
API 5.197 pharos: No hits
API 5.198 pharos: No hits
API 5.199 pharos: 2 hits
API 5.200 pharos: No hits
API 3.2 semmed_chemical: 15391 hits
API 1.1 scigraph: 4 hits
API 1.2 scigraph: No hits
API 1.3 scigraph: 24 hits
API 1.4 scigraph: 30 hits
API 1.5 scigraph: 12 hits
API 1.6 scigraph: No hits
API 1.7 scigraph: 72 hits
API 1.8 scigraph: No hits
API 1.9 scigraph: 5 hits
API 1.10 scigraph: 6 hits
API 1.11 scigraph: 20 hits
API 1.12 scigraph: 5 hits
API 1.13 scigraph: No hits
API 1.14 scigraph: No hits
API 1.15 scigraph: No hits
API 1.16 scigraph: No hits
API 1.17 scigraph: No hits
API 1.18 scigraph: No hits
API 1.19 scigraph: 26 hits
API 1.20 scigraph: No hits
API 1.21 scigraph: 5 hits
API 1.22 scigraph: 33 hits
API 1.23 scigraph: 8 hits
API 1.24 scigraph: 2 hits
API 1.25 scigraph: No hits
API 1.26 scigraph: No 

API 6.78 hmdb: 129 hits
API 6.79 hmdb: No hits
API 6.80 hmdb: No hits
API 6.81 hmdb: No hits
API 6.82 hmdb: No hits
API 6.83 hmdb: 10 hits
API 6.84 hmdb: No hits
API 6.85 hmdb: 4 hits
API 6.86 hmdb: No hits
API 6.87 hmdb: No hits
API 6.88 hmdb: 12 hits
API 6.89 hmdb: No hits
API 6.90 hmdb: No hits
API 6.91 hmdb: 2 hits
API 6.92 hmdb: No hits
API 6.93 hmdb: No hits
API 6.94 hmdb: 1 hits
API 6.95 hmdb: No hits
API 6.96 hmdb: No hits
API 6.97 hmdb: No hits
API 6.98 hmdb: No hits
API 6.99 hmdb: No hits
API 6.100 hmdb: No hits
API 6.101 hmdb: No hits
API 6.102 hmdb: No hits
API 6.103 hmdb: No hits
API 6.104 hmdb: No hits
API 6.105 hmdb: No hits
API 6.106 hmdb: No hits
API 6.107 hmdb: No hits
API 6.108 hmdb: No hits
API 6.109 hmdb: No hits
API 6.110 hmdb: 1 hits
API 6.111 hmdb: No hits
API 6.112 hmdb: No hits
API 6.113 hmdb: No hits
API 6.114 hmdb: No hits
API 6.115 hmdb: No hits
API 6.116 hmdb: No hits
API 6.117 hmdb: No hits
API 6.118 hmdb: No hits
API 6.119 hmdb: No hits
API 6.120 hmdb: N

SyntaxError: invalid syntax (<string>, line 1)

In [16]:
q2_r_graph = q2.fc.G
q2_r_paths_table = q2.display_table_view()

q2_type = re.findall("dispatcher.([a-zA-Z]+)'", str(type(q2.fc)))
q2_type = "".join(q2_type)  ## convert to string

q2 = None ## clear memory

Different knowledge sources (APIs) were called in different parts of the query. 

In the first part of the query (Disease &rarr; ChemicalSubstance), the following APIs returned results and the following predicates (semantic relationships) were found between SARS and ChemicalSubstances.

In [17]:
## show that the APIs use different predicates, one each
q2_r_paths_table[['pred1_api', 'pred1_source', 'pred1']].drop_duplicates().sort_values(by = ['pred1_api', 'pred1'])

Unnamed: 0,pred1_api,pred1_source,pred1
53,Automat CORD19 Scigraph API,scigraph,related_to
52,Automat HMDB API,hmdb,related_to
17,SEMMED Disease API,SEMMED,affected_by
0,SEMMED Disease API,SEMMED,caused_by
19,SEMMED Disease API,SEMMED,disrupted_by
10179,SEMMED Disease API,SEMMED,prevented_by
5,SEMMED Disease API,SEMMED,related_to
1,SEMMED Disease API,SEMMED,treated_by
46,mydisease.info API,ctd,related_to


In the second part of the query (ChemicalSubstance &rarr; Gene), the following APIs returned results and the following predicates (semantic relationships) were found between ChemicalSubstances and Genes.

In [18]:
## show that the APIs use different predicates
q2_r_paths_table[['pred2_api', 'pred2_source', 'pred2']].drop_duplicates().sort_values(by = ['pred2_api', 'pred2'])

Unnamed: 0,pred2_api,pred2_source,pred2
18904,Automat CHEMBIO API,chembio,related_to
5230,Automat CORD19 Scibite API,scibite,related_to
5178,Automat CORD19 Scigraph API,scigraph,related_to
6001,Automat HMDB API,hmdb,related_to
22243,Automat PHAROS API,pharos,related_to
5057,CORD Chemical API,Translator Text Mining Provider,related_to
5065,CTD API,CTD,related_to
6015,DGIdb API,GuideToPharmacologyInteractions,physically_interacts_with
8809,DGIdb API,DrugBank,physically_interacts_with
40863,DGIdb API,"GuideToPharmacologyInteractions,DrugBank,TTD",physically_interacts_with


Finally, we can see the number of ChemicalSubstances that were linked to both SARS and to a Gene as well as the number of Genes returned as output nodes. 

In [None]:
## show number of unique intermediate nodes
print("There are {0} unique {1}s for {2}.".format( \
    q2_r_paths_table.node1_name.nunique(), q2_intermediate, corona_starting_str))

## show number of unique output nodes
print("There are {0} unique Genes linked to those {1}s.".format( \
    q2_r_paths_table.output_name.nunique(), q2_intermediate))

### Filtering and scoring

Filtering involves using edge provenance, like the source this relationship came from and the method used to make this association, to filter out edges (removing nodes in the process). Before filtering, we have a large number of results (see output from code block above) and paths between the SARS node and Gene nodes (see output from code block below). 

Note that there is no method information in this particular table to filter on. 

In [20]:
## show number of paths from SARS to genes
print("There are {0} unique paths.".format( \
    q2_r_paths_table.shape[0]))

## show that there isn't any method information in this table
## shows number of cells without missing data (aka non-NA values)
print("\nThere is no method information in this table.")
q2_r_paths_table[['pred1_method', 'pred2_method']].count()

There are 58774 unique paths.

There is no method information in this table.


pred1_method    0
pred2_method    0
dtype: int64

In [21]:
q2_r_paths_table = filter_table(q2_r_paths_table)

After filtering, there are fewer paths (and fewer nodes/edges) in the answer knowledge graph. 

In [None]:
## show number of unique intermediate nodes
print("There are {0} unique {1}s for {2}.".format( \
    q2_r_paths_table.node1_name.nunique(), q2_intermediate, corona_starting_str))

## show number of unique output nodes
print("There are {0} unique Genes linked to those {1}s.".format( \
    q2_r_paths_table.output_name.nunique(), q2_intermediate))

## show number of paths from SARS to genes
print("There are {0} unique paths.".format( \
    q2_r_paths_table.shape[0]))

The scoring process relies on the assumption that the user would be most interested in Genes that share many ChemicalSubstances (intermediate nodes) with SARS. 

To score individual Gene nodes, we first take a copy of the knowledge graph (KG) and flatten the multi-edges into single edges. 

We then count the number of paths from the SARS node to each Gene node. This count is also equivalent to the number of ChemicalSubstances that SARS and the Gene in common and to the node (out)degree of the Gene in the flattened KG.    

Finally, we "normalize" the score by dividing those counts by maximum-possible number of paths from the SARS node to a Gene node. In this case (with one input node and one intermediate node type), this is equivalent to the number of ChemicalSubstances nodes. 

We can then see the top-scored nodes. A score of 1 would mean that the maximum number of ChemicalSubstances were linked to both SARS and to that Gene. A score closer to 0 would mean that only a few ChemicalSubstances were linked to both SARS and to that Gene. 

In [23]:
## create scoring table for Genes (output nodes)
q2_scoring = scoring_output(q2_r_paths_table, q2_type)

q2_scoring.head(10)

Unnamed: 0,score
PPIB,0.366667
CAT,0.333333
CASP14,0.266667
TLN1,0.266667
DUOXA1,0.233333
INS,0.233333
IDS,0.2
7841,0.2
BCAT2,0.2
PTS,0.2


[TO DO: Actual code depends on graph object (ReasonerStd object or Networkx MultiDiGraph) used.]

We then add the scores and score provenance information to the graph object that will be returned to the ARS. These stored as node attributes on the Gene nodes of the "answer graph". 

In [None]:
# nx.set_node_attributes(q2_r_graph, values = q2_scoring.to_dict(), \
#                        name = 'score')
# nx.set_node_attributes(q2_r_graph, values = '(0-1]', name = 'score_range')
# nx.set_node_attributes(q2_r_graph, values = 'normalized path count', name = 'score_method')
# ## what is a good score, larger or smaller? 
# nx.set_node_attributes(q2_r_graph, values = 'larger', name = 'score_better_direction')  

# ## example of Gene node object with score-related attributes
# q2_r_graph.nodes()['BCAT2']