# Translator Use Case Question 2: SARS (proxy for COVID-19)

## Introduction

**To experiment with an executable version of this notebook, [load it in Google Colaboratory](https://colab.research.google.com/github/colleenXu/biothings_explorer/blob/relay/jupyter%20notebooks/CX_WIPs/TranslatorUseCase_Q2_SARS_COVIDproxy.ipynb).**

The Translator Use Case Question #2 is:    

> What genes or proteins may be associated with symptoms of a disease X (such as based on drugs that are currently used to treat that disease, etc)?

<br>

BioThings Explorer (BTE) can answer two classes of queries -- "EXPLAIN" and "PREDICT". This Question fits the PREDICT  template of starting with **a specific biomedical entity** (a specific `Disease` X) and finding relationships with **one biomedical entity type** (like `PhenotypicFeature` or `Gene`).
* Note that currently a `Protein` biomedical entity type is not implemented in BTE. Instead, protein-coding and some non-coding genes are `Genes`. 

<br>

We will use severe acute respiratory syndrome (SARS), as a proxy for COVID-19, as our specific disease of interest.  We address the question above using the query: `Disease` SARS &rarr; `PhenotypicFeature` &rarr; `Gene`.    
* In other words, starting with the Disease SARS, find entities of type PhenotypicFeature associated with this disease, then find the entities of type Gene associated with those PhenotypicFeatures.       
* This query will return a graph object with entities as nodes and relationships as edges. We then use edge provenance information to **filter** the results. For each Gene node, we use the number of unique paths from SARS (input node) to that node to **score** it. The scores can then be used to sort the results.  

We will then compare the results to the genes mentioned in a [recent article describing a "bradykinin storm" mechanism for COVID-19](https://elifesciences.org/articles/59177). 

## Step 0: Load BTE modules, notebook functions

In [1]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection

## show time that this notebook was executed 
from datetime import datetime

## packages to work with objects 
import re

## to get around bugs
import nest_asyncio
nest_asyncio.apply()

In [2]:
## functions to add to modules?
def hint_display(query, hint_result):
    """
    show the type, name, number of IDs for all results returned by the query
    
    :param: query: string used in hint query
    :param: hint_result: object returned from hint query, a dictionary of lists of dictionaries
    
    Returns: None
    """
    ## function needs to be rewritten if it's going to give the exact index of each object within its type 
    display = ['type', 'name']  ## replace with the parts of the BioThings object you want to see
    concise_results = []
    for BT_type, result in hint_result.items():
        if result:  ## basically if it's not empty
            for items in result:
                ## number of identifiers per object: number of keys - 4 (name, primary, display, type)
                temp = len(items) - 4
                concise_results.append((items[display[0]], items[display[1]], 
                                         str(temp)))
                    
    print('There are {total} BioThings objects returned for {ht}:'.format(\
                total = len(concise_results), ht = query))
    for display_info in concise_results:
        print('{0}, {1}, num of IDs: {2}'.format(display_info[0], display_info[1], display_info[2]))

In [3]:
def filter_table(df):
    """
    use _source and _method columns to remove rows (paths) from the dataframe
    :param: pandas dataframe containing results from BTE FindConnection module, in table form
    
    Returns: filtered dataframe
    """
    ## note: still needs checking with EXPLAIN queries
    ## key is the string to match to column, value is a list of strings to match to column values
    filter_out = {'_source': ['SEMMED', 'CTD', 'ctd', 'omia']   
#                   '_method': []  ## currently no method stuff I want to filter out
                 }
    ## SEMMED: text mining results wrong for PhenotypicFeature -> Gene
    ## CTD/ctd: results odd for MSUD -> ChemicalSubstance
    ## omia: results wrong or discontinued gene IDs for PhenotypicFeature -> Gene
    
    
    df_temp = df.copy()  ## so the original df isn't modified in-place
    for key,val in filter_out.items():
        ## find columns that match the key string
        columns = [i for i in df_temp.columns if key in i]
        ## iterate through each column
        for col in columns:
            ## iterate through each value to take out, check if string CONTAINS match. 
            ## only keep rows that don't contain the value
            for i in val:
                df_temp = df_temp[~ df_temp[col].str.contains(i, na = False)]
    return df_temp

In [4]:
def scoring_output(df, q_type):
    """
    score results based on whether query was Predict or Explain type, number of 
        intermediate nodes 
    :param: pandas dataframe containing results from BTE FindConnection module
    :param: string describing type of query (Predict or Explain)
    
    May flatten some edges, because score only counts one edge per 
        unique predicate / API / method (ignoring source and pubmed col)
    
    Predict queries: score each output node by counting # of paths
        from input nodes to it. Normalize by dividing by maximum
        possible # of paths
    Explain two-hop (one intermediate) queries: score each intermediate node by 
        counting # of paths (between input and output nodes) that include it. 
        Normalize by dividing by maximum possible # of paths    

    Explain one-hop (direct) queries: no need to score, prints message
    Other Explain queries (many-hops): currently not able to score, prints message     
    
    Returns: pandas series with scores, index is output_name
             or None (one-hop or many-hop Explain query)
    """
    df_temp = df.copy()  ## so no chance to mutate this   
    flag_direct = False  ## one-hop query or not
    ## use df_col to look quicker into columns
    df_col = set(df_temp.columns)
    
    ## ignore source and pubmed col in looking at unique edges 
    columns_drop = [col for col in df_col if (('_source' in col) or ('_pubmed' in col))]
    df_temp.drop(columns = columns_drop, inplace = True)    
    df_temp.drop_duplicates(inplace = True)
    
    ## check if query is one-hop or not
    if "node1_name" not in df_col:    ## name for first intermediate node layer
        flag_direct = True  
    
    if q_type == 'Explain':
        if flag_direct:   # one hop / no intermediates
            print('No valid node scoring for one-hop (direct) Explain queries.')
            return None
        ## if there are many-hops/intermediate layers
        elif "node2_name" in df_col:  ## name for 2nd intermed. node layer
            print('Cannot currently score many-hop Explain queries.')
            return None
        else:   ## two-hop / 1 intermediate layer
            ## count multi-edges to results (the intermediate node1 col)
            scores = df_temp.node1_name.value_counts() 
            ## to find the maximum-possible number of edges, look at non-result cols
            columns_drop = [col for col in df_col if 'node1' in col]
            df_temp.drop(columns = columns_drop, inplace = True)
            ## now look at number of unique combos for input, edge info, output
            df_temp.drop_duplicates(inplace = True)
            max_paths = df_temp.shape[0]            
            ## normalize scores by dividing each by max number of paths
            scores = scores / max_paths

    else:  ## Predict type query
        ## count multi-edges to results (the output col)
        scores = df_temp.output_name.value_counts()
        ## to find the maximum number of multi-edges, look at non-output col
        columns_drop = [col for col in df_temp.columns if 'output' in col]
        df_temp.drop(columns = columns_drop, inplace = True)
        ## now look at number of unique paths possible
        df_temp.drop_duplicates(inplace = True)
        max_paths = df_temp.shape[0]
        ## normalize scores by dividing each by max number of paths
        scores = scores / max_paths
            
    ## return scores as pandas dataframe, with rank
    scores = scores.to_frame(name = 'score') 
    scores['rank'] = scores['score'].rank(method = 'dense', ascending = False)
    return scores

In [5]:
## record when cell blocks are executed
print('The time that this notebook was executed is...')
print('Local time (PST, West Coast USA): ')
print(datetime.now())
print('UTC time: ')
print(datetime.utcnow())

The time that this notebook was executed is...
Local time (PST, West Coast USA): 
2020-11-26 17:55:36.454409
UTC time: 
2020-11-27 01:55:36.454540


## Step 1: Find representation of "SARS" in BTE

In this step, BioThings Explorer translates our query string "SARS"  into BioThings objects, which contain mappings to many common identifiers. We then pick the BioThings object that best matches what we want. 

Generally, the top result returned by the Hint module for your BioThings type of interest will match what you want, but you should confirm that using the identifiers shown. 


> BioThings types correspond to children and descendants of [BiologicalEntity](https://biolink.github.io/biolink-model/docs/BiologicalEntity.html) from the [Biolink Model](https://biolink.github.io/biolink-model/docs/), including `Disease` (e.g., "lupus"), `ChemicalSubstance` (e.g., "acetaminophen"), `Gene` (e.g., "CDK2"), `BiologicalProcess` (e.g., "T cell differentiation"), and `Pathway` (e.g., "Citric acid cycle"). **However, [only a subset of the Biolink BiologicalEntity children / descendants are currently implemented in BTE](https://smart-api.info/portal/translator/metakg)**. More biomedical object types will be available as more knowledge sources (APIs) are added to the system. **Note that the type `BiologicalEntity` means any BioThings type currently implemented in BTE will be accepted.**

In [6]:
ht = Hint()  ## neater way to call this BTE module

## the human user gives this input
disease_starting_str = "SARS"

disease_hint = ht.query(disease_starting_str)
hint_display(disease_starting_str, disease_hint)

There are 10 BioThings objects returned for SARS:
ChemicalSubstance, Anti-SARS-CoV-2 REGN-COV2, num of IDs: 1
Disease, severe acute respiratory syndrome, num of IDs: 5
Disease, COVID-19, num of IDs: 2
Disease, COVID-19–associated multisystem inflammatory syndrome in children, num of IDs: 2
Disease, IMD74, num of IDs: 0
MolecularActivity, selenocysteine-tRNA ligase activity, num of IDs: 2
MolecularActivity, mRNA (guanine-N7-)-methyltransferase activity, num of IDs: 3
MolecularActivity, 5'-3' RNA helicase activity, num of IDs: 2
MolecularActivity, mRNA (nucleoside-2'-O-)-methyltransferase activity, num of IDs: 3
MolecularActivity, RNA-directed 5'-3' RNA polymerase activity, num of IDs: 3


Note: the query failed to retrieve Disease &rarr; PhenotypicFeatures for COVID-19 (a sibling of SARS in the Mondo ontology) and Orthocoronavirinae infectious disease (parent of COVID-19 and SARS in the Mondo ontology). 

So we'll pick the SARS `Disease` choice (indexed at 0) for our query. We can look at identifier mappings inside this BioThings object. 

In [7]:
## the human user makes this choice, gives this input
disease_choice_type = 'Disease'
disease_choice_idx = 0

disease_hint_obj = disease_hint[disease_choice_type][disease_choice_idx]  
disease_hint_obj
## these inner dictionaries are keys = id type, 
##       values = curie, normal string, or dictionary (for the key 'primary')

{'MONDO': 'MONDO:0005091',
 'DOID': 'DOID:2945',
 'UMLS': 'C1175175',
 'name': 'severe acute respiratory syndrome',
 'MESH': 'D045169',
 'ORPHANET': '140896',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0005091'},
 'display': 'MONDO(MONDO:0005091) DOID(DOID:2945) ORPHANET(140896) UMLS(C1175175) MESH(D045169) name(severe acute respiratory syndrome)',
 'type': 'Disease'}

## Step 2: SARS &rarr; PhenotypicFeature &rarr; Gene

In this section, we dynamically generate a knowledge graph with paths connecting SARS to genes *using PhenotypicFeature intermediates* (representing the symptoms part of the question).  

BTE performs the **query path planning** and **query path execution** by deconstructing the query into individual API calls, executing those API calls, and then assembling the results.

The code block below takes about 18 seconds to run.   

In [8]:
## the human user gives this input
q1_output_type = 'Gene'
q1_intermediate = 'PhenotypicFeature'

q1 = FindConnection(input_obj = disease_hint_obj,\
                     output_obj = q1_output_type, \
                    intermediate_nodes = q1_intermediate)
q1.connect(verbose = True)


BTE will find paths that join 'severe acute respiratory syndrome' and 'Gene'. Paths will have 1 intermediate node.

Intermediate node #1 will have these type constraints: PhenotypicFeature




==== Step #1: Query path planning ====

Because severe acute respiratory syndrome is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'PhenotypicFeature' as output

BTE found 3 apis:

API 1. mydisease(1 API call)
API 2. biolink(1 API call)
API 3. semmed_disease(11 API calls)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 1.1: https://mydisease.info/v1/query?fields=hpo.phenotype_related_to_disease (POST -d q=140896&scopes=hpo.orphanet)
API 3.1: https://biothings.ncats.io/semmed/query?fields=affected_by (POST -d q=C1175175&scopes=umls)
API 3.10: https://biothings.ncats.io/semmed/query?fields=disrupts (POST -d q=C1175175&scopes=umls)
API 3.2: https://biothing

API 1.20: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0000819/genes?rows=200
API 1.27: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0001945/genes?rows=200
API 1.13: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0003326/genes?rows=200&direct=true
API 1.17: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0002721/genes?rows=200
API 1.31: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0001626/genes?rows=200
API 1.24: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0012735/genes?rows=200
API 1.26: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0006528/genes?rows=200
API 1.19: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0001919/genes?rows=200
API 1.8: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0012735/genes?rows=200&direct=true
API 1.5: https://api.monarchinitiative.org/api/bioentity/phenotype/HP:0002315/genes?rows=200&direct=true
API 1.18: https://api.mona

In [9]:
# q1_r_graph = q1.fc.G   ## for changing the graph object to reflect the table
q1_r_paths_table = q1.display_table_view()

q1_type = re.findall("dispatcher.([a-zA-Z]+)'", str(type(q1.fc)))
q1_type = "".join(q1_type)  ## convert to string

q1 = None  ## clear memory

We can see the number of PhenotypicFeatures that were linked to both SARS and to a Gene, the number of Genes returned as output nodes, and the total number of paths from SARS to Gene nodes. 

In [10]:
## show number of unique intermediate nodes
print("There are {0} unique {1}s for {2}.".format( \
    q1_r_paths_table.node1_name.nunique(), q1_intermediate, disease_starting_str))

## show number of unique output nodes
print("There are {0} unique Genes linked to those {1}s.".format( \
    q1_r_paths_table.output_name.nunique(), q1_intermediate))

## show number of paths from disease to genes
print("There are {0} unique paths.".format( \
    q1_r_paths_table.shape[0]))

There are 16 unique PhenotypicFeatures for SARS.
There are 951 unique Genes linked to those PhenotypicFeatures.
There are 2838 unique paths.


### Filtering and scoring

Filtering involves using edge provenance, like the source this relationship came from and the method used to make this association, to filter out edges (removing nodes in the process). 

However, in this particular example, no edges will be removed (we consider almost all the information reliable). 

In [11]:
q1_r_paths_table = filter_table(q1_r_paths_table)

## show number of paths from disease to genes
print("There are {0} unique paths.".format( \
    q1_r_paths_table.shape[0]))

There are 2828 unique paths.


The scoring process relies on the assumption that the user would be most interested in Genes that share many phenotypes (intermediate nodes) with SARS. 
1. To score individual Gene nodes, we first take a copy of the knowledge graph (KG) and flatten the multi-edges into single edges. 
2. We then count the number of paths from the SARS node to each Gene node. This count is also equivalent to the number of phenotypes/symptoms (PhenotypicFeatures) that SARS and the Gene in common and to the node (out)degree of the Gene in the flattened KG.    
3. Finally, we "normalize" the score by dividing those counts by maximum-possible number of paths from the SARS node to a Gene node. In this case (with  one input node and one intermediate node type), this is equivalent to the number of PhenotypicFeature nodes. 

We can then see the top-scored nodes. A score of 1 would mean that the maximum number of PhenotypicFeatures were linked to both SARS and to that Gene. A score closer to 0 would mean that only a few PhenotypicFeatures were linked to both SARS and to that Gene. 

In [12]:
## create scoring table for Genes (output nodes)
q1_scoring = scoring_output(q1_r_paths_table, q1_type)

q1_scoring.head(10)

Unnamed: 0,score,rank
MTHFR,0.225806,1.0
GBA,0.225806,1.0
CSF2RB,0.193548,2.0
CSF2RA,0.16129,3.0
APC,0.16129,3.0
ATM,0.16129,3.0
PDGFRA,0.16129,3.0
CTNNB1,0.16129,3.0
GAA,0.16129,3.0
DMPK,0.16129,3.0


Different knowledge sources (APIs) were called in different parts of the query. 

In the first part of the query (SARS &rarr; PhenotypicFeature), the following APIs returned results and the following predicates (semantic relationships) were found. 

In [13]:
## show that the APIs use different predicates
q1_r_paths_table[['pred1_api', 'pred1']].drop_duplicates().sort_values(by = ['pred1_api', 'pred1'])

Unnamed: 0,pred1_api,pred1
0,BioLink API,has_phenotype
1,mydisease.info API,related_to


The following symptoms (PhenotypicFeatures) were linked to genes and to SARS. 

In [14]:
q1_r_paths_table['node1_name'].unique()

array(['DECREASED IMMUNE FUNCTION', 'HEADACHE',
       'RESPIRATORY DISTRESS NECESSITATING MECHANICAL VENTILATION',
       'CHRONIC LUNG DISEASE', 'FEVER', 'MUSCLE ACHE',
       'DIABETES MELLITUS', 'COUGH', 'ABNORMAL TISSUE MASS',
       'ABNORMAL BREATHING', 'BREATHING DIFFICULTIES', 'PHARYNGITIS',
       'HYPOXEMIA', 'ABNORMALITY OF THE CARDIOVASCULAR SYSTEM',
       'ACUTE INFECTIOUS PNEUMONIA', 'ACUTE KIDNEY FAILURE'], dtype=object)

In the second part of the query (PhenotypicFeature &rarr; Gene), the following APIs returned results and the following predicates (semantic relationships) were found.

In [15]:
## show that the APIs use different predicates
q1_r_paths_table[['pred2_api', 'pred2']].drop_duplicates().sort_values(by = ['pred2_api', 'pred2'])

Unnamed: 0,pred2_api,pred2
782,BioLink API,contributes_to_condition
2,BioLink API,has_phenotype
0,EBIgene2phenotype API,related_to


[TO DO: Actual code depends on graph object (ReasonerStd object or Networkx MultiDiGraph) used.]

We then add the scores and score provenance information to the graph object that will be returned to the ARS. These stored as node attributes on the Gene nodes of the "answer graph". 

In [16]:
# nx.set_node_attributes(q1_r_graph, values = q1_scoring.to_dict(), \
#                        name = 'score')
# nx.set_node_attributes(q1_r_graph, values = '(0-1]', name = 'score_range')
# nx.set_node_attributes(q1_r_graph, values = 'normalized path count', name = 'score_method')
# ## what is a good score, larger or smaller? 
# nx.set_node_attributes(q1_r_graph, values = 'larger', name = 'score_better_direction')  

# ## example of Gene node object with score-related attributes
# q1_r_graph.nodes()['NDUFS1']

## Compare answers to bradykinin article

The ["bradykinin storm" mechanism article](https://elifesciences.org/articles/59177) describes many gene products and states that these may be linked to COVID-19 symptoms. 

We used BTE to find genes that are linked to COVID-19 / SARS symptoms (PhenotypicFeatures). We can then compare BTE's answers to the genes mentioned in this article. 

In [16]:
article_genes = ['AGT', ## AGT = angiotensin precursor. 
                 'AGTR1', 'AGTR2', 'MAS1', ## receptors angiotensin binds to
                 'REN', 'ACE', 'ACE2',  ## enzymes working on angiotensin II precursors
                 'KNG1', 'KLKB1', ## Bradykinin precursor
                 'BDKRB2', 'BDKRB1', ## receptors for bradykinin
                 'NOS1', 'NOS2',  ## nitric oxide synthases: missing NOS3
                 'VDR', 'CYP24A1', 'CYP3A4',  ## vitamin D related
                 'KLK1', 'KLK2', 'KLK3', 'KLK4', 'KLK5', ## KLK = enzymes working on bradykinin precursor 
                 'KLK6', 'KLK7', 'KLK8', 'KLK9', 'KLK10', 
                 'KLK11', 'KLK12', 'KLK13', 'KLK14', 'KLK15', 
                 'KLK11', 'KLK12', 'KLK13', 'KLK14', 'KLK15', 
                 'F12',  ## F12: factor XII related to clotting and bradykinin precursor
                 'SERPING1', ## protease inhibitor, involved in kinin, clotting, complement pathways
                 'CPN1', ## protease linked to kinins, angioedema
                 'HAS1', 'HAS2', 'HAS3', ## hyaluronic acid synthases
                 'HYAL1', 'HYAL2',  ## degrade hyaluronic acid 
                 'CD44',  ## related to hyaluronic acid and immune system
                 'TMSB4X', ## linked to ACE, tissue regeneration
                 'IL2', ## link to CD44, capillary leak syndrome (called VLS in the paper)
                 'APLN', 'APLNR',  ## apelin and its receptor. related to RAS, heart 
                 'MME', ## related to degrading bradykinin, apelin 
                 'NFKB1', 'NFKB2', 'RELA', 'RELB', 'REL', ## genes for NF-kB complex
                 'IKBKG', ## part of IKB kinase complex that activates NFkappaB, degraded by virus 3CLpro protease?
                 'DPP4' ## linked to MERS and immune system
                ]  

Of the gene entities discussed in the bradykinin article, the following are in our results:     
* **NOS1** linked to cough,   
* **SERPING1** linked to breathing difficulties and abnormal breathing    
* **NFKB1**, **NFKB2**, **IKBKG** linked to decreased immune function    

Below, the scores and COVID-19 / SARS symptoms linked to the genes are shown. 

In [17]:
print('Scoring from SARS -> PhenotypicFeature -> Gene query')

## reset index to show placement of genes
q1_scoring_df = q1_scoring.reset_index().rename(columns = {'index': 'output_name'})
q1_scoring_df[q1_scoring_df.output_name.isin(article_genes)]

q1_r_paths_table[q1_r_paths_table.output_name.isin(article_genes)][['node1_name', 'pred2', 'output_name']]

Scoring from SARS -> PhenotypicFeature -> Gene query


Unnamed: 0,output_name,score,rank
261,IKBKG,0.064516,6.0
309,NFKB2,0.032258,7.0
614,SERPING1,0.032258,7.0
617,NOS1,0.032258,7.0
650,NFKB1,0.032258,7.0


Unnamed: 0,node1_name,pred2,output_name
0,DECREASED IMMUNE FUNCTION,related_to,IKBKG
1,DECREASED IMMUNE FUNCTION,related_to,IKBKG
2,DECREASED IMMUNE FUNCTION,has_phenotype,IKBKG
3,DECREASED IMMUNE FUNCTION,has_phenotype,IKBKG
302,DECREASED IMMUNE FUNCTION,has_phenotype,NFKB2
303,DECREASED IMMUNE FUNCTION,has_phenotype,NFKB2
304,DECREASED IMMUNE FUNCTION,has_phenotype,NFKB1
305,DECREASED IMMUNE FUNCTION,has_phenotype,NFKB1
916,COUGH,has_phenotype,NOS1
917,COUGH,has_phenotype,NOS1


The APIs involved in those results are shown below. 

In [18]:
## show APIs/sources for the underlying info for these answers
q1_r_paths_table[q1_r_paths_table.output_name.isin(article_genes)][['pred1_api']].drop_duplicates()
q1_r_paths_table[q1_r_paths_table.output_name.isin(article_genes)][['pred2_api']].drop_duplicates()

Unnamed: 0,pred1_api
0,BioLink API
1,mydisease.info API


Unnamed: 0,pred2_api
0,EBIgene2phenotype API
2,BioLink API


## Top scored genes and disease mechanism ideas

The top scored genes in BTE's dynamically generated knowledge graph are...

In [19]:
q1_scoring_df.sort_values(by = ['score', 'output_name'], ascending = [False, True]).head(20)

Unnamed: 0,output_name,score,rank
1,GBA,0.225806,1.0
0,MTHFR,0.225806,1.0
2,CSF2RB,0.193548,2.0
4,APC,0.16129,3.0
5,ATM,0.16129,3.0
3,CSF2RA,0.16129,3.0
7,CTNNB1,0.16129,3.0
9,DMPK,0.16129,3.0
8,GAA,0.16129,3.0
12,HBB,0.16129,3.0


[Note: the genes discussed below were in the results at the time that this notebook was run.]

Even though the symptoms for COVID-19 / SARS are non-specific (many diseases have some of these symptoms), BTE's top results includes 2 interesting genes that relate to potential mechanisms of disease. 

**Immune-system-related genes:**
* [**ADA2**](https://ghr.nlm.nih.gov/gene/ADA2): this gene appears to be involved in the growth and development of immune cells, particularly macrophages. Mutations in this gene can lead to [adenosine deaminase 2 deficiency](https://ghr.nlm.nih.gov/condition/adenosine-deaminase-2-deficiency), which involves abnormal inflammation of blood vessels (vasculitis) and other tissues. Symptoms involve fever, skin discoloration (livedo racemosa), enlarged liver and spleen, recurrent strokes, and immune system abnormalities. 
* [**CSF2RB**](https://www.genecards.org/cgi-bin/carddisp.pl?gene=CSF2RB): encodes a subunit for the IL3 receptor, IL5 receptor, and GM-CSF receptor. [IL3, IL5, and GM-CSF signaling is involved in the growth, development, and maturation of immune cells.](https://reactome.org/PathwayBrowser/#/R-HSA-512988) 

In particular, people have been investigating GM-CSF-related therapies for COVID-19 (references [here](https://www.nature.com/articles/s41577-020-0357-7 ) and [here](https://els-jbs-prod-cdn.jbs.elsevierhealth.com/pb/assets/raw/Health%20Advance/journals/jmcp/jmcp_ft95_8_7.pdf)).
For more information on GM-CSF and its role in suppressing autoimmunity, [see this article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4553090/). 


Other top genes that seem less related to COVID-19 / SARS:
* [GBA: related to Gaucher disease (symptoms include enlarged liver and spleen), anemia, easy bruising, neurological issues), Parkinson disease, and dementia with Lewy bodies](https://ghr.nlm.nih.gov/gene/GBA)
* [HBB: encodes hemoglobin. Related to beta-thalassemia and sickle cell anemia](https://monarchinitiative.org/gene/HGNC:4827)
* [MTHFR: related to cancer, neuronal disease, vascular disease](https://monarchinitiative.org/gene/HGNC:7436)    
* [APC: related to cancer, cell adhesion, cell migration, etc.](https://monarchinitiative.org/gene/HGNC:583)
* [ATM: related to cancer, cell division, DNA repair, multiple body systems' development and activity](https://ghr.nlm.nih.gov/gene/ATM)
* [CAV1: linked to lipodystrophy](https://ghr.nlm.nih.gov/gene/CAV1#conditions)
* [CTNNB1: related to cancer, cell adhesion](https://ghr.nlm.nih.gov/gene/CTNNB1)
* [DMPK: linked to myotonic dystrophy](https://ghr.nlm.nih.gov/gene/DMPK#conditions)
* [FBP1: linked to fructose-1,6-bisphosphatase deficiency, which involves hypoglycemia and metabolic acidosis with fasting.](https://www.omim.org/entry/229700)
* [GAA: linked to Pompe disease, which involves the accumulation of glycogen in the body. This can cause progressive muscle weakness, breathing issues, heart issues, etc.](https://ghr.nlm.nih.gov/condition/pompe-disease#genes)
* [PDGFRA: related to cancer, cell growth and division](https://ghr.nlm.nih.gov/gene/PDGFRA)
* [RET: related to cancer, nerve and kidney development](https://ghr.nlm.nih.gov/gene/RET)
* [WT1: related to kidney development and cancer](https://ghr.nlm.nih.gov/gene/WT1)

In [20]:
q1_r_paths_table[q1_r_paths_table['output_name'] == 'CSF2RB']

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,pred1_method,node1_type,node1_name,node1_id,pred2,pred2_source,pred2_api,pred2_pubmed,pred2_method,output_type,output_name,output_id
836,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,COUGH,UMLS:C0010200,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
837,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,COUGH,UMLS:C0010200,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
838,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,HYPOXEMIA,UMLS:C0700292,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
839,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,HYPOXEMIA,UMLS:C0700292,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
840,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,BREATHING DIFFICULTIES,UMLS:C0013404,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
841,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,BREATHING DIFFICULTIES,UMLS:C0013404,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
842,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,ABNORMAL BREATHING,UMLS:C0013404,has_phenotype,"omim,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
843,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,ABNORMAL BREATHING,UMLS:C0013404,has_phenotype,"omim,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
844,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,has_phenotype,hpoa,BioLink API,,,PhenotypicFeature,RESPIRATORY DISTRESS NECESSITATING MECHANICAL ...,UMLS:C4025279,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
845,ACUTE RESPIRATORY CORONAVIRUS INFECTION,Disease,related_to,hpo,mydisease.info API,,TAS,PhenotypicFeature,RESPIRATORY DISTRESS NECESSITATING MECHANICAL ...,UMLS:C4025279,has_phenotype,"orphanet,hpoa",BioLink API,,,Gene,CSF2RB,NCBIGene:1439
