# Introduction: Constructing a data-driven gene ontology to study disease mechanisms

This notebook demonstrates how to create data-driven gene ontologies to study disease mechanisms. This notebook focuses on Fanconi Anemia (FA), a rare genetic disorder that is associated with bone marrow failure, myeloid dysplasia, and increased cancer risk. Although mutations in 20 genes are known to cause FA by dysrupting the repair of DNA damage, the existence of other FA genes and the involvement of other pathways besides DNA repair remain unclear. To discover new FA genes and pathways, this notebook executes a five-step pipeline to construct a Fanconi Anemia gene ontology (FanGO)

1. Gather input data, consisting of the 20 known FA genes as a seed set of genes for modeling and a pre-computed gene similarity network derived by integrating several types of molecular evidence including protein-protein interactions, co-expression, co-localization, and epistasis.

2. Score every gene for its involvement in FA by calculating its average functional similarity to the seed genes. The minimum score among the seed genes was used as a threshold to identify an additional set of 174 candidate genes.

3. Organize all genes in a hierarchy of 74 cellular subsystems to construct FanGO.

4. Align FanGO to the Gene Ontology.

5. Upload FanGO to an online database, the Network Data Exchange ([NDEx](http://ndexbio.org)), and visualize FanGO in the [HiView](http://hiview.ucsd.edu) web application.

Code is also provided to analyze 651 other diseases using a similar pipeline. Known gene-disease associations are taken from the [Monarch Initiative database](https://monarchinitiative.org/)

Before reading this notebook, it is recommended that you look at the [DDOT tutorial](https://github.com/michaelkyu/ddot/blob/master/examples/Tutorial.ipynb)

<img src="https://raw.githubusercontent.com/michaelkyu/ddot/master/docs/software_pipeline_23jan2018.png" width="700" align="left">

In [3]:
import pandas as pd
import networkx as nx
import numpy as np
import os
from ndex.client import Ndex

import ddot
from ddot import Ontology, ndex_to_sim_matrix, expand_seed, melt_square, make_seed_ontology, make_index

# Set the NDEx server and the user account/password
* Replace with your own NDEx user account

In [5]:
ndex_server = 'http://ndexbio.org'
ndex_user, ndex_pass = os.environ['NDEX_USERNAME'], os.environ['NDEX_PASSWORD']
ndex = Ndex(ndex_server, ndex_user, ndex_pass)
ndex

<ndex.client.Ndex at 0x119c26e10>

# Read gene-gene integrated similarity network

In [None]:
## Read gene similarity network
sim, sim_names = ndex_to_sim_matrix(
    ndex_uuid='d2dfa5cc-56de-11e7-a2e2-0660b7976219',
    similarity='similarity',
    input_fmt='cx_matrix',
    output_fmt='matrix',
    subset=None,
)

sim = pd.DataFrame(sim, columns=sim_names, index=sim_names)

## Rank transform the similarities
sim_rank = sim.rank(0) / (sim.shape[0] - 1)
sim_rank = pd.DataFrame((sim_rank.values + sim_rank.values.T) / 2.0, columns=sim_names, index=sim_names)

sim_rank.head()

In [None]:
tmp = np.load('/cellar/users/mikeyu/DeepTranslate/hnexo/RFv2r3_square.npz')
sim, sim_names = tmp['rf'], tmp['genes']
np.fill_diagonal(sim, 0)
sim[np.isnan(sim)] = 0
sim = pd.DataFrame(sim, columns=sim_names, index=sim_names)

tmp = np.load('/cellar/users/mikeyu/DeepTranslate/hnexo/RFv2r3_square.ranked.npz')
sim_rank, sim_names = tmp['rf'], tmp['genes']
sim_rank = pd.DataFrame(sim_rank, columns=sim_names, index=sim_names)

# Specify a set of seed genes with known associations to the disease being studied

In [None]:
# Let seed genes be the 20 known genes that cause Fanconi Anemia (from the Fanconi Anemia Mutation Database, http://www2.rockefeller.edu/fanconi/)
seed = ['FANCA', 'FANCB', 'FANCC', 'BRCA2', 'FANCD2',
        'FANCE', 'FANCF', 'FANCG', 'FANCI', 'BRIP1',
        'FANCL', 'FANCM', 'PALB2', 'RAD51C', 'SLX4',
        'ERCC4', 'RAD51', 'BRCA1', 'UBE2T', 'XRCC2']

In [None]:
# # Let seed genes be the known genese for one of 651 diseases (uncomment to use)

# # Retrieve a table of gene-disease associations from the Monarch Initiative (reformatted and stored on NDEx)
# monarch, _ = ddot.ndex_to_sim_matrix(
#     ddot.config.MONARCH_DISEASE_GENE_SLIM_URL,
#     similarity=None,
#     input_fmt='cx',
#     output_fmt='sparse')
# print(monarch.head())

# # Example: get the known genes for Caffey Disease
# seed = monarch.loc[monarch['disease']=='caffey_disease', 'gene'].tolist()
# seed = [s for s in seed if s in sim_names]
# print('Seed:', seed)

# Identify candidate set of genes that are highly similar to the seed set of genes

In [None]:
expand, expand_idx, sim_2_seed, fig = expand_seed(
    seed,
    sim_rank.values,
    sim_names,
    seed_perc=0,
    agg='mean',
    figure=True)

# Organize seed and candidate genes into a data-driven gene ontology

In [None]:
# Run CliXO, with parameters alpha=0.05 and beta=0.5
ont = Ontology.run_clixo(sim.loc[expand, :].loc[:, expand], alpha=0.05, beta=0.5, square=True)
ont

# Align the data-driven ontology with the Gene Ontology (GO)

In [None]:
# Read Gene Ontology from NDEx. This version has been pre-processed to contain a non-redundant set of GO terms and connections that are relevant to human genes (see Get_Gene_Ontology.ipynb) 
go_human = Ontology.from_ndex(ddot.config.GO_HUMAN_URL)
print(go_human)

In [None]:
# Align ontologies
alignment = ont.align(go_human, 
                      iterations=100,
                      update_self=['Term_Description'],
                      align_label='Term_Description',
                      verbose=True)
alignment.head()

In [None]:
# Note how node attributes have been updated to reflect the ontology alignment
ont.node_attr

# Upload ontology with NDEx to visualize in the HiView application (http://hiview.ucsd.edu)
* A two-dimensional layout of nodes is automatically calculated to optimize visualization of hierarchical structure
* Molecular networks, such as protein-protein interactions and RNA coexpression, can be visualized in HiView to understand how an ontology's structure is consistent with data
* Node attributes (color and size) can be set to visualize metadata.

In [None]:
# Set the node color of seed genes to be green
fill_attr = pd.DataFrame({'Vis:Fill Color' : '#6ACC65'}, index=seed)
ont.update_node_attr(fill_attr)

In [None]:
# Set the node color of inferred terms according to the alignment with GO (for visualization in HiView)
fill_attr = ont.node_attr['Aligned_Similarity'].dropna().map(ddot.color_gradient)
fill_attr = fill_attr.to_frame().rename(columns={'Aligned_Similarity' : 'Vis:Fill Color'})
ont.update_node_attr(fill_attr)

In [None]:
# Download a table containing multiple types of gene-gene interactions, which were preformatted and uploaded to NDEx for the Fanconi Anemia example.
from ndex.networkn import NdexGraph
G_ndex = NdexGraph(server='http://test.ndexbio.org', uuid='9412e430-02f1-11e8-bd69-0660b7976219')
G = ddot.NdexGraph_to_nx(G_ndex)
gene_network_data = ddot.nx_edges_to_pandas(G)
gene_network_data.index.names = ['Gene1', 'Gene2']
gene_network_data.reset_index(inplace=True)
gene_network_data['RandomForest integrated similarity'] = [sim.loc[g1, g2] for g1, g2 in zip(gene_network_data['Gene1'], gene_network_data['Gene2'])]
gene_network_data.head()

In [None]:
# Upload ontology to NDEx
ndex_url, G = ont.to_ndex(
    name='Fanconi Anemia Gene Ontology',
    description='Generated with the Data-driven Ontology Toolkit (https://github.com/michaelkyu/ddot)',
    ndex_server=ndex_server,
    ndex_user=ndex_user,
    ndex_pass=ndex_pass,
    visibility='PUBLIC',
    layout='bubble-collect',
    network=gene_network_data,
    main_feature='RandomForest integrated similarity',
)

print('Go to http://hiview.ucsd.edu in your web browser')
print('Enter this into the "NDEx Sever URL" field: %s' % ndex_url.split('ndexbio.org')[0] + 'ndexbio.org')
print('Enter this into the "UUID of the main hierarchy" field: %s' % ndex_url.split('/')[-1])