### In this notebook

- Filtering for most variable genes before performing gene set enrichemnt analysis
- Visualizing the mHG enrichment statistic
- Visualizing the expression profiles of the genes responsible for the enrichment

### Performing gene set enrichment analysis using the most variable genes

We're repeating the gene set enrichment analysis from the previous notebook, but this time, we'll only perform the analysis on the 10,000 most variable genes. First, we'll do the variance filtering:

In [1]:
import os

from genometools.expression import ExpMatrix

import pandas as pd

dead_expression_file = os.path.join('..', 'data', 'brca_expression_5yr_dead.tsv')
survive_expression_file = os.path.join('..', 'data', 'brca_expression_5yr_survive.tsv')

top_genes = 10000

# read expression data
matrix_dead = ExpMatrix.read_tsv(dead_expression_file)
matrix_survive = ExpMatrix.read_tsv(survive_expression_file)

# combine matrix and filter genes by variance
matrix_comb = pd.concat([matrix_dead, matrix_survive], axis=1)
matrix_comb = matrix_comb.filter_variance(top_genes)

# select the most variable genes in the original matrices
matrix_dead = matrix_dead.loc[matrix_comb.genes]
matrix_survive = matrix_survive.loc[matrix_comb.genes]
print(matrix_dead.shape)

[2016-11-01 10:18:41] INFO: Selected the 10000 most variable genes (excluded 49.5% of genes, representing 9.4% of total variance).
(10000, 30)


Next, we're reading the other data required for the analysis

In [2]:
import os

from genometools.basic import GeneSetCollection
from genometools.expression import ExpGenome, ExpMatrix

genome_file = os.path.join('..', 'data', 'protein_coding_genes_human_ensembl83.tsv')
gene_set_file = os.path.join('..', 'data', 'GO_gene_sets_human_ensembl83_goa153_ontology2016-01-18.tsv')

# read list of protein-coding genes
genome = ExpGenome.read_tsv(genome_file)

# read gene sets
gene_sets = GeneSetCollection.read_tsv(gene_set_file)

Finally, we're running the enrichment analysis: 

In [3]:
from genometools.enrichment import GeneSetEnrichmentAnalysis

pval_thresh = 0.05

diff = matrix_dead.median(axis=1) - matrix_survive.median(axis=1)
diff.sort_values(ascending=False, inplace=True)

enrichment = GeneSetEnrichmentAnalysis(genome, gene_sets)
enriched = enrichment.get_rank_based_enrichment(diff.index, pval_thresh=pval_thresh)

# sort significantly enriched GO terms by their E-score (higher E-score = stronger enrichment)
enriched = sorted(enriched, key=lambda x:-x.escore)

# print nicely formatted list of enriched GO terms
for i, enr in enumerate(enriched):
    print('%02d. %s' %(i, enr.get_pretty_format()))

[2016-11-01 10:18:42] INFO: Starting new HTTPS connection (1): api.plot.ly
[2016-11-01 10:18:48] INFO: Generating gene-by-gene set membership matrix...
[2016-11-01 10:18:49] INFO: Conducting 5538 tests.
[2016-11-01 10:18:49] INFO: Using Bonferroni-corrected p-value threshold: 9.0e-06
[2016-11-01 10:18:50] INFO: 39 / 6921 gene sets were found to be significantly enriched (p-value <= 9.0e-06).
00. condensed chromosome outer kinetochore (6 / 12 @ 319), p=4.7e-06, e=15.7x
01. mitotic chromosome condensation (7 / 10 @ 531), p=6.5e-07, e=13.2x
02. spindle checkpoint (10 / 28 @ 367), p=3.0e-07, e=13.0x
03. chromosome condensation (7 / 12 @ 531), p=4.3e-06, e=11.0x
04. condensed chromosome kinetochore (14 / 21 @ 1346), p=2.6e-07, e=10.4x
05. condensed chromosome, centromeric region (16 / 25 @ 1346), p=7.0e-08, e=10.0x
06. spindle assembly checkpoint (9 / 26 @ 367), p=1.8e-06, e=9.7x
07. negative regulation of sister chromatid segregation (9 / 28 @ 367), p=3.7e-06, e=9.0x
08. regulation of ubiq

You can see that you get a lot fewer enriched gene sets, and that the p-values are generally higher (worse). In the analysis using all genes, there could have been a lot false positive results.

### Visualizing the enrichment statistic

We can use the visualization function from the XL-mHG package to visualize the enrichment statistic of each of our significantly enriched GO terms:

In [4]:
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode()

from xlmhg import get_result_figure

enr = enriched[0]
fig = get_result_figure(enr)
fig.layout.title = enr.get_pretty_format()
iplot(fig)

### Visualizing the expression profiles of the genes responsible for the enrichment

In [5]:
from genometools.expression.cluster import cluster_samples

# cluster the samples in each of the two groups
matrix_dead = cluster_samples(matrix_dead)
matrix_survive = cluster_samples(matrix_survive)

print(matrix_dead.shape, matrix_survive.shape)

# combine the clustered matrices
matrix_comb = pd.concat([matrix_dead, matrix_survive], axis=1)

# create the heatmap
fig = matrix_comb.loc[enr.genes_above_cutoff].center_genes(use_median=True).get_figure(show_sample_labels=False)

# add a line separating the two classes
fig.layout.shapes.append(
    {
        'type': 'line',
        'x0': matrix_dead.shape[1],
        'y0': enr.k-0.5,
        'x1': matrix_dead.shape[1],
        'y1': -0.5,
        'line': {
            'color': 'rgb(1.0, 0, 1.0)',
            'width': 4,
        }
    }
)

iplot(fig)


(10000, 30) (10000, 123)


### Copyright and License

Copyright (c) 2016 Florian Wagner.

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).