### In this notebook

- Hierarchical clustering of the 2,000 most variable genes 

### Hierarchical clustering

We're going to perform hierarchical clustering on both the genes (rows) and samples (columns) of the expression matrix for 123 breast cancer "survivors" (those patients who lived for at least five years following their initial diagnosis).

In [1]:
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode()

import os

from genometools.expression import ExpMatrix
from genometools.expression.cluster import bicluster

expression_file = os.path.join('..', 'data', 'brca_expression_5yr_survive.tsv')

# read expression data
matrix = ExpMatrix.read_tsv(expression_file)
print('matrix shape:', matrix.shape)

# variance filtering
matrix = matrix.filter_variance(top=2000)

# center expression for each gene
matrix = matrix.center_genes(use_median=True)

# perform hierarchical clustering on both genes and samples
matrix = bicluster(matrix)

# visualize the clustered matrix as a heatmap
fig = matrix.get_figure(
    height=800, width=900,
    emin=-2.0, emax=2.0,
    margin_bottom=100,
    heatmap_kw=dict(colorbar_label='Centered Expression<br>(log<sub>2</sub>-scale)'))
iplot(fig)

matrix shape: (19785, 123)
[2016-11-01 10:16:41] INFO: Selected the 2000 most variable genes (excluded 89.9% of genes, representing 58.0% of total variance).


Several gene clusters can be discerned, which all show rather distinct expression patterns. Only in some cases are the patterns strongly anti-correlated. In other words, there seems to be a lot of expression heterogeneity within this panel of tumors. In order to generate hypotheses about the biological importance of each cluster, we would have to carefully examine the individual genes contained in it.

### Copyright and License

Copyright (c) 2016 Florian Wagner.

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).