### In this notebook

- How to use the `ExpMatrix` class from the `genometools` package to work with expression data
- How to normalize, threshold and log-transform expression data
- How to perform hierarchical clustering and visualize an expression matrix as an interactive heatmap

### Disclaimer

The `genometools` package was developed by me (Florian Wagner).

### Expression file format

We're reading the expression data from a tab-delimited text file. Here's an example of how such a file must look in order for the method used here to work:

```
    ignored Sample1 Sample2 Sample3
    IGBP1   8.64947 8.01958 7.95444
    MYC     7.61296 7.38281 7.58559
    SMAD1   8.84338 8.41662 8.94365
    MDM1    6.17908 6.07470 5.59411
    CD44    7.64093 7.56293 7.58277
```

As you can see, the first line contains the sample labels (with the first field being ignored), and each following row corresponds to a gene. The first column contains the gene name, and the following columns contain the expression values. In this example, the data is already $\log_2$-transformed.

In [1]:
# ignore this -- just setting a few display options for pandas
import pandas as pd
#print(pd.get_option("display.max_rows"))
#print(pd.get_option("display.max_columns"))
pd.options.display.max_rows = 20
pd.options.display.max_columns = 5

### Reading expression data using the `genometools` package

We're reading an example dataset that is contained in this repository. This dataset contains TCGA breast cancer
expression data for patients surviving for at least five years after their diagnosis. The data was generated using RNA-Seq, and the expression values are provided as FPKM values (fragments per kilobase mapped per million).

In [2]:
import os

from genometools.expression import ExpMatrix

# we're using os.path.join() so that we get a valid path on Linux, Mac, and Windows
expression_file = os.path.join('..', 'data', 'brca_expression_5yr_survive_raw.tsv')
print('Path of expression file: ', expression_file)
print()

# read the expression data
matrix = ExpMatrix.read_tsv(expression_file)

# print expression data of the first five genes
print(matrix[:5])  

Path of expression file:  ../data/brca_expression_5yr_survive_raw.tsv

Samples  TCGA-A2-A25E-01A  TCGA-AR-A24L-01A        ...         \
Genes                                              ...          
RAB4B             3.02223           1.27855        ...          
TIGAR             3.08927           4.16475        ...          
RNF44             5.40491           8.69783        ...          
DNAH3             0.01141           0.05329        ...          
RPL23A          124.78030         100.39448        ...          

Samples  TCGA-AR-A0TW-01A  TCGA-AR-A252-01A  
Genes                                        
RAB4B             3.92477           1.72338  
TIGAR             4.52063           5.16972  
RNF44            13.76553           9.29157  
DNAH3             0.00213           0.10911  
RPL23A          142.02469         173.36890  

[5 rows x 123 columns]


Note: When using `print()` on a matrix, it is split across multiple rows if not all the columns fit onto one line.

### The relationship between the `ExpMatrix` class and the `pandas.DataFrame` class

The `ExpMatrix` class *inherits* from the `pandas.DataFrame` class. Inheritance is a concept from object-oriented programming. When a class inherits from another class, we can call the inheriting class the *child*, and the other class the *parent*. The child then has all the attributes and methods from the parent, but a programmer can add additional attributes and methods to the child that are not found in the parent.

What this means in practice is that you can do everything with `ExpMatrix` objects that you can do with *pandas* `DataFrame` objects. You can access all the attributes and methods from the `pandas.DataFrame` class, as [documented here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). You can also use all the ways of indexing a pandas `DataFrame`, as [documented here](http://pandas.pydata.org/pandas-docs/stable/indexing.html). For example, you can select a gene by name:

In [3]:
profile = matrix.loc['MYC']
print(profile)
print('------------------')
print('Type:', type(profile))

Samples
TCGA-A2-A25E-01A    61.56640
TCGA-AR-A24L-01A    20.02871
TCGA-AR-A24Z-01A    16.45212
TCGA-AO-A03N-01B     1.83598
TCGA-BH-A0EI-01A     8.76648
TCGA-B6-A0WY-01A    43.11045
TCGA-AO-A12D-01A    23.40482
TCGA-BH-A0AU-01A    25.72791
TCGA-AR-A1AO-01A    40.51482
TCGA-AO-A03R-01A    74.17699
                      ...   
TCGA-A2-A04U-01A    47.47473
TCGA-AR-A0TS-01A    18.80322
TCGA-AR-A24U-01A    28.39909
TCGA-B6-A0RT-01A    70.45375
TCGA-BH-A18K-01A    21.56477
TCGA-GM-A2DC-01A    27.15174
TCGA-BH-A0AZ-01A    13.21819
TCGA-AO-A0JE-01A    21.41329
TCGA-AR-A0TW-01A    65.92601
TCGA-AR-A252-01A    20.15881
Name: MYC, dtype: float64
------------------
Type: <class 'genometools.expression.profile.ExpProfile'>


As you can see, selecting a single gene gives us an `ExpProfile` object.

Similarly to how the `ExpMatrix` class inherits from `pandas.DataFrame`, the `ExpProfile` class inherits from `pandas.Series`. This means you can access all the attributes and methods of `pandas.Series`, as [documented here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html). For example, you can print some summary statistics using the `pandas.Series.describe()` method:

In [4]:
print(profile.describe())

count    123.000000
mean      39.530449
std       31.389610
min        1.835980
25%       19.415965
50%       32.700830
75%       48.116135
max      201.999290
Name: MYC, dtype: float64


So the mean expression of MYC in our data is 39.5, and the median is 32.7.

### Sorting the genes alphabetically

As you may have noticed, the genes in our matrix are not sorted alphabetically. Let's change that:

In [5]:
matrix = matrix.sort_index()

print(matrix[:5])

Samples  TCGA-A2-A25E-01A  TCGA-AR-A24L-01A        ...         \
Genes                                              ...          
A1BG              0.09294           0.19308        ...          
A1CF              0.01454           0.01363        ...          
A2M             104.43383         114.38324        ...          
A2ML1             0.08709           0.02933        ...          
A3GALT2           0.01516           0.00000        ...          

Samples  TCGA-AR-A0TW-01A  TCGA-AR-A252-01A  
Genes                                        
A1BG              0.13932           0.13110  
A1CF              0.00498           0.00296  
A2M              63.98945         187.70691  
A2ML1             0.02402           0.01749  
A3GALT2           0.01559           0.00000  

[5 rows x 123 columns]


### Normalizing the expression values for each sample

We would like to make sure that the expression values for all genes sum to the same value for all samples. Right now, the values sum to different values:

In [6]:
print(matrix.sum(axis=0))

Samples
TCGA-A2-A25E-01A    270028.72911
TCGA-AR-A24L-01A    245778.70120
TCGA-AR-A24Z-01A    240779.11464
TCGA-AO-A03N-01B    484606.53165
TCGA-BH-A0EI-01A    234636.34215
TCGA-B6-A0WY-01A    265204.13565
TCGA-AO-A12D-01A    284467.77727
TCGA-BH-A0AU-01A    247396.35926
TCGA-AR-A1AO-01A    284120.05646
TCGA-AO-A03R-01A    415584.23859
                        ...     
TCGA-A2-A04U-01A    313258.64889
TCGA-AR-A0TS-01A    243910.32221
TCGA-AR-A24U-01A    276432.11400
TCGA-B6-A0RT-01A    274050.75451
TCGA-BH-A18K-01A    293826.93000
TCGA-GM-A2DC-01A    324669.25042
TCGA-BH-A0AZ-01A    264875.22081
TCGA-AO-A0JE-01A    272453.66686
TCGA-AR-A0TW-01A    291380.68812
TCGA-AR-A252-01A    253417.17351
dtype: float64


Let's normalize the values to make them all sum to one million. Thanks to pandas, this is quite simple:

In [7]:
matrix = 1e6*(matrix/matrix.sum(axis=0))
print(matrix.sum(axis=0))

Samples
TCGA-A2-A25E-01A    1000000.0
TCGA-AR-A24L-01A    1000000.0
TCGA-AR-A24Z-01A    1000000.0
TCGA-AO-A03N-01B    1000000.0
TCGA-BH-A0EI-01A    1000000.0
TCGA-B6-A0WY-01A    1000000.0
TCGA-AO-A12D-01A    1000000.0
TCGA-BH-A0AU-01A    1000000.0
TCGA-AR-A1AO-01A    1000000.0
TCGA-AO-A03R-01A    1000000.0
                      ...    
TCGA-A2-A04U-01A    1000000.0
TCGA-AR-A0TS-01A    1000000.0
TCGA-AR-A24U-01A    1000000.0
TCGA-B6-A0RT-01A    1000000.0
TCGA-BH-A18K-01A    1000000.0
TCGA-GM-A2DC-01A    1000000.0
TCGA-BH-A0AZ-01A    1000000.0
TCGA-AO-A0JE-01A    1000000.0
TCGA-AR-A0TW-01A    1000000.0
TCGA-AR-A252-01A    1000000.0
dtype: float64


### Thresholding expression values for lowly expressed genes

Next, we're going to threshold low expression values. We're going to set all values below 5 FPKM to 5. This is a rule of thumb that gets rid of a lot of technical noise that affects lowly expressed genes.

In [8]:
matrix[matrix<5.0] = 5.0

# to show that this worked, we're printing the minimum value for each sample
print(matrix.min(axis=0))

Samples
TCGA-A2-A25E-01A    5.0
TCGA-AR-A24L-01A    5.0
TCGA-AR-A24Z-01A    5.0
TCGA-AO-A03N-01B    5.0
TCGA-BH-A0EI-01A    5.0
TCGA-B6-A0WY-01A    5.0
TCGA-AO-A12D-01A    5.0
TCGA-BH-A0AU-01A    5.0
TCGA-AR-A1AO-01A    5.0
TCGA-AO-A03R-01A    5.0
                   ... 
TCGA-A2-A04U-01A    5.0
TCGA-AR-A0TS-01A    5.0
TCGA-AR-A24U-01A    5.0
TCGA-B6-A0RT-01A    5.0
TCGA-BH-A18K-01A    5.0
TCGA-GM-A2DC-01A    5.0
TCGA-BH-A0AZ-01A    5.0
TCGA-AO-A0JE-01A    5.0
TCGA-AR-A0TW-01A    5.0
TCGA-AR-A252-01A    5.0
dtype: float64


### Log-transforming the data

We would also like to log-transform the data. This serves as a variance-stabilizing transformation that helps ensure that the amount of technical noise associated with each expression measurement is approximately the same.

Note that since we thresholded low expression values at 5 FPKM, we avoid taking the logarithm of 0.

In [9]:
import numpy as np

matrix = np.log2(matrix)

print(matrix[:5])

Samples  TCGA-A2-A25E-01A  TCGA-AR-A24L-01A        ...         \
Genes                                              ...          
A1BG             2.321928          2.321928        ...          
A1CF             2.321928          2.321928        ...          
A2M              8.595261          8.862300        ...          
A2ML1            2.321928          2.321928        ...          
A3GALT2          2.321928          2.321928        ...          

Samples  TCGA-AR-A0TW-01A  TCGA-AR-A252-01A  
Genes                                        
A1BG             2.321928          2.321928  
A1CF             2.321928          2.321928  
A2M              7.778785          9.532752  
A2ML1            2.321928          2.321928  
A3GALT2          2.321928          2.321928  

[5 rows x 123 columns]


### Performing variance filtering

Filtering the expression data for the most variable genes has been a popular technique for analyzing microarray data [(Bourgon et al.)](https://www.ncbi.nlm.nih.gov/pubmed/20460310), but it works well for RNA-Seq data as well. For example, with RNA-Seq data, I often focus on the 10,000 most variable genes. Here we'll restrict ourselves even more, to 2,000 genes. To do this: 

In [10]:
matrix = matrix.filter_variance(2000)

[2016-10-26 12:30:09] INFO: Selected the 2000 most variable genes (excluded 89.9% of genes, representing 58.0% of total variance).


### Centering gene expression values

For visualization purposes, and for clustering samples, it often helps to center gene expression values. Centering means that the mean (or median) expression of each gene is 0. We can do this easily:

In [14]:
matrix = matrix.center_genes(use_median=True)

### Performing hierarchical clustering and generating an interactive heatmap

As the last part of this notebook, we'll cluster the filtered expression matrix using hierarchical clustering and then show a heatmap of the result.

In [15]:
from genometools.expression.cluster import bicluster

matrix_clustered = bicluster(matrix)

In [21]:
# plotting the figure

from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode()

fig = matrix_clustered.get_figure(emin=-2, emax=2, height=600, width=800, show_sample_labels=False)
iplot(fig)

### Copyright and License

Copyright (c) 2016 Florian Wagner.

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).