# Creating a Gene Expression Heatmap

## Introduction
This is a companion notebook to the `ClusteringMethods` notebook. The purpose is to provide a "real-world" example of how clustering can be used to identify Gene Expression patterns. The same [Mice Protein Expression Data Set](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression#) from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.html) will be used as well. For additional information on the dataset, please reference the UCI site as well as the original notebook.

## Data Cleaning / Preprocessing
The same process will be used to construct the final, "clean" dataset, so please refer to the `ClusteringMethods` notebook for full details. A small difference is the use of the extra library `RColorBrewer`, in order to produce a more colorful figure.

In [1]:
library(dplyr);
library(gplots);
library(RColorBrewer);

MouseDataRaw <- read.csv(file="data/Data_Cortex_Nuclear.csv",head=TRUE,sep=";");

# STEP 1: Restructure data for the clustering process
# ====================================================
# First, include an extra column that will correspond to the label of each instance
# The label is constructed through concantenation of two existing fields:
# MouseID and class  (e.g.:   )
MouseDataRaw$rowNamesInfo <- paste(MouseDataRaw$MouseID, MouseDataRaw$class, sep="   ");

# Second, split the dataset for the different classes (i.e. 8 datasets)
# Each subset contains 78 attributes: the 77 protein expression levels, and the label
MouseData_cCSs <- MouseDataRaw %>%
  na.omit() %>%
  filter(class == "c-CS-s") %>%
  select(-MouseID, -Genotype, -Treatment, -Behavior);

MouseData_cCSm <- MouseDataRaw %>%
  na.omit() %>%
  filter(class == "c-CS-m") %>%
  select(-MouseID, -Genotype, -Treatment, -Behavior);

MouseData_cSCs <- MouseDataRaw %>%
  na.omit() %>%
  filter(class == "c-SC-s") %>%
  select(-MouseID, -Genotype, -Treatment, -Behavior);

MouseData_cSCm <- MouseDataRaw %>%
  na.omit() %>%
  filter(class == "c-SC-m") %>%
  select(-MouseID, -Genotype, -Treatment, -Behavior);

MouseData_tCSs <- MouseDataRaw %>%
  na.omit() %>%
  filter(class == "t-CS-s") %>%
  select(-MouseID, -Genotype, -Treatment, -Behavior);

MouseData_tCSm <- MouseDataRaw %>%
  na.omit() %>%
  filter(class == "t-CS-m") %>%
  select(-MouseID, -Genotype, -Treatment, -Behavior);

MouseData_tSCs <- MouseDataRaw %>%
  na.omit() %>%
  filter(class == "t-SC-s") %>%
  select(-MouseID, -Genotype, -Treatment, -Behavior);

MouseData_tSCm <- MouseDataRaw %>%
  na.omit() %>%
  filter(class == "t-SC-m") %>%
  select(-MouseID, -Genotype, -Treatment, -Behavior);


# Finally, join all subsets in the final set. The label attribute is assigned as the
# row name, and dropped as an independent attribute
# Final construct: each row has 77 attributes (for the 77 proteins)
MouseDataClean <- bind_rows(MouseData_cCSs, MouseData_cCSm, MouseData_cSCs, MouseData_cSCm, MouseData_tCSs, MouseData_tCSm, MouseData_tSCs, MouseData_tSCm);
rownames(MouseDataClean) <- MouseDataClean$rowNamesInfo; 
MouseData <- select(MouseDataClean, -rowNamesInfo, -class);

head(MouseData);
summary(MouseData);


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

: package 'gplots' was built under R version 3.2.4
Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess



Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,ellip.h,SHH_N,BAD_N,BCL2_N,pS6_N,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N
1,0.4458384,0.719069,0.4190171,2.859232,5.321076,0.229538,0.1712234,3.518429,0.2241737,1.502336,<8b>,0.2052275,0.1366628,0.1069778,0.1252275,0.1149125,0.5784831,0.195846,0.1494049,0.1824971,1.736803
2,0.4274166,0.7232267,0.4328929,2.939673,5.38491,0.2344402,0.169854,3.541551,0.2389604,1.575278,<8b>,0.2247601,0.135684,0.1210247,0.132908,0.1245799,0.5950421,0.2085423,0.1612039,0.1939317,1.84391
3,0.4567857,0.7507313,0.4632568,3.090683,5.576101,0.2443046,0.1782643,3.654995,0.235972,1.722099,<8b>,0.2295513,0.1298929,0.1298468,0.1287389,0.1326625,0.5667467,0.2008863,0.1651588,0.1794221,1.77031
4,0.3662387,0.5892331,0.3644717,2.534339,4.605254,0.2343032,0.1873012,3.230416,0.194487,1.342561,<8b>,0.2057434,0.1421907,0.1097177,0.1277717,0.1154732,0.556222,0.2028353,0.1605477,0.1905974,1.763359
5,0.3851905,0.6067402,0.3761718,2.584431,4.786994,0.2408924,0.1698113,3.230806,0.1905779,1.457933,<8b>,0.22147,0.1412626,0.1201586,0.1329674,0.1230253,0.5512046,0.2114059,0.1664532,0.1915828,1.807502
6,0.3773878,0.6306113,0.3909981,2.666428,5.101839,0.2443887,0.1766953,3.27149,0.1948424,1.531519,<8b>,0.2264863,0.1370762,0.1228518,0.1285416,0.1240711,0.525836,0.2072689,0.167615,0.1916512,1.707095


    DYRK1A_N         ITSN1_N           BDNF_N           NR1_N      
 Min.   :0.1453   Min.   :0.2454   Min.   :0.1152   Min.   :1.331  
 1st Qu.:0.2908   1st Qu.:0.4805   1st Qu.:0.2790   1st Qu.:2.044  
 Median :0.3721   Median :0.5903   Median :0.3085   Median :2.285  
 Mean   :0.4152   Mean   :0.6231   Mean   :0.3150   Mean   :2.295  
 3rd Qu.:0.4957   3rd Qu.:0.7306   3rd Qu.:0.3470   3rd Qu.:2.544  
 Max.   :0.9922   Max.   :1.3364   Max.   :0.4972   Max.   :3.758  
     NR2A_N          pAKT_N          pBRAF_N         pCAMKII_N    
 Min.   :1.738   Min.   :0.1210   Min.   :0.1076   Min.   :1.344  
 1st Qu.:3.118   1st Qu.:0.1974   1st Qu.:0.1567   1st Qu.:2.484  
 Median :3.708   Median :0.2206   Median :0.1761   Median :3.370  
 Mean   :3.784   Mean   :0.2245   Mean   :0.1754   Mean   :3.621  
 3rd Qu.:4.343   3rd Qu.:0.2478   3rd Qu.:0.1925   3rd Qu.:4.596  
 Max.   :8.483   Max.   :0.3538   Max.   :0.3171   Max.   :7.105  
    pCREB_N           pELK_N          pERK_N           

## Initializing the Heatmap
The content of the heatmap will directly correspond to the gene expression data. However, before starting the visualization process, the data should be clustered in order to produce the necessary hieararchical structure.

The parameters to be considered are the following:
1. The distance metric to be applied to the vectors. Choices are:  
  - "euclidean" *(default)*
  - "maximum"
  - "manhattan"
  - "canberra"
  - "binary"
  - "minkowski"
2. The clustering method for joining similar vectors together. Choice are:  
  - "complete" *(default)*
  - "single"
  - "ward.D" and "ward.D2"
  - "average", which is UPGMA 
  - "mcquitty", which is WPGMA
  - "median", which is WPGMC
  - "centroid", which is UPGMC

It goes without saying that the optimal selection of these parameters (based on your experience in the first part of the exercise - notebook `ClusteringMethods`) will produce the most meaningful results.

Moreover, clustering can be applied to both axis; data can be clustered both at the protein level (i.e. columns) as well as the sample level (i.e. rows). In the first case (columns), our instances are vectors corresponding to the expression level of one protein across all samples. In the second case (rows), our instances are vectors corresponding to the expression level of all proteins of one sample.

In [2]:
# Clustering or rows (i.e. tissues/samples)
row_distance = dist(as.matrix(MouseData)[c(1:nrow(MouseData)),], method = "euclidean");
row_cluster = hclust(row_distance, method = "complete");

# Clustering of columns (i.e. protein expressions)
col_distance = dist(t(as.matrix(MouseData)[c(1:nrow(MouseData)),]), method = "euclidean");
col_cluster = hclust(col_distance, method = "complete");

Starting the visualization process, we need to first define the coloring range for the expression. In this case, we are going to use a green - yellow - orange - red scale.

In [3]:
# Define the color range for the actual expression levels
my_palette <- colorRampPalette(c("green", "yellow", "orange", "red"))(n = 399);

The next step is to identify the thresholds for changing the different coloring scheme, i.e. the expression levels that can be identified with low, medium and high expression levels. This particular aspects is somewhat objective, so it should be changed accordingly. The current levels are as follows:
1.   **0 - 0.49**  : Green hue, low expression
2. **0.5 - 0.79**  : Yellow hue, mid low expression
3. **0.8 - 0.99**  : Orange hue, mid high expression
4.   **1 - 10**    : Red hue, high expression

In [4]:
# Define the color breaks that would better represent the different values
col_breaks = c(seq(0  ,  0.49,   length=100),    # for green
               seq(0.5,  0.79,   length=100),    # for yellow
               seq(0.8,  0.99,   length=100),    # for orange
               seq(1  , 10,      length=100));   # for red

Other than the coloring scheme for the expression level, it would be useful to include a coloring scheme for each of the 8 samples.

In [5]:
# Define a color for each of the 8 different cases
colorList = c("gray", "blue", "lightsalmon", "orchid", "skyblue", "black", "green", "chartreuse4", "burlywood");
rowCategories <- c(rep(colorList[1], nrow(MouseData_cCSs)),   # c-CS-s
                   rep(colorList[2], nrow(MouseData_cCSm)),   # c-CS-m
                   rep(colorList[3], nrow(MouseData_cSCs)),   # c-SC-s
                   rep(colorList[4], nrow(MouseData_cSCm)),   # c-SC-m
                   rep(colorList[5], nrow(MouseData_tCSs)),   # t-CS-s
                   rep(colorList[6], nrow(MouseData_tCSm)),   # t-CS-m
                   rep(colorList[7], nrow(MouseData_tSCs)),   # t-SC-s
                   rep(colorList[8], nrow(MouseData_tSCm))    # t-SC-m
);

classNames <- c("c-CS-s", "c-CS-m", "c-SC-s", "c-SC-m", "t-CS-s", "t-CS-m", "t-SC-s", "t-SC-m");

The next step will be to create the parameters for the png file to be produced.

In [6]:
# Creates the png image file
png("MouseProteinExpressionHeatmap.png",   # create PNG for the heat map        
    width     = 8000,                      # set the width of the image in pixels
    height    = 6000,                      # set the height of the image in pixels
    res       = 300,                       # set the resolutions to 300 pixels per inch
    pointsize = 5);                        # set the size of any letters/text    

Finally, we create the heatmap, providing as parameters all the information gathered and defined so far.

In [None]:
# Constructs the heatmap with the selected parameters
heatmap.2(as.matrix(MouseData)[c(1:nrow(MouseData)),],                        # data for heatmap
          #cellnote = as.matrix(MouseDataClean)[c(1:nrow(MouseDataClean)),],  # uncomment to show cell values
          main = "Mouse Protein Expression",                         # heat map title
          notecol="black",                                           # change font color of cell labels to black
          density.info="histogram",                                  # turns off density plot inside color legend ("histogram","density","none"),
          trace="none",                                              # turns off trace lines inside the heat map (row, column, both, none)
          tracecol="cyan",                                           # character string giving the color for "trace" line
          margins =c(20,20),                                         # widens margins around plot
          col=my_palette,                                            # use on color palette defined earlier 
          dendrogram="both",                                         # character string indicating whether to draw 'none', 'row', 'column' or 'both' dendrograms
          Rowv = as.dendrogram(row_cluster),                         # use the clustering created earlier clustering method
          Colv = as.dendrogram(col_cluster),                         # use the clustering created earlier clustering method
          RowSideColors = rowCategories,                             # grouping row-variables into different categories
          breaks=col_breaks                                          # enable color transition at specified limits
);

# Include legend in figure
legend("topright",                                         # location of the legend on the heatmap plot
       legend = classNames,                                # category labels
       col = colorList[1:length(unique(rowCategories))],   # color key
       lty= 1,                                             # line style
       lwd = 10                                            # line width
);

# Close the file
dev.off()

The output of the previous command is the following figure:

![image](MouseProteinExpressionHeatmap.png)

It is obvious that neither the color scheme, nor the coloring scheme provide a lot of information. The samples (as indicated by the colored column before the heatmap) are distributed and not grouped together. The proteins are better grouped but still exhibit abnormal behavior, as seen by the highly expressed columns on the right and left-end of the heatmap. An optimal selection of the clustering methods could potentially lead to more meaningful results.