# GCB535: Batch Effects

## Instructions

In this *adventure*, you are going to analyze gene expression data from
<a href="https://www.pnas.org/content/111/48/17224" target="_blank">Lin S et al. PNAS 2014</a> that aimed at comparing transcitpion signature accross tissues between mouse and human. We will explore how batch variability can confound this analysis and make our ability to compare other variables very difficult.

First, let's load the data and the libraries that we will use

**Execute the following code below.**

In [None]:
library(tidyverse)
library(pheatmap)
library(sva)

norm_data <- as_tibble(read.csv('norm_data.csv'))
sample_data <- as_tibble(read.csv('sample_data.csv'))

Next, we will examine the similarity between samples using both clustering and dimensionality reduction.

To examine the relationship between the samples we will use a heatmap to plot the distance between samples. This includes two steps:
* Calculating the distance between samples, which can be done using different metrices, here we will use pearson correlation as our distance measure.
* Using hierarchical clustering to arrange the samples according to their similarity.

**Q1.** Run the code below to plot the heatmap, add code to look only at the output of the function `cor`, why is the diagonal of this matrix always 1?

In [None]:
pheatmap(cor(norm_data))

**Q2.** Would you say that the samples are clustered by tissue or by organism?

Next we will use PCA and explore different labels, including information about the sequecing batch. 

Use the code below to calculate the principal components of our data, note that we use the function `t` to transpose the data, this is done because `prcomp` treats rows as observations and we would like to calculate the distance between the columns (samples).

In [None]:
pca_no_batch_correction <- prcomp(t(norm_data),scale=TRUE,center=TRUE)
summary(pca_no_batch_correction)

**Q3.** How much variance is captured by the first two principal components?

We have provided the code below to construct a tibble for you containing the first two principal components.

**Q4.** Add code to join this tibble with `sample_data` to add addtional information about each sample

In [None]:
tb <- tibble(setname = rownames(pca_no_correction$x),
             PC1 = (pca_no_correction$x[,'PC1']),
             PC2 = (pca_no_correction$x[,'PC2']))

**Q5.** Now use `ggplot` to make a dot plot of the two principal components, use `col` for `tissue` and `shape` for `species`. Make a second plot where instead of `tissue` you use `seqBatch` to examine the distance between the different sequecing batches. 

We will now try to computationlly remove the effects associated with sequencing runs using a package called <a href="https://rdrr.io/bioc/sva/man/ComBat.html" target="_blank">Combat</a>. This program was developed for microarray data but also works well for RNAseq.

**Q6.** Execute the code below to run the Combat algorithm on the data.

In [None]:
combat <- ComBat(dat=norm_data,batch=datasets$seqBatch,mod=NULL)
head(combat)

**Q7.** Re plot the heat map and PCA plots, what is your conclusion:

    (1) Tissues are more similar that organism
    (2) Organism is more similar than tissue
    (3) Cannot be determined from the data