# GCB535: Batch Effects

## Instructions

In this *adventure*, you are going to analyze gene expression data from
<a href="https://www.pnas.org/content/111/48/17224" target="_blank">Lin S et al. PNAS 2014</a> that aimed at comparing transcitpion signature accross tissues between mouse and human. We will explore how batch variability can confound this analysis and make our ability to compare other variables very difficult.

First, let's load the data and the libraries that we will use.

We first need to make sure that our System is set to `Ubuntu 20.04 (Experimental)` in CoCalc. To do this, in CoCalc:

- Click on the Settings Tab (wrench icon) 
- Look in the Project Control Panel (left side) 
- In "Software Environment" select "Ubuntu 20.04 (Experimental)" from the drop down menu 
- Click the "Save and Restart" Button that pops up

**Execute the following code below.**

In [None]:
library(tidyverse)
library(pheatmap)
library(sva)

norm_data <- as_tibble(read.csv('norm_data.csv'))
sample_data <- as_tibble(read.csv('sample_data.csv'))

#replace all characters "(", ")", and " " with "."
#this is so that labels match between norm_data and sample_data
sample_data$setname <- sub("\\)",".",sample_data$setname)
sample_data$setname <- sub("\\(",".",sample_data$setname)
sample_data$setname <- gsub(" ",".",sample_data$setname)

We will examine the similarity between samples using both clustering and dimensionality reduction.

To examine the relationship between the samples, we will use a heatmap to plot the distance between samples. This includes two steps:
* Calculating the distance between samples, which can be done using different metrics. Here, we will use Pearson's Correlation as our distance measure.
* Using hierarchical clustering to arrange the samples according to their similarity.

**Q1.** Run the code below to plot the heatmap, add code to look only at the output of the function `cor`, why is the diagonal of this matrix always 1?

In [None]:
pheatmap(cor(norm_data))

Your answer:

**Q2.** Based on these data, would you say that the samples clustered by tissue or by organism? Does this make intuitive sense? Why or why not?

Next, we will use PCA and explore different labels, including information about the sequecing batch. 

Use the code below to estimate principal components of our data. Note that we use the function `t()` to transpose the data, this is done because `prcomp` treats rows as observations and we would like to calculate the distance between the columns (samples).

In [None]:
pca_no_batch_correction <- prcomp(t(norm_data),scale=TRUE,center=TRUE)
summary(pca_no_batch_correction)

**Q3.** How much variance is explained by the first two principal components?

We have provided the code below to construct a tibble for you containing the first two principal components.

**Execute the code below.**

In [None]:
tb <- tibble(setname = rownames(pca_no_batch_correction$x),
             PC1 = (pca_no_batch_correction$x[,'PC1']),
             PC2 = (pca_no_batch_correction$x[,'PC2']))

**Q4.** Next, join `tb` with `sample_data` to add addtional information about each sample.

**Provide and execute your code below.**

**Q5.** Now:

- Use `ggplot` to make a scatter plot of these two principal components
- Assign `col` for `tissue` and `shape` for `species`
- Then, create a second scatter plot where inplace of `tissue`, you utilize `seqBatch` to examine the distance between the different sequecing batches. 

**Provide and execute your code below:**

We will now try to computationlly estimate and remove the effects associated with sequencing runs using a package called <a href="https://rdrr.io/bioc/sva/man/ComBat.html" target="_blank">Combat</a>. This program was developed for microarray data but also works well for RNA-seq data.

**Execute the code below to run the Combat algorithm on the data.**

In [None]:
combat <- ComBat(dat=norm_data,batch=sample_data$seqBatch,mod=NULL)
head(combat)

**Q6.** The output of `ComBat()` is a table of batch corrected data, analogous to the `norm_data` object. Using your combat-adjusted expression data:

- (Re)construct the heat map
- Perform PCA analysis and report the summary
- Create a tibble called `tb_combat` which contains the first two PCs of this analysis
- Merge `tb_combat` with `sample_data`
- Use this to create new scatter plots of the data, again assigning `col` for `tissue` and `shape` for `species`

**Provide and execute your code below:**

**Q7.** Examine your plots carefully and compare with the previous ones you generated in **Q1** and **Q5**. Based on these plots, which if any of the following do you think now is correct? Why?

    (A) Tissues are more similar that organism
    (B) Organism is more similar than tissue
    (C) Cannot be determined from the data