de_example.Rmd

---
title: "An example of RNA-seq differential expression analysis"
author: "Davide Risso"
date: "10/27/2017"
output: BiocStyle::html_document
bibliography: biblio.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

# The data

We will use the data from [@peixoto2015data].
The gene-level read counts are available on [Github](https://github.com/drisso/peixoto2015_tutorial).

Briefly, the dataset consists of 5 biological replicates per three conditions: controls, fear-condition (memory formation), and memory retrieval.

We also have access to a list of negative control genes, i.e., genes that are not influenced by the biological effect of interest, and to a list of positive control genes, i.e., genes known to be differentially expressed with respect to the biological effect of interest.

We will use these data to show how to perform a typical differential expression 
workflow, using edgeR and DESeq2.

# Packages needed

```{r}
library(RColorBrewer)
library(EDASeq)
library(DESeq2)
library(RUVSeq)
library(edgeR)
```

# Exploratory Data Analysis

After reading the counts and the positive and negative control genes in R, we filter out the non expressed genes.

```{r readFC}
data_dir <- "~/git/peixoto2015_tutorial/Peixoto_Input_for_Additional_file_1/"
fc <- read.table(paste0(data_dir, "Peixoto_CC_FC_RT.txt"), row.names=1, header=TRUE)
negControls <- read.table(paste0(data_dir, "Peixoto_NegativeControls.txt"), sep='\t', header=TRUE, as.is=TRUE)
positive <- read.table(paste0(data_dir, "Peixoto_positive_controls.txt"), as.is=TRUE, sep='\t', header=TRUE)

x <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x) <- colnames(fc)

filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter,]

negCon <- intersect(negControls[,2], rownames(filtered))
FCup <- intersect(positive[positive[,3]=="UP",1], rownames(filtered))
FCdown <- intersect(positive[positive[,3]=="DOWN",1], rownames(filtered))
RTup <- intersect(positive[positive[,4]=="UP",1], rownames(filtered))
RTdown <- intersect(positive[positive[,4]=="DOWN",1], rownames(filtered))

colors <- brewer.pal(9, "Set1")
colLib <- colors[x]
``` 

The *betweenLaneNormalization* function of *EDASeq* implements UQ normalization. We can then use the *plotRLE* and *plotPCA* functions of *EDASeq* to explore the normalized data.

```{r uq}
uq <- betweenLaneNormalization(filtered, which="upper")

plotRLE(uq, col=colLib, outline=FALSE, las=3, ylab="Relative Log Expression", cex.axis=1, cex.lab=1)
plotPCA(uq, col=colLib, cex=1, cex.axis=1, cex.lab=1)
``` 

From these plots we notice that the data points denoted with "3" and "8" are somewhat different from the rest of the samples.

For now, let's filter out these samples.

```{r filter}
idx <- grep("[3,8]", colnames(fc), invert = TRUE)
filter <- apply(fc[, idx], 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter, idx]

x <- as.factor(rep(c("CC", "FC", "RT"), each=3))
names(x) <- colnames(filtered)
colLib <- colors[x]
```

```{r uq2}
uq <- betweenLaneNormalization(filtered, which="upper")

plotRLE(uq, col=colLib, outline=FALSE, las=3, ylab="Relative Log Expression", cex.axis=1, cex.lab=1)
plotPCA(uq, col=colLib, cex=1, cex.axis=1, cex.lab=1)
``` 

# Differential expression with DESeq2

First, we need to create a `DESeqDataSet` object to store the count matrix and the sample-level information.

```{r deseq2}
dds <- DESeqDataSetFromMatrix(countData = filtered,
                              colData = data.frame(Condition = x, 
                                                   Expt = substring(colnames(filtered), 3)),
                              design = ~ Condition)
dds
colData(dds)
```

To test for the differential expression, we can simply call the wrapper function

```{r deseq2de}
dds <- DESeq(dds)
```

This function runs the following in this order.

```{r, eval=FALSE}
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
```

To understand how the dispersion is estimated by `DESeq2`, we can plot it against the mean.

```{r disp}
plotDispEsts(dds)
```

The genewise estimates of the dispersion are shrunk towards the fitted line. Some gene-wise estimates are flagged as outliers and not shrunk towards the fitted value.

Let's explore the results.

```{r deseq2res}
res <- results(dds)
res
```

Calling `results` without any arguments will extract the estimated log2 fold-changes and p-values for the last variable in the design formula. If there are more than 2 levels for this variable, `results` will extract the results table for a comparison of the last level over the first level.

A nice feature of the `DESeq2` package is that we can easily summarize these results.

```{r deseq2sum}
summary(res)
```

To perform a different comparison, say FC vs. CC, we need to use the `contrast` argument.

```{r deseq2contrast}
res2 <- results(dds, contrast=c("Condition", "FC", "CC"))
summary(res2)
```

To explore the results, we can use look at the MA-plot and the histogram of the p-values.

```{r}
plotMA(res2, ylim=c(-5, 5))
hist(res2$pvalue)
```

This is a good distribution of the p-values. We expect the majority of the genes to be non-differentially expressed, hence leading to a uniform distribution of the p-value and a small portion of genes to be differentially expressed, leading to p-values close to zero.

Another very useful plot is the volcano plot. As far as I know, `DESeq2` does not have an automatic function for that, but it is easy to do it "by hand."

```{r}
plot(res2$log2FoldChange, -log10(res2$pvalue), 
     xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2$padj <= 0.1)
points(res2[de, "log2FoldChange"],
       -log10(res2[de, "pvalue"]),
       pch=20, col=4)
pos <- which(rownames(res2) %in% c(FCup, FCdown))
points(res2[pos, "log2FoldChange"],
       -log10(res2[pos, "pvalue"]),
       pch=1, col=2)
```

## Using all the samples

What happens if we repeat the analysis with all the original samples?

```{r deseq2all}
x_orig <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x_orig) <- colnames(fc)

filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered_orig <- as.matrix(fc)[filter,]

dds_all <- DESeqDataSetFromMatrix(countData = filtered_orig,
                              colData = data.frame(Condition = x_orig, 
                                                   Expt = substring(colnames(filtered_orig), 3)),
                              design = ~ Condition)
dds_all <- DESeq(dds_all)

res2_all <- results(dds_all, contrast=c("Condition", "FC", "CC"))
summary(res2_all)
```

```{r}
hist(res2_all$pvalue)
```

This is not a good distribution of the p-values. Usually a symptom that the data are affected by batch effects or other unwanted variation.

```{r}
plot(res2_all$log2FoldChange, -log10(res2_all$pvalue), 
     xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2_all$padj <= 0.1)
points(res2_all[de, "log2FoldChange"],
       -log10(res2_all[de, "pvalue"]),
       pch=20, col=4)
pos <- which(rownames(res2_all) %in% c(FCup, FCdown))
points(res2_all[pos, "log2FoldChange"],
       -log10(res2_all[pos, "pvalue"]),
       pch=1, col=2)
```

## Accounting for unwanted variation

Sometimes filtering out "bad" samples is not feasible or undesirable. In such cases, we can try and capture the unwanted 
variation and include it in the model. This is the approach used in the `RUVSeq` package.

```{r ruvseq}
uq <- betweenLaneNormalization(filtered_orig, which="upper")

ruv <- RUVg(uq, cIdx = negCon, k = 5)
plotRLE(ruv$normalizedCounts, col = colors[x_orig], outline=FALSE)
plotPCA(ruv$normalizedCounts, col=colors[x_orig], cex=1, cex.axis=1, cex.lab=1)
```

Now that we checked that these factors account for the unwanted variation, we can include them in the model.

```{r deseq2ruv}
dds_ruv <- DESeqDataSetFromMatrix(countData = filtered_orig,
                              colData = data.frame(Condition = x_orig, 
                                                   Expt = substring(colnames(filtered_orig), 3),
                                                   ruv$W),
                              design = ~ Condition + W_1 + W_2 + W_3 + W_4 + W_5)
dds_ruv <- DESeq(dds_ruv)

res2_ruv <- results(dds_ruv, contrast=c("Condition", "FC", "CC"))
summary(res2_ruv)
```

```{r}
hist(res2_ruv$pvalue)
```

This is not a good distribution of the p-values. Usually a symptom that the data are affected by batch effects or other unwanted variation.

```{r}
plot(res2_ruv$log2FoldChange, -log10(res2_ruv$pvalue), 
     xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2_ruv$padj <= 0.1)
points(res2_ruv[de, "log2FoldChange"],
       -log10(res2_ruv[de, "pvalue"]),
       pch=20, col=4)
pos <- which(rownames(res2_ruv) %in% c(FCup, FCdown))
points(res2_ruv[pos, "log2FoldChange"],
       -log10(res2_ruv[pos, "pvalue"]),
       pch=1, col=2)
```

# Differential expression with edgeR

Similarly to `DESeq2`, `edgeR` works on a dedicated object, called `DGEList`, that we need to create from our matrix.

```{r edger}
y <- DGEList(counts = filtered_orig, group = x_orig)
```

The steps of a typical analysis in `edgeR` are very similar to those of `DESeq2`: calculate the normalization factors, estimate the dispersion parameters, and test for differential expression.

```{r edger2}
y <- calcNormFactors(y)
design <- model.matrix(~x_orig + ruv$W)
y <- estimateDisp(y, design)
```

`estimateDisp` in this case consist of three steps: it first estimates a common dispersion parameter, then a "trend" capturing the relation between dispersion and mean and finally a "tagwise" dispersion parameter shrinked towards the common trend.

In order to understand how this work, we can plot the mean-variance plot.

```{r meanvar}
meanVarPlot <- plotMeanVar(y, 
                           show.raw.vars=TRUE, 
                           show.tagwise.vars=TRUE,
                           show.binned.common.disp.vars=FALSE,
                           show.ave.raw.vars=FALSE, 
                           NBline = TRUE , nbins = 100,
                           pch = 16, 
                           xlab ="Mean Expression (Log10 Scale)", 
                           ylab = "Variance (Log10 Scale)" , 
                           main = "Mean-Variance Plot" )
```

To test for differential expression, we first fit a generalized linear model (GLM) and then test using the likelihood ratio test.

```{r edger3}
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)
top <- topTags(lrt, n = Inf)$table

hist(top$PValue)

plot(top$logFC, -log10(top$PValue),
     xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")

de <- which(top$FDR <= 0.1)
points(top[de, "logFC"],
       -log10(top[de, "PValue"]),
       pch=20, col=4)
pos <- which(rownames(top) %in% c(FCup, FCdown))
points(top[pos, "logFC"],
       -log10(top[pos, "PValue"]),
       pch=1, col=2)
```

We can test contrasts in `edgeR` in the following way. Imagine we want to test the difference between FC and RT.

```{r contrasts_edger}
colnames(design) <- c("Intercept", "FC", "RT", 
                      "W1", "W2", "W3", "W4", "W5")
cont <- makeContrasts(FC - RT, levels = colnames(design))
cont
lrt <- glmLRT(fit, contrast = cont)
topTags(lrt)
top <- topTags(lrt, n = Inf)$table
hist(top$PValue)
```