title | author | date | output | bibliography |
---|---|---|---|---|
An example of RNA-seq differential expression analysis |
Davide Risso |
10/27/2017 |
BiocStyle::html_document |
biblio.bib |
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
We will use the data from [@peixoto2015data]. The gene-level read counts are available on Github.
Briefly, the dataset consists of 5 biological replicates per three conditions: controls, fear-condition (memory formation), and memory retrieval.
We also have access to a list of negative control genes, i.e., genes that are not influenced by the biological effect of interest, and to a list of positive control genes, i.e., genes known to be differentially expressed with respect to the biological effect of interest.
We will use these data to show how to perform a typical differential expression workflow, using edgeR and DESeq2.
library(RColorBrewer)
library(EDASeq)
library(DESeq2)
library(RUVSeq)
library(edgeR)
After reading the counts and the positive and negative control genes in R, we filter out the non expressed genes.
data_dir <- "~/git/peixoto2015_tutorial/Peixoto_Input_for_Additional_file_1/"
fc <- read.table(paste0(data_dir, "Peixoto_CC_FC_RT.txt"), row.names=1, header=TRUE)
negControls <- read.table(paste0(data_dir, "Peixoto_NegativeControls.txt"), sep='\t', header=TRUE, as.is=TRUE)
positive <- read.table(paste0(data_dir, "Peixoto_positive_controls.txt"), as.is=TRUE, sep='\t', header=TRUE)
x <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x) <- colnames(fc)
filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter,]
negCon <- intersect(negControls[,2], rownames(filtered))
FCup <- intersect(positive[positive[,3]=="UP",1], rownames(filtered))
FCdown <- intersect(positive[positive[,3]=="DOWN",1], rownames(filtered))
RTup <- intersect(positive[positive[,4]=="UP",1], rownames(filtered))
RTdown <- intersect(positive[positive[,4]=="DOWN",1], rownames(filtered))
colors <- brewer.pal(9, "Set1")
colLib <- colors[x]
The betweenLaneNormalization function of EDASeq implements UQ normalization. We can then use the plotRLE and plotPCA functions of EDASeq to explore the normalized data.
uq <- betweenLaneNormalization(filtered, which="upper")
plotRLE(uq, col=colLib, outline=FALSE, las=3, ylab="Relative Log Expression", cex.axis=1, cex.lab=1)
plotPCA(uq, col=colLib, cex=1, cex.axis=1, cex.lab=1)
From these plots we notice that the data points denoted with "3" and "8" are somewhat different from the rest of the samples.
For now, let's filter out these samples.
idx <- grep("[3,8]", colnames(fc), invert = TRUE)
filter <- apply(fc[, idx], 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter, idx]
x <- as.factor(rep(c("CC", "FC", "RT"), each=3))
names(x) <- colnames(filtered)
colLib <- colors[x]
uq <- betweenLaneNormalization(filtered, which="upper")
plotRLE(uq, col=colLib, outline=FALSE, las=3, ylab="Relative Log Expression", cex.axis=1, cex.lab=1)
plotPCA(uq, col=colLib, cex=1, cex.axis=1, cex.lab=1)
First, we need to create a DESeqDataSet
object to store the count matrix and the sample-level information.
dds <- DESeqDataSetFromMatrix(countData = filtered,
colData = data.frame(Condition = x,
Expt = substring(colnames(filtered), 3)),
design = ~ Condition)
dds
colData(dds)
To test for the differential expression, we can simply call the wrapper function
dds <- DESeq(dds)
This function runs the following in this order.
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
To understand how the dispersion is estimated by DESeq2
, we can plot it against the mean.
plotDispEsts(dds)
The genewise estimates of the dispersion are shrunk towards the fitted line. Some gene-wise estimates are flagged as outliers and not shrunk towards the fitted value.
Let's explore the results.
res <- results(dds)
res
Calling results
without any arguments will extract the estimated log2 fold-changes and p-values for the last variable in the design formula. If there are more than 2 levels for this variable, results
will extract the results table for a comparison of the last level over the first level.
A nice feature of the DESeq2
package is that we can easily summarize these results.
summary(res)
To perform a different comparison, say FC vs. CC, we need to use the contrast
argument.
res2 <- results(dds, contrast=c("Condition", "FC", "CC"))
summary(res2)
To explore the results, we can use look at the MA-plot and the histogram of the p-values.
plotMA(res2, ylim=c(-5, 5))
hist(res2$pvalue)
This is a good distribution of the p-values. We expect the majority of the genes to be non-differentially expressed, hence leading to a uniform distribution of the p-value and a small portion of genes to be differentially expressed, leading to p-values close to zero.
Another very useful plot is the volcano plot. As far as I know, DESeq2
does not have an automatic function for that, but it is easy to do it "by hand."
plot(res2$log2FoldChange, -log10(res2$pvalue),
xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2$padj <= 0.1)
points(res2[de, "log2FoldChange"],
-log10(res2[de, "pvalue"]),
pch=20, col=4)
pos <- which(rownames(res2) %in% c(FCup, FCdown))
points(res2[pos, "log2FoldChange"],
-log10(res2[pos, "pvalue"]),
pch=1, col=2)
What happens if we repeat the analysis with all the original samples?
x_orig <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x_orig) <- colnames(fc)
filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered_orig <- as.matrix(fc)[filter,]
dds_all <- DESeqDataSetFromMatrix(countData = filtered_orig,
colData = data.frame(Condition = x_orig,
Expt = substring(colnames(filtered_orig), 3)),
design = ~ Condition)
dds_all <- DESeq(dds_all)
res2_all <- results(dds_all, contrast=c("Condition", "FC", "CC"))
summary(res2_all)
hist(res2_all$pvalue)
This is not a good distribution of the p-values. Usually a symptom that the data are affected by batch effects or other unwanted variation.
plot(res2_all$log2FoldChange, -log10(res2_all$pvalue),
xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2_all$padj <= 0.1)
points(res2_all[de, "log2FoldChange"],
-log10(res2_all[de, "pvalue"]),
pch=20, col=4)
pos <- which(rownames(res2_all) %in% c(FCup, FCdown))
points(res2_all[pos, "log2FoldChange"],
-log10(res2_all[pos, "pvalue"]),
pch=1, col=2)
Sometimes filtering out "bad" samples is not feasible or undesirable. In such cases, we can try and capture the unwanted
variation and include it in the model. This is the approach used in the RUVSeq
package.
uq <- betweenLaneNormalization(filtered_orig, which="upper")
ruv <- RUVg(uq, cIdx = negCon, k = 5)
plotRLE(ruv$normalizedCounts, col = colors[x_orig], outline=FALSE)
plotPCA(ruv$normalizedCounts, col=colors[x_orig], cex=1, cex.axis=1, cex.lab=1)
Now that we checked that these factors account for the unwanted variation, we can include them in the model.
dds_ruv <- DESeqDataSetFromMatrix(countData = filtered_orig,
colData = data.frame(Condition = x_orig,
Expt = substring(colnames(filtered_orig), 3),
ruv$W),
design = ~ Condition + W_1 + W_2 + W_3 + W_4 + W_5)
dds_ruv <- DESeq(dds_ruv)
res2_ruv <- results(dds_ruv, contrast=c("Condition", "FC", "CC"))
summary(res2_ruv)
hist(res2_ruv$pvalue)
This is not a good distribution of the p-values. Usually a symptom that the data are affected by batch effects or other unwanted variation.
plot(res2_ruv$log2FoldChange, -log10(res2_ruv$pvalue),
xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2_ruv$padj <= 0.1)
points(res2_ruv[de, "log2FoldChange"],
-log10(res2_ruv[de, "pvalue"]),
pch=20, col=4)
pos <- which(rownames(res2_ruv) %in% c(FCup, FCdown))
points(res2_ruv[pos, "log2FoldChange"],
-log10(res2_ruv[pos, "pvalue"]),
pch=1, col=2)
Similarly to DESeq2
, edgeR
works on a dedicated object, called DGEList
, that we need to create from our matrix.
y <- DGEList(counts = filtered_orig, group = x_orig)
The steps of a typical analysis in edgeR
are very similar to those of DESeq2
: calculate the normalization factors, estimate the dispersion parameters, and test for differential expression.
y <- calcNormFactors(y)
design <- model.matrix(~x_orig + ruv$W)
y <- estimateDisp(y, design)
estimateDisp
in this case consist of three steps: it first estimates a common dispersion parameter, then a "trend" capturing the relation between dispersion and mean and finally a "tagwise" dispersion parameter shrinked towards the common trend.
In order to understand how this work, we can plot the mean-variance plot.
meanVarPlot <- plotMeanVar(y,
show.raw.vars=TRUE,
show.tagwise.vars=TRUE,
show.binned.common.disp.vars=FALSE,
show.ave.raw.vars=FALSE,
NBline = TRUE , nbins = 100,
pch = 16,
xlab ="Mean Expression (Log10 Scale)",
ylab = "Variance (Log10 Scale)" ,
main = "Mean-Variance Plot" )
To test for differential expression, we first fit a generalized linear model (GLM) and then test using the likelihood ratio test.
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)
top <- topTags(lrt, n = Inf)$table
hist(top$PValue)
plot(top$logFC, -log10(top$PValue),
xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(top$FDR <= 0.1)
points(top[de, "logFC"],
-log10(top[de, "PValue"]),
pch=20, col=4)
pos <- which(rownames(top) %in% c(FCup, FCdown))
points(top[pos, "logFC"],
-log10(top[pos, "PValue"]),
pch=1, col=2)
We can test contrasts in edgeR
in the following way. Imagine we want to test the difference between FC and RT.
colnames(design) <- c("Intercept", "FC", "RT",
"W1", "W2", "W3", "W4", "W5")
cont <- makeContrasts(FC - RT, levels = colnames(design))
cont
lrt <- glmLRT(fit, contrast = cont)
topTags(lrt)
top <- topTags(lrt, n = Inf)$table
hist(top$PValue)