Skip to content

drisso/rnaseq_meetup

Repository files navigation

title author date output bibliography
An example of RNA-seq differential expression analysis
Davide Risso
10/27/2017
BiocStyle::html_document
biblio.bib
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

The data

We will use the data from [@peixoto2015data]. The gene-level read counts are available on Github.

Briefly, the dataset consists of 5 biological replicates per three conditions: controls, fear-condition (memory formation), and memory retrieval.

We also have access to a list of negative control genes, i.e., genes that are not influenced by the biological effect of interest, and to a list of positive control genes, i.e., genes known to be differentially expressed with respect to the biological effect of interest.

We will use these data to show how to perform a typical differential expression workflow, using edgeR and DESeq2.

Packages needed

library(RColorBrewer)
library(EDASeq)
library(DESeq2)
library(RUVSeq)
library(edgeR)

Exploratory Data Analysis

After reading the counts and the positive and negative control genes in R, we filter out the non expressed genes.

data_dir <- "~/git/peixoto2015_tutorial/Peixoto_Input_for_Additional_file_1/"
fc <- read.table(paste0(data_dir, "Peixoto_CC_FC_RT.txt"), row.names=1, header=TRUE)
negControls <- read.table(paste0(data_dir, "Peixoto_NegativeControls.txt"), sep='\t', header=TRUE, as.is=TRUE)
positive <- read.table(paste0(data_dir, "Peixoto_positive_controls.txt"), as.is=TRUE, sep='\t', header=TRUE)

x <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x) <- colnames(fc)

filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter,]

negCon <- intersect(negControls[,2], rownames(filtered))
FCup <- intersect(positive[positive[,3]=="UP",1], rownames(filtered))
FCdown <- intersect(positive[positive[,3]=="DOWN",1], rownames(filtered))
RTup <- intersect(positive[positive[,4]=="UP",1], rownames(filtered))
RTdown <- intersect(positive[positive[,4]=="DOWN",1], rownames(filtered))

colors <- brewer.pal(9, "Set1")
colLib <- colors[x]

The betweenLaneNormalization function of EDASeq implements UQ normalization. We can then use the plotRLE and plotPCA functions of EDASeq to explore the normalized data.

uq <- betweenLaneNormalization(filtered, which="upper")

plotRLE(uq, col=colLib, outline=FALSE, las=3, ylab="Relative Log Expression", cex.axis=1, cex.lab=1)
plotPCA(uq, col=colLib, cex=1, cex.axis=1, cex.lab=1)

From these plots we notice that the data points denoted with "3" and "8" are somewhat different from the rest of the samples.

For now, let's filter out these samples.

idx <- grep("[3,8]", colnames(fc), invert = TRUE)
filter <- apply(fc[, idx], 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter, idx]

x <- as.factor(rep(c("CC", "FC", "RT"), each=3))
names(x) <- colnames(filtered)
colLib <- colors[x]
uq <- betweenLaneNormalization(filtered, which="upper")

plotRLE(uq, col=colLib, outline=FALSE, las=3, ylab="Relative Log Expression", cex.axis=1, cex.lab=1)
plotPCA(uq, col=colLib, cex=1, cex.axis=1, cex.lab=1)

Differential expression with DESeq2

First, we need to create a DESeqDataSet object to store the count matrix and the sample-level information.

dds <- DESeqDataSetFromMatrix(countData = filtered,
                              colData = data.frame(Condition = x, 
                                                   Expt = substring(colnames(filtered), 3)),
                              design = ~ Condition)
dds
colData(dds)

To test for the differential expression, we can simply call the wrapper function

dds <- DESeq(dds)

This function runs the following in this order.

dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)

To understand how the dispersion is estimated by DESeq2, we can plot it against the mean.

plotDispEsts(dds)

The genewise estimates of the dispersion are shrunk towards the fitted line. Some gene-wise estimates are flagged as outliers and not shrunk towards the fitted value.

Let's explore the results.

res <- results(dds)
res

Calling results without any arguments will extract the estimated log2 fold-changes and p-values for the last variable in the design formula. If there are more than 2 levels for this variable, results will extract the results table for a comparison of the last level over the first level.

A nice feature of the DESeq2 package is that we can easily summarize these results.

summary(res)

To perform a different comparison, say FC vs. CC, we need to use the contrast argument.

res2 <- results(dds, contrast=c("Condition", "FC", "CC"))
summary(res2)

To explore the results, we can use look at the MA-plot and the histogram of the p-values.

plotMA(res2, ylim=c(-5, 5))
hist(res2$pvalue)

This is a good distribution of the p-values. We expect the majority of the genes to be non-differentially expressed, hence leading to a uniform distribution of the p-value and a small portion of genes to be differentially expressed, leading to p-values close to zero.

Another very useful plot is the volcano plot. As far as I know, DESeq2 does not have an automatic function for that, but it is easy to do it "by hand."

plot(res2$log2FoldChange, -log10(res2$pvalue), 
     xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2$padj <= 0.1)
points(res2[de, "log2FoldChange"],
       -log10(res2[de, "pvalue"]),
       pch=20, col=4)
pos <- which(rownames(res2) %in% c(FCup, FCdown))
points(res2[pos, "log2FoldChange"],
       -log10(res2[pos, "pvalue"]),
       pch=1, col=2)

Using all the samples

What happens if we repeat the analysis with all the original samples?

x_orig <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x_orig) <- colnames(fc)

filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered_orig <- as.matrix(fc)[filter,]

dds_all <- DESeqDataSetFromMatrix(countData = filtered_orig,
                              colData = data.frame(Condition = x_orig, 
                                                   Expt = substring(colnames(filtered_orig), 3)),
                              design = ~ Condition)
dds_all <- DESeq(dds_all)

res2_all <- results(dds_all, contrast=c("Condition", "FC", "CC"))
summary(res2_all)
hist(res2_all$pvalue)

This is not a good distribution of the p-values. Usually a symptom that the data are affected by batch effects or other unwanted variation.

plot(res2_all$log2FoldChange, -log10(res2_all$pvalue), 
     xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2_all$padj <= 0.1)
points(res2_all[de, "log2FoldChange"],
       -log10(res2_all[de, "pvalue"]),
       pch=20, col=4)
pos <- which(rownames(res2_all) %in% c(FCup, FCdown))
points(res2_all[pos, "log2FoldChange"],
       -log10(res2_all[pos, "pvalue"]),
       pch=1, col=2)

Accounting for unwanted variation

Sometimes filtering out "bad" samples is not feasible or undesirable. In such cases, we can try and capture the unwanted variation and include it in the model. This is the approach used in the RUVSeq package.

uq <- betweenLaneNormalization(filtered_orig, which="upper")

ruv <- RUVg(uq, cIdx = negCon, k = 5)
plotRLE(ruv$normalizedCounts, col = colors[x_orig], outline=FALSE)
plotPCA(ruv$normalizedCounts, col=colors[x_orig], cex=1, cex.axis=1, cex.lab=1)

Now that we checked that these factors account for the unwanted variation, we can include them in the model.

dds_ruv <- DESeqDataSetFromMatrix(countData = filtered_orig,
                              colData = data.frame(Condition = x_orig, 
                                                   Expt = substring(colnames(filtered_orig), 3),
                                                   ruv$W),
                              design = ~ Condition + W_1 + W_2 + W_3 + W_4 + W_5)
dds_ruv <- DESeq(dds_ruv)

res2_ruv <- results(dds_ruv, contrast=c("Condition", "FC", "CC"))
summary(res2_ruv)
hist(res2_ruv$pvalue)

This is not a good distribution of the p-values. Usually a symptom that the data are affected by batch effects or other unwanted variation.

plot(res2_ruv$log2FoldChange, -log10(res2_ruv$pvalue), 
     xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2_ruv$padj <= 0.1)
points(res2_ruv[de, "log2FoldChange"],
       -log10(res2_ruv[de, "pvalue"]),
       pch=20, col=4)
pos <- which(rownames(res2_ruv) %in% c(FCup, FCdown))
points(res2_ruv[pos, "log2FoldChange"],
       -log10(res2_ruv[pos, "pvalue"]),
       pch=1, col=2)

Differential expression with edgeR

Similarly to DESeq2, edgeR works on a dedicated object, called DGEList, that we need to create from our matrix.

y <- DGEList(counts = filtered_orig, group = x_orig)

The steps of a typical analysis in edgeR are very similar to those of DESeq2: calculate the normalization factors, estimate the dispersion parameters, and test for differential expression.

y <- calcNormFactors(y)
design <- model.matrix(~x_orig + ruv$W)
y <- estimateDisp(y, design)

estimateDisp in this case consist of three steps: it first estimates a common dispersion parameter, then a "trend" capturing the relation between dispersion and mean and finally a "tagwise" dispersion parameter shrinked towards the common trend.

In order to understand how this work, we can plot the mean-variance plot.

meanVarPlot <- plotMeanVar(y, 
                           show.raw.vars=TRUE, 
                           show.tagwise.vars=TRUE,
                           show.binned.common.disp.vars=FALSE,
                           show.ave.raw.vars=FALSE, 
                           NBline = TRUE , nbins = 100,
                           pch = 16, 
                           xlab ="Mean Expression (Log10 Scale)", 
                           ylab = "Variance (Log10 Scale)" , 
                           main = "Mean-Variance Plot" )

To test for differential expression, we first fit a generalized linear model (GLM) and then test using the likelihood ratio test.

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)
top <- topTags(lrt, n = Inf)$table

hist(top$PValue)

plot(top$logFC, -log10(top$PValue),
     xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")

de <- which(top$FDR <= 0.1)
points(top[de, "logFC"],
       -log10(top[de, "PValue"]),
       pch=20, col=4)
pos <- which(rownames(top) %in% c(FCup, FCdown))
points(top[pos, "logFC"],
       -log10(top[pos, "PValue"]),
       pch=1, col=2)

We can test contrasts in edgeR in the following way. Imagine we want to test the difference between FC and RT.

colnames(design) <- c("Intercept", "FC", "RT", 
                      "W1", "W2", "W3", "W4", "W5")
cont <- makeContrasts(FC - RT, levels = colnames(design))
cont
lrt <- glmLRT(fit, contrast = cont)
topTags(lrt)
top <- topTags(lrt, n = Inf)$table
hist(top$PValue)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages