-
Notifications
You must be signed in to change notification settings - Fork 4
/
de_example.Rmd
325 lines (245 loc) · 11.1 KB
/
de_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: "An example of RNA-seq differential expression analysis"
author: "Davide Risso"
date: "10/27/2017"
output: BiocStyle::html_document
bibliography: biblio.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
# The data
We will use the data from [@peixoto2015data].
The gene-level read counts are available on [Github](https://github.com/drisso/peixoto2015_tutorial).
Briefly, the dataset consists of 5 biological replicates per three conditions: controls, fear-condition (memory formation), and memory retrieval.
We also have access to a list of negative control genes, i.e., genes that are not influenced by the biological effect of interest, and to a list of positive control genes, i.e., genes known to be differentially expressed with respect to the biological effect of interest.
We will use these data to show how to perform a typical differential expression
workflow, using edgeR and DESeq2.
# Packages needed
```{r}
library(RColorBrewer)
library(EDASeq)
library(DESeq2)
library(RUVSeq)
library(edgeR)
```
# Exploratory Data Analysis
After reading the counts and the positive and negative control genes in R, we filter out the non expressed genes.
```{r readFC}
data_dir <- "~/git/peixoto2015_tutorial/Peixoto_Input_for_Additional_file_1/"
fc <- read.table(paste0(data_dir, "Peixoto_CC_FC_RT.txt"), row.names=1, header=TRUE)
negControls <- read.table(paste0(data_dir, "Peixoto_NegativeControls.txt"), sep='\t', header=TRUE, as.is=TRUE)
positive <- read.table(paste0(data_dir, "Peixoto_positive_controls.txt"), as.is=TRUE, sep='\t', header=TRUE)
x <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x) <- colnames(fc)
filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter,]
negCon <- intersect(negControls[,2], rownames(filtered))
FCup <- intersect(positive[positive[,3]=="UP",1], rownames(filtered))
FCdown <- intersect(positive[positive[,3]=="DOWN",1], rownames(filtered))
RTup <- intersect(positive[positive[,4]=="UP",1], rownames(filtered))
RTdown <- intersect(positive[positive[,4]=="DOWN",1], rownames(filtered))
colors <- brewer.pal(9, "Set1")
colLib <- colors[x]
```
The *betweenLaneNormalization* function of *EDASeq* implements UQ normalization. We can then use the *plotRLE* and *plotPCA* functions of *EDASeq* to explore the normalized data.
```{r uq}
uq <- betweenLaneNormalization(filtered, which="upper")
plotRLE(uq, col=colLib, outline=FALSE, las=3, ylab="Relative Log Expression", cex.axis=1, cex.lab=1)
plotPCA(uq, col=colLib, cex=1, cex.axis=1, cex.lab=1)
```
From these plots we notice that the data points denoted with "3" and "8" are somewhat different from the rest of the samples.
For now, let's filter out these samples.
```{r filter}
idx <- grep("[3,8]", colnames(fc), invert = TRUE)
filter <- apply(fc[, idx], 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter, idx]
x <- as.factor(rep(c("CC", "FC", "RT"), each=3))
names(x) <- colnames(filtered)
colLib <- colors[x]
```
```{r uq2}
uq <- betweenLaneNormalization(filtered, which="upper")
plotRLE(uq, col=colLib, outline=FALSE, las=3, ylab="Relative Log Expression", cex.axis=1, cex.lab=1)
plotPCA(uq, col=colLib, cex=1, cex.axis=1, cex.lab=1)
```
# Differential expression with DESeq2
First, we need to create a `DESeqDataSet` object to store the count matrix and the sample-level information.
```{r deseq2}
dds <- DESeqDataSetFromMatrix(countData = filtered,
colData = data.frame(Condition = x,
Expt = substring(colnames(filtered), 3)),
design = ~ Condition)
dds
colData(dds)
```
To test for the differential expression, we can simply call the wrapper function
```{r deseq2de}
dds <- DESeq(dds)
```
This function runs the following in this order.
```{r, eval=FALSE}
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
```
To understand how the dispersion is estimated by `DESeq2`, we can plot it against the mean.
```{r disp}
plotDispEsts(dds)
```
The genewise estimates of the dispersion are shrunk towards the fitted line. Some gene-wise estimates are flagged as outliers and not shrunk towards the fitted value.
Let's explore the results.
```{r deseq2res}
res <- results(dds)
res
```
Calling `results` without any arguments will extract the estimated log2 fold-changes and p-values for the last variable in the design formula. If there are more than 2 levels for this variable, `results` will extract the results table for a comparison of the last level over the first level.
A nice feature of the `DESeq2` package is that we can easily summarize these results.
```{r deseq2sum}
summary(res)
```
To perform a different comparison, say FC vs. CC, we need to use the `contrast` argument.
```{r deseq2contrast}
res2 <- results(dds, contrast=c("Condition", "FC", "CC"))
summary(res2)
```
To explore the results, we can use look at the MA-plot and the histogram of the p-values.
```{r}
plotMA(res2, ylim=c(-5, 5))
hist(res2$pvalue)
```
This is a good distribution of the p-values. We expect the majority of the genes to be non-differentially expressed, hence leading to a uniform distribution of the p-value and a small portion of genes to be differentially expressed, leading to p-values close to zero.
Another very useful plot is the volcano plot. As far as I know, `DESeq2` does not have an automatic function for that, but it is easy to do it "by hand."
```{r}
plot(res2$log2FoldChange, -log10(res2$pvalue),
xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2$padj <= 0.1)
points(res2[de, "log2FoldChange"],
-log10(res2[de, "pvalue"]),
pch=20, col=4)
pos <- which(rownames(res2) %in% c(FCup, FCdown))
points(res2[pos, "log2FoldChange"],
-log10(res2[pos, "pvalue"]),
pch=1, col=2)
```
## Using all the samples
What happens if we repeat the analysis with all the original samples?
```{r deseq2all}
x_orig <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x_orig) <- colnames(fc)
filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered_orig <- as.matrix(fc)[filter,]
dds_all <- DESeqDataSetFromMatrix(countData = filtered_orig,
colData = data.frame(Condition = x_orig,
Expt = substring(colnames(filtered_orig), 3)),
design = ~ Condition)
dds_all <- DESeq(dds_all)
res2_all <- results(dds_all, contrast=c("Condition", "FC", "CC"))
summary(res2_all)
```
```{r}
hist(res2_all$pvalue)
```
This is not a good distribution of the p-values. Usually a symptom that the data are affected by batch effects or other unwanted variation.
```{r}
plot(res2_all$log2FoldChange, -log10(res2_all$pvalue),
xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2_all$padj <= 0.1)
points(res2_all[de, "log2FoldChange"],
-log10(res2_all[de, "pvalue"]),
pch=20, col=4)
pos <- which(rownames(res2_all) %in% c(FCup, FCdown))
points(res2_all[pos, "log2FoldChange"],
-log10(res2_all[pos, "pvalue"]),
pch=1, col=2)
```
## Accounting for unwanted variation
Sometimes filtering out "bad" samples is not feasible or undesirable. In such cases, we can try and capture the unwanted
variation and include it in the model. This is the approach used in the `RUVSeq` package.
```{r ruvseq}
uq <- betweenLaneNormalization(filtered_orig, which="upper")
ruv <- RUVg(uq, cIdx = negCon, k = 5)
plotRLE(ruv$normalizedCounts, col = colors[x_orig], outline=FALSE)
plotPCA(ruv$normalizedCounts, col=colors[x_orig], cex=1, cex.axis=1, cex.lab=1)
```
Now that we checked that these factors account for the unwanted variation, we can include them in the model.
```{r deseq2ruv}
dds_ruv <- DESeqDataSetFromMatrix(countData = filtered_orig,
colData = data.frame(Condition = x_orig,
Expt = substring(colnames(filtered_orig), 3),
ruv$W),
design = ~ Condition + W_1 + W_2 + W_3 + W_4 + W_5)
dds_ruv <- DESeq(dds_ruv)
res2_ruv <- results(dds_ruv, contrast=c("Condition", "FC", "CC"))
summary(res2_ruv)
```
```{r}
hist(res2_ruv$pvalue)
```
This is not a good distribution of the p-values. Usually a symptom that the data are affected by batch effects or other unwanted variation.
```{r}
plot(res2_ruv$log2FoldChange, -log10(res2_ruv$pvalue),
xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(res2_ruv$padj <= 0.1)
points(res2_ruv[de, "log2FoldChange"],
-log10(res2_ruv[de, "pvalue"]),
pch=20, col=4)
pos <- which(rownames(res2_ruv) %in% c(FCup, FCdown))
points(res2_ruv[pos, "log2FoldChange"],
-log10(res2_ruv[pos, "pvalue"]),
pch=1, col=2)
```
# Differential expression with edgeR
Similarly to `DESeq2`, `edgeR` works on a dedicated object, called `DGEList`, that we need to create from our matrix.
```{r edger}
y <- DGEList(counts = filtered_orig, group = x_orig)
```
The steps of a typical analysis in `edgeR` are very similar to those of `DESeq2`: calculate the normalization factors, estimate the dispersion parameters, and test for differential expression.
```{r edger2}
y <- calcNormFactors(y)
design <- model.matrix(~x_orig + ruv$W)
y <- estimateDisp(y, design)
```
`estimateDisp` in this case consist of three steps: it first estimates a common dispersion parameter, then a "trend" capturing the relation between dispersion and mean and finally a "tagwise" dispersion parameter shrinked towards the common trend.
In order to understand how this work, we can plot the mean-variance plot.
```{r meanvar}
meanVarPlot <- plotMeanVar(y,
show.raw.vars=TRUE,
show.tagwise.vars=TRUE,
show.binned.common.disp.vars=FALSE,
show.ave.raw.vars=FALSE,
NBline = TRUE , nbins = 100,
pch = 16,
xlab ="Mean Expression (Log10 Scale)",
ylab = "Variance (Log10 Scale)" ,
main = "Mean-Variance Plot" )
```
To test for differential expression, we first fit a generalized linear model (GLM) and then test using the likelihood ratio test.
```{r edger3}
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)
top <- topTags(lrt, n = Inf)$table
hist(top$PValue)
plot(top$logFC, -log10(top$PValue),
xlab = "log2FC", ylab = "-log10(p)", pch=20, col="grey")
de <- which(top$FDR <= 0.1)
points(top[de, "logFC"],
-log10(top[de, "PValue"]),
pch=20, col=4)
pos <- which(rownames(top) %in% c(FCup, FCdown))
points(top[pos, "logFC"],
-log10(top[pos, "PValue"]),
pch=1, col=2)
```
We can test contrasts in `edgeR` in the following way. Imagine we want to test the difference between FC and RT.
```{r contrasts_edger}
colnames(design) <- c("Intercept", "FC", "RT",
"W1", "W2", "W3", "W4", "W5")
cont <- makeContrasts(FC - RT, levels = colnames(design))
cont
lrt <- glmLRT(fit, contrast = cont)
topTags(lrt)
top <- topTags(lrt, n = Inf)$table
hist(top$PValue)
```