-
Notifications
You must be signed in to change notification settings - Fork 4
/
slides.Rmd
132 lines (89 loc) · 4.09 KB
/
slides.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: "RNA-seq differential expression analysis"
author: "Davide Risso"
date: "10/27/2017"
output:
beamer_presentation:
toc: no
keep_tex: no
---
## What we will cover
We will cover differential expression analysis of RNA-seq data in R/Bioconductor.
We will start from a matrix of gene-level read counts.
We will cover the two most popular packages, `DESeq2` and `edgeR`.
I will also show you how to deal with unwanted variation using the `RUVSeq` package.
## What we will not cover
I will not talk about the preprocessing of RNA-seq data, i.e., what we do to obtain the gene-level read counts.
These steps are usually done with stand-alone software outside R.
I will not talk about isoform-level analysis and alternative splicing.
We will focus on gene-level differential expression.
## Where to find these slides
\Large
[https://github.com/drisso/rnaseq_meetup](github.com/drisso/rnaseq_meetup)
## Where to find additional resources
- The edgeR user guide [https://bioconductor.org/packages/edgeR](bioconductor.org/packages/edgeR)
- The DESeq2 vignette [https://bioconductor.org/packages/DESeq2](bioconductor.org/packages/DESeq2)
- The F1000 Research Bioconductor gateway [https://f1000research.com/gateways/bioconductor](f1000research.com/gateways/bioconductor)
- [https://support.bioconductor.org](support.bioconductor.org)
## From RNA to gene-level read counts
\centering
\includegraphics[width=.6\linewidth]{central_dogma}
## From RNA to gene-level read counts
\centering
\includegraphics[width=\linewidth]{rna-seq.jpg}
## From RNA to gene-level read counts
\scriptsize
```{r, echo=FALSE, results='markup'}
data_dir <- "~/git/peixoto2015_tutorial/Peixoto_Input_for_Additional_file_1/"
fc <- read.table(paste0(data_dir, "Peixoto_CC_FC_RT.txt"), row.names=1, header=TRUE)
negControls <- read.table(paste0(data_dir, "Peixoto_NegativeControls.txt"), sep='\t', header=TRUE, as.is=TRUE)
positive <- read.table(paste0(data_dir, "Peixoto_positive_controls.txt"), as.is=TRUE, sep='\t', header=TRUE)
x <- as.factor(rep(c("CC", "FC", "RT"), each=5))
names(x) <- colnames(fc)
filter <- apply(fc, 1, function(x) length(x[which(x>10)])>5)
filtered <- as.matrix(fc)[filter,]
head(filtered)
```
## The Poisson Model
When statisticians see counts, they immediately think about Simeon Poisson.
\centering
\includegraphics[width=.7\linewidth]{Simeon_Poisson}
## The Poisson Model
The Poisson distribution naturally arises from binomial calculations, with a large number of trials and a small probability.
It has a rather stringent assumption: **the variance is equal to the mean**!
$$
Var(Y_{ij}) = \mu_{ij}
$$
In real datasets the variance is greater than the mean, a condition known as **overdispersion**.
## A real example
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(edgeR)
y <- DGEList(counts = filtered, group = x)
y <- calcNormFactors(y)
design <- model.matrix(~x)
y <- estimateDisp(y, design)
meanVarPlot <- plotMeanVar(y,
show.raw.vars=TRUE,
show.tagwise.vars=FALSE,
show.binned.common.disp.vars=FALSE,
show.ave.raw.vars=FALSE,
NBline = TRUE , nbins = 100,
pch = 16,
xlab ="Mean Expression (Log10 Scale)",
ylab = "Variance (Log10 Scale)" ,
main = "Mean-Variance Plot" )
```
## The Negative Binomial Model
A generalization of the Poisson model is the negative binomial, that assumes that the variance is a quadratic function of the mean.
$$
Var(Y_{ij}) = \mu_{ij} + \phi_j \mu_{ij}^2
$$
where $\phi$ is called the **dispersion parameter**.
Both `edgeR` and `DESeq2` assume that the data is distributed as a negative binomial.
## An example dataset
![](FCdesign.png)
## An example dataset
- C57BL/6J adult male mice (2 months of age).
- Five animals per group: fear conditioning (FC), memory retrieval (RT), and controls (CC).
- Illumina 100bp paired-end reads mapped to the mouse genome (mm9) using GMAP/GSNAP.
- Ensembl (release 65) gene counts obtained using HTSeq.