-
Notifications
You must be signed in to change notification settings - Fork 11
/
Tutorial_150827.Rmd
429 lines (273 loc) · 10.3 KB
/
Tutorial_150827.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
---
title: "Pathway analysis"
author: "Asela Wijeratne"
output: html_document
---
###During this session you will learn about:
1. Use of GAGE package to do pathway analysis
a. "Native workflow"
b. Combined workflow with RNAseq
2. Interpret the GAGE output
3. Use of Pathview to visualize the perturbed KEGG pathways
######For more details, please go to relevent reference manauls:
GAGE package and tutorial:
http://bioconductor.org/packages/devel/bioc/html/gage.html
pathview and gageData:
http://bioconductor.org/packages/devel/bioc/html/pathview.html
http://bioconductor.org/packages/devel/data/experiment/html/gageData.html
EdgeR user guide:
http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
###Installing necessary packages
Before you start the tutorial following packages needs to be installed.
Install required packages:
```
source("http://bioconductor.org/biocLite.R")
biocLite(c("pathview", "edgeR", "gage"))
```
Load required packages
```
library(edgeR)
library(gage)
library(pathview)
```
###Getting and preprocessing data
First, let's get some RNAseq data. This data comes from a study described in Zhang et al., 2011, Genome-wide mapping of the HY5-mediated gene-networks in Arabidopsis that involve both transcriptional and post-transcriptional regulation (http://onlinelibrary.wiley.com/doi/10.1111/j.1365-313X.2010.04426.x/full). This a comparison between a Arabidopsis mutant called hy5 and wild-type. For each genotype, there are two replicates. Vince Buffalo has re-analyzed this data as described here: (https://github.com/vsbuffalo/rna-seq-example). We are going to grab the count data file from this analysis.
```
rnaseq_URL <- "https://github.com/vsbuffalo/rna-seq-example/raw/master/results/raw-counts.txt"
download.file(rnaseq_URL, destfile = "rnaseq_data.txt")
rnaseq_data <- read.table("rnaseq_data.txt",
header=TRUE)
```
#####If above code work, ignore the rest of the next part of the code.
You can also get the data from the following link and save it in the ```Desktop/pathway_analysis``` folder.
http://bit.ly/150826_rnaseqdata
Read the data into R using the following code:
```
rnaseq_data <- read.table("rnaseq_data.txt",
header=TRUE)
```
Let's look at the head of the data set.
```
head(rnaseq_data)
````
This is how they are labelled:
```
GEO:GSM613465: wild type samples
SRR070570: replicate 1
SRR070571: replicate 2
GEO:GSM613466: hy5 mutant samples
SRR070572: replicate 1
SRR070573: replicate 2
```
Rename the columns so we can identify them easily.
```
colnames(rnaseq_data) <- c("WT_1", "WT_2", "hy_1", "hy_2")
```
###Getting KEGG data
#####1. Finding your species in KEGG
First, we need to make sure the species we work with is found in KEGG database. You can go the following webpage and see the organism is list:
http://rest.kegg.jp/list/organism
In addition, ```pathview``` also carry a data matrix with KEGG supported species. You can explore this matrix as well.
```
data(korg)
head(korg)
```
You can get the short names for Arabidopsis by searching the ```korg``` data.
```
org <- "arabidopsis thaliana"
species <- unlist(sapply(1:ncol(korg), function(i) {
agrep(org, korg[, i])
}))
korg[species, 1, drop = F]
```
#####2. Creating the KEGG dataset for GAGE analysis
```kegg.gsets``` can be used to get KEGG data for any species present in the KEGG database. GAGE manual recommends that you save this data as a .Rdata file. This way you don't need to download this each time you need to use and also increase the reproducibility.
```
kegg_arab <- kegg.gsets("ath")
kegg.gs <- kegg_arab$kg.sets[kegg_arab$sigmet.idx]
```
###GAGE "native" workflow
#####1. Remove zero counts in all samples.
```
rnaseq_counts <- rnaseq_data # rename the dataset
dim(rnaseq_counts)
non_zero <- rowSums(rnaseq_counts) != 0
rnaseq_counts <- rnaseq_counts[non_zero,]
# number of genes left after removing zero counts
dim(rnaseq_counts)
```
#####2. Normalize library sizes
```
libsizes <- colSums(rnaseq_counts)
size_factor <- libsizes / exp(mean(log(libsizes)))
norm_counts <- t(t(rnaseq_counts) / size_factor)
range(norm_counts)
```
#####3. Variance stabilizing transformation
We need to add small (positive) value to avoid getting ```-inf``` during log transformation. We will add 8 as per GAGE manual.
```
norm_counts <- log2(norm_counts + 8)
range(norm_counts)
```
You can do MA and PCA plots to see the processed data variances and overall variances and similarity between samples, respectively.
#####4. Define reference and experiment samples
```
ref_idx <- 1:2 #wt
samp_idx <- 3:4 #hy mutant
```
#####5. Enrichment analysis
Here we are going to perform the enrichment analysis using fold change (default) as the gene set statistics. In addition, control (wt) and experiment (hy) are not paired. Therefore, we going to use ```compare ="unpaired"```.
```
native_kegg_fc <- gage(norm_counts,
gsets = kegg.gs,
ref = ref_idx,
samp = samp_idx,
compare ="unpaired")
```
#####6. Some outputs
First four pathways that are up-regulated.
```
head(native_kegg_fc$greater[,1:5], 4)
```
######Meaning of the output:
```p.geomean``` geometric mean of the individual p-values from multiple single array based gene set tests
```stat.mean``` mean of the individual statistics from multiple single array based gene set tests
```p.val``` gloal p-value or summary of the individual p-values from multiple single array based gene set tests. This is the default p-value being used.
```q.val``` FDR q-value adjustment of the global p-value using the Benjamini & Hochberg procedure implemented in multtest package. This is the default q-value being used.
```set.size``` the effective gene set size, i.e. the number of genes included in the gene set test
First four pathways that are down-regulated.
```
head(native_kegg_fc$less[,1:5], 4)
```
We can also do significance test to find the pathways that are up-regulated and down-regulated using a given cutoff (default is 0.1). ```sigGeneSet``` function will also create a heat map, showing the average fold change of expression in pathways that are perturbed.
```
wt_hy_sig_kegg <-sigGeneSet(native_kegg_fc, outname="wt_hy.kegg")
```
Write the significant pathway sets to a text file.
```
write.table(rbind(wt_hy_sig_kegg$greater,
wt_hy_sig_kegg$less),
file = "wt_hy_sig_kegg.txt", sep = "\t")
```
###Combine GAGE and EdgeR analysis
First, we will do a differential gene expression analysis using the ```edgeR``` package.
#####1. Make the study design as follows:
```
targets <-
data.frame(sample_name=colnames(rnaseq_data),Group=rep(c("WT","hy"),each=2))
targets
```
#####2. Create DGElist object.
This is a R list object that can be easily manipulated.
```
d <- DGEList(counts=rnaseq_data,
group=targets$Group) # Constructs DGEList object
```
#####3. Remove lowly expressed genes
Keep genes with more than 10 counts per million (calculated with ```cpm()```) at least in two samples.
```
#before filtering:
dim(d)
keep <- rowSums(cpm(d)> 10) >= 2
d <- d[keep,]
#after filtering:
dim(d)
```
#####4. Normalize the data
```
d$samples$lib.size <- colSums(d$counts)
d <- calcNormFactors(d)
d
```
#####5. multidimentioanl scaling plot
Quick sanity check to see how samples are related to each other using multidimensional scaling plot (MSD).
```
plotMDS(d, labels = targets$Group,
col = c("darkgreen","blue")[factor(targets$Group)])
```
#####6. Estimate the dispersion
```
d1 <- estimateCommonDisp(d, verbose=T)
d1 <- estimateTagwiseDisp(d1)
plotBCV(d1)
```
#####7. Differentail gene expression
```
de.com <- exactTest(d1, pair = c("WT", "hy"))
topTags(de.com, n = 5)
```
```
FDR <- p.adjust(de.com$table$PValue, method="BH")
de1 <- decideTestsDGE(de.com, adjust.method="BH", p.value=0.05)
summary(de1)
```
#####8. Making the data suitable for pathway analysis
```
isDE <- as.logical(de1) # covert to DE set to true/false set
DEnames <- rownames(d)[isDE] # get the DE gene names
edger.fc <- de.com$table$logFC # get the log fold change
names(edger.fc) <- rownames(de.com$table) # assign row names to fold change
exp.fc <- edger.fc
head(exp.fc)
```
Getting DE gene set and fold change
```
DE_foldchange <- edger.fc[DEnames]
length(DE_foldchange)
```
#####9. GAGE analysis
```
fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
head(fc.kegg.p$greater[,1:5], 4)
```
Significance test
```
wt_hy_sig_kegg <-sigGeneSet(fc.kegg.p, outname="wt_hy.kegg")
```
###Visualize the data using Pathview
```
log_fc= norm_counts[, samp_idx]-rowMeans(norm_counts[, ref_idx])
```
#####1. Selecting the path ids for the upregulated set.
```
greater_set <- native_kegg_fc$greater[, "q.val"] < 0.1 &
!is.na(native_kegg_fc$greater[, "q.val"])
greater_ids <- rownames(native_kegg_fc$greater)[greater_set]
head(greater_ids)
```
#####2. Selecting path ids for down-regulated set
```
less_set <- native_kegg_fc$less[, "q.val"] < 0.1 &
!is.na(native_kegg_fc$less[,"q.val"])
less_ids <- rownames(native_kegg_fc$less)[less_set]
```
#####3. Combine up and down-regulated path ids.
```
combine_ids <- substr(c(greater_ids, less_ids), 1, 8)
head(combine_ids)
```
#####4. Visualization
Here we are going to get the first three pathways and visualize using ```pathview```.
```
pv.out.list <- sapply(combine_ids[1:3], function(pid) pathview(
gene.data = exp.fc, pathway.id = pid,
species = "ath", out.suffix="Wt_hy",
gene.idtype="KEGG"))
```
#####5. Mapping DE genes to individual pathways
We can map genes that show statistically significant changes to individual pathways. As an example, we can take DNA replication pathway and see which of the DE genes from EdgeR is present.
```
pv_replication <- pathview(gene.data = DE_foldchange,
gene.idtype = "KEGG",
pathway.id = combine_ids[3],
species = "ath",
out.suffix = "DNA_replication",
keys.align = "y",
kegg.native = T,
match.data = T,
key.pos = "topright")
```
See how the output look like.
```
head(pv_replication)
```