-
Notifications
You must be signed in to change notification settings - Fork 1
/
fgsea.Rmd
387 lines (282 loc) · 13.7 KB
/
fgsea.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
title: "Using the Fast preranked Gene Set Enrichment Analysis (fgsea) package"
date: "`r Sys.Date()`"
output:
workflowr::wflow_html:
toc: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
From the [original paper](https://pubmed.ncbi.nlm.nih.gov/16199517/) describing the Gene Set Enrichment Analysis (GSEA):
>The goal of GSEA is to determine whether members of a gene set **S** tend to occur toward the top (or bottom) of the list **L**, in which case the gene set is correlated with the phenotypic class distinction.
To provide an example, consider carrying out RNA-seq experiments with two conditions, where one set of experiments are the control and another set are the treatment. A differential gene expression analysis is carried out and each gene is **ranked** by the fold-change between the control and treatment; genes that are up-regulated (fold-change > 1) are ranked at the top and genes that are down-regulated (fold-change < 1) are ranked at the bottom.
Besides inspecting the individual genes that are up- or down-regulated, you wish to find out whether a pathway of genes is up- or down-regulated. Therefore you carry out GSEA using a set of gene pathways (**S**) and check whether genes belonging to the same pathway tend to occur at the start or the end of the list of ranked genes (**L**).
## Installation
First install [fgsea](https://bioconductor.org/packages/release/bioc/html/fgsea.html).
```{r install_fgsea, message=FALSE, warning=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("fgsea", quietly = TRUE))
BiocManager::install("fgsea")
library(fgsea)
```
## Example data
The example pathways are packaged with `fgsea` and can be loaded with `data()`. The example pathways are stored in a list.
```{r example_pathways}
data("examplePathways", package = "fgsea")
class(examplePathways)
```
There are `r length(examplePathways)` example pathways.
```{r example_pathways_n}
length(examplePathways)
```
The first pathway `r head(names(examplePathways), 1)` contains Entrez Gene IDs that belong to this gene set.
```{r first_pathway}
examplePathways[1]
```
The gene ranks are also packaged with `fgsea` but we will re-generate the ranks based on the [author's code](https://github.com/ctlab/fgsea/blob/master/inst/gen_gene_ranks.R) to see how the ranks were created. Several other Bioconductor packages are required to generate the ranks. The Reactome database is also installed here for later use.
```{r install_bioc}
bioc_pac <- c("GEOquery", "limma", "org.Mm.eg.db", "reactome.db")
cran_pac <- c("data.table", "pheatmap")
install_pac <- function(x, repo){
if (!require(x, quietly = TRUE, character.only = TRUE)){
if(repo == "bioc"){
BiocManager::install(x, character.only = TRUE)
} else if (repo == "cran"){
install.packages(x, character.only = TRUE)
} else {
stop("Unknown repo")
}
}
}
sapply(bioc_pac, install_pac, repo = "bioc")
sapply(cran_pac, install_pac, repo = "cran")
```
Load packages.
```{r gen_gene_ranks_packages}
library(GEOquery)
library(limma)
library(org.Mm.eg.db)
library(data.table)
# for collapseBy
source("https://raw.githubusercontent.com/assaron/r-utils/master/R/exprs.R")
```
The example gene ranks were created using [mouse microarray data](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14308), which we can download using `getGEO` from the `GEOquery` package.
```{r download_gse14308}
gse14308 <- getGEO("GSE14308")[[1]]
```
Conditions (are extracted from the title and) are added to the phenotypic data.
```{r add_condition}
pData(gse14308)$condition <- sub("-.*$", "", gse14308$title)
pData(gse14308)[, c('platform_id', 'type', 'condition')]
```
`fData` retrieves information on features, which are microarray probe sets.
```{r feature_dat}
feature_dat <- fData(gse14308)
colnames(feature_dat)
```
The `collapseBy` function is sourced from [exprs.R](https://raw.githubusercontent.com/assaron/r-utils/master/R/exprs.R) and the source is shown below but named as `collapseBy_` so we do not overwrite the sourced function. We will go through each line of code to try to understand what the function is doing.
```{r collapse_by_}
collapseBy_ <- function(es, factor, FUN=median) {
ranks <- apply(exprs(es), 1, FUN)
t <- data.frame(f=factor, i=seq_along(ranks), r=ranks)
t <- t[order(t$r, decreasing=T), ]
keep <- t[!duplicated(t$f) & !is.na(t$f),]$i
res <- es[keep, ]
fData(res)$origin <- rownames(res)
rownames(res) <- factor[keep]
res
}
```
`ranks` contains the median probe intensity across all samples.
```{r ranks}
ranks <- apply(exprs(gse14308), 1, median)
head(ranks)
```
The `ENTREZ_GENE_ID`, index, and median get saved into a data frame. (I've used `my_df` here because `t` is the name of the transpose function.) Next, the data frame is ordered from highest intensity to lowest.
```{r my_df}
my_df <- data.frame(
f=fData(gse14308)$ENTREZ_GENE_ID,
i=seq_along(ranks),
r=ranks
)
my_df <- my_df[order(my_df$r, decreasing=TRUE), ]
head(my_df)
```
A vector called `keep` is created to keep only rows with non-duplicated and non-missing Entrez Gene IDs.
```{r keep}
keep <- my_df[!duplicated(my_df$f) & !is.na(my_df$f),]$i
length(keep)
```
Finally, `keep` is used to subset the data; `origin` is created to store the original probe IDs before replacing the row names with Entrez Gene IDs.
```{r subset_keep}
res <- gse14308[keep, ]
fData(res)$origin <- rownames(res)
rownames(res) <- fData(gse14308)$ENTREZ_GENE_ID[keep]
res
```
Therefore, the `collapseBy` function is used to calculate the median intensity across samples/experiments, which is used to rank features, and to remove features that are duplicated or have no Entrez Gene ID.
Now that we know what `collapseBy` does, we can use it.
```{r collapse_by}
es <- collapseBy(gse14308, fData(gse14308)$ENTREZ_GENE_ID, FUN=median)
es
```
Probe IDs that mapped to several Entrez Gene IDs and empty entries are also removed.
```{r es_subset}
es <- es[!grepl("///", rownames(es)), ]
es <- es[rownames(es) != "", ]
dim(exprs(es))
```
Quantile normalisation is carried out.
```{r quantile_normalisation}
exprs(es) <- normalizeBetweenArrays(log2(exprs(es)+1), method="quantile")
```
A [design matrix](https://f1000research.com/articles/9-1444) is defined.
```{r design}
es.design <- model.matrix(~0+condition, data=pData(es))
es.design
```
A linear model is fit given the design.
```{r lm_fit}
fit <- lmFit(es, es.design)
```
A contrasts matrix is used to compute contrasts using our fitted linear model. Here we're contrasting naive T-cells to T-helper 1 cells.
```{r make_and_fit_contrasts}
makeContrasts(
conditionTh1-conditionNaive,
levels=es.design
)
fit2 <- contrasts.fit(
fit,
makeContrasts(
conditionTh1-conditionNaive,
levels=es.design
)
)
fit2
```
Finally, the differential expression analysis is carried out and the results saved. The results are ranked by [limma's moderated t-statistic](https://support.bioconductor.org/p/6124/) and this creates the ranked list of genes.
```{r ebayes}
fit2 <- eBayes(fit2)
names(topTable(fit2, adjust.method="BH", number=12000, sort.by = "B"))
de <- data.table(topTable(fit2, adjust.method="BH", number=12000, sort.by = "B"), keep.rownames = TRUE)
ranks <- de[order(t), list(rn, t)]
ranks
```
Load `exampleRanks`.
```{r example_ranks}
data("exampleRanks", package = "fgsea")
head(exampleRanks)
```
The names of the vector are the Entrez Gene IDs and the values are the rank metric.
Compare with our results.
```{r compare}
wanted <- head(names(exampleRanks))
ranks[rn %in% wanted]
```
Visualise the six most significantly down- and up-regulated genes.
```{r pheatmap}
library(pheatmap)
my_sample <- pData(es)$condition == "Th1" | pData(es)$condition == "Naive"
my_group <- data.frame(group = pData(es)$condition[my_sample])
row.names(my_group) <- colnames(exprs(es))[my_sample]
pheatmap(
mat = es[c(head(de[order(t), 1])$rn, tail(de[order(t), 1])$rn), my_sample],
annotation_col = my_group,
cluster_rows = FALSE,
cellwidth=25,
cellheight=15,
scale = "row"
)
```
## Vignette analysis
The following section is based on the [fgsea tutorial](https://bioconductor.org/packages/release/bioc/vignettes/fgsea/inst/doc/fgsea-tutorial.html) but with my elaborations. The pathways are stored in `examplePathways` and the ranked gene list in `exampleRanks`.
The `fgsea()` function runs the pre-ranked gene set enrichment analysis.
```{r fgsea}
data(examplePathways)
data(exampleRanks)
set.seed(42)
system.time(
fgseaRes <- fgsea(
pathways = examplePathways,
stats = exampleRanks,
minSize=15,
maxSize=500
)
)
class(fgseaRes)
```
Top 6 enriched pathways; see the [data.table Wiki](https://github.com/Rdatatable/data.table/wiki) for more information on the `data.table` package.
```{r fgsea_result}
head(fgseaRes[order(pval), ])
```
Number of significant pathways at padj < 0.01.
```{r fgsea_sign_result}
sum(fgseaRes[, padj < 0.01])
```
Plot the most significantly enriched pathway.
```{r plot_sign_pathway}
plotEnrichment(
examplePathways[[head(fgseaRes[order(pval), ], 1)$pathway]],
exampleRanks
) +
ggplot2::labs(title=head(fgseaRes[order(pval), ], 1)$pathway)
```
Plot the top 10 pathways enriched at the top and bottom of the ranked list, respectively.
```{r plot_gsea_table}
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(
examplePathways[topPathways],
exampleRanks,
fgseaRes,
gseaParam = 0.5
)
```
Since the genes were ranked according to their differential expression significance and fold change, with the most significantly down-regulated genes at the top and up-regulated genes at the bottom of the list, the enriched gene sets provides us with some idea of the function of these genes. For example, `exampleRanks` was generated by testing naive T cells against T helper 1 cells, so the enriched pathways may suggest how naive cells are differentiated into T helper 1 cells.
## Reactome
[Reactome pathways](https://reactome.org/) can also be used with fgsea and this database is available via Bioconductor (and was installed earlier in this post).
The `reactomePathways()` function returns a list of Reactome pathways given a list of Entrez Gene IDs. In the example data, the IDs are stored as the names in the `exampleRanks` vector.
```{r my_pathways}
my_pathways <- reactomePathways(names(exampleRanks))
```
Out of interest, let's check the median number of genes in the pathways.
```{r my_pathways_median}
summary(sapply(my_pathways, length))
```
The results are very similar to the example pathways, where the pathways are related to the cell cycle and mitosis.
```{r fgsea_reactome}
set.seed(42)
system.time(
fgsea_reactome <- fgsea(
pathways = my_pathways,
stats = exampleRanks,
minSize=15,
maxSize=500
)
)
head(fgsea_reactome[order(pval), ])
```
## Leading edge
Again quote the [original paper](https://pubmed.ncbi.nlm.nih.gov/16199517/):
> Often it is useful to extract the core members of high scoring gene sets that contribute to the enrichment score. We define the leading-edge subset to be those genes in the gene set **S** that appear in the ranked list **L** at, or before, the point where the running sum reaches its maximum deviation from zero.
The `fgsea` results includes the leading edge, which contains the genes that contributed to the enrichment score. To illustrate, let's take the leader edge genes from the most significant Reactome pathway.
```{r fgsea_reactome_most_sign}
fgsea_reactome[order(pval),][1,]
```
The set of leading edge genes is made up of `r length(fgsea_reactome[order(pval),][1,]$leadingEdge[[1]])` genes.
```{r leading_edge}
fgsea_reactome[order(pval),][1,]$leadingEdge[[1]]
```
The `r fgsea_reactome[order(pval),][1,]$pathway` contains `r length(my_pathways[[fgsea_reactome[order(pval),][1,]$pathway]])` genes, which was indicated in the `fgsea` result as size (but it's always good to double-check).
```{r length_pathway}
length(my_pathways[[fgsea_reactome[order(pval),][1,]$pathway]])
```
## Summary
The Gene Set Enrichment Analysis (GSEA) has been around since 2005 and has become a routine step when analysing gene expression data. It differs from Gene Ontology enrichment analysis in that it considers all genes in contrast to taking only significantly differentially expressed genes.
The `fgsea` package allows one to conduct a pre-ranked GSEA in R, which is one approach in a GSEA. A p-value is estimated by permuting the genes in a gene set, which leads to randomly assigned gene sets of the same size. Note that "This approach is not strictly accurate because it ignores gene-gene correlations and will overestimate the significance levels and may lead to false positives." However, it is still useful in getting an idea of the potential roles of the genes that are up- and down-regulated. If your pre-ranked GSEA returns no significant gene sets, you may still get an idea of what roles the up- and down-regulated genes may be involved in by examining the leading edge set. This set indicates the genes that contributed to the enrichment score.
The example ranks in the `fgsea` package were ranked based on the moderated t-statistic. [Another metric](https://genomespot.blogspot.com/2014/09/data-analysis-step-8-pathway-analysis.html) that you may consider is to take the signed fold change and multiply it by the $-log_{10}$p-value. For example, if you performed your differential expression analysis with `edgeR`, you can simply multiply the signed fold change column to the $-log_{10}$p-value column. This metric is based on the paper: [Rank-rank hypergeometric overlap: identification of statistically significant overlap between gene-expression signatures](https://pubmed.ncbi.nlm.nih.gov/20660011/).