-
Notifications
You must be signed in to change notification settings - Fork 1
/
go_enrichment.Rmd
286 lines (207 loc) · 9.76 KB
/
go_enrichment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: "Gene Ontology Enrichment Analysis"
date: "`r Sys.Date()`"
output:
workflowr::wflow_html
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Getting started
The Gene Ontology Enrichment Analysis (GOEA) is a typical analysis carried out on transcriptome data. Online tools for performing a GOEA include [DAVID](https://david.ncifcrf.gov/), [Enrichr](https://maayanlab.cloud/Enrichr/), and [PANTHER](http://www.pantherdb.org/) just to name a few. While web-based tools are easy to use, it becomes tedious when you have to analyse (or re-analyse) lots of datasets. Therefore, it is preferable to use a programmatic approach and in this post we will check out some Bioconductor packages that allow to perform a GOEA.
First install the following packages, if necessary, and then load them.
```{r install_and_or_load_packages, message=FALSE, warning=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
my_packages <- c("clusterProfiler",
"GOstats",
"GO.db",
"org.Hs.eg.db")
to_install <- my_packages[!my_packages %in% installed.packages()]
# install missing packages, if any
if (length(to_install) > 0){
BiocManager::install(pkgs = to_install)
}
# load all packages and suppress output of sapply
invisible(sapply(my_packages, library, character.only = TRUE))
```
Create a positive control where the gene set are composed of genes that are all associated with `GO:0007411` (axon guidance); we will use the `org.Hs.eg.db` package to achieve this based on [the vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf).
```{r ls_org_hs_eg_db, include=FALSE}
ls("package:org.Hs.eg.db")
```
Methods that can be applied to `AnnotationDbi` objects such as `org.Hs.eg.db` include: `columns`, `keytypes`, `keys`, and `select`.
Use `columns` to find out what data can be retrived using `select`.
```{r columns}
columns(org.Hs.eg.db)
```
Use `keytypes` to find out what fields we can use as keys to query the database.
```{r keytypes}
keytypes(org.Hs.eg.db)
```
Select all genes with `GO:0007411`.
```{r go_to_entrez}
go_to_entrez <- select(org.Hs.eg.db,
keys = "GO:0007411",
columns = "ENTREZID",
keytype = "GO")
axon_gene <- unique(go_to_entrez$ENTREZID)
length(axon_gene)
```
Note that we can also use `select` on `GO.db` to fetch more information on `GO:0007411`.
```{r select_go_db}
select(GO.db,
keys = "GO:0007411",
columns = columns(GO.db),
keytype = "GOID")
```
To perform the GOEA we need to create a gene background called the `universe` and we will use all genes with a GO term. Normally the `universe` should be the list of genes that were actually assayed in your transcriptome analysis.
```{r universe}
all_go_terms <- keys(org.Hs.eg.db, keytype = "GO")
all_go <- select(org.Hs.eg.db, keys = all_go_terms, columns = c("ENTREZID", "GO"), keytype = "GO")
universe <- unique(all_go$ENTREZID)
length(universe)
```
The function `hyperGTest` will perform the GOEA based on a set of parameters; in this example, we are testing for the over-representation of biological process (BP) terms and using a p-value cutoff of 0.001 or less.
```{r hypergeometric_test}
params <- new('GOHyperGParams',
geneIds = axon_gene,
universeGeneIds = universe,
ontology = 'BP',
pvalueCutoff = 0.001,
conditional = FALSE,
testDirection = 'over',
annotation = "org.Hs.eg.db"
)
my_test <- hyperGTest(params)
my_test
```
Use `summary` to get a summary of the results. The summary contains the `GOID`, `Pvalue`, `OddsRatio`, `ExpCount`, `Count`, and `Size`.
* `ExpCount` is the expected count
* `Count` is how many instances of that term were actually observed in your gene list
* `Size` is the number that could have been found in your gene list if every instance had turned up
```{r summary}
head(summary(my_test))
```
GO terms associated to axons are enriched as expected. Note that the `Count` and `Size` for GO:0007411 is not identical even though we had selected all genes associated with GO:0007411. If we manually select Entrez gene IDs using `org.Hs.egGO`, we still get the same list of genes.
```{r check_go_0007411}
my_df <- as.data.frame(org.Hs.egGO)
my_idx <- my_df$go_id == "GO:0007411"
length(unique(my_df[my_idx, "gene_id"])) == length(axon_gene)
```
This is probably because genes containing GO terms that are descendants of GO:0007411 are also included.
## Relations in the Gene Ontology
Gene ontologies (GO) are structured as a directed acyclic graph with GO terms as nodes and their relationships as edges. The most commonly used relationships in GO are:
* is a
* part of
* has part
* regulates
* negatively regulates
* positively regulates
Below is [an example](http://geneontology.org/docs/ontology-relations/) of the `is a` and `part of` relationships.
![](assets/diag-graph-example.gif)
We can use the `GOBPCHILDREN` [annotation map or Bimap](https://rdrr.io/bioc/AnnotationDbi/man/Bimap.html) from the [GO.db package](https://www.bioconductor.org/packages/release/data/annotation/manuals/GO.db/man/GO.db.pdf) to retrieve all descendants of GO:0007411.
```{r bp_children}
bp_children <- as.list(GOBPCHILDREN)
bp_children[["GO:0007411"]]
```
We can include these GO terms in our `select` query.
```{r fetch_children}
my_keys <- c("GO:0007411", bp_children[["GO:0007411"]])
go_to_entrez_children <- select(org.Hs.eg.db,
keys = my_keys,
columns = "ENTREZID",
keytype = "GO")
length(unique(go_to_entrez_children$ENTREZID))
```
We are still missing some genes that are associated with GO:0007411, which is probably due to the exclusion of descendants in the descendants of GO:0007411. We need to recursively search all terms that are descendants of GO:0007411.
```{r hypergeometric_test_take_two}
params <- new('GOHyperGParams',
geneIds = unique(go_to_entrez_children$ENTREZID),
universeGeneIds = universe,
ontology = 'BP',
pvalueCutoff = 0.001,
conditional = FALSE,
testDirection = 'over',
annotation = "org.Hs.eg.db"
)
my_test2 <- hyperGTest(params)
head(summary(my_test2))
```
## clusterProfiler
The `enrichGO` function in the `clusterProfiler` package can also perform a GOEA with FDR control.
```{r enrich_go}
my_test3 <- enrichGO(axon_gene,
org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pvalueCutoff = 0.001,
pAdjustMethod = "BH",
universe,
qvalueCutoff = 0.1,
minGSSize = 10,
maxGSSize = 500,
readable = FALSE)
head(data.frame(my_test3))
```
The output now includes adjusted p-values and the geneIDs that are associated with a given GO ID. Note that the full list of genes for GO:0007411 is 276 again.
In addition to performing the GOEA, `clusterProfiler` also has some nice plotting functions.
Bar plot showing each enriched GO term coloured by the adjusted p-value.
```{r barplot}
barplot(my_test3, showCategory=10)
```
Dot plot showing each enriched GO term with associated statistics.
```{r dotplot}
dotplot(my_test3, showCategory=10)
```
Heat plot showing the enriched GO terms on the y-axis and the genes on the x-axis. Genes with the associated GO term are highlighted.
```{r heatplot}
heatplot(my_test3, showCategory=10)
```
Enrichment map organises enriched terms into a network with edges connecting overlapping gene sets.
```{r emapplot}
emapplot(my_test3, showCategory = 10)
```
`goplot` shows the gene ontology graph with the enriched GO terms highlighted.
```{r goplot, fig.height=10, fig.width=12}
goplot(my_test3)
```
## What if my gene list IDs are not Entrez gene IDs?
We can use the `biomaRt` package for converting between different gene identifiers and in this example, we will convert Ensembl gene IDs to Entrez gene IDs.
```{r load_biomart}
if (!"biomaRt" %in% installed.packages()){
BiocManager::install("biomaRt")
}
library("biomaRt")
```
We will fetch every Ensembl gene ID and randomly select 10 IDs to convert into Entrez gene IDs.
```{r fetch_all_ensembl}
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
my_chr <- c(1:22, 'M', 'X', 'Y')
my_ensembl_gene <- getBM(attributes = 'ensembl_gene_id',
filters = 'chromosome_name',
values = my_chr,
mart = ensembl)
head(my_ensembl_gene)
```
Select 10 Ensembl gene IDs.
```{r sample}
set.seed(1984)
to_convert <- sample(x = my_ensembl_gene$ensembl_gene_id, size = 10, replace = FALSE)
```
Now to convert the IDs.
```{r to_entrez}
to_entrez <- getBM(attributes = c('ensembl_gene_id', 'entrezgene_id'),
filters = 'ensembl_gene_id',
values = to_convert,
mart = ensembl)
to_entrez
```
Note that not all Ensembl IDs have Entrez IDs. We can find out how many Ensembl IDs do not have Entrez IDs.
```{r ensembl_to_entrez}
my_entrez_gene <- getBM(attributes = c('ensembl_gene_id', 'entrezgene_id'),
filters = 'ensembl_gene_id',
values = my_ensembl_gene,
mart = ensembl)
table(is.na(my_entrez_gene$entrezgene_id))
```
`r sum(is.na(my_entrez_gene$entrezgene_id))` out of `r length(my_entrez_gene$entrezgene_id)` Ensembl gene IDs do not have corresponding Entrez gene IDs. To learn more about the missing Entrez ID values from the Ensembl conversion see [this useful post](https://www.biostars.org/p/16505/) on BioStars.