analysis/go_enrichment.Rmd

---
title: "Gene Ontology Enrichment Analysis"
date: "`r Sys.Date()`"
output:
  workflowr::wflow_html
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Getting started

The Gene Ontology Enrichment Analysis (GOEA) is a typical analysis carried out on transcriptome data. Online tools for performing a GOEA include [DAVID](https://david.ncifcrf.gov/), [Enrichr](https://maayanlab.cloud/Enrichr/), and [PANTHER](http://www.pantherdb.org/) just to name a few. While web-based tools are easy to use, it becomes tedious when you have to analyse (or re-analyse) lots of datasets. Therefore, it is preferable to use a programmatic approach and in this post we will check out some Bioconductor packages that allow to perform a GOEA.

First install the following packages, if necessary, and then load them.

```{r install_and_or_load_packages, message=FALSE, warning=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

my_packages <- c("clusterProfiler",
                 "GOstats",
                 "GO.db",
                 "org.Hs.eg.db")

to_install <- my_packages[!my_packages %in% installed.packages()]

# install missing packages, if any
if (length(to_install) > 0){
  BiocManager::install(pkgs = to_install)
}

# load all packages and suppress output of sapply
invisible(sapply(my_packages, library, character.only = TRUE))
```

Create a positive control where the gene set are composed of genes that are all associated with `GO:0007411` (axon guidance); we will use the `org.Hs.eg.db` package to achieve this based on [the vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf).

```{r ls_org_hs_eg_db, include=FALSE}
ls("package:org.Hs.eg.db")
```

Methods that can be applied to `AnnotationDbi` objects such as `org.Hs.eg.db` include: `columns`, `keytypes`, `keys`, and `select`.

Use `columns` to find out what data can be retrived using `select`.

```{r columns}
columns(org.Hs.eg.db)
```

Use `keytypes` to find out what fields we can use as keys to query the database.

```{r keytypes}
keytypes(org.Hs.eg.db)
```

Select all genes with `GO:0007411`.

```{r go_to_entrez}
go_to_entrez <- select(org.Hs.eg.db,
                       keys = "GO:0007411",
                       columns = "ENTREZID",
                       keytype = "GO")

axon_gene <- unique(go_to_entrez$ENTREZID)
length(axon_gene)
```

Note that we can also use `select` on `GO.db` to fetch more information on `GO:0007411`.

```{r select_go_db}
select(GO.db,
       keys = "GO:0007411",
       columns = columns(GO.db),
       keytype = "GOID")
```

To perform the GOEA we need to create a gene background called the `universe` and we will use all genes with a GO term. Normally the `universe` should be the list of genes that were actually assayed in your transcriptome analysis.

```{r universe}
all_go_terms <- keys(org.Hs.eg.db, keytype = "GO")
all_go <- select(org.Hs.eg.db, keys = all_go_terms, columns = c("ENTREZID", "GO"), keytype = "GO")
universe <- unique(all_go$ENTREZID)
length(universe)
```

The function `hyperGTest` will perform the GOEA based on a set of parameters; in this example, we are testing for the over-representation of biological process (BP) terms and using a p-value cutoff of 0.001 or less.

```{r hypergeometric_test}
params <- new('GOHyperGParams',
              geneIds = axon_gene,
              universeGeneIds = universe,
              ontology = 'BP',
              pvalueCutoff = 0.001,
              conditional = FALSE,
              testDirection = 'over',
              annotation = "org.Hs.eg.db"
             )
 
my_test <- hyperGTest(params)
my_test
```

Use `summary` to get a summary of the results. The summary contains the `GOID`, `Pvalue`, `OddsRatio`, `ExpCount`, `Count`, and `Size`.

* `ExpCount` is the expected count
* `Count` is how many instances of that term were actually observed in your gene list
* `Size` is the number that could have been found in your gene list if every instance had turned up

```{r summary}
head(summary(my_test))
```

GO terms associated to axons are enriched as expected. Note that the `Count` and `Size` for GO:0007411 is not identical even though we had selected all genes associated with GO:0007411. If we manually select Entrez gene IDs using `org.Hs.egGO`, we still get the same list of genes.

```{r check_go_0007411}
my_df <- as.data.frame(org.Hs.egGO)
my_idx <- my_df$go_id == "GO:0007411"
length(unique(my_df[my_idx, "gene_id"])) == length(axon_gene)
```

This is probably because genes containing GO terms that are descendants of GO:0007411 are also included.

## Relations in the Gene Ontology

Gene ontologies (GO) are structured as a directed acyclic graph with GO terms as nodes and their relationships as edges. The most commonly used relationships in GO are:

* is a
* part of
* has part
* regulates
* negatively regulates
* positively regulates

Below is [an example](http://geneontology.org/docs/ontology-relations/) of the `is a` and `part of` relationships.

![](assets/diag-graph-example.gif)

We can use the `GOBPCHILDREN` [annotation map or Bimap](https://rdrr.io/bioc/AnnotationDbi/man/Bimap.html) from the [GO.db package](https://www.bioconductor.org/packages/release/data/annotation/manuals/GO.db/man/GO.db.pdf) to retrieve all descendants of GO:0007411.

```{r bp_children}
bp_children <- as.list(GOBPCHILDREN)
bp_children[["GO:0007411"]]
```

We can include these GO terms in our `select` query.

```{r fetch_children}
my_keys <- c("GO:0007411", bp_children[["GO:0007411"]])

go_to_entrez_children <- select(org.Hs.eg.db,
                                keys = my_keys,
                                columns = "ENTREZID",
                                keytype = "GO")

length(unique(go_to_entrez_children$ENTREZID))
```

We are still missing some genes that are associated with GO:0007411, which is probably due to the exclusion of descendants in the descendants of GO:0007411. We need to recursively search all terms that are descendants of GO:0007411.

```{r hypergeometric_test_take_two}
params <- new('GOHyperGParams',
              geneIds = unique(go_to_entrez_children$ENTREZID),
              universeGeneIds = universe,
              ontology = 'BP',
              pvalueCutoff = 0.001,
              conditional = FALSE,
              testDirection = 'over',
              annotation = "org.Hs.eg.db"
             )
 
my_test2 <- hyperGTest(params)
head(summary(my_test2))
```

## clusterProfiler

The `enrichGO` function in the `clusterProfiler` package can also perform a GOEA with FDR control.

```{r enrich_go}
my_test3 <- enrichGO(axon_gene,
                     org.Hs.eg.db,
                     keyType = "ENTREZID",
                     ont = "BP",
                     pvalueCutoff = 0.001,
                     pAdjustMethod = "BH",
                     universe,
                     qvalueCutoff = 0.1,
                     minGSSize = 10,
                     maxGSSize = 500,
                     readable = FALSE)

head(data.frame(my_test3))
```

The output now includes adjusted p-values and the geneIDs that are associated with a given GO ID. Note that the full list of genes for GO:0007411 is 276 again.

In addition to performing the GOEA, `clusterProfiler` also has some nice plotting functions.

Bar plot showing each enriched GO term coloured by the adjusted p-value.

```{r barplot}
barplot(my_test3, showCategory=10)
```

Dot plot showing each enriched GO term with associated statistics.

```{r dotplot}
dotplot(my_test3, showCategory=10)
```

Heat plot showing the enriched GO terms on the y-axis and the genes on the x-axis. Genes with the associated GO term are highlighted.

```{r heatplot}
heatplot(my_test3, showCategory=10)
```

Enrichment map organises enriched terms into a network with edges connecting overlapping gene sets.

```{r emapplot}
emapplot(my_test3, showCategory = 10)
```

`goplot` shows the gene ontology graph with the enriched GO terms highlighted.

```{r goplot, fig.height=10, fig.width=12}
goplot(my_test3)
```

## What if my gene list IDs are not Entrez gene IDs?

We can use the `biomaRt` package for converting between different gene identifiers and in this example, we will convert Ensembl gene IDs to Entrez gene IDs.

```{r load_biomart}
if (!"biomaRt" %in% installed.packages()){
  BiocManager::install("biomaRt")
}

library("biomaRt")
```

We will fetch every Ensembl gene ID and randomly select 10 IDs to convert into Entrez gene IDs.

```{r fetch_all_ensembl}
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
 
my_chr <- c(1:22, 'M', 'X', 'Y')
my_ensembl_gene <- getBM(attributes = 'ensembl_gene_id',
                         filters = 'chromosome_name',
                         values = my_chr,
                         mart = ensembl)
 
head(my_ensembl_gene)
```

Select 10 Ensembl gene IDs.

```{r sample}
set.seed(1984)
to_convert <- sample(x = my_ensembl_gene$ensembl_gene_id, size = 10, replace = FALSE)
```

Now to convert the IDs.

```{r to_entrez}
to_entrez <- getBM(attributes = c('ensembl_gene_id', 'entrezgene_id'),
                   filters = 'ensembl_gene_id',
                   values = to_convert,
                   mart = ensembl)

to_entrez
```

Note that not all Ensembl IDs have Entrez IDs. We can find out how many Ensembl IDs do not have Entrez IDs.

```{r ensembl_to_entrez}
my_entrez_gene <- getBM(attributes = c('ensembl_gene_id', 'entrezgene_id'),
                        filters = 'ensembl_gene_id',
                        values = my_ensembl_gene,
                        mart = ensembl)

table(is.na(my_entrez_gene$entrezgene_id))
```

`r sum(is.na(my_entrez_gene$entrezgene_id))` out of `r length(my_entrez_gene$entrezgene_id)` Ensembl gene IDs do not have corresponding Entrez gene IDs. To learn more about the missing Entrez ID values from the Ensembl conversion see [this useful post](https://www.biostars.org/p/16505/) on BioStars.