analysis/3_Enrichment_HomVsHet.Rmd

---
title: "Enrichment Analysis: Homozygous Vs Heterozygous Mutants"
author: "Steve Pederson"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(
    autodep = TRUE,
    echo = TRUE,
    warning = FALSE,
    message = FALSE,
    fig.align = "center"
)
```


# Setup

```{r loadPackages}
library(tidyverse)
library(magrittr)
library(edgeR)
library(scales)
library(pander)
library(msigdbr)
library(AnnotationDbi)
library(RColorBrewer)
library(ngsReports)
library(fgsea)
library(metap)
```

```{r setOpts}
theme_set(theme_bw())
panderOptions("table.split.table", Inf)
panderOptions("table.style", "rmarkdown")
panderOptions("big.mark", ",")
```


```{r samplesAndLabels}
samples <- here::here("data/samples.csv") %>%
  read_csv() %>%
  distinct(sampleName, .keep_all = TRUE) %>%
  dplyr::select(sample = sampleName, sampleID, genotype) %>%
  mutate(
    genotype = factor(genotype, levels = c("WT", "Het", "Hom")),
    mutant = genotype %in% c("Het", "Hom"),
    homozygous = genotype == "Hom"
  )
genoCols <- samples$genotype %>%
  levels() %>%
  length() %>%
  brewer.pal("Set1") %>%
  setNames(levels(samples$genotype))
```


```{r loadFits}
dgeList <- here::here("data/dgeList.rds") %>% read_rds()
cpmPostNorm <- here::here("data/cpmPostNorm.rds") %>% read_rds()
entrezGenes <- dgeList$genes %>%
  dplyr::filter(!is.na(entrezid)) %>%
  unnest(entrezid) %>%
  dplyr::rename(entrez_gene = entrezid)
```

```{r topTables}
deTable <- here::here("output", "psen2HomVsHet.csv") %>% 
  read_csv() %>%
  mutate(
    entrezid = dgeList$genes$entrezid[gene_id]
  )
```


```{r formatP}
formatP <- function(p, m = 0.0001){
  out <- rep("", length(p))
  out[p < m] <- sprintf("%.2e", p[p<m])
  out[p >= m] <- sprintf("%.4f", p[p>=m])
  out
}
```


# Introduction

Enrichment analysis for this dataset present some challenges.
Despite normalisation to account for gene length and GC bias, some appeared to still be present in the final results.
In addition, the confounding of incomplete rRNA removal with genotype may lead to other distortions in both DE genes and ranking statistics.

As the list of DE genes for this comparison was small ($n_{\text{DE}} = `r nrow(dplyr::filter(deTable, DE))`$), enrichment testing was only performed using ranked-list approaches.
Testing for enrichment *with ranked lists* will be performed using:

1. `fry` as this can take into account inter-gene correlations. Values supplied will be fitted values for each gene/sample as these have been corrected for GC and length biases.
2. `camera`, which also accommodates inter-gene correlations. Values supplied will be fitted values for each gene/sample as these have been corrected for GC and length biases.
3. `fgsea` which is an R implementation of GSEA. This approach simply takes a ranked list and doesn't directly account for correlations. However, the ranked list will be derived from analysis using CQN-normalisation so will be robust to these technical artefacts.

In the case of `camera`, inter-gene correlations will be calculated for each gene-set prior to analysis to ensure more conservative p-values are obtained.

# Databases used for testing

Data was sourced using the `msigdbr` package.
The initial database used for testing was the Hallmark Gene Sets, with mappings from gene-set to EntrezGene IDs performed by the package authors. 

## Hallmark Gene Sets

```{r hm}
hm <- msigdbr("Danio rerio", category = "H")  %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
hmByGene <- hm %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
hmByID <- hm %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")
```

Mappings are required from gene to pathway, and Ensembl identifiers were used to map from gene to pathway, based on the mappings in the previously used annotations (Ensembl Release 98).
A total of `r comma(length(hmByGene))` Ensembl IDs were mapped to pathways from the hallmark gene sets.

## KEGG Gene Sets


```{r kg}
kg <- msigdbr("Danio rerio", category = "C2", subcategory = "CP:KEGG")  %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
kgByGene <- kg  %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
kgByID <- kg  %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")
```

The same mapping process was applied to KEGG gene sets.
A total of `r comma(length(kgByGene))` Ensembl IDs were mapped to pathways from the KEGG gene sets.

## Gene Ontology Gene Sets

```{r goSummaries}
goSummaries <- url("https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS") %>%
  readRDS() %>%
  mutate(
    Term = Term(id),
    gs_name = Term %>% str_to_upper() %>% str_replace_all("[ -]", "_"),
    gs_name = paste0("GO_", gs_name)
    )
minPath <- 3
```


```{r go}
go <- msigdbr("Danio rerio", category = "C5") %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  left_join(goSummaries) %>% 
  dplyr::filter(shortest_path >= minPath) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
goByGene <- go %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
goByID <- go %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")
```


For analysis of gene-sets from the GO database, gene-sets were restricted to those with `r minPath` or more steps back to the ontology root terms.
A total of `r comma(length(goByGene))` Ensembl IDs were mapped to pathways from restricted database of `r comma(length(goByID))` GO gene sets.

```{r gsSizes}
gsSizes <- bind_rows(hm, kg, go) %>% 
  dplyr::select(gs_name, gene_symbol, gene_id) %>% 
  chop(c(gene_symbol, gene_id)) %>%
  mutate(gs_size = vapply(gene_symbol, length, integer(1)))
```


# Enrichment Testing on Ranked Lists


```{r rnk}
rnk <- structure(
  -sign(deTable$logFC)*log10(deTable$PValue), 
  names = deTable$gene_id
) %>% sort()
np <- 1e5
```


Genes were ranked by -sign(logFC)*log~10~(p) for approaches which required a ranked list.
Multiple approaches were first calculated individually, before being combined for the final integrated set of results.


## Hallmark Gene Sets

```{r hmFry}
hmFry <- cpmPostNorm %>%
  fry(
    index = hmByID,
    design = dgeList$design,
    contrast = "homozygous",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

For analysis under `camera` when inter-gene correlations were calculated for a more conservative result.

```{r hmCamera}
hmCamera <- cpmPostNorm %>%
  camera(
    index = hmByID,
    design = dgeList$design,
    contrast = "homozygous",
    inter.gene.cor = NULL
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

For generation of the GSEA ranked list, `r comma(np)` permutations were conducted.

```{r hmGsea}
hmGsea <- fgsea(
  pathways = hmByID, 
  stats = rnk,
  nperm = np
) %>%
  as_tibble() %>%
  dplyr::rename(gs_name = pathway, PValue = pval) %>%
  arrange(PValue)
```

Results for all analyses were then combined using Wilkinson's method to combine p-values.
For a conservative approach, under $m$ tests, the $m - 1^{\text{th}}$ smallest p-value was chosen.

```{r hmMeta}
hmMeta <- hmFry %>%
  dplyr::select(gs_name, fry = PValue) %>%
  left_join(
    dplyr::select(hmCamera, gs_name, camera = PValue)
  ) %>%
  left_join(
    dplyr::select(hmGsea, gs_name, gsea = PValue)
  ) %>%
  nest(p = one_of(c("fry", "camera", "gsea"))) %>%
  mutate(
    n_p = vapply(p, function(x){sum(!is.na(unlist(x)))}, integer(1)), 
    wilkinson_p = vapply(p, function(x){
      x <- unlist(x)
      x <- x[!is.na(x)]
      wilkinsonp(x, length(x) - 1)$p
    }, numeric(1)),
    FDR = p.adjust(wilkinson_p, "fdr"), 
    adjP = p.adjust(wilkinson_p, "bonferroni")
  ) %>% 
  arrange(wilkinson_p) %>% 
  unnest(p) %>%
  left_join(gsSizes) %>%
  mutate(
    DE = lapply(gene_id, intersect, dplyr::filter(deTable, DE)$gene_id),
    DE = lapply(DE, unique),
    nDE = vapply(DE, length, integer(1))
  )
```

```{r}
hmMeta %>%
  dplyr::filter(FDR < 0.1) %>%
  mutate_at(vars(one_of(c("wilkinson_p", "FDR", "adjP"))), formatP) %>%
  dplyr::select(`Gene Set` = gs_name, `Number DE` = nDE, `Set Size` = gs_size, `Wilkinson~p~` = wilkinson_p, `p~FDR~` = FDR, `p~bonf~` = adjP) %>%
  pander(
    caption = "Results from combining all above approaches for the Hallmark Gene Sets. All terms are significant to an FDR of 0.1, with none passing the initial filter of FDR < 0.05",
    justify = "lrrrrr"
  )
```

## KEGG Gene Sets

```{r kgFry}
kgFry <-cpmPostNorm%>%
  fry(
    index = kgByID,
    design = dgeList$design,
    contrast = "homozygous",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

For analysis under `camera` when inter-gene correlations were calculated for a more conservative result.

```{r kgCamera}
kgCamera <- cpmPostNorm %>%
  camera(
    index = kgByID,
    design = dgeList$design,
    contrast = "homozygous",
    inter.gene.cor = NULL
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

For generation of the GSEA ranked list, `r comma(np)` permutations were conducted.

```{r kgGsea}
kgGsea <- fgsea(
  pathways = kgByID, 
  stats = rnk,
  nperm = np
) %>%
  as_tibble() %>%
  dplyr::rename(gs_name = pathway, PValue = pval) %>%
  arrange(PValue)
```

Results for all analyses were then combined using Wilkinson's method to combine p-values.
For a conservative approach, under $m$ tests, the $m - 1^{\text{th}}$ smallest p-value was chosen.

```{r kgMeta}
kgMeta <- kgFry %>%
  dplyr::select(gs_name, fry = PValue) %>%
  left_join(
    dplyr::select(kgCamera, gs_name, camera = PValue)
  ) %>%
  left_join(
    dplyr::select(kgGsea, gs_name, gsea = PValue)
  )  %>%
  nest(p = one_of(c("fry", "camera", "gsea"))) %>%
  mutate(
    n_p = vapply(p, function(x){sum(!is.na(unlist(x)))}, integer(1)), 
    wilkinson_p = vapply(p, function(x){
      x <- unlist(x)
      x <- x[!is.na(x)]
      wilkinsonp(x, length(x) - 1)$p
    }, numeric(1)),
    FDR = p.adjust(wilkinson_p, "fdr"), 
    adjP = p.adjust(wilkinson_p, "bonferroni")
  ) %>% 
  arrange(wilkinson_p) %>% 
  unnest(p) %>%
  left_join(gsSizes) %>%
  mutate(
    DE = lapply(gene_id, intersect, dplyr::filter(deTable, DE)$gene_id),
    DE = lapply(DE, unique),
    nDE = vapply(DE, length, integer(1))
  )
```

```{r}
kgMeta %>%
  dplyr::filter(FDR < 0.01) %>%
  mutate_at(vars(one_of(c("wilkinson_p", "FDR", "adjP"))), formatP) %>%
  dplyr::select(`Gene Set` = gs_name, `Number DE` = nDE, `Set Size` = gs_size, `Wilkinson~p~` = wilkinson_p, `p~FDR~` = FDR, `p~bonf~` = adjP) %>%
  pander(
    caption = "Results from combining all above approaches for the KEGG Gene Sets. All terms are significant to an FDR of 0.05.",
    justify = "lrrrrr"
  )
```

## GO Gene Sets

```{r goFry}
goFry <- cpmPostNorm %>%
  fry(
    index = goByID,
    design = dgeList$design,
    contrast = "homozygous",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

For analysis under `camera` when inter-gene correlations were calculated for a more conservative result.

```{r goCamera}
goCamera <- cpmPostNorm %>%
  camera(
    index = goByID,
    design = dgeList$design,
    contrast = "homozygous",
    inter.gene.cor = NULL
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

For generation of the GSEA ranked list, `r comma(np)` permutations were conducted.

```{r goGsea}
goGsea <- fgsea(
  pathways = goByID, 
  stats = rnk,
  nperm = np
) %>%
  as_tibble() %>%
  dplyr::rename(gs_name = pathway, PValue = pval) %>%
  arrange(PValue)
```

Results for all analyses were then combined using Wilkinson's method to combine p-values.
For a conservative approach, under $m$ tests, the $m - 1^{\text{th}}$ smallest p-value was chosen.

```{r goMeta}
goMeta <- goFry %>%
  dplyr::select(gs_name, fry = PValue) %>%
  left_join(
    dplyr::select(goCamera, gs_name, camera = PValue)
  ) %>%
  left_join(
    dplyr::select(goGsea, gs_name, gsea = PValue)
  )  %>%
  nest(p = one_of(c("fry", "camera", "gsea"))) %>%
  mutate(
    n_p = vapply(p, function(x){sum(!is.na(unlist(x)))}, integer(1)), 
    wilkinson_p = vapply(p, function(x){
      x <- unlist(x)
      x <- x[!is.na(x)]
      wilkinsonp(x, length(x) - 1)$p
    }, numeric(1)),
    FDR = p.adjust(wilkinson_p, "fdr"), 
    adjP = p.adjust(wilkinson_p, "bonferroni")
  ) %>% 
  arrange(wilkinson_p) %>% 
  unnest(p) %>%
  left_join(gsSizes) %>%
  mutate(
    DE = lapply(gene_id, intersect, dplyr::filter(deTable, DE)$gene_id),
    DE = lapply(DE, unique),
    nDE = vapply(DE, length, integer(1))
  )
```

```{r}
goMeta %>%
  dplyr::filter(adjP < 0.05) %>%
  mutate_at(vars(one_of(c("wilkinson_p", "FDR", "adjP"))), formatP) %>%
  dplyr::select(`Gene Set` = gs_name, `Number DE` = nDE, `Set Size` = gs_size, `Wilkinson~p~` = wilkinson_p, `p~FDR~` = FDR, `p~bonf~` = adjP) %>%
  pander(
    caption = "Results from combining all above approaches for the GO Gene Sets. All terms are significant to a bonferroni-adjusted p-value < 0.05.",
    justify = "lrrrrr"
  )
```

# Data Export

All enriched gene sets terms with an FDR adjusted p-value < 0.05 were exported as a single csv file.

```{r}
add_prefix <- function(x, pre = "p_"){
  paste0(pre, x)
}
bind_rows(
  hmMeta,
  kgMeta,
  goMeta
) %>%
  dplyr::filter(FDR < 0.05) %>%
  mutate(
    DE = lapply(DE, function(x){dplyr::filter(deTable, gene_id %in% x)$gene_name}),
    DE = lapply(DE, unique),
    DE = vapply(DE, paste, character(1), collapse = ";")
  ) %>%
  arrange(wilkinson_p) %>%
  dplyr::select(
    `Gene Set` = gs_name, 
    `Nbr Detected Genes` = gs_size, 
    `Nbr DE Genes` = nDE, 
    combined = wilkinson_p, FDR, 
    fry, camera, gsea, 
    `DE Genes` = DE
  ) %>%
  rename_at(
    vars(combined, fry, camera, gsea),
    add_prefix
  ) %>%
  write_csv(
    here::here("output", "Enrichment_Hom_V_Het.csv")
  )
```