analysis/3_Enrichment_HomVsHet.Rmd

---
title: "Enrichment Analysis: Homozygous Vs Heterozygous Mutants"
author: "Steve Pederson"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(
    autodep = TRUE,
    echo = TRUE,
    warning = FALSE,
    message = FALSE,
    fig.align = "center"
)
```


# Setup

```{r loadPackages}
library(tidyverse)
library(magrittr)
library(edgeR)
library(scales)
library(pander)
library(msigdbr)
library(AnnotationDbi)
library(RColorBrewer)
library(ngsReports)
```

```{r setOpts}
theme_set(theme_bw())
panderOptions("table.split.table", Inf)
panderOptions("table.style", "rmarkdown")
panderOptions("big.mark", ",")
```


```{r samplesAndLabels}
samples <- here::here("data/samples.csv") %>%
  read_csv() %>%
  distinct(sampleName, .keep_all = TRUE) %>%
  dplyr::select(sample = sampleName, sampleID, genotype) %>%
  mutate(
    genotype = factor(genotype, levels = c("WT", "Het", "Hom")),
    mutant = genotype %in% c("Het", "Hom"),
    homozygous = genotype == "Hom"
  )
genoCols <- samples$genotype %>%
  levels() %>%
  length() %>%
  brewer.pal("Set1") %>%
  setNames(levels(samples$genotype))
```


```{r loadFits}
dgeList <- here::here("data/dgeList.rds") %>% read_rds()
cpmPostNorm <- here::here("data/cpmPostNorm.rds") %>% read_rds()
entrezGenes <- dgeList$genes %>%
  dplyr::filter(!is.na(entrezid)) %>%
  unnest(entrezid) %>%
  dplyr::rename(entrez_gene = entrezid)
```

```{r topTables}
deTable <- here::here("output", "psen2HomVsHet.csv") %>% 
  read_csv() %>%
  mutate(
    entrezid = dgeList$genes$entrezid[gene_id]
  )
```


```{r formatP}
formatP <- function(p, m = 0.0001){
  out <- rep("", length(p))
  out[p < m] <- sprintf("%.2e", p[p<m])
  out[p >= m] <- sprintf("%.4f", p[p>=m])
  out
}
```


# Introduction

Enrichment analysis for this dataset present some challenges.
Despite normalisation to account for gene length and GC bias, some appeared to still be present in the final results.
In addition, the confounding of incomplete rRNA removal with genotype may lead to other distortions in both DE genes and ranking statistics.

As the list of DE genes for this comparison was small ($n_{\text{DE}} = `r nrow(dplyr::filter(deTable, DE))`$), enrichment testing was only performed using ranked-list approaches.
For enrichment within larger gene lists,  `fry` can take into account inter-gene correlations. 
Values supplied will be logCPM for each gene/sample after being adjusted for GC and length biases.

# Databases used for testing

Data was sourced using the `msigdbr` package.
The initial database used for testing was the Hallmark Gene Sets, with mappings from gene-set to EntrezGene IDs performed by the package authors. 

## Hallmark Gene Sets

```{r hm}
hm <- msigdbr("Danio rerio", category = "H")  %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
hmByGene <- hm %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
hmByID <- hm %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")
```

Mappings are required from gene to pathway, and Ensembl identifiers were used to map from gene to pathway, based on the mappings in the previously used annotations (Ensembl Release 98).
A total of `r comma(length(hmByGene))` Ensembl IDs were mapped to pathways from the hallmark gene sets.

## KEGG Gene Sets


```{r kg}
kg <- msigdbr("Danio rerio", category = "C2", subcategory = "CP:KEGG")  %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
kgByGene <- kg  %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
kgByID <- kg  %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")
```

The same mapping process was applied to KEGG gene sets.
A total of `r comma(length(kgByGene))` Ensembl IDs were mapped to pathways from the KEGG gene sets.

## Gene Ontology Gene Sets

```{r goSummaries}
goSummaries <- url("https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS") %>%
  readRDS() %>%
  mutate(
    Term = Term(id),
    gs_name = Term %>% str_to_upper() %>% str_replace_all("[ -]", "_"),
    gs_name = paste0("GO_", gs_name)
    )
minPath <- 3
```


```{r go}
go <- msigdbr("Danio rerio", category = "C5") %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  left_join(goSummaries) %>% 
  dplyr::filter(shortest_path >= minPath) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
goByGene <- go %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
goByID <- go %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")
```


For analysis of gene-sets from the GO database, gene-sets were restricted to those with `r minPath` or more steps back to the ontology root terms.
A total of `r comma(length(goByGene))` Ensembl IDs were mapped to pathways from restricted database of `r comma(length(goByID))` GO gene sets.

```{r gsSizes}
gsSizes <- bind_rows(hm, kg, go) %>% 
  dplyr::select(gs_name, gene_symbol, gene_id) %>% 
  chop(c(gene_symbol, gene_id)) %>%
  mutate(
    gs_size = vapply(gene_symbol, length, integer(1)),
    de_id = lapply(
      X = gene_id, 
      FUN = intersect, 
      y = dplyr::filter(deTable, DE)$gene_id
      ),
    de_size = vapply(de_id, length, integer(1))
  )
```


# Enrichment Testing on Ranked Lists


## Hallmark Gene Sets

```{r hmFry}
hmFry <- cpmPostNorm %>%
  fry(
    index = hmByID,
    design = dgeList$design,
    contrast = "homozygous",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

Using the directional results from `fry`, no Hallmark Gene Sets were detected as different between mutant genotypes, with a minimum FDR being found as `r percent(min(hmFry$FDR))`.
Similarly for a non-directional approach, no Hallmark Gene Sets were detected as different, with a minimum `FDR.Mixed` being `r percent(min(hmFry$FDR.Mixed))`.

## KEGG Gene Sets

```{r kgFry}
kgFry <-cpmPostNorm%>%
  fry(
    index = kgByID,
    design = dgeList$design,
    contrast = "homozygous",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

Using the directional results from `fry`, no KEGG Gene Sets were detected as different between mutant genotypes, with a minimum FDR being found as `r percent(min(kgFry$FDR))`.

```{r}
kgFry %>% 
  arrange(PValue.Mixed) %>%
  dplyr::select(gs_name, NGenes, contains("Mixed")) %>%
  set_names(str_remove(names(.), ".Mixed")) %>%
  dplyr::filter(FDR < 0.05) %>%
  dplyr::select(
    `KEGG Gene Set` = gs_name,
    `Expressed Genes` = NGenes, 
    PValue, FDR
  ) %>%
  mutate_at(
    vars(PValue, FDR), formatP
  ) %>%
  pander(
    caption = "The only KEGG Gene Set detected as different between mutants."
  )
```

Using a non-directional (i.e. Mixed) approach, one KEGG gene set was detected as different.
However, an FDR of 0.047 in this instance does not provide much confidence to this observation.

## GO Gene Sets

```{r goFry}
goFry <- cpmPostNorm %>%
  fry(
    index = goByID,
    design = dgeList$design,
    contrast = "homozygous",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()
```

No GO Gene Sets were detected as different between mutant genotypes, with a minimum FDR being found as `r percent(min(goFry$FDR))`.

```{r}
goFry %>% 
  arrange(PValue.Mixed) %>%
  dplyr::select(gs_name, NGenes, contains("Mixed")) %>%
  set_names(str_remove(names(.), ".Mixed")) %>%
  dplyr::filter(FDR < 0.05) %>%
  dplyr::select(
    `GO Gene Set` = gs_name,
    `Expressed Genes` = NGenes, 
    PValue, FDR
  ) %>%
  mutate_at(
    vars(PValue, FDR), formatP
  ) %>%
  pander(
    caption = "The only GO Gene Set detected as different between mutants."
  )
```

Again, using a non-directional (i.e. Mixed) approach, one GO gene set was detected as different.
However, an FDR of 0.046 in this instance does not provide much confidence to this observation.

# Data Export

All enriched gene sets terms with an FDR adjusted p-value < 0.05 were exported as a single csv file.

```{r}
bind_rows(
  hmFry,
  kgFry,
  goFry
) %>%
  dplyr::filter(FDR.Mixed < 0.05) %>%
  left_join(gsSizes) %>%
  dplyr::select(
    gs_name, NGenes, 
    PValue = PValue.Mixed, 
    FDR = FDR.Mixed,
    gene_symbol, de_size
  ) %>%
  mutate(
    DE = lapply(
      X = gene_symbol,
      FUN =  intersect, 
      y = dplyr::filter(deTable, DE)$gene_name
    ),
    DE = vapply(DE, paste, character(1), collapse = ";")
  ) %>%
  dplyr::select(
    `Gene Set` = gs_name, 
    `Nbr Detected Genes` = NGenes, 
    `Nbr DE Genes` = de_size, 
     PValue, FDR, 
    `DE Genes` = DE
  ) %>%
  write_csv(
    here::here("output", "Enrichment_Hom_V_Het.csv")
  )
```