# Advanced topics in scRNAseq

What we'll cover:

 * Classifying cell types with a reference atlas with `TransferAnchors` from `Seurat`
 * Controlling for Type I error in estimated clusters with `clusterpval`
 * Gene set enrichment analysis with `fgsea`
 * Joint estimation of differential expression and gene set enrichment with `iDEA`
 
(Joselynn: pyscenic? multimodal inference?)

# Gene set enrichment analysis

Gene set enrichment analysis attempts to identify genes that are over-represented in a set of genes that make up pathways associated with various mechanisms. You can think of it like this: if a particular pathway consists of 300 genes, and 280 of those 300 genes are in our list of differentially expressed genes (DEGs), then is that pathway relevant to the differences between our groups of interest?

One of the more basic approaches to answering this question is Overrepresentation Analysis (ORA) (citation: https://yulab-smu.top/biomedical-knowledge-mining-book/references.html#ref-boyle2004). It takes as input a list of genes (aka the background) with pvalues indicating which genes are differentially expressed, and a pathway / set of genes, computes the number of DEGs that are found in the pathway, and then computes a p-value using a hypergeometric distribution based on that number along with the number of background genes and number of genes in the pathway.

## GSEA

ORA is solely based on the number of DEGs out of the , and thus will find pathways that have a lot of DEGs out of the background. However, ORA will miss smaller but coordinated changes in pathways, for example when a pathway does not contain many DEGs, but all genes in the pathway change together. A more complex approach to address this question is Gene Set Enrichment Analysis (GSEA), which utilizes all genes in the list rather than just the DEGs. GSEA takes a ranked list of genes in the set instead. How to create the ranked list is an open question, but typically log fold change, the test statistic, or a combination of log fold change and pvalue from a differential expression analysis are used.

We will not get into how GSEA method works here, but the core results from such an analysis are:

 * Enrichment score (ES) - a representation of how overrepresented a gene set is at the top or bottom of the ranked list.
 * p-value of ES - the significance of the score for that gene set
 * Adjusted p-value - an adjusted p-value to account for multiple hypothesis testing of multiple gene sets

We also run a more standard GSEA analysis, on just the Hallmark gene sets. There is an open question of how to create a ranked list for this analysis. We started with `avg_logFC` as was suggested in [issue#50](https://github.com/ctlab/fgsea/issues/50) on the `fgsea` repo, but it returned no significant pathways at all, which does not seem right for our comparisons also given Table \@ref(tab:de-markers). This may be because in the discussion `avg_logFC` was suggested when comparing markers just between cell type clusters, but we are also actually comparing between different treatment types. Another option was to use `-sign(avg_log2FC)*log10(p_val_adj)`, which incorporates both logfoldchange and the pvalue. This approach seemed more reasonable and yielded results more similar to `iDEA`.

## iDEA

GSEA and ORA are dependent on an initial differential expression analysis. However, the differential expression of an individual gene also has an obvious dependence on gene set enrichment, as gene sets contain information about the individual genes within the set. iDEA (integrative Differental expression and gene set Enrichment Analysis) addresses this joint dependence through a model that jointly estimates both differential expression and gene set enrichment. The input to this method is still a list of per-gene summary statistics from an initial differential expression analysis, but iDEA has increased power and provides updated results for detection of both differentially expressed genes and gene sets.

### notes for brainstorming

 * Show how to pull out hallmark gene sets (or any) with msigdbr
 * Show how to get markers with necessary options to get all genes
 * Show how to compute number of DE genes out of total
 * Show how to do an ORA with hypergeometric distribution
 * Show how to do an ORA with ClusterProfiler
 * Show how to make ranked list with log2FC and -sign(log2FC)xlog10(pval) for GSEA with fgsea and compare
 * Show how to make summary statistics list with iDEA
 * graphic for iDEA
 * Don't run iDEA but show code

Problem... what data should I use? Can use 1-5kpbmc, but not sure any pathways will be represented... might as well run and see though. If not, iDEA has example data that we can also use for GSEA analysis

In [6]:
set.seed(61)
.libPaths(c('/usr/local/lib/R/site-library', '/usr/local/lib/R/library'))
library(RColorBrewer)
library(Seurat)
library(patchwork)
library(ggplot2)
library(dplyr)
#library(hdf5r)
library(stringr)
#library(biomaRt)
#library(kableExtra)
#library(SeuratDisk)

#data_dir <- '/gpfs/data/cbc/scrna_r_workshop'
pbmc.1k <- Read10X_h5(paste0(data_dir, '/data/pbmc_1k_v3_filtered_feature_bc_matrix.h5'))
pbmc.5k <- Read10X_h5(paste0(data_dir, '/data/5k_pbmc_v3_filtered_feature_bc_matrix.h5'))

ERROR: Error in Read10X_h5(paste0(data_dir, "/data/pbmc_1k_v3_filtered_feature_bc_matrix.h5")): Please install hdf5r to read HDF5 files
