analysis/gsva.Rmd

---
title: "GSVA: gene set variation analysis"
date: "`r Sys.Date()`"
output:
  workflowr::wflow_html:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Following the [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.html).

>Gene set variation analysis (GSVA) is a particular type of gene set enrichment method that works on single samples and enables pathway-centric analyses of molecular data by performing a conceptually simple but powerful change in the functional unit of analysis, from genes to gene sets. The GSVA package provides the implementation of four single-sample gene set enrichment methods, concretely zscore, plage, ssGSEA and its own called GSVA. While this methodology was initially developed for gene expression data, it can be applied to other types of molecular profiling data. In this vignette we illustrate how to use the GSVA package with bulk microarray and RNA-seq expression data.
>
>Gene set variation analysis (GSVA) provides an estimate of pathway activity by transforming an input gene-by-sample expression data matrix into a corresponding gene-set-by-sample expression data matrix.
>
>This resulting expression data matrix can be then used with classical analytical methods such as differential expression, classification, survival analysis, clustering or correlation analysis in a pathway-centric manner. One can also perform sample-wise comparisons between pathways and other molecular data types such as microRNA expression or binding data, copy-number variation (CNV) data or single nucleotide polymorphisms (SNPs).

## Package

Install GSVA. (Dependencies are listed in the Imports section in the [DESCRIPTION](https://github.com/rcastelo/GSVA/blob/devel/DESCRIPTION) file.)

```{r install_fgsea, message=FALSE, warning=FALSE}
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!require("GSVA", quietly = TRUE))
  BiocManager::install("GSVA")

if (!require("GSVAdata", quietly = TRUE))
  BiocManager::install("GSVAdata")
```

Load package.

```{r load_gsva, message=FALSE, warning=FALSE}
library(GSVA)
packageVersion("GSVA")
```

## Quick start

Generate example expression matrix.

```{r eg_mat}
p <- 10000
n <- 30
set.seed(1984)
X <- matrix(
  rnorm(p*n),
  nrow=p,
  dimnames=list(paste0("g", 1:p), paste0("s", 1:n))
)
X[1:5, 1:5]
```

Generate 100 gene sets that are contain from 10 to up to 100 genes sampled from `1:p`.

```{r eg_gs}
set.seed(1984)
gs <- as.list(sample(10:100, size=100, replace=TRUE))

gs <- lapply(gs, function(n, p){
  paste0("g", sample(1:p, size=n, replace=FALSE))
}, p)
names(gs) <- paste0("gs", 1:length(gs))

sapply(gs, length)
```

Calculate GSVA enrichment scores using the `gsva()` function, which does all the work and requires the following two input arguments:

1. A **normalised** gene expression dataset, which can be provided in one of the following containers:
    * A matrix of expression values with genes corresponding to rows and samples corresponding to columns.
    * An ExpressionSet object; see package Biobase.
    * A SummarizedExperiment object, see package SummarizedExperiment.
2. A collection of gene sets; which can be provided in one of the following containers:
    * A list object where each element corresponds to a gene set defined by a vector of gene identifiers, and the element names correspond to the names of the gene sets.
    * A GeneSetCollection object; see package GSEABase.

>The first argument to the `gsva()` function is the gene expression data matrix and the second the collection of gene sets. The `gsva()` function can take the input expression data and gene sets using different specialized containers that facilitate the access and manipulation of molecular and phenotype data, as well as their associated metadata. Another advanced features include the use of on-disk and parallel backends to enable, respectively, using GSVA on large molecular data sets and speed up computing time.


The `gsva()` function will apply the following filters before the actual calculations take place:

1. Discard genes in the input expression data matrix with constant expression.
2. Discard genes in the input gene sets that do not map to a gene in the input gene expression data matrix.
3. Discard gene sets that, after applying the previous filters, do not meet a minimum and maximum size, which by default is 1 for the minimum size and `Inf` for the maximum size.

When `method="gsva"` is used (the default), the following parameters can be tuned:

* `kcdf`: The first step of the GSVA algorithm brings gene expression profiles to a common scale by calculating an expression statistic through a non-parametric estimation of the CDF across samples. Such a non-parametric estimation employs a kernel function and the `kcdf` parameter allows the user to specify three possible values for that function:

1. "Gaussian", the default value, which is suitable for continuous expression data, such as microarray fluorescent units in logarithmic scale and RNA-seq log-CPMs, log-RPKMs or log-TPMs units of expression;
2. "Poisson", which is suitable for integer counts, such as those derived from RNA-seq alignments;
3. "none", which will enforce a direct estimation of the CDF without a kernel function.

* `mx.diff`: The last step of the GSVA algorithm calculates the gene set enrichment score from two Kolmogorov-Smirnov random walk statistics. This parameter is a logical flag that allows the user to specify two possible ways to do such calculation:

1. `TRUE`, the default value, where the enrichment score is calculated as the magnitude difference between the largest positive and negative random walk deviations;
2. `FALSE`, where the enrichment score is calculated as the maximum distance of the random walk from zero.

* `abs.ranking`: Logical flag used only when `mx.diff=TRUE`. By default, `abs.ranking=FALSE` and it implies that a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When `abs.ranking=TRUE` the original Kuiper statistic is used, by which the largest positive and negative random walk deviations are added together. In this case, gene sets with genes enriched on either extreme (high or low) will be regarded as highly activated.

* `tau`: Exponent defining the weight of the tail in the random walk. By default `tau=1`. When `method="ssgsea"`, this parameter is also used and its default value becomes then `tau=0.25` to match the methodology described in (Barbie et al. 2009).

In general, the default values for the previous parameters are suitable for most analysis settings, which usually consist of some kind of normalized continuous expression values.

```{r eg_gsva}
es_gsva <- gsva(gsvaParam(X, gs), verbose=FALSE)
dim(es_gsva)
```

Median enrichment scores.

```{r eg_median_es}
apply(es_gsva, 2, median)
```

>ssgsea (Barbie et al. 2009). Single sample GSEA (ssGSEA) is a non-parametric method that calculates a gene set enrichment score per sample as the normalized difference in empirical cumulative distribution functions (CDFs) of gene expression ranks inside and outside the gene set. By default, the implementation in the GSVA package follows the last step described in (Barbie et al. 2009, online methods, pg. 2) by which pathway scores are normalized, dividing them by the range of calculated values. This normalization step may be switched off using the argument ssgsea.norm in the call to the gsva() function; see below.

```{r eg_ssgsea}
es_ssgsea <- gsva(ssgseaParam(X, gs), verbose=FALSE)
apply(es_ssgsea, 2, median)
```

## Investigating the scores

Create another test matrix.

```{r create_test_matrix}
p <- 10000
n <- 2
set.seed(1984)
X <- matrix(
  rnorm(n = p*n, mean = 10, sd = 10),
  nrow=p,
  dimnames=list(paste0("g", 1:p), paste0("s", 1:n))
)

X[1:50, 's1'] <- rnorm(n = 50, mean = 50, sd = 55)
X[51:100, 's1'] <- rnorm(n = 50, mean = 2, sd = 2)

X[1:50, 's2'] <- rnorm(n = 50, mean = 100, sd = 5)
X[51:100, 's2'] <- rnorm(n = 50, mean = 2, sd = 0.5)

X[1:5, ]
```

Create testing gene lists with higher and lower expression patterns and check their enrichment scores.

```{r test_gene_set}
gene_set <- list(
  gs1 = paste0("g", 1:50),
  gs2 = paste0("g", 51:100),
  gs3 = paste0("g", 101:150)
)
```

_ssgsea_ (Barbie et al. 2009). Single sample GSEA (ssGSEA) is a non-parametric method that calculates a gene set enrichment score per sample as the normalized difference in empirical cumulative distribution functions (CDFs) of gene expression ranks inside and outside the gene set. By default, the implementation in the GSVA package follows the last step described in (Barbie et al. 2009, online methods, pg. 2) by which pathway scores are normalized, dividing them by the range of calculated values. This normalization step may be switched off using the argument `ssgsea.norm` in the call to the `gsva()` function.

```{r ssgsea_test_matrix}
test_ssgsea <- gsva(ssgseaParam(X, gene_set), verbose=FALSE)
test_ssgsea
```

No normalisation.

```{r ssgsea_no_norm_test_matrix}
test_ssgsea <- gsva(ssgseaParam(X, gene_set, normalize = FALSE), verbose=FALSE)
test_ssgsea
```

gsva (Hanzelmann, Castelo, and Guinney 2013). This is the default method of the package and similarly to ssGSEA, is a non-parametric method that uses the empirical CDFs of gene expression ranks inside and outside the gene set, but it starts by calculating an expression-level statistic that brings gene expression profiles with different dynamic ranges to a common scale.

```{r gsva_test_matrix}
test_gsva <- gsva(gsvaParam(X, gene_set), verbose=FALSE)
test_gsva
```

_zscore_ (Lee et al. 2008). The z-score method standardizes expression profiles over the samples and then, for each gene set, combines the standardized values as follows. Given a gene set $\gamma = \{1, \ldots ,k\}$ with standardized values $\{z_1,\ldots,z_k\}$ for each gene in a specific sample, the combined z-score $Z_\gamma$ for the gene set $\gamma$ is defined as:

$$ Z_\gamma = \frac{\sum^k_{i=1} z_i}{\sqrt{k}}.$$

```{r zscore_test_matrix}
test_zscore <- gsva(zscoreParam(X, gene_set), verbose=FALSE)
test_zscore
```

_plage_ (Tomfohr, Lu, and Kepler 2005). Pathway level analysis of gene expression (PLAGE) standardizes expression profiles over the samples and then, for each gene set, it performs a singular value decomposition (SVD) over its genes. The coefficients of the first right-singular vector are returned as the estimates of pathway activity over the samples. Note that, because of how SVD is calculated, the sign of its singular vectors is arbitrary.

```{r plage_test_matrix}
test_plage <- gsva(plageParam(X, gene_set), verbose=FALSE)
test_plage
```

## Enrichment scores on microarray and RNA-seq data

Gene expression data of lymphoblastoid cell lines (LCL) from HapMap individuals that have been profiled using microarray and RNA-seq.

```{r c2_broad_sets, message=FALSE, warning=FALSE}
library(Biobase)
library(GSVAdata)
data(c2BroadSets)
data(commonPickrellHuang)

stopifnot(
  identical(
    featureNames(huangArrayRMAnoBatchCommon_eset),
    featureNames(pickrellCountsArgonneCQNcommon_eset)
  )
)

stopifnot(
  identical(
    sampleNames(huangArrayRMAnoBatchCommon_eset),
    sampleNames(pickrellCountsArgonneCQNcommon_eset)
  )
)

pickrellCountsArgonneCQNcommon_eset
```

For the current analysis we use the subset of canonical pathways from the C2 collection of MSigDB Gene Sets. These correspond to the following pathways from KEGG, REACTOME and BIOCARTA.

```{r canonical_c2_broad_sets}
canonicalC2BroadSets <- c2BroadSets[
  c(
    grep("^KEGG", names(c2BroadSets)),
    grep("^REACTOME", names(c2BroadSets)),
    grep("^BIOCARTA", names(c2BroadSets))
  )
]

canonicalC2BroadSets
```

We calculate the GSVA enrichment scores for these gene sets using first the normalized microarray data and then the normalized RNA-seq integer count data. Note that the only requirement to do the latter is to set the argument `kcdf="Poisson"`, which is `"Gaussian"` by default. However, if the RNA-seq normalized expression levels is continuous, such as log-CPMs, log-RPKMs or log-TPMs, use `"Gaussian"`.

Microarray.

```{r esmicro}
huangPar <- gsvaParam(
  exprData = huangArrayRMAnoBatchCommon_eset,
  geneSets = canonicalC2BroadSets,
  minSize=5,
  maxSize=500
)

esmicro <- gsva(huangPar, verbose=FALSE)

exprs(esmicro)[1:6, 1:6]
```

RNA-seq.

```{r esrnaseq}
pickrellPar <- gsvaParam(
  exprData = pickrellCountsArgonneCQNcommon_eset,
  geneSets = canonicalC2BroadSets,
  minSize=5,
  maxSize=500,
  kcdf="Poisson"
)

esrnaseq <- gsva(pickrellPar, verbose=FALSE)

exprs(esrnaseq)[1:6, 1:6]
```

Correlation of enrichment scores between the two technologies.

```{r correlation_enrichment_scores, fig.width=12, fig.height=12}
library(corrplot)
corrplot(cor(exprs(esrnaseq), exprs(esmicro)), type="lower")
```

Correlation of enrichment scores between just the RNA-seq samples ordered using hierarchical clustering.

```{r correlation_enrichment_scores_rnaseq, fig.width=12, fig.height=12}
corrplot(cor(exprs(esrnaseq), exprs(esrnaseq)), type="lower", diag = FALSE, order = "hclust")
```