# Single-sample Gene Set Enrichment Analysis (ssGSEA) Guide

## Introduction

Single-sample Gene Set Enrichment Analysis (ssGSEA) is a method used to quantify the enrichment score of specific gene sets within individual samples. Unlike traditional GSEA, ssGSEA does not require multiple samples for comparison; instead, it performs gene set enrichment analysis independently for each sample. ssGSEA evaluates the activity of specific gene sets within each sample by calculating their cumulative distribution functions in the ranked list of genes.

## Principle

The steps of ssGSEA are as follows:

1. **Gene Ranking**: For each sample, rank all genes based on their expression values in descending order.
2. **Calculate Cumulative Distribution Functions (CDFs)**:
   - **Hit CDF**: CDF for genes in the gene set within the ranked list.
   - **Miss CDF**: CDF for genes not in the gene set within the ranked list.
3. **Enrichment Score (ES)**: The ES is the maximum deviation between the Hit CDF and the Miss CDF.

### Formula Explanation

1. **Gene Ranking**:
   - Rank gene expression values in descending order.

2. **Cumulative Distribution Functions**:
   - **Hit CDF**:
     $$
     P_{\text{hit}}(i) = \sum_{g \in S} \frac{\text{rank}(g)}{\sum_{g \in S} \text{rank}(g)}
     $$
     where \(S\) is the gene set and \(\text{rank}(g)\) is the position of gene \(g\) in the ranked list.
   
   - **Miss CDF**:
     $$
     P_{\text{miss}}(i) = \sum_{g \notin S} \frac{1}{N - |S|}
     $$
     where \(N\) is the total number of genes and \(|S|\) is the size of the gene set.

3. **Enrichment Score (ES)**:
   - The ES is the maximum deviation between the Hit CDF and the Miss CDF:
     $$
     ES = \max \left( P_{\text{hit}}(i) - P_{\text{miss}}(i) \right)
     $$

## Implementation Steps

Below is a detailed implementation of ssGSEA, including gene expression value ranking, cumulative distribution calculation, and enrichment score computation.

### Example Code


In [5]:
import numpy as np
import pandas as pd

# Example gene expression matrix
data = {
    'Gene': ['Gene1', 'Gene2', 'Gene3', 'Gene4', 'Gene5'],
    'Sample1': [5.1, 3.2, 2.5, 7.8, 3.4],
    'Sample2': [6.3, 2.1, 4.3, 6.7, 5.1],
    'Sample3': [5.4, 3.8, 2.9, 8.0, 4.2]
}

# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Gene', inplace=True)

# Example gene set
gene_set = ['Gene1', 'Gene3', 'Gene4']

# Compute ssGSEA enrichment score
def ssGSEA(sample_values, gene_set):
    # Rank gene expression values
    ranked_genes = sample_values.sort_values(ascending=False)
    ranked_gene_indices = ranked_genes.index
    n_genes = len(sample_values)
    n_set = len(gene_set)
    
    # Initialize cumulative distribution
    hit_scores = np.zeros(n_genes)
    miss_scores = np.zeros(n_genes)
    
    # Calculate hit and miss cumulative distributions
    hit_sum = ranked_genes[ranked_genes.index.isin(gene_set)].sum()
    miss_const = 1.0 / (n_genes - n_set)
    
    for i, gene in enumerate(ranked_gene_indices):
        if gene in gene_set:
            hit_scores[i] = ranked_genes[gene] / hit_sum
        miss_scores[i] = miss_const

    # Calculate cumulative distributions
    hit_scores = np.cumsum(hit_scores)
    miss_scores = np.cumsum(miss_scores)

    # Calculate Enrichment Score (ES)
    es = np.max(hit_scores - miss_scores)
    return es

# Compute enrichment scores for each sample
es_scores = {}
for sample in df.columns:
    es_scores[sample] = ssGSEA(df[sample], gene_set)

# Output results
print(es_scores)


{'Sample1': 0.006493506493506551, 'Sample2': -0.11271676300578037, 'Sample3': -0.009202453987730064}
