# <div align="center"><b>Pathway enrichment analysis with DecoupleR-py</b></div>

The last thing we will do in bootcamp (😢) is to perform pathway enrichment analysis using the DecoupleR-py package. This package is incredibly useful for summarizing the results of a differential expression analysis and identifying the biological pathways that are most affected by the changes in gene expression.

This notebook is based on the decoupleR-py tutorial for bulk RNA-seq data found [here](https://decoupler.readthedocs.io/en/latest/notebooks/bulk/rna.html).

# 0) Packages

In [None]:
# As usual we will start by importing useful packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scanpy as sc
import decoupler as dc
from anndata import AnnData

# 1) Load and preprocess data

In [None]:
# It can also be useful to specify all your paths here so it is clear where things are coming from
# TODO: Make sure this matches the path of your counts file
path_deseq2 = "~/scratch/differential_analysis/deseq2_results.csv"
path_out = '~/scratch/pathway_analysis/'

In [None]:
# Load the deseq2 results using pandas
res = pd.read_csv(path_deseq2, index_col=0)  # note that this is actually a csv file because we saved it as such!
res.head()

In [None]:
# We again need to clean up clean up the dataframew to use symbols as the index to match the decoupler database
res.set_index("Symbol", inplace=True)
res = res[~res.index.isna()]
res = res[~res.index.duplicated()]
mat = res[['stat']].T.rename(index={'stat': 'Persister.vs.Parental'})
mat

In [None]:
# As a sanity check, let's look at the volcano plot showing specific hits again to make sure it matches the last notebook

# Plot the volcano without labeling
fig, ax = plt.subplots()
dc.pl.volcano(
    res,  # The results table
    x='log2FoldChange',  # The column with the log2 fold changes will be on the x-axis
    y='padj',  # The column with the adjusted p-values will be on the y-axis
    top=1,  # The number of top genes to label
    figsize=(5, 5),  # The size of the figure
    thr_sign=0.05,  # The significance threshold to use for padj
    thr_stat=0.5,  # The log2 fold change threshold to use
    ax=ax,
    return_fig=True
)

# Remove the automatically labeled gene
if ax.texts:
    ax.texts[-1].remove()

# # Define our genes to highlight, in this case the hit discussed in the paper.
genes_of_interest = [
    "PROM1", "CD44", # PROM1 encodes CD133
    "GPX4"
]

# Annotate manually
for gene in genes_of_interest:
    if gene in res.index:
        logfc = res.at[gene, "log2FoldChange"]
        padj = -np.log10(res.at[gene, "padj"])
        ax.scatter(logfc, padj, color="black")
        ax.text(logfc, padj, gene, fontsize=8)

plt.show()

# 2) Definition of enrichment analysis
Enrichment analysis tests whether a specific set of omics features is “overrepresented” or “coordinated” in the measured data compared to a background distribution. These sets are predefined based on existing biological knowledge and may vary depending on the omics technology used.

Enrichment analysis requires the use of an enrichment method, and several options are available. In the original manuscript of decoupler [BiMVSB+22], we benchmarked multiple methods and found that the univariate linear model (ulm) outperformed the others.

The scores from decoupler.mt.ulm should be interpreted such that larger magnitudes indicate greater significance, while the sign reflects whether the features in the set are overrepresented (positive) or underrepresented (negative) compared to the background.

# 3) Look for enrichment of PROGENy pathways

**PROGENy** is a comprehensive resource containing a curated *collection of pathways* and their target genes, with weights for each interaction. For this example we will use the human weights (other organisms are available) and we will use the top 500 responsive genes ranked by p-value. Here is a brief description of each pathway:

**Androgen**: involved in the growth and development of the male reproductive organs.

**EGFR**: regulates growth, survival, migration, apoptosis, proliferation, and differentiation in mammalian cells

**Estrogen**: promotes the growth and development of the female reproductive organs.

**Hypoxia**: promotes angiogenesis and metabolic reprogramming when O2 levels are low.

**JAK-STAT**: involved in immunity, cell division, cell death, and tumor formation.

**MAPK**: integrates external signals and promotes cell growth and proliferation.

**NFkB**: regulates immune response, cytokine production and cell survival.

**p53**: regulates cell cycle, apoptosis, DNA repair and tumor suppression.

**PI3K**: promotes growth and proliferation.

**TGFb**: involved in development, homeostasis, and repair of most tissues.

**TNFa**: mediates haematopoiesis, immune surveillance, tumour regression and protection from infection.

**Trail**: induces apoptosis.

**VEGF**: mediates angiogenesis, vascular permeability, and cell migration.

**WNT**: regulates organ morphogenesis during development and tissue repair.

In [None]:
# Retrieve PROGENy model weights
progeny = dc.op.progeny(top=500)
progeny

In [None]:
# Explore unique list of pathways.
progeny['source'].unique()

In [None]:
# Infer pathway activities with mlm
pathway_acts, pathway_pvals = dc.mt.mlm(
    data=mat,
    net=progeny,
    verbose=True
)
pathway_acts

In [None]:
# We can now plot the pathway activity scores as a barplot
dc.pl.barplot(
    pathway_acts,
    'Persister.vs.Parental',
    top=25,
    vertical=False,
    figsize=(6, 3)
)

In [None]:
# We can even look at the specific genes in a pathway and what their weights are
dc.pl.source_targets(
    data=res,
    x='weight', y='stat',
    net=progeny,
    name='p53',
    top=15,
    max_x=20, # Note that we set a threshold to the gene network weight (x-axis) to combat outlier on the negative side. Try running the function again after commenting this line and see what happens.
    figsize=(6, 6)
)

# 4) Functional enrichment of biological terms in MSigDB

The Molecular Signatures Database (MSigDB) is a resource containing a collection of gene sets annotated to different biological processes. This will likely be discussed in more detail on the final day of bootcamp, but for now we will use the MSigDB gene sets to perform functional enrichment analysis.

In [None]:
hallmark = dc.op.hallmark(organism="human")

In [None]:
# Grab the MSigDB database using the decoupler package
msigdb = dc.op.resource('MSigDB')
msigdb

Not every geneset in msigdb is useful for every analysis, so we will use the gene sets that are most relevant to our data. For this example we will use hallmark genesets

In [None]:
# Filter by hallmark
msigdb = msigdb[msigdb['collection']=='hallmark']

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]

# Rename
msigdb.loc[:, 'geneset'] = [name.split('HALLMARK_')[1] for name in msigdb['geneset']]

msigdb

In [None]:
# We use only significant differentially expressed genes for the analysis
top_genes = res[res['padj'] < 0.05]
top_genes = mat.loc[:, top_genes.index.array] # Expected input format

In [None]:
# Run pathway scores with ulm.
hm_acts, hm_padj = dc.mt.ulm(data=top_genes, net=hallmark)

# Filter by sign padj
msk = (hm_padj.T < 0.05).iloc[:, 0]
hm_acts = hm_acts.loc[:, msk]

hm_acts

In [None]:
# Tranform to df
df = hm_acts.melt(value_name="score").merge(
    hm_padj.melt(value_name="pvalue")
    .assign(padj=lambda x: x["pvalue"].clip(2.22e-16, 1))
    .assign(padj=lambda x: (-np.log10(x["pvalue"])).clip(0, 10))
)

In [None]:
dc.pl.dotplot(
    df=df,
    x="score", y="variable",
    s="padj",
    c="score",
    # vcenter=0, # Didn't work. :(
    top=30, scale=0.3,
    # dot_max=0.5,
    figsize=(10, 6)
)

Note that the above dotplot tells us that a pathway is enriched but does not tell us if it is up or downregulated. To get a better view of this, we can plot something called a running score:

In [None]:
# Plot running score for the epithelial-mesenchymal transition pathway (discussed in the paper).
# Can you confirm whether the expected type of enrichment (either upregulation or downregulation) can be observed?
dc.pl.leading_edge(
    df=res,
    stat='stat',
    net=hallmark,
    name='EPITHELIAL_MESENCHYMAL_TRANSITION'
)

In [None]:
# Plot running score for E2F_TARGETS. Note that it's the pathway with the greatest significance (in terms of P-value)!
dc.pl.leading_edge(
    df=res,
    stat='stat',
    net=hallmark,
    name='E2F_TARGETS'
)

# 5) Some potential exercises

1. What pathways are enriched in only the upregulated genes? In only the downregulated genes?
2. Try different gene sets other than hallmark. What are the differences? Do you notice any trends?


# DONE!

---