# <div align="center"><b>Pathway enrichment analysis with DecoupleR-py</b></div>

The last thing we will do in bootcamp (😢) is to perform pathway enrichment analysis using the DecoupleR-py package. This package is incredibly useful for summarizing the results of a differential expression analysis and identifying the biological pathways that are most affected by the changes in gene expression.

This notebook is based on the decoupleR-py tutorial found here: https://decoupler-py.readthedocs.io/en/latest/notebooks/bulk.html

In [None]:
# As usual we will start by importing useful packages
import numpy as np
import pandas as pd
import seaborn as sns

#
import scanpy as sc
import decoupler as dc
from anndata import AnnData

# 1) Load data

In [None]:
# It can also be useful to specify all your paths here so it is clear where things are coming from
# TODO: Make sure this matches the path of your counts file
path_deseq2 = "scratch/differential_analysis/deseq2_results.csv"
path_out = '~/scratch/pathway_analysis/'

In [None]:
# Load the deseq2 results using pandas
res = pd.read_csv(path_deseq2, index_col=0)  # note that this is actually a csv file because we saved it as such!
res.head()

In [None]:
# As a sanity check, let's look at the volcano plot again to make sure it matches the last notebook
dc.plot_volcano_df(
    res,
    x='log2FoldChange',
    y='padj',
    top=20,
    figsize=(5, 5),
    sign_thr=0.05,
    lFCs_thr=1,
)

In [None]:
# We again need to clean up clean up the dataframew to use symbols as the index to match the decoupler database
res.set_index("Symbol", inplace=True)
res = res[~res.index.isna()]
res = res[~res.index.duplicated()]
mat = res[['stat']].T.rename(index={'stat': 'Persister.vs.Parental'})
mat

# 2) Look for enrichment of PROGENy pathways

PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction. For this example we will use the human weights (other organisms are available) and we will use the top 500 responsive genes ranked by p-value. Here is a brief description of each pathway:

Androgen: involved in the growth and development of the male reproductive organs.

EGFR: regulates growth, survival, migration, apoptosis, proliferation, and differentiation in mammalian cells

Estrogen: promotes the growth and development of the female reproductive organs.

Hypoxia: promotes angiogenesis and metabolic reprogramming when O2 levels are low.

JAK-STAT: involved in immunity, cell division, cell death, and tumor formation.

MAPK: integrates external signals and promotes cell growth and proliferation.

NFkB: regulates immune response, cytokine production and cell survival.

p53: regulates cell cycle, apoptosis, DNA repair and tumor suppression.

PI3K: promotes growth and proliferation.

TGFb: involved in development, homeostasis, and repair of most tissues.

TNFa: mediates haematopoiesis, immune surveillance, tumour regression and protection from infection.

Trail: induces apoptosis.

VEGF: mediates angiogenesis, vascular permeability, and cell migration.

WNT: regulates organ morphogenesis during development and tissue repair.

In [None]:
# Retrieve PROGENy model weights
progeny = dc.get_progeny(top=500)
progeny

In [None]:
# Infer pathway activities with mlm
pathway_acts, pathway_pvals = dc.run_mlm(mat=mat, net=progeny, verbose=True)
pathway_acts

In [None]:
# We can now plot the pathway activity scores as a barplot
dc.plot_barplot(
    pathway_acts,
    'Persister.vs.Parental',
    top=25,
    vertical=False,
    figsize=(6, 3)
)

In [None]:
# We can even look at the specific genes in a pathway and what their weights are
dc.plot_targets(res, stat='stat', source_name='p53', net=progeny, top=15)

# Functional enrichment of biological terms in MSigDB

The Molecular Signatures Database (MSigDB) is a resource containing a collection of gene sets annotated to different biological processes. This will likely be discussed in more detail on the final day of bootcamp, but for now we will use the MSigDB gene sets to perform functional enrichment analysis.

In [None]:
# Grab the MSigDB database using the decoupler package
msigdb = dc.get_resource('MSigDB')
msigdb

Not every geneset in msigdb is useful for every analysis, so we will use the gene sets that are most relevant to our data. For this example we will use hallmark genesets

In [None]:
# Filter by hallmark
msigdb = msigdb[msigdb['collection']=='hallmark']

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]

# Rename
msigdb.loc[:, 'geneset'] = [name.split('HALLMARK_')[1] for name in msigdb['geneset']]

msigdb

In [None]:
# We use only significant differentially expressed genes for the analysis
top_genes = res[res['padj'] < 0.05]

In [None]:
# Run functional enrichment analysis with ORA
enr_pvals = dc.get_ora_df(
    df=top_genes,
    net=msigdb,
    source='geneset',
    target='genesymbol'
)
enr_pvals.head()

In [None]:
# Plot a dotplot of the top 15 enriched pathways
dc.plot_dotplot(
    enr_pvals.sort_values('Combined score', ascending=False).head(15),
    x='Combined score',
    y='Term',
    s='Odds ratio',
    c='FDR p-value',
    scale=1.5,
    figsize=(3, 6)
)

Note that the above dotplot tells us that a pathway is enriched but does not tell us if it is up or downregulated. To get a better view of this, we can plot something called a running score:

In [None]:
# Plot running score for E2F_TARGETS
dc.plot_running_score(
    df=res,
    stat='stat',
    net=msigdb,
    source='geneset',
    target='genesymbol',
    set_name='E2F_TARGETS'
)

In [None]:
# Try to find one with a positive score
dc.plot_running_score(
    df=res,
    stat='stat',
    net=msigdb,
    source='geneset',
    target='genesymbol',
    set_name='E2F_TARGETS'
)

# 5) Some potential exercises

1. What pathways are enriched in only the upregulated genes? In only the downregulated genes?
2. Try different gene sets other than hallmark. What are the differences? Do you notice any trends?


# DONE!

---