### Programming for Biomedical Informatics
#### Week 6 - Differential Gene Expression Analysis

This is a basic example of how to use pydeseq2 to perform differential expression analysis on a synthetic dataset that comes with the package. There is excellent documentation on the package website https://pydeseq2.readthedocs.io/en/latest/. In the session on Thursday, we will go through a real world example

In [1]:
# import libraries
import os
import pickle as pkl

# import pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import *

In [None]:
import pandas as pd

# import small synthetic dataset
DATA_PATH = "https://raw.githubusercontent.com/owkin/PyDESeq2/main/datasets/synthetic/"

## NB this is only for 10 genes!
counts_df = pd.read_csv(os.path.join(DATA_PATH, "test_counts.csv"), index_col=0)
counts_df.head()

In [None]:
# transpose the counts matrix so that we have sample as rows and genes as columns
counts_df = counts_df.T
counts_df.head()

In [None]:
## load the meta-data
metadata = pd.read_csv(os.path.join(DATA_PATH, "test_metadata.csv"), index_col=0)
metadata.head()

In [10]:
## filter out samples with missing condition
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]

In [11]:
## remove the genes with low counts
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

In [12]:
## set up the DESeq2 object
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors="condition",
    refit_cooks=True,
    inference=inference,
)

In [None]:
## do the fitting - this does it all in one go, we break it down in the next notebook
dds.deseq2()

In [None]:
## look at the results
print(dds)

In [None]:
## access the dispersion estimates
print(dds.varm["dispersions"])

In [None]:
## access the log2 fold changes
print(dds.varm["LFC"])

In [None]:
## now we can calculate the statistics
stat_res = DeseqStats(dds, inference=inference)
stat_res.summary()

In [None]:
## plot an MA plot of the results
## remember only 10 genes!
stat_res.summary(lfc_null=0.1, alt_hypothesis="greaterAbs")
stat_res.plot_MA(s=20)