# Differential expression

In [None]:
import os
import pickle as pkl

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data


OUTPUT_PATH = "tmp/"
os.makedirs(OUTPUT_PATH, exist_ok=True) 

## QC on raw counts

Before we do any differential expression analysis, it's good to do some QC on the count data to ensure the sequencing samples cluster as expected. For our experiment, we expect 2 clusters (PZQ vs. Control) each with 5 points.

### Gene filtering based on total read count and the number of samples the gene is found in

Before we do anything, we are going to filter the count data to remove genes with low total expression values (anything less than 10 is not going to be informative) and genes that were only expressed in 3 or fewer samples. First, we read in the counts data and convert numeric columns to integers:

In [None]:
import pandas as pd

counts_df = pd.read_csv("/data/classes/2025/fall/biol343/course_files/counts/star_counts.tsv", sep="\t", header=1)
counts_df.iloc[:, 6:16] = counts_df.iloc[:, 6:16].astype('int')
counts_df

Next, we do the filtering. The first operation will sum the expression across all samples and remove any gene that had a total count of less than or equal to 10. The second operation will count each sample that had counts >0 of each gene. Any gene that wasn't expressed in >3 samples will be removed entirely.

In [None]:
# take only the Geneid and count columns
star_cols = [c for c in counts_df.columns if "star.bam" in c]
counts_subset = counts_df[["Geneid"] + star_cols].copy()

# rename count columns
prefix = "/data/classes/2025/fall/biol343/course_files/dedup/star.bam:"
def strip_prefix(name: str) -> str:
    return name.replace(prefix, "")
counts_subset.rename(columns={c: strip_prefix(c) for c in star_cols}, inplace=True)

Now we're going to sum the counts (row-wise). This will create the new column `total_counts`:

In [None]:

# compute total_counts across numeric sample columns (row-wise)
# treat NA as 0, matching na.rm = TRUE
sample_cols = [c for c in counts_subset.columns if c != "Geneid"]
counts_numeric = counts_subset[sample_cols].apply(pd.to_numeric, errors="coerce").fillna(0)
counts_subset["total_counts"] = counts_numeric.sum(axis=1)

counts_subset 

Filter to only keep genes with `total_counts` >10:

In [None]:
counts_summary = counts_subset.loc[counts_subset["total_counts"] >= 10].copy()
counts_summary

You can see that previously, we had 9,920 rows. Each row is a gene. Now, we have 9,587 rows. We removed about 350 lowly expressed genes. 

We're now going to find genes that have a count of >0 in <3 samples and remove them.

In [None]:
positive_counts_mask = counts_numeric.loc[counts_summary.index] > 0
n_positive_samples = positive_counts_mask.sum(axis=1)
genes_to_remove = counts_summary.loc[n_positive_samples <= 3, "Geneid"]
genes_to_remove

10 genes are going to be removed. Let's remove them:

In [None]:
counts_filt = (
    counts_summary.loc[~counts_summary["Geneid"].isin(genes_to_remove)]
    .sort_values(by="Geneid")
    .drop(columns=["total_counts"])
)
counts_filt

This filtered dataset of 9577 genes will be used for all downstream analyses. 

### Clustering for QC
First, we'll measure the Euclidean distance between each sample and then plot the results as a heatmap. We'll first convert the data from a data frame to a matrix, and then measure the distance.

We'll first get the coutns and create a matrix object and set the rownames as Geneid:

In [None]:
import numpy as np

count_cols = [c for c in counts_filt.columns if c != "Geneid"]
counts_m = counts_filt[count_cols].apply(pd.to_numeric, errors="coerce").fillna(0)
counts_m.index = counts_filt["Geneid"].values
counts_m

We'll use NumPy to calculate the distances. We have to transpose the matrix (samples x genes instead of genes x samples), then we'll calculate the distances:

In [None]:
from sklearn.metrics import pairwise_distances

X = counts_m.to_numpy().T  # samples x genes
sample_names = counts_m.columns.tolist()

D = pairwise_distances(X, metric="euclidean")
dists_df = pd.DataFrame(D, index=sample_names, columns=sample_names)

dists_df.round(3)

You can see the distance matrix, which shows how similar in expression each sample is to every other samples. QC involves us ensuring that the PZQ samples are all closer to each other than to the CTRL samples. Here's a heatmap showing that:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.clustermap(
    dists_df,
    cmap="viridis",
    method="average",
    z_score=None,
    standard_scale=None,
    figsize=(8, 8)
)


We clearly see two clusters: one for treated samples and one for control samples. Let's see what the PCA looks like. 

We first log-transform the count data and scale the data across samples:

In [None]:

from sklearn.preprocessing import StandardScaler

# log transform and transpose to samples x genes
X = np.log10(counts_m.to_numpy() + 1.0).T  # shape: n_samples x n_genes
sample_names = counts_m.columns.tolist()

# scale features (genes) across samples
scaler = StandardScaler(with_mean=True, with_std=True)
X_scaled = scaler.fit_transform(X)
X_scaled

We're also going to build a DataFrame that includes sample metadata. This will be used to annotate the PCA and will also used for DESeq2 analyses later.

In [None]:
sample_ids = list(counts_m.columns)

# create a DataFrame with sample_id
metadata = pd.DataFrame({"sample_id": sample_ids})
parts = metadata["sample_id"].str.split("_", expand=True)
parts.columns = ["stage", "treatment", "rep"]

metadata = pd.concat([metadata, parts], axis=1)
metadata = metadata.set_index("sample_id")
metadata = metadata[["treatment", "rep"]]
metadata

Next we run the PCA and plot the points:

In [None]:

from sklearn.decomposition import PCA

pca = PCA(n_components=10, random_state=0)  
scores = pca.fit_transform(X_scaled)        
# explained = pca.explained_variance_ratio_

scores_df = pd.DataFrame(
    scores[:, :2],  # PC1 and PC2 for plotting
    index=sample_names,
    columns=["PC1", "PC2"]
)

plot_df = scores_df.join(metadata[["treatment", "rep"]], how="left")

# plot
plt.figure(figsize=(7, 6))
sns.scatterplot(
    data=plot_df,
    x="PC1",
    y="PC2",
    hue="treatment",     # color by tissue
    style="rep", 
    s=80
)
plt.title("PCA: PC1 vs PC2")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0.)
plt.tight_layout()
plt.show()


Perfect! This is exactly what we want. Two clear clusters based on the treatment. This means all 10 samples should be included in our differential expression analysis.

## Differential expression analysis
To perform differential expression analysis, DESeq2 requires two types of input:

1. A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers) 
2. Metadata (or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.

The output of featureCounts needs to be converted to a matrix, the column names should be simplified, and a few unnecessary columns need to be removed. We also made the metadata DataFrame earlier as well.

To be sure, we'll check to confirm the metadata and count columns from `counts_m` correspond:

In [None]:
set(metadata.index) == set(counts_m.columns)

### Single factor analyses

DESeq2 allows for single factorial analyses (only one independent variable, i.e., just treamtent) or multifactorial analyses (more than independent variable, which is not applicaable for this dataset). We will analyze the data to see if there are differences between CTRL and PZQ worms. To do so, we create a `DeseqDataSet` (or DDS object), which incorporates the counts and the metadata. We can then run the `deseq()` method on the DDS object to fit dispersions and log-fold changes:

In [None]:
# PyDESeq2 single-factor (combined) setup, mirroring your R snippet

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference

inference = DefaultInference(n_cpus=32)

dds = DeseqDataSet(
    counts=counts_m.T,        # genes x samples, integer counts, transposed
    metadata=metadata,         # DataFrame indexed by sample_id
    design_factors="treatment",  # single factor name present in metadata
    refit_cooks=True,
    inference=inference
)

dds


The DDS object is based on the AnnData object. Like any Python objects, we can access the fields:

In [None]:
print(dds.obsm)
dds.obsm['design_matrix']

Now we can run the `deseq2()` method, which fits the dispersion and log-fold changes and therefore now adds new fields to the DDS object:

In [None]:
dds.deseq2()
dds

Now, for example, we can access the gene-level log-fold changes (LFCs):

In [None]:
dds.varm["LFC"]

The DDS object with the dispersions and LFCs allow us to perform statistical tests. The `DeseqStats` class includes the DDS object and will allow for the calculation of p-values and adjusted p-values. These data are stored in `results_df`. We can view these data with the `.summary()` methods

In [None]:
stat_res = DeseqStats(dds, inference=inference)
stat_res.summary()

One of the featured differentially expressed genes from the paper was the ABC transporter (which detoxifies drugs by transporting them out of the cell) gene Smp_089200, which decreased in expression after drug treatment. We can search for that gene to see we reproduce the finding:

In [None]:
summary = stat_res.results_df
summary['gene_id'] = summary.index
summary = summary.reset_index()
summary.loc[summary['gene_id'] == 'Smp_089200']


We can see that the L2FC of Smp_089200 is -1, which means its expression is half as much in treated samples as in control samples (i.e, a 2^-1). This is consistent with the figure from the paper:

<img src="../8_counting/assets/example_genes.png" width="400">

### Volcano plots

Typically in RNA-seq analyses, we are interested in genes that are significantly differentially expressed (padj <0.05) and those that have large differences (maybe log two fold change > 2, which is a four-fold change). Volcano plots are a good way to visualize all the genes that satisfy both conditions.

In [None]:
x = summary['log2FoldChange'].to_numpy()
y = -np.log10(summary['padj'].to_numpy())

sig = (summary['padj'] < 0.05) & (summary['log2FoldChange'].abs() >= 2)
plt.figure(figsize=(6, 4))
plt.axhline(-np.log10(0.05), color='red', linestyle='--', linewidth=1)
plt.axvline(2, color='gray', linestyle='--', linewidth=1)
plt.axvline(-2, color='gray', linestyle='--', linewidth=1)
plt.scatter(x[~sig], y[~sig], s=3, color='steelblue', alpha=0.7, label='NS')
plt.scatter(x[sig], y[sig], s=6, color='tomato', alpha=0.8, label='DE (padj<0.05 & |LFC|≥2)')
plt.legend(frameon=False)
plt.xlabel('log2FoldChange'); plt.ylabel('-log10(padj)')
plt.title('Volcano plot'); plt.tight_layout(); plt.show()


In a plot like this, the higher points represent genes that are most likely to be truly significantly differentially expressed (a small p-value). The points far to the right or left represent genes that have large expression differences (fold changes) between CTRL and PZQ. The points in the top right/left areas (red), then, are the genes in which we're interested.

## QC on transformed data

Now what we have the `dds` object, we can do another PCA on counts that have been normalized and transformed. The first PCA we did was on raw counts, but it's often helpful to see the PCA clusters on normalized/transformed counts. We first transform the read counts with DESeq2's variance-stabilizing transformation, which will add a new `vst_counts` layer to the `dds` object:

In [None]:
dds.vst()


Now we run the same PCA code, but this time using the normalized/transformed counts:

In [None]:
X = dds.layers["vst_counts"]   # shape: (samples, genes)
obs = dds.obs  # sample metadata DataFrame

pca = PCA(n_components=10, random_state=0)  
scores = pca.fit_transform(X)        
# explained = pca.explained_variance_ratio_

scores_df = pd.DataFrame(
    scores[:, :2],  # PC1 and PC2 for plotting
    index=sample_names,
    columns=["PC1", "PC2"]
)

plot_df = scores_df.join(metadata[["treatment", "rep"]], how="left")

# plot
plt.figure(figsize=(7, 6))
sns.scatterplot(
    data=plot_df,
    x="PC1",
    y="PC2",
    hue="treatment",     # color by tissue
    style="rep", 
    s=80
)
plt.title("PCA: PC1 vs PC2")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0.)
plt.tight_layout()
plt.show()



The clustering is similar to the previous PCA, but now the spread is much less (the x and y axes have smaller ranges.)