# Differential expression

In [None]:
import os
import pickle as pkl

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data


OUTPUT_PATH = "tmp/"
os.makedirs(OUTPUT_PATH, exist_ok=True) 

To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:

A count matrix of shape ‘number of samples’ x ‘number of genes’, containing read counts (non-negative integers),

Metadata (or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.

In [None]:
import pandas as pd

counts_df = pd.read_csv("star_counts.tsv", sep="\t", header=1)
counts_df.iloc[:, 6:18] = counts_df.iloc[:, 6:18].astype('int')
counts_df

The output of featureCounts needs to be transposed, and a few unnecessary columns need to be removed:

In [None]:
counts_reshape = counts_df.drop(columns=["Chr", "Start", "End", "Strand", "Length"])
counts_reshape = counts_reshape.set_index("Geneid")
counts_reshape = counts_reshape.transpose()
counts_reshape.index = counts_reshape.index.str.strip('../6_alignment/alignment/star/Aligned.sortedByCoord.out.bam:')
counts_reshape

Now we create the metadata:

In [None]:
metadata = pd.DataFrame(counts_df_t.index, columns=['Sample'])
metadata['Age'] = metadata['Sample'].str[4:6]
metadata['Replicate'] = metadata['Sample'].str[-1:]
metadata['Tissue'] = metadata['Sample'].str[0:3]
metadata['Tissue_Age'] = metadata['Tissue'] + "_" + metadata['Age']
metadata = metadata.set_index("Sample")
metadata.index.name = None
print(metadata)

Remove genes that have <10 reads in total:

In [None]:
genes_to_keep = counts_reshape.columns[counts_reshape.sum(axis=0) >= 10]
counts_filt = counts_reshape[genes_to_keep]
counts_filt.index.name = None
print(counts_filt)

We removed ~600 very lowly expressed genes.

We can analyze the data just to see if there are differences between liver and intestine eggs, ignoring the age of the eggs. To do so, we create a `DeseqDataSet` (or DDS object), which incorporates the counts and the metadata. We can then run the `deseq()` method on the DDS object to fit dispersions and log-fold changes:

In [None]:
inference = DefaultInference(n_cpus=32)
dds = DeseqDataSet(
    counts=counts_filt,
    metadata=metadata,
    design_factors="Tissue_Age",
    refit_cooks=True,
    inference=inference
)
dds


The DDS object is based on the AnnData object. Like any Python objects, we can access the fields:

In [None]:
print(dds.obsm)
dds.obsm['design_matrix']

Now we can run the `deseq2()` method, which fits the dispersion and log-fold changes and therefore now adds new fields to the DDS object:

In [None]:
dds.deseq2()
dds

Now, for example, we can access the gene-level LFCs:

In [None]:
dds.varm["LFC"]

The DDS object with the dispersions and LFCs allow us to perform statistical tests. The `DeseqStats` class includes the DDS object and will allow for the calculation of p-values and adjusted p-values. These data are stored in `results_df`.

In [None]:
stat_res = DeseqStats(dds, inference=inference)
stat_res.summary()

One of the featured differentially expressed genes from the Winners vs. Losers paper was Smp_245390, which encodes for an immunomodulatory molecule IPSE/alpha-1. We can search for that gene to see we reproduce the finding:

In [None]:
summary = stat_res.results_df
summary['gene_id'] = summary.index
summary = summary.reset_index()
print(summary)
summary.loc[summary['gene_id'] == 'Smp_245390']

import matplotlib.pylab as plt
import seaborn as sns
import numpy as np

plt.scatter(x=summary['log2FoldChange'],y=summary['padj'].apply(lambda x:-np.log10(x)), s=1)