# 一步一步的PyDESeq2工作流程

本笔记本详细介绍了PyDESeq2管道的所有步骤。

它允许您在此存储库中提供的合成数据上运行PyDESeq2管道。

如果这是您第一次接触PyDESeq2，我们建议您先看一下[标准工作流示例](https://pydeseq2.readthedocs.io/en/latest/auto_examples/plot_minimal_pydeseq2_pipeline.html)。

我们首先导入所需的包，并设置一个可选的路径来保存结果。

In [1]:
import os
import pickle as pkl
import sys
sys.path.insert(0, "/slurm/home/admin/nlp/DL/97-bioinformatics/bio_package/pydeseq2")
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data

SAVE = False  # whether to save the outputs of this notebook

if SAVE:
    # Replace this with the path to directory where you would like results to be
    # saved
    OUTPUT_PATH = "/slurm/home/admin/nlp/DL/results/synthetic_example"
    os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist

## 数据加载
请注意，我们在入门示例中也解释了这一步。要执行差分表达式分析(DEA)， PyDESeq2需要两种类型的输入

- 样本形状数x基因数的计数矩阵，包含读取计数(非负整数)
- 样本数x变量数的元数据(或列数据)，包含将用于在队列中分割数据的样本注释。

两者都应该作为pandas数据框架提供。

为了说明所需的数据格式，我们加载了一个合成的示例数据集，该数据集可以通过使用  `utils.load_example_data()` 的PyDESeq2 API获得。您可以用自己的数据集替换它。

In [2]:
# counts_df = load_example_data(
#     modality="raw_counts",
#     dataset="synthetic",
#     debug=False,
# )

# metadata = load_example_data(
#     modality="metadata",
#     dataset="synthetic",
#     debug=False,
# )

count_file = "../../data/test_counts.csv"
counts_df = pd.read_csv(count_file,index_col=0).T
counts_df

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,gene7,gene8,gene9,gene10
sample1,12,21,4,130,18,0,16,54,49,3
sample2,1,44,2,63,11,10,70,32,57,9
sample3,4,4,11,180,21,3,28,34,65,2
sample4,1,10,2,100,44,9,28,16,33,9
sample5,1,11,6,135,16,2,32,29,31,5
...,...,...,...,...,...,...,...,...,...,...
sample96,7,26,3,67,11,4,41,44,54,1
sample97,1,14,3,71,33,5,19,42,25,4
sample98,10,36,2,72,11,2,66,27,16,9
sample99,18,14,3,66,53,11,32,19,79,11


In [3]:
metadata = "../../data/test_metadata.csv"
metadata = pd.read_csv(metadata, index_col=0)
metadata.head()

Unnamed: 0,condition,group
sample1,A,X
sample2,A,Y
sample3,A,X
sample4,A,Y
sample5,A,X


## 读取计数建模

使用`DeseqDataSet`类读取计数建模

DeseqDataSet类有两个强制参数，`counts`和`metadata`，以及一组可选关键字参数，其中包括
- design_factor:要用作设计变量的元数据列的名称
- refit_cooks: 一般来说，建议是否对异常值进行改变。



In [4]:
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors="condition",  # compare samples based on the "condition"
    # column ("B" vs "A")
    refit_cooks=True,
    inference=inference,
)

## 计算归一化因子

In [5]:
dds.fit_size_factors()

dds.obsm["size_factors"]

Fitting size factors...
... done in 0.00 seconds.



array([1.22898094, 1.18877375, 0.99722229, 1.00215773, 0.83457743,
       1.10730382, 0.8999001 , 1.15343785, 0.68163849, 1.29764537,
       1.04491511, 1.45930946, 1.14588441, 0.8049275 , 0.88402672,
       0.88402672, 1.32879767, 0.82564657, 1.5978062 , 1.29764537,
       1.31940196, 0.69919197, 1.10697146, 1.10214803, 1.19152118,
       1.0624452 , 0.98548229, 0.76881428, 0.8939601 , 1.27135863,
       1.61101905, 1.55084302, 0.83601298, 0.98213727, 1.27270212,
       1.0510719 , 1.76078144, 1.08132885, 1.50390106, 1.0510719 ,
       0.80280751, 0.70955247, 1.32602392, 0.98031899, 1.1078077 ,
       0.68792508, 0.90429564, 1.56411155, 0.81918767, 1.19364837,
       0.79492024, 1.84963565, 0.79694628, 0.79708276, 0.97287297,
       1.16248554, 1.50489413, 1.41929759, 1.04612122, 1.05720226,
       0.99635345, 1.84224912, 1.03801163, 0.89633874, 0.72952979,
       1.33453944, 0.93968061, 1.14016425, 1.59166589, 1.08554239,
       0.72370261, 0.91558563, 1.14183629, 1.33857618, 0.94450

## 拟合离散趋势系数


In [6]:
dds.fit_dispersion_trend()


Fitting dispersions...
... done in 0.02 seconds.

Fitting dispersion trend curve...
... done in 0.01 seconds.



array([0.65142434, 0.31300062, 1.04986539, 0.13414536, 0.264005  ,
       0.97812827, 0.25676459, 0.20575044, 0.21602633, 0.50274561])

In [7]:
dds.uns["trend_coeffs"]


a0    0.086109
a1    4.828540
dtype: float64

In [8]:
dds.varm["fitted_dispersions"]

array([0.65142434, 0.31300062, 1.04986539, 0.13414536, 0.264005  ,
       0.97812827, 0.25676459, 0.20575044, 0.21602633, 0.50274561])

## 分散先验

In [9]:
dds.fit_dispersion_prior()

In [10]:
print(
    f"logres_prior={dds.uns['_squared_logres']}, sigma_prior={dds.uns['prior_disp_var']}"
)

logres_prior=0.055924936547501455, sigma_prior=0.25


## MAP 分散

拟合`fit_MAP_dispersions`方法过滤应用分散体收缩的基因。实际上，对于MLE离散度高于趋势曲线的基因，其MLE值保持不变。用于下游分析的色散的最终值存储在`dds.dispersions`中。

In [11]:
dds.fit_MAP_dispersions()

Fitting MAP dispersions...
... done in 1.10 seconds.



In [12]:
dds.varm["MAP_dispersions"]


array([0.88259824, 0.22257849, 0.83723751, 0.15897038, 0.24992574,
       0.97364737, 0.23515474, 0.19878066, 0.18652019, 0.63189957])

In [13]:
dds.varm["dispersions"]

array([0.88259824, 0.22257849, 0.83723751, 0.15897038, 0.24992574,
       0.97364737, 0.23515474, 0.19878066, 0.18652019, 0.63189957])

## 拟合`log fold`变化

请注意，在DeseqDataSet对象中，log-fold更改以自然对数尺度存储，但DeseqStats的摘要方法输出的结果数据帧以log2尺度显示lfc(请参阅后面的内容)。

In [14]:
dds.fit_LFC()


Fitting LFCs...
... done in 0.01 seconds.



In [15]:
dds.varm["LFC"]

Unnamed: 0,intercept,condition_B_vs_A
gene1,1.891436,0.438632
gene2,2.851662,0.373296
gene3,1.78778,-0.438645
gene4,4.741958,-0.285647
gene5,3.077798,0.403457
gene6,1.678536,0.00101
gene7,3.291025,0.093116
gene8,3.785129,-0.187604
gene9,3.682882,-0.147443
gene10,2.300515,0.267562


## 计算`Cooks`距离和`refit`

注意，这一步是可选的

In [16]:
dds.calculate_cooks()
if dds.refit_cooks:
    # Replace outlier counts
    dds.refit()

Calculating cook's distance...
... done in 0.01 seconds.

Replacing 0 outlier genes.



## Save everything


In [None]:
if SAVE:
    with open(os.path.join(OUTPUT_PATH, "dds_detailed_pipe.pkl"), "wb") as f:
        pkl.dump(dds, f)

## 统计分析

使用`DeseqStats`类进行统计分析。`DeseqDataSet`类有一个唯一的强制参数`dds`，它应该是一个合适的`DeseqDataSet`对象，以及一组可选的关键字参数，其中包括
- Alpha: p值和调整后的p值显著性阈值
- Cooks filter:是否根据Cooks异常值过滤p值
- independent_filter:是否对p值趋势进行独立过滤。

In [17]:
stat_res = DeseqStats(dds, alpha=0.05, cooks_filter=True, independent_filter=True)

## Wald tests

In [18]:
stat_res.run_wald_test()
stat_res.p_values

Running Wald tests...
... done in 1.07 seconds.



gene1     0.028604
gene2     0.000329
gene3     0.032075
gene4     0.000513
gene5     0.000168
gene6     0.996253
gene7     0.370297
gene8     0.047227
gene9     0.110391
gene10    0.114518
dtype: float64

## Cooks filtering

这是可选的

In [19]:
if stat_res.cooks_filter:
    stat_res._cooks_filtering()
stat_res.p_values

gene1     0.028604
gene2     0.000329
gene3     0.032075
gene4     0.000513
gene5     0.000168
gene6     0.996253
gene7     0.370297
gene8     0.047227
gene9     0.110391
gene10    0.114518
dtype: float64

## P-value adjustment

In [20]:
if stat_res.independent_filter:
    stat_res._independent_filtering()
else:
    stat_res._p_value_adjustment()

stat_res.padj

gene1     0.064150
gene2     0.001646
gene3     0.064150
gene4     0.001710
gene5     0.001646
gene6     0.996253
gene7     0.411441
gene8     0.078711
gene9     0.143147
gene10    0.143147
Name: 0, dtype: float64

## 构建结果数据框架
该数据框存储在`DeseqStats`类的`results_df`属性中。

In [21]:
stat_res.summary()

Log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.632812  0.289101  2.188898  0.028604  0.064150
gene2    21.281239        0.538552  0.149963  3.591236  0.000329  0.001646
gene3     5.010123       -0.632830  0.295236 -2.143476  0.032075  0.064150
gene4   100.517961       -0.412102  0.118629 -3.473868  0.000513  0.001710
gene5    27.142450        0.582065  0.154706  3.762409  0.000168  0.001646
gene6     5.413043        0.001457  0.310311  0.004696  0.996253  0.996253
gene7    28.294023        0.134338  0.149945  0.895917  0.370297  0.411441
gene8    40.358344       -0.270656  0.136401 -1.984261  0.047227  0.078711
gene9    37.166183       -0.212715  0.133243 -1.596437  0.110391  0.143147
gene10   11.589325        0.386011  0.244588  1.578207  0.114518  0.143147


如果Save设置为True，则保存所有内容

In [None]:
if SAVE:
    with open(os.path.join(OUTPUT_PATH, "stat_results_detailed_pipe.pkl"), "wb") as f:
        pkl.dump(stat_res, f)

## LFC Shrinkage
出于可视化或后处理的目的，可能适合执行LFC收缩。这是由`lfc_shrink `方法实现的。

In [22]:
stat_res.lfc_shrink(coeff="condition_B_vs_A")

Shrunk log2 fold change & Wald test p-value: condition B vs A
          baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
gene1     8.541317        0.408253  0.294276  2.188898  0.028604  0.064150
gene2    21.281239        0.480145  0.151201  3.591236  0.000329  0.001646
gene3     5.010123       -0.396066  0.300796 -2.143476  0.032075  0.064150
gene4   100.517961       -0.374191  0.118703 -3.473868  0.000513  0.001710
gene5    27.142450        0.521487  0.156210  3.762409  0.000168  0.001646
gene6     5.413043        0.000716  0.239203  0.004696  0.996253  0.996253
gene7    28.294023        0.103421  0.141496  0.895917  0.370297  0.411441
gene8    40.358344       -0.226288  0.133477 -1.984261  0.047227  0.078711
gene9    37.166183       -0.175746  0.129138 -1.596437  0.110391  0.143147
gene10   11.589325        0.239935  0.231986  1.578207  0.114518  0.143147


Fitting MAP LFCs...
... done in 0.01 seconds.



## Save everything

In [None]:
if SAVE:
    with open(
        os.path.join(OUTPUT_PATH, "shrunk_stat_results_detailed_pipe.pkl"), "wb"
    ) as f:
        pkl.dump(stat_res, f)