pingouin是一个基于Python的用于统计分析的工具包，可以实现很多心理学研究中需要用到的统计方法，相比于R丰富的统计计算，虽然pingouin内容并不丰富，但是完全填补了很多日常所需但scipy.stats无法实现的功能。因此，路同学希望在中文社区分享pingouin。  
pingouin的官方网址：https://pingouin-stats.org  
pingouin的GitHub源码：https://github.com/raphaelvallat/pingouin  
使用pingouin需要引用的文章：Vallat, R. (2018). Pingouin: statistics in Python. Journal of Open Source Software, 3(31), 1026, https://doi.org/10.21105/joss.01026

安装pingouin：  
控制台输入：pip install pingouin  
即可完成安装  

简单上手pingouin

这一篇推送选取了一些常用的统计方法对应的基于pingouin的实现。

(1) T检验

In [1]:
# 随机生成两组数据（对应条件A和条件B），每组30个样本
import numpy as np
A = np.random.normal(loc=0, size=30) # 随机生成30个样本作为条件A的数据
                              # 设置条件A的30个样本的均值为0
B = np.random.normal(loc=1, size=30) # 随机生成30个样本作为条件的数据
                              # 设置条件B的30个样本的均值为1
# (Bayesian) T-test
import pingouin as pg
from pingouin import ttest
# 独立样本t检验
# 使用ttest()函数计算：
ttest(A, B)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-4.860054,58,two-sided,9e-06,"[-1.53, -0.64]",1.254861,1797.699,0.997585


In [2]:
# 配对样本t检验
ttest(A, B, paired=True)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-5.178556,29,two-sided,1.5e-05,"[-1.51, -0.66]",1.254861,1418.591,1.0


依次会显示t值，自由度，单边或双边检验，p值，置信区间，效应量，贝叶斯因子，统计效力

如果想用代码获取统计结果，则直接获取返回值：

In [3]:
stats = ttest(A, B, paired=True)
# stats的数据类型是pandas.DataFrame，获取具体值：
# 以t值为例：
print(stats['T'])

T-test   -5.178556
Name: T, dtype: float64


(2) 相关性分析

默认皮尔逊相关：

In [4]:
# 使用corr()函数计算：
pg.corr(A, B)

Unnamed: 0,n,r,CI95%,p-val,BF10,power
pearson,30,0.119835,"[-0.25, 0.46]",0.528203,0.274,0.096748


依次获得样本数，相关系数，置信区间，p值，贝叶斯因子，统计效力

通过修改method参数可使用其他相关计算方法，比如'spearman'、'kendall'、'bicor'等，以斯皮尔曼相关为例：

In [5]:
pg.corr(A, B, method='spearman')

Unnamed: 0,n,r,CI95%,p-val,power
spearman,30,0.010456,"[-0.35, 0.37]",0.956267,0.049835


(3) 方差分析

In [6]:
# 导入一个已有数据
data = pg.read_dataset('mixed_anova')
# 数据包含scores, time, group和subject四个维度的信息
print(data)

       Scores    Time       Group  Subject
0    5.971435  August     Control        0
1    4.309024  August     Control        1
2    6.932707  August     Control        2
3    5.187348  August     Control        3
4    4.779411  August     Control        4
..        ...     ...         ...      ...
175  6.176981    June  Meditation       55
176  8.523692    June  Meditation       56
177  6.522273    June  Meditation       57
178  4.990568    June  Meditation       58
179  7.822986    June  Meditation       59

[180 rows x 4 columns]


In [7]:
# 单因素方差分析
pg.anova(data=data, dv='Scores', between='Group', detailed=True)
# 通过anova()函数计算，
# 上行代码中, dv传入因变量(Y)维度, between传入被试间因子的维度

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,Group,5.459963,1,5.459963,5.243656,0.0232,0.028616
1,Within,185.342729,178,1.041251,,,


In [8]:
# 重复测量方差分析
pg.rm_anova(data=data, dv='Scores', within='Time', subject='Subject', detailed=True)
# 通过rm_anova()函数计算，
# 上行代码, dv传入因变量(Y)维度, within传入被试内自变量维度, subject传入被试维度

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2,eps
0,Time,7.628428,2,3.814214,3.912796,0.022629,0.062194,0.998751
1,Error,115.027023,118,0.974805,,,,


In [9]:
# 双因素方差分析
pg.mixed_anova(data=data, dv='Scores', between='Group', within='Time', subject='Subject')
# 通过mixed_anova()函数计算

Unnamed: 0,Source,SS,DF1,DF2,MS,F,p-unc,np2,eps
0,Group,5.459963,1,58,5.459963,5.051709,0.02842,0.08012,
1,Time,7.628428,2,116,3.814214,4.027394,0.020369,0.064929,0.998751
2,Interaction,5.167192,2,116,2.583596,2.727996,0.069545,0.044922,


(4) 多重比较

In [10]:
# 以Time维度下的多重比较为例：
pg.pairwise_ttests(data=data, dv='Scores', within='Time', subject='Subject')

Unnamed: 0,Contrast,A,B,Paired,Parametric,T,dof,alternative,p-unc,BF10,hedges
0,Time,August,January,True,True,-1.74037,59.0,two-sided,0.087008,0.582,-0.327583
1,Time,August,June,True,True,-2.743238,59.0,two-sided,0.008045,4.232,-0.482547
2,Time,January,June,True,True,-1.02362,59.0,two-sided,0.310194,0.232,-0.16952


In [11]:
# 进行多重比较的结果矫正，使用padjust参数选取矫正方法，以Bonferroni矫正为例：
pg.pairwise_ttests(data=data, dv='Scores', within='Time', subject='Subject', padjust='bonf')
# 当然也可以使用其他矫正方法传入padjust参数, 
# 如'sidak'、'holm'、'fdr_bh'、'fdr_by', 默认none不矫正

Unnamed: 0,Contrast,A,B,Paired,Parametric,T,dof,alternative,p-unc,p-corr,p-adjust,BF10,hedges
0,Time,August,January,True,True,-1.74037,59.0,two-sided,0.087008,0.261023,bonf,0.582,-0.327583
1,Time,August,June,True,True,-2.743238,59.0,two-sided,0.008045,0.024134,bonf,4.232,-0.482547
2,Time,January,June,True,True,-1.02362,59.0,two-sided,0.310194,0.930581,bonf,0.232,-0.16952


(5) 多元线性回归

In [12]:
# 以计算C ~ A + B + A*B为例
C = np.random.normal(loc=5, size=30)
# 使用linear_regression()函数计算：
pg.linear_regression(np.transpose([A, B], (1, 0)), C)

Unnamed: 0,names,coef,se,T,pval,r2,adj_r2,CI[2.5%],CI[97.5%]
0,Intercept,4.648561,0.293522,15.837195,3.431727e-15,0.063433,-0.005942,4.046304,5.250818
1,x1,0.203245,0.202106,1.005636,0.3235189,0.063433,-0.005942,-0.211442,0.617932
2,x2,0.173755,0.2236,0.777078,0.4438682,0.063433,-0.005942,-0.285035,0.632545


其他功能：pingouin还包含大量其他的计算功能以及画图功能，但是无法在这篇推送里一一罗列，大家可以参考官网的Functions页面：
https://pingouin-stats.org/api.html