[本文结合Python的scipy.stats](https://mp.weixin.qq.com/s/qjuSeoodOaom51lcRRBh1g)

简单梳理皮尔逊Pearson、斯皮尔曼Spearman、肯德尔等级Kendallta三个相关系数的运用场景；及Python中如何计算三个相关系数。

In [None]:
1、统计学中常见变量类型
方便下文理解，先简单梳理下统计学中常用的变量类别

In [None]:
2、皮尔逊相关系数(Pearson)

scipy.stats.pearsonr(x, y)
The Pearson correlation coefficient measures the linear relationship between two datasets 「衡量两组数据的线性相关性」.

The calculation of the p-value relies on the assumption that each dataset is normally distributed「假设两组数据服从正态分布，即数据必须是连续型数据(continuous)」.

Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.「pearson相关系数范围为-1到1、负值为负相关、0为不相关、正值为正相关下图能很好的展示这种关系👇👇」

In [2]:
from scipy import stats
import numpy as np
a = np.array([0, 0, 0, 1, 1, 1, 1])
b = np.arange(7)
c,p = stats.pearsonr(a, b)
print('Pearson correlation：%s'%c)
print('p-value：%s'%p)

Pearson correlation：0.8660254037844386
p-value：0.011724811003954649


In [None]:
3、斯皮尔曼相关系数(Spearman)
使用前提：皮尔逊Pearson相关系数使用前提条件中，任何一个条件不满足时可以考虑使用该系数；

Spearman与Pearson相关系数计算很类似，只是Spearman计算需要将两个变量转化为序数，以下为scipy中描述，

scipy.stats.spearmanr(a, b=None, axis=0, nan_policy='propagate')
Calculate a Spearman correlation coefficient with associated p-value.The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets「两个变量成对取值并排序取秩」. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed「假设两组数据不需要服从正态分布」.

Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases「相关系数范围为-1到1、负值为负相关、0为不相关」.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so「数据集元素大小大于500可能才靠谱」.

In [3]:
#使用Python scipy
from scipy import stats
s,p1 = stats.spearmanr([1,2,3,4,5], [5,6,7,8,7])
print('Spearman correlation：%s'%s)
print('p-value：%s'%p1)

Spearman correlation：0.8207826816681233
p-value：0.08858700531354381


In [None]:
4、肯德尔相关系数(Kendallta)
使用前提：和前两者比完全不一样，衡量有序分类型数据的序数相关性，以下为scipy中描述，

scipy.stats.kendalltau(x, y, initial_lexsort=None, nan_policy='propagate', method='auto')

Calculate Kendall’s tau, a correlation measure for ordinal data「评估有序分类变量(ordinal data)的相关性」.Kendall’s tau is a measure of the correspondence between two rankings「衡量两组变量的等级相关性」. Values close to 1 indicate strong agreement, values close to -1 indicate strong disagreement「相关系数1为极度相关、-1极度不相关」.

In [4]:
from scipy import stats
x1 = [12, 2, 1, 12, 2]
x2 = [1, 4, 7, 1, 0]
k, p2 = stats.kendalltau(x1, x2)
print('Kendallta correlatio：%s'%k)
print('p-value：%s'%p2)

Kendallta correlatio：-0.4714045207910316
p-value：0.2827454599327748
