# Chi-Squared Goodness-Of-Fit Test

* 拟合优度检验：检验的对象是**2个分布**，（过程）只涉及1个变量，但1个分类变量涉及不同水平。
      * stats.chisquare(f_obs=observed,f_exp=expected)   
          --The chi square test tests the null hypothesis that the categorical data has the given frequencies.
                     
* 独立性检验：检验的对象是**2个变量**（的相关性），（过程）在同一分布中，2个变量的不同水平两两比较。
      * stats.chi2_contingency()      
        --Chi-square test of independence of variables in a contingency table

In our study of t-tests, we introduced the one-way t-test to check whether a sample mean differs from the an expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for **categorical variables**: 

**It tests whether the distribution of sample categorical data matches an expected distribution. **


例如：用于检验 一个州的人种统计数据 是否与 全美国总人口的人种分布 相匹配。

分类型数据的处理：用分类变量的计数值 代替 自身的观测值（如”男“、”女“是没有数学意义的）

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [28]:
#构造人种数据集：
national = pd.DataFrame(["white"]*100000 + ["hispanic"]*60000 +\
                        ["black"]*50000 + ["asian"]*15000 + ["other"]*35000)
           

minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 + \
                         ["black"]*250 +["asian"]*75 + ["other"]*150)

national_table = pd.crosstab(index=national[0], columns="count")  # 分类变量必须用交叉表（列联表）统计频数
minnesota_table = pd.crosstab(index=minnesota[0], columns="count")

print( "National")
print(national_table)
print(" ")
print( "Minnesota")
print(minnesota_table)

National
col_0      count
0               
asian      15000
black      50000
hispanic   60000
other      35000
white     100000
 
Minnesota
col_0     count
0              
asian        75
black       250
hispanic    300
other       150
white       600


卡方检验基于所谓的“卡方检验统计量”，它用于决定某项分类变量的观测频数与 期望频数的差异大小：
![chi-s](chi-test.png) ![chi](chi-square.png)

原假设：明尼苏达州的人种分布与全美人种分布一致

备择假设：不一致

In [29]:
observed = minnesota_table

national_ratios = national_table/len(national)  # Get population ratios

expected = national_ratios * len(minnesota)  # *Get expected counts

chi_squared_stat = (((observed-expected)**2)/expected).sum() # 有多个类，所以要sum 每个类

print(chi_squared_stat)

col_0
count    18.194805
dtype: float64


t-test：  检验统计量t--t分布--临界值/p-value 比较--是否显著

chi-squared test: 检验统计量chi-squared--chi-squared distribution--临界值/p-value 比较--是否显著

stats.chi2:

In [30]:
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 4)   # Df = number of variable categories - 1

print('Critical value', crit)

p_value = 1-stats.chi2.cdf(x=chi_squared_stat, # Find the p-value
                           df=4)

print('p-value', p_value)

Critical value 9.48772903678
p-value [ 0.00113047]


比较得出：拒绝原假设。两个分布显著不同。

In [31]:
# 更直接地实现卡方拟合优度检验：scipy.stats.chisquare():
stats.chisquare(f_obs=observed,
                f_exp=expected)

Power_divergenceResult(statistic=array([ 18.19480519]), pvalue=array([ 0.00113047]))

# Chi-Squared Test of Independence 独立性检验

独立性描述两个变量是否相关，即一个变量的值会不会影响另一个.

原假设H0：两个变量独立（“=”）

备择假设Ha：两个变量不独立（‘≠’）

In [47]:
# 例：2个变量，人种和 党派，检验两者的独立性
# 构造数据：
np.random.seed(10)

# Sample data randomly at fixed probabilities
voter_race = np.random.choice(a= ["asian","black","hispanic","other","white"],
                              p = [0.05, 0.15 ,0.25, 0.05, 0.5],
                              size=1000)

# Sample data randomly at fixed probabilities
voter_party = np.random.choice(a= ["democrat","independent","republican"],
                              p = [0.4, 0.2, 0.4],
                              size=1000)

voters = pd.DataFrame({"race":voter_race, 
                       "party":voter_party})

voter_tab = pd.crosstab(voters.race, voters.party, margins = True) # 添加行/列边界（即汇总统计）

voter_tab.columns = ["democrat","independent","republican","row_totals"] # row-totals 为边界行名， 原为"all"

voter_tab.index = ["asian","black","hispanic","other","white","col_totals"]

observed = voter_tab.ix[0:5,0:3]   # Get table without totals for later use
voter_tab

# 独立检验使用列联表(contingency table)格式：用一个列联表 描述 两个变量人种、党派 所有可能的组合。所以也被称为"列联表检验".

Unnamed: 0,democrat,independent,republican,row_totals
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
col_totals,397,186,417,1000


第一个单元格的意思：亚裔里面，有21人属于民主党。

当【独立性假设为真】时，observe列联表的期望频数计算公式：![expected](expected.png)

In [52]:
expected = np.outer(voter_tab['row_totals'][0:5],  # 通过矩阵乘法np.outer（外积）得到所有单元格的期望频数
                    voter_tab.ix['col_totals'][0:3]) / 1000 

expected = pd.DataFrame(expected) 

expected.columns = ["democrat","independent","republican"]
expected.index = ["asian","black","hispanic","other","white"]

expected

Unnamed: 0,democrat,independent,republican
asian,23.82,11.16,25.02
black,61.138,28.644,64.218
hispanic,99.647,46.686,104.667
other,15.086,7.068,15.846
white,197.309,92.442,207.249


接下来和之前一样，计算卡方统计检验量、临界值 和p-value ，只不过一维变二维，加和2次（先求每列的和，在把“每列的和”加和）：
![twice](twicesum.png)

In [53]:
# 计算检验统计量
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print(chi_squared_stat)

7.169321280162059


In [54]:
# 计算95%置信水平 在卡方分布中 对应的临界值(分为点）

crit = stats.chi2.ppf(q=0.95,  # Find the critical value for 95% confidence*
                      df=8)    # (5-1) x (3-1)=8

print('critical: ',crit)

p_value = 1-stats.chi2.cdf(x=chi_squared_stat,
                           df=8)

print('p_value: ', p_value)
                           

critical:  15.5073130559
p_value:  0.518479392949


p_value > 0.05,不拒绝H0。所以两个变量具有独立性，没有显著的相关性。

#### Use stats.chi2_contingency():

In [55]:
stats.chi2_contingency(observed=observed)

(7.1693212801620589,
 0.51847939294884204,
 8,
 array([[  23.82 ,   11.16 ,   25.02 ],
        [  61.138,   28.644,   64.218],
        [  99.647,   46.686,  104.667],
        [  15.086,    7.068,   15.846],
        [ 197.309,   92.442,  207.249]]))

# Wrap up

卡方检验的用途是研究 
1. 同一分类变量在样本分布 和 总体分布（expected)中的差异性
2. 分类变量间的独立性（相关性）