复习假设检验和统计显著性，并且使用scipy进行学生-t检验。

In [1]:
# Let's import our usual numpy and pandas libraries
import numpy as np
import pandas as pd

# Now let's bring in some new libraries from scipy
from scipy import stats

scipy是python中有关data science最有用的模块，它包含了numpy和pandas，同时也包括了绘图模块matplotlib，以及其它的科学模块。

我们做假设检验的时候，实际上关注两个statement：第一个是我们的真实想作出的解释，称之为备择假设(Alternative Hypothesis)，另一个是我们还不足够充分的解释，称之为原假设(Null Hypothesis)。我们的检验方法就是决定原假设是否是真实的。

例如在几组数据中，原假设是它们来自相同的总体。如果我们发现它们之间确实存在不同，则我们可以拒绝原假设而接受备择假设。例如以下的“成绩数据”：

In [2]:
df = pd.read_csv (
    "C:\\Users\\asus\\Applied Data Science with Python\\Introduction to Data Science in Python\\grades.csv")
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [3]:
# 观察数据，我们发现有6中不同的assignment，让我们看看这个DataFrame的Summary Statistics。
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


In [4]:
# 让我们将样本空间分为两个部分：在2015年完成的学生称作"early finishers"，2015年之后完成的学生称作"late finishers"

early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


如何将late_finishers也提取出来呢?实际上有多种实现方法。

In [5]:
late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


实际上还有其它很多方法可以提取出late_finishers。例如，可以沿用提取early_finishers的方法，不过如果今后想通过别的日期来区分两组，就要同时修改两处。还可以通过join将df和early_finishers合并————如果使用left join，则仅保留左边dataframe的元素。也可以写一个函数来判定学生属于哪组，再用.apply()进行筛选。

In [6]:
# 我们可以在dataframe上直接运用.mean

print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024303
74.0450648477065


它们看上去很接近，但它们相同吗？我们如何定义接近？接下来运用学生-t检验，选择显著性水平为$\alpha = 0.05$。

我们将使用ttest_ind()函数，它会进行一个独立t-test(意味着两组分别来自独立总体)，ttest_ind()会输出T-statistic和p-value。


In [7]:
# Let's bring in our ttest_ind function
from scipy.stats import ttest_ind

# Let's run this function with our two populations, looking at the assignment 1 grades
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.3223540853721596, pvalue=0.18618101101713855)

得出p-value为0.18，它远大于0.05，这说明我们不能拒绝原假设。因此我们没有足够的证据说明它们来自两个平均值不同的总体。

In [8]:
# Why don't we check the other assignment grades?
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


从数据来看，我们没有足够的证据证明两个总体在成绩上有着不同。我们看看这些p-values，它们可以告知我们未来如何设计实验。例如，对于assignment 3，它的p-value大约在0.1，这意味着如果我们将$\alpha$改为0.11，那么实验结果就是显著的。（我都不知道你在说些什么，这也能改？）

In [9]:
# Lets see a simulation of this. First, lets create a data frame of 100 columns, each with 100 numbers
# 注意，每个np.random.random(100)都创造了100个在[0, 1]中的随机值，并储存在一个list中。
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.963518,0.380284,0.283449,0.006967,0.587608,0.092339,0.809381,0.654596,0.641369,0.327932,...,0.805923,0.415677,0.072538,0.973503,0.231374,0.381781,0.92558,0.736738,0.627818,0.261326
1,0.778083,0.187219,0.137161,0.408284,0.541722,0.814571,0.073905,0.983607,0.3789,0.980328,...,0.859519,0.656729,0.300894,0.774476,0.721381,0.106257,0.077079,0.76076,0.283446,0.539332
2,0.796281,0.834948,0.948555,0.195366,0.828227,0.737327,0.1906,0.712246,0.878975,0.032049,...,0.751039,0.984251,0.861334,0.685844,0.59623,0.362369,0.021154,0.832481,0.008408,0.373545
3,0.610745,0.398445,0.892263,0.415803,0.733456,0.251302,0.244772,0.017722,0.583763,0.804718,...,0.667052,0.516686,0.906091,0.943773,0.052648,0.809766,0.582744,0.360599,0.863814,0.594701
4,0.833679,0.944092,0.514317,0.029887,0.320819,0.251497,0.908326,0.457224,0.989538,0.493501,...,0.309956,0.585676,0.804713,0.323178,0.709659,0.69432,0.578297,0.169062,0.540295,0.68514


In [25]:
# Ok, let's create a second dataframe
df2=pd.DataFrame([np.random.random(100) for x in range(100)])

这两个DataFrame相同吗？假设我们的显著性水平是0.1，或者$\alpha = 10\%$，我们需要比较df1和df2的每一列的值。
原假设是它们相同，当p-value > 0.1时，则不足以拒绝原假设。

In [26]:
# Let's write this in a function called test_columns
def test_columns(alpha=0.1):
    # I want to keep track of how many differ
    num_diff = 0
    # And now we can just iterate over the columns
    for col in df1.columns:
        # we can run out ttest_ind between the two dataframes
        teststat,pval=ttest_ind(df1[col],df2[col])
        # and we check the pvalue versus the alpha
        if pval <= alpha:
            # And now we'll just print out if they are different and increment the num_diff
            print("""
            Col {} is statistically significantly different at alpha={}, pval={}
            """.format(col, alpha, pval))
            num_diff = num_diff + 1
    # and let's print out some summary stats
    print("""
    Total number different was {}, which is {}%
    """.format(num_diff, float(num_diff)/len(df1.columns)*100))

# And now lets actually run this
test_columns()


            Col 16 is statistically significantly different at alpha=0.1, pval=0.060500476343567505
            

            Col 17 is statistically significantly different at alpha=0.1, pval=0.06299325352965593
            

            Col 18 is statistically significantly different at alpha=0.1, pval=0.011821114048170649
            

            Col 21 is statistically significantly different at alpha=0.1, pval=0.08160398033424848
            

            Col 29 is statistically significantly different at alpha=0.1, pval=0.06966095662020229
            

            Col 39 is statistically significantly different at alpha=0.1, pval=0.0590541205970791
            

            Col 44 is statistically significantly different at alpha=0.1, pval=0.010820018871614955
            

            Col 52 is statistically significantly different at alpha=0.1, pval=0.05388080218646875
            

            Col 79 is statistically significantly different at alpha=0.1, pval=0.077807760474

In [None]:
我们可以看到这个数字特别接近我们选的alpha（woc居然一样？）。

In [29]:
# We can test some other alpha values as well
test_columns(0.05)


            Col 18 is statistically significantly different at alpha=0.05, pval=0.011821114048170649
            

            Col 44 is statistically significantly different at alpha=0.05, pval=0.010820018871614955
            
Total number different was 2, which is 2.0%


In [30]:
# Just for fun, lets recreate that second dataframe using a non-normal distribution, 
# I'll arbitrarily chose chi squared
df2 = pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])
test_columns()


            Col 0 is statistically significantly different at alpha=0.1, pval=0.0014564442149353708
            

            Col 1 is statistically significantly different at alpha=0.1, pval=4.140194625696831e-05
            

            Col 2 is statistically significantly different at alpha=0.1, pval=0.0034943840072365864
            

            Col 3 is statistically significantly different at alpha=0.1, pval=1.5390930147860945e-05
            

            Col 4 is statistically significantly different at alpha=0.1, pval=0.003183743513907044
            

            Col 5 is statistically significantly different at alpha=0.1, pval=0.0003611554120692914
            

            Col 6 is statistically significantly different at alpha=0.1, pval=4.834755575790957e-05
            

            Col 7 is statistically significantly different at alpha=0.1, pval=0.0019255350975535134
            

            Col 8 is statistically significantly different at alpha=0.1, pval=1.4641805

我们现在可以看到几乎所有列在0.1的显著性水平下都显著了。