## Basic Statistical Testing

The goal of hypothesis testing is to determine if, for instance, the two different conditions we have in an experiment have resulted in different impacts.

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
from scipy import stats

In [4]:
# scipy is an interesting collection of libraries for data science and you will use most or perhaps all of libraries for data science. 
# It includes numpy and pandas, but also plotting libraries such as matplotlib, and a number of scientific library functions.

Two statements of interests:
1. Actual explanation, which is the alternative hypothesis.
2. explanation is not sufficient, which is the null hypothesis. 

The actual testing method is to determine whether the null hypothesis is true or not. 
If there is a difference between groups, reject the null hypothesis and accept the alternative.

In [5]:
df = pd.read_csv("assets/grades.csv")
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


There are six different assignment. Let's look at some summary statistics for this DataFrame

In [6]:
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


In [10]:
# Let's segment this population into two pieces
# Those who finish the first assignment by the end of Dec 2015 are early finishers
# Those who finish it sometime after that are late finishers

early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [14]:
late_finishers_try1 = df[pd.to_datetime(df['assignment1_submission']) >= '2016']
late_finishers_try1.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


In [13]:
# Want everything in the df which is not in early_finishers
late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


1. Join of the dataframe df with early_finishers. If you do a left join you only keep the items in the left dataframe, so this would have been a good answer.
2. A function that determines if someone is early or late, and then called .apply() on the dataframe and added a new conlumn to the dataframe.

In [16]:
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024303
74.0450648477065


In [17]:
# What do we mean by similar?
# This is where the students' t-test comes in. It allows us to form the alternative hypthesis as well as the 
# null hypothesis and then test that null hypothesis

# When doing hypothsis testing, we have to choose a significance level as a threshold for how much of a 
# chance we're willing to accept. This significance level is typically called alpha.

In [18]:
from scipy.stats import ttest_ind

ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)

The p-value is 0.18 and above our alpha value of 0.05. This means that we cannot reject the null hypothesis.

The null hypothesis was that the two populations are the same, and we don't have enough certainty in our evidence (becasue it is greater than alpha) to come to a conlcusion to the contrary. This doesn't mean that we have proven the populations are the same.

In [19]:
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.781037,0.57803,0.95709,0.863923,0.023813,0.321775,0.98598,0.875277,0.823929,0.865984,...,0.730081,0.89398,0.182539,0.717723,0.707501,0.65783,0.642496,0.682527,0.361978,0.536748
1,0.276768,0.386151,0.296504,0.746178,0.262391,0.722245,0.033479,0.22207,0.839301,0.914214,...,0.460948,0.870931,0.286714,0.52893,0.953518,0.274413,0.021581,0.519306,0.723609,0.875261
2,0.57577,0.045751,0.941714,0.237793,0.650395,0.222964,0.377363,0.700825,0.677531,0.522036,...,0.942535,0.492913,0.804804,0.108795,0.329727,0.270312,0.348645,0.686009,0.148566,0.030767
3,0.110826,0.022395,0.650159,0.32916,0.042542,0.860804,0.098154,0.364955,0.412324,0.614992,...,0.916943,0.312286,0.17,0.170893,0.14915,0.923076,0.7315,0.578633,0.032146,0.204681
4,0.822043,0.437353,0.922702,0.949546,0.721122,0.929068,0.428405,0.911523,0.632004,0.019619,...,0.413681,0.731408,0.971511,0.689614,0.876563,0.445081,0.16191,0.556171,0.818006,0.188925


In [20]:
df2 = pd.DataFrame([np.random.random(100) for x in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.320641,0.794225,0.111155,0.743778,0.115088,0.983482,0.257087,0.692234,0.813613,0.579759,...,0.881761,0.537904,0.406771,0.114959,0.970291,0.582301,0.05414,0.750849,0.332425,0.62758
1,0.719144,0.085116,0.369136,0.842764,0.952956,0.668071,0.116102,0.78861,0.313235,0.988071,...,0.948449,0.304528,0.11842,0.560352,0.18292,0.729624,0.615499,0.504834,0.66398,0.041319
2,0.376177,0.620183,0.686659,0.881225,0.878423,0.713644,0.961147,0.178687,0.281547,0.580469,...,0.820664,0.20219,0.495658,0.879154,0.035058,0.979497,0.117621,0.263765,0.00015,0.78292
3,0.242707,0.762029,0.760593,0.32897,0.758009,0.673745,0.679601,0.317576,0.597414,0.168291,...,0.412177,0.041766,0.476118,0.184581,0.277603,0.456813,0.881678,0.65445,0.795106,0.015345
4,0.017931,0.341615,0.01694,0.517245,0.568095,0.01726,0.816769,0.35956,0.074111,0.071266,...,0.569583,0.514131,0.392947,0.942193,0.8535,0.372454,0.265007,0.370098,0.113914,0.672143


In [24]:
# Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same 
# as the row inside of 2?

# Critical value is 0.1, or alpha of 10%

def test_columns(alpha = 0.1):
    # Keep track of how many differ
    num_diff = 0
    # Iterate over the columns
    for col in df1.columns:
        # Run out ttest_ind between the two dataframes
        teststat,pval = ttest_ind(df1[col], df2[col])
        # Check the pvalue versus the alpha
        if pval <= alpha:
            print("Col {} is statistically significanty different at alpha = {}, pval = {}".format(col, alpha, pval))
            num_diff = num_diff + 1
    # Print some summary stats
    print("Total number different was {}, which is {}%".format(num_diff, float(num_diff)/len(df1.columns)*100))

test_columns()

Col 9 is statistically significanty different at alpha = 0.1, pval = 0.022255617901449525
Col 10 is statistically significanty different at alpha = 0.1, pval = 0.04300000339167113
Col 17 is statistically significanty different at alpha = 0.1, pval = 0.04655201384057509
Col 53 is statistically significanty different at alpha = 0.1, pval = 0.0774223826190771
Col 62 is statistically significanty different at alpha = 0.1, pval = 0.0871131782042013
Col 85 is statistically significanty different at alpha = 0.1, pval = 0.0807164137263577
Total number different was 6, which is 6.0%


All ttest does is check if two sets are similar given ome level of confidence, in our case, 10% 

The more random comparisons you do, the more will just happen to be the same by chance. In this example, we checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1

In [25]:
test_columns(0.05)

Col 9 is statistically significanty different at alpha = 0.05, pval = 0.022255617901449525
Col 10 is statistically significanty different at alpha = 0.05, pval = 0.04300000339167113
Col 17 is statistically significanty different at alpha = 0.05, pval = 0.04655201384057509
Total number different was 3, which is 3.0%


Keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand that this p-value isn't magic, that it's a threshold for you when reporting results and trying to answer your hypothesis. 

What's a reasonable threshold? Depends on your question, and you need to engage domain experts to better understand what they would consider significant.

In [26]:
df2 = pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])

In [27]:
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.014095,0.165917,0.548317,0.736558,0.011588,0.017947,1.971273,0.751031,0.89205,0.002533,...,0.322931,0.364089,0.76499,0.658565,1.092446,3.389651,0.013253,0.850113,4.259518,0.203684
1,0.082443,0.364093,1.47538,0.22648,0.00111,0.57439,0.927932,0.001655,0.174158,4.974089,...,0.496773,0.143406,2.807119,2.716798,0.322123,0.026221,1.624929,0.00027,0.022333,1.67769
2,0.816413,0.088087,0.201689,0.091542,1.2e-05,0.332167,0.000643,1.065374,0.964419,0.117485,...,0.422728,0.978301,0.571465,0.21081,0.391875,0.27545,0.57689,0.067704,0.023073,1.885851
3,0.009786,0.001924,1.271808,0.49239,0.242373,3.826152,0.033403,3.060671,1.138687,3.997034,...,0.088515,0.012039,0.334683,0.264116,0.01743,0.326187,1.392091,0.165286,0.117668,0.093462
4,2.468948,0.010017,0.001198,0.110663,9.6e-05,1.376623,0.071788,0.339618,0.108358,7.367742,...,1.462841,0.164284,0.393093,0.147552,0.020886,0.288106,0.781177,1.28643,0.973932,0.556855
