In this lecture we're going to review some of the basics of statistical testing in python. We're going to
talk about hypothesis testing, statistical significance, and using scipy to run student's t-tests.

In [1]:
# We use statistics in a lot of different ways in data science, and on this lecture, I want to refresh your
# knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of
# hypothesis testing is to determine if, for instance, the two different conditions we have in an experiment 
# have resulted in different impacts

# Let's import our usual numpy and pandas libraries
import numpy as np
import pandas as pd

# Now let's bring in some new libraries from scipy
from scipy import stats

In [2]:
# Now, scipy is an interesting collection of libraries for data science and you'll use most or perpahs all of
# these libraries. It includes numpy and pandas, but also plotting libraries such as matplotlib, and a
# number of scientific library functions as well

In [3]:
# When we do hypothesis testing, we actually have two statements of interest: the first is our actual
# explanation, which we call the alternative hypothesis, and the second is that the explanation we have is not
# sufficient, and we call this the null hypothesis. Our actual testing method is to determine whether the null
# hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null
# hypothesis and we accept our alternative.

# Let's see an example of this; we're going to use some grade data
df=pd.read_csv ('datasets/grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [4]:
# If we take a look at the data frame inside, we see we have six different assignments. Lets look at some
# summary statistics for this DataFrame
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


In [5]:
# For the purpose of this lecture, let's segment this population into two pieces. Let's say those who finish
# the first assignment by the end of December 2015, we'll call them early finishers, and those who finish it 
# sometime after that, we'll call them late finishers.

early_finishers=df[pd.to_datetime(df['assignment1_submission']) < '2016']
# I'm going to do a pd.to_datetime. So I'm going to convert that column to date_time. We could have done this when we read in 
# the CSV file as well. I want to take that assignment one submission time and just if it's less than 2016.
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [6]:
# So, you have lots of skills now with pandas, how would you go about getting the late_finishers dataframe?
# Why don't you pause the video and give it a try.

In [7]:
# Here's my solution. First, the dataframe df and the early_finishers share index values, so I really just
# want everything in the df which is not in early_finishers
late_finishers=df[~df.index.isin(early_finishers.index)]
# So I'll create late finishers and I'll make that equal to our original DataFrame, and then I'll take the inverse of 
# df.index.isin early_finishers.index. So here the tilde is a bit wise compliment. So we're just taking all of our true values 
# and negating them to false and taking all our false values and negating them back into true values.
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


In [8]:
# There are lots of other ways to do this. For instance, you could just copy and paste the first projection
# and change the sign from less than to greater than or equal to. This is ok, but if you decide you want to
# change the date down the road you have to remember to change it in two places. You could also do a join of
# the dataframe df with early_finishers - if you do a left join you only keep the items in the left dataframe,
# so this would have been a good answer. You also could have written a function that determines if someone is
# early or late, and then call .apply() on the dataframe and added a new column to the dataframe. This is a
# pretty reasonable answer as well. So there's a number of different things you could have done to create this DataFrame.

In [9]:
# As you've seen, the pandas data frame object has a variety of statistical functions associated with it. If
# we call the mean function directly on the data frame, we see that each of the means for the assignments are
# calculated. Let's compare the means for our two populations

print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024303
74.0450648477065


In [10]:
# Ok, these look pretty similar. But, are they the same? What do we mean by similar? This is where the
# students' t-test comes in. It allows us to form the alternative hypothesis ("These are different") as well
# as the null hypothesis ("These are the same") and then test that null hypothesis.

# When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a
# chance we're willing to accept. This significance level is typically called alpha. #For this example, let's
# use a threshold of 0.05 for our alpha or 5%. Now this is a commonly used number but it's really quite
# arbitrary.

# The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing
# in Python and we're going to use the ttest_ind() function which does an independent t-test (meaning the
# populations in the two groups are not related to one another). The result of ttest_index() are the t-statistic and a p-value.
# It's this latter value, the probability, which is most important to us, as it indicates the chance (between
# 0 and 1) of our null hypothesis being True.

# Let's bring in our ttest_ind function
from scipy.stats import ttest_ind

# Let's run this function with our two populations, looking at the assignment 1 grades
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])
# So ttest_ind() and will take early finishers and we just want to project the assignment1_grade and late finishers and we'll 
# project the assignment1_grade.

Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)

In [11]:
# So here we see that the probability is 0.18, and this is above our alpha value of 0.05. This means that we
# cannot reject the null hypothesis. The null hypothesis was that the two populations are the same, and we
# don't have enough certainty in our evidence (because of the probability is greater than alpha) to come to a conclusion to
# the contrary. This doesn't mean that we have proven the populations are the same.

In [12]:
# Why don't we check the other assignment grades?
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


In [13]:
# Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with
# respect to grade. Let's take a look at those p-values for a moment though, because they are saying things
# that can inform experimental design down the road. For instance, one of the assignments, assignment 3, has a
# p-value around 0.1. This means that if we accepted a level of chance similarity of 11% this would have been
# considered statistically significant. As a research, this would suggest to me that there is something here
# worth considering following up on. For instance, if we had a small number of participants (and we don't here) or if
# there was something unique about this assignment as it relates to our experiment (whatever the experiment was) then
# there may be followup experiments we could run to better understand the phenomenon.

In [14]:
# P-values have come under fire recently for being insuficient for telling us enough about the interactions
# which are happening, and two other techniques, confidence intervalues and bayesian analyses, are being used
# more regularly. One issue with p-values is that as you run more tests you are likely to get a value which
# is statistically significant just by chance.

# Lets see a simulation of this. First, lets create a data frame of 100 columns, each with 100 numbers
df1=pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.594582,0.934674,0.928005,0.402981,0.899311,0.27685,0.00771,0.369291,0.993799,0.346297,...,0.121825,0.929949,0.876649,0.858907,0.722868,0.590758,0.422649,0.751324,0.512489,0.965033
1,0.941025,0.560229,0.005066,0.239995,0.893428,0.947562,0.263387,0.895551,0.397754,0.299527,...,0.057007,0.561166,0.312636,0.047455,0.329718,0.63262,0.045496,0.407757,0.495057,0.750729
2,0.903976,0.441911,0.395131,0.106077,0.372937,0.315134,0.366005,0.836094,0.849736,0.341996,...,0.469593,0.518654,0.027216,0.545869,0.997112,0.942015,0.184176,0.221051,0.313018,0.01964
3,0.495397,0.27028,0.329554,0.489409,0.138365,0.251533,0.380508,0.137015,0.158889,0.394645,...,0.488811,0.485901,0.008634,0.853493,0.985143,0.527601,0.864648,0.90502,0.527978,0.723729
4,0.67203,0.071926,0.326345,0.526676,0.743603,0.248494,0.476739,0.072954,0.373388,0.460512,...,0.514914,0.331583,0.271667,0.158583,0.447563,0.8938,0.182033,0.541518,0.439124,0.506304


In [15]:
# Pause this and reflect -- do you understand the list comprehension and how I created this DataFrame? You
# don't have to use a list comprehension to do this, but you should be able to read this and figure out how it
# works as this is a commonly used approach on web forums and on help forums.

In [16]:
# Ok, let's create a second dataframe
df2=pd.DataFrame([np.random.random(100) for x in range(100)])
# What I'm saying here is I want to call the statement np.random.random(100). So this will generate 100 random values into it 
# in a list for x in range 100. So I want to iterate over another list of 100 values. I'm actually not using (x) here. It's 
# just being thrown away because the data that I'm using is the np.random.random.

In [17]:
# Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same
# as that same row inside of df2?

# Let's take a look. Let's say our critical value is 0.1, or an alpha of 10%. And we're going to compare each
# column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%,
# which means that we have sufficient evidence to say that the columns are different.

# Let's write this in a function called test_columns
def test_columns(alpha=0.1): # We'll pass a parameter Alpha by default we'll set it to 0.1. We can change that later though, so
                             # it's nice to have a parameter.
    # I want to keep track of how many columns actually differ
    num_diff=0
    # And now we can just iterate over the columns
    for col in df1.columns: # we're just iterating over all the list of columns in df1
        # we can run out ttest_ind between the two dataframes
        teststat,pval=ttest_ind(df1[col],df2[col])
        # Remember ttest_ind returns two values. So we can use tuple unpacking here to get the test stat in its variable and 
        # the probability in it.
        # and we check the pvalue versus the alpha
        if pval<=alpha:
            # And now we'll just print out if they are different and increment the num_diff
            print("Col {} is statistically significantly different at alpha={}, pval={}".format(col,alpha,pval))
            num_diff=num_diff+1 # Of course, let's increment our number of differences
    # and let's print out some summary stats after we're done testing all the columns
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))

# And now lets actually run this
test_columns()

Col 14 is statistically significantly different at alpha=0.1, pval=0.039019091921267735
Col 15 is statistically significantly different at alpha=0.1, pval=0.06834933904206675
Col 25 is statistically significantly different at alpha=0.1, pval=0.00882920260763997
Col 27 is statistically significantly different at alpha=0.1, pval=0.08199711298912607
Col 40 is statistically significantly different at alpha=0.1, pval=0.07259126589129088
Col 54 is statistically significantly different at alpha=0.1, pval=0.055508713415907115
Col 67 is statistically significantly different at alpha=0.1, pval=0.09089143199602412
Col 87 is statistically significantly different at alpha=0.1, pval=0.03854988481732605
Total number different was 8, which is 8.0%


In [18]:
# Interesting, so we see that there are a bunch of columns that are different! In fact, that number looks a
# lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same? Remember
# that all the ttest_ind does is check if two sets are similar given some level of confidence, in our case, 10%.
# The more random comparisons you do, the more will just happen to be the same by chance. In this example, we
# checked 100 columns, so we would expect there to be roughly 10 of them to be the same if our alpha was 0.1.

# We can test some other alpha values as well
test_columns(0.05) # So let's do test_columns with 0.05 and you could try other values as well.

Col 14 is statistically significantly different at alpha=0.05, pval=0.039019091921267735
Col 25 is statistically significantly different at alpha=0.05, pval=0.00882920260763997
Col 87 is statistically significantly different at alpha=0.05, pval=0.03854988481732605
Total number different was 3, which is 3.0%


In [19]:
# So, keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand
# that this p-value isn't magic, that it's a threshold for you when reporting results and trying to answer
# your hypothesis. What's a reasonable threshold? Depends on your question, and you need to engage domain
# experts to better understand what they would consider significant.

# Just for fun, lets recreate that second dataframe using a non-normal distribution, I'll arbitrarily chose
# chi squared. You can try some other ones if you'd like.
df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])
# Now chi-squared is a distribution actually takes parameter the degrees of freedom. I'll just set it to one here. You can read
# about that or maybe you already know about the chi-squared distribution and I want 100 values and we're going to iterate over
# 100 columns as well. Let's just test that.
test_columns()

Col 0 is statistically significantly different at alpha=0.1, pval=8.226366738129916e-05
Col 1 is statistically significantly different at alpha=0.1, pval=9.34146813697836e-06
Col 2 is statistically significantly different at alpha=0.1, pval=0.003359155422247822
Col 3 is statistically significantly different at alpha=0.1, pval=0.0005312100156933493
Col 4 is statistically significantly different at alpha=0.1, pval=0.0012956198286605428
Col 5 is statistically significantly different at alpha=0.1, pval=0.0010090863382767557
Col 6 is statistically significantly different at alpha=0.1, pval=4.881843184733509e-05
Col 7 is statistically significantly different at alpha=0.1, pval=0.0011750414350145807
Col 8 is statistically significantly different at alpha=0.1, pval=0.015531300167084079
Col 9 is statistically significantly different at alpha=0.1, pval=0.0006032687535588861
Col 10 is statistically significantly different at alpha=0.1, pval=0.0008877899863427575
Col 11 is statistically significan

In [20]:
# Now we see that all or most columns test to be statistically significant at the 10% level.

The following statements (numbered 1 & 2) about the statistical testing below are incorrect: But the notes are correct
1. The p value reports the probability we can get the observed data under the null hypothesis, so with a larger p value (by larger we mean it is closer to 1, instead of saying that it is more extreme), we have less evidence to reject the null hypothesis.

>Note 1: It is important to understand the definition of significance here: significance means the evidence we observed from the data against the null hypothesis, the p value is a measure of the significance.

2. With a larger alpha, we reject the null hypothesis less carefully.

>Note 2: It is important to understand the definition of significance here: significance means the evidence we observed from the data against the null hypothesis, and alpha level is a measure of our tolerance of the significance. 

In this lecture, we've discussed just some of the basics of hypothesis testing in Python. I introduced you
to the SciPy library, which you can use for the students t test. We've discussed some of the practical
issues which arise from looking for statistical significance. There's much more to learn about hypothesis
testing, for instance, there are different tests used, depending on the shape of your data and different
ways to report on the results instead of just p-values such as confidence intervals or bayesian analyses. But this
should give you a basic idea of where to start when comparing two populations for differences, which is a
common task for data scientists.