# Basic Statistical Testing

In this lecture we're going to review some of the basics of statistical testing in python. We're going to talk about hypothesis testing, statistical significance, and using scipy to run student's t-tests.

In [1]:
# We use statistics in a lot of different ways in data science, and on this lecture, I want to refresh your
# knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of
# hypothesis testing is to determine if, for instance, the two different conditions we have in an experiment 
# have resulted in different impacts

# Let's import our usual numpy and pandas libraries
import numpy as np
import pandas as pd

# Now let's bring in some new libraries from scipy
from scipy import stats

In [2]:
# Now, scipy is an interesting collection of libraries for data science and you'll use most or perpahs all of
# these libraries. It includes numpy and pandas, but also plotting libraries such as matplotlib, and a
# number of scientific library functions as well

In [3]:
# When we do hypothesis testing, we actually have two statements of interest: the first is our actual
# explanation, which we call the alternative hypothesis, and the second is that the explanation we have is not
# sufficient, and we call this the null hypothesis. Our actual testing method is to determine whether the null
# hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null
# hypothesis and we accept our alternative.

# Let's see an example of this; we're going to use some grade data
df=pd.read_csv ('datasets/grades.csv')
df

# df.describe()
# df.shape
# df.info()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.813800,2015-12-13 17:06:10.750000000,51.491040,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2310,DE88902E-C7A7-E37A-CFA7-F2C8F2D219F2,77.684611,2016-03-07 02:52:24.378000000,69.916150,2016-03-11 22:02:39.161000000,69.916150,2016-03-17 07:30:09.261000000,69.916150,2016-03-18 18:01:24.525000000,55.932920,2016-03-20 06:38:12.120000000,50.339628,2016-03-25 11:00:06.923000000
2311,DE88902E-C7A7-E37A-CFA7-F2C8F2D219F2,75.367870,2015-11-29 02:43:27.932000000,59.934296,2015-12-03 05:30:39.218000000,48.687437,2015-12-09 15:56:44.895000000,43.008693,2015-12-13 06:18:01.342000000,38.707824,2015-12-20 02:39:39.248000000,38.707824,2015-12-22 13:34:42.931000000
2312,EFDA9F93-D0C3-864F-B0F6-2E9AA3E05E31,73.269463,2015-10-20 08:09:27.418000000,58.255570,2015-11-18 19:07:06.930000000,58.955570,2015-12-10 08:54:54.871000000,52.250013,2015-11-23 19:40:00.434000000,41.800010,2015-11-29 14:23:43.659000000,41.800010,2015-12-04 09:56:07.156000000
2313,1F51E050-78F7-F270-1B90-ED1BC0376763,87.268366,2016-04-03 09:04:51.646000000,87.268366,2016-04-08 19:24:29.095000000,87.268366,2016-04-12 05:43:33.853000000,69.814693,2016-04-14 10:43:58.104000000,55.851754,2016-04-19 05:37:19.322000000,55.851754,2016-04-23 03:44:06.813000000


In [4]:
# If we take a look at the data frame inside, we see we have six different assignments. Lets look at some
# summary statistics for this DataFrame
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))
# print(f'There are {df.shape[0]} rows and {df.shape[1]} columns')

There are 2315 rows and 13 columns


In [6]:
# For the purpose of this lecture, let's segment this population into two pieces. Let's say those who finish
# the first assignment by the end of December 2015, we'll call them early finishers, and those who finish it 
# sometime after that, we'll call them late finishers.

early_finishers=df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()
# early_finishers.shape

(1259, 13)

In [44]:
# #  MY SOL
# late_fin = df[pd.to_datetime(df['assignment1_submission'])>='2016']
# late_fin

In [6]:
# So, you have lots of skills now with pandas, how would you go about getting the late_finishers dataframe?
# Why don't you pause the video and give it a try.

In [7]:
# Here's my solution. First, the dataframe df and the early_finishers share index values, so I really just
# want everything in the df which is not in early_finishers

late_finishers=df[~df.index.isin(early_finishers.index)]
late_finishers.head()
late_finishers.shape

(1056, 13)

In [8]:
# There are lots of other ways to do this. For instance, you could just copy and paste the first projection
# and change the sign from less than to greater than or equal to. This is ok, but if you decide you want to
# change the date down the road you have to remember to change it in two places. You could also do a join of
# the dataframe df with early_finishers - if you do a left join you only keep the items in the left dataframe,
# so this would have been a good answer. You also could have written a function that determines if someone is
# early or late, and then called .apply() on the dataframe and added a new column to the dataframe. This is a
# pretty reasonable answer as well.

In [46]:
# As you've seen, the pandas data frame object has a variety of statistical functions associated with it. If
# we call the mean function directly on the data frame, we see that each of the means for the assignments are
# calculated. Let's compare the means for our two populations

print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024304
74.0450648477065


In [50]:
# Ok, these look pretty similar. But, are they the same? What do we mean by similar? This is where the
# students' t-test comes in. It allows us to form the alternative hypothesis ("These are different") as well
# as the null hypothesis ("These are the same") and then test that null hypothesis.

# When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a
# chance we're willing to accept. This significance level is typically called alpha. #For this example, let's
# use a threshold of 0.05 for our alpha or 5%. Now this is a commonly used number but it's really quite
# arbitrary.

# The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing
# in Python and we're going to use the ttest_ind() function which does an independent t-test (meaning the
# populations are not related to one another). The result of ttest_index() are the t-statistic and a p-value.
# It's this latter value, the probability, which is most important to us, as it indicates the chance (between
# 0 and 1) of our null hypothesis being True.

# Let's bring in our ttest_ind function
from scipy.stats import ttest_ind

# Let's run this function with our two populations, looking at the assignment 1 grades
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])


# need to read more on hypothesis testing

TtestResult(statistic=1.3223540853721598, pvalue=0.18618101101713855, df=2313.0)

In [51]:
# So here we see that the probability is 0.18, and this is above our alpha value of 0.05. This means that we
# cannot reject the null hypothesis. The null hypothesis was that the two populations are the same, and we
# don't have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to
# the contrary. This doesn't mean that we have proven the populations are the same.

In [52]:
# Why don't we check the other assignment grades?
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

TtestResult(statistic=1.2514717608216366, pvalue=0.2108889627004424, df=2313.0)
TtestResult(statistic=1.6133726558705392, pvalue=0.10679998102227865, df=2313.0)
TtestResult(statistic=0.049671157386456125, pvalue=0.960388729789337, df=2313.0)
TtestResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492, df=2313.0)
TtestResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656, df=2313.0)


In [53]:
# Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with
# respect to grade. Let's take a look at those p-values for a moment though, because they are saying things
# that can inform experimental design down the road. For instance, one of the assignments, assignment 3, has a
# p-value around 0.1. This means that if we accepted a level of chance similarity of 11% this would have been
# considered statistically significant. As a research, this would suggest to me that there is something here
# worth considering following up on. For instance, if we had a small number of participants (we don't) or if
# there was something unique about this assignment as it relates to our experiment (whatever it was) then
# there may be followup experiments we could run.

In [67]:
# P-values have come under fire recently for being insuficient for telling us enough about the interactions
# which are happening, and two other techniques, confidence intervalues and bayesian analyses, are being used
# more regularly. One issue with p-values is that as you run more tests you are likely to get a value which
# is statistically significant just by chance.

# Lets see a simulation of this. First, lets create a data frame of 100 columns, each with 100 numbers
df1=pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.120385,0.407988,0.804251,0.525036,0.544048,0.949361,0.866966,0.10044,0.527606,0.143048,...,0.024097,0.545157,0.836405,0.166604,0.73657,0.062372,0.471603,0.949147,0.582226,0.96918
1,0.017565,0.538153,0.350936,0.605714,0.827211,0.662022,0.277209,0.203106,0.284408,0.015791,...,0.245692,0.497351,0.697067,0.35588,0.815989,0.961239,0.498042,0.362651,0.075699,0.565474
2,0.127092,0.654804,0.224736,0.723047,0.720521,0.262354,0.278217,0.368777,0.016694,0.056108,...,0.878273,0.124723,0.444822,0.7408,0.942802,0.750333,0.269538,0.961819,0.2421,0.48357
3,0.378014,0.846372,0.594806,0.782161,0.46443,0.152882,0.033473,0.277175,0.185035,0.535932,...,0.273499,0.399496,0.423392,0.308991,0.897937,0.549577,0.535614,0.06443,0.434074,0.634874
4,0.511195,0.648137,0.452764,0.738311,0.920918,0.758556,0.920987,0.986382,0.012695,0.298222,...,0.450571,0.661597,0.773159,0.623088,0.538666,0.221494,0.599308,0.347215,0.905429,0.240725


In [68]:
# Pause this and reflect -- do you understand the list comprehension and how I created this DataFrame? You
# don't have to use a list comprehension to do this, but you should be able to read this and figure out how it
# works as this is a commonly used approach on web forums.


In [69]:
# Ok, let's create a second dataframe
df2=pd.DataFrame([np.random.random(100) for x in range(100)])

In [71]:
# Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same
# as the row inside df2?

# Let's take a look. Let's say our critical value is 0.1, or and alpha of 10%. And we're going to compare each
# column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%,
# which means that we have sufficient evidence to say that the columns are different.


# Let's write this in a function called test_columns
def test_columns(alpha=0.1):
    # I want to keep track of how many differ
    num_diff=0
    
    # And now we can just iterate over the columns
    for col in df1.columns:
        
        # we can run out ttest_ind between the two dataframes
        teststat,pval=ttest_ind(df1[col],df2[col])
        
        # and we check the pvalue versus the alpha
        if pval<=alpha:
            
            # And now we'll just print out if they are different and increment the num_diff
            print("Col {} is statistically significantly different at alpha={}, pval={}".format(col,alpha,pval))
            num_diff=num_diff+1
            
    # and let's print out some summary stats
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))

# And now lets actually run this
test_columns()

Col 8 is statistically significantly different at alpha=0.1, pval=0.047333436555205236
Col 9 is statistically significantly different at alpha=0.1, pval=0.0985244483817065
Col 14 is statistically significantly different at alpha=0.1, pval=0.09114561122502587
Col 18 is statistically significantly different at alpha=0.1, pval=0.00432747890660054
Col 19 is statistically significantly different at alpha=0.1, pval=0.0604831989778226
Col 36 is statistically significantly different at alpha=0.1, pval=0.013454773927010581
Col 38 is statistically significantly different at alpha=0.1, pval=0.06330338524955584
Col 41 is statistically significantly different at alpha=0.1, pval=0.07648227629536573
Col 60 is statistically significantly different at alpha=0.1, pval=0.09847851029312407
Col 73 is statistically significantly different at alpha=0.1, pval=0.0873238089054977
Col 75 is statistically significantly different at alpha=0.1, pval=0.025021695007052053
Col 98 is statistically significantly differe

In [73]:
# Interesting, so we see that there are a bunch of columns that are different! In fact, that number looks a
# lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same? Remember
# that all the ttest does is check if two sets are similar given some level of confidence, in our case, 10%.
# The more random comparisons you do, the more will just happen to be the same by chance. In this example, we
# checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1.

# We can test some other alpha values as well
test_columns(0.05)

Col 8 is statistically significantly different at alpha=0.05, pval=0.047333436555205236
Col 18 is statistically significantly different at alpha=0.05, pval=0.00432747890660054
Col 36 is statistically significantly different at alpha=0.05, pval=0.013454773927010581
Col 75 is statistically significantly different at alpha=0.05, pval=0.025021695007052053
Total number different was 4, which is 4.0%


In [75]:
# So, keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand
# that this p-value isn't magic, that it's a threshold for you when reporting results and trying to answer
# your hypothesis. What's a reasonable threshold? Depends on your question, and you need to engage domain
# experts to better understand what they would consider significant.

# Just for fun, lets recreate that second dataframe using a non-normal distribution, I'll arbitrarily chose
# chi squared
df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])
test_columns()

Col 0 is statistically significantly different at alpha=0.1, pval=0.0006378210245149429
Col 1 is statistically significantly different at alpha=0.1, pval=0.00018900069454866493
Col 2 is statistically significantly different at alpha=0.1, pval=0.003733355022906926
Col 3 is statistically significantly different at alpha=0.1, pval=0.0016406718893629644
Col 4 is statistically significantly different at alpha=0.1, pval=0.00020896240916878174
Col 5 is statistically significantly different at alpha=0.1, pval=5.359333211387088e-05
Col 6 is statistically significantly different at alpha=0.1, pval=4.535623893934582e-05
Col 7 is statistically significantly different at alpha=0.1, pval=0.00033102266035165356
Col 8 is statistically significantly different at alpha=0.1, pval=4.8858628853425766e-05
Col 9 is statistically significantly different at alpha=0.1, pval=0.00047547794348675457
Col 10 is statistically significantly different at alpha=0.1, pval=4.830398595713815e-05
Col 11 is statistically sig

In [76]:
# Now we see that all or most columns test to be statistically significant at the 10% level.

In this lecture, we've discussed just some of the basics of hypothesis testing in Python. I introduced you to the SciPy library, which you can use for the students t test. We've discussed some of the practical issues which arise from looking for statistical significance. There's much more to learn about hypothesis testing, for instance, there are different tests used, depending on the shape of your data and different ways to report results instead of just p-values such as confidence intervals or bayesian analyses. But this should give you a basic idea of where to start when comparing two populations for differences, which is a common task for data scientists.