<a href="https://colab.research.google.com/github/brunofbpaula/DataScience-UM-Coursera/blob/main/StatisticalTesting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Basic Statistical Testing

Hypothesis testing, statistical significance and use of scipy to run student's t-tests.

In [20]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import ttest_ind

# Scipy is an interesting collection of libraries. It includes numpy and pandas,
# but also plotting libraries such as matplotlib, and many more scientific libs.

## Hypothesis testing

A core of data analysis activity, behind experimentation. The goal of it is to determine if, for instance, the two different conditions we have in an experiment have resulted in different impacts.

When we do hypothesis testing, we have two statements of interest: the first is our actual explanation, which is called the alternative hypothesis, and the second is that the explanation we have is not sufficient, and it is called the null hypothesis.

The goal of a testing method is to determine whether the null hypothesis is true or not. If an difference between groups is discovered, then the null hypothesis can be rejected and the alternative is accepted.

In [4]:
# Loading data
grades = pd.read_csv('grades.csv')
grades.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [15]:
# Diving the population into two pieces
# Students who finished their assignments by the end of
# December 2015 will be called 'early finishers', while
# on the other hand, students who finished it
# sometime after that will be called 'late finishers'.

early_finishers = grades[pd.to_datetime(grades['assignment1_submission']) < '2016']
late_finishers = grades[~grades.index.isin(early_finishers.index)]

In [17]:
mean_early = early_finishers['assignment1_grade'].mean()
mean_late = late_finishers['assignment1_grade'].mean()

mean_early, mean_late

(74.94728457024304, 74.0450648477065)

That's where our student's t-test come in.  Are these values the same? The t-test allows us to form the alternative hypothesis 'THESE ARE DIFFERENT' as well as the null hypothesis 'THESE ARE THE SAME' and then test that null hypothesis.

When doing hypothesis testing, there's the need to choose a significance level as a threshold for much of a chance the tester is willing to accept. This significant level is typically called alpha.

In [23]:
# Adopting a threshold of 5%, an alpha of 0.05 (it's a commonly used number but it's quite arbitrary)
alpha = 0.05

The SciPy library contains a number of different statistical tests and forms a basis for hypothesis in Python. Here we'll implement the ttest_ind() function, which does an independent t-test, meaning that the populations are not related to one another.


The result of ttest_index() are the t-statistical and p-value. It's this latter value, the probability, which is the most important to us, as it indicates the chance (between zero and one) of our null hypothesis being True.

In [26]:
# Bringing in the ttest_ind function
pack = ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])
probability = pack[1]

probability < alpha, probability, alpha

(False, 0.18618101101713855, 0.05)

The probability value is above our alpha value. This means we cannot reject the null hypothesis, that the two populations are the same, and we don't have enough certainty in our evidence to come to a conclusion to the contrary.

This doesn't mean that's been proven the populations are the same. We can also check other assignment's grades.

In [27]:
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

TtestResult(statistic=1.2514717608216366, pvalue=0.2108889627004424, df=2313.0)
TtestResult(statistic=1.6133726558705392, pvalue=0.10679998102227865, df=2313.0)
TtestResult(statistic=0.049671157386456125, pvalue=0.960388729789337, df=2313.0)
TtestResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492, df=2313.0)
TtestResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656, df=2313.0)


It appears we don't have enough evidence in this data to suggest the populations differ with respect to grade.

If we take a closer look at the assigment three and its p-value around 0.1, it means that if we accepted a level of chance similarity of 11% this would've been considered statistically significant. This would suggest that there's something here worth considering following up on.

P-values have come under fire for being insuficient for telling us enough about the interactions which are happening, and two other techniques, confidence intervalues and bayesian analyses, are used more regularly.

One issue with p-values are their fragility. If you run more tests, it's likely to get a value which is statistically significant just by chance.

In [28]:
# Let's simulate it.
random = pd.DataFrame([np.random.random(100) for x in range(100)])
random.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.533168,0.922108,0.426075,0.09474,0.321487,0.335258,0.825352,0.33224,0.300093,0.439383,...,0.568197,0.150114,0.062735,0.607959,0.696388,0.671451,0.75252,0.194396,0.409206,0.022066
1,0.043451,0.657803,0.836316,0.011588,0.316258,0.784021,0.584552,0.285052,0.040222,0.318279,...,0.375984,0.928132,0.934407,0.712306,0.610277,0.30574,0.145727,0.554339,0.485878,0.327824
2,0.490451,0.106278,0.98546,0.762957,0.988016,0.230659,0.2019,0.053925,0.967673,0.749211,...,0.856546,0.382869,0.42677,0.243841,0.301867,0.721501,0.996882,0.749405,0.665976,0.397752
3,0.873466,0.914549,0.912621,0.668618,0.352162,0.332661,0.331353,0.670239,0.026584,0.115365,...,0.152971,0.621251,0.335966,0.706975,0.305718,0.941191,0.505898,0.494598,0.067513,0.197382
4,0.691262,0.890532,0.384648,0.092986,0.121481,0.568642,0.27851,0.801916,0.380063,0.565975,...,0.284117,0.855121,0.826111,0.555276,0.461351,0.327758,0.472509,0.262414,0.162318,0.033486


In [29]:
# Now a second DataFrame
random2  = pd.DataFrame([np.random.random(100) for x in range(100)])
random2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.60421,0.56706,0.14036,0.304626,0.666864,0.662157,0.710947,0.925766,0.528579,0.858471,...,0.775051,0.74631,0.070835,0.810937,0.979292,0.789904,0.254106,0.244679,0.235292,0.691939
1,0.44301,0.035649,0.582926,0.874777,0.90355,0.268179,0.572179,0.146317,0.229842,0.399989,...,0.463498,0.881958,0.027967,0.630738,0.587873,0.99858,0.53463,0.488407,0.306134,0.11654
2,0.731826,0.219762,0.572853,0.841965,0.387434,0.707536,0.199051,0.807222,0.018745,0.07222,...,0.220114,0.718653,0.566082,0.878779,0.539614,0.601953,0.70994,0.042251,0.447737,0.440269
3,0.077495,0.326206,0.437881,0.864916,0.463126,0.130097,0.737082,0.988303,0.877723,0.652264,...,0.885351,0.969221,0.825446,0.587213,0.53126,0.920977,0.077343,0.936986,0.650788,0.645601
4,0.82922,0.745201,0.173735,0.169196,0.504753,0.331583,0.959117,0.408756,0.021748,0.99628,...,0.180034,0.074035,0.4914,0.589693,0.879114,0.321259,0.890047,0.833302,0.359426,0.860456


Question: for a given row inside random, is it the same as the row inside random2?

We'll adopt an alpha of 10% and compare each columns in random to the same numbered column in random2 and report when the p-value isn't less than 10%, which means we have sufficient evidence to say the columns are different.


In [31]:
def test_columns(alpha=0.1):

  # Keeping track of how many differ
  num_diff = 0

  # Iterating over the columns
  for col in random.columns:

    # Testing
    teststat, pval = ttest_ind(random[col], random2[col])

    # Checking alpha versus probability
    if pval <= alpha:
      print(f'Col {col} is statistically significantly different at alpha = {alpha} and pval = {pval}')
      num_diff += 1

  print(f'Total number different was {num_diff}, which is {float(num_diff)/len(random.columns)*100}')

test_columns()

Col 4 is statistically significantly different at alpha = 0.1 and pval = 0.01896061355496562
Col 6 is statistically significantly different at alpha = 0.1 and pval = 0.01096246305299877
Col 10 is statistically significantly different at alpha = 0.1 and pval = 0.07352968834808218
Col 13 is statistically significantly different at alpha = 0.1 and pval = 0.09305276753574451
Col 17 is statistically significantly different at alpha = 0.1 and pval = 0.011684416456636417
Col 36 is statistically significantly different at alpha = 0.1 and pval = 0.001568120064132054
Col 49 is statistically significantly different at alpha = 0.1 and pval = 0.08418662740212511
Col 56 is statistically significantly different at alpha = 0.1 and pval = 0.04510457963917731
Col 90 is statistically significantly different at alpha = 0.1 and pval = 0.006233007872892748
Col 92 is statistically significantly different at alpha = 0.1 and pval = 0.04488085796185302
Total number different was 10, which is 10.0


There are a bunch of columns that are different. The number looks a lot like the alpha value we chose. So what's going on shouldn't all of the columns be the same?

All t-test does is check if two sets are similar given some level of confidence in our case 10 percent. The more random comparisons you do, the more will just happen to be the same by chance. In this example, we checked 100 columns.

So we would expect there to be roughly 10 of them to be the same if our Alpha was 0.1. So we can test some other Alpha values as well.

In [32]:
test_columns(0.05)

Col 4 is statistically significantly different at alpha = 0.05 and pval = 0.01896061355496562
Col 6 is statistically significantly different at alpha = 0.05 and pval = 0.01096246305299877
Col 17 is statistically significantly different at alpha = 0.05 and pval = 0.011684416456636417
Col 36 is statistically significantly different at alpha = 0.05 and pval = 0.001568120064132054
Col 56 is statistically significantly different at alpha = 0.05 and pval = 0.04510457963917731
Col 90 is statistically significantly different at alpha = 0.05 and pval = 0.006233007872892748
Col 92 is statistically significantly different at alpha = 0.05 and pval = 0.04488085796185302
Total number different was 7, which is 7.000000000000001


Understand that this p-value isn't magic and it has a threshold for you when reporting results and trying to answer your hypothesis. What's a reasonable threshold? That depends on your question and you need to engage domain experts to better understand what they would consider significant.

In [33]:
# For fun, let's recreate the second DataFrame using a non-normal distribution
random2 = pd.DataFrame ([np.random.chisquare(df=1, size=100) for x in range(100)])
test_columns()

Col 0 is statistically significantly different at alpha = 0.1 and pval = 0.0008068254028278329
Col 1 is statistically significantly different at alpha = 0.1 and pval = 0.005871166779670925
Col 2 is statistically significantly different at alpha = 0.1 and pval = 0.0004897019794074245
Col 3 is statistically significantly different at alpha = 0.1 and pval = 0.0001259531968645479
Col 4 is statistically significantly different at alpha = 0.1 and pval = 7.279254898360494e-05
Col 5 is statistically significantly different at alpha = 0.1 and pval = 0.0001450080275995744
Col 6 is statistically significantly different at alpha = 0.1 and pval = 0.00017026737276517698
Col 7 is statistically significantly different at alpha = 0.1 and pval = 0.03369754138943464
Col 8 is statistically significantly different at alpha = 0.1 and pval = 8.57950693902678e-05
Col 9 is statistically significantly different at alpha = 0.1 and pval = 0.0007103029479145262
Col 10 is statistically significantly different at al

All columns test are statistically significant at the level of 10%.