# ***Hypothesis testing Statistical significance & Student's t test using SciPy***

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

### **Hypothesis testing involves two hypotheses**
- Null hypothesis ($NH_{0}$) - implies that our speculation cannot be proved correct with available evidence
- Alternative hypothesis ($NH_{1}$) - our actual explanation

***Hypothesis testing is done by testing whether the Null hypothesis is true or not. If there are significant differences between groups, we can reject the Null hypothesis and accept our alternative hypothesis.***

In [20]:
grades = pd.read_csv(r"D:/Introduction-to-Data-Science-in-Python/week-4/datasets/grades.csv", parse_dates = [2, 4, 6, 8, 10, 12],
                    infer_datetime_format = True)

In [21]:
grades.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282,83.030552,2015-11-09 02:22:58.938,67.164441,2015-11-12 08:58:33.998,53.011553,2015-11-16 01:21:24.663,47.710398,2015-11-20 13:24:59.692,38.168318,2015-11-22 18:31:15.934
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429,86.290821,2015-12-06 17:41:18.449,69.772657,2015-12-10 08:54:55.904,55.098125,2015-12-13 17:32:30.941,49.588313,2015-12-19 23:26:39.285,44.629482,2015-12-21 17:07:24.275
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389,85.512541,2016-01-09 06:39:44.416,68.410033,2016-01-15 20:22:45.882,54.728026,2016-01-11 12:41:50.749,49.255224,2016-01-11 17:31:12.489,44.329701,2016-01-17 16:24:42.765
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801,68.824532,2016-04-30 17:20:38.727,61.942079,2016-05-12 07:47:16.326,49.553663,2016-05-07 16:09:20.485,49.553663,2016-05-24 12:51:18.016,44.598297,2016-05-26 08:09:12.058
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750,51.49104,2015-12-14 12:25:12.056,41.932832,2015-12-29 14:25:22.594,36.929549,2015-12-28 01:29:55.901,33.236594,2015-12-29 14:46:06.628,33.236594,2016-01-05 01:06:59.546


In [22]:
# contains submission dates & grades for 6 different assignments

grades.columns

Index(['student_id', 'assignment1_grade', 'assignment1_submission',
       'assignment2_grade', 'assignment2_submission', 'assignment3_grade',
       'assignment3_submission', 'assignment4_grade', 'assignment4_submission',
       'assignment5_grade', 'assignment5_submission', 'assignment6_grade',
       'assignment6_submission'],
      dtype='object')

In [23]:
grades.shape

(2315, 13)

In [24]:
[column for column in grades.columns if "submission" in column]

['assignment1_submission',
 'assignment2_submission',
 'assignment3_submission',
 'assignment4_submission',
 'assignment5_submission',
 'assignment6_submission']

In [25]:
grades.dtypes

student_id                        object
assignment1_grade                float64
assignment1_submission    datetime64[ns]
assignment2_grade                float64
assignment2_submission    datetime64[ns]
assignment3_grade                float64
assignment3_submission    datetime64[ns]
assignment4_grade                float64
assignment4_submission    datetime64[ns]
assignment5_grade                float64
assignment5_submission    datetime64[ns]
assignment6_grade                float64
assignment6_submission    datetime64[ns]
dtype: object

In [27]:
# lets categorize students based on the first assignment submission date
# who have submitted by the end of 2015 -> Early finishers
# others -> Late finishers

pd.to_datetime(grades.assignment1_submission, format = "%Y-%m-%d %H:%M:%S") \
                    .apply(lambda date: "Early Finisher" if date < pd.Timestamp(year = 2015, month = 12, day = 31) else "Late Finisher")

0       Early Finisher
1       Early Finisher
2        Late Finisher
3        Late Finisher
4       Early Finisher
             ...      
2310     Late Finisher
2311    Early Finisher
2312    Early Finisher
2313     Late Finisher
2314    Early Finisher
Name: assignment1_submission, Length: 2315, dtype: object

In [42]:
deadline = pd.Timestamp(year = 2015, month = 12, day = 31)

In [43]:
early_finishers = grades.loc[pd.to_datetime(grades.assignment1_submission,  format = "%Y-%m-%d %H:%M:%S") < deadline, :]

In [45]:
late_finishers = grades.loc[pd.to_datetime(grades.assignment1_submission,  format = "%Y-%m-%d %H:%M:%S") >= deadline, :]

In [46]:
bools = np.array([True, True, False, True, False, False])
~bools

array([False, False,  True, False,  True,  True])

In [47]:
f"Late finishres:: Average grade: {late_finishers.assignment1_grade.mean()}, Standard deviation: {late_finishers.assignment1_grade.std()} & \
Median: {late_finishers.assignment1_grade.median()}."

'Late finishres:: Average grade: 74.01742892115759, Standard deviation: 16.738814907901745 & Median: 76.61331847067096.'

In [48]:
f"Early finishres:: Average grade: {early_finishers.assignment1_grade.mean()}, Standard deviation: {early_finishers.assignment1_grade.std()} & \
Median: {early_finishers.assignment1_grade.median()}."

'Early finishres:: Average grade: 74.9727408643377, Standard deviation: 16.014613849987388 & Median: 77.48239466865009.'

## ***Hypothesis testing***

### ***Null hypothesis: These two averages are same (no significant differences).***
### ***Alternative hypothesis: These two averages are different.***

<a style="font-size:20px">***Hypothesis testing requires a significance level: how much of a deviation are we willing to tolerate?         
This is typically denoted by an $\alpha$.<br> For this test, lets use an $\alpha$ of 0.05 (which is 5%)***</a>

In [52]:
# let's do a t test
# ttest_ind() for independent t test, implying that the populations in the two groups are not related to one another
# the ttest_ind() function from scipy evaluates the two series of values and returns a t static & P value
# t statistic is the t value while the p value is the probability of the Null hypothesis being true.
# once we have the p value, we can compare that with our alpha

stats.ttest_ind(early_finishers.assignment1_grade, late_finishers.assignment1_grade)

Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)

In [53]:
# the p value 0.16148283016 is way higher than the alpha 0.05, thus Null hypothesis is accepted
# if we are to reject the Null hypothesis, the p value must be less than 0.05 (since we do not accept a chance greater than 5% as insignificant)
# and the alternative hypothesis is rejected

# This however does not prove that the groups are same

In [58]:
# repeat this for other assignments

for grade_column in [column for column in grades.columns if "grade" in column]:
    print(f"{grade_column.replace('_grade', '')}: {stats.ttest_ind(early_finishers.loc[:, grade_column], late_finishers.loc[:, grade_column])}")

assignment1: Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)
assignment2: Ttest_indResult(statistic=1.323986822091257, pvalue=0.18563824610067958)
assignment3: Ttest_indResult(statistic=1.7116160037010733, pvalue=0.08710151634155668)
assignment4: Ttest_indResult(statistic=0.16232182017140787, pvalue=0.8710666110447575)
assignment5: Ttest_indResult(statistic=0.06063973879942835, pvalue=0.9516513635792874)
assignment6: Ttest_indResult(statistic=-0.009767754757653123, pvalue=0.9922074255698552)


**In this data, there isn't enough evidence to suggest that the populations differ based on their grades. Since no pair of groups have a p value less than 5% (0.05), we cannot reject the Null hypothesis and accept our explanation.      
Assignments 4, 5 & 6 have p fairly large values (close to 1, the maximum probability). Thus in these assignments the two groups appear to perform more identically while in assignment 3, the p value is 0.0871 (close to 0) which indicates that the two groups have performed quite differently even though the probability of this happening by chance exceeds our tolerance threshold of 5%.        
If we assume our $\alpha$ = 0.1, then the grades of the third assignment will indicate that the two groups are different!**

**A major disadvantage of p is that it does not explain anything about the interactions happening between the two groups. And p value can give statistically significant results just by chance!. So, to overcome this confidence intervals and bayesian analyses are used.**

In [70]:
# a simulation to demonstrate the shortcomings of t tests
# lets create 100 x 10 dataframes

dummy_1 = pd.DataFrame(np.random.randint(0, 100, 1000).reshape(100, 10))
dummy_2 = pd.DataFrame(np.random.randint(0, 100, 1000).reshape(100, 10))

In [74]:
# check how many pairs of groups have a p value < 0.1

for i in range(10):
    print("P value of column {} of dummy dataframe 1 & column {} of dummy dataframe 2 is {}".format(
                           i, i, stats.ttest_ind(dummy_1.loc[:, i], dummy_2.loc[:, i])[1]))

P value of column 0 of dummy dataframe 1 & column 0 of dummy dataframe 2 is 0.1909439773428309
P value of column 1 of dummy dataframe 1 & column 1 of dummy dataframe 2 is 0.38791883467736343
P value of column 2 of dummy dataframe 1 & column 2 of dummy dataframe 2 is 0.8594123019889872
P value of column 3 of dummy dataframe 1 & column 3 of dummy dataframe 2 is 0.623258818729865
P value of column 4 of dummy dataframe 1 & column 4 of dummy dataframe 2 is 0.05517316762733526
P value of column 5 of dummy dataframe 1 & column 5 of dummy dataframe 2 is 0.8835800850764407
P value of column 6 of dummy dataframe 1 & column 6 of dummy dataframe 2 is 0.7876615916589355
P value of column 7 of dummy dataframe 1 & column 7 of dummy dataframe 2 is 0.34603320164454465
P value of column 8 of dummy dataframe 1 & column 8 of dummy dataframe 2 is 0.8858942989876363
P value of column 9 of dummy dataframe 1 & column 9 of dummy dataframe 2 is 0.3921258801318731


In [75]:
# let's increase the diversity in the columns and try this again

dummy_1 = pd.DataFrame(np.random.randint(0, 1000, 1000).reshape(100, 10))
dummy_2 = pd.DataFrame(np.random.randint(0, 1000, 1000).reshape(100, 10))

In [77]:
for i in range(10):
    print("P value of column {} of dummy_1 & column {} of dummy_2 is {}".format(
                           i, i, stats.ttest_ind(dummy_1.loc[:, i], dummy_2.loc[:, i])[1]))

P value of column 0 of dummy_1 & column 0 of dummy_2 is 0.855192102708796
P value of column 1 of dummy_1 & column 1 of dummy_2 is 0.6475661186975601
P value of column 2 of dummy_1 & column 2 of dummy_2 is 0.267575167291769
P value of column 3 of dummy_1 & column 3 of dummy_2 is 0.4924944617793594
P value of column 4 of dummy_1 & column 4 of dummy_2 is 0.2963549997815036
P value of column 5 of dummy_1 & column 5 of dummy_2 is 0.38925103459730304
P value of column 6 of dummy_1 & column 6 of dummy_2 is 0.8558039869029401
P value of column 7 of dummy_1 & column 7 of dummy_2 is 0.30325543759497514
P value of column 8 of dummy_1 & column 8 of dummy_2 is 0.12507044229267342
P value of column 9 of dummy_1 & column 9 of dummy_2 is 0.44958734065507233


In [78]:
dummy_1 = pd.DataFrame(np.random.randint(0, 10000, 1000).reshape(100, 10))
dummy_2 = pd.DataFrame(np.random.randint(0, 10000, 1000).reshape(100, 10))

for i in range(10):
    print("P value of column {} of dummy_1 & column {} of dummy_2 is {}".format(
                           i, i, stats.ttest_ind(dummy_1.loc[:, i], dummy_2.loc[:, i])[1]))

P value of column 0 of dummy_1 & column 0 of dummy_2 is 0.545081779544611
P value of column 1 of dummy_1 & column 1 of dummy_2 is 0.8262302709831132
P value of column 2 of dummy_1 & column 2 of dummy_2 is 0.09721320567926282
P value of column 3 of dummy_1 & column 3 of dummy_2 is 0.9315415104018732
P value of column 4 of dummy_1 & column 4 of dummy_2 is 0.6416805974993413
P value of column 5 of dummy_1 & column 5 of dummy_2 is 0.013428800213645285
P value of column 6 of dummy_1 & column 6 of dummy_2 is 0.29741573197537247
P value of column 7 of dummy_1 & column 7 of dummy_2 is 0.9030139763428984
P value of column 8 of dummy_1 & column 8 of dummy_2 is 0.40574093793008525
P value of column 9 of dummy_1 & column 9 of dummy_2 is 0.8644205446232408


In [80]:
# the p value is not a standard, and heavily depends on the research context. One needs to engage domain experts to determine 
# a decent p value threshold.