In this lecture we're going to review some of the basics of **statistical testing** in python. 

We're going to talk about **hypothesis testing**, **statistical significance**, and **using scipy to run t-tests**.

We use statistics in a lot of different ways in data science.

**hypothesis testing** is a core **data analysis** activity **behind experimentation**.

**The goal of hypothesis testing** is to determine **if**, for instance, **two different conditions that have in an experiment** have resulted in **different impacts**.

**Scipy**:

* is an interesting **collection of libraries** for **data science** and we'll use most or perpahs all of these libraries.
* **includes numpy and pandas**, but also **plotting libraries such as matplotlib**, and a number of **scientific library functions** as well.

In [1]:
# pip install numpy==1.23.0rc3 # we installed the compatible numpy version with scipy library.

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

When we do **hypothesis testing**, we actually have two statements of interest: the first is our **actual explanation**, which we call **the alternative hypothesis**, and the second is that **the explanation we have is not sufficient**, and we call this **the null hypothesis**.

Our **actual testing method** is to determine whether **the null hypothesis is true or not**. 

If we find that **there is a difference between groups**, then we can **reject the null hypothesis** and we **accept alternative hypothesis**.

In [3]:
df = pd.read_csv('datasets/grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [4]:
"There are {} rows and {} columns in df".format(df.shape[0], df.shape[1])

'There are 2315 rows and 13 columns in df'

For the purpose of this lecture, we're going to segment this population into two pieces. Let's say **those who finish the first assignment by the end of December 2015**, we'll call them **early finishers**, and **those who finish it sometime after that**, we'll call them **late finishers**.

In [5]:
early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [6]:
(early_finishers.shape[0], early_finishers.shape[1])

(1259, 13)

solution 1:

**the dataframe df and the early_finishers share index values**, so I really just **want everything** in the **df which is not in early_finishers**.

In [8]:
late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


solution 2:

in this solution, we can just copy and paste the first projection and change the sign from less than to greater than or equal to.

solution 2 is ok but if we want to change the date we have to change it in two places.

In [9]:
late_finishers = df[pd.to_datetime(df['assignment1_submission']) >= '2016']
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


solution 3:

we can also do a join of the dataframe df with early_finishers. if we do a left join, we only keep the items in the left dataframe, so this would have been a good answer.

In [12]:
merged_df = pd.merge(df, early_finishers, how= 'left', left_index= True, right_index= True)
late_finishers = merged_df[merged_df['assignment1_submission_y'].isna() == True]
late_finishers = late_finishers.loc[:, :'assignment6_submission_x']
late_finishers.head()

Unnamed: 0,student_id_x,assignment1_grade_x,assignment1_submission_x,assignment2_grade_x,assignment2_submission_x,assignment3_grade_x,assignment3_submission_x,assignment4_grade_x,assignment4_submission_x,assignment5_grade_x,assignment5_submission_x,assignment6_grade_x,assignment6_submission_x
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


solution 4:

We also can write a function that determines if someone is early or late, and then called .apply() on the dataframe and added a new column to the dataframe. This is a pretty reasonable answer as well.

In [14]:
def early_late(record):
    if record['assignment1_submission'] < '2016':
        return 'early finishers'
    return 'late finishers'

df['type'] = df.apply(lambda x: early_late(x), axis= 1)

early_finishers = df[df['type'] == 'early finishers']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission,type
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000,early finishers
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000,early finishers
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000,early finishers
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000,early finishers
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000,early finishers


In [15]:
late_finishers = df[df['type'] == 'late finishers']
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission,type
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000,late finishers
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000,late finishers
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000,late finishers
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000,late finishers
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000,late finishers


In [16]:
(late_finishers.shape[0], late_finishers.shape[1])

(1056, 14)

As we've seen, **the pandas data frame object** has **a variety of statistical functions** associated with it. If we call the mean function directly on the data frame columns, we see that each of the means for the assignment columns are calculated.

In [19]:
print("the early_finishers grade mean for assignment 1: {}".format(early_finishers['assignment1_grade'].mean()))
print("the late_finishers grade mean for assignment 1: {}".format(late_finishers['assignment1_grade'].mean()))

the early_finishers grade mean for assignment 1: 74.94728457024303
the late_finishers grade mean for assignment 1: 74.0450648477065


**Let's compare the means for our two populations.**

Ok, these look pretty similar. But, are they the same? What do we mean by similar? This is where the students' t-test comes in. 

**ttest is a statistical test**.

**t-test allows us to form the alternative hypothesis ("These are different")** as well as **the null hypothesis ("These are the same")**, and then **test null hypothesis**.

**When doing hypothesis testing**, we have to choose **a significance level as a threshold for how much of a chance we're willing to accept**. This significance level is typically called **alpha**. 

For this example, let's use **a threshold of 0.05 or 5% for our alpha**. Now this is a commonly used number but it's really quite arbitrary.

The **SciPy** library **contains a number of different statistical tests and forms a basis for hypothesis testing** in Python and we're going to use the **ttest_ind()** function which does an **independent t-test** (meaning **the populations are not related to one another**). **The result of ttest_ind()** are the **t-statistic** and a **p-value**. 

**p-value is a probability** which **indicates the chance of our null hypothesis being True with a number between 0 and 1**.

**if p-value is greater than alpha threshold(pval > alpha), we can not reject the null hypothesis**, and **it doesn't mean that we have proven the populations are the same**.

**if p-value is lower or equal than alpha threshold(pval <= alpha), we can reject the null hypothesis**. in this state, **we have statistically significantly different at alpha and pval value**.

**p-value is the most important value** that the ttest_ind() function will return for us.

In [20]:
from scipy.stats import ttest_ind

In [31]:
'assignment1_grade : {}'.format(ttest_ind(a= early_finishers['assignment1_grade'], b= late_finishers['assignment1_grade']))

'assignment1_grade : Ttest_indResult(statistic=1.3223540853721596, pvalue=0.18618101101713855)'

So here we see that the probability is 0.18, and this is above our alpha value of 0.05. This means that we cannot reject the null hypothesis. The null hypothesis was that the two populations are the same, and we don't have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to the contrary. This doesn't mean that we have proven the populations are the same.

In [35]:
print("assignment2_grade : {}".format(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade'])))
print("assigment3_grade : {}".format(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade'])))
print("assignment4_grade : {}".format(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade'])))
print("assignment5_grade : {}".format(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade'])))
print("assignment6_grade : {}".format(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade'])))

assignment2_grade : Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
assigment3_grade : Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
assignment4_grade : Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
assignment5_grade : Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
assignment6_grade : Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with respect to grade. 

Let's take a look at those p-values for a moment though, because they are saying things that can inform experimental design down the road.

**P-values have come under fire** recently **for being insufficient interactions**. for this reason, we use more regularly two other techniques, **confidence intervalues** and **bayesian analyses**.

One **issue** with **p-values** is that **as we run more tests we are likely to get a value which is statistically significant(pval <= alpha)** just by chance.


In [91]:
df1 = pd.DataFrame([np.random.random(100) for row in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.578137,0.055896,0.940719,0.515093,0.950226,0.798664,0.922082,0.811472,0.354317,0.808405,...,0.462997,0.879078,0.173821,0.010979,0.75814,0.544931,0.607247,0.745306,0.057758,0.528566
1,0.747673,0.264812,0.753508,0.779665,0.856873,0.720847,0.263684,0.860284,0.027922,0.679667,...,0.771362,0.047675,0.308786,0.217226,0.833489,0.123287,0.769133,0.331656,0.427071,0.392213
2,0.296223,0.611617,0.343132,0.328514,0.394963,0.574821,0.313154,0.076682,0.085898,0.178595,...,0.620473,0.048829,0.029286,0.818644,0.621244,0.122123,0.020953,0.521337,0.621387,0.844049
3,0.160014,0.931669,0.984996,0.998118,0.442335,0.765138,0.542032,0.485719,0.196573,0.870543,...,0.214488,0.466068,0.895359,0.399128,0.15824,0.842644,0.845924,0.060311,0.486765,0.870758
4,0.45716,0.157134,0.455518,0.824685,0.36024,0.407491,0.672141,0.810924,0.846987,0.984509,...,0.248278,0.591765,0.979,0.402819,0.147105,0.723205,0.819794,0.472688,0.537232,0.669485


we created first 100 rows, and then for each row, we created 100 cells.

In [90]:
df2 = pd.DataFrame([np.random.random(100) for row in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.078932,0.941384,0.162811,0.340975,0.943668,0.758863,0.713388,0.813184,0.393019,0.57806,...,0.918239,0.404948,0.742721,0.495851,0.693143,0.281523,0.49995,0.411942,0.144587,0.046447
1,0.72307,0.339,0.82233,0.611427,0.264291,0.897197,0.555187,0.478302,0.399623,0.547246,...,0.176336,0.135427,0.267326,0.732678,0.982897,0.679624,0.410152,0.036417,0.54763,0.248511
2,0.270537,0.396375,0.233692,0.864959,0.119104,0.005373,0.554679,0.753271,0.832355,0.888091,...,0.621286,0.824854,0.526625,0.453082,0.449709,0.416995,0.925189,0.236002,0.53373,0.667975
3,0.903233,0.906638,0.782551,0.292318,0.945687,0.565829,0.567964,0.648442,0.276738,0.383173,...,0.113058,0.857359,0.671848,0.184256,0.87697,0.194758,0.560202,0.582845,0.337425,0.946504
4,0.983914,0.427219,0.4694,0.614732,0.593415,0.498634,0.246977,0.563126,0.026379,0.584474,...,0.936541,0.252743,0.961804,0.567942,0.426562,0.636063,0.129378,0.426015,0.050385,0.160094


is the row inside df1 the same as the row inside df2?

if the alpha value is 0.1, or 10%, we're going to compare each column in df1 to the same numbered column in df2, And we'll report when the p-value is less or equal than 0.1, which means that we have sufficient evidence to say that the columns are different.

In [84]:
def test_columns(alpha=0.1):
    num_diff = 0
    for col in df1.columns:
        t_stats, pval = ttest_ind(a= df1[col], b= df2[col])
        if pval <= alpha:
            print("column {} is statistically significantly different at alpha= {} and pval= {}".format(
                                            col, alpha, pval))
            num_diff += 1
            
    print("the total of number different {}, which is {}%".format(
                    num_diff, (num_diff / df1.shape[1]) * 100))

In [85]:
test_columns()

column 0 is statistically significantly different at alpha= 0.1 and pval= 0.014532334841064466
column 3 is statistically significantly different at alpha= 0.1 and pval= 0.08859188129656598
column 46 is statistically significantly different at alpha= 0.1 and pval= 0.09628103683733866
column 47 is statistically significantly different at alpha= 0.1 and pval= 0.049687588593923915
column 48 is statistically significantly different at alpha= 0.1 and pval= 0.03296485307462091
column 54 is statistically significantly different at alpha= 0.1 and pval= 0.08320639826975887
column 57 is statistically significantly different at alpha= 0.1 and pval= 0.002248901557707801
column 72 is statistically significantly different at alpha= 0.1 and pval= 0.06560601172484788
column 96 is statistically significantly different at alpha= 0.1 and pval= 0.013850835577966106
column 99 is statistically significantly different at alpha= 0.1 and pval= 0.09451164048740684
the total of number different 10, which is 10.0%

we see that there are a bunch of columns that are different. In fact, **the number of columns that are different will increase by the alpha value we choose**.

In this example, we checked 100 columns, so we would expect there to **be roughly 10 of them different if our alpha was 0.1**.

**The more random comparisons** we do, the more will just happen to be **the same** by chance.

In the following example, we checked 100 columns, so we would expect there to **be roughly 5 of them different if our alpha was 0.05**.

In [86]:
test_columns(0.05)

column 0 is statistically significantly different at alpha= 0.05 and pval= 0.014532334841064466
column 47 is statistically significantly different at alpha= 0.05 and pval= 0.049687588593923915
column 48 is statistically significantly different at alpha= 0.05 and pval= 0.03296485307462091
column 57 is statistically significantly different at alpha= 0.05 and pval= 0.002248901557707801
column 96 is statistically significantly different at alpha= 0.05 and pval= 0.013850835577966106
the total of number different 5, which is 5.0%


keep in mind when we are doing **statistical tests** like the **t-test which has a p-value**. 

Understand that this **p-value isn't magic**, that it's a **threshold** for us when **reporting results** and trying **to answer a hypothesis**.

**What's a reasonable threshold for pvalue?**

it **Depends on our question**, and we need to engage domain experts to better understand what they would consider significant.

In [89]:
df2 = pd.DataFrame([np.random.chisquare(df= 1, size= 100) for row in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.024104,0.591245,2.298736,0.305441,0.317571,0.404399,3.220644,0.165943,0.081436,0.113075,...,1.302274,0.758885,0.228342,0.399818,3.699864,1.349894,0.027041,1.703285,0.445159,2.685345
1,3.567365,7.91686,2.366448,0.257058,1.817855,0.32885,0.619564,2.650853,2.477572,12.006922,...,0.195634,0.160681,2.475874,1.093944,0.082976,0.006347,5.262316,0.292405,0.508539,0.013906
2,1.185825,1.00563,0.414641,0.060052,0.028205,0.047683,0.022563,0.79825,0.182035,0.274736,...,0.001957,0.531805,0.079884,1.899458,0.100359,0.002018,0.011806,5.2e-05,3.602904,0.004656
3,0.007949,2.390962,2.841913,6.573791,0.154919,4.161776,0.027403,0.000937,0.000198,0.05308,...,0.025837,1e-06,0.771446,0.284561,0.526343,0.011005,0.396242,0.390841,4.014505,0.00024
4,0.019233,0.16858,0.364683,0.352259,2.649488,6.411575,0.861237,0.135524,0.128791,3.718605,...,0.090782,0.090508,1.40854,0.931067,3.037156,1.754095,0.397025,0.100988,0.03963,0.02112


In [88]:
test_columns()

column 0 is statistically significantly different at alpha= 0.1 and pval= 0.0009338572262928515
column 1 is statistically significantly different at alpha= 0.1 and pval= 0.001471513866899381
column 2 is statistically significantly different at alpha= 0.1 and pval= 0.00010960192411109297
column 3 is statistically significantly different at alpha= 0.1 and pval= 0.0008350060452161352
column 4 is statistically significantly different at alpha= 0.1 and pval= 0.002286381483393534
column 5 is statistically significantly different at alpha= 0.1 and pval= 0.00034796419039537495
column 6 is statistically significantly different at alpha= 0.1 and pval= 0.0001279707488908462
column 7 is statistically significantly different at alpha= 0.1 and pval= 0.0019968928678215643
column 8 is statistically significantly different at alpha= 0.1 and pval= 0.00012325834110144126
column 9 is statistically significantly different at alpha= 0.1 and pval= 0.0008958008782596941
column 10 is statistically significantl

There's much more to learn about hypothesis testing.

for instance, there are different tests depending on the shape of our data and different ways to report results instead of just p-values such as confidence intervals or bayesian analyses.