# Hypothesis Testing Demo

In [1]:
import pandas as pd

In [2]:
URL = 'http://scipy-lectures.org/_downloads/brain_size.csv'

In [3]:
df = pd.read_csv(URL, sep=';', na_values=".", index_col=0)

In [4]:
df.head(12)

Unnamed: 0,Gender,FSIQ,VIQ,PIQ,Weight,Height,MRI_Count
1,Female,133,132,124,118.0,64.5,816932
2,Male,140,150,124,,72.5,1001121
3,Male,139,123,150,143.0,73.3,1038437
4,Male,133,129,128,172.0,68.8,965353
5,Female,137,132,134,147.0,65.0,951545
6,Female,99,90,110,146.0,69.0,928799
7,Female,138,136,131,138.0,64.5,991305
8,Female,92,90,98,175.0,66.0,854258
9,Male,89,93,84,134.0,66.3,904858
10,Male,133,114,147,172.0,68.8,955466


In [5]:
df.describe()

Unnamed: 0,FSIQ,VIQ,PIQ,Weight,Height,MRI_Count
count,40.0,40.0,40.0,38.0,39.0,40.0
mean,113.45,112.35,111.025,151.052632,68.525641,908755.0
std,24.082071,23.616107,22.47105,23.478509,3.994649,72282.05
min,77.0,71.0,72.0,106.0,62.0,790619.0
25%,89.75,90.0,88.25,135.25,66.0,855918.5
50%,116.5,113.0,115.0,146.5,68.0,905399.0
75%,135.5,129.75,128.0,172.0,70.5,950078.0
max,144.0,150.0,150.0,192.0,77.0,1079549.0


In [6]:
# importing scipy

from scipy import stats

# One Sample T test

I'm curious if the averages given by this sample vary from the standard average IQ, which I happen to know is 100. In this experiment the null hypothesis is that the population from which this sample is drawn is actually 100, and the alternative hypothesis is that it is not.

Let's use 5% as our significance, alpha

In [7]:
IQ_column_names = ['FSIQ', 'VIQ', 'PIQ']

for IQ_column in IQ_column_names:
  print(stats.ttest_1samp(df[IQ_column], 100))

Ttest_1sampResult(statistic=3.532307014238269, pvalue=0.0010766792736967715)
Ttest_1sampResult(statistic=3.3074146385401786, pvalue=0.002030117404781822)
Ttest_1sampResult(statistic=3.1030246997178783, pvalue=0.0035555593418294417)


Since the p-value is smaller than alpha, we can confidently reject the null hypothesis. That means that our average IQs of 113, 112, and 111 (for the FSIQ, VIQ, and PIQ) are most likely due to something other than random variation.

Speculating as to why these IQs are above average, I imagined a scenario in which subjects are being gathered for a data collection at a university. Many of these subjects would naturally be students.

As it turns out, this speculation was correct, as you can confirm by looking at the article linked at the top of this notebook.

# Two sample t test

Suppose we want to compare the IQs of men and women.

In [8]:
groupby_gender = df.groupby('Gender')
for IQ_column in IQ_column_names:
  for gender, value in groupby_gender[IQ_column]:
    print((gender, value.mean()))

('Female', 111.9)
('Male', 115.0)
('Female', 109.45)
('Male', 115.25)
('Female', 110.45)
('Male', 111.6)
