# Welcome to Inference!
In this notebook, we will be taking our first steps into inferential statistics, which is a field of statistical analysis designed to test a hypothesis, typically involving some change or difference in a statistic among one or more samples or groups.

We will study the independent t-test, which tests for a difference between two independent samples. We will compute statistical significance and effect size, which are both key factors in inferential statistics.

In [1]:
import pandas as pd
from scipy import stats

In [14]:
df = pd.read_csv('health_data.csv')
df[df['CholCheck'] == 0]

Unnamed: 0,Age,Sex,HighChol,CholCheck,BMI,Smoker,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,GenHlth,MentHlth,PhysHlth,DiffWalk,Diabetes,Hypertension,Stroke
77,8.0,1.0,1.0,0.0,37.0,0.0,0.0,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
121,1.0,1.0,0.0,0.0,36.0,0.0,0.0,1.0,1.0,1.0,0.0,2.0,30.0,0.0,0.0,0.0,0.0,0.0
172,6.0,0.0,0.0,0.0,31.0,0.0,0.0,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0
187,5.0,1.0,1.0,0.0,30.0,0.0,0.0,1.0,1.0,1.0,1.0,4.0,15.0,30.0,0.0,0.0,0.0,0.0
188,7.0,0.0,0.0,0.0,25.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69403,13.0,0.0,0.0,0.0,26.0,1.0,0.0,1.0,0.0,1.0,0.0,3.0,2.0,5.0,0.0,1.0,1.0,0.0
69544,8.0,1.0,0.0,0.0,30.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,1.0
69738,9.0,0.0,1.0,0.0,30.0,1.0,0.0,1.0,0.0,0.0,0.0,5.0,10.0,30.0,1.0,1.0,1.0,0.0
69853,10.0,1.0,1.0,0.0,30.0,0.0,0.0,1.0,1.0,1.0,0.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0


To reduce my risk of heart attacks, should I focus on improving my mental health or my physical health?
*DISCLAIMER:* THIS IS NOT MEDICAL ADVICE

In [8]:
a = df[df['MentHlth'] == 1]['HeartDiseaseorAttack']
b = df[df['MentHlth'] == 5]['HeartDiseaseorAttack']

# 1 is best, 5 is worst
t, p = stats.ttest_ind(a, b)
print('p-value:', p) # typically, we say the relationship is significant if p < 0.05
effect = a.mean() - b.mean()
print('effect size', a.mean() - b.mean())

p-value: 2.3786212680130074e-09
effect size -0.05947601737279369


In [9]:
a = df[df['PhysHlth'] == 1]['HeartDiseaseorAttack']
b = df[df['PhysHlth'] == 5]['HeartDiseaseorAttack']

# 1 is best, 5 is worst
t, p = stats.ttest_ind(a, b)
print('p-value:', p) # typically, we say the relationship is significant if p < 0.05
effect = a.mean() - b.mean()
print('effect size', a.mean() - b.mean())

p-value: 2.7097531316707e-23
effect size -0.09095072503500573


Question: Does getting your cholesterol checked *change* the risk of a heart attack?

Hypothesis (to try to test the question):

Null hypothesis - the incidence of HeartDiseaseorAttack will be the same for both populations that got their cholesterol checked and populations that did not get their cholesterol checked.

Alternative hypothesis - the incidence of HeartDiseaseorAttack will be different for both populations.

In [10]:
a = df[df['CholCheck'] == 0]['HeartDiseaseorAttack']
b = df[df['CholCheck'] == 1]['HeartDiseaseorAttack']

# 0 cholesterol has not been checked, 1 cholesterol has been checked
t, p = stats.ttest_ind(a, b)
print('p-value:', p) # typically, we say the relationship is significant if p < 0.05
effect = a.mean() - b.mean()
print('effect size', a.mean() - b.mean())

p-value: 5.829364932154651e-31
effect size -0.09938284215147875


apparently so! does this seem right? we have to be careful...

In [11]:
no_check_mask = df['CholCheck'] == 0
df[no_check_mask].mean()

Age                      6.762150
Sex                      0.481990
HighChol                 0.256146
CholCheck                0.000000
BMI                     27.818182
Smoker                   0.488851
HeartDiseaseorAttack     0.050886
PhysActivity             0.726701
Fruits                   0.558605
Veggies                  0.787879
HvyAlcoholConsump        0.077187
GenHlth                  2.423099
MentHlth                 4.297885
PhysHlth                 3.628359
DiffWalk                 0.131504
Diabetes                 0.137793
Hypertension             0.241852
Stroke                   0.028016
dtype: float64

In [12]:
no_check_mask = df['CholCheck'] == 1
df[no_check_mask].mean()

Age                      8.630274
Sex                      0.456363
HighChol                 0.532541
CholCheck                1.000000
BMI                     29.908707
Smoker                   0.474929
HeartDiseaseorAttack     0.150269
PhysActivity             0.702435
Fruits                   0.613144
Veggies                  0.788797
HvyAlcoholConsump        0.041846
GenHlth                  2.847584
MentHlth                 3.738190
PhysHlth                 5.865773
DiffWalk                 0.255806
Diabetes                 0.509189
Hypertension             0.571617
Stroke                   0.063038
dtype: float64

does it make sense that older people, heavier people, and people with already high cholesterol would all be more likely to get their cholesterol checked? could these pre-existing factors be the real reason why we see this trend?