# Welcome to Inference!
In this notebook, we will be taking our first steps into inferential statistics, which is a field of statistical analysis designed to test a hypothesis, typically involving some change or difference in a statistic among one or more samples or groups.

We will study the independent t-test, which tests for a difference between two independent samples. We will compute statistical significance and effect size, which are both key factors in inferential statistics.

In [3]:
import pandas as pd
from scipy import stats

In [8]:
df = pd.read_csv('health_data.csv')
df.head()

Unnamed: 0,Age,Sex,HighChol,CholCheck,BMI,Smoker,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,GenHlth,MentHlth,PhysHlth,DiffWalk,Diabetes,Hypertension,Stroke
0,4.0,1.0,0.0,1.0,26.0,0.0,0.0,1.0,0.0,1.0,0.0,3.0,5.0,30.0,0.0,0.0,1.0,0.0
1,12.0,1.0,1.0,1.0,26.0,1.0,0.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,1.0
2,13.0,1.0,0.0,1.0,26.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,10.0,0.0,0.0,0.0,0.0
3,11.0,1.0,1.0,1.0,28.0,1.0,0.0,1.0,1.0,1.0,0.0,3.0,0.0,3.0,0.0,0.0,1.0,0.0
4,8.0,0.0,0.0,1.0,29.0,1.0,0.0,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0


To reduce my risk of heart attacks, should I focus on improving my mental health or my physical health?

In [65]:
a = df[df['MentHlth'] == 1]['HeartDiseaseorAttack']
b = df[df['MentHlth'] == 5]['HeartDiseaseorAttack']
t, p = stats.ttest_ind(a, b)
print('p-value:', p)
print('effect size:', a.mean() - b.mean())

p-value: 2.3786212680130074e-09
effect size: -0.05947601737279369


In [64]:
a = df[df['PhysHlth'] == 1]['HeartDiseaseorAttack']
b = df[df['PhysHlth'] == 5]['HeartDiseaseorAttack']
t, p = stats.ttest_ind(a, b)
print('p-value:', p)
print('effect size:', a.mean() - b.mean())

p-value: 2.7097531316707e-23
effect size: -0.09095072503500573


Does getting your cholesterol checked increase the risk of a heart attack?

In [66]:
a = df[df['CholCheck'] == 0]['HeartDiseaseorAttack']
b = df[df['CholCheck'] == 1]['HeartDiseaseorAttack']
t, p = stats.ttest_ind(a, b)
print('p-value:', p)
print('effect size:', a.mean() - b.mean())

p-value: 5.829364932154651e-31
effect size: -0.09938284215147875


apparently so! does this seem right? we have to be careful...

In [72]:
no_check_mask = df['CholCheck'] == 0
df[no_check_mask].mean()

Age                      6.762150
Sex                      0.481990
HighChol                 0.256146
CholCheck                0.000000
BMI                     27.818182
Smoker                   0.488851
HeartDiseaseorAttack     0.050886
PhysActivity             0.726701
Fruits                   0.558605
Veggies                  0.787879
HvyAlcoholConsump        0.077187
GenHlth                  2.423099
MentHlth                 4.297885
PhysHlth                 3.628359
DiffWalk                 0.131504
Diabetes                 0.137793
Hypertension             0.241852
Stroke                   0.028016
dtype: float64

In [73]:
check_mask = df['CholCheck'] == 1
df[check_mask].mean()

Age                      8.630274
Sex                      0.456363
HighChol                 0.532541
CholCheck                1.000000
BMI                     29.908707
Smoker                   0.474929
HeartDiseaseorAttack     0.150269
PhysActivity             0.702435
Fruits                   0.613144
Veggies                  0.788797
HvyAlcoholConsump        0.041846
GenHlth                  2.847584
MentHlth                 3.738190
PhysHlth                 5.865773
DiffWalk                 0.255806
Diabetes                 0.509189
Hypertension             0.571617
Stroke                   0.063038
dtype: float64

In [75]:
df[check_mask].mean() - df[no_check_mask].mean()

Age                     1.868124
Sex                    -0.025627
HighChol                0.276395
CholCheck               1.000000
BMI                     2.090525
Smoker                 -0.013922
HeartDiseaseorAttack    0.099383
PhysActivity           -0.024266
Fruits                  0.054539
Veggies                 0.000918
HvyAlcoholConsump      -0.035341
GenHlth                 0.424485
MentHlth               -0.559695
PhysHlth                2.237414
DiffWalk                0.124302
Diabetes                0.371396
Hypertension            0.329765
Stroke                  0.035022
dtype: float64

does it make sense that older people, heavier people, and people with already high cholesterol would all be more likely to get their cholesterol checked? could these pre-existing factors be the real reason why we see this trend?