### Confidence Interval - Difference In Means

Here you will look through the example for the last video, but you will also go a couple of steps further into what might actually be going on with this data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('../data/coffee_dataset.csv')
sample_data = full_data.sample(200)

`1.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for coffee and non-coffee drinkers.  Build a 99% confidence interval using your sampling distribution.  Use your interval to start answering the first quiz question below.

In [10]:
diff_coffee=[]
for _ in range(10000):
    sample = sample_data.sample(200, replace=True)
    coffee_mean = sample[sample['drinks_coffee']==True]['height'].mean()
    nocoffee_mean = sample[sample['drinks_coffee']==False]['height'].mean()
    diff_coffee.append(coffee_mean-nocoffee_mean)

In [11]:
np.percentile(diff_coffee,2.5),np.percentile(diff_coffee,97.5)

(0.42348750310031807, 2.264930717442466)

`2.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for those older than 21 and those younger than 21.  Build a 99% confidence interval using your sampling distribution.  Use your interval to finish answering the first quiz question below.  

In [13]:
diff_age=[]
for _ in range(10000):
    sample = sample_data.sample(200, replace=True)
    u21_mean = sample.query('age=="<21"').height.mean()
    oe21_mean = sample.query('age==">=21"').height.mean()
    diff_age.append(oe21_mean-u21_mean)

In [14]:
np.percentile(diff_age,0.5),np.percentile(diff_age,99.5)

(3.384624971838698, 5.1051788925373289)

`3.` For 10,000 iterations bootstrap your sample data, compute the difference in the average height for coffee drinkers and the average height non-coffee drinkers for individuals under 21 years old.  Using your sampling distribution, build a 95% confidence interval.  Use your interval to start answering question 2 below.

In [32]:
diff_u21_coffee = []
for _ in range(10000):
    sample = sample_data.sample(200, replace = True)
    coffee = sample[(sample['age']=='<21') & (sample['drinks_coffee']==True)]['height'].mean()
    nocoffee = sample[(sample['age']=='<21') & (sample['drinks_coffee']==False)]['height'].mean()
    diff_u21_coffee.append(coffee-nocoffee)

In [33]:
np.percentile(diff_u21_coffee,0.5),np.percentile(diff_u21_coffee,99.5)

(-2.8581301454263999, -0.82934312803391042)

`4.` For 10,000 iterations bootstrap your sample data, compute the difference in the average height for coffee drinkers and the average height non-coffee drinkers for individuals under 21 years old.  Using your sampling distribution, build a 95% confidence interval. Use your interval to finish answering the second quiz question below. As well as the following questions.

In [35]:
diff_o21_coffee = []
for _ in range(10000):
    sample = sample_data.sample(200, replace = True)
    coffee_o21 = sample[(sample['age']=='>=21') & (sample['drinks_coffee']==True)]['height'].mean()
    nocoffee_o21 = sample[(sample['age']=='>=21') & (sample['drinks_coffee']==False)]['height'].mean()
    diff_o21_coffee.append(coffee_o21-nocoffee_o21)

In [36]:
np.percentile(diff_o21_coffee,0.5),np.percentile(diff_u21_coffee,99.5)

(-4.8109488329749963, -0.82934312803391042)

In [37]:
np.percentile(diff_o21_coffee,2.5),np.percentile(diff_u21_coffee,97.5)

(-4.3801456069175968, -1.0854810109996373)

Within the under 21 and over 21 groups, we saw that on average non-coffee drinkers were taller.  But, when combined, we saw that on average coffee drinkers were on average taller.  This is again **Simpson's paradox**, and essentially there are more adults in the dataset who were coffee drinkers.  So these individuals made it seem like coffee drinkers were on average taller - which is a misleading result.  

A larger idea for this is the idea of confounding variables altogether.  You will learn even more about these in the regression section of the course.