### Confidence Interval - Difference In Means

Here you will look through the example from the last video, but you will also go a couple of steps further into what might actually be going on with this data.

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('doc/coffee_dataset.csv')
sample_data = full_data.sample(200)

In [20]:
full_data.head()

Unnamed: 0,user_id,age,drinks_coffee,height
0,4509,<21,False,64.538179
1,1864,>=21,True,65.824249
2,2060,<21,False,71.319854
3,7875,>=21,True,68.569404
4,6254,<21,True,64.020226


`1.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for coffee and non-coffee drinkers.  Build a 99% confidence interval using your sampling distribution.  Use your interval to start answering the first quiz question below.

In [21]:
boot_diffs = []
for value in range(10000):
    bootsample = sample_data.sample(200,replace = True)
    coff_mean = bootsample[bootsample['drinks_coffee']== True]['height'].mean()
    nocoff_mean = bootsample[bootsample['drinks_coffee']== False]['height'].mean()
    boot_diff = coff_mean - nocoff_mean
    boot_diffs.append(boot_diff)
    
np.percentile(boot_diffs, 0.5), np.percentile(boot_diffs, 99.5)

(0.10258900080919674, 2.5388333707966284)

`2.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for those older than 21 and those younger than 21.  Build a 99% confidence interval using your sampling distribution.  Use your interval to finish answering the first quiz question below.  

In [27]:
age_diffs = []
for value in range(10000):
    bootsample = sample_data.sample(200 , replace=True)
    under21_mean = bootsample[bootsample['age'] == '<21']['height'].mean()
    over21_mean = bootsample[bootsample['age'] == '>=21']['height'].mean()
    age_diff =  over21_mean - under21_mean
    age_diffs.append(age_diff)
np.percentile(age_diffs, 0.5), np.percentile(age_diffs, 99.5)

(3.3502745897258372, 5.109059900189735)

`3.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **under** 21 years old.  Using your sampling distribution, build a 95% confidence interval.  Use your interval to start answering question 2 below.

In [31]:
diffs_coff_over21 = []
for _ in range(10000):
    bootsamp = sample_data.sample(200, replace = True)
    over21_coff_mean = bootsamp.query("age != '<21' and drinks_coffee == True")['height'].mean()
    over21_nocoff_mean = bootsamp.query("age != '<21' and drinks_coffee == False")['height'].mean()
    diffs_coff_over21.append(over21_nocoff_mean - over21_coff_mean)

In [33]:
under21_diffs = []
for value in range(10000):
    bootsample = sample_data.sample(200 , replace=True)
    coff_mean = bootsample.query("age == '<21' and drinks_coffee == True")['height'].mean()
    noncoff_mean = bootsample.query("age == '<21' and drinks_coffee == False")['height'].mean()
    under21_diff =  noncoff_mean - coff_mean
    under21_diffs.append(under21_diff)
np.percentile(under21_diffs, 2.5), np.percentile(under21_diffs, 98.5)

(1.0865706186573634, 2.6847440744854763)

`4.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **over** 21 years old.  Using your sampling distribution, build a 95% confidence interval. Use your interval to finish answering the second quiz question below. As well as the following questions. 

In [35]:
over21_diffs = []
for value in range(10000):
    bootsample = sample_data.sample(200 , replace=True)
    coff_mean = bootsample.query("age != '<21' and drinks_coffee == True")['height'].mean()
    noncoff_mean = bootsample.query("age != '<21' and drinks_coffee == False")['height'].mean()
    over21_diff =  noncoff_mean - coff_mean
    over21_diffs.append(over21_diff)
np.percentile(over21_diffs, 2.5), np.percentile(over21_diffs, 98.5)

(1.7836677437342696, 4.502295585482247)

Within the under 21 and over 21 groups, we saw that on average non-coffee drinkers were taller. But, when combined, we saw that on average coffee drinkers were on average taller. This is again Simpson's paradox, and essentially there are more adults in the dataset who were coffee drinkers. So these individuals made it seem like coffee drinkers were on average taller - which is a misleading result.