### Confidence Interval - Difference In Means

Here you will look through the example from the last video, but you will also go a couple of steps further into what might actually be going on with this data.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('coffee_dataset.csv')
sample_data = full_data.sample(200)
sample_data.head()

Unnamed: 0,user_id,age,drinks_coffee,height
2402,2874,<21,True,64.357154
2864,3670,>=21,True,66.859636
2167,7441,<21,False,66.659561
507,2781,>=21,True,70.166241
1817,2875,>=21,True,71.36912


`1.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for coffee and non-coffee drinkers.  Build a 99% confidence interval using your sampling distribution.  Use your interval to start answering the first quiz question below.

In [4]:
diff = []
for _ in range(10000):
    bootsample = sample_data.sample(200, replace=True)
    drink_h = bootsample.query('drinks_coffee')['height'].mean()
    nodrink_h = bootsample.query('drinks_coffee == False')['height'].mean()
    diff.append(drink_h - nodrink_h)

np.percentile(diff, 0.5), np.percentile(diff, 99.5)

(0.10258900080921117, 2.538833370796657)

`2.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for those older than 21 and those younger than 21.  Build a 99% confidence interval using your sampling distribution.  Use your interval to finish answering the first quiz question below.  

In [9]:
diff = []
for _ in range(10000):
    bootsample = sample_data.sample(200, replace=True);
    old_h = bootsample.query('age == ">=21"')['height'].mean();
    young_h = bootsample.query('age == "<21"')['height'].mean();
    diff.append(old_h - young_h);

np.percentile(diff, 0.5), np.percentile(diff, 99.5)


(3.3652749452554938, 5.0932450670661495)

`3.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **under** 21 years old.  Using your sampling distribution, build a 95% confidence interval.  Use your interval to start answering question 2 below.

In [11]:
diff = []
for _ in range(10000):
    bootsample = sample_data.sample(200, replace=True)
    drink_h = bootsample.query('age == "<21" and drinks_coffee')['height'].mean()
    nodrink_h = bootsample.query('age == "<21" and drinks_coffee == False')['height'].mean()
    diff.append(nodrink_h - drink_h)

np.percentile(diff, 2.5), np.percentile(diff, 97.5)

(1.0809572510875, 2.6258697660461725)

`4.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **over** 21 years old.  Using your sampling distribution, build a 95% confidence interval. Use your interval to finish answering the second quiz question below. As well as the following questions. 

In [12]:
diff = []
for _ in range(10000):
    bootsample = sample_data.sample(200, replace=True)
    drink_h = bootsample.query('age == ">=21" and drinks_coffee')['height'].mean()
    nodrink_h = bootsample.query('age == ">=21" and drinks_coffee == False')['height'].mean()
    diff.append(nodrink_h - drink_h)

np.percentile(diff, 2.5), np.percentile(diff, 97.5)

(1.828156731814163, 4.408029942439456)

Within the under 21 and over 21 groups, we saw that on average non-coffee drinkers were taller. But, when combined, we saw that on average coffee drinkers were on average taller. This is again Simpson's paradox, and essentially there are more adults in the dataset who were coffee drinkers. So these individuals made it seem like coffee drinkers were on average taller - which is a misleading result
