<a href="https://colab.research.google.com/github/bsmrvl/DS-Unit-1-Sprint-2-Statistics/blob/master/module3/LS_DS_123_Confidence_Intervals_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Confidence Intervals

Soft drinks like Coke and Pepsi are manufactured to have a standard caffeine content. For example, a 12-oz serving of Coke has 34mg of caffeine, and a 12-oz serving of Pepsi has 37.6mg of caffeine. However, fountain soft drinks are typically mixed in individual restaurant dispensers, so it is more difficult to maintain a standard level of caffeine per serving. In this study, researchers randomly sampled Coke, Diet Coke, Pepsi, and Diet Pepsi at a set of franchise restaurants and measured the caffeine content in 12oz of each soft drink2. The data is found in the Soda.xlsx dataset.

Because individuals can be sensitive to caffeine – and because the manufacturers are interested in product consistency – we wish to estimate the mean caffeine content in 12oz of Coke served in franchise restaurants using a 95% confidence interval. 

You can find the Coke data here: 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Coke.csv'

The first variable is the sample ID and the second variable is the caffeine content in the 12-oz sample measured in mg.

Source: A.N. Garand and L.N. Bell (1997). "Caffeine Content of Fountain and Private-Label Store Brand Carbonated Beverages," Journal of the American Dietetic Association, Vol. 97, #2, pp. 179-182.




###1) Load the dataset and print the first few rows

In [42]:
#Import statements

import pandas as pd
import numpy as np

drinks = pd.read_csv('https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Coke.csv')

In [43]:
#Print the top 5 rows

print(drinks.shape)
drinks.head()

(50, 2)


Unnamed: 0,Drink,Caffeine
0,1,47.32
1,2,43.78
2,3,48.12
3,4,43.25
4,5,46.42


###2) Calculate the mean, SD, SE caffeiene content and n for the sample.  Summarize your results in a sentence or two.

In [44]:
#Calculations here

caff_mean = drinks['Caffeine'].mean()
caff_std = drinks['Caffeine'].std()
caff_n = drinks['Caffeine'].count()
caff_se = caff_std / (caff_n**(1/2))
print(caff_mean)
print(caff_std)
print(caff_n)
print(caff_se)

37.9402
5.243756828216712
50
0.7415792024250598


Our sample of 50 drinks has a mean caffeine content of 37.9mg, with a standard deviation of 5.24mg. The estimated standard error is 0.742mg.

###3) Find t* for a 95% confidence interval.  

Use the starter code below and fill in the degrees of freedom.

In [45]:
from scipy.stats import t

#Don't worry too much about where the 0.975 comes from.  It has to do
#with wanting to determine the *middle* 95% of the t-distribution
#We're going to learn
#how to calculate a 95% CI this easy way in just a minute.

#Recall that n = 223 for the body temp problem.
t_star = t.ppf(0.975,df=49)
print('t_star =', t_star)

t_star = 2.009575234489209


###4) Calculate the margin of error for a 95% confidence interval for the mean caffeine content in a 12-oz Coke.



In [46]:
#Calculations here

# Margin of error equals t* times SE

margin_error = t_star*caff_se
margin_error

1.49025919960566

For a 95% confidence interval, the margin of error is 1.49mg.

###5) Calculate a 95% CI for the mean caffeine content in a 12-oz fountain Coke with the CI formula using the summary statistics and t* that you calculated above.

In [47]:
#Calculations here

[caff_mean-margin_error, caff_mean+marginError]

[36.44994080039434, 39.43045919960566]

###6) Calculate a 95% CI for the mean caffeiene content in a 12-oz fountain Coke using the t-interval function in Python.

In [48]:
#Calculations here

t.interval(alpha=.95, df=caff_n-1, loc=caff_mean, scale=caff_se)

(36.44994080039434, 39.43045919960566)

###7) Compare the two confidence intervals you calculated.  Do they match?  Should they?

They match exactly, as they should, since they come from the exact same sample and have the same confidence level.

###8) Interpret the meaning of the 95% confidence interval for the mean caffeiene content in a 12-oz fountain Coke. in a sentence or two.

We are 95% confident that the mean caffeine content of the entire population of 12-oz fountain Cokes is between 36.4mg and 39.4mg.

###9) Using the t-interval Python function, calculate a 90% confidence interval for the mean caffeine content in a 12-oz Coke.  Is this estimate more accurate or more precise (pick one) than the 95% confidence interval?



In [49]:
#Calculations here

t.interval(alpha=.9, df=caff_n-1, loc=caff_mean, scale=caff_se)

(36.696904726749196, 39.1834952732508)

This interval, being a bit narrower, is more precise than the 95% interval. However, at only 90% confidence, it risks being less accurate; there is a lower probability that it contains the population mean.

###10) Using the t-interval Python function, calculate a 99% confidence interval for the mean caffeine content in a 12-oz Coke.  Is this estimate more accurate or more precise (pick one) than the 95% confidence interval?



In [50]:
#Calculations here

t.interval(alpha=.99, df=caff_n-1, loc=caff_mean, scale=caff_se)

(35.95280335285685, 39.92759664714315)

This is the widest interval so far, and thus the least precise. However, we are 99% confident that it contains the true mean, so it is clearly the most accurate.

## Stretch goals:

###1) The correspondence between confidence intervals and hypothesis tests.

Read [this](https://https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-confidence-intervals-and-confidence-levels#:~:text=If%20a%20hypothesis%20test%20produces,corresponding%20confidence%20level%20is%2095%25.&text=If%20the%20confidence%20interval%20does,the%20results%20are%20statistically%20significant.) article about the correspondence between confidence intervals and hypothesis tests.  Feel free to read the whole article, but the relevant part can be found under the heading Why P Values and Confidence Intervals Always Agree About Statistical Significance.

Imagine you work for quality control at Coke and are tasked with making sure that the caffeiene content in the fountain beverages served in restaurants is the same as in a 12-oz can of Coke (34mg).  If you believe that the mean caffeiene content in fountain coke is not 34mg, you must re-train the franchise managers to make sure the Coke served has the correct caffeiene level.

Based on the confidence interval you calculated in the assignment, do you believe that the mean caffeiene content is statistically significantly different from 34 mg in a 12-oz serving?


Yes! Calculating 95% confidence interval is similar to running a one-sample t-test at a 0.05 significance level. Values outside the interval would produce a p-value less than 0.05. Since 34mg is well below our interval minimum (36.4mg), we clearly must reject the null hypothesis that fountain Coke caffeine content is the same as in a 12-oz can, and re-train the managers.

###2) If we increased the sample size from 50 to 100 but the sample mean and SD remained the same, describe **two** ways the margin of error would change.  Would the margin of error become smaller or larger?

Increasing the sample size will first of all reduce the standard error, since n (sample size) is in the denominator of the standard error formula. Second, a higher sample size means more degrees of freedom, which means a lower t* (since the t-distribution will be closer to normal, and hence have thinner tails). 

Lower standard error and lower t* mean lower margin of error.

In [51]:
#Calculations here.

new_SE = caff_std / (100**(1/2))
new_t_star = t.ppf(0.975,df=99)

new_margin_error = new_SE * new_t_star
new_margin_error

1.0404751188137005

A sample size of 50 gave us a margin of error of 1.49mg.

With our new sample of 100 drinks, our margin of error is reduced to 1.04mg.