## Confidence Intervals

Soft drinks like Coke and Pepsi are manufactured to have a standard caffeine content. For example, a 12-oz serving of Coke has 34mg of caffeine, and a 12-oz serving of Pepsi has 37.6mg of caffeine. However, fountain soft drinks are typically mixed in individual restaurant dispensers, so it is more difficult to maintain a standard level of caffeine per serving. In this study, researchers randomly sampled Coke, Diet Coke, Pepsi, and Diet Pepsi at a set of franchise restaurants and measured the caffeine content in 12oz of each soft drink2. The data is found in the Soda.xlsx dataset.

Because individuals can be sensitive to caffeine – and because the manufacturers are interested in product consistency – we wish to estimate the mean caffeine content in 12oz of Coke served in franchise restaurants using a 95% confidence interval. 

You can find the Coke data here: 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Coke.csv'

The first variable is the sample ID and the second variable is the caffeine content in the 12-oz sample measured in mg.

Source: A.N. Garand and L.N. Bell (1997). "Caffeine Content of Fountain and Private-Label Store Brand Carbonated Beverages," Journal of the American Dietetic Association, Vol. 97, #2, pp. 179-182.




###1) Load the dataset and print the first few rows

In [1]:
#Import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
# Assign Dataframe
df = pd.read_csv('https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Coke.csv')

In [3]:
#Print the top 5 rows
df.head(5)

Unnamed: 0,Drink,Caffeine
0,1,47.32
1,2,43.78
2,3,48.12
3,4,43.25
4,5,46.42


###2) Calculate the mean, SD, SE caffeiene content and n for the sample.  Summarize your results in a sentence or two.

In [4]:
#Calculations here
n = len(df['Caffeine'])
cafmean = df['Caffeine'].mean()
df_std = df['Caffeine'].std()
df_se = (df_std) / (n**(1/2))

print('The mean is: ', cafmean)
print('The STD is: ', df_std)
print('The Standard Error is: ', df_se)
print('The N is: ', n)

The mean is:  37.9402
The STD is:  5.243756828216712
The Standard Error is:  0.7415792024250598
The N is:  50


Sentence or two summary.
---
The overall mean for this sample is 37.9402 mg of caffeine, with a standard error of 0.74, and a standard deviation of 5.2437. 

###3) Find t* for a 95% confidence interval.  

Use the starter code below and fill in the degrees of freedom.

In [5]:
from scipy.stats import t

#Don't worry too much about where the 0.975 comes from.  It has to do
#with wanting to determine the *middle* 95% of the t-distribution
#We're going to learn
#how to calculate a 95% CI this easy way in just a minute.

#Recall that n = 223 for the body temp problem.
# ^^^^ Was copy-pasta xD

t_star = t.ppf(0.975,df=49)
print('t_star =', t_star)

t_star = 2.009575234489209


###4) Calculate the margin of error for a 95% confidence interval for the mean caffeine content in a 12-oz Coke.



In [7]:
#Calculations here
cafmoe = t_star * df_se
print(cafmoe)

1.49025919960566


Summary here.

---
With a CI of 95%, a SE of approximately 0.74, and a t of approximately 2, our Margin of Error is approximately 1.49 mg of caffeine. This would indicate that with 95% confidence the overall population mean is between approximately 36.45 and 39.43 mg of caffeine. 
---
*   (Mean: 37.9402)
*   (Upper Limit: 37.9402 + 1.4902 = 39.4304)
*   (Lower Limit: 37.9402 - 1.4902 = 36.45)

###5) Calculate a 95% CI for the mean caffeine content in a 12-oz fountain Coke with the CI formula using the summary statistics and t* that you calculated above.

In [8]:
# Calculations here
# Lower Limit CI
CIL = cafmean - (t_star*df_se)

# Upper Limit CI
CIU = cafmean + (t_star*df_se)

# Print LL and UL CI
print('The lower limit is: ', CIL)
print('The upper limit is: ', CIU)

The lower limit is:  36.44994080039434
The upper limit is:  39.43045919960566


###6) Calculate a 95% CI for the mean caffeiene content in a 12-oz fountain Coke using the t-interval function in Python.

In [9]:
#Calculations here
# CI with stats.t.interval
CI = stats.t.interval(alpha = 0.95, df = n-1, loc=cafmean, scale=df_se)

# Print CI
print('95% Confidence Interval calculated with stats.t.interval \n', CI[0].round(3), CI[1].round(3))

95% Confidence Interval calculated with stats.t.interval 
 36.45 39.43


###7) Compare the two confidence intervals you calculated.  Do they match?  Should they?

Answer here.

---
The Confidence Intervals for #5 and #6 match after rounding is taken into consideration. **If** calculated correctly, the Confidence Intervals should match 100% of the time, as one way you are simply doing the calculation manually, while the other utilizes a library that has the calculation preformatted for ease of use. 


###8) Interpret the meaning of the 95% confidence interval for the mean caffeiene content in a 12-oz fountain Coke. in a sentence or two.

Interpretation here.

---

We are 95% confident that the population mean of caffeine content in this sample is between 36.45 and 39.43 milligrams of caffeine.

###9) Using the t-interval Python function, calculate a 90% confidence interval for the mean caffeine content in a 12-oz Coke.  Is this estimate more accurate or more precise (pick one) than the 95% confidence interval?



In [10]:
# Calculations here
# CI with stats.t.interval
CI = stats.t.interval(alpha = 0.90, df = n-1, loc=cafmean, scale=df_se)

# Print CI
print('90% Confidence Interval calculated with stats.t.interval \n', CI[0].round(3), CI[1].round(3))

90% Confidence Interval calculated with stats.t.interval 
 36.697 39.183


Interpretation here.

---
We are 90% confident that the population mean of caffeine content in this sample is between 36.697 and 39.183 milligrams of caffeine.


###10) Using the t-interval Python function, calculate a 99% confidence interval for the mean caffeine content in a 12-oz Coke.  Is this estimate more accurate or more precise (pick one) than the 95% confidence interval?



In [11]:
# Calculations here
# CI with stats.t.interval
CI = stats.t.interval(alpha = 0.99, df = n-1, loc=cafmean, scale=df_se)

# Print CI
print('99% Confidence Interval calculated with stats.t.interval \n', CI[0].round(3), CI[1].round(3))

99% Confidence Interval calculated with stats.t.interval 
 35.953 39.928


Interpretation here.

---
We are 99% confident that the population mean of caffeine content in this sample is between 35.953 and 39.928 milligrams of caffeine.


## Stretch goals:

###1) The correspondence between confidence intervals and hypothesis tests.

Read [this](https://https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-confidence-intervals-and-confidence-levels#:~:text=If%20a%20hypothesis%20test%20produces,corresponding%20confidence%20level%20is%2095%25.&text=If%20the%20confidence%20interval%20does,the%20results%20are%20statistically%20significant.) article about the correspondence between confidence intervals and hypothesis tests.  Feel free to read the whole article, but the relevant part can be found under the heading Why P Values and Confidence Intervals Always Agree About Statistical Significance.

Imagine you work for quality control at Coke and are tasked with making sure that the caffeiene content in the fountain beverages served in restaurants is the same as in a 12-oz can of Coke (34mg).  If you believe that the mean caffeiene content in fountain coke is not 34mg, you must re-train the franchise managers to make sure the Coke served has the correct caffeiene level.

Based on the confidence interval you calculated in the assignment, do you believe that the mean caffeiene content is statistically significantly different from 34 mg in a 12-oz serving?


Answer here.

---



###2) If we increased the sample size from 50 to 100 but the sample mean and SD remained the same, describe **two** ways the margin of error would change.  Would the margin of error become smaller or larger?

Answer to what would change here.

---
The margin of error would become smaller.

In [17]:
#Calculations here.
t_star = t.ppf(0.975,df=49)
t_star * df_se

1.49025919960566

In [18]:
#CI = stats.t.interval(alpha=0.95, df=99, loc=cafmean, scale=df_se)
#print(CI[0].round(3), CI[1].round(3))

t_star = t.ppf(0.975,df=99)
t_star * df_se

1.4714540243380925

Compare MEs here.

---
The Margin of Error for a sample of 50 is equal to ~1.4902; the Margin of Error for a sample of 100 is equal to ~1.4714. The Margin of Error for the larger sample size is smaller. 


