<a href="https://colab.research.google.com/github/Frank-Howd/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/LS_DS_123_Confidence_Intervals_autograde.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

### Instructions

* **Download this notebook** as you would any other ipynb file
* **Upload** to Google Colab or work locally (if you have that set-up)
* **Delete `raise NotImplementedError()`**
* Write your code in the `# YOUR CODE HERE` space
* **Execute** the Test cells that contain `assert` statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
* **Save** your notebook when you are finished
* **Download** as a `ipynb` file (if working in Colab)
* **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

# Lambda School Data Science - Unit 1 Sprint 2 Module 3

## Module Project: Sampling and Confidence Intervals

### Objectives

* Objective 01 - explain the concepts of statistical estimate, precision, and standard error as they apply to inferential statistics
* Objective 02 - explain the implications of the central limit theorem in inferential statistics
* Objective 03 - explain the purpose of and identify applications for confidence intervals
* Objective 04 - demonstrate how to build a confidence interval around a sample estimate
* Objective 05 - visualize a confidence interval in order to communicate the precision of sample estimates

## Introduction

### Confidence Intervals

Soft drinks like Coke and Pepsi are manufactured to have a standard caffeine content. For example, a 12-oz serving of Coke has 34mg of caffeine, and a 12-oz serving of Pepsi has 37.6mg of caffeine. However, fountain soft drinks are typically mixed in individual restaurant dispensers, so it is more difficult to maintain a standard level of caffeine per serving. 

In this study, researchers randomly sampled Coke, Diet Coke, Pepsi, and Diet Pepsi at a set of franchise restaurants and measured the caffeine content in 12oz of each soft drink.

Because individuals can be sensitive to caffeine – and because the manufacturers are interested in product consistency – **we wish to estimate the mean caffeine content in 12oz of Coke served in franchise restaurants using a 95% confidence interval.** 


### Data set

The data set for Coke is available at the link provided below. The first variable is the sample ID and the second variable is the caffeine content (in mg) in the 12oz sample.

*Source: A.N. Garand and L.N. Bell (1997). "Caffeine Content of Fountain and Private-Label Store Brand Carbonated Beverages," Journal of the American Dietetic Association, Vol. 97, #2, pp. 179-182.*

**Task 1** - Load the data

Load the dataset using the provided URL

* Read in your data as a pandas DataFrame with the variable name `coke_df`
* Use the `.head()` method to take a look at the DataFrame

In [3]:
# Task 1

# Imports
import pandas as pd
import numpy as np

# URL for the data
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Soda/Soda.csv'

# YOUR CODE HERE
coke_df = pd.read_csv(data_url)

# Look at your DataFrame
print(coke_df.shape)
coke_df.head()

(50, 2)


Unnamed: 0,Drink,Caffeine
0,1,47.32
1,2,43.78
2,3,48.12
3,4,43.25
4,5,46.42


**Task 1 - Test**

In [4]:
# Task 1 Test

assert isinstance(coke_df, pd.DataFrame), 'Have you created a DataFrame named `coke_df`?'


**Task 2** - Descriptive statistics

Calculate the following statistical quantities for the `Caffeine` content column. Name your variables as indicated (they need to be an exact match to pass the tests)

* mean - `mean_caffeine`
* standard deviation - `std_caffeine`
* standard error - `se_caffeine`
* number of samples - `n_caffeine`

Summarize your results in a sentence or two.

In [19]:
# Task 2

# YOUR CODE HERE

mean_caffeine = coke_df['Caffeine'].mean()
print('Mean caffeine: {:.3f}'.format(mean_caffeine))

std_caffeine = coke_df['Caffeine'].std()
print('Std Dev, caffeine: {:.3f}'.format(std_caffeine))

se_caffeine = std_caffeine / (n_caffeine-1)
print('Std Error, caffeine: {:.3f}'.format(se_caffeine))

n_caffeine = coke_df['Caffeine'].count()
print('N samples: ', n_caffeine)


Mean caffeine: 37.940
Std Dev, caffeine: 5.244
Std Error, caffeine: 0.107
N samples:  50


**Task 2 - Test**

In [20]:
# Task 2 Test

assert n_caffeine == 50, 'Did you correctly calculate the number of samples?'


**Task 2** - ANSWER

Using the statistics you calculated above, write out your answer in words. Use the following format:

*Example: The mean caffeine content is XXmg per 12oz serving with a standard error of XXmg. The sample size is XX.*

This task will not be autograded - but it is part of completing the project.

YOUR ANSWER

The mean caffeine content is 37.94mg per 12oz serving with a standard error of 0.107mg.  The sample size is 50. 

**Task 3** - Calculate t*

For this task you will calculate t* for a 95% confidence interval.

* set the variable `deg_free` equal to the degrees of freedom for the `Caffeine` variable
* set the variable `t_star` equal to t* using `t.ppf(q, df)` with `q=0.975` and `df = deg_free`

Note: Don't worry about where the 0.975 value comes from - it relates to finding the *middle* of the 95% t-distribution. We're going to learn how to calculate the 95% confidence interval an easier way in the next exercise.

In [46]:
# Task 3

from scipy.stats import t

# YOUR CODE HERE
deg_free = n_caffeine-1
print('DOF: ', deg_free)

t_star = t.ppf(0.975, deg_free)

# View your answer
print('t_star = {:.6f}'.format(t_star))

DOF:  49
t_star = 2.042272


**Task 3 - Test**

In [22]:
# Task 3 Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 4** - Margin of error

In this task you'll calculate the margin of error for a 95% confidence interval (CI) for the mean caffeine content in a 12-oz Coke.

* Assign the margin of error for a 95% CI to the variable `margin_err`

Hint: You already have the value for t* for a 95% CI and the standard error

In [23]:
# Task 4

# YOUR CODE HERE
margin_err = t_star*se_caffeine

# View your answer
print('Margin of error = {:.6f}'.format(margin_err))

Margin of error = 0.215056


**Task 4 - Test**

In [24]:
# Task 4 Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 4** - ANSWER

Using the margin of error you calculated above, write out your answer in words. Use the following format:

*Example: The margin of error is XXmg of caffeine per 12oz serving*

This task will not be autograded - but it is part of completing the project.

YOUR ANSWER

The margin of error is +/- 0.215056mg of caffeine per 12oz serving. 

**Task 5** - Calculate a confidence interval

For this task, you are going to calculate a 95% CI for the mean caffeine content in a 12-oz fountain Coke with the CI formula using the summary statistics and t* that you calculated above.

* Calculate the lower confidence level and assign it to `lower_cl`
* Calculate the upper confidence level and assign it to `upper_cl`

In [27]:
# Task 5

# YOUR CODE HERE
upper_cl = mean_caffeine+margin_err
lower_cl = mean_caffeine-margin_err

# View your answers
print ('Lower confidence limit = {:.6f}'.format(lower_cl))
print ('Upper confidence limit = {:.6f}'.format(upper_cl))

Lower confidence limit = 37.725144
Upper confidence limit = 38.155256


**Task 5 - Test**

In [28]:
# Task 5 Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 5** - ANSWER

Write out the confidence interval you just calculated. Use the following format:

*Example: The true mean of the caffeine content is between [lower CL, upper CL]*

This task will not be autograded - but it is part of completing the project.

YOUR ANSWER

The true mean of the caffeine content is between 37.725144mg and 38.155256mg, measured at hte 95% confidence level. 

**Task 6** - 95% confidence interval using t-interval

As promised in Task 4, we're going to calculate the confidence interval the easy way. We'll use the `t.interval()` function to calculate the 95% confidence interval.

* Assign the confidence interval to `t_int_95`
* `alpha` should be set equal to the confidence level
* `df` is the degrees of freedom
* `loc` is the sample mean
* `scale` is the standard deviation of the distribution

In [38]:
# Task 6

# YOUR CODE HERE
t_int_95 = t.interval(alpha=0.95, df=deg_free, loc=mean_caffeine, 
                      scale=se_caffeine)

# View your answer
print('95% CL: {:.6f}, {:.6f}'.format(t_int_95[0], t_int_95[1]))


95% CL: 37.725144, 38.155256


**Task 6 - Test**

In [37]:
# Task 6 Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 7** - Compare and interpret confidence intervals

(This part is not graded and is practice for writing out your results.)

Q1 - In this task, you are going to do your own test. Look at the two confidence intervals you calculated; are they equal? Should they be equal?

The condifence intervals were both calculated at the 95% confidence level, regardless of means of calculating - they should be equal.

Q2 - Interpret the meaning of the 95% confidence interval for the mean caffeine content in the 12oz fountain Coke in a sentence or two.

ANSWER

The population mean for mg_caffeine/12oz serving of Coke in franchise restaurants, at the 95% alpha/confidence inteveral is within 37.725144mg to 38.155256mg.  

**Task 8** - 90% confidence interval using t-interval

Now that we've calculated a confidence interval at the 95% level, we'll repeat the calculation for a 90% confidence level.

* assign the confidence interval to `t_int_90`
* `alpha` is the confidence level
* `df`, `loc`, `scale` are the same as for the first calculation

In [40]:
# Task 8

# YOUR CODE HERE

t_int_90 = t.interval(alpha=0.90, df=deg_free, loc=mean_caffeine, 
                      scale=se_caffeine)

# View your answer
print('90% CL: {:.6f}, {:.6f}'.format(t_int_90[0], t_int_90[1]))


90% CL: 37.760783, 38.119617


**Task 8 - Test**

In [41]:
# Task 8 Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 9** - 99% confidence interval using t-interval

And, we'll complete one more confidence interval calculation, this time at the 99% level.

* assign the confidence interval to `t_int_99`
* `alpha` is the confidence level
* `df`, `loc`, `scale` are the same as for the first two calculations

In [43]:
# Task 9

# YOUR CODE HERE
t_int_99 = t.interval(alpha=0.99, df=deg_free, loc=mean_caffeine, scale=se_caffeine)

# View your answer
print('99% CL: {:.6f}, {:.6f}'.format(t_int_99[0], t_int_99[1]))

99% CL: 37.653404, 38.226996


**Task 9 - Test**

In [None]:
# Task 9 Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 10** - Summarize confidence interval calculations

This part is not autograded and is practice for writing out your results!

Q1 -  Is the 90% confidence interval more accurate or more precise (pick one) than the 95% confidence interval?

ANSWER

The 90% confidence interval is more accurate and less precise.

Q2 -  Is the 99% confidence interval more accurate or more precise (pick one) than the 95% confidence interval?

ANSWER The 99% confidence interval is more precise and less accurate than the 95% confidence interval. 

## Stretch goals:

### Stretch Task 1

**The correspondence between confidence intervals and hypothesis tests.**

Read [this](https://https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-confidence-intervals-and-confidence-levels#:~:text=If%20a%20hypothesis%20test%20produces,corresponding%20confidence%20level%20is%2095%25.&text=If%20the%20confidence%20interval%20does,the%20results%20are%20statistically%20significant.) article about the correspondence between confidence intervals and hypothesis tests.  Feel free to read the whole article, but the relevant part can be found under the heading Why P Values and Confidence Intervals Always Agree About Statistical Significance.

Imagine you work for quality control at Coke and are tasked with making sure that the caffeine content in the fountain beverages served in restaurants is the same as in a 12-oz can of Coke (34mg).  If you believe that the mean caffeine content in fountain coke is not 34mg, you must re-train the franchise managers to make sure the Coke served has the correct caffeine level.

Based on the confidence interval you calculated in the assignment, do you believe that the mean caffeine content is statistically significantly different from 34 mg in a 12-oz serving?

**Answer the question before viewing the solution in the next cell:**

YOUR ANSWER

34mg of caffeine is outside of the 95% confidence level, so the mean caffeine content is statistically signifcantly different from the 34mg in a 12-oz serving; we can reject $Ho$


**Stretch Task 1 Solution**

Because 34mg is not in the bounds of the 95% confidence interval, we can reject the null hypothesis that the mean caffeine content in 12-oz of fountain Coke is equal to 34mg.  Instead, we conclude it is between about 36.4 and 39.4 mg.

### Stretch Task 2

If we increased the sample size from 50 to 100 but the sample mean and SD remained the same, describe **two** ways the margin of error would change.  Would the margin of error become smaller or larger?

**Answer the question before viewing the solution in the next cell:**

YOUR ANSWER

The margin of error will be changed in two ways:
$t^*$ will move closer to 1.96 and the standard error will shrink (dof will increase to 99 from 50, increasing the size of the SE's demoninator).

**Stretch Task 2 Solution**

Both t* and n would change.

In [47]:
# Stretch Task 2 Solution - demonstrated in code

t_star = t.ppf(0.975,df=99)

n_caffeine = 100
se_caffeine = std_caffeine/(n_caffeine**(1/2))

ME = t_star*se_caffeine
print('Margin of error =', margin_err)

Margin of error = 0.2150555889319998


**Stretch Task 2 Solution**

The ME for n = 100 is 1.04 compared to 1.49 when n = 50.  Increasing the sample size decreases the margin of error (thus increasing the precision of the estimate) while maintaining the same level of confidence.