<a href="https://colab.research.google.com/github/Collin-Campbell/DS-Unit-1-Sprint-2-Statistics/blob/master/module3/LS_DS_123_Confidence_Intervals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Confidence Intervals

## Objectives: 

- Explain the concepts of statistical estimate, precision, and standard error in regards to inferential statistics
- Explain the implications of the central limit theorem in inferential statistics
- Explain the purpose of confidence intervals and identify applications for their use
- Demonstrate how to build a confidence interval around a sample estimate


## Some Useful Definitions

**Statistical Estimation**: The process by which one makes inferences about a population, based on information obtained from a sample.

- **Point Estimate (of a population parameter)**: A single value of a statistic. (e.g. the sample mean x  is a point estimate of the population mean μ.)
- **Interval Estimate (of a population parameter)**: Two values, between which a population parameter is said to lie.

**Precision vs Accuracy**

- **Precision**: How close two or more measurements are to each other. (e.g If you consistently measure your height as 5’0″ with a yardstick, your measurements are precise.)
- **Accuracy**: How close you are to the true value. (e.g. your true height is exactly 5’9″)
    - You measure yourself with a yardstick and get 5’0″. Your measurement is not accurate.
    - You measure yourself again with a laser yardstick and get 5’9″. Your measurement is accurate

**Standard Error**: Measures the accuracy with which a sample distribution represents a population.
- S.D. is the measure of variation within a set of measurements (sample)
- SE is the variation in the **means** from multiple sets of measurements
###$SE = \frac{\sigma}{\sqrt n}$
- $SE$ = Standard Error
- $\sigma$ = **Population** Standard Deviation
- $n$ = Sample Size

*Standard error increases when the spread of values within the population increases.*

*As **$n$** increases the standard error falls allowing us to **infer** specific claims about a population with greater confidence*

## Central Limit Thoerem

#### "Even if you're not normal, your mean is normal!"



---



- Have a look at the normal distribution:

![The Normal Distribution](https://tk-assets.lambdaschool.com/14a07636-1e22-414d-8186-d860828e47df_Screenshot_2020-09-07_at_21.42.37.png)

*95.44% of observations fall between +2 and -2 standard deviations from the mean.*

The Central Limit Theorem (CLT) tell us that, given a sufficiently large sample size (about 40) from a population (of most distributions*) with a finite level of variance, if you collect random samples from them, the means of those samples will be normally distributed*.

**Requires that a sample mean can be calculated from the distribution (which is possible with almost all (but not all) distributions)*

[Check out this demonstration!](http://digitalfirst.bfwpub.com/stats_applet/stats_applet_3_cltmean.html)


**Why do we care?**
- Because the sample means approximate a normal distribution we can expect 95% of sample means to be within about 2 standard errors of the population mean!
- This helps us be **confident** that our sample means will approximate our population means despite us not knowing the population's distribution.


Let's look at a picture:


![alt text](https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/T-dist%201.png)


**The CLT tells us to expect 95% of sample means to be within about 2 standard errors of the population mean.**

The reason that we say "about" 2 is that the exact number of standard errors we need to add and subtract to be 95% confident about the population mean will depend on our sample size in the form of the degrees of freedom (sample size - 1).

The CLT allows us to be a little bit more specific about our confidence interval formula.

We can now say that:

"We are 95% confident that the true population mean falls between about 2SEs below the sample mean and about 2SEs above the sample mean."

Because the actual population mean is unknown, we use the t-distribution instead of the normal distribution:
- T-distribution has fatter tails to account for our higher level of uncertainty
- T-distribution also changes in proportion to our sample size

[Have a look at how t-distribution changes with sample size](https://media.geeksforgeeks.org/wp-content/uploads/20200525113955/f126.png)

The distribution of the sample mean has a **t-distribution** with a mean equal to the population mean, a standard deviation equal to $\frac{s}{\sqrt{n}}$, and n-1 degrees of freedom.

## Confidence Intervals

### What are Confidence Intervals?

> The probability that a population parameter will fall between two set values for a certain proportion of times.


Believe it or not, you already use the concept of a confidence interval all the time in your daily life.

*   "How long will it take the brownies to bake?"  About 35 - 40 minutes.
*   "How many cookies should I bake for the bake sale?"  Probably around 2 - 3 dozen.
*   "How many loaves of bread will the bakery sell today?" Around 90 - 100.

The true answer to all of these questions is unknowable ahead of time.  Exactly how long it will take the brownies to bake probably depends on a zillion factors including how well your oven is working, how hot and humid it is, the age of the eggs you use, etc.

However, based on personal experience - and perhaps the recipe - we can be very confident that it will take between 35 - 40 minutes for the brownies to be done.  

Estimates in the other two scenarios work similarly.

*Note that there are some analogous procedures for confidence intervals for chi-square tests and two-sample t-tests, but we are going to focus on a confidence interval for a single population mean here.*

**The reason that we make confidence intervals to estimate population parameters rather than relying on point estimates is because it greatly increases our chances of making an accurate estimate.**

### The Anatomy of a Confidence Interval

Here comes the math part!

When we make brownies, we make an informal confidence interval for how long it will take for them to bake based on the recipe and our baking knowledge.

**When we make a confidence interval for a population mean, we will use information from our sample and mathematical properties of the t-distribution.**

### The formula for a confidence interval for a population mean is 

###$\bar{X}$  $\pm$  $t^* \frac{s}{\sqrt{n}}$ 



- $\bar{X}$ = sample mean, 
- $s$ = sample standard deviation 
- $n$ = sample size.

Note that $\frac{s}{\sqrt{n}}$ is the standard error, which is an estimate of the true standard error of the sample mean.  

**Another name for the quantitiy $t^* \frac{s}{\sqrt{n}}$ is the margin of error.** 

Nearly all of the information we need to estimate the population mean using a confidence interval comes from our sample.  The only thing we don't get from the sample is t*.

Now is the time to introduce some data.

# Estimate the mean healthy adult human body temperature.

Everyone knows that 98.60 F (37.00 C) is the normal human body temperature.  But is that actually correct, and – come to think of it – how does everyone know that in the first place?

A German physician named Carl Reinhold August Wunderlich is generally credited with originating this idea, which was based on – reportedly – more than one million axiliary temperature readings taken from 25,000 subjects and was published in his 1868 book Das Verhalten der Eigenwärme in Krankheiten (which translates to The Behavior of the Self-Warmth in Diseases). But was he correct? History tells that his thermometer was a foot long and took 20 minutes to determine a subject’s temperature. For a measure that is used so often to determine general health, it would be a good idea to use modern instruments to confirm or refute his results.

In 1992, three physicians from the University of Maryland School of Medicine set out to do just that, measuring body temperatures for 223 healthy men and women aged 18-40 one to four times a day for three consecutive days using an electronic digital thermometer. The mean body temperature was computed for each individual, and this summary measure is recorded in the Bodytemp.csv dataset. 

**We wish to estimate the population mean healthy human body temperature.**

Source: Mackowiak, P. A., Wasserman, S. S., and Levine, M. M.  (1992), "A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlich," Journal of the American Medical Association_, 268, 1578-1580.


In [1]:
import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Bodytemp.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)

print(df.shape)
df.head()


(223, 2)


Unnamed: 0,ID,Body_temp
0,36,96.7
1,254,96.9
2,282,97.0
3,286,97.0
4,302,97.0


The two variables in the dataset are participant ID and body temperature measured in degrees F.

We can use the mean, sd and row counting functions in Python to calculate and save the sample mean body temp ($\bar{X}$ for the CI formula), the sample standard deviation (s in the CI formula) and the sample size (n in the formula).

In [4]:
#Calculate mean

mean_temp = df['Body_temp'].mean()

mean_temp

98.16502242152464

In [5]:
#Calculate SD

sd_temp = df['Body_temp'].std()

sd_temp

0.5273047859946364

In [7]:
#Calculate n

n_temp = df['Body_temp'].count()

n_temp

223

We can calculate the standard error of the sample mean by dividing the sd of the sample by the square root of the sample size.

In [9]:
#Calculate SE

se_temp = sd_temp / (n_temp**(1/2))

se_temp

0.035310940220425246

Let's plug these quantities into the CI formula.  We're almost there!

###$\bar{X}$  $\pm$  $t^* \frac{s}{\sqrt{n}}$ 


98.17 $\pm$ $t^* \frac{0.53}{\sqrt{223}}$ 

98.17 $\pm$ $t^* * 0.035$ 

So what is t*?????

Let's look carefully at what the CI equation is saying.

We are going to use our sample mean as the starting point for estimating the population mean.  That seems like a good idea.

Then we are going to add and subtract some number of standard errors from the sample mean to get the range of values that will be our confidence interval.

**t* is the number that tells us how many standard errors to add and subtract from the sample mean in the CI formula.**

**We have to use Python to get the exact value of t* for the CI formula.**

In [14]:
from scipy.stats import t

#Don't worry too much about where the 0.975 comes from.  It has to do
#with wanting to determine the *middle* 95% of the t-distribution
#We're going to learn
#how to calculate a 95% CI the easy way in just a minute.

#Recall that n = 223 for the body temp problem.
t_star = t.ppf(0.975,df=222)
#Remember: degrees of freedom is equal to n-1

t_star

1.9707073953190277

### Back to body temperature!

We left off in the body temperature example with:

A confidence interval of 

98.17 $\pm$ $t^* * 0.035$ 

We know from the CLT that t* should be somewhere around 2 for a 95% confidence interval.

And further, using Python, we know that t* for a 95% confidence interval for the body temperature data = 1.97 (quick check - this is very close to 2).

98.17 $\pm$ 1.97 * 0.035

The margin of error = 1.97 * 0.035 = 0.07.



### Want to see how to do this the easy way?

In [15]:
#We can use the t.interval funtion to calculate the CI.
#We set the loc parameter equal to the mean and the
#scale parameter equal to the SE
#Alpha = 0.95 means we want a 95% CI

#Calculate your t-interval here.

t.interval(alpha=0.95, df=222, loc=mean_temp, scale=se_temp)

(98.09543489049658, 98.2346099525527)

Note that we get exactly the same answer as we got working by hand.

In [19]:
#recall that we calculated mean_bodytem etc. above.

#Calculate the lower confidence limit

mean_temp - t_star * se_temp

98.09543489049658

In [20]:
#Calculate the upper confidence limit

mean_temp + t_star * se_temp

98.2346099525527

In conclusion, we are 95% confident that the population mean healthy human body temperature is between ?????????

### But I want to be X% confident!

Because the confidence level is determined by t* and that's just a number, you can use any confidence level you want.  However, most commonly, we choose to be 90%, 95% or 99% confident.

We can tune the confidence level by changing the alpha parameter in the CI funtion.

In [16]:
#90% Confidence Interval

t.interval(alpha=0.90, df=222, loc=mean_temp, scale=se_temp)

(98.106697704594, 98.22334713845528)

In [17]:
#95% Confidence Interval

t.interval(alpha=0.95, df=222, loc=mean_temp, scale=se_temp)

(98.09543489049658, 98.2346099525527)

In [18]:
#99% Confidence Interval

t.interval(alpha=0.99, df=222, loc=mean_temp, scale=se_temp)


(98.0732790800297, 98.25676576301957)

### But I want to be 100% confident!

Let's take a very, very close look at the confidence intervals above.  This is subtle - out in the 100ths place of the decimals.

The 90% confidence interval is the *narrowest* - it includes the smallest range of values - and the 99% confidence interval is the *widest* - it includes the widest range of values.

Confidence intervals are a trade-off between accuracy and precision.

**The more confident you want to be, the less precise your CI will be.**

**The less confident you are willing to be, the more precise your estimate can be.**

At the point we were 100% confident, our CI would probably be meaningless.

### In terms of brownies:

*   We are 90% confident the brownies will be done in 35 - 36 minutes.
*   We are 95% confident the brownies will be done in 34 - 37 minutes.
*   We are 99% confident the brownies will be done in 33 - 39 minutes.
*   We are 100% confident the brownies will be done in 0 - 60 minutes.

The 100% confidence interval is true... but not very helpful.


## Common Errors with the Interpretation of Confidence Intervals.

**Correct statement:** We are 95% confident that the population mean healthy human body temperature is between 98.10 and 98.23 degrees F.

**Incorrect statement:** We are 95% confident that the sample mean healthy human body temperature is between 98.10 and 98.23 degrees F.


*   We already *know* the sample mean is in the CI because we use the sample mean to compute the CI.  We want to make a statement about the population mean.


**Incorrect statement:** We are 95% confident that the sample mean healthy human body temperature is 98.17 degrees F.

*   The CI is a statement about the likelihood of the population mean being in our CI, not being equal to the sample mean.

**Incorrect statement:** 95% of samples will have a mean that is between 98.10 and 98.23 degrees F.

*   Based on the CLT, we know that 95% of all 95% CIs will contain the true population mean.  However, we don't expect those CIs to be exactly equal to the one we created here.

### It can help to look at a picture

We expect 95% of all 95% CIs to contain the true population mean.  

That means, if we took 20 random samples from the same population, they'd all be different, but we'd expect 19/20 to contain the true population mean and only 1/20 not to contain the true population mean.

![alt text](https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/CI%20image.png)


## Final Challenge!

### Let's do one more quick example.

Use the Titanic dataset to create a 95% confidence interval for the population mean age of all Western Europeans traveling the US in the early 1900s (assume the Titanic passengers are representative of this population).

In [21]:
import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Titanic.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)

print(df.shape)
df.head()

(887, 8)


Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses_Aboard,Parents/Children_Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [24]:
#calculate mean, sd, n, se here.

mean_age = df['Age'].mean()

print(mean_age)

sd_age = df['Age'].std()

print(sd_age)

n_age = df['Age'].count()

print(n_age)

se_age = sd_age / (n_age**(1/2))

print(se_age)

29.471443066516347
14.121908405462555
887
0.4741672781626


In [25]:
#Calculate t-interval here.

t.interval(alpha=0.95, df=886, loc=mean_age, scale=se_age)

(28.540820985809045, 30.40206514722365)

CI interval interpretation here.

We are 95% confident that the population mean age of all Western Europeans traveling the US in the early 1900s is between 28.54 and 30.40 years old.