# Confidence Intervals
<hr>

A confidence interval is a range is which we are confident to a certain degree that a specific population parameter will fall. 

For example, according to [glassdoor](https://www.glassdoor.com/Salaries/data-scientist-salary-SRCH_KO0,14.htm), as of April 7th 2020, the mean salary for data scientists is \\$113,309. Now obviously glassdoor hasn't polled every working data scientist that exists to get to this number, meaning \\$113,309 is an estimate of the true mean.

So if we were to construct a 95% CI for this average and (just making up the numbers) got \\$100,000 - \\$115,000. This means that if we were to poll many times the average salary of __random__ samples of data scientists, than in 95% of those cases we would get an average salary that falls in that range.

Please note most of the data used in this notebook for the examples originates from [this udemy course](https://www.udemy.com/course/the-data-science-course-complete-data-science-bootcamp/). I highly recommend it!

In [210]:
import pandas as pd
import numpy as np
# we'll use both a normal distribution and a T distribution
from scipy.stats import norm, t

## General notes

__Note:__ In this notebook every $\sigma$ represents a population standard deviation while $s$ represents a sample standard deviation. Also $\mu$ generally denotes the mean of some population, while $\bar{x}$ denotes a sample mean.

CI's can either be two-sided or one-sided. A two sided CI means a population parameter falls in between two values i.e.

$$
\bar{x} - \text{ME} \le \mu \le \bar{x} + \text{ME}
$$

while a one sided CI could either mean 
$$
\mu \le \bar{x} + \text{ME} \\
\mu \ge \bar{x} + \text{ME}
$$

where $\text{ME}$ = Margin of Error.

Another thing to be aware of in calculating CI's is when to use a Z (normal) distribution or a T distribution. In general when the population variance is known or the data you have is sufficiently big enough (usually 30+ samples is a good rule of thumb) you use a Z distribution. Otherwise use a T distribution which is meant for smaller amounts of data.

Lastly when using a T distribution you need to specify how many [Degrees of Freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)) it has. 

Which in general is just $\text{the number of data points in the set} - 1$.

Now let's get into it.

## Population variance known

$$
\large \bar{x} \pm z_{\alpha/2} * \frac{\sigma}{\sqrt{n}}
$$
Let's break this down:

    α is known as the confidence level and is just 1 - confidence. For a confidence of 90%, α = 10% or 0.1, for a  confidence of 95%, α = 0.05, and for a confidence of 99%, α = 0.01
    
    n is the number of samples in our data.

Let's calculate a 95% CI using a dataset with the salaries of 70 data scientists. __We'll assume for the sake of this example that the population standard deviation is $15,000__

In [229]:
data = pd.read_csv('data/salaries_extended.csv')
print(f'There are {len(data)} samples in the dataset.')
data

There are 70 samples in the dataset.


Unnamed: 0,Salaries
0,120643
1,131248
2,108833
3,127776
4,114564
...,...
65,112276
66,85927
67,102848
68,121200


In [228]:
# in jupyter you can easily type 
# any greek symbol like α by typing
# \alpha and hitting the tab key.
α = 0.05
n = len(data)

mean_salary = data.Salaries.mean()
σ = 15_000

print(f'The sample mean salary is ${mean_salary:,.2f}')
print(f'The population standard deviation is ${σ:,}')

# note we're using a normal distribution
z_score = norm.ppf(1 - α/2)
print(f'The z-score for a {int((1-α)*100)}% CI is {z_score:.2f}')
print('\n', '='*45, '\n', sep='')
lower_bound = mean_salary - z_score * σ / np.sqrt(n)
upper_bound = mean_salary + z_score * σ / np.sqrt(n)

print(f'The {int((1-α)*100)}% ranges from ${lower_bound:,.2f} to ${upper_bound:,.2f}')

The sample mean salary is $113,309.00
The population standard deviation is $15,000
The z-score for a 95% CI is 1.96


The 95% ranges from $109,795.09 to $116,822.91
