In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### Intro
- 1 point estimate can give us a rough idea of a population parameter, but there is a chance that it contains some errors
- Or we take many samples to reduce these errors but taking many samples is not feasible all the time  
$\to$ Confidence intervals

# Confidence intervals
- Confidence interval = a range of values above and below a point estimate
    + Might contain the true value of a population parameter
+ The interval is associated with a **confidence level**
    + The bigger the confidence level = the wider the range of the interval
    + The confidence level is decided before calculating the confidence intervals
    + Confidence intervals are usually reported with point estimates to show how reliable the estimates are

- The intervals are calculated from samples = for each sample, there will be a different interval
    + Example: A 95% confidence level
        + There is a 95% chance that the true population parameter lies in the range
        + If we take 100 samples and calculate confidence intervals for each of the 100 samples with 95% confidence
            + the true population parameter will lie in 95 of these intervals

## Calculating confidence intervals
- Calculate confidence intervals from point estimates.\
    + Add/Subtract a margin of error to the point estimate to calculate both boundaries of the interval
    + The margin of error depends on
        + Confidence level chosen
        + Size of the data
        + The spread of the data (std)

- There are 2 ways of calculating confidence intervals, depend on whether we know the standard deviation of the population or not
    + z-values
    + t-values 

In [2]:
df = pd.read_csv('./data/UCI_Credit_Card.csv')

# Population
pop_mean = df['AGE'].mean()
pop_std = df['AGE'].std()

# Sample
sample_size = 100
sample = df['AGE'].sample(
    n=sample_size,
    random_state = 5)
sample_mean = sample.mean()
sample_std = sample.std()

#### Z-values
- $margin\ of\ error = z^*\ \frac{\sigma}{\sqrt{n}}$
    + $z^*$: critical z-value
    + $\sigma$: standard deviation of the population
    + $n$: sample size

- Critical z-value
    + The critical z-value $z^*$ for a 95% confidence interval = the z-score that tells us the number of $\sigma$ we must go above or below the mean to obtain an area of 95% in a standard normal distribution

<img src="./assets/1.png" width="550"/>

- Calculate z-value by `scipy.stats`
    + `z_value = st.norm.ppf(q=quantile)`
    + $quantile = 1 - \frac{1-CL}{2}$
        + $CL$: Confidence level

In [3]:
import scipy.stats as st

In [4]:
# Z-value
cl = 0.95
quantile = 1 - ((1-cl)/2)
z_value = st.norm.ppf(q=quantile)
print("Z-value for 95% confidential level : ", z_value)

# Confidence interval caclulations
margin_error = z_value * (pop_std / np.sqrt(sample_size))
CI = (sample_mean - margin_error, sample_mean + margin_error)

print("Sample Mean : ",sample_mean)
print("margin of error : ", margin_error)
print("Confidence Interval : ",CI)
print("Population Mean : ", pop_mean)

Z-value for 95% confidential level :  1.959963984540054
Sample Mean :  35.57
margin of error :  1.806675998640202
Confidence Interval :  (33.7633240013598, 37.3766759986402)
Population Mean :  35.4855


#### t-values
- t-values: In most cases, we do not know the population std
    + Use the sample std with accounting some error
    + t-values are drawn from a t-distribution
        + similar to a standard normal distribution
        + Wider if the sample size small
        + If greater sample sizes, t-distribution is very close to a normal distribution

In [5]:
# t-value
cl = 0.95
quantile = 1 - ((1-cl)/2)
t_value = st.t.ppf(
    q=quantile,
    df=sample_size-1)
print("t-value for 95% confidential level : ", t_value)

# Confidence interval caclulations
margin_error = t_value * (sample_std / np.sqrt(sample_size))
CI = (sample_mean - margin_error,sample_mean + margin_error)

print('Sample Mean : ',sample_mean)
print('margin of error : ', margin_error)
print('Confidence Interval : ',CI)
print('Population Mean : ', pop_mean)

t-value for 95% confidential level :  1.9842169515086827
Sample Mean :  35.57
margin of error :  1.9820622046591403
Confidence Interval :  (33.58793779534086, 37.55206220465914)
Population Mean :  35.4855
