<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 23: Confidence Intervals

Associated Textbook Sections: [13.0, 13.1, 13.2](https://inferentialthinking.com/chapters/13/Estimation.html)

## Outline

* [Percentiles](#Percentiles)
* [Estimation](#Estimation)
* [The Bootstrap](#The-Bootstrap)
* [Confidence Intervals](#Confidence-Intervals)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

## Percentiles

### Computing Percentiles

The Xth percentile is first value on the sorted list that is at least as large as X% of the elements. 

* The 80th percentile of `[1, 7, 3, 9, 5]` is the 4th ordered (`[1, 3, 5, 7, 9]`) element, 7:
* The 4th element is used because $(80/100) \cdot 5 = 4$
* For a percentile that does not exactly correspond to an element, take the next greater element instead

### The percentile Function

* The pth percentile is the smallest value in a set that is at least as large as p% of the elements in the set
* Function in the datascience module: `percentile(p, values)`
    * `p` is between 0 and 100
    * `values` is an array, list, etc.
    * Returns the pth percentile of the array
* `percentile(80, s)` returns `7`

### Demo: Percentiles

Compute the 55th percentile of the following array.

In [None]:
x = make_array(43, 20, 51, 7, 28, 34)

In [None]:
...

Sort the array

In [None]:
...

Calculate the value of the index corresponding to the percentile. Round up the value.

In [None]:
...

## Estimation

### Inference: Estimation

How do we calculate the value of an unknown parameter?
* If you have a census (that is, the whole population): Just calculate the parameter and you're done
* If you don't have a census:
    * Take a random sample from the population
    * Use a statistic as an estimate of the parameter

### Demo: Sample Median Estimation

Load the [2022 Employee Compensation data from data.sfgov.org](https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd).

In [None]:
sf = Table.read_table('https://data.sfgov.org/resource/88g8-5mnd.csv?year=2022')
sf.show(3)

Reduce the table to the `job`, `total_salary`, and `total_compensation` columns.

In [None]:
sf = sf.select('job', 'total_salary', 'total_compensation')
sf.show(3)

Who is making the most money?

In [None]:
sf.sort('total_compensation', descending=True).show(5)

Visually explore the distribution of compensations.

In [None]:
...

Filter out the bottom 5% and top 5% of compensations.

In [None]:
bottom_5 = ...
top_5 = ...
sf = ...
sf.select('total_compensation').hist(bins=30)

Calculate the median total compensation of this population.

In [None]:
pop_median = ...
pop_median

Randomly sample 300 employees from the population and calculate the sample median total compensation.

In [None]:
our_sample = ...
our_sample.show(5)

In [None]:
...

Visually compare the distribution of total compensation values for the population and random sample.

In [None]:
....hist('total_compensation', bins=10)
plots.title('Population Distribution');

In [None]:
....hist('total_compensation', bins=10)
plots.title('Sample Distribution');

### Variability of the Estimate

* One sample $\implies$ One estimate
* But the random sample could have come out differently
* And so the estimate could have been different
* **Big question**: How different would it be if we did it again?


### Demo: Variability of the Estimate

Create a function that samples randomly from the `sf` table and returns the sample median for `total_compensation`.

In [None]:
def generate_sample_median(samp_size):
    ...
    return ...

In [None]:
sample_median = generate_sample_median(300)
sample_median

Compute the error if the sample median is used to estimate the population median for total compensation. Re-run the above function to see how the error varies.

In [None]:
error = ...
error

### Quantifying Uncertainty

* The estimate is usually not exactly right: `Estimate = Parameter + Error`
* How accurate is the estimate, usually?
* How big is a typical error?
* When we have a census, we can do this by simulation


### Demo: Quantifying Uncertainty

Generate 1000 random samples of size 300 and calculate the total compensation median for each sample. Store the median values in an array and visualize the sample distribution along with the population mean.

In [None]:
...

In [None]:
Table().with_column(
    'Sample Medians', ...
).hist(bins = 20)

plots.scatter(pop_median, 0, color="red", s=60, zorder=3);

Visualize the distribution of errors found from using the sample medians to estimate the population median.

In [None]:
Table().with_column(
    'Errors', ...
).hist(bins=20)

plots.scatter(0, 0, color="red", s=60, zorder=3);

### Where to Get Another Sample?

* We want to understand errors of our estimate
* Given the population, we could simulate... but we only have the sample!
* To get many values of the estimate, we needed many random samples
* We can't go back and sample again from the population:
    * No time, no money
    * Stuck?


## The Bootstrap

### The Bootstrap

* A technique for simulating repeated random sampling
* All that we have is the original sample... which is large and random
* Therefore, it probably resembles the population
* So we sample at random from the original sample!

### Why the Bootstrap Works

All of the resamples look pretty **similar**, most likely.

<img src="./img/why_the_bootstrap_works.png" width=90%>

### Why We Need the Bootstrap

<img src="./img/why_we_need_the_bootstrap.png" width=90%>

### The Bootstrap Principle

* The bootstrap principle: Bootstrap-world sampling $\approx$ Real-world sampling 
* Not always true! ... but reasonable if sample is large enough
* We hope that the following are similar to what they are in the real world
    * Variability of bootstrap estimate
    * Distribution of bootstrap errors



### Key to Resampling

* From the original sample,
    * draw at random
    * with replacement
    * as many values as the original sample contained
* The size of the new sample has to be the same as the original one, so that the two estimates are comparable


### The Bootstrap Process

#### One Random Sample

True but unknown distribution (population) → Random sample (the original sample)



#### Bootstrap

Empirical distribution of original sample (“population”) → Bootstrap sample 1
* → Estimate 1
* → Bootstrap sample 2
* → Estimate 2
* ...
* → Bootstrap sample 1000
* → Estimate 1000

### Demo: Bootstrap

Take a bootstrap (re)sample (**WITH replacement**) of size 300 from the last sf sample.

In [None]:
boot_sample = ...
....hist('total_compensation', bins=20)
plots.title('Bootstrap sample');
 
print("Population Median =       ", pop_median)
print("Our Sample Median =       ", sample_median)
print("Bootstrap Sample Median = ", 
      np.median(boot_sample.column('total_compensation')))

Explore the distribution of 1000 bootstrap resamples from the one sample in relation to the sample median and population median.

In [None]:
def one_bootstrap_median():
    ...
    return ...

In [None]:
...

In [None]:
Table().with_column(
    'Bootstrap Medians', ...
).hist('Bootstrap Medians', bins=20)

plots.scatter(pop_median, 0, color="red", s=60, zorder=3);
plots.scatter(sample_median, 0, color="blue", s=60, zorder=3);

## Confidence Intervals

### 95% Confidence Interval

* Interval of estimates of a parameter
* Based on random sampling
* 95% is called the confidence level
    * Could be any percent between 0 and 100
    * Higher level means wider intervals
* The confidence is in the process that gives the interval: It generates a "good" interval about 95% of the time.

### Demo: Confidence Intervals

Make an interval based on the middle 95% of bootstrap samples.

In [None]:
left = ...
right = ...

Table().with_column(
    'Bootstrap Medians', bootstrap_medians
).hist('Bootstrap Medians', bins=20)

plots.plot([left, right], [0,0], color="gold",lw=6, zorder=3);
plots.scatter(pop_median, 0, color="red", s=60, zorder=4);
plots.scatter(sample_median, 0, color="blue", s=60, zorder=4);

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>