In [None]:
from datascience import *
%matplotlib inline
path_data = '../../../assets/data/'
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np

# Lecture 23 #

## A. Percentiles ##

In [None]:
# Example: Manually compute the 55th percentile for the array x
x = make_array(43, 20, 51, 7, 28, 34)

In [None]:
# Step 1. Sort the data using np.sort()
...

In [None]:
# Step 2. Figure out where 55th percentile would be, as discussed in the slides
(55/100) 
...

In [None]:
# Step 3. 3.3 rounds up to 4; the 4th item (i.e., index 3) of the
# sorted array is the 55th percentile
...

In [None]:
# Alternatively: Find the answer with one line of code using `percentile`
...

We will generally use the `percentile` function to find percentiles. But to be able to interpret the results we get back when calling this function, we need to understand the definition it's based on. That's why we went through the "long way" to find a percentile for a numerical array, just to be sure we're all clear on the idea of percentiles.

**Back to Slides...**

## B. Discussion Question about Percentiles

In [None]:
s = make_array(1, 3, 5, 7, 9)

In [None]:
percentile(10, s) == 0

In [None]:
percentile(39, s) == percentile(40, s)

In [None]:
percentile(40, s) == percentile(41, s)

In [None]:
percentile(50, s) == 5

**Back to slides...**

## C. Estimation: Total Compensation in Population 

We have data for City of San Francisco employees in 2019, which we've looked at previously. Consider the variable "Total Compensation".

In [None]:
sf = Table.read_table('san_francisco_2019.csv')
sf.show(3)

In [None]:
# Which employees made the most money?
sf.sort('Total Compensation', descending=True).show(5)

In [None]:
# Which employees earned the least?
sf.sort('Total Compensation', descending=False).show(5)

In [None]:
# Somehow there are rows in the table with $0 total compensation.
# Let's focus on employees who earned at least the amount that
# would correspond to $15/hr, 20 hr/wk, for 50 weeks:

min_salary = 15 * 20 * 50
sf = sf.where('Salary', are.above(min_salary))

In [None]:
# Here's the size of our population
sf.num_rows

In [None]:
# Visualize the population
sf_bins = np.arange(0, 726000, 25000)
sf.hist('Total Compensation', bins=sf_bins)

### Parameter: Median Total Compensation 

The *median* of a numerical distribution is the 50th percentile. Let's say we're interested in the median total compensation for our population.

In [None]:
pop_median = ...
pop_median

Now imagine we don't have the population data, only a random sample from the population. How could we use the sample data to estimate the median total compensation for the population?

## Estimating the Parameter (Pretend it is Unknown) 

In [None]:
# Take a random sample of size 400, without replacement, from `sf`
our_sample = sf.sample(400, with_replacement=False)

# Visualize the distribution of our sample
our_sample.hist('Total Compensation', bins=sf_bins)

Compare the histogram for the sample and the histogram for the population. Do they look the same?

In [None]:
# Calculate a sample statistic to estimate the population median 
estimate = ...
estimate

**Back to slides...**

# D. Sampling Variability

In [None]:
# Run this cell repeatedly to see how the sample median varies over
# different random samples from the population
new_sample = sf.sample(400, with_replacement=False)
np.median(new_sample.column('Salary'))

We see that, in theory, we could sample the population over and over again to make a distribution of estimates for the population median. That would give us a good sense of how far off any one estimate is likely to be.

But in the real world we won't be able to keep going back to the population and drawing new samples over and over and over, it's just too expensive in terms of time and money. 

Question: How to generate a new random sample *without going back to the population?*

**Back to slides...**

# E. The Bootstrap

Sample randomly
 - from the original sample
 - with replacement
 - the same number of times as the original sample size

In [None]:
# Resample and Visualize
size = our_sample.num_rows
bootstrap_sample = our_sample.sample(size, with_replacement=True)
# Note: could also call it as our_sample.sample()

bootstrap_sample.hist('Total Compensation', bins=sf_bins)

## Bootstrap Sample Median
This is one estimate of the population median.

In [None]:
# Record the process of resampling and finding the median
def one_bootstrap_median(our_sample):
    # draw the bootstrap sample
    size = ...
    resample = ...
    # return the median total compensation in the bootstrap sample
    return ...

one_bootstrap_median(our_sample)

In [None]:
# Generate an array of the medians of 1000 bootstrap samples
num_repetitions = 1000
bstrap_medians = make_array()
for i in np.arange(num_repetitions):
    stat = one_bootstrap_median(our_sample)
    bstrap_medians = np.append(bstrap_medians, stat)

In [None]:
# Visualize with a histogram
resampled_medians = (
    Table()
    .with_column('Bootstrap Sample Median', bstrap_medians)
)
median_bins=np.arange(120000, 160000, 2000)
resampled_medians.hist(bins = median_bins)

# Plotting parameters; you can ignore this code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Bootstrap Medians and the Parameter (Green Dot)');

## Percentile Method: Middle 95% of the Bootstrap Estimates 

To literally quanitify the amount of error we anticipate in using our one sample median to estimate the population median, we can take the middle 95% of the boostrap estimates. This gives us an interval estimate for the population median rather than just a single number. 

In [None]:
# Find the endpoints of the middle 95% of the boostrap medians
left = ...
right = ...

make_array(left, right)

Because this was based on the middle 95% of the boostrap medians, we refer to it as a **95% confidence interval** for the population median. With 95% confidence (we'll talk more about what that means, exactly), we report that the population median is between ________ and ________ dollars total compensation.

In [None]:
# Show the interval in yellow below the histogram of boostrap medians
resampled_medians.hist(bins = median_bins)

# These next lines are plotting parameters; you can ignore
plots.ylim(-0.000005, 0.00014)
plots.plot([left, right], [0, 0], color='yellow', lw=3, zorder=1)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2);

**One last slide...**