In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import warnings
warnings.simplefilter("ignore")

## New material

From your text:

> [SF OpenData](https://datasf.org/opendata/) is a website where the City and County of San Francisco make some of their data publicly available. One of the data sets contains compensation data for employees of the City. These include medical professionals at City-run hospitals, police officers, fire fighters, transportation workers, elected officials, and all other employees of the City.

In [None]:
population = Table.read_table('san_francisco_2019.csv') 

In [None]:
population.sample(5, with_replacement = False)

In [None]:
population.num_rows

### Exploration

In [None]:
population.sort('Total Compensation', descending=True)

**Discussion Question 1: Visualization**

In [None]:
population.hist('Total Compensation', bins = np.arange(0, 800000, 25000))

In [None]:
population.sort('Total Compensation')

#### How can we decide what people to include in today's problem? 

- Maybe we can just think about the minimum salary for part-time workers:
    - $15/hr, 20 hr/wk, 50 weeks

In [None]:
min_salary = 15 * 20 * 50
min_salary

In [None]:
population = population.where('Salary', are.above(min_salary))
population

In [None]:
population.hist('Total Compensation', bins = np.arange(0, 800000, 25000))

- Population parameter for today: The *median* total compensation of all City employees of San Francisco (in 2019).
- If you have the entire population, just calculate the parameter. 

In [None]:
pop_median = percentile(50, population.column('Total Compensation'))
pop_median

**STOP**

### Let's change the problem statement slightly:

> *Can we find a range of values which we strongly believe that the population parameter lies in?*

A sample from the population:

In [None]:
sample_size = 400
a_sample = population.sample(sample_size, with_replacement=False)
a_sample.hist('Total Compensation', bins = np.arange(0, 800000, 25000))

In [None]:
percentile(50, a_sample.column('Total Compensation'))

**STOP**

### How should we bootstrap sample?

In [None]:
bootstrap_sample = a_sample.sample(k = a_sample.num_rows, 
                                   with_replacement = True)

In [None]:
bootstrap_sample.hist('Total Compensation', bins = np.arange(0, 800000, 25000))

In [None]:
percentile(50, bootstrap_sample.column('Total Compensation'))

In [None]:
def one_bootstrap_median():
    # draw the bootstrap sample
    resample = a_sample.sample(k = a_sample.num_rows, with_replacement = True)
    # return the median total compensation in the bootstrap sample
    return percentile(50, resample.column('Total Compensation'))

In [None]:
bootstrapped_sample_medians = make_array()
num_resamples = 1000

for i in np.arange(num_resamples):
    new_median = one_bootstrap_median()
    bootstrapped_sample_medians = np.append(bootstrapped_sample_medians, new_median)

In [None]:
bootstrapped_sample_medians

Now, we will use the fact that this dataset is our population as leverage to check whether our method of estimation is doing a good job.

In [None]:
bootstrapped_median_table = Table().with_column('Bootstrapped Sample Median', bootstrapped_sample_medians)
bootstrapped_median_table.hist(bins = np.arange(125000, 155000, 2000))

# Plotting parameters; you can ignore this code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Do our bootstrapped medians cover the true value?*');

**STOP**

### Percentiles help us describe ordered lists

**Discussion Questions** 

- Which statements are true when `s = array([1, 5, 7, 3, 9])`?

1. The 50th percentile of `s` is 5.
2. The 10th percentile of `s` is 6.
3. The 39th percentile of `s` is the same as the 40th percentile of `s`. 
4. The 40th percentile of `s` is the same as the 41st percentile of `s`. 

In [None]:
s = make_array(1,5,7,3,9)

In [None]:
percentile(50, s) == 5

In [None]:
percentile(50, s) == 6

In [None]:
percentile(39, s) == percentile(40, s)

In [None]:
percentile(40, s) == percentile(41, s)

In [None]:
left = percentile(2.5, bootstrapped_sample_medians)
right = percentile(97.5, bootstrapped_sample_medians)

make_array(left, right)

_____

In [None]:
t = make_array(1,3,3,7,9)

In [None]:
percentile(40, t)

In [None]:
percentile(60, t)

**STOP**

### The confidence interval is a tool for estimation

In [None]:
bootstrapped_median_table.hist(bins = np.arange(125000, 150000, 2000))

# Plotting parameters; you can ignore this code
plots.ylim(-0.000005, 0.00014)
plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=3, zorder=1)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2);
plots.title('We are 95 percent confident that \n the parameter lies within the yellow bounds');

In practice, when we make confidence intervals, we do not have the true value of the parameter. We have it today for teaching purposes. We will not know in reality whether the bootstrapped distribution of statistics covers the parameter (we hope that it does).

**Discussion Question 6**