In [None]:
from datascience import *
%matplotlib inline
path_data = '../../../assets/data/'
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np

# Lecture 24 : Confidence Intervals

## Review: Bootstrap 
We have developed a method for estimating a parameter by using random sampling and the bootstrap. Our method produces an interval of estimates, to account for chance variability in the random sample. By providing an interval of estimates instead of just one estimate, we give ourselves some wiggle room.

In the previous example we saw that our process of estimation produced a good interval about 95% of the time, a “good” interval being one that contains the parameter. We say that we are 95% confident that the process results in a good interval. Our interval of estimates is called a **95% confidence interval** for the parameter, and 95% is called the **confidence level** of the interval.

The method is called the **bootstrap percentile method** because the interval is formed by picking off two percentiles of the bootstrapped estimates.

In [None]:
# Read the data -- City of San Francisco Employees
sf = Table.read_table('san_francisco_2019.csv')

# Eliminiate rows with very low salary figures
min_salary = 15 * 20 * 50
sf = sf.where('Salary', are.above(min_salary))

In [None]:
# Our population has 37103 individuals
sf.num_rows

In [None]:
# Visualize the population distribution for Total Compenation
sf_bins = np.arange(0, 726000, 25000)
sf.hist('Total Compensation', bins=sf_bins)

Make a function `median_comp()` that takes a table `t` which includes a column labeled 'Total Compensation'; the function returns the **median** for the 'Total Compensation' column.

In [None]:
def median_comp(t):
    return ...   # return the appropriate percentile

# call the function to find the population median for our current example
pop_median = median_comp(sf)
pop_median

## Bootstrap Estimates of the Parameter (Pretend it is Unknown) 
We don't need to estimate the population median in this case -- we just computed it exactly! But the idea is to pretend we don't know it, make a confidence interval estimate, and then compare that estimate with the actual computed value. This lets us practice the Bootstrap Percentile Method for finding a confidence interval, in a context where we actually know the right answer.

In [None]:
# Warm up: collect a random sample of size 100 (without replacement) and find the 
# sample median
our_sample = ...
print("Sample median is:", ...)


# run the cell several times -- you should see random variation in the results

Define a function, `confidence_interval_95`, which uses the Bootstrap Percentile Method with 1000 bootstrap resamples to find the 95% confidence interval for the population median based on a given sample size.

In [None]:
def confidence_interval_95(sample_size):
    # Collect one random sample from the population (without replacement)
    our_sample = ...
    
    # Generate the medians of 1000 bootstrap samples
    num_repetitions = 1000
    bstrap_medians = make_array()
    for ...
        # Resample the same number of rows, with replacement
        resample = ...
        bstrap_medians = ...

    # Find the middle 95% of the medians; that's a confidence interval
    left = percentile(..., bstrap_medians)
    right = percentile(..., bstrap_medians)
    return make_array(left, right)

# Run this cell several times and notice the random variation
confidence_interval_95(100)

In [None]:
# Multiple times, compute a 95% confidence interval with sample size 40
# Keep an array of lower bounds and another array of upper bounds
lower = make_array()
upper = make_array()
reps = 50
for ...
    ci = ...
    lower = ...
    upper = ...

print(lower)
print(upper)

In [None]:
# A confidence interval is "correct" if it captures the population parameter. Here is
# a helper function for determining the correctness of an entire array of lower bounds
# and upper bounds
def is_correct(L, U, A):
    '''
    Parameters: L, U, A are numbers
    Returns:    True if A is between L and U, otherwise False
    '''
    return ...

is_correct(3, 5, 9)

In [None]:
# Numpy syntax: np.ones() is helpful for making a an array of 1's
np.ones(5)

In [None]:
# We can use it to make an array of length `reps`, all the same value
np.ones(reps) * 17

In [None]:
# Make a table showing the population parameter, each interval's bounds, and 
# whether the interval is "correct"
# A correct interval contains the parameter (pop_median) within its bounds
actual_parameter = np.ones(reps) * pop_median
intervals_tbl = (Table().with_column('Lower', lower)
                 .with_column('Parameter', actual_parameter)
                 .with_column('Upper', upper))
intervals_tbl

In [None]:
# Make a boolean array for the 'Correct' column and add it to the intervals_tbl
correct = ...
intervals_tbl = intervals_tbl.with_column('Correct', correct)
intervals_tbl

**Claim**: This process of estimation captures the parameter about 95% of the time.

To do: Use `tbl.group()` to check on the success rate we achieved in practice:

In [None]:
intervals_tbl.group('Correct')

Is our actual success rate compatible with "about 95%"?

## Confidence Interval for an Unknown Population Mean
The whole point of statistics is to estimate **unknown** population parameters. If we already know the parameter, we don't need to estimate it. 

Let's try an example of estimating an unknown population mean. Start by loading the `baby.csv` data:

In [None]:
# Treat this as a random sample of mother-newborn pairs
births = Table.read_table('baby.csv')

In [None]:
# Visualize the sample 
births.hist('Maternal Age')

In [None]:
# Calculate the sample mean maternal age
...

This is a single-number estimate of the unknown population mean.

### Question
What is the average age of ALL the mothers in the population? Let's use the Boostrap Percentile Method, just as we did previously for a population median. We can calculate a 95% confidence interval for population mean by resampling from our sample many times and computing percentiles.

Start by making a function for a single bootstrap mean (resample from the sample, WITH replacement, then return the average maternal age from the resample).

In [None]:
def one_bootstrap_mean():
    resample = ...
    resample_mean = ...
    return resample_mean

In [None]:
# Try calling the function several times; there should be random variation
one_bootstrap_mean()

In [None]:
# Generate means from 3000 bootstrap samples
repetitions = 3000
bstrap_means = make_array()
for ...
    bstrap_means = ...

### Bootstrap Percentile Method for Confidence Interval

The interval of estimates is the "middle 95%" of the bootstrap estimates.

This is called a *95% confidence interval* for the mean age in the population.

In [None]:
# Use `percentile()` to get the endpoints of the 95% confidence interval
left = ...
right = ...

make_array(left, right)

In [None]:
# Visualize the means in a histogram
resampled_means = Table().with_columns(
    'Bootstrap Sample Mean', bstrap_means
)
resampled_means.hist(bins=15)

# Use a bold yellow line to show the confidence interval (central 95% of the distribution)
plots.plot([left, right], [0, 0], color='yellow', lw=8);

In [None]:
# Visualize the confidence interval for the population mean below
# the distribution of the sample (instead of the distribution of the bootstrap means)
births.hist('Maternal Age')
plots.plot([left, right], [0, 0], color='yellow', lw=8);

## Using Confidence Intervals for Testing Hypotheses
When the alternative hypothesis is two-sided, we can use a confidence interval to decide a hypothesis test. 
  - If we are looking for a p-value below 5% in order to reject the null, we use a 95% confidence interval to decide the question.
  - What confidence level would we use for our interval if we were looking for a p-value below 1%?
  
For example, consider these hypotheses in the context of mothers and babies:

  - **Null:** The average age of mothers in the population is 25 years; the random sample average is different from 25 just due to chance.

  - **Alternative:** The average age of the mothers in the population is not 25 years.

Suppose you use the 5% cutoff for the p-value.

Based on the confidence interval, which hypothesis would you pick? What would be a reasonable conclusion?

## Estimating a Population Proportion
In the sample, 39% of the mothers smoked during pregnancy:

In [None]:
births.where('Maternal Smoker', are.equal_to(True)).num_rows / births.num_rows

Remember that a proportion is an average of zeros and ones. So the proportion of mothers who smoked could also be calculated using array operations as follows.

In [None]:
np.average(...)

Let's use the Bootstrap Percentile Method to estimate the proportion of mothers in the population who smoked during pregnancy. As usual, we start with a function to find one bootstrap proportion.

In [None]:
def one_bootstrap_proportion():
    resample = births.sample(...)
    smoking = resample.column('Maternal Smoker')
    return ...

In [None]:
# Test the new function -- there should be random variation
one_bootstrap_proportion()

In [None]:
# Generate proportions from 5000 bootstrap samples
bstrap_proportions = make_array()
num_repetitions = 5000
for ...
    bstrap_proportions = ...

In [None]:
# Get the endpoints of the 95% confidence interval
left = ...
right = ...

make_array(left, right)

We estimate that somewhere between ___ and ___ percent of mothers in the population smoked during pregnancy.

Visualize the interval relative to the bootstrap proportions:

In [None]:
resampled_proportions = Table().with_columns(
    'Bootstrap Sample Proportion', bstrap_proportions
)
resampled_proportions.hist(bins=15)
plots.plot([left, right], [0, 0], color='yellow', lw=8);

## Take Care in Using the Bootstrap Percentile Method
The bootstrap is an elegant and powerful method. But before using it, it is important to keep some points in mind.

### Point 1
Use a **large random sample**. If you don't use a large sample, your results will be unreliable.

### Point 2
To get an accurage estimate for the probability distribution of a statistic, it is a good idea to replicate the resampling procedure **as many times as possible**. 

### Point 3
Check that the probability distribution of the bootstrap statistics is roughly **bell-shaped**. If it's not, use a different method.

### Point 4
The bootstrap should not be used to try to estimate:

  - Maximum value or minimum value in the population
  - Any population parameter which is greatly influenced by "rare" elements in the population

### Point 5
Do not misinterpret the interval. For example, suppose a 95% c.i. for the height (in inches) of an average Hanover College student, based on a random sample, is reported to be (68.3, 70.1).

  - Correct interpretation: We are 95% confident that the mean height of an HC student is in the range from 68.3 inches to 70.1 inches.
  - **Incorrect interpretation: About 95% of all HC students are between 68.3 and 70.1 inches tall.**