In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Uncertainty and Confidence Intervals

### Estimation

Let's pretend that 20% of the Californian population has or has had COVID-19. We can simulate a random
sample of 1,000 people from the population using the code below.

In [None]:
# create a table for the probability distribution
covid_dist = Table().with_columns(
    'COVID-19', make_array('Positive', 'Negative'),
    'Population %', make_array(20, 80)
)
covid_dist

In [None]:
# Take a random sample of 1,000 people from the population
covid_dist = covid_dist.sample_from_distribution('Population %', 1000).relabel('Population % sample', 'Sample')
covid_dist

Now, let's forget that we know the true parameter. Let's attempt to estimate it.

First, we'll define our estimator, and then we'll compute the estimate using our sample from above.

In [None]:
# define the estimator
def sample_prop(tbl, sample_col):
    """Computes the sample proportion."""
    counts = tbl.column(sample_col)
    return counts.item(0) / sum(counts)

In [None]:
# compute the estimate
sample_covid = sample_prop(covid_dist, 'Sample')
sample_covid

If we run through all this code again, will the value of our estimate change? Yes, because it is based on a random sample!
Recall that
$$Estimator(Sample) = Estimate = Parameter + Error.$$

So, how can we evaluate the variability of an estimator?

### The Bootstrap

Let's use the Bootstrap method to get a sense of how variable our estimator is.

In [None]:
# first, let's create a new table based on our initial sample
covid_sample_dist = Table().with_columns(
    'COVID-19', make_array('Positive', 'Negative'),
    'Sample %', make_array(sample_covid*100, (1-sample_covid)*100)
)
covid_sample_dist

In [None]:
# next, lets create a function that generates a boot strap sample
def one_bootstrap_sample(tbl, size):
    """Generate a single bootstrap sample from a table."""
    return tbl.sample_from_distribution('Sample %', size).relabel('Sample % sample', 'Sample')

# make sure that the sample size of the bootrstrap sample is identical to the original sample
one_bootstrap_sample(covid_sample_dist, 1000)

In [None]:
# finally, let's generate 1000 bootstrap samples, and compute an estimate for each

bootstrap_props = make_array()
repetitions = 1000

for i in np.arange(repetitions):
    boot_sample_tbl = one_bootstrap_sample(covid_sample_dist, 1000)
    boot_estimate = sample_prop(boot_sample_tbl, 'Sample')
    bootstrap_props = np.append(bootstrap_props, boot_estimate)
    
Table().with_column("Bootstrap Estimates", bootstrap_props).hist()

The distribution of our bootstrap estimates seem to be concentrated near the true proportion of COVID-19 cases
in the population (20%). This distribution doesn't appear to be particularly variable, either. We'll talk about more
quantitative ways to measure this variability in Module 5.2.

<br>

## Confidence Intervals

The Bootstrap method can also be used to compute new kinds of estimates: confidence intervals (CIs). CIs are estimates
that correspond to a range of values.

Let's start by computing the approximate 95% CI for the true proportion of Californians with COVID-19 using the sample
from before:

In [None]:
bootstrap_props = make_array()
repetitions = 500

for i in np.arange(repetitions):
    boot_sample_tbl = one_bootstrap_sample(covid_sample_dist, 1000)
    boot_estimate = sample_prop(boot_sample_tbl, 'Sample')
    bootstrap_props = np.append(bootstrap_props, boot_estimate)

print("The approximate 95% CI is [", str(percentile(2.5, bootstrap_props)), ",",
      str(percentile(97.5, bootstrap_props)), "]")

Table().with_column("Bootstrap Estimates", bootstrap_props).hist()

The process that we defined above gives us a recipe to generate an estimate, the CI. That means we can
store it in a function, and define it as an interval *estimator*.

In [None]:
def ci_95_estimator(sample_tbl, sample_size, boot_reps):
    """Given a sample, this functions computes the 95% CI of the population proportion."""
    bootstrap_props = make_array()
    
    for i in np.arange(boot_reps):
        boot_sample_tbl = one_bootstrap_sample(sample_tbl, sample_size)
        boot_estimate = sample_prop(boot_sample_tbl, 'Sample')
        bootstrap_props = np.append(bootstrap_props, boot_estimate)
        
    return make_array(percentile(2.5, bootstrap_props), percentile(97.5, bootstrap_props))

ci_95_estimator(covid_sample_dist, 1000, 500)

We now know how to compute confidence intervals, but how do we interpret them?

For a 95% CI, we say that if we were to compute the 95% CI of many, many random samples, the true population
proportion would be contained in about 95% of all the CIs.

Here's an example:

In [None]:
contains_true_param = make_array()
iterations = 500

# compute 500 95% CIs of the true proportion
for i in np.arange(iterations):
    covid_sample = covid_dist.sample_from_distribution('Population %', 1000).relabel('Population % sample', 'Sample')
    sample_covid = sample_prop(covid_sample, 'Sample')
    covid_sample_dist = Table().with_columns(
        'COVID-19', make_array('Positive', 'Negative'),
        'Sample %', make_array(sample_covid*100, (1-sample_covid)*100)
    )
    ci_95 = ci_95_estimator(covid_sample_dist, 1000, 500)
    contains_true_param = np.append(contains_true_param, ci_95.item(0) < 0.2 < ci_95.item(1))

# Compute the number of intervals that do not contain the true parameter. We expect 25 on average.
iterations - np.count_nonzero(contains_true_param)