## Estimation

Adapted from "Teaching statistical inference with resampling," Copyright 2018 Allen Downey
License: http://creativecommons.org/licenses/by/4.0/

In [None]:
# Configure Jupyter so figures appear in the notebook
%matplotlib inline

# Configure Jupyter to display the assigned value after an assignment
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(4)

Suppose we want to estimate the average height of men in the U.S.

We can use data from the [BRFSS](https://www.cdc.gov/brfss/index.html):

"The Behavioral Risk Factor Surveillance System (BRFSS) is the nation's premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services."

The following function reads the most recent data from BRFSS (2016), selects just the columns we need, and saves the results in an HDF file.

After we run this function, it is much faster to read the HDF file than the original data.

In [None]:
import pandas as pd

def read_brfss():
    """Read the BRFSS dataset, select columns, and save as HDF.
    """
    df = pd.read_sas('LLCP2016.XPT')
    df.head()
    columns = ['SEX', 'HTM4', 'WTKG3', '_LLCPWT']
    selected = df[columns]
    selected.to_hdf('LLCP2016.HDF', 'brfss')

In [None]:
read_brfss()

Now we can read the HDF file.  The result is a Pandas DataFrame.

In [None]:
df = pd.read_hdf('LLCP2016.HDF')
df.head()

Males are coded as SEX == 1, and Females are coded as 2. Select male respondents into a new dataframe

In [None]:
males = ...
males.head()

Select height data (in CM):

In [None]:
data = ...
data.head()

How many values are missing?

The mean and standard deviation, ignoring missing data, is

In [None]:
print('Mean male height in cm =', np.nanmean(data))
print('Std male height in cm =', np.nanstd(data))

## Quantifying precision

At this point we have an estimate of the average adult male height.  We'd like to know how accurate this estimate is, and how precise.  In the context of estimation, these words have a [technical distinction](https://en.wikipedia.org/wiki/Accuracy_and_precision):

*"Given a set of data points from repeated measurements of the same quantity, the set can be said to be precise if the values are close to each other, while the set can be said to be accurate if their average is close to the true value of the quantity being measured."*

Usually accuracy is what we really care about, but it's hard to measure accuracy unless you know the true value.  And if you know the true value, you don't have to estimate it.

Quantifying precision is not as useful, but it is much easier.  Here's one way to do it:

1.  Use the data you have to make a model of the population.

2.  Use the model to simulate the data collection process.

3.  Use the simulated data to compute an estimate.

4.  Repeat steps 1-3 and collect the results.

To model the population, we can use **resampling**; that is, we can treat the observed measurements as if they were taken from the entire population, and then draw random samples from them.

Here's a function that takes observed measurements and returns a new set of measurements with the same sample size.

With `replace=True`, we sample with replacement, which means that some measurements might be chosen more than once, and some might not be chosen at all.

If we sample *without* replacement, the resampled data is always identical to the original, so that's no good.

In [None]:
def resample(data):
    size = len(data)
    return np.random.choice(data, size, 
                            replace=True)

To simulate an experiment, we run `resample` to generate data and `nanmean` to compute the mean (ignoring missing data).

In [None]:
resampled_data = resample(data)
np.nanmean(resampled_data)

Simulate 1000 experiments and collect the results.

In [None]:
sampling_dist_mean = ...

The result is the "sampling distribution", which shows how much the results of the experiment would vary if we ran it many times. Represent the sampling distribution graphically with a histogram:

In [None]:
plot_hist(sampling_dist_mean)
plt.title('Sampling distribution of the mean')
plt.xlabel('Mean adult male height, U.S.');

The width of this distribution shows how much the results vary from one experiment to the next. We can quantify this variability by computing the standard deviation of the sampling distribution, which is called "standard error".


In [None]:
std_err = np.std(sampling_dist_mean)

We can also summarize the sampling distribution with a "confidence interval", which is a range that contains a specified fraction, like 90%, of the values in `sampling_dist_mean`.

The central 90% confidence interval is between the 5th and 95th percentiles of the sampling distribution.

In [None]:
ci_90 = np.percentile(sampling_dist_mean, [5, 95])

The following function plots a histogram and shades the 90% confidence interval.

In [None]:
def plot_sampling_dist(dist):
    patch = plot_hist(dist)
    low, high = np.percentile(dist, [5, 95])
    fill_hist(low, high, patch)
    print('Mean = ', np.mean(dist))
    print('Std error = ', np.std(dist))
    print('90% CI = ', (low, high))

Here's what it looks like for the sampling distribution of mean adult height:

In [None]:
plot_sampling_dist(sampling_dist_mean)
plt.xlabel('Mean adult male height, U.S. (%)');

For an experiment like this, we can compute the standard error analytically.

In [None]:
def analytic_stderr(data):
    size = len(data)
    return np.std(data) / np.sqrt(size)

The result is close to what we observed computationally.

In [None]:
analytic_stderr(data), std_err

## Other sample statistics

One nice thing about using computaton is that it is easy to compute the sampling distribution for other statistics.

For example, suppose we want to estimate the coefficient of variation for adult male height (standard deviation as a percentage of the mean).  We can define a function to compute it:

In [None]:
def coef_var(data):
    return np.nanstd(data) / np.nanmean(data) * 100

And estimate the sampling distribution by running simulated experiments.

In [None]:
sampling_dist_cv = ...

Here's what the sampling distribution of CV looks like:

In [None]:
plot_sampling_dist(sampling_dist_cv)
plt.title('Sampling distribution of CV %')
plt.xlabel('CV adult male height, U.S. (%)');

## Weighted resampling

Another nice thing about resampling is that we can extend it to handle the case where the data are weighted.  In fact, the BRFSS deliberately oversamples some groups, so each respondent has a weight that indicates how many people in the population they represent.

The variable `_LLCPWT` contains these weight, which we can normalize so they add up to 1. 

In [None]:
def compute_sampling_weights(df):
    p = df._LLCPWT
    p /= p.sum()
    return p

In [None]:
sampling_weights = compute_sampling_weights(males);

We can pass these weights to `np.random.choice`:

In [None]:
def resample_weighted(data, p):
    size = len(df)
    return np.random.choice(data, size, 
                            replace=True, p=p)

In [None]:
sampling_dist_mean_weighted = ...

If we take the mean of the sampling distribution, we get an estimate of the average male height, taking account of the sampling weights.

In [None]:
xbar_weighted = np.mean(sampling_dist_mean_weighted)

And we can compare to the unweighted version:

In [None]:
xbar = np.mean(sampling_dist_mean)

What's your conclusion about the differences between the weighted and unweighted estimates? What does that mean?