<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Probability and Statistics 02 - Inferential Statistics and Bootstrapping

In [None]:
%matplotlib notebook

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import math
from pathlib import Path
import random
import sys

import IPython.display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import utils.plot

In [None]:
sns.set_context(context="notebook", font_scale=1.7)

## Statistics are used primarily for two purposes:

- _description_, or summarization of the data that was observed

- _inference_, or moving beyond the data to knowledge of what data might be observed in the future, including data you would never actually be able to observe

Today's lecture focuses on inferential statistics.

It should perhaps be unsurprising, based on that description, that inferential statistics is
1) much harder than descriptive statistics and
2) philosophically slippery.

## Statistical terminology reveals the history of statistics

The term "Statistics" comes from a Latin word meaning
["matters of the state"](https://en.wiktionary.org/wiki/statistics).

While probability was invented in the 16th century by gamblers and mathematicians,
statistics was invented in the 18th century by bureaucrats.

## The motivating problem of early statistics was

how do I track the commonwealth of the nation without talking to everybody in the nation?

Some early work in probability and statistics was done by
[J.S. Mill](https://en.wikipedia.org/wiki/John_Stuart_Mill),
better known now for his work on
[utilitarianism](https://en.wikipedia.org/wiki/Utilitarianism),
the ethical theory that says that one should
achieve the greatest good for the greatest number.

For these classical liberals, many of them economists,
this lead to the question:
how do you measure that good, when the numbers are too great?

There is a total _population_ under the rule of my government.

I would like to know how it's doing.

### Example: Housing Markets of ≈8000 of the largest cities in the US

In [None]:
houses_2010_df = pd.read_csv("data/houses_2010.csv", index_col=0)
houses_2017_df = pd.read_csv("data/houses_2017.csv", index_col=0)

These two dataframes contain information about the total number of homes for sale
in each of the ~8000 largest cities in the United States,
one during a week at the end of 2010 and the other at the end of 2017.

It contains the value of the observation "Total Number of Homes for Sale At This Time"
on every single member of a population, the 8000 largest cities in the US.

Taken from the
[Zillow Economics dataset](https://www.kaggle.com/zillow/zecon),
on [kaggle.com](https://kaggle.com).
There are something like
[16,000 - 20,000](https://www2.census.gov/geo/pdfs/reference/GARM/Ch9GARM.pdf) cities in the United States,
depending on your definition,
but the vast majority of city-dwellers are in one of these 8,000 cities.

Measuring something like this is possible now because of ubiquitous, networked computing power.
In the past, doing something like this would've been impossible,
and hence the need for statistics.

There remain plenty of problems where the population is still too big
to work with sensibly:
e.g., deciding who most eligible Americans would vote for if an election were held today.
Doing statistics on this is still a [hard problem](https://fivethirtyeight.com/).

In [None]:
houses_2010_df.sample(5)

## When we have the _entire_ population of interest, all statistics is descriptive.

The distribution of the entire population is called the _population distribution_.

What I called the "true distribution" in the previous lecture,
the distribution you would see if you observed an infinite number of data points,
is also called, by some folks, the population distribution.

In [None]:
plt.figure()
sns.distplot(houses_2010_df["InventoryRaw_AllHomes"], kde=False, norm_hist=True)
sns.distplot(houses_2017_df["InventoryRaw_AllHomes"], kde=False, norm_hist=True); plt.tight_layout();

Distributions like the above, where there are some rare very large values
and lots of common small or intermediate values, are hard to work with,
so let's use a common trick and just ask what the _order of magnitude_,
or _logarithm base 10_, of the number of houses for sale is.

In [None]:
houses_2010_df["log_homes"] = np.log10(houses_2010_df["InventoryRaw_AllHomes"])
houses_2017_df["log_homes"] = np.log10(houses_2017_df["InventoryRaw_AllHomes"])

In [None]:
plt.figure()
sns.distplot(houses_2010_df["log_homes"], kde=False, norm_hist=True)
sns.distplot(houses_2017_df["log_homes"], kde=False, norm_hist=True); plt.tight_layout();

When you have access to the population distribution,
the only thing to do is _describe_ it.

In [None]:
print(houses_2010_df[["log_homes"]].describe())

In [None]:
print(houses_2017_df[["log_homes"]].describe())

So just about every basic question you can ask about this data is easy to answer:
- did the number of houses for sale increase or decrease, on average?
- did the variability in the number of houses for sale increase or decrease?
- did the largest number observed increase or decrease?

Just look at the data, and you're done!

But note, the more interesting questions are about inferences, like:
- how large of a dataset would you have to give me before I could decide whether
the values are from 2010 or 2017?

These can still be answered from the population distribution,
but they use techniques and tools of inferential thinking.

In [None]:
true_means = {2010: houses_2010_df.log_homes.mean(), 2017: houses_2017_df.log_homes.mean()}
true_vars = {2010: houses_2010_df.log_homes.var(ddof=0), 2017: houses_2017_df.log_homes.var(ddof=0)}

## Getting the whole population is often hard or impossible.

- Hard: we would like to make statements about large populations (millions, billions)

- Impossible: sometimes we want to make statements about populations with no finite size

Example: if we make a general statement about human psychology,
we're trying to talk about every human that ever lived or will live.
Good luck surveying all of those!

## The random sample was devised to get around this problem.

Idea: don't measure _every_ member of the population,
just measure a _sufficiently large number of examples_.

From _example_ is derived the word _sample_.

Intuitively, the entire population can't be too different from those examples,
if they are chosen _at random_.

Exceptions: finding the minimum and maximum will, in some cases,
require looking at every member of the population.

Imagine you were trying to figure out how wealthy the wealthiest person was.
Unless you lucked out and happened to talk to
one of
[the top 10 of these people](https://en.wikipedia.org/wiki/List_of_richest_people_in_the_world),
you'd be off by a factor of 2, if not far more.

Unintuitively, _sufficiently large_ is often quite small!

In [None]:
subsample_2010 = houses_2010_df.log_homes.sample(100)
subsample_2017 = houses_2017_df.log_homes.sample(100)

The `.sample` method, with no arguments but a size,
performs a random sample _without replacement_,
mimicking the process of selecting some examples at random from a population.

In [None]:
print(subsample_2010.mean(), subsample_2017.mean())

In [None]:
true_means[2010], true_means[2017]

## The Law of Large Numbers guarantees this will work.

Loosely:
the value of a descriptive statistic on a random sample gets closer
to the value of that descriptive statistic

This theorem is originally due to
[Jacob Bernoulli](https://en.wikipedia.org/wiki/Law_of_large_numbers#History),
the 17th century mathematician who discovered the importance of `math.e`.

Alternatively:
the _sampling distribution of a statistic_ on a random sample
gets tighter around, and the center gets closer to, the true value
as the sample size increases.

This is the _key trick_ in applications of probability,
e.g. information theory, statistical physics:
when we look at the behavior of really big collections of random things,
they often behave very predictably.
In its general form, it's known as _concentration_.

The direction and speed of each particle in my coffee cup is random,
but I never need to worry that they'll all suddenly zoom upwards and spill my drink on me.

## Let's visualize the Law of Large Numbers in Action

First, we pick a bunch of sample sizes, with values between 2 and about 500 - 1000.

In [None]:
sample_sizes = np.ceil(np.random.lognormal(mean=3, size=10000)).astype(np.int) + 1

This random selection method was chosen because it results in more small sizes than large sizes.

Then, we draw samples of those sizes, and calculate our descriptive statistics on those samples.

In [None]:
replace = False
means_2010 = [houses_2010_df["log_homes"].sample(sample_size, replace=replace).mean()
              for sample_size in sample_sizes]
means_2017 = [houses_2017_df["log_homes"].sample(sample_size, replace=replace).mean()
              for sample_size in sample_sizes]

These are _list comprehensions_.

They are like "one-line `for` loops", designed to make it easy to make lists.
Read more about comprehensions to build lists
and other things, like dictionaries,
[here](https://www.geeksforgeeks.org/comprehensions-in-python/).

In [None]:
vars_2010 = [houses_2010_df["log_homes"].sample(sample_size, replace=replace).var(ddof=0)
                   for sample_size in sample_sizes]
vars_2017 = [houses_2017_df["log_homes"].sample(sample_size, replace=replace).var(ddof=0)
                   for sample_size in sample_sizes]

Let's look at our results for the 2010 data.

In [None]:
utils.plot.plot_samples(sample_sizes, means_2010, ylabel="Sample Mean"); plt.tight_layout();

This plot shows the mean of each sample, plotted against the sample size.

Notice how quickly the mean gets to within a tight band,
then how slowly that band gets tighter.

This phenomenon is an important part of how homeworks and labs are autograded.

You will codeup a model,
and I will check whether the statistics of that model's samples
have the right value, for large sample sizes.

The Law of Large Numbers predicts the shape we saw above:

In [None]:
utils.plot.make_LLN_plot(sample_sizes, means_2010, true_means[2010]); plt.tight_layout();

The shape is that of a _square root_,
which loosely means that the benefit of taking 10 samples relative to taking just 1
is the same as the benefit of taking 100 samples relative to just 10, and 10000 relative to 100.

This doesn't just apply to the mean, but to a wide variety of statistics that 
are calculated by averaging some function over every value in the sample.

This has a profound effect on the way statistics is done:
unless you're taking more than 5000 samples,
almost all of the improvement in the accuracy of your guess will come in the first 100 samples.
If those first 100 aren't sufficient, you'll be waiting a very long time!

This has often led to an obsessive focus on making the other factors that influence that curve,
what is called in the code here `spread_scaling`,
as small as possible:
reducing the variability in the data, etc.

Now let's compare the sample means for samples from our two datasets:

In [None]:
ax = utils.plot.plot_samples(sample_sizes, means_2010, alpha=0.05, label="2010 data")
ax = utils.plot.plot_samples(sample_sizes, means_2017, alpha=0.05, label="2017 data",
                  ylabel="Sample Means", ax=ax);
ax.legend(); plt.tight_layout();

Notice: for larger sample sizes, we don't see any sample means from 2010 that are lower than those from 2017,
though it seems to happen for sample sizes smaller than about 100.

Zoom in on the region of interest with
```python
ax.set_xlim([0, 100])
```

Reduce the transparency by making `alpha` larger

Repeat all the above, but with `replace=True`.
Notice that almost nothing changes.

So for sample sizes smaller than the population,
drawing with replacement is almost the same as drawing without replacement.

Put another way,
our procedure of sampling is not that different from
generating random numbers according to the population distribution.

## Two ways your guess can be wrong

- almost always: for a sample smaller than population, sampling distribution has values other than true value.
- sometimes: for a sample smaller than population, mean of sampling distribution is wrong


The former is usually operationalized as the _variance_ of the sampling distribution,
and so is called the variance of an estimate or estimator.

The latter is known as bias.

### Lots of intuitive choices lead to biased guesses

Example: computing the variance on a sample as we've defined it gives a biased estimate of the true variance.

In [None]:
ax = utils.plot.make_LLN_plot(sample_sizes, vars_2010, true_vars[2010], ylabel="Variance", spread_scaling=1);
ax.set_xlim([0, 10]);

### The bias for the variance can be corrected very simply.

In [None]:
corrected_vars_2010 = [houses_2010_df["log_homes"].sample(sample_size, replace=True).var(ddof=1)
                      for sample_size in sample_sizes]

Note: `ddof=1` is the default in pandas! Instead of dividing by `len(series)`, it divides the sum of squared differences by `len(series) - 1`.

In [None]:
ax = utils.plot.make_LLN_plot(sample_sizes, corrected_vars_2010, true_vars[2010], ylabel="Variance", spread_scaling=1);
ax.set_xlim([0, 10]);

This bias is small enough that to be sure it's there, we need to be quantitative.

In [None]:
sds_2010_df = pd.DataFrame({"sample_size": 2 * list(sample_sizes),
                            "var": list(vars_2010)  + list(corrected_vars_2010),
                            "corrected": len(sample_sizes) * [False] + len(sample_sizes) * [True]})

In [None]:
sds_2010_df.sample(10)

In [None]:
print(true_vars[2010])
sds_2010_df.groupby(["sample_size", "corrected"])["var"].mean().head(10)

This is a groupby on multiple columns: the `sample_size` and whether the estimate was `corrected` or not.

Notice that the values for `corrected==True` are sometimes above, sometimes below the true value,
while the values for `corrected==False` are uniformly below.
This is _bias_.

The exception is often `sample_size==2`,
where there aren't enough samples to reliably estimate the center of the sampling distribution.
Try resampling both the `corrected_vars` and `uncorrected_vars`
multiple times and you should see the value for `corrected==True` bouncing around,
both above and below the correct value, while the value for `corrected==False` is almost always below.

## We apply an _estimator_ on the sample to estimate what the value of some _statistic_ would be on the population.

The _estimator_ is a function/method, like `.mean` or `.var` with `ddof=1`, but it need not be the same as the _statistic_, which might be `.mean` or `.var` with `ddof=0`.

## Once we have an estimate, we also need to determine our uncertainty.

We're almost always pretty sure our estimate is wrong.

In [None]:
print(f"does {true_means[2010]} == {subsample_2010.mean()}?")

No, but we know that the true mean should be "about" the value we measured.

The challenge is figuring out how big "about" is.

## Our uncertainty comes from the sampling distribution

So we can determine our uncertainty by estimating the sampling distribution.

## First Pass: Let's just draw more from the sampling distribution

We draw more samples, calculate our estimate on all of those samples,
and then plot the distribution of our estimates on those samples.

Importantly, the samples should be of the same size!
Otherwise the sampling distribution won't be the same.

In [None]:
subsamples_2010 = [houses_2010_df["log_homes"].sample(100) for _ in range(100)]
means = [subsample.mean() for subsample in subsamples_2010]

In [None]:
plt.figure()
sns.distplot(means, rug=True, label="New Samples"); plt.xlabel("Means"); plt.legend()
plt.tight_layout()

By visualizing our estiamte of the sampling distribution, we get a loose idea of what our uncertainty is:
we can say that certain values are definitely plausible, while others are definitely implausible.

In [None]:
plt.figure()
sns.distplot(means, rug=True, label="New Samples"); plt.xlabel("Means");
plt.vlines([1.2, 1.8, 1.9, 2.0], 0, 5, color="C1", lw=2, label="Which of these are plausible?");
plt.legend(loc="upper left"); plt.tight_layout()

But there will be some cases where it's unclear, and reasonable people might disagree.
To avoid that, we need a way to quantify our uncertainty.

Visualizing the sampling distribution and saying which values are plausible and implausible
is sufficient to complete the lab for this week,
though we'll be doing it with bootstrapping,
as discussed later.

We'll leave quantifying uncertainty, as we discuss next,
for the homework.

## The most common way to quantify uncertainty is with an interval.

Once we have an estimate of the sampling distribution,
we then build an interval that covers most of that distribution.

How do we construct it, given samples from an estimate of the sampling distribution?

Below, we'll see that some estimates of the sampling distribution don't come with samples,
and so confidence intervals must be constructed differently.
We'll talk about those when we come across concrete examples.

We build the interval,
[Pied Piper-style](https://news.mlh.io/i-hacked-the-middle-out-compression-from-silicon-valley-06-16-2015),
from the middle out.

### In Python, we can use the `np.percentile` function

In [None]:
np.percentile(range(0, 101), [5, 95])

In [None]:
means_CI = np.percentile(means, [2.5, 97.5])
means_CI

This function comes from the `numpy`, or `num`erical `Py`thon, library.
This is one of the most important libraries in Python,
especially for data science and statistics.

For more on numpy, see
[this tutorial](https://docs.scipy.org/doc/numpy/user/quickstart.html).

There is also a similar `.quantile` method in pandas, but watch out:
it takes numbers between 0 and 1, not 0 and 100.

In [None]:
plt.figure()
sns.distplot(means, rug=True, label="New Samples"); plt.xlabel("Means"); 
plt.hlines(5, *means_CI, color="k", lw=4, label="95% CI");
plt.vlines(true_means[2010], 0, 6, color="C3", lw=2, label="True Mean");
plt.legend(loc=[0.7, 0.64])
plt.tight_layout()

## This is an example of a _Confidence Interval_

A confidence interval is any _interval-valued statistic_
that has the property that for some known fraction of possible samples,
the population parameter is inside that interval.

The typical choice for "sufficient percentage" is _95%_, for essentially historical reasons.

It goes back to a few off-hand comments by
[Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher),
one of the founders of modern statistics.

> It is usual and convenient for researchers to take 5 percent as a standard level

## Resampling to construct a confidence interval is almost never done.

- It increases the budget for the experiment dramatically

- You can compute the statistic on all of the samples together and get a new estimate. A confidence interval built from the smaller samples _vastly_ overstates what your real uncertainty is.

## We could instead _model_ the sampling distribution.

1. Mathematically: write down equations for the sampling distribution

2. Computationally: simulate your data, then look at the sampling distribution of your simulation

### These approaches are hard!

Hard enough that it will take us much of the rest of this course to learn to use these.

The mathematical approach is made easier than it sounds by the Central Limit Theorem:
many sampling distributions have the same shape, a bell curve,
once the sample size gets large enough.

But the Central Limit Theorem won't always save you.

For smaller sample sizes, there's no choice but to do some really hard math,
which you must repeat with each new kind of data distribution.

More on the Central Limit Theorem later.

## There's an easier way: 

Just pull yourself up by your bootstraps!

## We approximate the process of sampling from the population by sampling from our data.

In [None]:
bootstrap_sample = subsample_2010.sample(frac=1, replace=True)
bootstrap_sample.mean()

This cell computes a single bootstrap sample, then takes its mean.

Notice that, in general,
the statistic on the bootstrap sample
will be different from the statistic on the original sample,
and different from run to run.

In [None]:
bootstrap_samples = [subsample_2010.sample(frac=1, replace=True) for _ in range (100)]
bootstrap_means = [sample.mean() for sample in bootstrap_samples]

The above cell computes multiple bootstrap samples, then computes the mean of each,
to be plotted below.

In [None]:
plt.figure(figsize=(8, 8))
sns.distplot(means, rug=True, label="actual resamples");
sns.distplot(bootstrap_means, rug=True, label="bootstrap resamples");
plt.xlabel("mean"); plt.legend(loc="upper left"); plt.tight_layout()

Key idea:
the _spread_ of this distribution is generally about correct,
even though the center is wrong,
since it's centered around the value of the statistic on the original sample.

Now, we build a confidence interval the same way we would have if we'd actually gone back and done resampling.

In [None]:
bootstrap_CI = np.percentile(bootstrap_means, [2.5, 97.5])
bootstrap_CI

In [None]:
plt.figure()
sns.distplot(means, rug=True, color="C1", label="Bootstraps"); plt.xlabel("Means"); 
plt.hlines(5, *bootstrap_CI, color="k", lw=4, label="Bootstrap CI");
plt.vlines(true_means[2010], 0, 6, color="C3", lw=2, label="True Mean"); plt.legend(loc=[0.6, 0.6])
plt.tight_layout()

Note: technically, these intervals will be too small:
they are a biased estimate of the true confidence interval,
especially for small sample sizes.
So if you really want to have 95% confidence,
you need to make the interval slightly wider.
This is fairly minor in most settings, but a huge pain to calculate, so we won't bother with it.
If you are supremely concerned,
just make the intervals slightly wider.
If it changes your conclusions,
then consider looking into
[bootstrap corrections](http://users.stat.umn.edu/~helwig/notes/bootci-Notes.pdf).

In [None]:
IPython.display.Image("img/how_bootstrapping_works.jpeg")

The inspiration for the term bootstrapping:
in the [_Adventures of Baron Munchausen_](https://en.wikipedia.org/wiki/Baron_Munchausen),
the eponymous hero claims to have once saved himself from drowning in a swamp by pulling himself out by his ponytail.

It's believed that this story was eventually mangled slightly into a version in which someone
pulls themselves out using the straps on their boots:

In [None]:
IPython.display.Image("img/bootstraps.jpg")

Image Credit: [Wikipedia](https://en.wikipedia.org/wiki/M%C3%BCnchhausen_trilemma).

## But what if there's no population?

As noted above,
it's more typical now to work with cases where there is no fixed, finite-size population.

The process of sampling from the population distribution is replaced with the process of _generating random numbers according to a distribution_.

Because of this connection, generating random numbers is often called _sampling_.

## Be careful interpreting confidence intervals.

It is tempting,
since confidence intervals represent our uncertainty, to say that,
when the 95% confidence interval is `[0, 1]`,
we are 95% sure that the value of the statistic on the true distribution is inside that interval.

In this case, we can see that this interpretation is wrong, because we know the true mean:

In [None]:
print(f"is {true_means[2010]} inside the interval {bootstrap_CI}?")

This misapprehension about confidence intervals is very common,
enough that if you make it publicly or in a meeting,
you're very likely to get [well, actually'd](https://twitter.com/search?q=well%2C+actually%27d).

## There's a simple counter example: what is `1 + 1`?

Here's a perfectly acceptable procedure for answering this question
using a statistic, an estimator, and a confidence interval:

- roll a die

- if that die comes up 1, guess `1`, with the confidence interval `[1, 1]`

- otherwise, guess `2`, with the confidence interval `[2, 2]`

In [None]:
def one_plus_one_CI(data):
    if data != 1:
        return [2, 2]
    else:
        return [1, 1]

In [None]:
one_plus_one_CI(random.randint(1, 20))

For a 20-sided die, this is a 95% Confidence Interval,
but no one would say that there's a 95% chance that the answer to `1+1` is inside the interval `[1, 1]`.

## Alternative: Credible Intervals

We will learn to build models of data and use those to construct
both sampling distributions of statistics
and something much like a confidence interval, but which can be interpreted as
"there's a 95% chance that the true value is in this interval".

We will also see that the alignment between these
credible intervals and traditional confidence intervals can be very high,
and so things are typically not so bad as in the examples above,
and all of the hemming and hawing about how to interpret Confidence Intervals is a bit off base.