In [None]:
# Imports
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

# Lecture 15 – Parameters and Statistics

## DSC 10, Fall 2021

### Announcements

- The Midterm Exam was graded, see Gradescope for your grade. Submit regrade requests by **Wednesday 11/3 at 11:59pm**.
    - Let us know if you want to talk about your performance!
- The Midterm Project is due on **Tuesday 11/2 at 11:59pm**.
- Lab 5 will be due on **Saturday, 11/6 at 11:59pm**.
- Homework 5 will be released soon and will be due on **Tuesday, 11/9 at 11:59pm**.
    - Moving forward: homeworks due Tuesdays, labs due Saturdays.
- Lots of office hours changes. Check the Canvas calendar.

### Agenda

- Statistical inference.
- Empirical distributions of statistics.
- Bias and variance.
- Statistical models.

### How much progress have you made on the Midterm Project?

### To answer, go to [menti.com](https://menti.com) and enter the code 1959 5357.

## Statistical inference

### Statistical inference

**Statistical inference** draws conclusions using data from random samples.

Terminology:
- **Parameter**: A number associated with the population.
    - Example: the population mean.
- **Statistic**: A number calculated from the sample.
    - Example: the sample mean.
- A statistic can be used as an **estimate** for a parameter.

_To remember: **p**arameter and **p**opulation both start with p, **s**tatistic and **s**ample both start with s._

### Example: estimating the number of German tanks in WWII

<center><img src='data/tank.jpg' width=500></center>

What we're about to study is known as the [German Tank Problem](https://en.wikipedia.org/wiki/German_tank_problem).

### Population and sample

- **Population:** all German tanks (unknown).
- **Sample:** the tanks we've seen (captured or destroyed).

### Setup

* There are N tanks, each with a serial number 1, 2, 3, ..., N.
* We don’t know N.
* We would like to estimate N based on the serial numbers of the tanks that we see.

### Discussion Question

If you saw these serial numbers, what would be your estimate (best guess) of N?
```
170	 271	285	 290	 48
235	 24	 90 	 291 	19
```

A. 291

B. 353

C. 438

D. 487

### To answer, go to [menti.com](https://menti.com) and enter the code 1959 5357.

### Approach #1: The largest number observed

* Is it likely to be close to the total number of tanks, N?
    - How likely?
    - How close?

### Creating some data

* We'll manufacture an unknown number of tanks (between 200 and 400).
* We'll see a random sample of the tanks (and their serial numbers).
* From the sample, we'll try to guess how many tanks were manufactured.
* Then we'll see if our guesses were any good.

### The main assumption
We're assuming that the serial numbers of the tanks that we see are a uniform random sample drawn without replacement from 1, 2, 3, …, N.

In [None]:
# Manufacture tanks
N = np.random.randint(200, 400)
serial_nos = bpd.DataFrame().assign(SerialNumber=np.arange(1, N+1))

### Estimate: approach #1

- Our sample: 30 tanks.
- Our statistic: the biggest serial number in our sample.
- Our sample is random, so the biggest seen serial number is also random.

In [None]:
# The biggest serial number
serial_nos.sample(30, replace=False).get('SerialNumber').max()

In [None]:
# what was N?
N

If we instead acquired a different random sample, the value of our statistic (the max of the serial numbers in our sample) might also be different.

How can we understand the **empirical distribution of the statistic**?

## Empirical distribution of a statistic

**Plan:** Repeatedly collect samples of size 30, determine the max serial number in each sample, and look at the resulting distribution of maxes.

In [None]:
repetitions = 1000 # Start small!
sample_size = 30
maxes = np.array([])
for i in np.arange(repetitions):
    one_max = serial_nos.sample(30, replace=False).get('SerialNumber').max()
    maxes = np.append(maxes, one_max)

maxes

In [None]:
# Plot the distribution
bpd.DataFrame().assign(maxes=maxes) \
               .plot(kind='hist', bins=np.arange(N-50, N+5, 5), density=True, ec='w', figsize=(10, 5));
plt.axvline(N, color='C2');

### Discussion Question

How often is our guess within 5 of the actual number of tanks, N?

A. 30% of the time  
B. 40% of the time  
C. 50% of the time  
D. 60% of the time

### To answer, go to [menti.com](https://menti.com) and enter the code 1959 5357.

### Verdict on the estimate

* The largest serial number observed is likely to be close to N.
* But it is also likely to underestimate N.

### Approach #2: double the mean
* Idea: the mean of the observed serial numbers should be close to $\frac{N}{2}$.
* Let's try to estimate the number of tanks, N, using twice the sample mean.

In [None]:
# The sample mean, times 2
serial_nos.sample(30, replace=False).get('SerialNumber').mean() * 2

In [None]:
# remember what the right answer was?
N

### Empirical distribution of approach #2's statistic

In [None]:
repetitions = 1000
sample_size = 30
twice_means = np.array([])
for i in np.arange(repetitions):
    m = serial_nos.sample(sample_size, replace=False).get('SerialNumber').mean() * 2 
    twice_means = np.append(twice_means, m)

In [None]:
bpd.DataFrame().assign(twice_means=twice_means) \
               .plot(kind='hist', bins=np.arange(N-100, N+100, 5), density=True, ec='w', figsize=(10, 5));
plt.axvline(N, color='C2');

Which of these two estimation strategies seems "better" to you? Hold on to that thought.

## More on statistics

### Probability distribution of a statistic

- The value of a statistic, e.g. the sample mean, is random, because it depends on a random sample.
- Like other random quantities, we can study the "probability distribution" of the statistic (also known as its "sampling distribution").
    - This describes all possible values of the statistic and all the corresponding probabilities.
- Unfortunately, this can be hard to calculate exactly.
    - Option 1: do the math by hand.
    - Option 2: generate **all** possible samples and calculate the statistic on each sample.
    - Both approaches are hard.

### Empirical distribution of a statistic
- The empirical distribution of a statistic is based on simulated values of the statistic. It describes
    - all the observed values of the statistic, and
    - the proportion of times each value appeared.
- The empirical distribution of a statistic can be a good approximation to the probability distribution of the statistic, **if the number of repetitions in the simulation is large**.

### Estimating the number of tanks
- We've used two statistics so far: `max` and `2 * mean`.
- A probability distribution for `max`, for example, would allow us to calculate the probability that the `max` of a sample of 30 tanks out of 300 is equal to $X$, for any value of $X$.
- An empirical distribution for `max` can be seen in the histograms we drew.

## Bias and variance

**Which statistic was "better"?**

### Bias
- Bias is **systematic error in one direction**.
- A biased estimate is one where, on average across all possible samples, the estimate is either too high or too low.
- Good estimates typically have low bias.

### Variability

- Variability measures the degree to which the value of an estimate varies from one sample to another.
- High variability makes it hard to estimate accurately.
- Good estimates typically have low variability.

### The "bias-variance trade-off"

- The `max` has low variability, but it is biased, because on average it is an underestimate.
- `2 * mean` has little bias, as it is correct on average, but it is highly variable.
- Achieving low bias and low variance rarely happens in practice.

In [None]:
bpd.DataFrame().assign(maxes=maxes, twice_means=twice_means) \
               .plot(kind='hist', bins=np.arange(N-99, N+100, 5), density=True, alpha=0.65, ec='w', figsize=(10, 5));
plt.axvline(N, color='C2');

Any **ideas** to help us achieve the best of both worlds?

### Approach #3: max + min

In [None]:
repetitions = 1000
sample_size = 30
maxes_plus_mins = np.array([])
for i in np.arange(repetitions):
    serials = serial_nos.sample(sample_size, replace=False).get('SerialNumber')
    m = serials.max()+serials.min()
    maxes_plus_mins = np.append(maxes_plus_mins, m)

In [None]:
bpd.DataFrame().assign(maxes_plus_mins= maxes_plus_mins) \
               .plot(kind='hist', bins=np.arange(N-99, N+100, 5), density=True, ec='w', figsize=(10, 5));
plt.axvline(N, color='C2');

In [None]:
bpd.DataFrame().assign(maxes=maxes, twice_means=twice_means, maxes_plus_mins=maxes_plus_mins) \
               .plot(kind='hist', bins=np.arange(N-99, N+100, 5), density=True, alpha=0.65, ec='w', figsize=(10, 5));
plt.axvline(N, color='C2');

## Statistical models

### Models

- A model is a set of assumptions about how data was generated.

    <center><img src='data/box.jpg' width=500></center>

- We want a way to assess the quality of a given model.

### Example

<center><img src="https://upload.wikimedia.org/wikipedia/commons/e/e5/Pisa_experiment.png" width=500></center>

[Galileo's Leaning Tower of Pisa Experiment](https://en.wikipedia.org/wiki/Galileo%27s_Leaning_Tower_of_Pisa_experiment)

### Recall: Swain vs. Alabama, 1965
- Robert Swain was a Black man convicted of crime in Talladega County, Alabama.
- He appealed the jury's decision all the way to the Supreme Court, on the grounds that Talladega County systematically excluded Black people from juries.
- At the time, only men 21 years or older were allowed to serve on juries. 26% of this population was Black.
- But of the 100 men on Robert Swain's jury panel, only 8 were Black.

### Assessing a model

- One **model** is that members of the jury panel were selected uniformly at random from the eligible population, and it is just pure chance that only 8 of the 100 members of that sample were Black.
- We now have the tools necessary to determine whether or not this is a reasonable model.
    - Spoiler alert: it's not.

## Summary, next time

### Summary

- A parameter is a number associated with a population, and a statistic is a number associated with a sample.
- We can use statistics calculated on a random samples to estimate population parameters.
    - For example, to estimate the max of a population, we can compute the max of a sample, or twice the mean of the sample, or the sum of the max and min of the sample, and so on.
- Estimates have bias and variance, both of which are some measure of how "wrong" an estimate is.
    - Bias is systematic error in one direction. If an estimate has bias, it is wrong on average.
    - Variability measures how much an estimate changes between samples.
    - Ideally, we'd like our estimates to have low bias and low variance, but in practice this is hard to accomplish.
- A model is a set of assumptions about how data were generated.
- **Next time**: more on models and how to assess them (hypothesis testing!).