# Probability vs. Statistics

What's probability got to do with statistics? Probability seems to be about games of chance, while statistics is about answering questions with data. How could they possibly be related?

In statistics, we typically answer questions by assuming that data is generated from some random process.

- In probability, we know the model and want to know what kind of data is likely to be generated.
- In statistics, we have observed the data and want to answer questions about the model that generated it.

In other words, probability and statistical inference are inverses of one another, as shown in the following diagram.

![](img/prob_stat.png)

In [None]:
!pip install -q symbulate
from symbulate import *
NSIM = 10000

## Example 1: Coin Tossing

A coin is tossed 100 times.

**Probability Question:** Suppose the coin has probability $0.5$ of coming up heads. What is the probability of observing 60 heads in the 100 tosses?

In [None]:
# 1 = heads, 0 = tails
model = BoxModel([0, 1], size=100, replace=True)
model.sim(5)

In [None]:
# Adding up the 0s and 1s gives the number of 1s (i.e., heads)
X = RV(model, sum)
X.sim(5)

In [None]:
# Simulate many instances and count how many are equal to 60.
nsim = 10000
X.sim(NSIM).count_eq(60) / NSIM

**Statistics Question:** The coin may or may not be fair; it has some probability $p$ of coming up heads. But we observed 60 heads in 100 tosses. Based on this data, how do we estimate $p$?

Intuitively, $\hat p = 60 / 100 = 0.6$. Is this estimate good or not? It's hard to say for certain whether any individual estimate is good. After all, what if the coin had come up heads 100 / 100 times, which is theoretically possible (although highly improbable). In that case, the estimate would be $\hat p = 100 / 100 = 1$, which would probably be a terrible estimate.

Even though we cannot evaluate individual estimates, we can evaluate the _procedure_ for coming up with the estimate, given data. This procedure is called the **estimator**.

The procedure that we followed in coming up with $\hat p = 0.6$ is this: take the number of heads in the data and divide by the number of tosses. Let's evaluate this estimator by simulation.

In [None]:
# Suppose the coin is fair (p = 0.5)
model = BoxModel([0, 1], size=100, replace=True)

# Define the estimator
def estimator(data):
    # number of heads divided by the number of tosses
    return sum(data) / len(data)
p_hat = RV(model, estimator)

# Now simulate many estimates.
p_hat.sim(5)

In [None]:
# Make a plot of these estimates.
p_hat.sim(NSIM).plot()

We simulated the data from a model where the true probability of heads was $p = 0.5$. We see that the estimated probability is not always $0.5$ exactly. It is sometimes more, sometimes less. But in expectation, it is equal to $0.5$. Let's check this.

In [None]:
p_hat.sim(NSIM).mean()

The difference between this expectation and the truth is called the **bias** of the estimator.

$$ \text{bias} = E[\text{estimate} ] - \text{truth} $$

The bias of $\hat p$ is $0$ (at least when the true $p = 0.5$). We call an estimator **unbiased** if it has a bias of $0$. So the simulation suggests that $\hat p$ is unbiased when $p = 0.5$. However, the simulation is not conclusive, nor does it provide evidence that $\hat p$ is unbiased for any other value of the true $p$.

## Example 2: The German Tank Problem

In World War II, the Allies wanted to know how many German tanks there were. Fortunately for them, each German tank had a serial number that corresponded to its position on the production line. For simplicity, suppose the first tank had a serial number of 1, the second tank a serial number of 2, and so on. If it were possible to see all the tanks, then the number of tanks would simply be the largest serial number.

However, the Allies did not see all the German tanks. Instead, each time they encountered a German tank, they would secretly record the serial number. We'll assume that they were equally likely to encounter any of the existing German tanks. That is, if there are $N$ tanks, all of them had probability $1 / N$ of being encountered. We'll also assume that no tank could be encountered twice.

**Probability question:** Suppose there were $N = 271$ tanks, of which the Allies encounter 10. What is the probability that the maximum serial number will be 221 _or greater_?

In [None]:
model = BoxModel(list(range(1, 272)), size=10, replace=False)

# calculate the maximum
X = RV(model, max)

# YOUR CODE HERE

**Statistics question:** There are $N$ tanks. We encounter 10 tanks, whose serial numbers are

193, 221, 129, 31, 169, 6, 33, 30, 151, 192.

One possible estimator of $N$ is the maximum serial number. So based on this data, our estimate of $N$ would be $\hat N = 221$. Is this a good estimator or not? Estimate its bias using simulation. (You will need to assume some value of $N$ in your simulation.) Is it unbiased for that value of $N$?

In [None]:
# YOUR CODE HERE

Can you come up with a different estimator that is better?

In [None]:
# YOUR CODE HERE