<a href="https://colab.research.google.com/github/dlsun/Stat305-S20/blob/master/colab/notebooks/1_The_Bias_of_an_Estimator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This lesson is the first in a series of lessons that explain how probability is applied to statistics and data science.

Save a copy of this notebook in your own Google Drive (File > Save a copy in Drive).

This notebook consists of a mix of 

1. explanations
2. code cells that you should run
3. questions that you should answer

I encourage you to work through this notebook with a partner so that you can discuss your answers. You should meet over an application such as Discord or Zoom. One person can share their screen with this notebook open.

In [None]:
# This is a code cell.
# To run the code in this cell, click on it and press the "Play" button.
!pip install -q symbulate
from symbulate import *
import matplotlib.pyplot as plt

# Example 1: Skew Dice

Suppose we have skew dice, shown below, and we want to know the probability of rolling an ace, $p_1$. The different faces may not all be equally likely, so $p_1$ is not necessarily $1/6$.

<img src="https://images-na.ssl-images-amazon.com/images/I/41fdyCI4IoL._AC_SX425_.jpg" height="150px"/>

How do we **estimate** $p_1$? We collect some data. We roll the skew die 50 times and observe that aces come up 10 times.

## Probability vs. Statistics

By now, you are very accustomed to _probability questions_, where you know the value of parameters like $p_1$ and asked to calculate a probability. 

**Question 1.** Suppose $p_1 = 1/6$. If we roll the die 50 times, what is the probability that we get (exactly) 10 aces?

**YOUR ANSWER HERE**

In the _probability question_ above, the probability model was specified exactly. That is, we knew that that number of aces would be binomial, and we knew all of the parameters of that binomial distribution (e.g., $p_1$). But we had no data; the 50 rolls were hypothetical.

A _statistics question_ is precisely the opposite of a _probability question_. In statistics, we have data (e.g., 10 aces in 50 rolls), but there are parameters of our model that we do not know (e.g., $p_1$). We want to use the data to help us estimate those parameters.

In other words, probability and statistics are inverses of one another, as illustrated in the following diagram.

![](https://github.com/dlsun/Stat425F19/blob/master/notes/img/prob_stat.png?raw=1)

# Estimates and Estimators

You might intuitively guess that a good estimate of $p_1$ based on the 10 aces we observed is 
$$ \hat p_1 = \frac{10}{50} = 0.20. $$
But this is not the only possible estimate. Some statisticians argue that you should add four "fake" observations (two aces and two non-aces) to the data before calculating the proportion, in which case the estimate is:
$$ \tilde p_1 = \frac{10 + 2}{50 + 4} = 0.222. $$
Which estimate is better? How do we decide?

First, it is impossible to say how good an individual estimate is because there is always the chance that the outcome we happened to observe is an anomaly.

For example, our estimate of $\hat p_1 = 0.20$ would be quite good if $p_1 = 0.18$. But what if $p_1 = 0.99$? It is still (theoretically) possible to observe 10 aces in 50 rolls, in which case our estimate of $\hat p_1 = 0.20$ would be terrible.

Since we do not know the value of the true $p_1$, there is no way to tell whether an individual estimate is good or not. But probability can help us evaluate our _procedure_ for coming up with estimates.

Notice that $\hat p_1$ and $\tilde p_1$ are random variables, since they depend on the random outcome of the dice rolls. In other words, they describe a _procedure_ for coming up with an estimate. If we had gotten a different number of aces, then the values of $\hat p_1$ and $\tilde p_1$ would be slightly different. The definitions of $\hat p_1$ and $\tilde p_1$ specify how to obtain those values from the data.

To be precise, we know that the number of aces, $X$, is a $\text{Binomial}(n=50, p_1)$ random variable. This is the data. The estimators $\hat p_1$ and $\tilde p_1$ are defined in terms of the data as 
\begin{align*}
\hat p_1 &= \frac{X}{50} \\
\tilde p_1 &= \frac{X + 2}{54}.
\end{align*}

When $\hat p_1$ and $\tilde p_1$ are regarded as random variables (instead of as specific values), we will call them **estimators**.


# Simulating Estimators and Bias

First, let's simulate the distributions of the estimators $\hat p_1$ and $\tilde p_1$ and compare them. To do so, we need to assume a value of $p_1$. Let's start by assuming $p_1 = 0.10$. 

In [None]:
p1 = 0.1
X = RV(Binomial(n=50, p=p1))

# Simulate and plot the two estimators.
est1 = (X / 50).sim(10000)
est2 = ((X + 2) / 54).sim(10000)

est1.plot(type="bar")
est2.plot(type="bar")
plt.legend(["Estimator 1", "Estimator 2"]);

Since the data came from a model where $p_1 = 0.10$, the ideal estimator would be "close" to $0.10$. Clearly, both $\hat p_1$ and $\tilde p_1$ vary quite a bit around $p_1$. However, _on average_, they are both close to $p_1$, with $\hat p_1$ being closer.



In [None]:
est1.mean(), est2.mean()

The average value of a random variable is its _expected value_. So, from our simulation, it seems that $E[\hat p_1]$ is a bit closer to the truth than $E[\tilde p_1]$.

The discrepancy between the expected value of an estimator, $E[\hat\theta]$ and the true value of the parameter $\theta$ is called its bias.
$$ \text{bias of $\hat\theta$} = E[\hat\theta] - \theta. $$
We generally want bias to be as close to 0 as possible.

Based on our simulations, we can approximate the bias of our estimators at $p_1 = 0.10$.
\begin{align*}
\text{bias of $\hat p_1$ (at $p_1=0.10$)} &= E[\hat p_1] - 0.10 \approx 0 \\
\text{bias of $\tilde p_1$ (at $p_1=0.10$)} &= E[\tilde p_1] - 0.10 \approx 0.129 \\
\end{align*}


The simulation assumed that $p_1 = 0.10$. But we do not know what $p_1$ is! The next question asks you to repeat the above steps for other values of $p_1$.

**Question 2.** Try at least 10 different values of $p_1$ between 0 and 1. Calculate the bias of the two estimators $\hat p_1$ and $\tilde p_1$ at these values of $p_1$. What do you notice?

In [None]:
# YOUR CODE HERE

**YOUR ANSWER HERE**

Clearly, simulation is a poor way to answer this question, for several reasons:

- There are infinitely many possible values of $p_1$, and we can't possibly try all of them.
- Because simulations are random, we can only ever calculate the _approximate_ bias from simulations.

Fortunately, for these two estimators, we can obtain the exact bias.

**Question 3.** Calculate $E[\hat p_1]$ and $E[\tilde p_1]$ using properties of expected value. (Your answer should depend on $p_1$.) Use this to calculate the bias exactly, for all values of $p_1$.

**YOUR ANSWER HERE**

Hopefully you noticed that the bias of $\hat p_1$ is 0, no matter what $p_1$ is. An estimator with a bias of 0 for all values of the parameter is called **unbiased**.

# Submission Instructions

1. [Go here](https://canvas.calpoly.edu/courses/25458/groups), and add you and your partner (if applicable) to one of the STAT 305 Groups.
2. Export this Colab notebook to PDF. Easiest way is File > Print > Save as PDF.
3. Double check that the PDF rendered properly (i.e., nothing is cut off).
4. Upload the PDF [to Canvas](https://canvas.calpoly.edu/courses/25458/assignments/111116). Only one of you needs to upload the PDF.