<a href="https://colab.research.google.com/github/dlsun/Stat305-S20/blob/master/colabs/notebooks/STAT_305_Notebook_2_More_on_Bias.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I encourage you to work through this notebook with a partner so that you can discuss your answers. You should meet over an application such as Discord or Zoom. One person can share their screen with this notebook open.

In [None]:
# This is a code cell.
# To run the code in this cell, click on it and press the "Play" button.
!pip install -q symbulate
from symbulate import *
import matplotlib.pyplot as plt

# Example 2: Estimating the Rate of a Poisson Process

Suppose we want to estimate the background radiation levels in town. That is, we know that radioactive particles should hit a Geiger counter according to a Poisson process, and our goal is to estimate the rate $\lambda$ (in particles per second). 

We record the number of particles in 1-second intervals for 10 seconds.
$$ 0, 3, 1, 0, 0, 1, 0, 2, 0, 4. $$
We estimate $\lambda$ by the sample mean:
$$ \hat\lambda = \frac{0 + 3 + 1 + 0 + 0 + 1 + 0 + 2 + 0 + 4}{10} = 1.1. $$

Is this a good estimate or not?

In the last lesson, you learned that we cannot tell whether any individual estimate is good or bad. We can only evaluate the procedure for coming up with estimates. The procedure is called an **estimator**.

In this case, the estimator based on data $X_1, X_2, \ldots, X_{10}$ is 
$$ \hat\lambda = \bar X \overset{\text{def}}{=} \frac{X_1 + X_2 + \ldots + X_{10}}{10}. $$
The $X_i$s in this case are i.i.d. $\text{Poisson}(\mu = \lambda \cdot 1)$ random variables, since they represent the number of arrivals on non-overlapping intervals of length 1 second.

Let's simulate the distribution of $\hat\lambda$ to get a sense of how good it is. Of course, to do the simulation, we need to assume a value for the true rate $\lambda$. Let's start by assuming $\lambda = 0.8$.

In [None]:
lam = 0.8
lam_hat = RV(Poisson(lam) ** 10, mean)
lam_hat.sim(10000).plot()

The estimator seems to be centered around the true rate $\lambda = 0.8$, which is good. We can check this by simulating the expected value.

In [None]:
lam_hat.sim(10000).mean()

**Question 1.** Simulate the distribution of $\hat\lambda$ for at least 10 different values of $\lambda$. What appears to be the bias of $\hat\lambda$ at each of $\lambda$ you tried?

In [None]:
# YOUR CODE HERE

**YOUR EXPLANATION HERE**

Of course, the only way to be sure that $\hat\lambda$ is unbiased is to calculate the expected value exactly, rather than relying on simulation.

**Question 2.** Calculate the bias of $\hat\lambda$, using properties of expected value (especially linearity of expectation). Recall that $E[X_i] = \mu$ for a $\text{Poisson}(\mu)$ distribution.

**YOUR EXPLANATION HERE**

# Example 3: Measurement Error Model

Before completing this example, make sure you have read [this excerpt about how the National Bureau of Standards estimates the weight of a kilogram](https://github.com/dlsun/Stat305-S20/raw/master/MeasurementError.pdf).

The true weight of NB10 is an unknown number $\mu$. It is very close to 10 grams (its weight is between 9.999 and 10.000), so we will report all of our measurements in micrograms below 10 grams. 

The 100 measurements of the weight of NB10 produced the following data (in micrograms below 10 grams).

In [None]:
data = [409,400,406,399,402,406,401,403,401,403,398,403,407,402,401,399,400,401,405,402,408,399,399,402,399,397,407,401,399,401,403,400,410,401,407,423,406,406,402,405,405,409,399,402,407,406,413,409,404,402,404,406,407,405,411,410,410,410,401,402,404,405,392,407,406,404,403,408,404,407,412,406,409,400,408,404,401,404,408,406,408,406,401,412,393,437,418,415,404,401,401,407,412,375,409,406,398,406,403,404]

To estimate the true weight of NB10 from these 100 measurements, it makes sense to take their mean.

In [None]:
mean(data)

Our estimate for the weight of NB10 is 404.59 micrograms below 10 grams, or 9.99959541 grams.

Is this a good estimate or not? It is hard to tell. All we can do is evaluate the estimator---that is, our procedure for coming up with this estimate.

Let's assume that our 100 measurements $X_1, X_2, \ldots, X_{100}$ come from a $\text{Normal}(\mu, \sigma)$ distribution, where both $\mu$ and $\sigma$ are unknown. 

- $\mu$ corresponds to the true weight of NB10.
- $\sigma$ corresponds to the precision of our machine.

Our estimator of $\mu$ is the sample mean 
$$ \hat\mu = \bar X \overset{\text{def}}{=} \frac{\sum_{i=1}^{100} X_i}{100}. $$

**Question 3.** Calculate the expected value and bias of the estimator $\hat\mu$. Is it unbiased? Does your answer depend on what $\sigma$ is?

**YOUR ANSWER HERE**

# General Theory

If $X_1, X_2, \ldots, X_n$ are i.i.d. from _any_ distribution, then the sample mean 
$$\bar X \overset{\text{def}}{=} \frac{X_1 + X_2 + \ldots + X_n}{n} $$
is an unbiased estimator of the true mean $\mu \overset{\text{def}}{=} E[X_1]$.

**Question 4.** It is obvious that this general theory implies that the estimators $\hat\lambda$ in Example 2 and $\hat\mu$ in Example 3 are unbiased. But it also implies that the estimator $\hat p$ in Example 1 (from the previous notebook) is unbiased. Explain why.

**YOUR ANSWER HERE**

# Submission Instructions

1. If you worked with a different partner on this notebook than on the previous notebook, [go here](https://canvas.calpoly.edu/courses/25458/groups) and add both of you (if applicable) to one of the STAT 305 Groups.
2. Export this Colab notebook to PDF. Easiest way is File > Print > Save as PDF.
3. Double check that the PDF rendered properly (i.e., nothing is cut off).
4. Upload the PDF [to Canvas](https://canvas.calpoly.edu/courses/25458/assignments/111361). Only one of you needs to upload the PDF.