<a href="https://colab.research.google.com/github/dlsun/Stat305-S20/blob/master/colabs/notebooks/STAT_305_Notebook_3_Estimating_the_Variance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I encourage you to work through this notebook with a partner so that you can discuss your answers. You should meet over an application such as Discord or Zoom. One person can share their screen with this notebook open.

In [None]:
# This is a code cell.
# To run the code in this cell, click on it and press the "Play" button.
!pip install -q symbulate
from symbulate import *
import matplotlib.pyplot as plt

# An Interlude

Before we get to the main meat of today's lesson, let's learn a fact that is interesting in its own right and useful for today's lesson.

Given a series of numbers, $x_1, x_2, \ldots, x_n$, what value $m$ minimizes the _sum of squared distances to the numbers_? That is, we want the value of $m$ that minimizes 
$$  \sum_{i=1}^n (x_i - m)^2. $$

**Question 1.** To find the value of $m$ that minimizes $\sum_{i=1}^n (x_i - m)^2$, we will first rewrite this expression. (Fill in the intermediate steps in the calculation below. You may want paper and pencil handy for this one.)

\begin{align}
\sum_{i=1}^n (x_i - m)^2 &= \sum_{i=1}^n ((x_i - \bar x) - (m - \bar x))^2 & \text{(add and subtract $\bar x$)} \\
&= ... \\
&= ... \\
&= \sum_{i=1}^n (x_i - \bar x)^2 + n(m - \bar x)^2
\end{align}

From the final form of the expression, it is clear that the $m$ that minimizes the expression is $????$ because the first term does not depend on $m$ and the second term cannot be negative and can be made equal to zero by setting $m = ????$. (Replace $????$ with the appropriate expression.)

# Estimating the Variance

In this lesson, we discuss how to estimate the variance parameter $\sigma^2 \overset{\text{def}}{=} \text{Var}[X]$ from i.i.d. data. This is useful, for example, if we want to estimate the precision of our measuring device in the measurement error example (Example 3 from the previous notebook). 

For concreteness, suppose we observe $X_1, X_2, \ldots, X_n$ from a $\text{Normal}(\mu, \sigma)$ distribution. We will consider estimating $\sigma^2$ in two cases:

1. $\mu$ is known.
2. $\mu$ is not known.

## Case 1. $\mu$ known

When $\mu \overset{\text{def}}{=} E[X]$ is known, we might as well use it. Since variance is defined as 
$$ \text{Var}[X] = E[(X - E[X])^2] = E[(X - \mu)^2], $$
it seems reasonable to estimate it by 
$$ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2. $$

Let's do a simulation for $n=3$. We arbitrarily set $\mu=5.2$ and $\sigma=1.2$, and see if the stimator above is able to recover $\sigma^2 = 1.44$.

In [None]:
n = 3
mu = 5.2
sigma = 1.2

X = RV(Normal(mu, sigma) ** n)

def estimator(data):
  return mean((data - mu) ** 2)

estimates = X.apply(estimator).sim(100000)

estimates.plot()
estimates.mean()

**Question 2.** Based on the simulations above, does the estimator $\hat\sigma^2$ seem to be unbiased? Can you _prove_ it? (Hint: Use linearity of expectation and the fact that $E[(X_i - \mu)^2] = \text{Var}[X_i] = \sigma^2$ by the definition of variance.)

_YOUR EXPLANATION HERE_

## Case 2. $\mu$ unknown

In most situations, we don't know $\mu$. For example, in the measurement error example, $\mu$ would be the true weight of NB 10. But we don't know the true weight of NB 10---that's why we're measuring it in the first place!

If $\mu$ is not known, it makes sense to try to estimate it. In the last notebook, we saw that the sample mean $\bar X$ is an unbiased estimate of $\mu$. So, as a first guess, let's try the estimator:

$$ S_0^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2 $$

**Question 3.** Modify the simulation above to estimate the distribution of $S_0^2$. Does it seem to be unbiased for estimating $\sigma^2$?

In [None]:
# YOUR CODE HERE

_YOUR EXPLANATION HERE_

Question 1 makes it clear why $S_0^2$ will consistently underestimate the variance. In Question 1, we saw that $\bar X$ is the value of $m$ that makes $\sum_{i=1}^n (X_i - m)^2$ smallest. 

In particular, this means that $S_0^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2$ will always be smaller than $\hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2$. Since $\hat\sigma^2$ is unbiased, $S_0^2$ cannot also be unbiased; it will be too small on average.

Let's try again. We will show that 
$$ E\big[\sum_{i=1}^n (X_i - \bar X)^2 \big] = c \sigma^2 $$
for some constant $c$. Then, $c$ will be the number we have to divide by, to obtain an unbiased estimator of $\sigma^2$.

**Question 4.** Complete the calculation above. What is $c$? (You may want paper and pencil handy for this one.)

\begin{align}
E\big[ \sum_{i=1}^n (X_i - \bar X)^2 \big] &= ... \\
&= ... \\
&= ...
\end{align}

Hints:

- Use the expression you derived in Question 1 to rewrite $\sum_{i=1}^n (X_i - \bar X)^2$ in terms of $\sum_{i=1}^n (X_i - \mu)^2$ and $n(\bar X - \mu)^2$.
- Note that $E[(\bar X - \mu)^2]$ is just the definition of $\text{Var}[\bar X]$, which we have calculated many times before.

Your calculation in Question 4 shows that an unbiased estimator of the variance $\sigma^2$ is the **sample variance**

$$ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2. $$

(Perhaps you have always wondered why we divide by $n-1$ in the definition of standard deviation. Now you know! It is to make $S^2$ unbiased for estimating $\sigma^2$.)

# Application

The 100 measurements of the weight of NB10 produced the following data (in micrograms below 10 grams).

In [None]:
data = [409,400,406,399,402,406,401,403,401,403,398,403,407,402,401,399,400,401,405,402,408,399,399,402,399,397,407,401,399,401,403,400,410,401,407,423,406,406,402,405,405,409,399,402,407,406,413,409,404,402,404,406,407,405,411,410,410,410,401,402,404,405,392,407,406,404,403,408,404,407,412,406,409,400,408,404,401,404,408,406,408,406,401,412,393,437,418,415,404,401,401,407,412,375,409,406,398,406,403,404]

**Question 5.** Calculate an unbiased estimate of the precision $\sigma^2$ of the measuring device.

Note: The term "unbiased estimate" is a misnomer. Unbiasedness is a property of the procedure (i.e., the estimator), not of an individual estimate. However, statisticians use the term "unbiased estimate" to refer to an estimate that comes from an unbiased estimator.

In [None]:
# YOUR CODE HERE

_YOUR EXPLANATION HERE_

# General Theory

If $X_1, X_2, \ldots, X_n$ are i.i.d. from _any_ distribution with $\text{Var}[X_1] = \sigma^2$, then 
$$ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 $$
is unbiased for estimating $\sigma^2$.

# Submission Instructions

1. If you worked with a different partner on this notebook than on the previous notebooks, [go here](https://canvas.calpoly.edu/courses/25458/groups) and add both you and your partner (if applicable) to one of the STAT 305 Groups.
2. Export this Colab notebook to PDF. Easiest way is File > Print > Save as PDF.
3. Double check that the PDF rendered properly (i.e., nothing is cut off).
4. Upload the PDF [to Canvas](https://canvas.calpoly.edu/courses/25458/assignments/111815). Only one of you needs to upload the PDF.