<a href="https://colab.research.google.com/github/dlsun/Stat425F19/blob/master/Estimating_the_Variance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
#@title Import Symbulate
!pip install -q symbulate
from symbulate import *

# The Sample Variance

$$
\def\Var{\text{Var}}
\def\Cov{\text{Cov}}
$$

## Warm-Up

Suppose $X_1, \ldots, X_n$ are i.i.d. observations from some distribution with mean $\mu$ and variance $\sigma^2$. 

Define $\displaystyle\bar X = \frac{1}{n} \sum_{i=1}^n X_i$ to be the sample mean.

What is $\Var[\bar X]$?

\begin{align}
\Var[\bar X] &= \Var\left[\frac{1}{n} \sum_{i=1}^n X_i\right] \\
&= \Cov\left[\frac{1}{n} \sum_{i=1}^n X_i, \frac{1}{n} \sum_{i=1}^n X_i\right] \\
&= \frac{1}{n^2} \Cov\left[\sum_{i=1}^n X_i, \sum_{i=1}^n X_i\right] \\
&= \frac{1}{n^2} \left( \Cov[X_1, X_1] + \ldots + \Cov[X_n, X_n] + \text{all other terms zero} \right) \\
&= \frac{1}{n^2} \cdot n \Var[X_1] \\
&= \frac{\sigma^2}{n}
\end{align}

In particular, this means that the standard deviation of $\bar X$ is $\frac{\sigma}{\sqrt{n}}$. This formula is the basis of a lot of statistical methods.

## Estimating the Variance

If you know the mean $\mu$, then the variance $\sigma^2$ can be estimated by 

$$ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2. $$

Show that $\hat\sigma^2$ is unbiased for $\sigma^2$.

But what if we do not know the mean $\mu$? We have to estimate the mean $\mu$ by the sample mean $\bar X$. That is, our estimator is of the form 

$$ c \sum_{i=1}^n (X_i - \bar X)^2, $$

where $c$ is some constant. What value of $c$ makes this estimator unbiased for $\sigma^2$?

We need to find $c$ to make 

$$ E\left[c\sum_{i=1}^n (X_i - \bar X)^2\right] = \sigma^2. $$

Let's calculate $E\left[\sum_{i=1}^n (X_i - \bar X)^2\right]$ first.

First, let's expand the expression inside the expected value, by adding and subtracting $\mu$. The following calculation is just algebra; no probability is involved at all!
\begin{align}
\sum_{i=1}^n (X_i - \bar X)^2 &= \sum_{i=1}^n (X_i - \mu + \mu - \bar X)^2 \\
&= \sum_{i=1}^n (X_i - \mu)^2 + \underbrace{\sum_{i=1}^n (\bar X - \mu)^2}_{n(\bar X - \mu)^2}  - 2 \underbrace{\sum_{i=1}^n (X_i - \mu)}_{n(\bar X - \mu)}(\bar X - \mu) \\
&= \sum_{i=1}^n (X_i - \mu)^2 + n(\bar X - \mu)^2  - 2 n(\bar X - \mu)^2 \\
&= \sum_{i=1}^n (X_i - \mu)^2 - n(\bar X - \mu)^2
\end{align}

Now we calculate expected values:
\begin{align}
E\left[\sum_{i=1}^n (X_i - \bar X)^2\right] &= \sum_{i=1}^n E\left[(X_i - \mu)^2\right] - n E\left[(\bar X - \mu)^2\right] \\
&= \sum_{i=1}^n \Var[X_i] - n\Var[\bar X] \\
&= n \sigma^2 - n \frac{\sigma^2}{n} \\
&= (n - 1) \sigma^2
\end{align}

So $c$ must be $\frac{1}{n-1}$ to cancel out the extra factor of $(n-1)$.


To summarize, we have just shown that 

$$ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 $$

is an unbiased estimator of $\sigma^2$. $S^2$ is typically called the **sample variance**.

Let's do a simulation to confirm that 

1. dividing by $n$ yields a unbiased estimator when you know the mean $\mu$ 
2. dividing by $n-1$ yields an unbiased estimator when you estimate it by $\bar X$.

In [0]:
# Simulate two independent observations from a standard normal.
# We know the true variance in this case is 1.
X = RV(Normal(0, 1) ** 2)

def var_known_mean(x):
  n = len(x)
  return sum((x - 0) ** 2) / n

def var_unknown_mean(x):
  n = len(x)
  return sum((x - mean(x)) ** 2) / (n - 1)

var_known = X.sim(10000).apply(var_known_mean)
var_known.plot()
var_unknown = X.sim(10000).apply(var_unknown_mean)
var_unknown.plot()

# If the estimators are unbiased, then the mean is 1.
var_known.mean(), var_unknown.mean()

But if we estimate the mean by $\bar X$ and divide by $n$ (instead of $n-1$), then we end up with a biased estimate of the variance.

In [0]:
def var_biased(x):
  n = len(x)
  return sum((x - mean(x)) ** 2) / n

var_biased = X.sim(10000).apply(var_biased)
var_biased.plot()

# The variance should be 1 on average, but it's not.
var_biased.mean()