A large part of statistical inference deals with estimating properties of a random process from a given data. In the previous notebook we talked about the maximum likelihood estimator, as the estimator obtained by maximizing the likelihood that the data $x$ was drawn from a process with parameters $\theta$.


$$
\hat{\theta}_{mle}=\arg\max_{\theta}\left\{ p\left(x|\theta\right)\right\} 
$$

There are other estimators we can use. We talked about how we can
approximate (i.e. estimate!) the variance of a random process using
$n$ samples $\left\{ x_{1}\ldots x_{n}\right\} $ as:

$$
s_{n}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}_{n}\right)^{2}
$$

And we talked about how we could alter that estimator using Bessel's
correction to yield a different estimator of the variance: 
$$
s_{n-1}=\frac{n}{n-1}s_{n}=\frac{1}{n-1}\sum_{i=1}^{n}\left(x_{i}-\overline{x}_{n}\right)^{2}
$$

We talked about how this correctoin might be desirable (sometimes)
because it yields an unbiased estimator. 

# Method of moments 


We can build estimators in other ways. For example, we can use the
method of moments, where we simply estimate the distribution by matching
the process moments with the sample moments. This might be more evident
with a few examples:

## Bernoulli Distribution

Let's say we have a sample of size $n$, $\mathbf{x}=\left\{ x_{1},x_{2},\ldots,x_{n}\right\}$  drawm from a coin with probablity
$p$ of hitting heads, $X\sim Bern\left(p\right)$.  We want to estimate the value of the parameter $p$. The first
moment of the process (or population) is

$$ \mu=E\left(X\right)=p $$

And the sample mean is:

$$ \overline{x}_{n}=\frac{1}{n}\sum_{i=1}{n}x_{i}=\frac{n_{1}}{n_{0}+n_{1}} $$

where $n_{0}$ and $n_{1}$ are the number of tails (zeros) and heads
(ones). The method of moments involves matching the process and sample moments. Setting $\mu=\overline{x}_{n}$, we get

$$ \hat{p}_{mm}=\frac{n_{1}}{n_{0}+n_{1}} $$

Notice that in this case, the methods of moment estimator $\hat{p}_{mm}$
is the same as the maximum likelihood estimator $\hat{p}_{mle}$.

The same holds for a Gaussian distribution, where $\hat{\mu}_{mle}=\overline{x}_{n}$
and $\hat{\sigma}_{mle}=s_{n}$. But that is not universally true:

## Gamma Distribution

### Method of moments

The process/population moments of $X\sim\Gamma\left(\alpha,\beta\right)$
are

$$ \begin{cases}
m_{1}=E(X)=\mu=\frac{\alpha}{\beta}\\
m_{2}=E\left(X^{2}\right)=E\left(X^{2}\right)-E(X)^{2}+E(X)^{2}=\sigma^{2}+\mu^{2}=\frac{\alpha}{\beta^{2}}+\frac{\alpha^{2}}{\beta^{2}}
\end{cases} $$

By matching these with sample moments we get the equation for the
method of moments estimator:

$$
\begin{cases}
\hat{\mu}=\overline{x}_{n}=\frac{\hat{\alpha}_{mm}}{\hat{\beta}_{mm}}\\
\hat{\sigma}^{2}+\hat{\mu}^{2}=\overline{x}_{n}^{2}+s_{n}=\frac{\hat{\alpha}_{mm}^{2}}{\hat{\beta}_{mm}^{2}}+\frac{\hat{\alpha}_{mm}}{\hat{\beta}_{mm}^{2}}
\end{cases}
$$

which yields

$$
\begin{cases}
\hat{\alpha}_{mm}=\frac{\overline{x}_{n}^{2}}{s_{n}}\\
\hat{\beta}_{mm}=\frac{\overline{x}_{n}}{s_{n}}
\end{cases}
$$

By contrast, the maximum likelihood estimator is obtained from maximizing
the likelihood:

$$
\log\mathcal{L}\left(\alpha,\beta\right)=\sum_{i=1}^{n}\left(-\log\Gamma(\alpha)-\alpha\log\beta+\left(\alpha-1\right)\log\left(x_{i}-x_{o}\right)-\frac{\left(x_{i}-x_{o}\right)}{\beta}\right)
$$

and the mle estimators for $\alpha$ and $\beta$ are obtained by
setting the partial derivatives to zero. Thus the mle estimators satsify

$$
\begin{cases}
\frac{\partial\mathcal{L}}{\partial\alpha}\left(\hat{\alpha}{}_{mle},\hat{\beta}{}_{mle}\right)=0\\
\frac{\partial L}{\partial\beta}\left(\hat{\alpha}{}_{mle},\hat{\beta}{}_{mle}\right)=0
\end{cases}
$$

$$
\begin{cases}
n\left(\ln\hat{\beta}{}_{mle}-\frac{d}{d\alpha}\ln\Gamma\left(\hat{\alpha}{}_{mle}\right)\right)+\sum_{i=1}^{n}\ln\left(x_{i}\right)=0\\
n\frac{\hat{\alpha}{}_{mle}}{\hat{\beta}{}_{mle}}-\sum_{i=1}^{n}x_{i}=0
\end{cases}
$$

Which can be rewritten as:

$$
\begin{cases}
n\left(\ln\hat{\alpha}-\ln\overline{x}_{n}-\frac{d}{d\alpha}\ln\Gamma\left(\alpha\right)+\overline{\ln\left(x\right)}_{n}\right)=0\\
\hat{\beta}{}_{mle}=\frac{\hat{\alpha}{}_{mle}}{\overline{x}_{n}}
\end{cases}
$$

Notice that for the gamma distribution, the mle and method of moments
estimators are not identical. Indeed the stats.gamma.fit function
in scipy allows one to fit alpha and beta using two different methods. 

We can build yet other estimators. Later in the course we will spend
a lot of time looking at a class of estimators called regularized
estimators. But in principle, we could build any kind of estimator.
For example, we could always estimate the sample-mean of the population
to be zero. That would be a very poor estimator, but nonetheless it
would be an estimator. 



