## ML Basics: Maximum Likelihood Estimation

#### Introduction

This is the first of a set of little blog-style post that I'm creating to get a better grasp on machine learning concepts. I'm mainly following the book "Deep Learning" by Ian Goodfellow, Yoshua Bengio and Aaron Courville [1], but if any other resources are used I'll be citing them underneath this introduction. While alot of these examples are going to be ones I take from [1] I think sometimes it helps to provide some context or explanation to an equation which is what I'm going to try to do throughought this series. If you see any mistakes please feel free to email me at adibfixeshismistakes@gmail.com.


#### Additional Refernces
https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 [2] (Jonny Brooks-Bartlett)

https://www.probabilitycourse.com/chapter8/8_2_0_point_estimation.php [3] (Hossein Pishro-Nik)

https://towardsdatascience.com/mse-and-bias-variance-decomposition-77449dd2ff55 [4] (Maksym Zavershynskyi)

#### Estimators

Estimators are functions that can be used to provide the best possible estimate ($\hat{\theta}$) of some quantity of interest (${\theta}$) where the true value $\theta$ is some fixed quantity for the distribution. If you're thinking "wow this is a very vague definition", you're right, it is! If {$x^{(1)}, x^{(2)},...,x^{(m)}$} are a set of independent, identically distributed data points collected by sampling some random variable $X$. The *point estimator* is some function g such that:

$$\hat{\theta} = g(x^{(1)}, x^{(2)}, ... x^{(m)})$$

which means pretty much any function can be considered an estimator. If you are sampling some random variable ${X}$ with an unknown parametric probability density we can estimate the parameters of the model by making some educated guesses about the type of distribution the sample data best resembles. 

Using an example from [1], suppose we have our set of samples and they are distributed according to some gaussian distribution with unknown parameters $\mu$ and $\sigma^2$. We have:

$$ P(x^{(i)}; \mu; \sigma^2) = N(x^{(i)}; \mu; \sigma^2) $$

$$ P(x^{(i)}; \mu; \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp^{-\frac{(x^{(i)} - \mu)^2}{2\sigma^2}}$$

We dont know $\mu$ or $\sigma^2$, just that the sampled data may be modelled by a gaussian distribution, but we can estimate $\hat{\mu}$ by just taking the average value of all of the sampled points.

$$\hat{\mu} = \frac{1}{m}\sum^{m}_{i=1}x^{(i)} $$

#### Evaluating Estimators

Notice that while this example has a pretty reasonable estimator for $\hat{\mu}$, the definition of an estimator makes no guarantees that the estimator will accurately predict the value that its trying to estimate. So we need some measure of how well an estimator will perform, or more importantly how closely it will come to the true value of $\theta$. The *bias* and *variance* of an estimator are measures of its offset from the true value of $\theta$ and how much it will vary as we apply the estimator to multiple independently sampled data sets. They are defined as:

$$ Bias(\hat{\theta}) = E[\hat{\theta}] - \theta $$

$$ Variance = Var(\hat{\theta})$$

Continuing the example from before, the bias for our estimator of $\hat{\mu}$ can be calculated as:

$$
\begin{aligned}
Bias(\hat{\mu}) &= E[\hat{\mu}] - \mu \\
                &= E[\frac{1}{m}\sum^{m}_{i=1}x^{(i)}] - \mu \\
                &= \frac{1}{m}\sum^{m}_{i=1}E[x^{(i)}] - \mu \\
                &= \frac{1}{m}\sum^{m}_{i=1}\mu - \mu \\
                &= \mu - \mu \\
                &= 0
\end{aligned}
$$

This shows that using the sample mean as an estimate for the gaussian mean parameter results in an unbiased estimator.

Similarly we can try to calculate the variance of the estimator.

$$
\begin{aligned}
Var(\hat{\mu}) &= Var(\frac{1}{m}\sum^{m}_{i=1}x^{(i)}) \\
               &= \frac{1}{m}Var(\sum^{m}_{i=1}x^{(i)}) \\
               &= \frac{1}{m}\sum^{m}_{i=1}Var(x^{(i)}) \\
               &= \sigma^2
\end{aligned}
$$

While these metrics are useful ultimately what we want to do when choosing between estimators is to pick the one with the lowest amount of error between $\hat{\theta}$ and $\theta$. Mean squared error does exactly this, and is defined as:

$$
\begin{aligned}
MSE(\hat{\theta}) &= E[(\hat{\theta} - \theta)^2] \\
                  &= E[(\hat{\theta}^2 - 2 \hat{\theta}\theta + \theta^2] \\
                  &= E[(\hat{\theta}^2] - 2E[\hat{\theta}\theta] + E[\theta^2]
\end{aligned}
$$

I decomposed the equation a little bit to show you the dependence of the MSE on the bias and variance of an estimator. To complete this derivation we need to work out two other derivations.

$$
\begin{aligned}
Bias(\hat{\theta})^2 &= (E[\hat{\theta}] - \theta)^2 \\
                     &= (E[\hat{\theta}])^2 - 2E[\hat{\theta}\theta] + E[\theta^2] \\
\end{aligned}
$$

We assume that the real value $\theta$ isn't a random variable, so its expectation is equal to its value. This leaves us with:

$$
\begin{aligned}
Bias(\hat{\theta})^2 &= (E[\hat{\theta}])^2 - 2\theta E[\hat{\theta}] + \theta^2 \\
\end{aligned}
$$

The second derivation we need is a decomposition of the variance of an estimator.

$$
\begin{aligned}
Var(\hat{\theta}) &= E[(\hat{\theta} - E[\hat{\theta}])^2] \\
                  &= E[(\hat{\theta}^2 - 2\hat{\theta}E[\hat{\theta}] + (E[\hat{\theta}])^2] \\
                  &= E[(\hat{\theta}^2] - 2E[\hat{\theta}E[\hat{\theta}]] + E[(E[\hat{\theta}])^2] \\
                  &= E[(\hat{\theta}^2] - 2E[\hat{\theta}]^2 + E[\hat{\theta}]^2 \\
                  &= E[(\hat{\theta}^2] - E[\hat{\theta}]^2
\end{aligned}
$$

A potentially non-obvious trick that is used in the derivation above is that $E[E[x]]$ is actually taking the expected value of a scalar non-random variable, so its equal to $E[x]$. This is how we are able to convert $2E[\hat{\theta}E[\hat{\theta}]] = 2E[\hat{\theta}]^2$ and likewise, $E[(E[\hat{\theta}])^2] = E[\hat{\theta}]^2$.

Now finally, if we put these two derivations together:

$$
\begin{aligned}
Var(\hat{\theta}) + Bias(\hat{\theta})^2 &= E[(\hat{\theta}^2] - E[\hat{\theta}]^2 + (E[\hat{\theta}])^2 - 2E[\hat{\theta}\theta] + E[\theta^2] \\
                                        &= E[(\hat{\theta}^2] + 2E[\hat{\theta}\theta] + E[\theta^2] \\
                                        &= MSE(\hat{\theta})
\end{aligned}
$$

This might seem like a long walk for a small drink of water but this derivation shows that when comparing the viability of two seperate estimators we dont really care about the variance or bias independently, but rather the balance between then that achieves the lowest MSE (obviously ideally we want both of them to be low!).