# Parameter Estimation
## Maximum Likelihood Estimation (MLE)
### Algorithm
<ol>
  <li> Decide on a model for the distribution of your samples. Define the PMF/PDF for your sample </li>
  <li> Write out the log likelihood function. </li>
  <li> State that the optimal parameters are the argmax of the log likelihood function. </li>
  <li> Use an optimization algorithm to calculate argmax </li>
</ol>

### Maximum Likelihood
Likelihood:
$$ L(\theta) = \prod_{i = 1}^nf(X_i\,|\,\theta) $$
Log Likelihood:
$$ LL(\theta) = \sum_{i = 1}^n\log{f(X_i\,|\,\theta)} $$
Parameter:
$$ \hat{\theta} = \textrm{argmax}_\theta LL(\theta) $$

### How do you compute $\textrm{argmax}$?

#### Computation using calculus
$$\hat{x}  = \textrm{arg max}_x f(x)$$
This can be done by finding the value of $x$ where the derivative vanishes.
$$ \textrm{Suppose } f(x) = -x^2 + 4, \qquad\textrm{where }-2 < x < 2$$
$$ \frac{d}{dx}f(x) = \frac{d}{dx}(-x^2 + 4) = -2x$$

### Example: MLE for Poisson
Suppose we have 12 data points, $X_1, ..., X_n$, each of them an IID sampled from an unknown Poisson distribution, that is:
* $X_i \sim Poi(\lambda)$
* PMF can be written as: $f(x_i\,|\,\lambda) = \frac{e^{-\lambda}\lambda^{x_i}}{x_i!}$
* Likelihood: $L(\lambda) = f(x_1, ..., x_n\,|\,\lambda) = \prod_{i = 1}^{n}f(x_i\,|\,\lambda) = \prod_{i = 1}^n\frac{e^{-\lambda}\lambda^{x_i}}{x_i!}$
* Log Likelihood: $LL(\lambda) = \log \prod_{i = 1}^n\frac{e^{-\lambda}\lambda^{x_i}}{x_i!} = \sum_{i = 1}^n \log \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} = \sum_{i = 1}^n -\lambda + x_i\log \lambda - \log x_i! $
Now, we are left with the task to find the $\textrm{arg max}$ of our log likelihood.
* Differentiate with respect to $\lambda$ and set to 0:
$$\frac{\partial LL(\lambda)}{\partial \lambda} = 0 = -n + \frac{1}{\lambda}\sum_{i = 1}^nx_i$$

Which gives:
$$\lambda = \frac{1}{n}\sum_{i = 1}^nx_i$$

This is rather frustrating, as we did an entire page worth of mathematics to come to the simple result that the MLE of Poisson is the mean of all the data points.

### Example: MLE for Bernoulli
Just a disclaimer, that the PMF of Bernoulli is not differentiable. There arises a need to redefine the bernoulli such that it can be differentiated to calculate the MLE.
$$f(x_i\,|\,p) = p^{x_i}(1-p)^{1-x_i} $$
If we were to apply the math once again, we would end up at the result:
$$p_{_{MLE}} = \frac{1}{n}\sum_{i = 1}^nX_i$$
This is also the sample mean of our distribution.

### Maximum Likelihood with Gaussian
Consider a sample of n iid random variables $X_1, X_2, X_3, ..., X_n$
* Let $X_i \sim \mathcal{N}(\mu, \sigma^2)$
* $f(X_i\,|\,\mu,\,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(X_i - \mu)^2}{2\sigma^2}}$
If we were to calculate, again, we would have the following:
$$\mu_{_{MLE}} = \frac{1}{n}\sum_{i = 1}^nX_i$$
$$\sigma^2_{_{MLE}} = \frac{1}{n}\sum_{i = 1}^n(X_i - \mu_{_{MLE}})^2 $$
So, it is pretty reasonable that the MLE of the mean was the sample mean, however, it is also noticable that the MLE of the variance isn't the sample variance. Therefore, we say that the MLE of the variance is biased based on data.

### Gradient Ascent
$$\theta_j^{\textrm{ new}} = \theta_j^{\textrm{ old}} + \eta \cdotp \frac{\partial LL(\theta^{\textrm{ old}})}{\partial \theta_j^{\textrm{ old}}}$$
Here, $\eta$ is supposed to be the Step Size Constant, or the Learning Rate. Note that in this algorithm, we're finding the $\textrm{argmax}$ of the likelihood function, that is, we want the parameters at the highest value of our likelihood function.\
If we were implementing something like linear regression, we would want the learning rate constant to be negative (or add a -ve sign before it), and practice something known as **Gradient Descent**, where we would want the difference between the actual value, and our estimated value (the error) to be the smallest.\
Another thing we could do is calculate the parameters at minima using gradient descent **of the negative log likelihood function**.

## MLE could Benefit from Priors
Consider iid random variables $X_1, X_2, ..., X_n$.
* $X_i \sim Uni(0, 1)$
* Observe Data:
  * 0.15, 0.20, 0.30, 0.40, 0.65, 0.70, 0.75

The problem with MLE is that it overfits. Overfitting means that the parameters you choose describe your dataset too well, which means it has very strict constraints on what does and doesn't constitute. In the above case, it predicts the $\alpha$ and $\beta$ of the uniformly distributed data to be 0.15 and 0.75 respectively.\
To combat this, we could use Bayesian Probability. We could have a really strong belief before we start looking at the data. This is the same as using a $\beta$ distribution of priors, before we start seeing the data.

## Maximum A Posteriori
$$\hat{\theta}_{_{MAP}} = \textrm{argmax}_\theta f(\Theta = \theta\,|\,X^{(1)} = x^{(1)}, ..., X^{(n)} = x^{(n)})$$
Instead of choosing the parameters which make the data more likely, MAP chooses parameters that are more likely, given the value of the data.
$$\hat{\theta}_{_{MAP}} = \textrm{argmax}_\theta f(x^{(1)}, ..., x^{(n)}\,|\,\theta)g(\theta)$$
$$ = \textrm{argmax}_\theta\,g(\theta)\prod_{i = 1}^nf(x^{(i)}\,|\,\theta)$$
$$ \hat{\theta}_{_{MAP}} = \textrm{argmax}_\theta\,\left(\log(g(\theta)) + \sum_{i = 1}^n\log(f(x^{(i)}\,|\,\theta))\right)$$

We have different distributions that we use as priors for other different distributions. The goal is to find a distribution that is a conjugate, so we can add our findings into the prior without having to get an entirely new function for the posterior.
### Quick MAP for Bernoulli
$Beta(a, b)$ is a conjugate prior for the probability of success in Bernoulli and Binomial Distributions.\
* Prior: $Beta(a,b)$
* Experiment: Observe $n + m$ new trials: $n$ successes and $m$ failures.
* Posterior: $Beta(a + n, b + m)$
* MAP(the mode of the posterior): $p = \frac{a+n-1}{a+b+n+m-2}$

One estimate is to have one success and one failure added with the actual data. This is known as the laplace prior.

### Brute force Bayes Classifier
$$\hat{y} = \textrm{arg max}_{y = \{0, 1\}}P(y | x)$$