# Maximum Likelihood
Maximum likelihood is an important principle in probability theory that can be used to estimate the unknown parameters of a probability distribution given a data set. 

Assume that we have a set of $n$ data points denoted by $X = \{x_1, x_2, \cdots, x_n\}$, which are generated from some probability distribution $P(X; \theta)$ with unknown parameter $\theta$. For example, the data points might be drawn from a Gaussian (normal) distribution, where the parameters $\theta$ are the mean $\mu$ and the standard deviation $\sigma$ of the distribution.

We also assume that the points are identically and independently distributed ($\text{iid}$ for short), which means that they are all sampled from the same distribution (identically), and all the points are mutually independent (independently).

Our goal is to find a model (represented by $\theta$) that make the observed data most probable, or in other words a model that maximizes the likelihood of obtaining the data points $X$ if we were sampling them from the distribution $P$. This process is often referred to as **maximum likelihood estimation (MLE).**

Formally, the **likelihood** of the model (represented by $\theta$) is defined as the probability of obtaining the observed data $X$ given the model:

$$\mathcal{L}(\theta \mid X) = P(X \mid \theta)$$

Notice the change in the direction of conditionality here: when the model $\theta$ is known (fixed), the function $P(X \mid \theta)$ is the probability density function (PDF) of the points, but when the points $X$ are known (fixed), then the same function becomes a likelihood function of the model $\theta$.

Since the points in $X$ are identically and independently distributed, we can write the likelihood function as a product of the probabilities of the individual data points in $X$:

$$\mathcal{L}(\theta \mid X) = P(x_1, \cdots, x_n \mid \theta) = \prod_{i=1}^n P(x_i \mid \theta)$$

Writing an explicit expression for the likelihood function can be quite complex (depending on the probability $P$). To simplify the function, we typically take its logarithm, which allows us to convert the product of probabilities into a sum of logarithms. The resultant function is called the **log likelihood:**

$$l(\theta \mid X) = \log \mathcal{L}(\theta \mid X) = \sum_{i=1}^n \log{(P(x_i \mid \theta))}$$

Maximizing the log likelihood is the same as maximizing the likelihood, since the logarithm function is monotonically increasing.

To find the parameters $\theta$ that maximize the (log) likelihood, all we need to do is to compute the derivatives of the (log) likelihood with respect to each parameter in $\theta$, set it to zero, and then solve the resultant system of equations.

## Example

There are 10 balls in a bag. Each ball is either red or green. Let $\theta$ be the number of red balls. In order to estimate $\theta$, we draw 5 balls with replacement out of the bag, replacing each one before drawing the next. The balls that we get are "red", "red", "green", "red" and "green" (in that order). What is the maximum likelihood estimate (MLE) for $\theta$?

To answer this question, we first need to write the likelihood function of $\theta$. According to our previous definitions, the likelihood of $\theta$ is the probability of getting this specific sequence of colors (our "data points"), or in mathematical notations:

$$\mathcal{L}(\theta) = P(\text{red, red, green, red, green} \mid \theta)$$

Since we have $\theta$ red balls out of 10 in the bag, the probability of drawing a red ball from the bag is $\theta$/10 and the probability of drawing a green ball from the bag is $\frac{10-\theta}{10}$.

Therefore, the probability of getting this specific sequence of colors is:

$$\mathcal{L}(\theta) = (\frac{\theta}{10})^3 (\frac{10-\theta}{10})^2 $$

To simplify this function, we take its logarithm to obtain the log likelihood:

$$l(\theta) = \text{log} \mathcal{L}(\theta) = 3(\text{log} \theta - \text{log} 10) + 2 (\text{log}(10 - \theta) - \text{log}10) = 3 \text{log}\theta + 2 \text{log}(10-\theta) - 5 \text{log}10 $$

We now compute the derivative of the log likelihood and set it to $0$:

$$\frac{\partial{l}}{\partial{\theta}} = \frac{3}{\theta} - \frac{2}{10-\theta} = 0$$

Therefore we get:

$$3(10-\theta) = 2\theta \Rightarrow 5\theta = 30 \Rightarrow \theta=6$$

The model that best explains our data states that we have 6 red balls in the bag out of 10. This result meets our expectation, since in our experiment we drew from the bag 3 red balls out of 5 random balls.

## Estimating the Parameters of a Normal Distribution

Assume that we have $n$ sample points generated from a one-dimensional Gaussian distribution, and we would like to find the parameters of this distribution $\mu$ and $\sigma$.

In this case, the likelihood of the parameters $\mu$ and $\sigma$ is given by the probability density function (PDF) of the normal distribution:

$$\mathcal{L}(\mu, \sigma \mid X) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi} \sigma} \text{exp} (-\frac{(x_i - \mu)^2}{2\sigma^2})$$

Therefore, the log likelihood is:

$$l(\mu, \sigma \mid X) = \sum_{i=1}^n \text{log} [\frac{1}{\sqrt{2\pi} \sigma} \text{exp} (-\frac{(x_i - \mu)^2}{2\sigma^2})] = n \text{log} \frac{1}{\sqrt{2\pi} \sigma} - \sum_{i=1}^n \frac{(x_i - \mu)^2}{2\sigma^2} = -\sum_{i=1}^n \frac{(x_i - \mu)^2}{2\sigma^2} - n \text{log}\sigma - \frac{n}{2} \text{log}(2\pi)$$

To find the parameters $\mu$ and $\sigma$ that yield the maximum likelihood, we now take the partial derivatives of the log likelihood with respect to each one of them and set them to $0$.
First, we take the partial derivative of the log likelihood with respect to $\mu$:

$$\frac{\partial{l}}{\partial{\mu}} = - \sum_{i=1}^n \frac{-2(x_i - \mu)}{2\sigma^2} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0
\\
\Rightarrow \sum_{i=1}^n x_i - n\mu = 0
\\
\Rightarrow \mu = \frac{\sum_{i=1}^n x_i}{n}$$

As expected, we get that the best mean that describes our data is just the sample mean of the given data points!

We now do the same for $\sigma$:

$$\frac{\partial{l}}{\partial{\sigma}} = - \sum_{i=1}^n \frac{-2(x_i - \mu)^2}{2\sigma^3} -\frac{n}{\sigma} = \sum_{i=1}^n \frac{(x_i - \mu)^2}{\sigma^3} - \frac{n}{\sigma} = 0
\\
\Rightarrow \frac{1}{\sigma^3} \sum_{i=1}^n (x_i - \mu)^2 = \frac{n}{\sigma}
\\
\Rightarrow \sigma = \sqrt{\frac{\sum_{i=1}^n (x_i - \mu)^2}{n}}$$

Similarly, we get that the MLE of $\sigma$ is just the standard deviation of our sample points.