In [0]:
#@title Imports
!pip install -q symbulate
from symbulate import *

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [0]:
#@title Define Plotting Functions

def plot_discrete_function(f, xlim=(0, 10), xlabel=r"$\theta$", ylabel="Likelihood"):
  xs = np.arange(np.ceil(xlim[0]), np.floor(xlim[1]) + 1, dtype=int)
  ys = [f(x) for x in xs]
  plt.plot(xs, ys, "o-")
  plt.xlabel(xlabel, fontsize=18)
  plt.ylabel(ylabel, fontsize=18)
  plt.xlim(*xlim)

def plot_continuous_function(f, xlim=(0, 1), xlabel=r"$\theta$", ylabel="Likelihood"):
  xs = np.linspace(xlim[0], xlim[1], 1000)
  ys = [f(x) for x in xs]
  plt.plot(xs, ys, "-")
  plt.xlabel(xlabel, fontsize=18)
  plt.ylabel(ylabel, fontsize=18)
  plt.xlim(*xlim)

# The Calculus of Maximum Likelihood


The number of radioactive particles that reach a Geiger counter (per minute) is a $\text{Poisson}(\mu)$ random variable. Suppose that over a period of one minute, we record 7 radioactive particles hitting the Geiger counter. What is the MLE of $\mu$?

The likelihood is 
$$ L(\mu) = e^{-\mu} \frac{\mu^7}{7!}. $$

The maximum likelihood estimate (MLE) $\hat\mu$ is the value of $\mu$ that maximizes this function. In mathematical notation, 
$$ \hat\mu = \underset{\mu}{\arg\max}\ L(\mu). $$

So far, we've solved for $\hat\mu$ by taking the derivative and setting it equal to $0$, i.e.,
$$ L'(\mu) = 0. $$
The derivative is a mess because we have a _product_ of terms involving $\mu$ (i.e., $e^{-\mu}$ and $\mu^7$). We have to apply the product rule from calculus.


## Transforming the Likelihood

Instead of maximizing $L(\mu)$ we can instead maximize $g(L(\mu))$, where $g$ is _any_ monotone increasing function. 


### Definition (Monotone Increasing)

A function $g(x)$ is said to be **monotone increasing** function if for any two values $x_1 < x_2$, we have $g(x_1) < g(x_2)$.

In other words, $g$ always increases as you move from left to right. Examples of monotone increasing functions include $g(x) = \log(x)$, $g(x) = x^3$, and $g(x) = x$. Functions like $x^2$ and $\sin(x)$ are not monotone increasing because they sometimes go up and sometimes go down.

In [0]:
#@title Graph Monotone Increasing Functions

plot_continuous_function(log, xlim=(1e-4, 4),
                         xlabel="$x$", ylabel="$g(x)$")
plot_continuous_function(lambda x: x ** 3, xlim=(-2, 2),
                         xlabel="$x$", ylabel="$g(x)$")
plot_continuous_function(lambda x: x, xlim=(-2, 2),
                         xlabel="$x$", ylabel="$g(x)$")

plt.legend([r"$\log(x)$", r"$x^3$", r"$x$"])
plt.ylim(-6, 6)

### Theorem (Monotone Transformations of the Likelihood)

Let $L$ be any function and $g$ be a monotone increasing function. If $\hat\theta$ maximizes $g(L(\theta))$, then $\hat\theta$ also maximizes $L(\theta)$.


#### Proof

In order for $\hat\theta$ to maximize $L$, we need to show that $L(\hat\theta) \geq L(\theta^*)$ for all values $\theta^*$.

But we know that $\hat\theta$ maximizes $g(L(\theta))$. So we know that $g(L(\hat\theta)) \geq g(L(\theta^*))$ for all values $\theta^*$. 

Now, suppose by contradiction that $\hat\theta$ did _not_ maximize $L$. That would mean that $L(\hat\theta) < L(\theta^*)$ for some $\theta^*$. Since $g$ is monotone increasing, that would mean that $g(L(\hat\theta)) < g(L(\theta^*))$ for that $\theta^*$. But that contradicts the statement above that $g(L(\hat\theta)) \geq g(L(\theta^*))$ for all $\theta^*$. So $\hat\theta$ must maximize $L$.

_Q.E.D._

### Which Monotone Transformation to Choose?

We choose a monotone increasing function that makes it easy to take derivatives. It turns out that taking the logarithm is usually the best transformation:
$$ \ell(\theta) \overset{\text{def}}{=} \log L(\theta). $$

Notice that $\log$ here is the _natural logarithm_ (which some people notate as $\ln$). In most science and engineering fields, base $e$ is the default, not base $10$. If you don't believe me, just check out what Python does by default:

In [0]:
log(e), log(10)

Now let's plot the likelihood and the log-likelihood for the Poisson example above.

In [0]:
def likelihood(mu):
  return Poisson(mu).pmf(7)

def log_likelihood(mu):
  return log(likelihood(mu))
  
plot_continuous_function(likelihood, xlim=(1, 15),
                         xlabel="$\mu$", ylabel="")
plot_continuous_function(log_likelihood, xlim=(1, 15),
                         xlabel="$\mu$", ylabel="")

plt.ylim(-4, 0.4)
plt.legend([r"$L(\mu)$", r"$\log(L(\mu))$"])

Notice that the likelihood and the log-likelihood are very different functions. However, they both achieve their maximum at the same value of $\mu$. So maximizing one is equivalent to maximizing the other.

Besides making the calculus easier (as we will see in a second), the graph above also illustrates another benefits of taking logs. The values of the likelihood are very small numbers. Taking logs also has the benefit of spreading out those very small numbers.

## Maximizing the Log-Likelihood

### Step 1. Calculate the Log-Likelihood

Now let's try maximizing the log-likelihood. First, we calculate and simplify the log-likelihood using properties of logarithms:

\begin{align}
\ell(\mu) = \log L(\mu) &= \log \left( e^{-\mu} \frac{\mu^7}{7!} \right) \\
&= \log e^{-\mu} + \log \mu^7 - \log(7!) \\
&= -\mu + 7 \log \mu - \log(7!)
\end{align}

### Refresher on the Properties of Logarithms

1. $\log(ab) = \log(a) + \log(b)$
2. $\log(a^b) = b \log(a)$
3. $\log(a / b) = \log(a) - \log(b)$ (Actually, this property follows from properties 1 and 2.)

Also, it is useful to know the derivative of the (natural) log:
$$ \frac{d}{dx} \log(x) = \frac{1}{x}. $$

### Step 2. Take the Derivative and Set it Equal to Zero

We showed above that the log-likelihood is 
$$\ell(\mu) = -\mu + 7 \log \mu - \log(7!). $$

Now we take the derivative of the log-likelihood and set it equal to 0.

\begin{align}
\ell'(\mu) &= -1 + \frac{7}{\mu} - 0 = 0
\end{align}

Solving for $\mu$, we see that the MLE is $\hat\mu = 7$.

By the way, the _derivative of the log-likelihood_ $\ell'$ turns out to be a very important quantity in mathematical statistics. It is called the **score**, and the equation $\ell'(\mu) = 0$ is called the **score equation**. You do not have to worry about it now, but it will play an important role later in this course.

## Maximum Likelihood Estimator (as a function of $X$)

We calculated the MLE of $\mu$ when $X = 7$. What if $X$ had been $5$? We would have to go through the entire process again: writing down the likelihood (for $X=5$), taking the log, setting the derivative equal to 0, and solving for $\mu$.

However, if we simply leave $X$ in the likelihood, then we will get a formula for the MLE in terms of $X$. This is an _estimator_ because we can say what the estimate will be for any value of $X$.

Let's start by leaving the random variable $X$ in the likelihood:

$$ L(\mu) = e^{-\mu} \frac{\mu^X}{X!}. $$

Next, we calculate the log-likelihood:

$$ \ell(\mu) = \log L(\mu) = -\mu + X \log \mu - \log(X!). $$

Now, we set the derivative equal to 0. Be careful here! Although it may be tempting to take the derivative with respect to $X$, remember that the variable here is $\mu$. 

$$ \frac{\partial\ell}{\partial\mu} = -1 + \frac{X}{\mu} - 0 = 0. $$

Finally, solving for $\mu$, we see that 

$$ \hat\mu = X. $$

So the maximum likelihood estimator for the Poisson distribution can be described as follows: however many radioactive particles $X$ we record in a single minute, that number is our estimate for $\mu$.