<!---
Latex Macros
-->
$$
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

In [23]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
# %cd .. 
import sys
sys.path.append("..")
import statnlpbook.util as util

# Maximum Likelihood Estimator

The Maximum Likelihood Estimator (MLE) is one of the simplest ways, and often most intuitive way, to determine the parameters of a probabilistic models based on some training data. Under favourable conditions the MLE has several useful properties. On such property is consistency: if you sample enough data from a distribution with certain parameters, the MLE will recover these parameters with arbitrary precision. In our [structured prediction recipe](structured_prediction.ipynb) MLE can be seen as the most basic form of continuous optimization for parameter estimation. 

In this section we will focus on MLE for _discrete distributions_ and _continuous parameters_. We will assume a distribution \\(\prob_\params(\x)\\) with \\(\x = (x_1,\ldots, x_n)\\) that factorizes in the following way:

\begin{equation}
  \prob_\params(\x) = \prod_i^n \prob_\params(x_i|\phi_i(\x)) 
                    = \prod_i^n \param_{x_i|\phi_i(\x)}
\end{equation}

TODO: Define simple toy example to later illustrate the objective with.

In [5]:
def prob(x, theta1, theta2):
    return theta1 if x else theta2

Here the functions \\(\phi_i\\) provide a context to condition the probability of \\(x_i\\) with. For example, in a trigram language model this could be the bigram history for word \\(i\\), and hence \\(\phi_i(\x) = (x_{i-1},x_{i-2})\\). Notice that this function should not consider the variable \\(x_i\\) itself. 

The Maximum Likelihood estimate \\(\params^*\\) for this model, given some training data \\(\train = (x_1,\ldots, x_n)\\), is defined as the solution to the following optimization problem:

\begin{equation}\label{eq:mle}
  \params^* = \argmax_{\params} \prob_\params(\train) = \argmax_{\params} \log \prob_\params(\train) 
\end{equation}

TODO: Mention/define IID?

Here the second equality stems from the monotonicity of the \\(\log\\) function, and is useful because the \\(\log\\) expression is easier to optimize. In words, the maximum likelihood estimate are the parameters that assign maximal probability to the training sample.  

In [24]:
from math import log
def ll(data, theta1, theta2):
    return sum([log(prob(x, theta1, theta2)) for x in data])
ll([True, False, True], 0.1, 0.9)

-4.710530701645917

In [36]:
import matplotlib.pyplot as plt
import mpld3
import numpy as np

N = 100
x = np.linspace(0.001, 1.0, N)
y = np.linspace(0.001, 1.0, N)

xx, yy = np.meshgrid(x, y)

def create_ll_plot(data):
    np_ll = np.vectorize(lambda t1,t2: ll(data, t1, t2))
    z = np_ll(xx,yy)
    fig = plt.figure()
    levels = 
    contour = plt.contour(x, y, z)
    plt.plot(x,1 - x)
    plt.clabel(contour)
    return fig

util.Carousel([mpld3.display(create_ll_plot([True,False])),
               mpld3.display(create_ll_plot([True,True,True,False]))])

As it turns out, the solution for \\(\ref{eq:mle}\\) has a _closed form_: we can write the result as a direct function of \\(\train\\) without the need of any iterative optimization algorithm. The result is simply:

\begin{equation}\label{eq:counts}
  \param_{x|\phi} = \frac{\counts{\train}{x,\phi}}{\counts{\train}{\phi}}
\end{equation}

where \\(\counts{\train}{x,\phi}\\) is the number of times we have seen the value \\(x\\) paired with the context \\(\phi\\) in the data \\(\train\\), and \\(\counts{\train}{\phi}\\) the number of times we have seen the context \\(\phi\\).   

Notice that in the same way we can represent the context of a variable using a function \\(\phi\\), and hence map contexts to more coarse-grained equivalence classes, we can map the values \\(x_i\\) to a more coarse grained representation \\(\gamma(x_i)\\). For example, in a language model we could decide to only care about the syntactic type (Verb, Noun, etc.) of a word and use \\(\gamma(x) = \mbox{syn-type}(x)\\). In this case the MLE only changes in the way we count: instead of counting the times we see \\(x\\) paired with the context \\(\phi\\), we count how often we see \\(\gamma\\) paired with the context \\(\phi\\). 

## Derivation
It is easy to derive the estimate for the discrete distributions described above. First let us reformulate the log-likelihood \\(L\\) in terms of dataset counts:

\begin{equation}
  \newcommand{\duals}{\boldsymbol{\lambda}}
  \newcommand{\lagrang}{\mathcal{L}}
  L(\train,\params) = \log \prob_\params(\train) 
            = \sum_{x,\phi} \counts{\train}{x,\phi} \log \param_{x|\phi}
\end{equation}

Next, remember that we want, for a given $\phi$, the parameters $\param_{\cdot,\phi}$ to represent a conditional probability distribution $\prob_\params(\cdot|\phi)$. This requires positivity (which fall out naturally later), and crucially: a normalization constraint. In particular, we need $\sum_x \param_{x,\phi} = 1$. 

We hence have to solve a *constrained* optimization problem. A Standard technique to solve such problems relies on the notion of the *Lagrangian* $\lagrang$: a version of the objective in which constraints are added as soft constraints weighted by the *lagrange multipliers* $\duals$: 

\begin{equation}
  \lagrang(\params,\duals) = L(\train,\params) + \sum_\phi \lambda_\phi (1 - \sum_x \param_{x|\phi})
\end{equation}

If $\params^*$ is a solution to the original optimization problem then there exist a set of multipliers $\duals^*$ such that $\params^*,\duals^*$ is a *stationary point* of $\lagrang$. By setting $\nabla_\params \lagrang = 0$ and \\(\nabla_\duals \lagrang = 0\\) we can find such points. 

We first set \\(\nabla_\params \lagrang = 0\\):

\begin{equation}
  \frac{\partial \lagrang}{\partial \param_{x|\phi}} = \counts{\train}{x,\phi} \frac{1}{\param_{x|\phi}} - \lambda_\phi = 0 
\end{equation}

This means that each parameter needs to be proportional to the count of its corresponding event:

\begin{equation}
  \param_{x|\phi} = \frac{\counts{\train}{x,\phi}}{\lambda_\phi}
\end{equation}

Setting set $\nabla_\duals \lagrang = 0$ will recover the original constraints: $\sum_x \param_{x|y} = 1$. Plugging the above expression for $\param_{x|\phi}$ into this constraint will give us $\lambda_\phi = \sum_x \counts{\train}{x,\phi} = \counts{\train}{\phi}$ and hence equation $\ref{eq:counts}$. Notice that there is only a single stationary point, and hence the parameters $\params$ at this point need to be the optimal ones.
