# Reparametrization And Cross Entropy Error

From the previous notebook, I said that we want to try to find a better model than Naive Bayes.

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\\]

## Equations From Last Time

Here is our Naive Bayes model equation:

\\[
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\cdots
\frac{
    \prob{W_M = w_M \condbar \text{S} = 1}
}{
    \prob{W_M = w_M \condbar \text{S} = 0}
}
\\]

And we saw we could rewrite this as:

\\[
\begin{align}
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\phi_0
\prod_{i = 1}^M
\phi_i^{w_i}
\phi_i^{\prime 1 - w_i}
\end{align}
\\]

Finally, we saw that we could then write:

\\[
\begin{align}
\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    \phi_0
    \prod_{i = 1}^M
    \phi_i^{w_i}
    \phi_i^{\prime 1 - w_i}
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\\
\prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    1
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\end{align}
\\]

### Eliminating Absence Features

Writing in the absence features $\phi_i^{\prime 1 - w_i}$ is driving me crazy. Can I show you how to get rid of them?

\\[
\begin{align}
\omega_0
&:=
\phi_0
\prod_{i = 1}^M
\phi_i^\prime
\\
\omega_i
&:=
\frac{
    \phi_i
}{
    \phi_i^\prime
}
\end{align}
\\]


If I do this, then:

\\[
\begin{align}
\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    \phi_0
    \prod_{i = 1}^M
    \phi_i^{w_i}
    \phi_i^{\prime 1 - w_i}
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\\
&=
\frac{
    \omega_0
    \prod_{i = 1}^M
    \omega_i^{w_i}
}{
    1
    +
    \left(
        \omega_0
        \prod_{i = 1}^M
        \omega_i^{w_i}
    \right)
}
\end{align}
\\]

Instead of switching over to $\omega$, I'm going to keep using $\phi$. But I'm going to not use those absence features anymore. They were unnecessary.

### Maximizing Likelihood

From the last notebook, we saw that we want to maximize:

\\[
\begin{align}
\pprob{\phi}{\mathcal{D}}
&=
\prod_i^N
\pprob{\phi}{\text{S} = s_i \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\\
&=
\prod_i^N
\left(
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\right)^{s_i}
\left(
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\right)^{1 - s_i}
\\
&=
\prod_i^N
\left(
    \frac{
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{i, j}}
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)^{s_i}
\left(
    \frac{
        1
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)^{1 - s_i}
\end{align}
\\]

I've written the $\phi$ as a subscript of $\Pr$ to try to emphasize that the "probability" is what our model thinks, and it depends on our choice of $\phi$. Our job is going to be pick the $\phi$ that maximizes this probability of the dataset.

### Maximizing Log Likelihood

We may have thousands of datapoints. Each datapoint will be assigned some probability $\pprob{\phi}{S = s_i \condbar W = w} < 1.0$.

Multiplying many numbers less than zero quickly yields a very, very small number. Computers have difficulty representing numbers like $\frac{1}{2}^{2048}$ in floating point representation. A number like this will end up being rounded to zero. Regardless, a great deal of precision is lost.

To avoid this problem, it is common to work in the *log space* to avoid multiplying many probabilities. Here we go:

\\[
\begin{align}
\log
\pprob{\phi}{\mathcal{D}}
&=
\log
\prod_i^N
\left(
    \frac{
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{i, j}}
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)^{s_i}
\left(
    \frac{
        1
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)^{1 - s_i}
\\
&=
\sum_i^N
s_i
\left(
    \log
    \frac{
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{i, j}}
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)
+
(1 - s_i)
\left(
    \log
    \frac{
        1
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)
\end{align}
\\]

Since the $\log$ function is *[monotonic][monotonic]*, maximizing the log probability is the same as maximizing the probability.

[monotonic]: https://en.wikipedia.org/wiki/Monotonic_function

### Log Probability Calculation Is Still Troublesome

When we go to maximize the log likelihood of the dataset, this is a sum of log probabilities. Each log probability is like this:

\\[
\log
\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
=
\log
\frac{
    \phi_0
    \prod_{j = 1}^M
    \phi_j^{w_{j}}
}{
    1
    +
    \left(
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{j}}
    \right)
}
\\]

There are a couple problems with this. Once again, we see a product of many $\phi_i$ values. We know computers don't do well with that, because they can lose precision.

There is another problem. This formula will work very poorly with Gradient Ascent. Gradient Ascent works best when the second partial derivative is the same with respect to all parameters. This has to do with the relationship between Gradient Ascent and Newton's Method. If the second derivatives are very different with respect to different parameters, no learning rate will work well for all the parameters.

Our parameterization has this problem. The second partial derivatives of our formula above will be very different when taking with respect to different parameters.

One way to intuit the problem is like this. Consider the same change of adding $\epsilon$ to a parameter $\phi_i$ and a parameter $\phi_j$. If $\epsilon \ll \phi_i$, then this change has a big impact on the product $\phi_0 \prod \phi_i w_i$. On the other hand, if $\epsilon \gg \phi_j$, then this will have very little impact on the product.

This isn't quite exactly what screws up Gradient Ascent, but it is closely related.


### Reparameterizing the model

Let's change the way we calculate the log probability. We'll avoid multiplying a bunch of terms, but come out with effectively the same model.

Let's define a bunch of values:

\\[
\theta_i := \log \phi_i
\\]

Then, we have that:

\\[
\log
\frac{
    \phi_0
    \prod_{j = 1}^M
    \phi_j^{w_{j}}
}{
    1
    +
    \left(
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{j}}
    \right)
}
=
\frac{
    \exp\left(
        \theta_0
        +
        \sum_{j = 1}^M
        \theta_j w_j
    \right)
}{
    1
    +
    \exp\left(
        \theta_0
        +
        \sum_{j = 1}^M
        \theta_j w_j
    \right)
}
\\]

What I have done is called *reparameterization*. Instead of defining my model in terms of $\phi$, I will define it in terms of $\theta$. But It's obvious how to go from $\phi$ to $\theta$ (take the log), and back again (exponentiate).

What I have done is really nothing at all. But it has the important benefit of avoiding a bunch of multiplications by turning them into one sum, plus one exponentiation operation. This plays a lot better with how computers work. It is faster and it is more numerically accurate.

Though I won't explain exactly why, this form also results in better behavior when applying Gradient Ascent. One suggestive difference from before is that changing any parameter by $\epsilon$ has the same result as changing any other parameter by $\epsilon$. (You can verify that.)

There is a handy function called the *logistic function*. It is $\sigma(z) = \frac{e^z}{1 + e^z}$. So we can write:

\\[
\frac{
    \exp\left(
        \theta_0
        +
        \sum_{j = 1}^M
        \theta_j w_j
    \right)
}{
    1
    +
    \exp\left(
        \theta_0
        +
        \sum_{j = 1}^M
        \theta_j w_j
    \right)
}
=
\sigma\left(
    \theta_0
    +
    \sum_{j = 1}^M
        \theta_j w_j
\right)
\\]

A handy fact about the sigmoid function:

\\[
1 - \sigma(z) = \sigma(-z)
\\]

You should prove this yourself.

### The Reparameterized Likelihood Function

So here is our new problem!

\\[
\begin{align}
\log
\pprob{\theta}{\mathcal{D}}
&=
\sum_i^N
s_i
\left(
    \log
    \sigma\left(
        \theta_0
        +
        \sum_{j = 1}^M \phi_j w_{i, j}
    \right)
\right)
+
(1 - s_i)
\left(
    \log
    \sigma\left(
        -\theta_0
        -
        \sum_{j = 1}^M \phi_j w_{i, j}
    \right)
\right)
\end{align}
\\]

This is the reparameterized problem! We are ready to try to apply Gradient Ascent to it!

Before we do so, let's turn this into an error function!

\\[
E(\theta) = - \log \pprob{\theta}{\mathcal{D}}
\\]

I didn't do anything exciting here. I just flipped things around so I'll do Gradient Descent now.

This error function is called the *cross entropy error*. Minimizing the cross entropy error is equivalent to maximizing the likelihood.

There is a cool interpretation of the cross entropy that comes out of a field called [Information Theory][itheory], but I will leave that as a bonus notebook.

[itheory]: https://en.wikipedia.org/wiki/Information_theory