# Reparametrization And Cross Entropy Error

From the previous notebook, I said that we want to try to find a better model than Naive Bayes.

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\\]

## Equations From Last Time

Here is our Naive Bayes model equation:

\\[
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\cdots
\frac{
    \prob{W_M = w_M \condbar \text{S} = 1}
}{
    \prob{W_M = w_M \condbar \text{S} = 0}
}
\\]

And we saw we could rewrite this as:

\\[
\begin{align}
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\phi_0
\prod_{i = 1}^M
\phi_i^{w_i}
\phi_i^{\prime 1 - w_i}
\end{align}
\\]

Finally, we saw that we could then write:

\\[
\begin{align}
\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    \phi_0
    \prod_{i = 1}^M
    \phi_i^{w_i}
    \phi_i^{\prime 1 - w_i}
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\\
\prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    1
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\end{align}
\\]

### Eliminating Absence Features

Writing in the absence features $\phi_i^{\prime 1 - w_i}$ is driving me crazy. We used presence and absence features with the Naive Bayes model because the Naive Bayes models calculates each one seperately as a feature probability ratio.

This is how Naive Bayes works, but it isn't strictly necessary to do things this way.

Let me choose some new parameters that will work *the same* as Naive Bayes. I will use these to simplify my equations above.

\\[
\begin{align}
\omega_0
&:=
\phi_0
\prod_{i = 1}^M
\phi_i^\prime
\\
\omega_i
&:=
\frac{
    \phi_i
}{
    \phi_i^\prime
}
\\
\omega_i^\prime
&:=
1.0
\end{align}
\\]


Let me show you this is this calculates the same Naive Bayes probabilities, even though the absence feature factor is always 1.0:

\\[
\begin{align}
\frac{
    \pprob{\omega}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\omega}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&=
\omega_0
\prod_{i = 1}^M
\omega_i^{w_i}
\omega_i^{\prime (1 - w_i)}
\\
&=
\left(
    \phi_0
    \prod_{i = 1}^M
        \phi_i^\prime
\right)
\prod_{i = 1}^M
\frac{
    \phi_i
}{
    \phi_i^\prime
}^{w_i}
1^{1 - w_i}
\\
&=
\phi_0
\prod_{i = 1}^M
\phi_i^{w_i}
\left(
    \phi_i^{\prime -1}
\right)^{w_i}
\left(
    \phi_i^\prime
\right)^1
\\
&=
\phi_0
\prod_{i = 1}^M
\phi_i^{w_i}
\phi_i^{\prime (1 - w_i)}
\end{align}
\\]


What have I shown here?

There is always a choice of $\omega$ that is *equivalent* to the $\phi$ that are chosen by Naive Bayes. Since these $\omega$ always have $\omega_i^\prime = 1.0$, those parameters are redundant and unnecessary. I will drop all mention of $\omega_i^\prime$ from any future equations.

Likewise, for any setting of $\omega$, I can always go back to a setting of the $\phi$ values:

\\[
\begin{align}
\phi_0
&:=
\omega_0
\\
\phi_i
&:=
\omega_i
\\
\phi_i^\prime
&:=
1.0
\end{align}
\\]


What I am showing is that $\phi$ and $\omega$ are *interchangeable*. Any model in terms of $\phi$ is equal to a model in terms of some $\omega$, and vice versa.

This is just like the interchangability of feet and meters. It doesn't really matter which one I use. If I find it easier to solve a problem about distance in terms of meters, then there is no necessity for me to use feet for my calculations, right?

Likewise, using my $\omega$ values is easier than using $\phi$ values. The equations will be simpler, because the $\omega$ version will not have any absence features.

Instead of trying to find the very best $\phi$ values, I will try to find the best $\omega$ values. This is like finding the answer in meters rather than in feet. If for *some* reason you really want me to give you the answer in $\phi$ values, I can convert back and give it to you like that.


Instead of writing $\omega$ everywhere I used to write $\phi$, I'm just going to keep using the $\phi$ variable name, but with no more $\phi_i^\prime$ variables. I showed those were unnecessary.

Meet the new equations!

\\[
\begin{align}
\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    \phi_0
    \prod_{i = 1}^M
    \phi_i^{w_i}
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
    \right)
}
\\
\prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    1
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
    \right)
}
\end{align}
\\]

If I do this, my Naive Bayes settings won't work:

\\[
\begin{align}
    \phi_0
&:=
    \frac{
        \prob{\text{S} = 1}
    }{
        \prob{\text{S} = 0}
    }
    \quad\quad\text{(THESE NOT CORRECT ANYMORE)}
\\
    \phi_i
&:=
    \frac{
        \prob{W_i = 1 \condbar \text{S} = 1}
    }{
        \prob{W_i = 1 \condbar \text{S} = 0}
    }
\end{align}
\\]

These old definitions wouldn't work, because there are no longer any $\phi_i^\prime$ variables, and our Naive Bayes formulas were using that. But sure enough, we can find a new setting that works for our Naive Bayes model without using any $\phi_i^\prime$:

\\[
\begin{align}
    \phi_0
&:=
    \frac{
        \prob{\text{S} = 1}
    }{
        \prob{\text{S} = 0}
    }
    \prod_{i = 1}^M
        \frac{
            \prob{W_i = 0 \condbar \text{S} = 1}
        }{
            \prob{W_i = 0 \condbar \text{S} = 0}
        }
\\
    \phi_i
&:=
        \frac{
            \prob{W_i = 1 \condbar \text{S} = 1}
        }{
            \prob{W_i = 1 \condbar \text{S} = 0}
        }
    \Big/
        \frac{
            \prob{W_i = 0 \condbar \text{S} = 1}
        }{
            \prob{W_i = 0 \condbar \text{S} = 0}
        }
\end{align}
\\]

That would give you the same old Naive Bayes probability calculations. And that is the whole point of the parameterization change: you never needed those $\phi_i^\prime$ parameters.

However, I no longer really care about the Naive Bayes model values for $\phi_0, \phi_i$. I want to choose my own values. It doesn't really matter any more what Naive Bayes thinks these should be set to, because I'm going to set them for myself.

### Maximizing Likelihood

From the last notebook, we saw that we want to maximize:

\\[
\begin{align}
\pprob{\phi}{\mathcal{D}}
&=
\prod_i^N
\pprob{\phi}{\text{S} = s_i \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\\
&=
\prod_i^N
\left(
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\right)^{s_i}
\left(
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\right)^{1 - s_i}
\\
&=
\prod_i^N
\left(
    \frac{
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{i, j}}
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)^{s_i}
\left(
    \frac{
        1
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)^{1 - s_i}
\end{align}
\\]

I've written the $\phi$ as a subscript of $\Pr$ to try to emphasize that the "probability" is what our model thinks, and it depends on our choice of $\phi$. Our job is going to be pick the $\phi$ that maximizes this probability of the dataset.

### Maximizing Log Likelihood

We may have thousands of datapoints. Each datapoint will be assigned some probability $\pprob{\phi}{S = s_i \condbar W = w} < 1.0$.

Multiplying many numbers less than zero quickly yields a very, very small number. Computers have difficulty representing numbers like $\frac{1}{2}^{2048}$ in floating point representation. A number like this will end up being rounded to zero. Regardless, a great deal of precision is lost.

To avoid this problem, it is common to work in the *log space* to avoid multiplying many probabilities. Here we go:

\\[
\begin{align}
\log
\pprob{\phi}{\mathcal{D}}
&=
\log
\prod_i^N
\left(
    \frac{
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{i, j}}
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)^{s_i}
\left(
    \frac{
        1
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)^{1 - s_i}
\\
&=
\sum_i^N
s_i
\left(
    \log
    \frac{
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{i, j}}
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)
+
(1 - s_i)
\left(
    \log
    \frac{
        1
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
        \right)
    }
\right)
\end{align}
\\]

Since the $\log$ function is *[monotonic][monotonic]*, maximizing the log probability is the same as maximizing the probability.

[monotonic]: https://en.wikipedia.org/wiki/Monotonic_function

### Log Probability Calculation Is Still Troublesome

When we go to maximize the log likelihood of the dataset, this is a sum of log probabilities. Each log probability is like this:

\\[
\log
\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
=
\log
\frac{
    \phi_0
    \prod_{j = 1}^M
    \phi_j^{w_{j}}
}{
    1
    +
    \left(
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{j}}
    \right)
}
\\]

There are a couple problems with this. Once again, we see a product of many $\phi_i$ values. We know computers don't do well with that, because they can lose precision.

**Bonus**: There is another problem. This formula will work very poorly with Gradient Ascent. Gradient Ascent works best when the second partial derivative is the same with respect to all parameters. This has to do with the relationship between Gradient Ascent and Newton's Method. If the second derivatives are very different with respect to different parameters, no learning rate will work well for all the parameters.

Our parameterization has this problem. The second partial derivatives of our formula above will be very different when taking with respect to different parameters.

One way to intuit the problem is like this. Consider the same change of adding $\epsilon$ to a parameter $\phi_i$ and a parameter $\phi_j$. If $\epsilon \ll \phi_i$, then this change has a big impact on the product $\phi_0 \prod \phi_i w_i$. On the other hand, if $\epsilon \gg \phi_j$, then this will have very little impact on the product.

This isn't quite exactly what screws up Gradient Ascent, but it is closely related.


### Reparameterizing the model (again)

We've already reparameterized the model once before by dropping a bunch of unnecessary parameters (the absence features).

Let's do it all over again to avoid multiplying a bunch of terms together. Here's what I'm going to do:

\\[
\begin{align}
\theta_0
&:=
\log \phi_0
\\
\theta_i
&:=
\log \phi_i
\end{align}
\\]


Using this definition, we can see that to go back we do like so:

\\[
\begin{align}
\phi_0
&=
\exp(\theta_0)
\\
\phi_i
&=
\exp(\theta_i)
\end{align}
\\]

This is because exponentation and logarithms are inverse operations.

Let's use this reparameterization to write all our equation for log probability in $\theta$:

\\[
\begin{align}
\log
\pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\log
\frac{
    \phi_0
    \prod_{j = 1}^M
    \phi_j^{w_{j}}
}{
    1
    +
    \left(
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{j}}
    \right)
}
\\
&=
\log
\frac{
    \exp(\theta_0)
    \prod_{j = 1}^M
    \exp(\theta_j)^{w_j}
}{
    1
    +
    \left(
        \exp(\theta_0)
        \prod_{j = 1}^M
        \exp(\theta_j)^{w_j}
    \right)
}
\\
&=
\log
\frac{
    \exp(\theta_0)
    \prod_{j = 1}^M
    \exp(\theta_j w_j)
}{
    1
    +
    \left(
        \exp(\theta_0)
        \prod_{j = 1}^M
        \exp(\theta_j w_j)
    \right)
}
\\
&=
\log
\frac{
    \exp\left(
        \theta_0
        +
        \sum_{j = 1}^M
        \theta_j w_j
    \right)
}{
    1
    +
    \exp\left(
        \theta_0
        +
        \sum_{j = 1}^M
        \theta_j w_j
    \right)
}
\end{align}
\\]


Again, what I have done is really nothing at all. Every setting of the $\phi$ parameters corresponds to a setting of the $\theta$ parameters, and vice versa. This is another reparameterization.

The equations using this reparameterization are convenient because they do not have any multiplications in them. It has turned the multiplications into one sum, plus one exponentiation operation. This plays a lot better with how computers work. It is faster and it is more numerically accurate.

Though I won't explain exactly why, this form also results in better behavior when applying Gradient Ascent. One suggestive difference from before is that changing any parameter by $\epsilon$ has the same result as changing any other parameter by $\epsilon$. (You can verify that.)

There is a handy function called the *logistic function*. It is $\sigma(z) = \frac{e^z}{1 + e^z}$. So we can write:

\\[
\frac{
    \exp\left(
        \theta_0
        +
        \sum_{j = 1}^M
        \theta_j w_j
    \right)
}{
    1
    +
    \exp\left(
        \theta_0
        +
        \sum_{j = 1}^M
        \theta_j w_j
    \right)
}
=
\sigma\left(
    \theta_0
    +
    \sum_{j = 1}^M
        \theta_j w_j
\right)
\\]

A handy fact about the sigmoid function:

\\[
1 - \sigma(z) = \sigma(-z)
\\]

This is because:

\\[
\begin{align}
\frac{
    e^{-z}
}{
    1 + e^{-z}
}
&=
\frac{
    e^z e^{-z}
}{
    e^z (1 + e^{-z})
}
\\
&=
\frac{
    1
}{
    e^z + 1
}
\\
&=
1
-
\frac{
    1
}{
    e^z + 1
}
\\
&=
1 - \sigma(z)
\end{align}
\\]


### The Reparameterized Likelihood Function

So here is our new problem!

\\[
\begin{align}
\log
\pprob{\theta}{\mathcal{D}}
&=
\sum_i^N
s_i
\left(
    \log
    \sigma\left(
        \theta_0
        +
        \sum_{j = 1}^M \theta_j w_{i, j}
    \right)
\right)
+
(1 - s_i)
\left(
    \log
    \sigma\left(
        -\theta_0
        -
        \sum_{j = 1}^M \theta_j w_{i, j}
    \right)
\right)
\end{align}
\\]

This is the reparameterized problem! We are ready to try to apply Gradient Ascent to it!

Before we do so, let's turn this into an error function!

\\[
E(\theta) = - \log \pprob{\theta}{\mathcal{D}}
\\]

I didn't do anything exciting here. I just flipped things around so I'll do Gradient Descent now.

This error function is called the *cross entropy error*. Minimizing the cross entropy error is equivalent to maximizing the likelihood.

There is a cool interpretation of the cross entropy that comes out of a field called [Information Theory][itheory], but I will leave that as a bonus notebook.

[itheory]: https://en.wikipedia.org/wiki/Information_theory