### Linear Classification

From the previous notebook, I said that we want to try to find a better model than Naive Bayes.

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\\]

Here is our Naive Bayes model equation:

\\[
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\cdots
\frac{
    \prob{W_M = w_M \condbar \text{S} = 1}
}{
    \prob{W_M = w_M \condbar \text{S} = 0}
}
\\]

I want to rewrite this using the $\phi$ variables I defined earlier:

\\[
\phi_i
=
\frac{
    \prob{W_i = 1 \condbar \text{S} = 1}
}{
    \prob{W_i = 1 \condbar \text{S} = 0}
}
\\
\phi'_i
=
\frac{
    \prob{W_i = 0 \condbar \text{S} = 1}
}{
    \prob{W_i = 0 \condbar \text{S} = 0}
}
\\
\phi_0
=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\\]

To do that, I'm going to use a little trick:

\\[
\begin{align}
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\cdots
\frac{
    \prob{W_M = w_M \condbar \text{S} = 1}
}{
    \prob{W_M = w_M \condbar \text{S} = 0}
}
\\
&=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\left(
    \frac{
        \prob{W_1 = 1 \condbar \text{S} = 1}
    }{
        \prob{W_1 = 1 \condbar \text{S} = 0}
    }
\right)^{w_1}
\left(
    \frac{
        \prob{W_1 = 0 \condbar \text{S} = 1}
    }{
        \prob{W_1 = 0 \condbar \text{S} = 0}
    }
\right)^{1 - w_1}
\\
&\quad
\cdots
\left(
    \frac{
        \prob{W_M = 1 \condbar \text{S} = 1}
    }{
        \prob{W_M = 1 \condbar \text{S} = 0}
    }
\right)^{w_M}
\left(
    \frac{
        \prob{W_M = 0 \condbar \text{S} = 1}
    }{
        \prob{W_M = 0 \condbar \text{S} = 0}
    }
\right)^{1 - w_M}
\end{align}
\\]

See the trick I did? For each word, there are now a *pair* of factors. One is the feature probability ratio for if the word is present, and the other is for when the word is absent. Because of the expontents, if $w_i = 1$, we'll raise the presence factor to the first power (stays the same), and raise the absence feature to the zeroth power (becomes one).

The reason I use this trick is because it lets me get the $w_i$ *outside* the probabilities. In particular, it lets me write:

\\[
\begin{align}
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\phi_0
\phi_1^{w_1}
\phi_1^{\prime 1 - w_1}
\cdots
\phi_M^{w_M}
\phi_M^{\prime 1 - w_M}
\end{align}
\\]

## Finding A Better $\phi$

The equation above tells me how to compute the odds assuming that the words are conditionally independent given the email's class (spam or not spam).

I want to explore other possible settings of $\phi$. I want to try to find a better choice. From here on out, we should not assume that $\phi$ is set like Naive Bayes says.

To make clear that the odds I calculate will depend on the $\phi$ I choose, I will write:

\\[
\begin{align}
\frac{
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\phi_0
\phi_1^{w_1}
\phi_1^{\prime 1 - w_1}
\cdots
\phi_M^{w_M}
\phi_M^{\prime 1 - w_M}
=
\phi_0
\prod_{i = 1}^{M}
\phi_i^{w_i}
\phi_i^{\prime 1 - w_i}
\end{align}
\\]

The $\phi$ is a subscript of $\Pr$ to show that that calculated probability depends on our choice of $\phi$.

We have an equation to compute the odds. But we know that we could turn this into a probability pretty easily:

\\[
\pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
=
\frac{
    \phi_0
    \prod_{i = 1}^{M}
    \phi_i^{w_i}
    \phi_i^{\prime 1 - w_i}
}{
    1 + \left(
        \phi_0
        \prod_{i = 1}^{M}
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\\
\pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
=
\frac{
    1
}{
    1 + \left(
        \phi_0
        \prod_{i = 1}^{M}
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\\]

Great! Now we have equation for the probability that an email is spam (or not spam) given the words that appear in the email.

Remember our criteria for choosing the best $\phi$. We want to maximize the likelihood of the dataset. The likelihood is:

\\[
\pprob{\phi}{\mathcal{D}}
=
\prod_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
\pprob{\phi}{\text{S} = s_i \condbar W_1 = w_1, \ldots, W_M = w_M}
\\]

I'm going to use the same trick as before:

\\[
\begin{align}
\pprob{\phi}{\mathcal{D}}
&=
\prod_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
\left(
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
\right)^{s_i}
\left(
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
\right)^{1 - s_i}
\\
&=
\prod_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
\left(
    \frac{
        \phi_0
        \prod_{i = 1}^{M}
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    }{
        1 + \left(
            \phi_0
            \prod_{i = 1}^{M}
            \phi_i^{w_i}
            \phi_i^{\prime 1 - w_i}
        \right)
    }
\right)^{s_i}
\left(
    \frac{
        1
    }{
        1 + \left(
            \phi_0
            \prod_{i = 1}^{M}
            \phi_i^{w_i}
            \phi_i^{\prime 1 - w_i}
        \right)
    }
\right)^{1 - s_i}
\end{align}
\\]

Here we are! This is the formula that I want to *maximize*. It's the formula I want to choose the $\phi$ values so that it is the greatest.

### Products Are Hard To Optimize

Right now, we could turn Gradient *Ascent* loose on the likelihood function above and try to calculate the $\phi$ that do the best job. In theory that should work, but Gradient Ascent will do a bad job.

Here is why. Consider a sum of numbers: $\sum_i n_i$. Say I decide to tweak the number $n_1$ by decreasing its value by $\epsilon$. How much does that change the sum? It changes it by $\epsilon$, no matter what the numbers are. Therefore, I can choose $\epsilon$ to be small, and know that small changes to the $n_i$ values will produce small changes in the sum.

On the other hand, consider a product: $\prod_i n_i$. Say I decide to tweak $n_1$ by decreasing its value by $\epsilon$. How does that change the product?

It can be much more extreme. It depends on $\left|\frac{\epsilon}{n_1}\right|$. If that is small, then the change to $n_1$ will have a small percentage change on the overall product.

However, if $\left|\frac{\epsilon}{n_1}\right|$ is large, then the percentage change to the product will be very great. And here is the problem: no $\epsilon$ value can be considered truly "small," because it's entirely relative to the size of $n_1$. $\epsilon$ can be small, but if $n_1$ is smaller still, then this is actually a *relatively* large value of $\epsilon$.

An $\epsilon$ that is huge for $n_1$ might actually be very small for $n_2$. So there is no consistent notion of what is a "small step."

Gradient Ascent (and Descent) rely on making small changes to the parameters $\phi_0, \phi_i, \phi^\prime_i$. But I'm telling you that for a product, there is no consistent definition of a small step.

### Turn A Product Into A Sum

Since sums are easy to optimize using Gradient Ascent/Descent, and products are hard, the natural choice is to turn a product into a sum! The way to do this is to use the logarithm function.

Let's start with the initial odds function. Let's turn that into a *log odds* function. Since the odds was a product, the log odds will be a sum.

\\[
\begin{align}
\frac{
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&=
\phi_0
\prod_{i = 1}^{M}
\phi_i^{w_i}
\phi_i^{\prime 1 - w_i}
\\
\log
\frac{
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&=
\log\left(
    \phi_0
    \prod_{i = 1}^{M}
    \phi_i^{w_i}
    \phi_i^{\prime 1 - w_i}
\right)
\\
&=
\log\phi_0
+
\sum_{i = 1}^M
    w_i \log\phi_i
    +
    (1 - w_i) \log\phi_i^\prime
\\
&=
\left(
    \log\phi_0 + \sum_{i = 1}^M \log\phi_i^\prime
\right)
+
\sum_{i = 1}^M
    w_i (\log\phi_i - \log\phi_i^\prime)
\end{align}
\\]


This shows that the log odds is a linear function. In fact, let me replace the $\phi_0, \phi_i, \phi_i^\prime$ with some new parameters:

\\[
\begin{align}
\theta_0
&=
\log\phi_0 + \sum_{i = 1}^M \log\phi_i^\prime
\\
\theta_i
&=
\log\phi_i - \log\phi_i^\prime
\end{align}
\\]

In this case, then:

\\[
\log
\frac{
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\theta_0
+ \sum_{i=1}^M w_i \theta_i
\\]

I will not forget about the $\phi$ values entirely. I will instead just try to pick the best $\theta$. For any setting of $\theta$, there is a setting of $\phi$ that corresponds to that $\theta$. So by thinking of this problem as *parameterized* by $\theta$ really doesn't change anything. This is called *reparameterization*.


### The Reparameterized Problem

I now want to choose $\theta$ to maximize:

\\[
\pprob{\phi}{\mathcal{D}}
=
\prod_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
\pprob{\theta}{\text{S} = s_i \condbar W_1 = w_1, \ldots, W_M = w_M}
\\]

By definition,

\\[
\log
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\theta_0
+ \sum_{i=1}^M w_i \theta_i
\\]

It's a little tricky to write the likelihood function in terms of these log odds. Let me instead make a note: maximizing the probability of something is the same as maximizing the odds, right? So let's try to maximize:

\\[
\begin{align}
&\prod_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
\frac{
    \pprob{\theta}{\text{S} = s_i \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\theta}{\text{S} = 1 - s_i \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&
\\
&=
\prod_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
\left(
    \frac{
        \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }{
        \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }
\right)^{s_i}
\left(
    \frac{
        \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }{
        \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }
\right)^{1 - s_i}
&
\end{align}
\\]


And you remember how I hate optimizing products, right? Let's take the log of this formula:

\\[
\begin{align}
&
\log\left(
    \prod_{
        ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
    }
    \left(
        \frac{
            \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
        }{
            \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
        }
    \right)^{s_i}
    \left(
        \frac{
            \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
        }{
            \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
        }
    \right)^{1 - s_i}
\right)
&
\\
&=
\sum_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
s_i
\log
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
+
(1 - s_i)
\log
\frac{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&
\end{align}
\\]

### Likelihood In Terms of $\theta$

Now, let's not lose sight of our goal. It is to maximize:

\\[
\pprob{\theta}{\mathcal{D}}
=
\prod_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
\pprob{\theta}{\text{S} = s_i \condbar W_1 = w_1, \ldots, W_M = w_M}
\\]

I've changed how I parameterize this problem, but I haven't really fundamentally changed anything. My goal remains the same.

I need to write this likelihood function directly in terms of $\theta$ so I can try to optimize it. Before I do that, let me change this into a sum by considering the *log likelihood*:

\\[
\begin{align}
\log \pprob{\theta}{\mathcal{D}}
&=
\log \left(
    \prod_{
        ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
    }
    \pprob{\theta}{\text{S} = s_i \condbar W_1 = w_1, \ldots, W_M = w_M}
\right)
\\
&=
\sum_{
    ((w_{i, 1}, \ldots w_{i, M}), s_i) \in \mathcal{D}
}
\log \pprob{\theta}{\text{S} = s_i \condbar W_1 = w_1, \ldots, W_M = w_M}
\end{align}
\\]



Let's see how to turn a log odds back into a probability.

\\[
\begin{align}
\log
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&=
\theta_0
+ \sum_{i=1}^M w_i \theta_i
\\
\exp\left(
    \log
    \frac{
        \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }{
        \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }
\right)
&=
\exp\left(
    \theta_0
    + \sum_{i=1}^M w_i \theta_i
\right)
\\
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&=
\exp\left(
    \theta_0
    + \sum_{i=1}^M w_i \theta_i
\right)
\\
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \left(1 - \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}\right)
}
&=
\exp\left(
    \theta_0
    + \sum_{i=1}^M w_i \theta_i
\right)
\\
\pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\left(
    1 - \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
\right)
\exp\left(
    \theta_0
    + \sum_{i=1}^M w_i \theta_i
\right)
\\
\pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
\left(
    1
    +
    \exp\left(
        \theta_0
        + \sum_{i=1}^M w_i \theta_i
    \right)
\right)
&=
\exp\left(
    \theta_0
    + \sum_{i=1}^M w_i \theta_i
\right)
\\
\pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    \exp\left(
        \theta_0
        + \sum_{i=1}^M w_i \theta_i
    \right)
}{
    1
    +
    \exp\left(
        \theta_0
        + \sum_{i=1}^M w_i \theta_i
    \right)
}
\end{align}
\\]
