## Naive Bayes Discriminator Is A Linear Model

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\\]

When we learned about linear regression, we learned coefficients $\theta_i$ to parameterize a linear model.

I want to show you how we can turn our Naive Bayes discriminator into a linear model with an intercept term and coefficients. We'll see why this is useful very soon.

First, let's return to our standard definition:

\\[
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_k = w_k}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_k = w_k}
}
=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\cdots
\frac{
    \prob{W_k = w_k \condbar \text{S} = 1}
}{
    \prob{W_k = w_k \condbar \text{S} = 0}
}
\\]


Next, let's write this in a form where the $w_i$ don't appear inside any of the probabilities.

\\[
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\left(
    \frac{
        \prob{W_1 = 1 \condbar \text{S} = 1}
    }{
        \prob{W_1 = 1 \condbar \text{S} = 0}
    }
\right)^{w_1}
\left(
    \frac{
        \prob{W_1 = 0 \condbar \text{S} = 1}
    }{
        \prob{W_1 = 0 \condbar \text{S} = 0}
    }
\right)^{1 - w_1}
\cdots
\left(
    \frac{
        \prob{W_k = 1 \condbar \text{S} = 1}
    }{
        \prob{W_k = 1 \condbar \text{S} = 0}
    }
\right)^{w_k}
\left(
    \frac{
        \prob{W_k = 0 \condbar \text{S} = 1}
    }{
        \prob{W_k = 0 \condbar \text{S} = 0}
    }
\right)^{1 - w_k}
\\]

Remember, either $w_i = 0$, or $w_i = 1$. So for each pair of feature probability ratios, only one remains, and the other is raised to the zeroth power and becomes one.


Next, let's bring the $w_i, 1 - w_i$ terms down from the exponent by taking the log.

\\[
\begin{align}
    &
    \left(
        \log\prob{\text{S} = 1}
        -
        \log\prob{\text{S} = 0}
    \right)
\\
    &+
    w_1
    \left(
        \log\prob{W_1 = 1 \condbar \text{S} = 1}
        -
        \log\prob{W_1 = 1 \condbar \text{S} = 0}
    \right)
    +
    (1 - w_1)
    \left(
        \log\prob{W_1 = 0 \condbar \text{S} = 1}
        -
        \log\prob{W_1 = 0 \condbar \text{S} = 0}
    \right)
\\
    &\cdots
\\
    &+
    w_k
    \left(
        \log\prob{W_k = 1 \condbar \text{S} = 1}
        -
        \log\prob{W_k = 1 \condbar \text{S} = 0}
    \right)
    +
    (1 - w_k)
    \left(
        \log\prob{W_k = 0 \condbar \text{S} = 1}
        -
        \log\prob{W_k = 0 \condbar \text{S} = 0}
    \right)
\end{align}
\\]


Next let's simplify slightly by multiplying out the $1 - w_i$ factors and rearranging slightly.

\\[
\begin{align}
    &
    \left(
        \log\prob{\text{S} = 1}
        -
        \log\prob{\text{S} = 0}
    \right)
    +
    \sum_{i = 1}^k
        \left(
            \log\prob{W_i = 0 \condbar \text{S} = 1}
            -
            \log\prob{W_i = 0 \condbar \text{S} = 0}
        \right)
\\
    &+
    \sum_{i = 1}^k
    w_i
    \Big[
        \left(
            \log\prob{W_i = 1 \condbar \text{S} = 1}
            -
            \log\prob{W_i = 1 \condbar \text{S} = 0}
        \right)
        -
        \left(
            \log\prob{W_i = 0 \condbar \text{S} = 1}
            -
            \log\prob{W_i = 0 \condbar \text{S} = 0}
        \right)
    \Big]
\end{align}
\\]

Now we can define a $\theta_0$ for an intercept and $\theta_i$ for the coefficients to the $w_i$ variables.

\\[
\theta_0
=
    \left(
        \log\prob{\text{S} = 1}
        -
        \log\prob{\text{S} = 0}
    \right)
    +
    \sum_{i = 1}^k
        \left(
            \log\prob{W_i = 0 \condbar \text{S} = 1}
            -
            \log\prob{W_i = 0 \condbar \text{S} = 0}
        \right)
\\
\theta_i
=
    \left(
        \log\prob{W_i = 1 \condbar \text{S} = 1}
        -
        \log\prob{W_i = 1 \condbar \text{S} = 0}
    \right)
    -
    \left(
        \log\prob{W_i = 0 \condbar \text{S} = 1}
        -
        \log\prob{W_i = 0 \condbar \text{S} = 0}
    \right)
\\]


We can substitute back in. Remember, that we started out with an equation for the odds of spam versus not spam, but when we took the log this became the *log odds*.

\\[
\log
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_k = w_k}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_k = w_k}
}
=
\theta_0
+
\sum_{i=1}^k
\theta_i w_i
\\]


Sometimes the $\theta_i$ are called a *log odds ratio*. Let me show you why. First step:

\\[
\theta_i
=
\log
    \frac{
        \prob{W_i = 1 \condbar \text{S} = 1}
    }{
        \prob{W_i = 1 \condbar \text{S} = 0}
    }
-
\log
    \frac{
        \prob{W_i = 0 \condbar \text{S} = 1}
    }{
        \prob{W_i = 0 \condbar \text{S} = 0}
    }
=
\log
    \frac{
        \prob{W_i = 1 \condbar \text{S} = 1}
        \big/
        \prob{W_i = 1 \condbar \text{S} = 0}
    }{
        \prob{W_i = 0 \condbar \text{S} = 1}
        \big/
        \prob{W_i = 0 \condbar \text{S} = 0}
    }
\\]

I just used the fact that the difference of logs is the log of the ratio. That's the same rule I've been using all along. I used this rule twice.

Next, since I have a ratio of ratios, I can move the denominator of the top down, and bring the numerator of the bottom up. See:

\\[
\theta_i
=
\log
    \frac{
        \prob{W_i = 1 \condbar \text{S} = 1}
        \big/
        \prob{W_i = 0 \condbar \text{S} = 1}
    }{
        \prob{W_i = 1 \condbar \text{S} = 0}
        \big/
        \prob{W_i = 0 \condbar \text{S} = 0}
    }
\\]

So now I have the log of the ratio of two odds. The top is the odds that a spam email contains the word $W_i$, while the bottom is the odds that a non-spam email contains the word $W_i$.

### Are these the best choice of $\theta_i$?

We see that the Naive Bayes suggests a way to define a linear function to estimate the log odds that an email is spam.

The Naive Bayes classifier assumes that features are conditionally independent, but we know that isn't strictly true. The less this is true, the worse a job the choice of $\theta_0, \theta_i$ as specified by the Naive Bayes assumption will do at predicting the log odds.

For example, consider if $w_{123}$ is *always* present in an email when $w_{456}$ is, and vice versa. Then if we set

\\[
\theta_{123}
=
\log
    \frac{
        \prob{W_{123} = 1 \condbar \text{S} = 1}
        \big/
        \prob{W_{123} = 0 \condbar \text{S} = 1}
    }{
        \prob{W_{123} = 1 \condbar \text{S} = 0}
        \big/
        \prob{W_{123} = 0 \condbar \text{S} = 0}
    }
\\]

then we should set

\\[
\theta_{456} = 0.0
\\]

The reason is that the feature $w_{456}$ adds no new information beyond $w_{123}$. But Naive Bayes won't know to do this, because it is assuming $w_{456}$ is independent of $w_{123}$. In fact, the exact opposite true: the presence of one depends *entirely* on the presence of the other.

Basically, Naive Bayes fails insofar as the occurence of words is not conditionally independent given the class spam/ham.

### Toward An Error Function

This suggests that we might try to pick the $\theta_0, \theta_i$ more freely, without assumption. But if we do that, we need to define an error function so that we can have a criterion that defines a "best" choice of $\hat\theta_0, \hat\theta_i$. What could that be?

Well, to start, it makes sense that if in our training dataset we have a spam email, we want:

\\[
\theta_0 + \sum w_i \theta_i
\\]

to be as large as possible. That's because this number is supposed to be the log odds:

\\[
\log
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_k = w_k}
}{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_k = w_k}
}
\\]

The higher the log odds are, the more likely the model thinks this email is spam, which is the correct answer.

Likewise, if the email weren't spam, we would want the estimated log odds to be as small as possible.

### Log Odds to Probability

These log odds are making my head hurt. More simply, I want $\pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_k = w_k}$ to be as close to 1.0 as possible when the email really is spam, and as close to 0.0 when the email isn't.

So, as a first step, let's turn log odds into probability. The key is to remember how to convert odds to probability: $\frac{\text{odds}}{1 + \text{odds}}$. So let's call 

\\[
z := \theta_0 + \sum w_i \theta_i
\\]

Since $z$ is thus the log odds ratio, we can undo by exponentiating:

\\[
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_k = w_k}
}{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_k = w_k}
}
=
e^z
\\]

Then

\\[
\pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_k = w_k}
=
\frac{
e^z
}{
1 + e^z
}
=
\frac{
1
}{
1 + e^{-z}
}
\\]

I like the form $\frac{e^z}{1 + e^z}$. However, you will often see the equivalent form $\frac{1}{e^{-z} + 1}$. Because of some subtleties of round-off error, this second way is better for computers.

The function $f(z) = \frac{e^z}{1 + e^z} = \frac{1}{e^{-z} + 1}$ is called the *logistic function*. It's the function for turning log odds into a probability.

### Sum of Squared Errors For Probabilities (not ideal)

So we now know that

\\[
\pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_k = w_k}
=
\frac{1}{1 + e^{-z}}
=
\frac{1}{1 + e^{
    -\left(\theta_0 + \sum \theta_i w_i\right)
}}
\\]

As discussed, if the email in question truly is spam, we want this to be as close to 1.0 as possible. On the other hand, we want it to be close to 0.0 if the email is not spam.

We're now in a similar spot to when we did linear regression. We know what we want our output to be close to, but we don't know exactly how to sum up our errors.

One (not very good) way to make an error function would be:

\\[
E(\theta)
=
\sum_i^N
    \left(
        \frac{1}{1 + e^{-z_i}}
        -
        y_i
    \right)
    ^2
\\]

That is, use the sum of squared errors loss on the probabilities calculated by the model.

But there is a better way.

### Cross Entropy Error

Here is the primary idea. A good choice of $\theta$ should make the observed dataset *likely*. This is basically synonymous with saying: choose the $\theta$ that explain the observed dataset best, leaving the least possible to random chance.

This principle is called the *maximum likelihood principle*. Let's apply it to our classification problem. We want to maximize

\\[
\prod_{i = 1}^N \pprob{\theta}{Y = y^i \condbar X = x^i}
\\]

I'm slightly shifting up my notation. $y^i$ means the $i$th label; I'm not taking the "power" of $y$. $Y$ is the variable "spam or not spam."

$x^i$ is the vector of observed predictor values for the $i$th example. We can use a subscript to number each of these features:

\\[
\pprob{\theta}{Y = y^i \condbar X = x^i}
=
\frac{
    1
}{
    1 + \exp \left(
        -\theta_0 - \sum_j \theta_j x^i_j
    \right)
}
\\]

Here $x^i_j$ is 1 if for training example number $i$, the feature $j$ is present.

Okay, let's use my fancy exponentiation trick:

\\[
\pprob{\theta}{\mathcal{D}}
=
\prod_{i = 1}^N \pprob{\theta}{Y = y^i \condbar X = x^i}
=
\prod_{i = 1}^N
    \left(
        \pprob{\theta}{Y = 1 \condbar X = x^i}
    \right)
    ^{y^i}
    \cdot
    \left(
        1 - \pprob{\theta}{Y = 1 \condbar X = x^i}
    \right)
    ^{1 - y^i}
\\]

This is the *likelihood* of the dataset for the model parameterized by $\theta$. I denoted it by $\pprob{\theta}{\mathcal{D}}$. $\mathcal{D}$ is supposed to mean "the dataset."

Okay, it is very typical for us to turn products of probabilities into sums of log probabilities. This is very practical, because multiplying many probabilities means you are dealing with a result that is very close to zero. The way floating point arithmetic works, it is hard for computers to deal with numbers with very small magnitude.

Thus, a common trick in computing is to work with sums of log probabilities. The log of a number close to zero is a large negative number; computers are happy with this. The log of a number close to 1.0 is still close to zero, but this is being *added*, not *multiplied*. So that is also good.

Therefore:

\\[
\begin{align}
\log \pprob{\theta}{\mathcal{D}}
&=
\log
\prod_{i = 1}^N
    \left(
        \log \pprob{\theta}{Y = 1 \condbar X = x^i}
    \right)
    ^{y^i}
    \cdot
    \left(
        1 - \pprob{\theta}{Y = 1 \condbar X = x^i}
    \right)
    ^{1 - y^i}
\\
&=
\sum_{i = 1}^N
    y^i
    \log \left( \pprob{\theta}{Y = 1 \condbar X = x^i} \right)
    +
    (1 - y^i)
    \log \left( 1 - \pprob{\theta}{Y = 1 \condbar X = x^i} \right)
\end{align}
\\]

Because this is a log probability, we want it to be as large (close to zero) as possible. Or we can flip things around:

\\[
-\log \pprob{\theta}{\mathcal{D}}
=
\sum_{i = 1}^N
    y^i
    \left(
        -\log \left( \pprob{\theta}{Y = 1 \condbar X = x^i} \right)
    \right)
    +
    (1 - y^i)
    \left(
        -\log \left( 1 - \pprob{\theta}{Y = 1 \condbar X = x^i} \right)
    \right)
\\]

Because this is a negative of a log probability, we want it to be as *small* as possible. We want it to be close to zero. That means it is suitable as an error function to minimize.

This is the *cross-entropy error*. In the next notebook, we'll train a model to minimize this error by using Gradient Descent.