## Naive Bayes Discriminator Is A Linear Model

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\\]

When we learned about linear regression, we learned coefficients $\theta_i$ to parameterize a linear model.

I want to show you how we can turn our Naive Bayes discriminator into a linear model with an intercept term and coefficients. We'll see why this is useful very soon.

First, let's return to our standard definition:

\\[
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_k = w_k}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_k = w_k}
}
=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\cdots
\frac{
    \prob{W_k = w_k \condbar \text{S} = 1}
}{
    \prob{W_k = w_k \condbar \text{S} = 0}
}
\\]


Next, let's write this in a form where the $w_i$ don't appear inside any of the probabilities.

\\[
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\left(
    \frac{
        \prob{W_1 = 1 \condbar \text{S} = 1}
    }{
        \prob{W_1 = 1 \condbar \text{S} = 0}
    }
\right)^{w_1}
\left(
    \frac{
        \prob{W_1 = 0 \condbar \text{S} = 1}
    }{
        \prob{W_1 = 0 \condbar \text{S} = 0}
    }
\right)^{1 - w_1}
\cdots
\left(
    \frac{
        \prob{W_k = 1 \condbar \text{S} = 1}
    }{
        \prob{W_k = 1 \condbar \text{S} = 0}
    }
\right)^{w_k}
\left(
    \frac{
        \prob{W_k = 0 \condbar \text{S} = 1}
    }{
        \prob{W_k = 0 \condbar \text{S} = 0}
    }
\right)^{1 - w_k}
\\]

Remember, either $w_i = 0$, or $w_i = 1$. So for each pair of feature probability ratios, only one remains, and the other is raised to the zeroth power and becomes one.


Next, let's bring the $w_i, 1 - w_i$ terms down from the exponent by taking the log.

\\[
\begin{align}
    &
    \left(
        \log\prob{\text{S} = 1}
        -
        \log\prob{\text{S} = 0}
    \right)
\\
    &+
    w_1
    \left(
        \log\prob{W_1 = 1 \condbar \text{S} = 1}
        -
        \log\prob{W_1 = 1 \condbar \text{S} = 0}
    \right)
    +
    (1 - w_1)
    \left(
        \log\prob{W_1 = 0 \condbar \text{S} = 1}
        -
        \log\prob{W_1 = 0 \condbar \text{S} = 0}
    \right)
\\
    &\cdots
\\
    &+
    w_k
    \left(
        \log\prob{W_k = 1 \condbar \text{S} = 1}
        -
        \log\prob{W_k = 1 \condbar \text{S} = 0}
    \right)
    +
    (1 - w_k)
    \left(
        \log\prob{W_k = 0 \condbar \text{S} = 1}
        -
        \log\prob{W_k = 0 \condbar \text{S} = 0}
    \right)
\end{align}
\\]


Next let's simplify slightly by multiplying out the $1 - w_i$ factors and rearranging slightly.

\\[
\begin{align}
    &
    \left(
        \log\prob{\text{S} = 1}
        -
        \log\prob{\text{S} = 0}
    \right)
    +
    \sum_{i = 1}^k
        \left(
            \log\prob{W_i = 0 \condbar \text{S} = 1}
            -
            \log\prob{W_i = 0 \condbar \text{S} = 0}
        \right)
\\
    &+
    \sum_{i = 1}^k
    w_i
    \Big[
        \left(
            \log\prob{W_i = 1 \condbar \text{S} = 1}
            -
            \log\prob{W_i = 1 \condbar \text{S} = 0}
        \right)
        -
        \left(
            \log\prob{W_i = 0 \condbar \text{S} = 1}
            -
            \log\prob{W_i = 0 \condbar \text{S} = 0}
        \right)
    \Big]
\end{align}
\\]

Now we can define a $\theta_0$ for an intercept and $\theta_i$ for the coefficients to the $w_i$ variables.

\\[
\theta_0
=
    \left(
        \log\prob{\text{S} = 1}
        -
        \log\prob{\text{S} = 0}
    \right)
    +
    \sum_{i = 1}^k
        \left(
            \log\prob{W_i = 0 \condbar \text{S} = 1}
            -
            \log\prob{W_i = 0 \condbar \text{S} = 0}
        \right)
\\
\theta_i
=
    \left(
        \log\prob{W_i = 1 \condbar \text{S} = 1}
        -
        \log\prob{W_i = 1 \condbar \text{S} = 0}
    \right)
    -
    \left(
        \log\prob{W_i = 0 \condbar \text{S} = 1}
        -
        \log\prob{W_i = 0 \condbar \text{S} = 0}
    \right)
\\]


We can substitute back in. Remember, that we started out with an equation for the odds of spam versus not spam, but when we took the log this became the *log odds*.

\\[
\log
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
=
\theta_0
+
\sum_{i=1}^k
\theta_i w_i
\\]


Sometimes the $\theta_i$ are called a *log odds ratio*. Let me show you why. First step:

\\[
\theta_i
=
\log
    \frac{
        \prob{W_i = 1 \condbar \text{S} = 1}
    }{
        \prob{W_i = 1 \condbar \text{S} = 0}
    }
-
\log
    \frac{
        \prob{W_i = 0 \condbar \text{S} = 1}
    }{
        \prob{W_i = 0 \condbar \text{S} = 0}
    }
=
\log
    \frac{
        \prob{W_i = 1 \condbar \text{S} = 1}
        \big/
        \prob{W_i = 1 \condbar \text{S} = 0}
    }{
        \prob{W_i = 0 \condbar \text{S} = 1}
        \big/
        \prob{W_i = 0 \condbar \text{S} = 0}
    }
\\]

I just used the fact that the difference of logs is the log of the ratio. That's the same rule I've been using all along. I used this rule twice.

Next, since I have a ratio of ratios, I can move the denominator of the top down, and bring the numerator of the bottom up. See:

\\[
\theta_i
=
\log
    \frac{
        \prob{W_i = 1 \condbar \text{S} = 1}
        \big/
        \prob{W_i = 0 \condbar \text{S} = 1}
    }{
        \prob{W_i = 1 \condbar \text{S} = 0}
        \big/
        \prob{W_i = 0 \condbar \text{S} = 0}
    }
\\]

So now I have the log of the ratio of two odds. The top is the odds that a spam email contains the word $W_i$, while the bottom is the odds that a non-spam email contains the word $W_i$.