### Evaluating Naive Bayes

I want to try to evaluate the performance of our Naive Bayes model. We used false positive rates and recall to look at the performance, but I want to talk about some other approaches.

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\\]

## Introducing Parameters

Consider an email with a word vector like $(w_1, w_2, \ldots, w_M)$. That means if $w_1 = 1$ the email contains the word assigned the numeric code of 1. If there are $M$ words in the vocabulary, then there are $M$ entries in the vector.

We know how to compute the Naive Bayes' calculation about the odds that this email is spam. It is:

\\[
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\cdots
\frac{
    \prob{W_M = w_M \condbar \text{S} = 1}
}{
    \prob{W_M = w_M \condbar \text{S} = 0}
}
\\]

Let's define some parameters to maybe make this equation simpler:

\\[
\phi_i
:=
\frac{
    \prob{W_i = 1 \condbar \text{S} = 1}
}{
    \prob{W_i = 1 \condbar \text{S} = 0}
}
\\]

And one for the absence of the feature:

\\[
\phi'_i
:=
\frac{
    \prob{W_i = 0 \condbar \text{S} = 1}
}{
    \prob{W_i = 0 \condbar \text{S} = 0}
}
\\]

There is also a special

\\[
\phi_0
:=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\\]

Using this, we have:

\\[
\begin{align}
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\phi_0
\phi_1^{w_1}
\phi_1^{\prime 1 - w_1}
\cdots
\phi_M^{w_M}
\phi_M^{\prime 1 - w_M}
=
\phi_0
\prod_{i = 1}^M
\phi_i^{w_i}
\phi_i^{\prime 1 - w_i}
\end{align}
\\]


### Odds To Probability

We have a formula for the odds. Let's turn it into a formula for the probability that an email is spam. We'll use the rule that probability is equal to $\frac{odds}{1 + odds}$:

\\[
\begin{align}
\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    \frac{
        \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }{
        \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }
}{
    1
    +
    \frac{
        \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }{
        \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
    }
}
\\
&=
\frac{
    \phi_0
    \prod_{i = 1}^M
    \phi_i^{w_i}
    \phi_i^{\prime 1 - w_i}
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\end{align}
\\]

By the same reasoning, we have:

\\[
\begin{align}
\prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    1
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\end{align}
\\]

### Maximizing The Likelihood Of An Example

What does it mean if our Naive Bayes model does a good job at predicting the class of an email?

It means that if the email truly is spam, then the Naive Bayes model should calculate $\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}$ to be very high.

On the other hand, if the email is not spam, then we want the Naive Bayes model to calculate $\prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}$.

Basically: the model is doing a good job if it thinks the right answers are very likely. The more likely the model thinks the right answer is the better.

Here is another way to think of the same idea. The model calculates a probability that the email is spam. If it then *randomly guessed* whether the email was spam/ham using that probability, we want the model to be most likely to guess correctly.


### Maximizing Likelihood Of A Dataset

I showed you what we want for a single example: we want the model to think that the right answer is very likely. The more likely the better.

The same principle applies across the entire dataset. You want the model to think the results of the entire dataset are as likely as possible. The probability of the dataset is denoted $\prob{\mathcal{D}}$.

Here is another way to say the same thing. Let's say the model went through each of the training emails one-by-one. For each email, it randomly guesses whether the email is spam/ham based on $\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}$. For instance, if $\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M} = 0.90$, then the model will randomly choose "spam" or "ham" with a 90%/10% probability.

If we use this strategy, what is the probability that the model is correct about *every single email*? The way to calculate is by *multiplying* all the probabilities that it is correct for each individual email.

This is probably very small, because even if the model is very good at guessing, there are lots of emails that we must get correct. All the same, we call this $\prob{\mathcal{D}}$.

Our goal is that $\prob{\mathcal{D}}$ be as high as possible. Therefore the probability we want to maximize is:

\\[
\begin{align}
\prob{\mathcal{D}}
&=
\prod_i^N
\prob{\text{S} = s_i \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\\
&=
\prod_i^N
\left(
    \frac{
        \phi_0
        \prod_{j = 1}^M
            \phi_{i, j}^{w_j}
            \phi_{i, j}^{\prime 1 - w_j}
    }{
        1
        +
        \phi_0
        \prod_{j = 1}^M
            \phi_{i, j}^{w_j}
            \phi_{i, j}^{\prime 1 - w_j}
    }
\right)^{s_i}
\left(
    \frac{
        1
    }{
        1
        +
        \left(
            \phi_0
            \prod_{i = 1}^M
            \phi_i^{w_i}
            \phi_i^{\prime 1 - w_i}
        \right)
    }
\right)^{1 - s_i}
\end{align}
\\]

This assumes there are $N$ emails. $s_i$ is 1 when the $i$th email is spam; it is 0 otherwise. The $(w_{i, 1}, w_{i, 2}, \ldots, w_{i, M})$ is the word vector for the $i$th email.

The higher $\prob{\mathcal{D}}$, the more likely the model thinks the dataset is. It makes sense to want the model to think the dataset is likely. The dataset is the only way we can learn the model; why should the model learn things that make it think the observed results are unusual or weird or not revealing of the true mechanism of the system?

This principle that the best model is the one that assigns the greatest probability to the dataset is called the *maximum likelihood principle*.


### Is Naive The Best We Can Do?

Now that we have a measure of how good a job a model is doing, this opens up a new possible question: is the Naive Bayes model doing the best job possible?

In particular, could I tweak the values of $\phi_0, \phi_i, \phi_i^\prime$ so that do a better job? Could some other values for the parameters assign a higher likelihood to the dataset?


### Why Naive Bayes May Not Choose The Best $\phi$

We have seen that Naive Bayes makes an assumption. It assumes:

\\[
\frac{
    \prob{W_1 = w_1 \wedge W_2 = w_2 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \wedge W_2 = w_2 \condbar \text{S} = 0}
}
=
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\frac{
    \prob{W_2 = w_2 \condbar \text{S} = 1}
}{
    \prob{W_2 = w_2 \condbar \text{S} = 0}
}
\\]

Basically, Naive Bayes assumes that the feature probability ratio for a pair of words is the same as the product of the feature probability ratios for each of the individual words.

We saw that this derives from the naive conditional independence assumption:

\\[
\prob{W_1 = w_1 \wedge W_2 = w_2 \condbar \text{S} = 1}
=
\prob{W_1 = w_1 \condbar \text{S} = 1}
\prob{W_2 = w_2 \condbar \text{S} = 1}
\\]

Basically, the naive assumption is that there is no relationship between the pair of words $W_1$ and $W_2$ given that you know whether the email is spam (or not spam).

When the naive assumption is wrong, then Naive Bayes will not calculate the correct probability. The more wrong the naive assumption is, the worse job Naive Bayes will tend to do.

### Duplicate Features

Let's see the most extreme example of where Naive Bayes can go wrong.

Say that $W_1$ always occurs if $W_2$ occurs, and vice versa. This is extreme, but there are definitely pairs of words that tend to co-occur: for instance, "limited" and "time" (as in "limited time offer").

If two words always appear together, than we know:

\\[
\prob{W_1 = w_1 \wedge W_2 = w_2 \condbar \text{S} = 1}
=
\prob{W_1 = w_1 \condbar \text{S} = 1}
=
\prob{W_2 = w_2 \condbar \text{S} = 1}
\\]

Our independence assumption could not be more wrong! If we use the formula:

\\[
\begin{align}
\frac{
    \prob{W_1 = w_1 \wedge W_2 = w_2 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \wedge W_2 = w_2 \condbar \text{S} = 0}
}
&=
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\frac{
    \prob{W_2 = w_2 \condbar \text{S} = 1}
}{
    \prob{W_2 = w_2 \condbar \text{S} = 0}
}
\\
&=
\left(
    \frac{
        \prob{W_1 = w_1 \condbar \text{S} = 1}
    }{
        \prob{W_1 = w_1 \condbar \text{S} = 0}
    }
\right)^2
= \phi_1^2
\\
&=
\left(
    \frac{
        \prob{W_2 = w_2 \condbar \text{S} = 1}
    }{
        \prob{W_2 = w_2 \condbar \text{S} = 0}
    }
\right)^2
=
\phi_2^2
\end{align}
\\]

then we are effectively *double counting* the effect of $W_1, W_2$. In this case, we really ought to change $\phi_2 = 1.0$ (no change in odds ratio) if we're going to keep $\phi_1$ the same.

Or we can change $\phi_1$ and keep $\phi_2$. Or we can split the difference and set:

\\[
\phi_1 := \phi_2 := \sqrt{
    \frac{
        \prob{W_1 = w_1 \condbar \text{S} = 1}
    }{
        \prob{W_1 = w_1 \condbar \text{S} = 0}
    }
}
\\]

### When You Can Beat Naive Bayes

We see now when you can set the $\phi$ values differently from Naive Bayes and do better at predicting the training dataset.

That happens when the conditional independence assumption is not true. It happens when there are correlations (either positive or negative correlations) between the occurence of one feature and the occurence of another.

If the conditional independence assumption really is true, then the $\phi$ calculated for Naive Bayes really are optimal. It's only because the naive assumption is typically false that we can typically do a better job.

### How To Find A Better $\phi$

We now know that Naive Bayes doesn't always give you the best choice of $\phi$. How can we find a better setting of $\phi$?

The answer is to use gradient descent to maximize our likelihood function. This basically starts out the $\phi$ values randomly. For each step of the gradient descent algorithms, it looks at every $\phi_i$ value, and asks: "Would I do better if I were to decrease $\phi_i$ a little bit? Would I do betetr if I increase $\phi_i$ a little bit?"

It repeats this for many steps, and hopefully ends up with a better setting of $\phi$.

In the next notebook, we'll see how to do this!