## Naive Bayes

Our task for today will be to classify emails as spam or not spam. We sill use the [Enron Email Corpus](https://en.wikipedia.org/wiki/Enron_Corpus). The dataset contains email text along with a label of whether that text was spam or not.

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\\]

Below, we do a number of things:

1. We write a class called `WordEncodingDictionary` that will assign a numeric "code" to each English word. For instance, "offer" might have the code 123. Nodes can be more efficiently stored than words. The `WordEncodingDictionary` will keep track of what codes correspond to what words, and vice versa.
2. We write an `Email` class. It has a method to read the email contents from the filesystem.
3. We define a `Dataset` class. It will keep track of our `WordEncodingDictionary`, and it will have two lists of `Email`s: of ham `Email`s and spam `Email`s.
4. We write a bunch of methods to download the Enron dataset, unzip it, and build the dataset by encoding the emails into a set of integers: one for each unique word in the email.

In [2]:
import os
import os.path

DATA_DIR = os.path.join(
    os.getcwd(),
    "data/"
)


In [3]:
from sortedcontainers import SortedSet

# WordEncodingDictionary keeps a bidirectional map between numeric codes and the words they stand for.
class WordEncodingDictionary:
    def __init__(self):
        self.word_to_code_dict = {}
        self.code_to_word_dict = {}

    def word_to_code(self, word):
        if word not in self.word_to_code_dict:
            code = len(self.word_to_code_dict)
            self.word_to_code_dict[word] = code
            self.code_to_word_dict[code] = word

        return self.word_to_code_dict[word]

    def code_to_word(self, code):
        if code not in self.code_to_word_dict:
            raise f"Code {code} not recorded!"

        return self.code_to_word_dict[code]

    def encode_text(self, text):
        codes = SortedSet()
        for word in text.split():
            codes.add(self.word_to_code(word))
        return codes

    def decode_codes_set(self, codes):
        cls = type(codes)
        return cls(map(self.code_to_word, codes))


In [4]:
import os

class Email:
    def __init__(self, path, content, word_encoding_dictionary, label):
        self.path = path
        self.codes = word_encoding_dictionary.encode_text(content)
        self.label = label
        self.word_encoding_dictionary = word_encoding_dictionary

    def text_content(self):
        return type(self).read_text_content(self.path)

    def words_set(self):
        return self.word_encoding_dictionary.decode_codes_set(self.codes)

    @classmethod
    def read(cls, path, word_encoding_dictionary, label):
        return Email(
            path = path,
            content = cls.read_text_content(path),
            word_encoding_dictionary = word_encoding_dictionary,
            label = label
        )

    @classmethod
    def read_text_content(cls, path):
        full_path = os.path.join(DATA_DIR, path)
        # Grr! Emails are encoded in Latin-1, not UTF-8. Python
        # (rightly) freaks out.
        with open(full_path, "r", encoding = "iso-8859-1") as f:
            try:
                return f.read()
            except:
                print(f"Error with: {path}")
                raise


In [5]:
import os.path
import pickle

class Dataset:
    def __init__(
            self, word_encoding_dictionary, ham_emails, spam_emails
    ):
        self.word_encoding_dictionary = word_encoding_dictionary
        self.ham_emails = ham_emails
        self.spam_emails = spam_emails

    INSTANCE = None
    @classmethod
    def get(cls):
        if not cls.INSTANCE:
            with open(os.path.join(DATA_DIR, 'data.p'), 'rb') as f:
                cls.INSTANCE = pickle.load(f)
        return cls.INSTANCE


In [6]:
import os
import os.path
import pickle
from urllib.request import urlretrieve

ENRON_SPAM_URL = (
    "http://csmining.org/index.php/"
    "enron-spam-datasets.html"
    "?file=tl_files/Project_Datasets/Enron-Spam%20datasets/Preprocessed"
    "/enron1.tar.tar"
)

TAR_FILE_NAME = "enron1.tar.tar"
ENRON_DATA_DIR_NAME = "enron1"

def download_tarfile():
    tarfile_path = os.path.join(DATA_DIR, TAR_FILE_NAME)
    if os.path.isfile(tarfile_path):
        print("Tarfile already downloaded!")
        return

    print("Downloading enron1.tar.tar")
    urlretrieve(ENRON_SPAM_URL, tarfile_path)
    print("Download complete!")

def extract_tarfile():
    tarfile_path = os.path.join(DATA_DIR, TAR_FILE_NAME)
    enron_data_dir = os.path.join(DATA_DIR, ENRON_DATA_DIR_NAME)
    if os.path.isdir(enron_data_dir):
        print("Tarfile already extracted!")
        return

    print("Extracting enron1.tar.tar")
    os.system(f"tar -xf {tarfile_path} -C {DATA_DIR}")
    print("Extraction complete!")

def read_emails_dir(word_encoding_dictionary, path, label):
    emails = []
    for email_fname in os.listdir(os.path.join(DATA_DIR, path)):
        email_path = os.path.join(path, email_fname)
        email = Email.read(
            path = email_path,
            word_encoding_dictionary = word_encoding_dictionary,
            label = label
        )
        emails.append(email)

    return emails

def build_dataset():
    word_encoding_dictionary = WordEncodingDictionary()
    ham_emails = read_emails_dir(
        word_encoding_dictionary = word_encoding_dictionary,
        path = os.path.join(ENRON_DATA_DIR_NAME, "ham"),
        label = 0
    )
    spam_emails = read_emails_dir(
        word_encoding_dictionary = word_encoding_dictionary,
        path = os.path.join(ENRON_DATA_DIR_NAME, "spam"),
        label = 1
    )

    return Dataset(
        word_encoding_dictionary = word_encoding_dictionary,
        ham_emails = ham_emails,
        spam_emails = spam_emails
    )

def save_dataset(dataset):
    with open("data/data.p", "wb") as f:
        pickle.dump(dataset, f)

def build_and_save_dataset():
    if os.path.isfile("data/data.p"):
        print("Dataset already processed!")
        return

    print("Reading and processing emails!")
    dataset = build_dataset()
    save_dataset(dataset)
    print("Dataset created!")

download_tarfile()
extract_tarfile()
build_and_save_dataset()


Tarfile already downloaded!
Tarfile already extracted!
Dataset already processed!


Now that we have built the `Dataset`, let's look at what some of the ham and spam emails look like.

In [7]:
DATASET = Dataset.get()

print()
print(">>> HAM EMAIL:")
print("=" * 72)
print(DATASET.ham_emails[5].text_content())

print()
print(">>> SPAM EMAIL:")
print("=" * 72)
print(DATASET.spam_emails[10].text_content())


>>> HAM EMAIL:
Subject: mcmullen gas for 11 / 99
jackie ,
since the inlet to 3 river plant is shut in on 10 / 19 / 99 ( the last day of
flow ) :
at what meter is the mcmullen gas being diverted to ?
at what meter is hpl buying the residue gas ? ( this is the gas from teco ,
vastar , vintage , tejones , and swift )
i still see active deals at meter 3405 in path manager for teco , vastar ,
vintage , tejones , and swift
i also see gas scheduled in pops at meter 3404 and 3405 .
please advice . we need to resolve this as soon as possible so settlement
can send out payments .
thanks

>>> SPAM EMAIL:
Subject: re : rdd , the auxiliary iturean
free cable @ tv
dabble bam servomechanism ferret canopy bookcase befog seductive elapse ballard daphne acrylate deride decadent desolate else sequestration condition ligament ornately yaqui giblet emphysematous woodland lie segovia almighty coffey shut china clubroom diagnostician
cheer leadsman abominate cambric oligarchy mania woodyard quake tetrachlor

You can see that the email text is all lower case, and each *token* (words, but also symbols like "@") is seperated by a space. The subject line is not technically part of the email body, but I will leave it in anyway.

For the purposes of our algorithm, we will convert emails into a set of words, throwing away the order of the words, and also how frequently they occur in the email. Every token will be represented with an integer, rather than the word itself.

We will represent whether an email is spam or not with a 1 for spam and a 0 for not spam. This is called the *label*.

Below I look at the set of words and the set of codes for an email. I also look at its labels.

In [8]:
DATASET = Dataset.get()

print(DATASET.spam_emails[10].words_set())
print(DATASET.spam_emails[10].codes)
print(DATASET.spam_emails[10].label)

SortedSet([',', ':', '@', 'Subject:', 'abominate', 'acrylate', 'alexander', 'almighty', 'annette', 'auxiliary', 'ballard', 'bam', 'banks', 'befog', 'bookcase', 'byroad', 'cable', 'cambric', 'canadian', 'canopy', 'charlie', 'cheer', 'china', 'cloister', 'clubroom', 'coffey', 'condition', 'contiguous', 'dabble', 'daphne', 'decadent', 'depressive', 'deride', 'desolate', 'diagnostician', 'elapse', 'else', 'emphysematous', 'ferret', 'free', 'giblet', 'gnaw', 'iturean', 'leadsman', 'lie', 'ligament', 'mania', 'oligarchy', 'ornately', 'quake', 'rdd', 're', 'seductive', 'segovia', 'sequestration', 'servomechanism', 'shut', 'synaptic', 'tetrachloride', 'the', 'trauma', 'tv', 'welsh', 'woodland', 'woodyard', 'yaqui'], key=None, load=1000)
SortedSet([0, 7, 13, 27, 69, 209, 306, 679, 1076, 1466, 1492, 3200, 3498, 3880, 4518, 5332, 5673, 6040, 6075, 7845, 10796, 11378, 11981, 12835, 14532, 16407, 17924, 20486, 20487, 20488, 20489, 20490, 20491, 20492, 20493, 20494, 20495, 20496, 20497, 20498, 20499

We call the preprocessing of the dataset *featurization*. The machine learning algorithm will interact with the *featurized* emails (the set of numbers and the 0/1 label), rather than the raw emails themselves.

It is not uncommon to throw away word order and word counts. This representation of text is called the *bag of words model*. Obviously a lot of information is lost with this representation. For some tasks like document retrieval based on keyword matching, the bag of words model can still be useful. For tasks like spam/not-spam bag of words performs well.

For tasks which need deeper *semantic* understanding of a document (understanding what it means), we would want to use techniques which can exploit the information contained in the word ordering. Naive Bayes would not be appropriate for tasks like that.

Luckily, Naive Bayes does very well for classifying emails as spam/not spam.

### Word Probabilities

To detect which emails are spam and which aren't, we will use the observation that some words are more probable to appear in a spam email rather than in a non-spam email. Words that are more likely to occur in a spam email than they would in a non-spam email is an indicator of spaminess.

Let's make this precise using probabilities.

Consider the word "offer." Offer probably occurs more commonly in spam emails than in non-spam emails. Let's write this in notation:

\\[
\prob{\text{OFFER} = 1 \condbar \text{SPAM} = 1}
>
\prob{\text{OFFER} = 1 \condbar \text{SPAM} = 0}
\\]

Here $\text{OFFER} = 1$ means "the word 'offer' is in the email." To say "the word 'offer' is *not* in the email," we would write $\text{OFFER} = 0$.

Likewise, $\text{SPAM} = 1$ means "the email is spam," versus $\text{SPAM} = 0$ which means "the email is not spam."

The *probability* that a randomly sampled email is spam is written:

\\[
\prob{\text{SPAM} = 1}
\\]

If we have a very large dataset, this probability can be defined as

\\[
\prob{\text{SPAM} = 1} = \frac{\text{# of emails that are spam}}{\text{# of emails in the dataset}}
\\]


I want to predict how probable it is that an email is spam, *given* that it contains the word "offer." That is, I want to calculate

\\[
\frac{
\text{# of emails that are spam and contain the word "offer"}
}{
\text{# of emails that contain the word "offer"}
}
\\]

We write this like so:

\\[
\prob{\text{SPAM} = 1 \condbar \text{OFFER} = 1}
\\]

The symbol $\condbar$ seperates the result we're asking about ($\text{SPAM} = 1$) from the *condition* ($\text{OFFER} = 1$). This is called a *conditional probability*.

There is another way to write a probability like this:

\\[
\prob{\text{SPAM} = 1 \condbar \text{OFFER} = 1}
=
\frac{
\prob{\text{SPAM} = 1}\prob{\text{OFFER} = 1 \condbar \text{SPAM} = 1}
}{
\prob{\text{OFFER} = 1}
}
\\]

Let me prove it's true. Replace each probability by its definition:

\\[
\begin{align}
\frac{
    \frac{
        \text{# of SPAM emails}
    }{
        \text{# of total emails}
    }
    \frac{
        \text{# of SPAM emails with the word OFFER}
    }{
        \text{# of SPAM emails}
    }
}{
    \frac{
        \text{# of emails with the word OFFER}
    }{
        \text{# of total emails}
    }
}
&=
\frac{
    \frac{
        \text{# of SPAM emails with the word OFFER}
    }{
        \text{# of total emails}
    }
}{
    \frac{
        \text{# of emails with the word OFFER}
    }{
        \text{# of total emails}
    }
}
\\
&=
\frac{
    \text{# of SPAM emails with the word OFFER}
}{
    \text{# of emails with the word OFFER}
}
\\
&=
\prob{\text{SPAM} = 1 \condbar \text{OFFER} = 1}
\end{align}
\\]

This rule is called *Bayes' Rule*.

I can apply the same equation to $\text{SPAM} = 0$. That gives me:

\\[
\prob{\text{SPAM} = 0 \condbar \text{OFFER} = 1}
=
\frac{
    \prob{\text{SPAM} = 0}
    \prob{\text{OFFER} = 1 \condbar \text{SPAM} = 0}
}{
    \prob{\text{OFFER} = 1}
}
\\]

It is frequently convenient to consider the *odds* that something is true, rather than the probability. If the probability of $X$ is $p$, then the *odds* of $X$ are $\frac{p}{1-p}$. For instance, a probability of $0.66$ corresponds to an odds of $2$, sometimes written $2:1$.

Let's compute the odds that an email is spam, given that it has the word "offer" in it.

\\[
\frac{
    \prob{\text{SPAM} = 1 \condbar \text{OFFER} = 1}
}{
    \prob{\text{SPAM} = 0 \condbar \text{OFFER} = 1}
}
=
\frac{
    \prob{\text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}
}
\cdot
\frac{
    \prob{\text{OFFER} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{OFFER} = 1 \condbar \text{SPAM} = 0}
}
\\]

The first factor in the product is what I call the *base odds*. It is the odds of an email being spam if it is randomly sampled and we know nothing else about it.

The second factor in the product I will call the *feature probability ratio*. It is the ratio of probabilities that the feature will occur conditional on $SPAM = 1$ versus when $SPAM = 0$.

When this ratio is $>1$, the probability the word occurs in a spam email is higher than the probability it would occur in a non-spam email.

When this ratio is $>1$, it means that if we know an email contains the word "offer", the odds are better that the email is spam. This is because the base odds are multiplied by a number greater than one.

Likewise, if the feature probability ratio is $<1$, than if we know an email contains the word "offer", that makes it less like that the email is spam.

Let's use code to calculate

\\[
\frac{
    \prob{\text{W_i} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{W_i} = 1 \condbar \text{SPAM} = 0}
}
\\]

For every word $W_i$ in our vocabulary. So we'll calculate the ratio for the word "offer," and we'll calculate it for the word "Enron", and we'll calculate for the word "baldness."

The words with the highest ratio are those that are the greatest indicators of spam. Those words multiply the base odds by the greatest feature probability ratio, so that the conditional odds that an email is spam are greatest.

Below code just counts up these numbers. It will count up how many times a word occurs in spam emails. It will divide by the number of spam emails. This is the value of the numerator.

It does the same thing for the denominator: count the number of times the word occurs in non-spam emails. It divides by the number of non-spam emails.

Last, it divides these two probabilities, which is the feature probability ratio defined above.

In [10]:
import numpy as np

# Keeps track of how many time a word occurs in ham or spam emails.
class Counts:
    def __init__(self, ham_count = 0, spam_count = 0):
        self.ham_count, self.spam_count = (
            ham_count, spam_count
        )

    def total_count(self):
        return self.ham_count + self.spam_count

    def __repr__(self):
        return self.__dict__.__repr__()

# Represents the unconditional class probabilities Pr(SPAM = 1), Pr(SPAM = 0)
class PriorClassProbabilities:
    def __init__(self, class_counts):
        self.ham_prior_prob = (
            class_counts.ham_count / class_counts.total_count()
        )
        self.spam_prior_prob = (
            class_counts.spam_count / class_counts.total_count()
        )

# Calculates the feature probability ratio defined above.
class ConditionalFeatureProbabilityRatio:
    def __init__(self, feature_counts, class_counts):
        self.prob_feature_given_ham = (
            feature_counts.ham_count / class_counts.ham_count
        )
        self.prob_feature_given_spam = (
            feature_counts.spam_count / class_counts.spam_count
        )

        if (self.prob_feature_given_ham != 0):
            self.feature_probability_ratio = (
                self.prob_feature_given_spam
                / self.prob_feature_given_ham
            )
        else:
            self.feature_probability_ratio = np.inf

    def __repr__(self):
        return self.__dict__.__repr__()

# Keeps a map of codes to feature probability ratio.
class FeatureProbabilities:
    def __init__(self):
        self.class_counts = Counts()
        self.code_counts = {}
        
    @classmethod
    def from_dataset(cls, dataset):
        return cls.from_emails(
            ham_emails = dataset.ham_emails,
            spam_emails = dataset.spam_emails
        )

    @classmethod
    def from_emails(cls, ham_emails, spam_emails):
        fps = cls()

        for ham_email in ham_emails:
            fps.add_email(ham_email, True)
        for spam_email in spam_emails:
            fps.add_email(spam_email, False)

        return fps

    def add_email(self, email, is_ham_email):
        if is_ham_email:
            self.class_counts.ham_count += 1
        else:
            self.class_counts.spam_count += 1

        for code in email.codes:
            self._check_code_added(code)

            if is_ham_email:
                self.code_counts[code].ham_count += 1
            else:
                self.code_counts[code].spam_count += 1

    def class_prior_probs(self):
        return PriorClassProbabilities(self.class_counts)

    # Gives you Pr(W_i = 1 | SPAM = 1) / Pr(W_i = 1 | SPAM = 0)
    def code_prob_ratio(self, code):
        return ConditionalFeatureProbabilityRatio(
            feature_counts = self.code_counts[code],
            class_counts = self.class_counts
        )

    def no_code_counts(self, code):
        code_counts = self.code_counts[code]
        return Counts(
            ham_count = (
                self.class_counts.ham_count - code_counts.ham_count
            ),
            spam_count = (
                self.class_counts.spam_count - code_counts.spam_count
            )
        )

    # Gives you Pr(W_i = 0 | SPAM = 1) / Pr(W_i = 0 | SPAM = 0)
    def no_code_prob_ratio(self, code):
        return ConditionalFeatureProbabilityRatio(
            feature_counts = self.no_code_counts(code),
            class_counts = self.class_counts
        )

    def _check_code_added(self, code):
        if code in self.code_counts: return
        self.code_counts[code] = Counts(
            ham_count = 0,
            spam_count = 0,
        )

In [11]:
DATASET = Dataset.get()
feature_probabilities = FeatureProbabilities.from_dataset(DATASET)
feature_probabilities.code_prob_ratio(
    code = DATASET.word_encoding_dictionary.word_to_code('offer')
)

{'prob_feature_given_ham': 0.01661220043572985, 'prob_feature_given_spam': 0.094, 'feature_probability_ratio': 5.658491803278688}

Here we see that

\\[
\prob{\text{OFFER} = 1 \condbar \text{SPAM} = 1} = 0.094
\\
\prob{\text{OFFER} = 1 \condbar \text{SPAM} = 0} = 0.0166
\\
\frac{
    \prob{\text{OFFER} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{OFFER} = 1 \condbar \text{SPAM} = 0}
}
= 5.66
\\]

Therefore, we can use our equation from above:

\\[
\begin{align}
\frac{
    \prob{\text{SPAM} = 1 \condbar \text{OFFER} = 1}
}{
    \prob{\text{SPAM} = 0 \condbar \text{OFFER} = 1}
}
&=
\frac{
    \prob{\text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}
}
\cdot
\frac{
    \prob{\text{OFFER} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{OFFER} = 1 \condbar \text{SPAM} = 0}
}
\\
&=
\frac{
    \prob{\text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}
}
\cdot
5.66
\end{align}
\\]

What we see here is that, whatever the base odds of a randomly selected email being spam, then if you know it contains the word "offer", the odds are now $5.66$ times greater that it is indeed spam.

### Empirical Quantities Can Be Inaccurate

To calculate the feature probability ratio, we are dividing two conditional probabilities: $\prob{\text{OFFER} = 1 \condbar \text{SPAM} = 1}$ and $\prob{\text{OFFER} = 1 \condbar \text{SPAM} = 0}$.

To calculate these probabilities, we're just using the emails we have in our dataset. We are just counting emails and dividing:

\\[
\eprob{\text{OFFER} = 1 \condbar \text{SPAM} = 1}
=
\frac{
\text{# of emails that are spam and contain the word "offer"}
}{
\text{# of emails that are spam}
}
\\]

I have put a "hat" on the probability, because this is an *estimate* of the probability. A hat means estimate. It isn't necessarily the true probability we would get if we considered every email ever written. We are only look at a much smaller *sample* dateaset.

For instance, let's say that "xylophone" never occurs in a spam email in our dataset. Then we calculate:

\\[
\begin{align}
\eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 1}
&=
\frac{
\text{# of emails that are spam and contain the word "xylophone"}
}{
\text{# of emails that are spam}
}
\\
&=
\frac{
    0
}{
    \text{# of emails that are spam}
}
\\
&=
0
\end{align}
\\]

If our calculation above is accurate, it means that no spam email will ever have the word "offer" in it. Now, that is true *in our dataset*, but is it true that no spam email *ever* contained the word "xylophone?" If so, then this probability is wrong. The probability is surely very small, but it can't be *exactly* zero.

That said, surely "xylophone" is very rare in spam emails. Let's presume that it is equally rare in ham emails. In that case:

\\[
\prob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 1}
=
\prob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 0}
\\
\Rightarrow
\frac{
    \prob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 0}
}
= 1.0
\\]

Notice I am not using hats here, because I'm talking about the *true* probabilities we would measure if we had every email ever written. If the true feature probabilities are equal, then the feature probability ratio should be one, which means that "xylophone" doesn't change the odds that an email is spam. That's what it means to be equally likely in spam and non-spam emails.


However, let us say by random chance, even though "xylophone" never appears in our spam dataset, it occurs that "xylophone" does appear in one non-spam email. That is:

\\[
\text{# of non-spam emails with the word "xylophone"} = 1
\\
\text{# of spam emails with the word "xylophone"} = 0
\\]

If that is true, then:

\\[
\eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 0}
>
0
\\
\eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 1}
=
0
\\]

And that would mean:

\\[
\frac{
    \eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 1}
}{
    \eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 0}
}
=
\frac{
    0
}{
    \text{some number} > 0
}
=
0
\\]

So this means that 

\\[
\frac{
    \eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 1}
}{
    \eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 0}
}
\ne
\frac{
    \prob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 0}
}
\\]

Now, any *estimate* will always be a little inaccurate. But the problem here is that the estimate is *very* wrong about the odds.

Now, let's say you want to predict if some future email is spam. Let's say this new email has the word xylophone. Then:

\\[
\begin{align}
\frac{
    \eprob{\text{SPAM} = 1 \condbar \text{XYLOPHONE} = 1}
}{
    \eprob{\text{SPAM} = 0 \condbar \text{XYLOPHONE} = 1}
}
&=
\frac{
    \eprob{\text{SPAM} = 1}
}{
    \eprob{\text{SPAM} = 0}
}
\cdot
\frac{
    \eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 1}
}{
    \eprob{\text{XYLOPHONE} = 1 \condbar \text{SPAM} = 0}
}
\\
&=
\frac{
    \eprob{\text{SPAM} = 1}
}{
    \eprob{\text{SPAM} = 0}
}
\cdot
0
\\
&=
0
\end{align}
\\]

That would mean that any future email that contains the word xylophone, we think there is a 0% chance that it is spam.


Here's what we're seeing:

(1) If a word is very rare, our estimate of the feature probability ratio can be *very* wrong.

(2) Therefore, if we use the feature probability ratio to calculate the conditional odds 

\\[
\frac{
    \eprob{\text{SPAM} = 1 \condbar \text{XYLOPHONE} = 1}
}{
    \eprob{\text{SPAM} = 0 \condbar \text{XYLOPHONE} = 1}
}
\\]

then this estimate can be very wrong.

Below, I show an example with the word "bacterial." This occurs once in a spam email, and never in a non-spam email. Therefore, we'll think that any new email with the word "bacterial" has a 100% chance of being spam.

In [30]:
DATASET = Dataset.get()

bacterial_code = DATASET.word_encoding_dictionary.word_to_code('bacterial')
print(f"Counts | {feature_probabilities.code_counts[bacterial_code]}")
print(f"Feature Probability Ratio | {feature_probabilities.code_prob_ratio(bacterial_code)}")

Counts | {'ham_count': 0, 'spam_count': 1}
Feature Probability Ratio | {'prob_feature_given_ham': 0.0, 'prob_feature_given_spam': 0.0006666666666666666, 'feature_probability_ratio': inf}


Rather than think that any email with the word "bacterial" is 100% for sure spam, maybe we should just *ignore* the word "bacterial." Basically: we don't know enough about the word "bacterial" to know whether it really does indicate spam or ham. It is simply too rare a word for us to know without more data.

### Discarding Low Reach Features

The code below goes through all the features, and it throws out any of those with low *reach*. That is: it throws out words which occur in less than one hundred emails. One hundred is chosen arbitrarily.

The idea is: if a word $W$ occurs in about 100 emails, then we have enough occurences to fairly accurately estimate the true $\prob{W = 1 \condbar \text{SPAM} = 1}$.

After throwing out these low reach words, I'll list all the words that are the best predictors of spam. That is: I'll show all the words $W$ where

\\[
\frac{
    \prob{W = 1 \condbar \text{SPAM} = 1}
}{
    \prob{W = 1 \condbar \text{SPAM} = 0}
}
\\]

is the greatest.

In [10]:
# A helper class to explore best ham/spam features.
class FeatureProbabilitiesExplorer:
    @classmethod
    def best_spam_features(
        cls, fps, limit = 20, present_features = True
    ):
        return cls.best_features(
            fps, limit, -1, present_features = present_features
        )

    @classmethod
    def best_ham_features(
        cls, fps, limit = 20, present_features = True
    ):
        return cls.best_features(
            fps, limit, +1, present_features = present_features
        )

    @classmethod
    def best_features(cls, fps, limit, multiplier, present_features):
        if present_features:
            prob_ratio_fn = fps.code_prob_ratio
            code_counts = lambda code: fps.code_counts[code]
        else:
            prob_ratio_fn = fps.no_code_prob_ratio
            code_counts = fps.no_code_counts

        codes = list(fps.code_counts.keys())
        code_prob_ratios = [{
            'code': code,
            'reach': code_counts(code),
            'feature_probability_ratio': (
                prob_ratio_fn(code).feature_probability_ratio
            )
        } for code in codes]
        code_prob_ratios.sort(key = lambda code_prob_ratio: (
            multiplier * code_prob_ratio['feature_probability_ratio']
        ))
        return code_prob_ratios[:limit]

    @classmethod
    def print_features_list(
        cls, features_list, word_encoding_dictionary
    ):
        for code_prob_ratio in features_list:
            code, reach, feature_probability_ratio = (
                code_prob_ratio['code'],
                code_prob_ratio['reach'],
                code_prob_ratio['feature_probability_ratio']
            )
            word = word_encoding_dictionary.code_to_word(code)
            print(
                f"{code} | {word} | reach: {reach} | "
                f"feature_probability_ratio: {feature_probability_ratio:0.2f}:1"
            )


In [14]:
def filter_feature_probabilities(fps, reach_limit):
    filtered_fps = FeatureProbabilities()
    filtered_fps.class_counts = Counts(
        ham_count = fps.class_counts.ham_count,
        spam_count = fps.class_counts.spam_count
    )

    for (code, counts) in fps.code_counts.items():
        if counts.total_count() < reach_limit: continue
        filtered_fps.code_counts[code] = Counts(
            ham_count = counts.ham_count,
            spam_count = counts.spam_count
        )

    return filtered_fps

DATASET = Dataset.get()
filtered_feature_probabilities = filter_feature_probabilities(
    FeatureProbabilities.from_dataset(DATASET),
    reach_limit = 100
)
best_spam_features = FeatureProbabilitiesExplorer.best_spam_features(
    filtered_feature_probabilities
)
FeatureProbabilitiesExplorer.print_features_list(
    best_spam_features, DATASET.word_encoding_dictionary
)

17077 | 2004 | reach: {'ham_count': 1, 'spam_count': 121} | feature_probability_ratio: 296.21:1
5969 | microsoft | reach: {'ham_count': 11, 'spam_count': 98} | feature_probability_ratio: 21.81:1
3104 | investment | reach: {'ham_count': 11, 'spam_count': 96} | feature_probability_ratio: 21.36:1
3522 | results | reach: {'ham_count': 18, 'spam_count': 98} | feature_probability_ratio: 13.33:1
370 | v | reach: {'ham_count': 26, 'spam_count': 134} | feature_probability_ratio: 12.62:1
3951 | million | reach: {'ham_count': 20, 'spam_count': 97} | feature_probability_ratio: 11.87:1
680 | stop | reach: {'ham_count': 31, 'spam_count': 147} | feature_probability_ratio: 11.61:1
3900 | software | reach: {'ham_count': 22, 'spam_count': 101} | feature_probability_ratio: 11.24:1
5621 | 80 | reach: {'ham_count': 23, 'spam_count': 104} | feature_probability_ratio: 11.07:1
4002 | dollars | reach: {'ham_count': 26, 'spam_count': 113} | feature_probability_ratio: 10.64:1
4611 | remove | reach: {'ham_count':

This list is pretty good! Most of the words intuitively fit with what we think might be in a spam email. "2004" is weird, but let's ignore that.

### Using all the words

So far we have calculated probabilities like:

\\[
\eprob{\text{SPAM} = 1 \condbar \text{OFFER} = 1}
\\]

This tells us the probability that a randomly drawn email is spam, if we know it contains the word "offer."

What if you tell me the email contains *both* the words "offer" and "limited?" That probably makes it *more* likely to be spam, because "limited" is probably a spammy word too.

I want to calculate:

\\[
\eprob{\text{SPAM} = 1 \condbar \text{OFFER} = 1 \wedge \text{LIMITED} = 1}
\\]

The wedge $\wedge$ means "AND."

Eventually, I want to use *all* the words in an email. When I classify an email as ham or spam, I get to look at *all* the words.

\\[
\prob{\text{SPAM} = 1 \condbar \text{all the words in the email}}
\\]

My expectation is that the more words I use, the better my guess about whether the email is spam. More words is more information.

Let's start with just two words. I want to calculate:

\\[
\eprob{\text{SPAM} = 1 \condbar \text{OFFER} = 1 \wedge \text{LIMITED} = 1}
\\]

One problem I'll encounter is that, even if $>100$ emails have the word "offer," and $>100$ emails have the word "limited," it may not be true that $>100$ emails have *both* words. That would be bad, because:

\\[
\begin{align}
&\eprob{\text{SPAM} = 1 \condbar \text{OFFER} = 1 \wedge \text{LIMITED} = 1}
&
\\
&=
\frac{
    \text{# of emails with both the words offer and limited, AND is spam}
}{
    \text{# of emails with both the words offer and limited}
}
&
\end{align}
\\]

Again, if there are very few emails with both, this estimate can be very wrong. For instance, if the number of emails with both the words offer and limited is just one, then this estimated probability will be either 0% or 100%, which are the most extreme possibilities.

The fundamental problem is this. The more words you try to use from an email, the smaller the count of the denominator. That means our estimate is likely to be highly inaccurate.

We need to come up with a trick to get around this problem.

### Conditional Independence

Let's try to get around the problem. We'll start with using Bayes' Rule:

\\[
\prob{\text{SPAM} = 1\condbar \text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1}
=
\frac{
    \prob{\text{SPAM} = 1}\prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1}
}
\\]

We can write this as odds:

\\[
\begin{align}
&
\frac{
    \prob{\text{SPAM} = 1\condbar \text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1}
}{
    \prob{\text{SPAM} = 0\condbar \text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1}
}
&
\\
&=
\frac{
    \prob{\text{SPAM} = 1}\prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}\prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 0}
}
&
\\
&=
\frac{
    \prob{\text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}
}
\frac{
    \prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 0}
}
&
\end{align}
\\]

Next, let's make an *assumption*. Let's assume:

\\[
\prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 1}
=
\prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 1}
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
\\]

Similarly, let's assume:

\\[
\prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 0}
=
\prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 0}
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 0}
\\]

These assumptions are *not true*. Let's just pretend they are true, though. I'll discuss them more in a moment.

If we use these equalities hold, then we can substitute into the odds equation above:

\\[
\begin{align}
&
\frac{
    \prob{\text{SPAM} = 1\condbar \text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1}
}{
    \prob{\text{SPAM} = 0\condbar \text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1}
}
&
\\
&=
\frac{
    \prob{\text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}
}
\cdot
\frac{
    \prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 1}
    \cdot
    \prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 0}
    \cdot
    \prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 0}
}
&
\\
&=
\frac{
    \prob{\text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}
}
\cdot
\frac{
    \prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 0}
}
\cdot
\frac{
    \prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 0}
}
&
\end{align}
\\]

The good news is that we now have an equation for the odds where no probability involves *both* "investment" and "quality". That's good, because if "investment" and "quality" are both individually high reach features, these estimated probabilities will be accurate, like we said before.

So, *if our assumption is true*, we have found a good way to calculate the odds. It doesn't involve a probability about two words, which can be low reach, even if the individual features are high reach. This makes the odds calculation is much more accurate, *assuming our assumption is true.*

### What Did We Assume?

I already showed you how to break up a probability that involves a conjunction:

\\[
\prob{X = x \wedge Y = y} = \prob{X = x} \prob{Y = y \condbar X = x}
\\]

This is always true. It is a law. I proved it above. For the same reason, this law is also true:

\\[
\prob{X = x \wedge Y = y \condbar Z = z}
=
\prob{X = x \condbar Z = z}
\prob{Y = y \condbar X = x \wedge Z = z}
\\]

If you are skeptical, prove it for yourself.


Using that rule, I know:

\\[
\begin{align}
&
\prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 1}
&\\
&=
\prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 1}
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
&
\end{align}
\\]


Now, let's return to what we assumed:

\\[
\begin{align}
&
\prob{\text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1 \condbar \text{SPAM} = 1}
&
\\
&=
\prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 1}
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
&
\end{align}
\\]

So our assumption is the same as assuming:

\\[
\begin{align}
&
\prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 1}
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
&
\\
&=
\prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 1}
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
&
\end{align}
\\]

Which is the same as assuming:

\\[
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
=
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
\\]


### Conditional Independence

The property we assumed is called *conditional independence*. Conditional independence means:

\\[
\prob{X = x \condbar Y = y \wedge Z = z} = \prob{X = x \condbar Y = y}
\\]

In this case, we say that "X is conditionally independent of Z, given Y".

Let's talk in terms of our example:

\\[
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
=
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
\\]

This says: an spam email is no more or less likely to contain the word "quality", regardless of whether it contains the word investment.


### Examples When Unconditional Independence Is Violated

I want to give you an intuition about conditional independence. Let's talk about the *unconditional* independence first.

\\[
\prob{\text{QUALITY} = 1 \condbar \text{INVESTMENT} = 1}
\ne
\prob{\text{QUALITY} = 1}
\\]

Here's why this isn't true in general.

1. The presence of the word "investment" suggests that the email is spam.
2. If the email is spam, then it is more likely to contain the word "quality" than the average email.
3. That means that the presence of the word "investment" makes the presence of the word "quality" more likely.
4. In conclusion, the words "investment" and "quality" are not unconditionally independent.

In fact, by that reasoning,

\\[
\prob{\text{QUALITY} = 1 \condbar \text{INVESTMENT} = 1}
>
\prob{\text{QUALITY} = 1}
\\]


### Reason for Conditional Independence

I've said that "investment" indicates spam, and spam indicates "quality", so "investment" indicates "quality."

The reason this happens is entirely because "investment" tells me what kind of email this is. Let's say I start out knowing whether an email is spam. My question then is, if I learn the email contains the word "investment", will my belief in the probability that the word "quality" appears change?

That is:

\\[
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
\stackrel{?}{=}
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
\\]

My old argument doesn't apply anymore. Even though "investment" normally indicates an email is spam, in this case I started out knowing that the email was spam. So learning that "investment" is in the email doesn't add *anything* to my knowledge about whether the email is spam in this case: I already knew.

In that case, "investment" doesn't change my belief in the probability of the word "quality" being present. Unless some other new reasoning applies, the equation above is true:

\\[
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
=
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
\\]

In that case, "quality" is conditionally independent of "investment" given that an email is spam.


### Examples When Conditional Independence Is Violated

The question then becomes: does the presence of the word "investment" change the probability of "quality" appearing for *any other* reason?

Now, it may be possible that the words "quality" and "investment" often appear in the same spam emails because "investment" frequently appears as part of the compound phrase "a quality investment".

That is, "investment" may indicate "quality" for a reason that isn't merely through "investment" indicating an email is "spam." It may be because *those words simply go together*.

If that were true, then:

\\[
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
>
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
\\]

This is the kind of thing we're assuming *doesn't happen.*

On the other hand, considering another pair of words, it may be that "baldness" is less likely given the presence of the word "investment" in spam emails. That might be because a spam email either pitches an investment or a baldness cure, but not typically both.

In that case, the presence of "investment" might *inhibit* "baldness." Those words *don't* go together.

If this is true:

\\[
\prob{\text{BALDNESS} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
<
\prob{\text{BALDNESS} = 1 \condbar \text{SPAM} = 1}
\\]

Again: we will assume that *does not happen.*

### Let's Just Pretend

We've seen that, independent of their effect at hinting at whether an email is spam, a word like "investment" may (1) indicate the presence of "quality" or (2) indicate the absence of "baldness."

We've said we're just going to assume that doesn't happen. We will assume:

\\[
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
=
\prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}\\
\prob{\text{BALDNESS} = 1 \condbar \text{SPAM} = 1 \wedge \text{INVESTMENT} = 1}
=
\prob{\text{BALDNESS} = 1 \condbar \text{SPAM} = 1}
\\]

It is typically the case that pairs of features are *not* conditionally independent. So our assumption is typically wrong. What's incredible is that we'll see that the Naive Bayes model can still do a good job with this false assumption.

This is what makes Naive Bayes *naive*. Remember why we did this in the first place: usin g the naive assumption, we can calculate the odds that an email is spam using all the words in an email, while still avoiding the problem that as we include more words, it is more rare for all those words to occur together.

In sum: Naive Bayes lets us use all the features, in a way which avoids inaccurate probability estimates, except that we base our model on an untrue assumption about conditional independence. And everything turns out okay!


## The Naive Bayes Model

Let's remember where we have gotten to. By assuming conditional independence, we now know:

\\[
\begin{align}
&\frac{
    \prob{\text{SPAM} = 1 \condbar \text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1}
}{
    \prob{\text{SPAM} = 0 \condbar \text{INVESTMENT} = 1 \wedge \text{QUALITY} = 1}
}
&
\\
&=
\frac{
    \prob{\text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}
}
\cdot
\frac{
    \prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{INVESTMENT} = 1 \condbar \text{SPAM} = 0}
}
\cdot
\frac{
    \prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{\text{QUALITY} = 1 \condbar \text{SPAM} = 0}
}
&
\end{align}
\\]

More generally, let's use *all* the words. Let's number the words. Let's say that $W_1$ means "OFFER", $W_2$ means "LIMITED", et cetera. Let's say there are $N$ words in the *vocabulary*.

An email's bag of words contains a *subset* of those $N$ words. Let's say the email contains $k$ unique words. Let's say they are words number $i_1, i_2, \ldots, i_k$. Each $i_j$ is a different index of a different word in the email.

Let's use our Naive Bayes equation now to calculate the probability that a randomly sampled email with these words would be spam:

\\[
\begin{align}
&
\frac{
    \prob{\text{SPAM} = 1 \condbar W_{i_1} = 1, \ldots, W_{i_k} = 1 }
}{
    \prob{\text{SPAM} = 0 \condbar W_{i_1} = 1, \ldots, W_{i_k} = 1 }
}
&
\\
&=
\frac{
    \prob{\text{SPAM} = 1}
}{
    \prob{\text{SPAM} = 0}
}
\cdot
\frac{
    \prob{W_{i_1} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{W_{i_1} = 1 \condbar \text{SPAM} = 0}
}
\cdot
\frac{
    \prob{W_{i_2} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{W_{i_2} = 1 \condbar \text{SPAM} = 0}
}
\cdots
\frac{
    \prob{W_{i_k} = 1 \condbar \text{SPAM} = 1}
}{
    \prob{W_{i_k} = 1 \condbar \text{SPAM} = 0}
}
&
\end{align}
\\]

This shows how to calculate the odds that an email is spam. Take the base odds, and then just multiply by each feature probability ratio for each word in the email.

Below I wrote some code to do this. Let's go!

In [17]:
import numpy as np

class NaiveBayesModel:
    def __init__(self, fps, use_negative_features):
        self.fps = fps
        self.use_negative_features = use_negative_features

    def _base_spam_score(self):
        return (
            self.fps.class_prior_probs().spam_prior_prob
            / self.fps.class_prior_probs().ham_prior_prob
        )

    # This calculates the prob_ratios once so that we don't need to
    # repeat this relatively costly computation
    def _build_feature_weights(self):
        # Note that because of filtering, some entries will remain
        # 1.0. That's fine.
        positive_code_prob_ratios = np.ones(
            max(self.fps.code_counts.keys()) + 1
        )
        negative_code_prob_ratios = np.ones(
            len(positive_code_prob_ratios)
        )

        for code in self.fps.code_counts:
            positive_code_prob_ratios[code] = (
                self.fps.code_prob_ratio(code).feature_probability_ratio
            )
            negative_code_prob_ratios[code] = (
                self.fps.no_code_prob_ratio(code).feature_probability_ratio
            )

        return (positive_code_prob_ratios, negative_code_prob_ratios)

    def score_email(
        self,
        email,
        base_spam_score,
        positive_code_prob_ratios,
        negative_code_prob_ratios
    ):
        spam_score = base_spam_score
        for code in self.fps.code_counts.keys():
            if code in email.codes:
                spam_score *= positive_code_prob_ratios[code]
            elif self.use_negative_features:
                spam_score *= negative_code_prob_ratios[code]

        return spam_score

    def score_emails(self, emails):
        base_spam_score = self._base_spam_score()
        (positive_code_prob_ratios, negative_code_prob_ratios) = (
            self._build_feature_weights()
        )

        return map(
            lambda email: self.score_email(
                email = email,
                base_spam_score = base_spam_score,
                positive_code_prob_ratios = positive_code_prob_ratios,
                negative_code_prob_ratios = negative_code_prob_ratios,
            ),
            emails
        )


This code below will test out how good a job our Naive Bayes model does at predicting whether an email is spam.

In [20]:
import numpy as np

# Helper class (see below)
class RecallResult:
    def __init__(self, score_cutoff, num_spams_identified, recall):
        self.score_cutoff, self.num_spams_identified, self.recall = (
            score_cutoff, num_spams_identified, recall
        )

# Determines what percentage of spam emails are detected if we can tolerate a given false positive rate.
# Does this for multiple false positive rate limits.
def recall_for_false_positive_rates(model, dataset, limits):
    ham_scores = list(model.score_emails(dataset.ham_emails))
    ham_scores.sort(key = lambda score: -score)
    spam_scores = list(model.score_emails(dataset.spam_emails))

    def calculate_result(limit):
        score_cutoff = ham_scores[int(len(ham_scores) * limit)]
        num_spams_identified = sum(
            [1 if s > score_cutoff else 0 for s in spam_scores]
        )
        recall = (
            num_spams_identified / len(dataset.spam_emails)
        )

        return RecallResult(
            score_cutoff = score_cutoff,
            num_spams_identified = num_spams_identified,
            recall = recall,
        )

    return [
        (limit, calculate_result(limit)) for limit in limits
    ]

Let's run the code and see how it does!

In [21]:
DATASET = Dataset.get()

model = NaiveBayesModel(
    filter_feature_probabilities(
        FeatureProbabilities.from_dataset(DATASET),
        reach_limit = 100
    ),
    use_negative_features = False
)

FALSE_POSITIVE_RATES = [0.001, 0.01, 0.02, 0.04, 0.08, 0.16]
results = recall_for_false_positive_rates(
    model,
    DATASET,
    FALSE_POSITIVE_RATES
)

for (false_positive_rate, result) in results:
    print(f"False Positive Rate {false_positive_rate:0.3f} | Recall {result.recall:0.2f}")

False Positive Rate 0.001 | Recall 0.08
False Positive Rate 0.010 | Recall 0.62
False Positive Rate 0.020 | Recall 0.88
False Positive Rate 0.040 | Recall 0.98
False Positive Rate 0.080 | Recall 1.00
False Positive Rate 0.160 | Recall 1.00


The *false positive rate* is the percentage of ham emails that were marked as spam. *Recall* is the percentage of percentage of spam emails that were identified as spam. Recall is the same as the *true positive rate*. Obviously the ideal is to have a false positive rate of zero and a recall of one.

These results are encouraging. It says that if we're okay with marking one in a hundred ham emails as spam, we'll catch 62% of the spam. And if we are okay with two in a hundred ham emails, we'll catch 88% of spam.

This isn't quite good enough for real life systems, but it isn't that bad!

### Word Absence (Sometimes) Matters Too!

See that `use_negative_features = False`? Right now we've only been using words that were observed: words whose presence indicates spam or not spam.

Another question is what about the *absence* of words? What if all spam emails contain a word? If it is absent, then we know this is a ham email. But our calculation isn't using that kind of information.

The way we've written our equations, it's as if we know that some words are present, but *don't know* whether the other words are *absent*. So let's rewrite slightly. Assume that $w_i = 1$ when the word $W_i$ is present, and $w_i = 0$ otherwise. Then:

\\[
\frac{
    \prob{W_1 = w_1, \ldots, W_{\card{W}} = w_{\card{W}} \condbar \text{SPAM} = 1}
}{
    \prob{W_1 = w_1, \ldots, W_{\card{W}} = w_{\card{W}} \condbar \text{SPAM} = 0}
}
=
\frac{
    \prob{W_1 = 1 \condbar \text{SPAM} = 1}
}{
    \prob{W_1 = 1 \condbar \text{SPAM} = 0}
}
\cdots
\frac{
    \prob{W_{\card{W}} = w_{\card{W}} \condbar \text{SPAM} = 1}
}{
    \prob{W_{\card{W}} = w_{\card{W}} \condbar \text{SPAM} = 0}
}
\\]

Let's see if the use of these features can help our model predict spam:

In [22]:
DATASET = Dataset.get()

model = NaiveBayesModel(
    filter_feature_probabilities(
        FeatureProbabilities.from_dataset(DATASET),
        reach_limit = 100
    ),
    use_negative_features = True
)

FALSE_POSITIVE_RATES = [0.001, 0.01, 0.02, 0.04, 0.08, 0.16]
results = recall_for_false_positive_rates(
    model,
    DATASET,
    FALSE_POSITIVE_RATES
)

for (false_positive_rate, result) in results:
    print(f"False Positive Rate {false_positive_rate:0.3f} | Recall {result.recall:0.2f}")

False Positive Rate 0.001 | Recall 0.09
False Positive Rate 0.010 | Recall 0.71
False Positive Rate 0.020 | Recall 0.93
False Positive Rate 0.040 | Recall 0.98
False Positive Rate 0.080 | Recall 1.00
False Positive Rate 0.160 | Recall 1.00


This seems to have made a fairly substantial improvement! Let's look at what the best word omission features were!

In [25]:
DATASET = Dataset.get()

filtered_feature_probabilities = filter_feature_probabilities(
    FeatureProbabilities.from_dataset(DATASET),
    reach_limit = 100
)

print("===BEST SPAM OMISSION FEATURES===")
best_spam_features = FeatureProbabilitiesExplorer.best_spam_features(
    filtered_feature_probabilities,
    present_features = False
)
FeatureProbabilitiesExplorer.print_features_list(
    best_spam_features, DATASET.word_encoding_dictionary
)

print("===BEST HAM OMISSION FEATURES===")
best_ham_features = FeatureProbabilitiesExplorer.best_ham_features(
    filtered_feature_probabilities,
    present_features = False
)
FeatureProbabilitiesExplorer.print_features_list(
    best_ham_features, DATASET.word_encoding_dictionary
)


===BEST SPAM OMISSION FEATURES===
0 | Subject: | reach: {'ham_count': 0, 'spam_count': 0} | feature_probability_ratio: inf:1
9 | . | reach: {'ham_count': 131, 'spam_count': 253} | feature_probability_ratio: 4.73:1
7 | , | reach: {'ham_count': 447, 'spam_count': 423} | feature_probability_ratio: 2.32:1
42 | for | reach: {'ham_count': 893, 'spam_count': 678} | feature_probability_ratio: 1.86:1
100 | enron | reach: {'ham_count': 2210, 'spam_count': 1500} | feature_probability_ratio: 1.66:1
610 | 2000 | reach: {'ham_count': 2162, 'spam_count': 1460} | feature_probability_ratio: 1.65:1
55 | / | reach: {'ham_count': 1270, 'spam_count': 836} | feature_probability_ratio: 1.61:1
24 | on | reach: {'ham_count': 1554, 'spam_count': 978} | feature_probability_ratio: 1.54:1
19 | - | reach: {'ham_count': 893, 'spam_count': 558} | feature_probability_ratio: 1.53:1
70 | cc | reach: {'ham_count': 2393, 'spam_count': 1490} | feature_probability_ratio: 1.52:1
73 | subject | reach: {'ham_count': 2275, 'spa

Spam emails seem to be more likely to lack punctuation. And it looks like spam emails tend not to mention Enron: this is clearly specific to our training dataset!

It looks clear that emails that don't contain a link are more likely to be ham. That's presumably because spammers want you to go to their website.

### Training And Test Sets

The way we are measuring performance is a little too leniant. We are supposed to train a model that predicts whether future emails are ham/spam, but we're measuring performance on the *training set*. This is often problematic because sometimes models will just "memorize" the training dataset in a way that doesn't lead to any future good performance.

For instance, say our model was able to record an exact map from a bag of words to a label of ham or spam. Then, since every email in the training set probably has a unique bag of words, the model would be able to just record an exact mapping of email to label. But when we go to evaluate new emails, new emails won't match any of those bags of words. Thus there would be no ability to predict labels for future emails.

To make sure our model *generalizes* well, it is common to split our data into two parts: the *training set* and the *test set*. The training set is fed to the machine learning algorithm, and then we use the test set to measure performance of the learned model. Since the ML algorithm never has seen the test set before, this should be a fair test of its ability to detect spam.

Let's train on 80% of the data, and leave 20% for testing. There is a conflict of interest when picking these proportions. The more data you train on, the better your model will be. But the more data in your testing set, the more accurate your estimate on how the model will generalize.

When you have more data, you might use more less for testing, figuring this amount will still be sufficient. On the other hand, 80/20 is a pretty common ratio to use.

There are fancy techniques like [cross-validation][0], but I won't talk about those here.

[0]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)

In [30]:
import zlib

class DatasetSplitter:
    @classmethod
    def split(cls, dataset, ratio):
        datasetA = cls._split(dataset, ratio, 0)
        datasetB = cls._split(dataset, ratio, 1)
        return (datasetA, datasetB)

    @classmethod
    def _split(cls, dataset, ratio, mode):
        split_ham_emails, split_spam_emails = [], []
        emails_pairs = [
            (dataset.ham_emails, split_ham_emails),
            (dataset.spam_emails, split_spam_emails)
        ]

        for (emails, split_emails) in emails_pairs:
            for email in emails:
                # This is a fancy way to pseudorandomly but
                # deterministically select emails. That way we always
                # pick the same set of emails for reproducability
                # across program runs.
                h = zlib.crc32(email.path.encode())
                p = h / (2**32 - 1)
                if (mode == 0 and p < ratio) or (mode == 1 and p >= ratio):
                    split_emails.append(email)

        return Dataset(
            dataset.word_encoding_dictionary,
            ham_emails = split_ham_emails,
            spam_emails = split_spam_emails
        )


In [32]:
DATASET = Dataset.get()
(training_set, test_set) = DatasetSplitter.split(DATASET, ratio = 0.80)

print(f"total_number of emails: {len(DATASET.ham_emails) + len(DATASET.spam_emails)}")
print(f"number of training emails: {len(training_set.ham_emails) + len(training_set.spam_emails)}")
print(f"number of test emails: {len(test_set.ham_emails) + len(test_set.spam_emails)}")

model = NaiveBayesModel(
    filter_feature_probabilities(
        FeatureProbabilities.from_dataset(training_set),
        reach_limit = 100
    ),
    use_negative_features = True
)

FALSE_POSITIVE_RATES = [0.001, 0.01, 0.02, 0.04, 0.08, 0.16]
results = recall_for_false_positive_rates(
    model,
    test_set,
    FALSE_POSITIVE_RATES
)

for (false_positive_rate, result) in results:
    print(f"False Positive Rate {false_positive_rate:0.3f} | Recall {result.recall:0.2f}")


total_number of emails: 5172
number of training emails: 4181
number of test emails: 991
False Positive Rate 0.001 | Recall 0.08
False Positive Rate 0.010 | Recall 0.74
False Positive Rate 0.020 | Recall 0.92
False Positive Rate 0.040 | Recall 0.97
False Positive Rate 0.080 | Recall 0.98
False Positive Rate 0.160 | Recall 0.98


### Naive Bayes Generalizes Well

As you can see, the naive bayes model has generalized very well to the test set. The recall numbers at each false positive rate are all mostly in line with the rates calculated over the entire dataset. That shows that we aren't just memorizing the training dataset.

Overfitting is not normally a major problem for Naive Bayes models. This is because Naive Bayes is very simple. The way overfitting normally happens with other models is that parameters are carefully set so that if just the right set of features (those of the memorized example) come in, then the right answer goes out (the label of the memorized example).

To do this, a model typically needs to coordinate many parameters so they are "just right" for the training example.

However, the Naive Bayes model does no such coordination. It sets each parameter seperately. It has no way to tweak parameters so they have combined effects that are any different than their individual effects.

Just because Naive Bayes is good at avoiding overfitting doesn't mean that it always gives high performance. It relies on this assumption that we *know* isn't true: that the presence of a word $w_i$ is independent of a word $w_j$ given the class of the example. In many contexts, understanding these interaction effects will be very important.

For instance, if we want to classify emails based on *sentiment*, then words like "good", "great", "awesome" are positive signals, while "terrible", "bad", "awful" are negative signals. But what about "This movie is not good?" How is the Naive Bayes model supposed to know this is negative? It can't just make "not" a very negative word, because I could also say: "This movie is not bad."

### Bias Versus Variance

We will later explore a classic tradeoff of machine learning models: *bias* versus *variance*. We say a model class is very *biased* if it imposes a very simple, inflexible structure of model. For instance, a model class of linear models is much more biased than the model class of all polynomial functions.

On the other hand, we say a model class has *high* variance if we learn very different models depending on what data we feed it. For instance: if I feed the first, third, et cetera emails into the Naive Bayes model, I will get mostly the same model as feeding in the second, fourth, et cetera emails. We say Naive Bayes has *low* variance. The idea is that the Naive Bayes model doesn't vary much in what sample you feed it.

Models that exhibit high variance do so because they have more *capacity*. That means: they can capture more complex relationships. This additional flexibility often allows the model to capture relationships that "aren't really there."

We saw an example of this before with fitting the line to the Gaussian noise. A relationship like this is called *spurious*. Luckily, a linear model is very biased (has low capacity), so the spurious relationship the linear model thinks it found in the noise is very weak.

The ability to capture spurious relationships often means that a model *thinks* it is doing a good job on the training dataset, but when we test it on the test dataset, the model fails badly. This happens because the things it thought it learned turn out to be false.

All else equal, you want to minimize bias in your model class. Bias is basically simplifying assumptions about what kind of model explains the data. But those assumptions are typically not quite correct, and so your model may not perform well if your assumptions are too simplistic.

On the other hand, when you have small amounts of data, you will prefer models with less capacity, because training a high capacity model with small amounts of data leads to high variance in outcome. Basically: there isn't enough data to convince the model not to use its additional capacity to model noise.