# Chapter 13. Naive Bayes

In [135]:
from __future__ import division
from collections import Counter, defaultdict
from machine_learning import split_data
import math, random, re, glob

DataSciencester has a popular feature that allows members to send messages to other members.  
The VP of DataSciencester has tasked you with building a spam filter for those messages.

## A Really Dumb Spam Filter

Remember Bayes' Theorem?  

${\large\displaystyle P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}}$  

where A and B are events and P(B) ≠ 0.

Now, imagine a 'universe' that consists of receiving a message chosen randomly from all possible messages.  
Let **`S`** be the event "the message is spam" and **`V`** be the event "the message contains the word *viagra*."  
Bayes' Theorem tells us that the probability that the message is spam *conditional* on containing the word viagra is:  

${\large\displaystyle P(S\mid V)={\frac {P(V\mid S)\,P(S)}{P(V\mid S)\,P(S) + P(V\mid \neg S)P(\neg S)}}}$  


The numerator is the probability that a message is spam *and* contains 'viagra', while the denominator is the probability that a message contains 'viagra'.  
Think of this calculation as representing the proportion of 'viagra' messages that are spam.

If we have a large corpus of messages that we know are spam, and a large collection of messages that we know are *not* spam, then we can estimate ${P(V\mid S)}$ and ${P(V\mid \neg S)}$.  
If we further assume that any message is equally likely to be spam or not-spam ( ${P(S) = 0.5}$ and ${P(\neg S) = 0.5}$ ), then:  

${\large\displaystyle P(S\mid V)={\frac {P(V\mid S)\,}{P(V\mid S) + P(V\mid \neg S)}}}$

For example, if 50% of spam messages have the word *viagra*, but only 1% of nonspam messages do, then the probability that any given *viagra*-containing email is spam is:  

${\large\displaystyle {\frac {0.5}{0.5 \,+\, 0.01} = {98\%}}}$

## A More Sophisticated Spam Filter

Imagine that we have a vocabulary, or corpus, of many words $w_1, w_2, ..., w_n$.  
To move this into the realm of probability theory, we'll write $X_i$ for the event "a message that contains the word $w_i$."  
Also imagine that we've come up with an estimate ${P(X_i \mid S)}$ for the probability that a spam message contains the *i*th word, and a similar estimate ${P(X_i \mid \neg S)}$ for the probability that a non-spam message contains the *i*th word.

The key to Naive Bayes is making the assumption that the presence or absence of each word are independent of one another, conditional on a message being spam or not.  
Intuitively, this assumption means that knowing whether a certain spam message contains the word *viagra* gives you no information about whether or not that same message contains the word *rolex*.  
In math terms, this means that:  
${P(X_1 = x_1, X_1 = x_2, ..., X_n = x_n \mid S) = P(X_1 = x_1 \mid S)\;\times\;...\;\times\;P(X_n = x_n \mid S)}$

This is an extreme assumption.  
Imagine that our vocabulary consists *only* of the words 'viagra' and 'rolex', and that half of all spam messages are for 'cheap viagra' and that the other half are for 'authentic rolex'.  
In this case, the Naive Bayes estimate that a spam message contains both *viagra* and *rolex* is:  

${P(X_1 = 1, X_2 = 1 \mid S) = P(X_1 = 1 \mid S)P(X_2 = 1 \mid S) = .5 \times .5 = .25}$  

since we've assumed away the knowledge that *viagra* and *rolex* actually never occur together.  
Although this assumption may seem unrealistic and unreasonable, this model often performs well and isused in actual spam filters.

The same Bayes' Theorem reasoning we used for our 'viagra-only' spam filter tells us that we can calculate the probability that a message is spam using the equation:  

${\normalsize\displaystyle {P(S \mid X = x)} = {\frac {P(X = x \mid S)}{P(X = x \mid S) + P(X = x \mid \neg S)}}}$  

The Naive Bayes assumption allows us to calculate each of the probabilities on the right simply by multiplying together the individual probability estimates for each vocabulary word.

In practice, you usually want to avoid multiplying lots of probabilities together, to avoid a problem called [underflow](https://en.wikipedia.org/wiki/Arithmetic_underflow).  
Basically, underflow is a result of computers not dealing well floating-point numbers that are too close to zero.  
Recalling from algebra that ${log\;(ab)\; = log\;a + log\;b}$ and that ${exp\;(log\;x) = x}$, we usually calculate ${p_1 \times \;p_2\;\times\;...\;\times p_n}$ as the equivalent:  

${\large\displaystyle {exp\;( log(p_1)\;+\;...\;+\;log(p_n))}}$  

The only challenge left is coming up with estimates for ${P(X_i \mid S)}$ and ${P(X_i \mid \neg S)}$, which are the probabilities that a spam message (or nonspam message) contains the word ${w_i}$.  
If we have a fair number of 'training' messages labeled as spam and not-spam, an obvious first try is to estimate ${P(X_i \mid S)}$ simply as a fraction of spam messages containing word ${w_i}$.

This causes a big problem, though.  
Imagine that in our training set the word 'data' only occurs in nonspam messages.  
In that case, we would estimate ${P("data" \mid S) = 0}$.  
The result is that our Naive Bayes classifier would always assign spam probability 0 to *any* message containing the word 'data', even a message like "data on cheap viagra and authentic rolex watches."  
To avoid this problem, we usually use some kind of [smoothing](https://en.wikipedia.org/wiki/Additive_smoothing).  
In particular, we'll choose a [pseudocount](https://en.wikipedia.org/wiki/Pseudocount) -- *k* -- and estimate the probability of seeing the *i*th word in a spam message as:  

${\large P(X_i \mid S) = {\frac {(k\;+\; \text{number of spam messages containing ${w_i}$)}}{ 2k\;+\;\text{number of spam messages}}}}$

Similarly for ${P(X_i \mid \neg S)}$.  
When calculating the spam probabilities for the *i*th word, we assume that we also saw *k* additional spams containing the word and *k* additional spams *not* containing the word.  
For example, if 'data' occurs in 0/98 spam emails, and if *k* is 1, we can estimate:  
${P("data" \mid S)}$ as 1/100 = 0.01,  
which allows our classifier to still assign some nonzero spam probability to messages that contain the word 'data'.

Everything in this section is quite a bit to take in, so read it again before moving on to the next section.

## Implementation

Let's build this thing.  
First, we'll create a function to tokenize messages into distinct words by:
- converting each message to lowercase,
- using `re.findall()` to extract the 'words' consisting of letters, numbers, and apostrophes,
- using `set()` to get just the distinct words.

In [136]:
def tokenize(message):
    # convert to lowercase
    message = message.lower()
    # extract the words
    all_words = re.findall("[a-z0-9']+", message)
    # remove duplicates
    return set(all_words)

Our second function will count the words in a labeled training set of messages.  
We'll have it return a dictionary whose keys are words, and whose values are two-element lists `[spam_count, non_spam_count]` corresponding to how many times we saw that word in both spam and non-spam messages:

In [137]:
def count_words(training_set):
    """ training set consists of pairs (message, is_spam) """
    counts = defaultdict(lambda: [0, 0])
    for message, is_spam in training_set:
        for word in tokenize(message):
            counts[word][0 if is_spam else 1] += 1
    return counts

The next step is to turn these counts into estimated probabilities using the smoothing described above.  
The function will return a list of triplets containing
- each word, 
- the probability of seeing that word in a spam message, 
- and the probability of seeing that word in a non-spam message:

In [138]:
def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
    """ turn the word_counts into a list of triplets w, p(w|spam), and p(w|not_spam) """
    return [(w,
            (spam + k) / (total_spams + 2 * k),
            (non_spam + k) / (total_non_spams + 2 * k))
            for w, (spam, non_spam) in counts.iteritems()]

The last piece is to use these word probabilities (and our Naive Bayes assumptions) to assign probabilities to messages:

In [139]:
def spam_probability(word_probs, message):
    message_words = tokenize(message)
    log_prob_if_spam = log_prob_if_not_spam = 0.0
    # iterate through each word in the corpus/vocabulary
    for word, prob_if_spam, prob_if_not_spam in word_probs:
        # if *word* appears in the message, add the log probability of seeing it
        if word in message_words:
            log_prob_if_spam += math.log(prob_if_spam)
            log_prob_if_not_spam += math.log(prob_if_not_spam)
        # if *word* doesn't appear in the message, add the log probability of *not*
        # seeing it, which is log(1 - probability of seeing it)
        else:
            log_prob_if_spam += math.log(1.0 - prob_if_spam)
            log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)
        
    prob_if_spam = math.exp(log_prob_if_spam)
    prob_if_not_spam = math.exp(log_prob_if_not_spam)
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)

Now we can put all of this together into our Naive Bayes Classifier:

In [140]:
class NaiveBayesClassifier:
    
    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []
        
    def train(self, training_set):
        # count spam and non-spam messages
        num_spams = len([is_spam for message, is_spam in training_set if is_spam])
        num_non_spams = len(training_set) - num_spams
        # run the training data
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts, num_spams, num_non_spams, self.k)
        
    def classify(self, message):
        return spam_probability(self.word_probs, message)

## Testing Our Model

To test our model, we'll be using the [SpamAssasin public corpus](https://spamassassin.apache.org/publiccorpus/) (an oldie but a goodie).  
If you want to play along, download the files prefixed with `20021010` and unzip them.  
There should be three folders: `spam`, `easy_ham`, and `hard_ham`.  
Each folder contains many emails, each contained in a single file.  
In order to keep things *really* simple, we are only going to look at the subject lines of each email.

How do we identify the subject line?  
Looking through the files, they all seem to start with "Subject", so let's look for that:

In [141]:
import glob, re

# modify the path to wherever you put the files
path = r"spam_email_data/*/*"
data = []
# glob.glob returns every filename that matches the wildcarded path
for fn in glob.glob(path):
    is_spam = "ham" not in fn
    with open(fn,'r') as file:
        for line in file:
            if line.startswith("Subject:"):
                # remove the leading "Subject: " and keep what's left
                subject = re.sub(r"^Subject: ", "", line).strip()
                data.append((subject, is_spam))
data[:10]

[('Re: New Sequences Window', False),
 ('[zzzzteana] RE: Alexander', False),
 ('[zzzzteana] Moscow bomber', False),
 ("[IRR] Klez: The Virus That  Won't Die", False),
 ('Re: Insert signature', False),
 ('Re: [zzzzteana] Nothing like mama used to make', False),
 ('Re: [zzzzteana] Nothing like mama used to make', False),
 ('[zzzzteana] Playboy wants to go out with a bang', False),
 ('Re: [zzzzteana] Nothing like mama used to make', False),
 ('[zzzzteana] Meaningful sentences', False)]

Now we can split the data into training data and test data, and then we're ready to build a classifier:

In [142]:
random.seed(0)
train_data, test_data = split_data(data, 0.75)
classifier = NaiveBayesClassifier()
classifier.train(train_data)

Then check how the model does:

In [143]:
# triplets (subject, actual is_spam, predicted spam probability)
classified = [(subject, is_spam, classifier.classify(subject)) for subject, is_spam in test_data]
# assume that spam_probability > 0.5 corresponds to spam prediction and
# count the combinations of (actual is_spam, predicted is_spam)
counts = Counter((is_spam, spam_probability > 0.5) for _, is_spam, spam_probability in classified)
counts

Counter({(False, False): 704,
         (False, True): 33,
         (True, False): 38,
         (True, True): 101})

A review of the results:
- 704 True Negatives (ham classified as 'ham')
- 33 False Positives (ham classified as 'spam')
- 38 False Negatives (spam classified as 'ham')
- 101 True Positives (spam classified as 'spam')

Precision is  101 / (101 + 33) = 75%  
Recall is  101 / (101 + 38) = 73%

Let's also look at the most misclassified:

In [144]:
# sort by spam_probability from smallest to largest
classified.sort(key=lambda row: row[2])
# the highest predicted spam probabilities among the non_spams
spammiest_hams = filter(lambda row: not row[1], classified)[-5:]
print "spammiest hams: " + str(spammiest_hams)
print
# the lowest predicted spam probabilities among the actual spams
hammiest_spams = filter(lambda row: row[1], classified)[:5]
print "hammiest_spams: " + str(hammiest_spams)

spammiest hams: [('Attn programmers: support offered [FLOSS-Sarai Initiative]', False, 0.975612960514201), ('2000+ year old Greek computer reinterpreted', False, 0.983535500810437), ('What to look for in your next smart phone (Tech Update)', False, 0.9898719206903349), ('[ILUG-Social] Re: Important - reenactor insurance needed', False, 0.9995349057803377), ('[ILUG-Social] Re: Important - reenactor insurance needed', False, 0.9995349057803377)]

hammiest_spams: [('Re: girls', True, 0.0009525186158414711), ('Introducing Chase Platinum for Students with a 0% Introductory APR', True, 0.0012566691211091526), ('.Message report from your contact page....//ytu855 rkq', True, 0.0015109358288617285), ('Testing a system, please delete', True, 0.0026920538836874555), ('Never pay for the goodz again (8SimUgQ)', True, 0.00591162322193142)]


The two spammiest hams both have the words 'needed' (77 times more likely to appear in spam),  
'insurance' (30 times more likely to appear in spam),  
and 'important' (10 times more likely to appear in spam).

The hammiest spam is too short ('Re:girls') to make much of a judgment, and the second is a credit card offer with many words not included in the training set.

We can also look at the spammiest *words*:

In [145]:
def p_spam_given_word(word_prob):
    """ use Bayes' Theorem to calculate p(spam | message contains word) """
    # word_prob is one of the triplets produced by word_probabilities
    word, prob_if_spam, prob_if_not_spam = word_prob
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)

words = sorted(classifier.word_probs, key=p_spam_given_word)

spammiest_words = words[-5:]
print "The spammiest words are: " + str(spammiest_words)
print
hammiest_words = words[:5]
print "The hammiest words are: " + str(hammiest_words)

The spammiest words are: [('year', 0.028767123287671233, 0.00022893772893772894), ('sale', 0.031506849315068496, 0.00022893772893772894), ('rates', 0.031506849315068496, 0.00022893772893772894), ('systemworks', 0.036986301369863014, 0.00022893772893772894), ('money', 0.03972602739726028, 0.00022893772893772894)]

The hammiest words are: [('spambayes', 0.0013698630136986301, 0.04601648351648352), ('users', 0.0013698630136986301, 0.036401098901098904), ('razor', 0.0013698630136986301, 0.030906593406593408), ('zzzzteana', 0.0013698630136986301, 0.029075091575091576), ('sadev', 0.0013698630136986301, 0.026785714285714284)]


### Ways to improve model performance  
- More data. Nuff said.
- Look at the message content, not just the subject line. Be careful how you deal with the message headers.
- Our classifier takes into account every word that appears in the training set, even words that appear only once. Modify the classifier to accept an optional `min_count` threshold and ignore tokens that don't appear at least that many times.
- The tokenizer has no notion of similar words (e.g. 'cheap' and 'cheapest'). Modify the classifier to take an optional `stemmer` function that converts words to [equivalence classes](https://en.wikipedia.org/wiki/Equivalence_class) of words, like the [Porter Stemmer](https://tartarus.org/martin/PorterStemmer/). 
- Although our features are all of the form "message contains word $w_i$", there's no reason why this has to be the case. In our implementation, we could add extra features like "message contains a number" by creating phony tokens like *contains:number* and modifying the `tokenizer` to emit them when appropriate.

## For Further Exploration

- Paul Graham's articles [A Plan for Spam](https://en.wikipedia.org/wiki/Equivalence_class) and [Better Bayesian Filtering](http://www.paulgraham.com/better.html) offer insight into the ideas behind building spam filters.  
- scikit-learn contains a [BernoulliNB](http://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes) model that implements a similar Naive Bayes algorithm that was implemented here.