In [1]:
import numpy as np
import pandas as pd

## 1. Review of Probability Concepts.

For and a review of probability concepts and a warm-up for today's main exercise, download and work through this notebook from a previous AI4ALL session: https://github.com/abisee/sailors2017/blob/master/lesson3_naivebayes_exercises.ipynb

## 2. Naive Bayes Spam Classifier.

Now, let's create the spam classifier we learned about in today's lecture.

### a. Download and load the data.

Download a spam classification dataset from here: https://www.kaggle.com/benvozza/spam-classification/data

Then run the following code to read in the data:

In [2]:
data = pd.read_csv("spam.csv", header=0, encoding='latin-1')

The next cell splits the dataset into a train and test set. We'll start with using 80% of the data for training and 20% for testing -- if you have time at the end of the session, you can come back and experiment with different train-test splits.

In [3]:
num_examples = data.shape[0]
indices = np.random.permutation(num_examples)

train_indices = indices[:int(num_examples * 4 / 5)]
test_indices = indices[int(num_examples * 4 / 5):]

train_data = data.iloc[train_indices]
test_data = data.iloc[test_indices]

How many train and test examples do we have?

In [4]:
print(train_data.shape)
print(test_data.shape)

(4457, 5)
(1115, 5)


In this dataset, the 'v1' column contains the labels and the 'v2' column contains the inputs.

In [5]:
train_labels = train_data['v1']
train_inputs = train_data['v2']

What are the possible labels?

In [6]:
print(np.unique(train_labels))

['ham' 'spam']


### b. Write functions to compute the probabilities we'll need for our classifier.

Complete these functions to calculate p(word|spam) and p(word|not spam) for each word.

In [7]:
spam_train_indices = np.where(train_labels == 'spam')
spam_inputs = train_inputs.iloc[spam_train_indices]

ham_train_indices = np.where(train_labels == 'ham')
ham_inputs = train_inputs.iloc[ham_train_indices]

In [8]:
all_spam_words = np.concatenate([example.split(" ") for example in spam_inputs.ravel()])
all_ham_words = np.concatenate([example.split(" ") for example in ham_inputs.ravel()])
all_words = np.concatenate((all_spam_words, all_ham_words))

In [9]:
def compute_p_word_given_spam(all_spam_words, word):
    num_spam_words = len(all_spam_words)
    num_word_occurrences = np.count_nonzero(all_spam_words == word)
    return num_word_occurrences / num_spam_words

In [10]:
def compute_p_word_given_ham(all_ham_words, word):
    num_ham_words = len(all_ham_words)
    num_word_occurrences = np.count_nonzero(all_ham_words == word)
    return num_word_occurrences / num_ham_words

Now compute p(spam), p(ham).

In [11]:
num_spam = len(spam_inputs)
num_ham = len(ham_inputs)
num_total = num_spam + num_ham

In [12]:
p_spam = num_spam / num_total
p_ham = num_ham / num_total

Next, complete the following function to compute p(word).

In [13]:
def compute_p_word(all_words, word):
    num_word_occurrences = np.count_nonzero(all_words == word)
    num_words = len(all_words)
    return num_word_occurrences / num_words

Calculate p(spam|word) and p(ham|word).

In [14]:
def compute_p_spam_given_word(all_spam_words, word, all_words, p_spam):
    p_word_given_spam = compute_p_word_given_spam(all_spam_words, word)
    p_word = compute_p_word(all_words, word)
    p_spam_given_word = p_word_given_spam * p_spam / p_word
    return p_spam_given_word

In [15]:
def compute_p_ham_given_word(all_ham_words, word, all_words, p_ham):
    p_word_given_ham = compute_p_word_given_ham(all_ham_words, word)
    p_word = compute_p_word(all_words, word)
    p_ham_given_word = p_word_given_ham * p_ham / p_word
    return p_ham_given_word

Fill in the following functions to compute p(email|spam), p(email|ham), and p(email).

In [16]:
def compute_p_email_given_spam(all_spam_words, email):
    p_email_given_spam = 1.0
    for word in email:
        p_word_given_spam = compute_p_word_given_spam(all_spam_words, word)
        p_email_given_spam *= p_word_given_spam
    return p_email_given_spam

In [17]:
def compute_p_email_given_ham(all_ham_words, email):
    p_email_given_ham = 1.0
    for word in email:
        p_word_given_ham = compute_p_word_given_ham(all_ham_words, word)
        p_email_given_ham *= p_word_given_ham
    return p_email_given_ham

In [18]:
def compute_p_email(all_words, email):
    p_email = 1.0
    for word in email:
        p_word = compute_p_word(all_words, word)
        p_email *= p_word
    return p_email

Fill in the following classifier, which makes predictions by comparing p(spam|email) to p(not spam|email).

In [19]:
def classify_email(email, all_spam_words, all_ham_words, all_words, p_spam, p_ham):
    p_email_given_spam = compute_p_email_given_spam(all_spam_words, email)
    p_email_given_ham = compute_p_email_given_ham(all_ham_words, email)
    p_email = compute_p_email(all_words, email)
    spam_probability = p_email_given_spam * p_spam
    ham_probability = p_email_given_ham * p_ham
    if spam_probability > ham_probability:
        return 'spam'
    else:
        return 'ham'

Fill in the following function, which checks the test accuracy for our classifier.

In [24]:
def test_classifier_accuracy(test_data, all_spam_words, all_ham_words, all_words, p_spam, p_ham):
    num_test_examples = len(test_data)
    num_correct = 0
    for test_index in range(num_test_examples):
        if test_index % 100 == 1:
            print("index: %d, current accuracy: %f" % (test_index, num_correct / test_index)) # TODO: Fill in here.
        test_example = test_data.iloc[test_index]
        test_label = test_example['v1']
        test_input = test_example['v2']
        prediction = classify_email(test_input, all_spam_words, all_ham_words, all_words, p_spam, p_ham)
        if prediction == test_label:
            num_correct += 1
    return num_correct / num_test_examples

### c. Time to test our classifier!

Run the following code to test the classifier. It'll take awhile to run, but the function periodically prints our test accuracy.

In [25]:
test_classifier_accuracy(test_data, all_spam_words, all_ham_words, all_words, p_spam, p_ham)

index: 1, current accuracy: 1.000000
index: 101, current accuracy: 0.891089
index: 201, current accuracy: 0.875622
index: 301, current accuracy: 0.877076
index: 401, current accuracy: 0.882793
index: 501, current accuracy: 0.882236
index: 601, current accuracy: 0.876872
index: 701, current accuracy: 0.873039
index: 801, current accuracy: 0.876404
index: 901, current accuracy: 0.866815
index: 1001, current accuracy: 0.866134
index: 1101, current accuracy: 0.867393


0.8681614349775785

What is the final test accuracy?

### Extra Challenge.

"Extra challenge" sections are a more unguided exploration into the concepts we've discussed. You'll notice less scaffolding for the code -- try implementing these concepts from scratch, and feel free to ask your neighbors or an instructor if you have any questions!

1. Try calculating the accuracy of our classifier on the train set. How does this compare to the test set?
2. Try finding some examples of inaccurately classified examples (or try writing your own input that tricks the classifier). Can you think of any way to improve our classifier?