# Naive Bayes Spam Filter

In this project we will be building a spam filter using the Multinomial Naive Bayes Algorithm. Our goal will be to aim for a classification accuracy of 80% or higher. The dataset we will be working with contains 5,572 messages classified as *spam* or *ham* and can be accessed at this [link](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

We'll start off by first reading in the dataset

In [1]:
import pandas as pd

sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print('Number of rows: {}\nNumber of Columns: {}'.format(sms.shape[0], sms.shape[1]))

sms.head()

Number of rows: 5572
Number of Columns: 2


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# Percentage of spam and ham
spam = sms['Label'].value_counts(normalize=True)[0].round(4) * 100
ham = sms['Label'].value_counts(normalize=True)[1].round(4) * 100

print('Percentage of spam: {}%\nPercentage of ham: {}%'.format(spam, ham))


Percentage of spam: 86.59%
Percentage of ham: 13.41%


## Training and Test Set

Now that we have an idea of what we're working with, we can go ahead and split the data into our training set and test set. We'll leave the test set alone for now and just focus on the training set. 

In [3]:
# Randomize the dataset
sms_random = sms.sample(frac=1, random_state=1)

# Create test and train sets
training_test_index = round(len(sms_random) * 0.2)

training_set = sms_random[training_test_index:].reset_index(drop=True)
test_set = sms_random[:training_test_index].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [4]:
# Percentage of spam and ham - training set
spam = training_set['Label'].value_counts(normalize=True)[0] * 100
ham = training_set['Label'].value_counts(normalize=True)[1] * 100

print('Training Set\n------------\nPercentage of spam: {}%\nPercentage of ham: {}%'.format(spam, ham))

# Percentage of spam and ham - test set
spam = test_set['Label'].value_counts(normalize=True)[0] * 100
ham = test_set['Label'].value_counts(normalize=True)[1] * 100

print('\nTest Set\n------------\nPercentage of spam: {}%\nPercentage of ham: {}%'.format(spam, ham))

Training Set
------------
Percentage of spam: 86.54104979811575%
Percentage of ham: 13.458950201884253%

Test Set
------------
Percentage of spam: 86.80430879712748%
Percentage of ham: 13.195691202872531%


## Lettercase and Punctuation

Earlier, we saw that our messages had all kinds or puncuations and spellings, so we are going to go ahead and start cleaning our data up a bit my removing all punctuation and making all our words lowercase for better accuracy.

In [5]:
# Remove punctuation
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ').str.lower()

## Creating the Vocabulary

In [6]:
# Split SMS into word list
training_set['SMS'] = training_set['SMS'].str.split()

# Initialize unique vocabulary list
vocabulary = []

# Add words to vocabulary
for message in training_set['SMS']:
    for word in message:
        vocabulary.append(word)
        
# Remove duplicates
vocabulary = list(set(vocabulary))

In [7]:
len(vocabulary)

7753

## Finalize Training Set

In [8]:
# Create dictionary of word counts
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

# Transform into dataframe
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)

# Merge datasets
training_set_clean = pd.concat([training_set, word_counts_per_sms], axis=1)

In [9]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"[yeah, do, don, t, stand, to, close, tho, you,..."
1,ham,"[hi, where, are, you, we, re, at, and, they, r..."
2,ham,"[if, you, r, home, then, come, down, within, 5..."
3,ham,"[when, re, you, guys, getting, back, g, said, ..."
4,ham,"[tell, my, bad, character, which, u, dnt, lik,..."


## Calculating Algorithm Constants

According to the Bayes Algorithm, there are some constants that we can go ahead and calculate. We'll go ahead and take care of that here.

In [10]:
# Split into spam and ham datasets
spam_set = training_set_clean[training_set_clean['Label'] == 'spam']
ham_set = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_set) / len(training_set_clean)
p_ham = len(ham_set) / len(training_set_clean)

# N_spam
n_words_per_spam = spam_set['SMS'].apply(len)
n_spam = n_words_per_spam.sum()

# N_ham
n_words_per_ham = ham_set['SMS'].apply(len)
n_ham = n_words_per_ham.sum()

# N_vocabulary
n_vocabulary = len(vocabulary)

# Laplace Smoothing
alpha = 1

## Calculating Parameters

In addition to calculating the constants, we can make things easier for ourselves by also creating dictionaries for the parameters of each word.

In [11]:
# Initiate parameters
spam_parameters = {unique_word:0 for unique_word in vocabulary}
ham_parameters = {unique_word:0 for unique_word in vocabulary}

# Create
for word in vocabulary:
    # P(word|spam)
    n_word_given_spam = spam_set[word].sum()
    p_word_given_spam = ((n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary))
    
    # P(word|ham)
    n_word_given_ham = ham_set[word].sum()
    p_word_given_ham = ((n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary))
    
    # Update parameters
    spam_parameters[word] = p_word_given_spam
    ham_parameters[word] = p_word_given_ham

## Classifying a New Message

Now that we've calculated the constants and the required probabilities, we can go ahead and create our function. We'll need to calculate P(Spam|message) and P(Ham|message) as they are not constants and depend on the word we are fixated on. We'll do a quick verification to see if it worked so we can move on to our test set.

In [12]:
import re

def classify(message):
    # Clean and split message
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    # Initialize P(Spam|message), P(Ham|message)
    p_spam_given_message = p_spam # We include one parameter to make calculations shorter
    p_ham_given_message = p_ham # Same applies here
    
    # Update parameters
    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [13]:
# Verify Spam
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.2784957584472927e-25
P(Ham|message): 2.5841428475044265e-27
Label: Spam


In [14]:
# Verify Ham
classify("Sounds good, Tom, then see u there")

P(Spam|message): 4.774748444294843e-25
P(Ham|message): 3.455584370145657e-21
Label: Ham


## Measuring Spam Accuracy

Now that we have created our algorithm and done a quick verification, it's time to put it to use! First we need to modify our function so that it only returns either 'ham', 'spam', or neither. Afterwards, we can measure the accuracy to see how good our classification was.

In [15]:
def test_classifier(message):
    # Clean and split message
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    # Initialize P(Spam|message), P(Ham|message)
    p_spam_given_message = p_spam # We include one parameter to make calculations shorter
    p_ham_given_message = p_ham # Same applies here
    
    # Update parameters
    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'I can\'t tell!'

In [16]:
# Classify Test Set
test_set['Predicted'] = test_set['SMS'].apply(test_classifier)

# Verify
test_set.head()

Unnamed: 0,Label,SMS,Predicted
0,ham,"Yep, by the pretty sculpture",ham
1,ham,"Yes, princess. Are you going to make me moan?",ham
2,ham,Welp apparently he retired,ham
3,ham,Havent.,ham
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham


In [28]:
# Initialize accuracy variables
correct = 0
total = len(test_set)

# Measure accuracy
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['Predicted']:
        correct += 1
accuracy = round((correct/total * 100), 2)
                 
print('Accuracy: {}%'.format(accuracy),
     '\nTotal Correct:', correct,
     '\nTotal Incorrect:', total - correct)

Accuracy: 98.83% 
Total Correct: 1101 
Total Incorrect: 13


With only 13 incorrect classifications, it looks like we were able to successfully create a spam filter given that we achieved an accuracy of 98.83%.