# Building a Spam Filter with Naive Bayes

In this project we are going to build a spam filter for SMS messages from scratch. We will use the multinominal Naive Bayes algorithm along with a dataset of 5,572 SMS messages collected by Tiago A. Almeida and José María Gómez Hidalgo. It can be downloaded from the <a href=https://archive.ics.uci.edu/ml/index.php>UCI Machine Learning Repository</a> or directly from <a href=https://archive.ics.uci.edu/ml/machine-learning-databases/00228/>this link</a>.

This project was developed as a guided project throughout a <a href=https://www.dataquest.io/>Dataquest</a> course on *Conditional Probability*.

## Exploring the Data

In [1]:
import pandas as pd
import re

from collections import defaultdict

In [2]:
collection = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
collection.shape

(5572, 2)

In [3]:
collection.head(10)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [4]:
collection['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

Each row of the dataset consists of a raw SMS message and its classification: spam or ham (i.e. non-spam). About 13.4% of the messages are spam.

## Splitting in Training and Test Set

In [5]:
shuffled = collection.sample(frac=1, random_state=1)

In [6]:
# use 80% of the data set to train model
train_index = round(shuffled.shape[0] * 0.8)
train_df = shuffled[:train_index].reset_index(drop=True)
train_df.shape

(4458, 2)

In [7]:
test_df = shuffled[train_index:].reset_index(drop=True)
test_df.shape

(1114, 2)

In [8]:
train_df['Label'].value_counts(normalize=True) * 100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [9]:
test_df['Label'].value_counts(normalize=True) * 100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

The percentage of spam in the training and in the test set is about 13.4% (the percentage of spam in the full dataset). Therefore, both sets seem to be representative samples.

## Cleaning the Data

In [10]:
# clean the SMS column in the training set
def clean_message(message):
    message = re.sub(r'[^A-Za-z0-9 ]+', ' ', message)
    message = message.lower()
    message = message.split()
    
    return message

train_df['SMS'] = train_df['SMS'].apply(clean_message)
train_df.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, all, smth, there, s, a, ca..."


In [11]:
# find all unique word in the training set
vocabulary = defaultdict(int)
for sms in train_df['SMS']:
    for word in sms:
        vocabulary[word] += 1
list(vocabulary.items())[:5]

[('yep', 9), ('by', 144), ('the', 1077), ('pretty', 12), ('sculpture', 1)]

In [12]:
len(vocabulary)

7776

There are 7,782 unique words in all the messages of our training set.

In [13]:
# transform the entire dataset
word_counts_per_sms = {unique_word: [0] * len(train_df['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train_df['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

word_counts = pd.DataFrame(word_counts_per_sms)
train_df_clean = pd.concat([train_df, word_counts], axis=1)
train_df_clean.shape

(4458, 7778)

In [14]:
train_df_clean.head()

Unnamed: 0,Label,SMS,yep,by,the,pretty,sculpture,yes,princess,are,...,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
0,ham,"[yep, by, the, pretty, sculpture]",1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, all, smth, there, s, a, ca...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


 This data transformation will help us calculate the model parameters more easily.


## Calibrating the Algorithm

In [15]:
train_ham = train_df_clean[train_df_clean['Label'] == 'ham']
train_spam = train_df_clean[train_df_clean['Label'] == 'spam']

In [16]:
total = train_df_clean.shape[0]
ham = train_ham.shape[0]
spam = train_spam.shape[0]
total, ham, spam

(4458, 3858, 600)

In [17]:
# calculate unconditional probabilities
p_ham = ham / total
p_spam = spam / total
p_ham, p_spam

(0.8654104979811574, 0.13458950201884254)

In [18]:
# number of unique words
n_vocabulary = len(vocabulary)
n_vocabulary

7776

In [19]:
# number of ham messages
n_ham = train_ham['SMS'].apply(len).sum()
n_ham

57101

In [20]:
# number of spam messages
n_spam = train_spam['SMS'].apply(len).sum()
n_spam

15193

In [21]:
# using Laplace smoothing
alpha = 1

In [22]:
# calculate conditional probabilities
cond_ham = defaultdict(int)
cond_spam = defaultdict(int)

for word in vocabulary:
    n_word_ham = train_ham[word].sum()
    cond_ham[word] = (n_word_ham + alpha) / (n_ham + alpha * n_vocabulary)
    
    n_word_spam = train_spam[word].sum()
    cond_spam[word] = (n_word_spam + alpha) / (n_spam + alpha * n_vocabulary)
    
print(list(cond_ham.items())[:3])
print(list(cond_spam.items())[:3])

[('yep', 0.00015413783004762858), ('by', 0.0017109299135286773), ('the', 0.014196094147386594)]
[('yep', 4.35369410945187e-05), ('by', 0.0015237929383081544), ('the', 0.006878836692933954)]


## Classifying a New Message

With all the parameters calculated we can create the spam filter. It is basically a function than:
1. Takes as input a new message as a list of words.
2. Caluclates the conditional probabilities of each - spam or ham - given the words.
3. Decides if the message is either ham or spam, or needs human help to classify.

In [23]:
def classify(message, verbose=False):

    message = clean_message(message)
    
    # initiate the conditional probabilities with the absolute probablilities
    p_ham_given_message = p_ham    
    p_spam_given_message = p_spam
    
    # calculate the condition probablilities using the product formula
    for word in message:
        if word in cond_ham:
            p_ham_given_message *= cond_ham[word]        
        if word in cond_spam:
            p_spam_given_message *= cond_spam[word]

    if verbose:
        print('P(Ham|message):', p_ham_given_message)
        print('P(Spam|message):', p_spam_given_message)    

    # classifiy the message
    if p_ham_given_message > p_spam_given_message:
        if verbose:
            print('Label: Ham')
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        if verbose:
            print('Label: Spam')
        return 'spam'
    else:
        if verbose:
            print('Equal proabilities, have a human classify this!')
        return 'needs human classification'

In [24]:
# test the above function
classify("WINNER!! This is the secret code to unlock the money: C3421.", verbose=True);
print('-' * 10)
classify("Sounds good, Tom, then see u there", verbose=True);

P(Ham|message): 1.9755668428218808e-27
P(Spam|message): 1.3502434564954932e-25
Label: Spam
----------
P(Ham|message): 3.744803676008355e-21
P(Spam|message): 2.440210195572668e-25
Label: Ham


## Measuring the Accuracy of the Spam Filter

In [25]:
# classify the test set
test_df['predicted'] = test_df['SMS'].apply(classify)
test_df.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [26]:
# calculate the accuracy
total = test_df.shape[0]
correct = (test_df['Label'] == test_df['predicted']).sum()

print('Correct: ', correct)
print('Incorrect: ', total - correct)
print('Accuracy: ', correct/total)

Correct:  1100
Incorrect:  14
Accuracy:  0.9874326750448833


The accuracy of our spam filter is about 98.74%, which is quite good.

## Conclusion

In [27]:
# looking at the wrong predicitons
test_df[test_df['Label'] != test_df['predicted']]

Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here...,ham
135,spam,More people are dogging in your area now. Call...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
302,ham,No calls..messages..missed calls,spam
319,ham,We have sent JD for Customer Service cum Accou...,spam
504,spam,Oh my god! I've found your number again! I'm s...,ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


In this project we built a spam filter for SMS messages from scratch using the multinominal Bayes algorithm. The spam filter had an accuracy of 98.74% on the test, which is quite successful.