# Building a Spam Filter with Naive Bayes

In this project we are going to build a spam filter for SMS messages from scratch. We will use the multinominal Naive Bayes algorithm along with a dataset of 5,572 SMS messages collected by Tiago A. Almeida and José María Gómez Hidalgo. It can be downloaded from the <a href=https://archive.ics.uci.edu/ml/index.php>UCI Machine Learning Repository</a> or directly from <a href=https://archive.ics.uci.edu/ml/machine-learning-databases/00228/>this link</a>.

This project was developed as a guided project throughout a <a href=https://www.dataquest.io/>Dataquest</a> course on *Conditional Probability*.

## Exploring the Data

In [28]:
import pandas as pd
from collections import defaultdict

In [13]:
collection = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
collection.shape

(5572, 2)

In [14]:
collection.head(10)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [15]:
collection['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

Each row of the dataset consists of a raw SMS message and its classification, spam or ham (i.e. non-spam). About 13.4% of the messages are spam.

## Splitting in Training and Test Set

In [16]:
shuffled = collection.sample(frac=1, random_state=1)

In [17]:
train_index = int(shuffled.shape[0] * 0.8)

In [21]:
train_df = shuffled[:train_index].reset_index(drop=True)
train_df.shape

(4457, 2)

In [22]:
test_df = shuffled[train_index:].reset_index(drop=True)
test_df.shape

(1115, 2)

In [23]:
train_df['Label'].value_counts(normalize=True) * 100

ham     86.53803
spam    13.46197
Name: Label, dtype: float64

In [24]:
test_df['Label'].value_counts(normalize=True) * 100

ham     86.816143
spam    13.183857
Name: Label, dtype: float64

The percentage of spam in the training and in the test set is about 13.4% (the percentage of spam in the full dataset). Therefore, both sets seem to be representative samples.

## Cleaning the Data

In [26]:
# clean the SMS column in the training set
train_df['SMS'] = train_df['SMS'].str.replace(r'\W', ' ')
train_df['SMS'] = train_df['SMS'].str.lower()
train_df['SMS'] = train_df['SMS'].str.split()
train_df.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [36]:
# find all unique word in the training set
vocabulary = defaultdict(int)
for sms in train_df['SMS']:
    for word in sms:
        vocabulary[word] += 1
list(vocabulary.items())[:5]

[('yep', 9), ('by', 144), ('the', 1077), ('pretty', 12), ('sculpture', 1)]

In [37]:
len(vocabulary)

7782

There are 7,782 unique words in all the messages of our training set.

In [42]:
# transform the whole dataset
word_counts_per_sms = {unique_word: [0] * len(train_df['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train_df['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

word_counts = pd.DataFrame(word_counts_per_sms)
train_df_clean = pd.concat([train_df, word_counts], axis=1)
train_df_clean.shape

(4457, 7784)

In [43]:
train_df_clean.head()

Unnamed: 0,Label,SMS,yep,by,the,pretty,sculpture,yes,princess,are,...,prakesh,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526
0,ham,"[yep, by, the, pretty, sculpture]",1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


 This data transformation will help us calculate the model parameters more easily.


## Calibrating the Algorithm

In [49]:
train_ham = train_df_clean[train_df_clean['Label'] == 'ham']
train_spam = train_df_clean[train_df_clean['Label'] == 'spam']

In [50]:
total = train_df_clean.shape[0]
ham = train_ham.shape[0]
spam = train_spam.shape[0]
total, ham, spam

(4457, 3857, 600)

In [51]:
# calculate unconditional probabilities
p_ham = ham / total
p_spam = spam / total
p_ham, p_spam

(0.8653803006506618, 0.13461969934933812)

In [52]:
# number of unique words
n_vocabulary = len(vocabulary)
n_vocabulary

7782

In [54]:
# number of ham messages
n_ham = train_ham['SMS'].apply(len).sum()
n_ham

57233

In [55]:
# number of spam messages
n_spam = train_spam['SMS'].apply(len).sum()
n_spam

15190

In [57]:
# using Laplace smoothing
alpha = 1

In [61]:
# calculate conditional probabilities
cond_ham = defaultdict(int)
cond_spam = defaultdict(int)

for word in vocabulary:
    n_word_ham = train_ham[word].sum()
    cond_ham[word] = (n_word_ham + alpha) / (n_spam + n_vocabulary)
    
    n_word_spam = train_spam[word].sum()
    cond_spam[word] = (n_word_spam + alpha) / (n_spam + n_vocabulary)
    
print(list(cond_ham.items())[:3])
print(list(cond_spam.items())[:3])

[('yep', 0.0004353125544140693), ('by', 0.004831969353996169), ('the', 0.040092286261535784)]
[('yep', 4.353125544140693e-05), ('by', 0.0015235939404492425), ('the', 0.006877938359742295)]
