# Building a Spam Filter with Naive Bayes
In this notebook we are going to use the ``Multinomial Naive Bayes Algorithm`` to build a spam filter for SMS's. The data that we will be using comes from the UCI Machine Learning Repository and contains abour 5K SMS categorized.

In [234]:
import pandas as pd

In [243]:
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [244]:
sms.shape

(5572, 2)

In [245]:
sms['Label'].value_counts(normalize = True).round(2)*100

ham     87.0
spam    13.0
Name: Label, dtype: float64

The dataset contains 5572 rows of data. Out of those 87\% of the SMS are **not Spam (ham)** and 13\% are actually **spam**.

Since we want to be able to measure the accuracy of our model after passing the data, we need to preset the program new unseen data. Since we cannot generate more data, we will split the dataset into two sets. One set for training the classifier and the other for testing. We choose the 80/20 rule, where 80\% of the data is used for training and 20\% for testing. we can achieve this by randomizing the dataset first and then splitting the data into two new dataframes.

In [246]:
sms_r = sms.sample(frac=1, random_state=1)

In [247]:
split_tresh = round(len(sms_r)*0.8)

sms_r_train = sms_r[:split_tresh].reset_index(drop=True)
sms_r_test  = sms_r[split_tresh:].reset_index(drop=True)

In [248]:
sms_r_test.shape[0]+sms_r_train.shape[0]

5572

In [249]:
sms_r_train['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [250]:
sms_r_test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

After dividing the dataset into two randomized unique sets of data, we can see that the distribution of the number of non-spam and ham labels remains the same for both datasets, even tough they differ in size. Now we need to take a look at the messages themselves. We need to perfom some cleaning before applying the Bayes.

In [251]:
# searches for all non alphanumeric characters

sms_r_train['SMS'] = sms_r_train['SMS'].str.replace('\W', ' ').str.lower()
sms_r_train['SMS'] = sms_r_train['SMS'].str.split()
sms_r_train.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [252]:
vocabulary = list()
for sentence in sms_r_train['SMS']:
    for word in sentence:
        vocabulary.append(word)

In [253]:
vocabulary = list(set(vocabulary))

In [254]:
len(vocabulary)

7783

In [255]:
word_counts_per_sms = {unique_word: [0] * len(sms_r_train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(sms_r_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [256]:
counts_train_df = pd.DataFrame(word_counts_per_sms)

In [258]:
counts_train_df.head()

Unnamed: 0,pansy,soryda,smash,costa,sec,2morrow,write,city,nickey,ts,...,picked,dual,toilet,maga,mens,news,09066380611,expects,missin,retired
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [259]:
train_clean_df = pd.concat([sms_r_train,counts_train_df], axis = 1)

In [260]:
train_clean_df

Unnamed: 0,Label,SMS,pansy,soryda,smash,costa,sec,2morrow,write,city,...,picked,dual,toilet,maga,mens,news,09066380611,expects,missin,retired
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,ham,"[sorry, i, ll, call, later, in, meeting, any, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,ham,"[babe, i, fucking, love, you, too, you, know, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,spam,"[u, ve, been, selected, to, stay, in, 1, of, 2...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,ham,"[hello, my, boytoy, geeee, i, miss, you, alrea...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Creating the Spam Filter

First we will start by calculation the initial probabilities for our classifier

In [261]:
# calculate the length of all words in Spam and Ham. P(Spam) and P(Ham)

p_spam_train = len(train_clean_df[train_clean_df['Label'] == 'spam']) / len(train_clean_df)
p_ham_train  = len(train_clean_df[train_clean_df['Label'] == 'ham']) / len(train_clean_df)

In [262]:
p_spam_train

0.13458950201884254

In [263]:
p_ham_train

0.8654104979811574

In [264]:
# Now we need to calculate the values for our Laplace Smoothing formula also N_ham N_spam_ N_vocabulary

N_ham = train_clean_df[train_clean_df['Label'] == 'ham']['SMS'].apply(len).sum()

N_spam = train_clean_df[train_clean_df['Label'] == 'spam']['SMS'].apply(len).sum()

N_vocabulary = len(vocabulary)

alpha = 1

print('The total number of words in non-spam messages (N_ham) is {}'.format(N_ham))

print('The total number of words in spam messages (N_spam) is {}'.format(N_spam))

print('The total number of unique words in the whole dataset (N_vocabulary )is {}'.format(N_vocabulary))

print('The Smoothing parameter (alpha) is {}'.format(alpha))




The total number of words in non-spam messages (N_ham) is 57237
The total number of words in spam messages (N_spam) is 15190
The total number of unique words in the whole dataset (N_vocabulary )is 7783
The Smoothing parameter (alpha) is 1


Now that we have the inital parameters for our formula, the next step is to calculate in forehand all the probabilities for all the words in ham and spam.

In [265]:
# Create an empty dictionary to store all the probabilities in the vocabulary.

ham_voc = {unique_word: 0 for unique_word in vocabulary}
spam_voc = {unique_word: 0 for unique_word in vocabulary}

In [266]:
spam_train = train_clean_df[train_clean_df['Label'] == 'spam']
ham_train  = train_clean_df[train_clean_df['Label'] == 'ham']

In [267]:
#  Here we find the probabilities for all words in spam sms
for unique_word in vocabulary:
    n_wi_spam = spam_train[unique_word].sum()
    p_wi_spam = (n_wi_spam + alpha)/(N_spam + (alpha * N_vocabulary))
    spam_voc[unique_word] = p_wi_spam
    
    
    # We do the same for ham
    n_wi_ham = ham_train[unique_word].sum()
    p_wi_ham = (n_wi_ham + alpha)/(N_ham + (alpha * N_vocabulary))
    ham_voc[unique_word] = p_wi_ham
    
    

In [268]:
sorted(spam_voc.items(), key=lambda x: x[1], reverse= True)

[('to', 0.023810560222870324),
 ('a', 0.013407043050537588),
 ('call', 0.012623514560571106),
 ('you', 0.011143516301745527),
 ('your', 0.009228224437383015),
 ('free', 0.00766116745745005),
 ('now', 0.0073999912941278894),
 ('2', 0.007225873851913115),
 ('the', 0.006877638967483567),
 ('for', 0.006703521525268794),
 ('or', 0.006442345361946633),
 ('is', 0.005963522395856005),
 ('txt', 0.0057458755930875375),
 ('u', 0.005658816871980151),
 ('ur', 0.005092935184782136),
 ('on', 0.004918817742567362),
 ('have', 0.004875288382013668),
 ('4', 0.0045705828581378135),
 ('stop', 0.004483524137030427),
 ('from', 0.004439994776476734),
 ('and', 0.00426587733426196),
 ('mobile', 0.00426587733426196),
 ('text', 0.004091759892047186),
 ('claim', 0.003961171810386106),
 ('1', 0.003961171810386106),
 ('with', 0.0038741130892787183),
 ('reply', 0.0038741130892787183),
 ('www', 0.0034823488442954774),
 ('of', 0.0034823488442954774),
 ('prize', 0.003351760762634397),
 ('t', 0.0031341139598659294),
 ('t

## Creating the classifier function

here we will wrap everithing into a function. This function will split a message inserted, then it will look for the word in both dictionaries and use the probabilities there to calculate the both ham and spam. Using the following formulas:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

In [269]:
import re

def classify_print(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam = 1
    p_ham = 1

    for word in message:
        if word in spam_voc:
            p_spam *= spam_voc[word]
        if word in ham_voc:
            p_ham *= ham_voc[word]

    p_spam_given_message = p_spam_train * p_spam
    p_ham_given_message = p_ham_train * p_ham
        
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')




In [270]:
classify_print('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.936804902858988e-27
Label: Spam


Since the function above does not return any value, we need to modify it for testing putposes and make it return strings

In [271]:

import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam_train
    p_ham_given_message = p_ham_train
    
    for word in message:
        if word in spam_voc:
            p_spam_given_message *= spam_voc[word]
        if word in ham_voc:
            p_ham_given_message *= ham_voc[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now we could test the accuracy of the alorithm on the test data that we prepared before.

In [273]:
sms_r_test['predicted'] = sms_r_test['SMS'].apply(classify)

sms_r_test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [297]:
correct = 0
total = sms_r_test.shape[0]

for index, row in sms_r_test.iterrows():
    if row['Label'] == row['predicted']:
        correct +=1
        
acc = round((correct/total), 3)*100

print('The model classified {} of {} Messages correctly. The accuracy for this model is {}%'.format(correct, total, acc))
    

The model classified 1100 of 1114 Messages correctly. The accuracy for this model is 98.7%
