# Spam Filter for SMS Messages

In this project we'll create a spam filter for SMS messages using multinomial Naive Bayes algorithm.

In order to classify messages as spam or non-spam, the computer:
1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

Therefore, our first task is to "teach" the computer how to classify messages. To do that, we'll use a existent dataset of 5,572 SMS messages that are already classified by humans. This dataset can be found in this [link](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

# Exploring the Dataset

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
sms = pd.read_csv('SMSSpamCollection', sep='\t', 
                  header=None, names=['Label', 'SMS'])
print(sms.shape)
sms.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

As we can see, about 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam. Now that we've become a bit familiar with the dataset, we can move on to building the spam filter.

# Training and Test Set

Before creating it, we'll test it to determine how well it works.
Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

First, let's randomize the entire dataset and split it into a training and a test set.

In [4]:
sms_random = sms.sample(frac=1, random_state=1)
sms_train = sms_random.iloc[:4458,:].reset_index()
sms_test = sms_random.iloc[4458:,:].reset_index()
print(sms_train.head(3))
print(sms_test.head(3))

   index Label                                            SMS
0   1078   ham                   Yep, by the pretty sculpture
1   4028   ham  Yes, princess. Are you going to make me moan?
2    958   ham                     Welp apparently he retired
   index Label                                                SMS
0   2131   ham          Later i guess. I needa do mcat study too.
1   3418   ham             But i haf enuff space got like 4 mb...
2   3424  spam  Had your mobile 10 mths? Update to latest Oran...


In [5]:
print(sms_train['Label'].value_counts(normalize=True)*100)
print(sms_test['Label'].value_counts(normalize=True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


Looking at the percentage of spam and ham in both training and test set, we can see they are similar to what we have in the full dataset. Thus, they are representative samples.

# Letter Case and Punctuation

Since the Naive Bayes algorithm uses all the words in all messages in the dataset, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

Now, we'll remove all the punctuation from the messages and transform every letter in every word to lower case

In [6]:
sms_train['SMS'] = sms_train['SMS'].str.replace('\W', ' ').str.lower()
sms_train.head(3)

Unnamed: 0,index,Label,SMS
0,1078,ham,yep by the pretty sculpture
1,4028,ham,yes princess are you going to make me moan
2,958,ham,welp apparently he retired


# Creating the Vocabulary

Our end goal with this data cleaning process is to bring our training set to a format where every word is a column so we can count how many times it appears on the message. So now, we'll create a list with all of the unique words that occur in the messages of our training set.

In [7]:
sms_train['SMS'] = sms_train['SMS'].str.split()
vocabulary = []
for m in sms_train['SMS']:
    for w in m:
        vocabulary.append(w)
# Transforming the list into a set to remove duplicates from the list
vocabulary_set = set(vocabulary)
vocabulary = list(vocabulary_set)
print(vocabulary[:20])

['okey', 'kettoda', 'uawake', 'truly', 'jsco', 'med', 'sweetest', 'edward', 'sisters', 'goodnight', 'secret', 'paru', 'rebtel', 'rentl', 'more', 'hallaq', 'placed', 'sir', 'props', 'desparate']


# Creating the Final Training Set

In [8]:
word_counts_per_sms = {unique_word: [0] * len(sms_train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(sms_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [9]:
word_counts_df = pd.DataFrame(word_counts_per_sms)
word_counts_df.head(3)

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
train_set = pd.concat([sms_train, word_counts_df], axis=1)
train_set.head(2)

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Calculating Constants

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter.
First, we need to calculate P(Spam), P(Ham), NSpam, NHam and NVocabulary from Naive Bayes equations:

In [11]:
p_spam = len(train_set[train_set['Label'] == 'spam'])/len(train_set['Label'])
p_ham = len(train_set[train_set['Label'] == 'ham'])/len(train_set['Label'])
print(p_spam, p_ham)

0.13458950201884254 0.8654104979811574


In [12]:
def len_list(sms_list):
    return len(sms_list)
train_set['len_SMS'] = train_set['SMS'].apply(len_list)
train_set_spam = train_set[train_set['Label'] == 'spam']
train_set_ham = train_set[train_set['Label'] == 'ham']
N_spam = train_set_spam['len_SMS'].sum()
N_ham = train_set_ham['len_SMS'].sum()
N_vocabulary = len(vocabulary)
alpha = 1
print(N_spam, N_ham, N_vocabulary)

15190 57237 7783


# Calculating Parameters

In [13]:
spam_params = {}
ham_params = {}
for w in vocabulary:
    #Calculating the parameters for spam messages
    N_w_spam = train_set_spam[w].sum()
    P_w_spam = (N_w_spam + alpha) / (N_spam + (alpha * N_vocabulary))
    spam_params[w] = P_w_spam
    
    #Calculating the parameters for ham messages
    N_w_ham = train_set_ham[w].sum()
    P_w_ham = (N_w_ham + alpha) / (N_ham + (alpha * N_vocabulary))
    ham_params[w] = P_w_ham

# Classifying a New Message

In [14]:
def classify(message):
    # Formating the message to fit in the calculations
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    # Calculating P(Spam|w1,w2,...,wn) and P(Ham|w1,w2,...,wm) 
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for w in message:
        if w in spam_params:
            p_spam_given_message *= spam_params[w]
        if w in ham_params:
            p_ham_given_message *= ham_params[w]
    
    # Printing the pseudo-probabilities of being spam or ham message
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    # Comparing these pseudo-probabilites to classify the message
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal probabilities, have a human classify this!')

In [15]:
# Testing our function
print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
print('\n')
print(classify("Sounds good, Tom, then see u there"))

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
None


P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham
None


# Measuring the Spam Filter's Accuracy

The two results above look promising, but let's see how well the filter does on our test set, which has 1,114 messages.

We'll start by rewriting the function above that returns classification labels instead of printing them.

In [16]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for w in message:
        if w in spam_params:
            p_spam_given_message *= spam_params[w]

        if w in ham_params:
            p_ham_given_message *= ham_params[w]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now, let's create a new column with the predicted results of the test set using our algorithm.

In [17]:
sms_test['predicted'] = sms_test['SMS'].apply(classify_test_set)
sms_test.head()

Unnamed: 0,index,Label,SMS,predicted
0,2131,ham,Later i guess. I needa do mcat study too.,ham
1,3418,ham,But i haf enuff space got like 4 mb...,ham
2,3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,5393,ham,"All done, all handed in. Don't know if mega sh...",ham


Finally, we can measure the accuracy of the spam filter.

In [18]:
correct = 0
total = len(sms_test['Label'])
for row in sms_test.iterrows():
    # We did it because this method returns me a tuple (index, Series)
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1

Accuracy = (correct/total)*100
print(Accuracy)

98.74326750448833


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

# Next Steps

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

Next steps include:

- Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
- Make the filtering process more complex by making the algorithm sensitive to letter case