### Building Spam Filter using the Naive Bayes algorithm

![Image](https://www.moneycrashers.com/wp-content/uploads/2017/07/ways-stop-spam-email-message-robocalls-2136x1427.jpg)


The aim of this project is to build a Spam filter for SMS using the Naive Bayes algorithm.

To perform this task, I use a dataset of 5,572 messages that are already classified by humans. This dataset has been prepared by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded directly from this [link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection).

I will split the dataset in two parts, one for training the algorithm and another to test the algorithm. After calculating the word probabilities given spam or ham messages.

These probabilities will be applied to classify the SMS contained in the test dataset.

In [1]:
import pandas as pd

#Read the dataset
collection = pd.read_csv("SMSSpamCollection",sep='\t', header=None, names=['Label', 'SMS'])

collection.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
#Compute the Spam and Ham probabilities
p_ham = collection["Label"].value_counts()[0] / collection["Label"].count()
p_spam = 1 - p_ham

print("p_spam:",round(p_spam,4))
print("p_ham:", round(p_ham,4))

p_spam: 0.1341
p_ham: 0.8659


In [3]:
#Randomization of the entire dataset
collection_randomized = collection.sample(frac=1, random_state=1)

#Calculate index for split
training_test_index = round(len(collection_randomized) * 0.8)

#Split the dataset into a training and a test set
training_set = collection_randomized[:training_test_index].reset_index(drop=True)
test_set = collection_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [4]:
#Calculate the ham and spam probabilities for each set
#For the training set
p_ham_training = training_set["Label"].value_counts()[0] / training_set["Label"].count()
p_spam_training = 1 - p_ham_training

print("p_ham_training:",p_ham_training, "\np_spam_training:", p_spam_training)

p_ham_training: 0.8654104979811574 
p_spam_training: 0.13458950201884257


From the results above, we can see that the randomization of the dataset has been successful.
The ham and spam probabilities are nearly identical to the probabilities of the collection of sms for the test and the training dataset.

In [5]:
# Before transforming the dataset to be easy to handle, 
# we need to clean the dataset

# Removal of the punctuation from the SMS column
training_set["SMS"] = training_set["SMS"].str.replace('\W'," ")

# Transforming every letter to lowercase
training_set["SMS"] = training_set["SMS"].str.lower()

In [6]:
#Creation of a vocabulary in the training set containing every unique word
training_set["SMS"] = training_set["SMS"].str.split()

#Creation of an empty list that will contain the vocabulary
vocabulary = []

#Iteration over each message and addition of every new word to the vocabulary list
for sms in training_set["SMS"]:
    for word in sms:
        vocabulary.append(word)

#Transform the vocabulary list into a set to remove the duplicates from the vocabulary list
#Then transform it back into a list
vocabulary = list(set(vocabulary))

print("Number of unique word:", len(vocabulary))

Number of unique word: 7783


It appears that the training dataset contains 7,783 unique words

In [7]:
#Creation of a dictionary that we'll use to count words per SMS
word_counts_per_sms = {unique_word:[0] * len(training_set['SMS']) 
                       for unique_word in vocabulary }

#Loop over the SMS columnn to count words
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [8]:
#Transform the word counts dictionary into a DataFrame
word_counts = pd.DataFrame(word_counts_per_sms)

#Concatenation of the word counts DF with the training DF
training_set_clean = pd.concat([training_set,word_counts], axis=1)

In [9]:
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [10]:
# Isolating spam and ham messages
ham_messages = training_set_clean[training_set_clean["Label"] == "ham"]
spam_messages = training_set_clean[training_set_clean["Label"] == "spam"]

# Calculation of the ham and spam probabilities in the training dataset
p_ham_training = len(ham_messages) / len(training_set_clean)
p_spam_training = 1 - p_ham_training

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

In [11]:
# Initialisation of 2 dictionaries we'll use to compute the word
# probabilities given ham or spam message

parameters_spam = {unique_word: 0 for unique_word in vocabulary}
parameters_ham = {unique_word: 0 for unique_word in vocabulary}

# Iterate over the vocabulary and compute probabilities given ham and spam
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocabulary)
    parameters_ham[word] = p_word_given_ham 

In [12]:
# Creation of a classification function that will classify messages as
# spam or ham depending on the computed probabilities

import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    # Initiate the probabilities
    p_spam_given_message = p_spam_training
    p_ham_given_message = p_ham_training
    
    # Looping over the message to compute probabilities
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
    
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [14]:
# Modification of the function to return labels
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [13]:
# Test 1 
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300844e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [17]:
# Test 2 
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4372375665888126e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [15]:
# Application of the classification function to the test dataset
test_set["predicted"] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [16]:
# We write a function to measure the accuracy of the function
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
accuracy = correct / total

print(" Correctly identified:", correct, "\n", "Incorrectly identified:",
      total - correct,"\n", "Accuracy:", round(accuracy,4))

 Correctly identified: 1100 
 Incorrectly identified: 14 
 Accuracy: 0.9874


## Conclusion
The initial goal of the project was to build a spam filter that would be able to classify new messages given their content.

The spam filter for messages built in this project using the multinomial Naive Bayes algorithm has an accuracy of 98.74% on the test set which is an excellent result.