# Introduction

Given my latest studies on Probability and Conditional Probability, I will attemp today to build a spam filter using Naive Bayes.

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](!https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

In [2]:
# importing libraries

import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
import urllib.request
import re

from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile

In [3]:
# loading the dataset

uci_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/"
data_name = "smsspamcollection.zip"
response = urllib.request.urlopen(uci_url + urllib.request.quote(data_name))

zipfile = ZipFile(BytesIO(response.read()))

data = TextIOWrapper(zipfile.open('SMSSpamCollection'), encoding= 'utf-8')

df = pd.read_csv(data, header= None, sep='\t')

df

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [4]:
df.rename(columns=
          {0:'label',
           1:'SMS'},
          inplace= True)

In [5]:
df.describe()

Unnamed: 0,label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [6]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [7]:
stats= df['label'].value_counts(normalize=True)*100
stats

ham     86.593683
spam    13.406317
Name: label, dtype: float64

We can see that 86.5% of entries in the dataset are `ham` (non-spam) and the remaining 13.4% are `spam`. 

The most common message that in the `ham` is "Sorry, I'll call later". Pretty straight forward it is not a spam message, likely to be a quick reply SMS text.

# Training and Test Set

Before we will commence to build the spam filter, it is important to test how well it will work before creating the spam. Creating the spam-detecting software first and testing it after that could lead to biases in test-design.

To test the spam filter, we're first going to split our dataset into two categories:

- A **training set**, which we'll use to "train" the computer how to classify messages.
- A **test set**, which we'll use to test how good the spam filter is with classifying new messages.

We will use the 80-20 proportion for the training and test data. The split will also help with identifying how good the spam filter actually is at predicting if a message is spam or not. That is because the test set, 20% of the data, has been already classified by a human. When the spam filter will be ready, it will treat this messages as new messages and we will be able to compare the algorithm classification with that done by a human. This will tell us how good the spam filter actually is.


In [8]:
# randomizing the dataset
data_random = df.sample(frac= 1, random_state= 1)

# calculating index for split
train_size = 0.8
train_end = round(len(data_random) * train_size)

# Train/Test split
df_train = data_random[:train_end].reset_index(drop= True)
df_test = data_random[train_end:].reset_index(drop= True)

print(df_train.shape)
print(df_test.shape)
print('\n')

print("The percentage of spam in train set is", "\n", df_train['label'].value_counts(normalize=True)*100)
print("\n")
print("The percentage of spam in test set is", "\n", df_test['label'].value_counts(normalize=True)*100)


(4458, 2)
(1114, 2)


The percentage of spam in train set is 
 ham     86.54105
spam    13.45895
Name: label, dtype: float64


The percentage of spam in test set is 
 ham     86.804309
spam    13.195691
Name: label, dtype: float64


In [9]:
df_train.head()

Unnamed: 0,label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


Both train and test set have similar percentages of `ham` and `spam` messages. It maintains the percentage ratio from the initial dataset, which would be great in this way to train the algorithm.

# Data Cleaning
## Letter Case and Punctuation

To make it easier to clean the data, we will have to bring under a form that it is easy to use the Naive Bayes algorithm.

It may be the case that message is all of in capitals, has punctuations or non-latin characters. For this, we will be splitting up the messages in the `message` column and transform it in a series of new columns, where each column represents a unique word from the vocabulary.

In [10]:
# Cleanin the SMS column

df_train['SMS']= df_train['SMS'].str.replace("\W", ' ').str.lower().str.replace(' +', ' ').str.strip()
df_train.head()

Unnamed: 0,label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on da...


# Creating the Vocabulary

Now that we removed the punctuation and all letters are in lowercase, our end goal is to transform the table above into one where each word has it's own column and it's counted for every `spam` and `ham` message.

In [11]:
# Splitting the words in the message in order to be able to iterate over them later
df_train['SMS'] = df_train['SMS'].str.split()

vocabulary = []

for sms in df_train['SMS']:
    for word in sms:
        vocabulary.append(word)

print(len(vocabulary))

72427


In [12]:
print("There are", len(vocabulary),"words in the vocabulary. Since the split was done for each message, there are very high changes that a word will have duplicates. Thus, we will remove those duplicates")

There are 72427 words in the vocabulary. Since the split was done for each message, there are very high changes that a word will have duplicates. Thus, we will remove those duplicates


In [13]:
# removing duplicates using set() function and then turning it back to a list
vocabulary = list(set(vocabulary))

print("There are ", len(vocabulary)," unique words in the vocabulary now.")

There are  7783  unique words in the vocabulary now.


# The Final Training Set

We will create a dictionary called `word_count_per_message` where each key is a unique word from the vocabulary and each value is a list of the length of training set, where each element in the list is a `0`.

Then we loop over `df_train['SMS']` to identify the message and its index, and then selecting each word in every sms message and adding it to the dictionary.

Then, the dictionary will be transformed in a dataframe that will contain the word count and its index. The index will serve as a connection point to concatinate with the `df_train` dataframe.



In [14]:
word_counts_per_message = {unique_word: [0] * len(df_train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(df_train['SMS']):
    for word in sms:
        word_counts_per_message[word][index] += 1

# transforming the dictionary in a dataframe
word_counts = pd.DataFrame(word_counts_per_message)
print(word_counts.head())

   0  00  000  000pes  008704050406  0089  01223585334  02  0207  02072069400  \
0  0   0    0       0             0     0            0   0     0            0   
1  0   0    0       0             0     0            0   0     0            0   
2  0   0    0       0             0     0            0   0     0            0   
3  0   0    0       0             0     0            0   0     0            0   
4  0   0    0       0             0     0            0   0     0            0   

  ...  zindgi  zoe  zogtorius  zouk  zyada  é  ú1  ü  〨ud  鈥  
0 ...       0    0          0     0      0  0   0  0    0  0  
1 ...       0    0          0     0      0  0   0  0    0  0  
2 ...       0    0          0     0      0  0   0  0    0  0  
3 ...       0    0          0     0      0  0   0  0    0  0  
4 ...       0    0          0     0      0  0   0  2    0  0  

[5 rows x 7783 columns]


In [15]:
# validating the dictionary count

print('sculpture:', word_counts_per_message['sculpture'][0:5])
print('welp:', word_counts_per_message['welp'][0:5])


sculpture: [1, 0, 0, 0, 0]
welp: [0, 0, 1, 0, 0]


In [16]:
# Concatinating the dataframes so that we will have the Label and SMS columns
df_train = pd.concat([df_train, word_counts], axis= 1)
df_train.head()

Unnamed: 0,label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [17]:
df_train.shape

(4458, 7785)

In [18]:
df_train.sum(numeric_only=True).sum()

72427

Above we confirmed that the number of words in the dataframe is the same as the number of words in the vocabulary before elinating the duplicates

# Calculating Constants First

We're now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Spam^C) = \frac{N_{w_i|Spam^C} + \alpha}{N_{Spam^C} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we'll use our training set to calculate:

\begin{equation} P(Spam)\ and\ P(Ham) \end{equation}
    
\begin{equation} NSpam, NHam, NVocabulary \end{equation}
    
We'll also use Laplace smoothing and set $\alpha = 1$

In [19]:
# isolating the spam and ham messages
spam_messages = df_train[df_train['label'] == 'spam']
ham_messages = df_train[df_train['label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(df_train)
p_ham = len(ham_messages) / len(df_train)

# Calculating NSpam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# Calculating NHam
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# Calculating NVocabulary
n_vocabulary = len(vocabulary)


# Laplace smoothing
alpha = 1

print("Probability of spam:", p_spam)
print("Probability of non-spam:", p_ham)
print('\n')
print('Number of spam messages:', n_spam)
print('Number of non-spam messages:', n_ham)
print('\n')
print('Number of words in the vocabulary:', n_vocabulary)

Probability of spam: 0.13458950201884254
Probability of non-spam: 0.8654104979811574


Number of spam messages: 15190
Number of non-spam messages: 57237


Number of words in the vocabulary: 7783


# Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Spam^C)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Spam^C) = \frac{N_{w_i|Spam^C} + \alpha}{N_{Spam^C} + \alpha \cdot N_{Vocabulary}}\end{equation}

Due to some initial errors that I got with the code, I decided to take a manual approach to calculate the parameters

In [20]:
# # Test for one word (msg) the calculation of conditional probabilities
# test_word = 'msg'
# test_n_word_given_spam = spam_messages[test_word].sum()
# test_p_word_given_spam = (test_n_word_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
# test_word_count_total = df_train[test_word].sum()

# test_n_word_given_ham = ham_messages[test_word].sum()
# test_p_word_given_ham = (test_n_word_given_ham + alpha) / (n_spam + alpha * n_vocabulary)

# print("Word counts:")
# print('spam: ', test_n_word_given_spam)
# print('ham: ', test_n_word_given_ham)
# print('total: ', test_word_count_total)
# print('_________________')
# print('Proportional probabilities:')
# print('P(word|spam): ', test_p_word_given_spam)
# print('P(word|ham): ', test_p_word_given_ham)

In [21]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculating the parameters using the additive smoothing technique
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocabulary)
    parameters_ham[word] = p_word_given_ham
        

# Classifying a New Message

With the needed constants and the parameters calculated, we can start creating the actual spam filter. The spam filter can be expressed as function that:
- Takes in as input a new message ($w_1$, $w_2$,...,$w_n$), as a string
- Perform data cleaning on the new message
- Calculates $P(Spam|w_1,w_2,...,w_n)$ and $P(Ham|w_1,w_2,...,w_n)$
- Compares the values of $P(Spam|w_1,w_2,...,w_n)$ and $P(Ham|w_1,w_2,...,w_n)$ and:
    - if $P(Ham|w_1,w_2,...,w_n)$ > $P(Spam|w_1,w_2,...,w_n)$, then the message is classified as ham
    - if $P(Ham|w_1,w_2,...,w_n)$ < $P(Spam|w_1,w_2,...,w_n)$, the the message is classified as spam
    - if $P(Ham|w_1,w_2,...,w_n)$ = $P(Spam|w_1,w_2,...,w_n)$, the algorithm may request human intervention.

In [22]:
def classify(message):
    
    # Using the same cleaning method as in the test set
    message = re.sub(r'\W', ' ', message)
    message = message.lower().split()
    
    # print(message) <- commented after verifying the cleaning
    
    # Initiating P(Spam|message) and P(Ham|message) with the initial values of
    # p_spam and p_ham
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Iterate over the words in the message. In case it is a known word, 
    # multiply with the respective proportional probabilities
    for word in message:
        if word in vocabulary:
            p_spam_given_message = p_spam_given_message * parameters_spam[word]
            p_ham_given_message = p_ham_given_message * parameters_ham[word]
#         else:
#             print("The word '", word, "' was not found. Calculations ignored.")
    
    # Classification of the word
    if p_ham_given_message > p_spam_given_message:
        result = 'Not Spam'
    elif p_ham_given_message < p_spam_given_message:
        result = 'Spam'
    elif p_ham_given_message == p_spam_given_message:
        result = 'Needs human check'
        

    return p_ham_given_message, p_spam_given_message, result
    

In [23]:
test_message1 = 'WINNER!! This is the secret code to unlock the money: C3421.'

print("The message is:", test_message1)
message_p_spam, message_p_not_spam, classification = classify(test_message1)
print("Probability that this message is spam:", message_p_spam)
print("Probability that this message isn't spam:", message_p_not_spam)
print("Classification:", classification)

print('\n')

test_message2 = "Sounds good, Tom, then see u there"

print("The message is:", test_message2)
message_p_spam, message_p_not_spam, classification = classify(test_message1)
print("Probability that this message is spam:", message_p_spam)
print("Probability that this message isn't spam:", message_p_not_spam)
print("Classification:", classification)

The message is: WINNER!! This is the secret code to unlock the money: C3421.
Probability that this message is spam: 1.9368049028589875e-27
Probability that this message isn't spam: 1.3481290211300841e-25
Classification: Spam


The message is: Sounds good, Tom, then see u there
Probability that this message is spam: 1.9368049028589875e-27
Probability that this message isn't spam: 1.3481290211300841e-25
Classification: Spam


The function classifies the messages correctly. Let's now try it on the test set

# Measuring the Spam Filter's Accuracy


Now we will use the `df_test` dataframe that we created in the beginning to test the spam filter. We know that a human classified each message. Our algorithm will run through the test dataset for the first time and will attempt to classify each message. Thus, we will be able to asses how well our spam filter performs.

We will create 3 new columns:
- `p_spam` and `p_ham` where we will insert the probability
- `classification` column where we will input the final result

In [24]:
# initializing the columns
df_test['p_spam'] = 0
df_test['p_ham'] = 0
df_test['classification'] = 'to do'

df_test.head()

Unnamed: 0,label,SMS,p_spam,p_ham,classification
0,ham,Later i guess. I needa do mcat study too.,0,0,to do
1,ham,But i haf enuff space got like 4 mb...,0,0,to do
2,spam,Had your mobile 10 mths? Update to latest Oran...,0,0,to do
3,ham,All sounds good. Fingers . Makes it difficult ...,0,0,to do
4,ham,"All done, all handed in. Don't know if mega sh...",0,0,to do


In [25]:
# Filling the columns with the values from the classify function
df_test['p_spam'] = df_test['SMS'].apply(lambda SMS:classify(SMS)[0])
df_test['p_ham'] = df_test['SMS'].apply(lambda SMS: classify(SMS)[1])
df_test['classification'] = df_test['SMS'].apply(lambda SMS: classify(SMS)[2])

# Summary statistics
print(df_test['classification'].value_counts())
print('\n')
print(df_test['classification'].value_counts(normalize=True)*100)

Not Spam             969
Spam                 144
Needs human check      1
Name: classification, dtype: int64


Not Spam             86.983842
Spam                 12.926391
Needs human check     0.089767
Name: classification, dtype: float64


In [26]:
# The un
df_test[df_test['classification']=='Needs human check']

Unnamed: 0,label,SMS,p_spam,p_ham,classification
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,0.0,0.0,Needs human check


The proportioonal values seems to be what we were expecting them to be. There is one message where a human would need to check and decide the classification because the probability of that message being a spam is equal with the probability of not being a spam.

In [27]:
df_test.head(10)

Unnamed: 0,label,SMS,p_spam,p_ham,classification
0,ham,Later i guess. I needa do mcat study too.,4.2532449999999994e-19,3.4831069999999996e-26,Not Spam
1,ham,But i haf enuff space got like 4 mb...,9.669411e-29,3.113881e-34,Not Spam
2,spam,Had your mobile 10 mths? Update to latest Oran...,4.338466e-98,7.54855e-83,Spam
3,ham,All sounds good. Fingers . Makes it difficult ...,1.481496e-28,3.608708e-34,Not Spam
4,ham,"All done, all handed in. Don't know if mega sh...",6.581143e-58,2.7643950000000002e-68,Not Spam
5,ham,But my family not responding for anything. Now...,1.396866e-88,3.003832e-110,Not Spam
6,ham,U too...,1.536822e-05,6.630543e-08,Not Spam
7,ham,Boo what time u get out? U were supposed to ta...,9.822271e-39,1.675016e-44,Not Spam
8,ham,Genius what's up. How your brother. Pls send h...,5.7006599999999994e-36,1.2938389999999999e-42,Not Spam
9,ham,I liked the new mobile,6.458119e-15,1.029835e-15,Not Spam


In the sample above it seems that all the messages are classified correct.

We will add a new column to the dataframe, one which will help us identify whether or not the spam classifier has correctedtly identified the spam as good as the human classification. This will enable us to isolate the examples where the classification was not correctedly done.

In [28]:
def is_correct(row):
    output = 'Not decided'
    
    if row['label'] == 'spam':
        if row['classification'] == 'Spam':
            output = 'Correct'
        if row['classification'] == 'Not Spam':
            output = 'Incorrect'

    if row['label'] == 'ham':
        if row['classification'] == 'Spam':
            output = 'Incorrect'
        if row['classification'] == 'Not Spam':
            output = 'Correct'

    return output

In [29]:
is_correct(df_test.iloc[0])

'Correct'

In [30]:
df_test['correctness'] = 'To be checked'
df_test['correctness'] = df_test.apply(is_correct, axis=1)

In [31]:
df_test.head(10)

Unnamed: 0,label,SMS,p_spam,p_ham,classification,correctness
0,ham,Later i guess. I needa do mcat study too.,4.2532449999999994e-19,3.4831069999999996e-26,Not Spam,Correct
1,ham,But i haf enuff space got like 4 mb...,9.669411e-29,3.113881e-34,Not Spam,Correct
2,spam,Had your mobile 10 mths? Update to latest Oran...,4.338466e-98,7.54855e-83,Spam,Correct
3,ham,All sounds good. Fingers . Makes it difficult ...,1.481496e-28,3.608708e-34,Not Spam,Correct
4,ham,"All done, all handed in. Don't know if mega sh...",6.581143e-58,2.7643950000000002e-68,Not Spam,Correct
5,ham,But my family not responding for anything. Now...,1.396866e-88,3.003832e-110,Not Spam,Correct
6,ham,U too...,1.536822e-05,6.630543e-08,Not Spam,Correct
7,ham,Boo what time u get out? U were supposed to ta...,9.822271e-39,1.675016e-44,Not Spam,Correct
8,ham,Genius what's up. How your brother. Pls send h...,5.7006599999999994e-36,1.2938389999999999e-42,Not Spam,Correct
9,ham,I liked the new mobile,6.458119e-15,1.029835e-15,Not Spam,Correct


In [32]:
print(df_test['correctness'].value_counts())
print('\n')

# this also gives our accuracy for the model, but for learning purposes 
# I will use another method
# print(df_test['correctness'].value_counts(normalize=True)*100)

Correct        1100
Incorrect        13
Not decided       1
Name: correctness, dtype: int64




The results seem to be quite good. Let's investigate the rows where the classification was incorrect and not decided.

In [33]:
incorrect_classification = df_test[df_test['correctness'] != 'Correct'].sort_values(['label', 'classification'])
incorrect_classification

Unnamed: 0,label,SMS,p_spam,p_ham,classification,correctness
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,0.0,0.0,Needs human check,Not decided
152,ham,Unlimited texts. Limited minutes.,1.385268e-12,1.054585e-11,Spam,Incorrect
159,ham,26th OF JULY,1.319153e-12,5.32843e-12,Spam,Incorrect
284,ham,Nokia phone is lovly..,3.341645e-10,3.230228e-09,Spam,Incorrect
302,ham,No calls..messages..missed calls,3.030822e-18,9.486266e-18,Spam,Incorrect
319,ham,We have sent JD for Customer Service cum Accou...,1.386787e-60,2.949867e-58,Spam,Incorrect
114,spam,Not heard from U4 a while. Call me now am here...,1.181554e-84,1.999536e-94,Not Spam,Incorrect
135,spam,More people are dogging in your area now. Call...,2.0353389999999998e-78,1.0041889999999999e-78,Not Spam,Incorrect
504,spam,Oh my god! I've found your number again! I'm s...,1.627774e-66,2.637158e-73,Not Spam,Incorrect
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",5.839167e-95,4.0536029999999997e-100,Not Spam,Incorrect


It is interesting to see those examples in according to the classification done by the algorithm. Some messages are quite close to eachother, but other have an `E-61` and `E-71` which means that there is a difference of 10,000,000,000 between them.

The example where correctness was `Not decided` is quite straightforward, since the probabilities are equal to each other. Or so it may seem, since there could be that the multiplication of so many small words went beyond computer's ability to handle and show them. Since it is just 1 of the 5k messages, we can go over it.

Let's now calculate the accuracy of our classifier using another method

In [37]:
total = len(df_test)
spam_total = len(df_test[df_test['label'] == 'spam'])
nonspam_total = len(df_test[df_test['label'] == 'ham'])

# Count the correctly classified messages
correct_spam_total = len(df_test[(df_test['label'] == 'spam') & (df_test['classification'] == 'Spam')])
correct_nonspam_total = len(df_test[(df_test['label'] == 'ham') & (df_test['classification'] == 'Not Spam')])
correct_total = correct_spam_total + correct_nonspam_total

#Calculate the accuracy
accuracy_overall = correct_total / total
accuracy_spam = correct_spam_total / spam_total
accuracy_nonspam = correct_nonspam_total / nonspam_total

print("Overall accuracy is {:.2%}".format(accuracy_overall))
print("Accuracy for spam messages is {:.2%}".format(accuracy_spam))
print("Accuracy for non-spam messages is {:.2%}".format(accuracy_nonspam))

Overall accuracy is 98.74%
Accuracy for spam messages is 94.56%
Accuracy for non-spam messages is 99.38%


The overall accuracy is almost 99% for the messages in test set. That is a quite good score and I am pleased with it.

As we could have seen with the spam messages above, they can get pretty well at creating a message so that it passes a spam filter. With the current accuracy, a spam filter still has 5.44% chances of passing as non-spam.

# Conclusion and next steps

The project's aim was to build a spam filter for SMS messages using the Naive Bayes algorithm. We processed a dataset of approx. 5000 SMS, we split it in train and test datasets, we cleaned the data and used individual word in the training set to create a vocabulary of words, on which we applied Naive Bayes algorithm to calculate the probability that a message would be `spam` or `ham`. From my personal experience, a model accuracy of 80% and above is good but the accuracy of this algorithm has been close to 99%.

## Next steps


- Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusions.
- Make the filtering process more complex by making the algorithm sensitive to letter case.


I will get back to this project at later stage, while now I have to get to finish another project about K-Means Clustering.