# Building a Spam Filter with Naive Bayes

In this project, I'm going to study the practical side of the Naive Bayes algorithm building a spam filter for SMS messages.

To classify messages as spam or non-spam the computer:

1. Learns how humans classify messages.
1. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam.

So the first task is to "teach" the computer how to classify messages. To do that, I'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can also download the dataset directly [from this link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

# Exploring the Data

Let's start by reading in the dataset.

In [64]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [65]:
sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

There are about 87% of the messages that are ham ("ham" means non-spam), and the remaining 13% are spam. This sample looks representative, since in practice most messages that people receive are ham.

## Training and Test Set

I'm now going to split our dataset into a training and a test set using the `sklearn.model_selection.train_test_split()` function, where the training set accounts for 80% of the data, and the test set for the remaining 20%.

In [66]:
#Splitting the data
training_set, test_set = train_test_split(sms_spam, test_size=0.2, random_state=1)

print(training_set.shape)
print(test_set.shape)

(4457, 2)
(1115, 2)


I'll now analyze the percentage of spam and ham messages in the training and test sets. I expect the percentages to be close to what I have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam.

In [67]:
training_set['Label'].value_counts(normalize=True)

ham     0.86538
spam    0.13462
Name: Label, dtype: float64

In [68]:
test_set['Label'].value_counts(normalize=True)

ham     0.868161
spam    0.131839
Name: Label, dtype: float64

The results look good! I'll now move on to cleaning the data set.

# Data Preprocessing

To calculate all the probabilities required by the algorithm, I'll first need to perform a bit of data cleaning to bring the data in a format that will allow me to extract easily all the information I need.

## Data Cleaning

Essentially, I want to change the data set to a *bag-of-words* format using `sklearn.feature_extraction.text.CountVectorizer()` function.

In [69]:
# Creating Bag of Words
count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(training_set["SMS"])
word_counts = pd.DataFrame(word_counts.A, columns=count_vect.get_feature_names(), index=training_set.index)

print(word_counts.shape)
word_counts.head()

(4457, 7714)


Unnamed: 0,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,02072069400,02073162414,02085076972,021,03,04,0430,05,050703,0578,06,07,07008009200,07090201529,07090298926,07123456789,07732584351,07734396839,07742676969,07753741225,0776xxxxxxx,07781482378,07786200117,078,07801543489,07808,07808247860,07808726822,07815296484,07821230901,07880867867,...,yogasana,yor,yorge,you,youdoing,youi,youphone,your,youre,yourinclusive,yourjob,yours,yourself,youwanna,yowifes,yoyyooo,yr,yrs,ything,yummmm,yummy,yun,yunny,yuo,yuou,yup,zac,zaher,zealand,zebra,zed,zeros,zhong,zindgi,zoe,zoom,zouk,zyada,èn,〨ud
1642,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2899,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
480,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3485,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
157,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [70]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)

print(training_set_clean.shape)
training_set_clean.head()

(4457, 7716)


Unnamed: 0,Label,SMS,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,02072069400,02073162414,02085076972,021,03,04,0430,05,050703,0578,06,07,07008009200,07090201529,07090298926,07123456789,07732584351,07734396839,07742676969,07753741225,0776xxxxxxx,07781482378,07786200117,078,07801543489,07808,07808247860,07808726822,07815296484,...,yogasana,yor,yorge,you,youdoing,youi,youphone,your,youre,yourinclusive,yourjob,yours,yourself,youwanna,yowifes,yoyyooo,yr,yrs,ything,yummmm,yummy,yun,yunny,yuo,yuou,yup,zac,zaher,zealand,zebra,zed,zeros,zhong,zindgi,zoe,zoom,zouk,zyada,èn,〨ud
1642,ham,"Hi , where are you? We're at and they're not ...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2899,ham,If you r @ home then come down within 5 min,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
480,ham,When're you guys getting back? G said you were...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3485,ham,Tell my bad character which u Dnt lik in me. ...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
157,ham,I'm leaving my house now...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Getting the Vocabulary

Let's now get the vocabularyas a dataframe, which in this context means a dictionary with all the unique words in our training set with their count.

In [71]:
vocabulary = pd.DataFrame.from_dict(count_vect.vocabulary_, orient="index", columns=["n_word"])

print(vocabulary.shape)
vocabulary.head()

(7714, 1)


Unnamed: 0,n_word
hi,3426
where,7445
are,1053
you,7677
we,7373


# Calculating Naive Bayes Parameters

Now, I can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}


Also, to calculate $P(w_i|Spam)$ and $P(w_i|Ham)$ inside the formulas above, I'll need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

where:

- $N_{w_i|Spam}$ is the number of times the word $w_i$ occurs in spam messages;
- $ N_{w_i|Ham}$ is the number of times the word $w_i$ occurs in non-spam messages;
- $N_{Spam}$ is the total number of words in spam messages;
- $ N_{Ham}$ is the total number of words in non-spam messages;
- $N_{Vocabulary}$ is the total number of words in the vocabulary;
- $ \alpha$ is the smoothing parameter.

Some of the terms in the four equations above will have the same value for every new message. I can calculate the value of these terms once and avoid doing the computations again when a new messages comes in.

I'll also use Laplace smoothing and set $\alpha = 1$.

In [0]:
# Isolating spam and ham messages
spam_messages = training_set_clean.loc[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean.loc[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = spam_messages.shape[0] / training_set_clean.shape[0]
p_ham = ham_messages.shape[0] / training_set_clean.shape[0]

# N_w|Spam
vocabulary["n_word_given_spam"] = spam_messages.iloc[:,2:].sum()

# N_Spam
n_spam = vocabulary["n_word_given_spam"].sum()

# N_w|Ham
vocabulary["n_word_given_ham"] = ham_messages.iloc[:,2:].sum()

# N_Ham
n_ham = vocabulary["n_word_given_ham"].sum()

# N_Vocabulary
n_vocabulary = vocabulary.shape[0]

# Laplace smoothing
alpha = 1

Now that I have the constant terms calculated above, I can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

In [73]:
# Calculate parameters
vocabulary["p_word_given_spam"] = (vocabulary["n_word_given_spam"]+alpha)/(n_spam+alpha*n_vocabulary)
vocabulary["p_word_given_ham"] = (vocabulary["n_word_given_ham"]+alpha)/(n_ham+alpha*n_vocabulary)

print(vocabulary.shape)
vocabulary.head()

(7714, 5)


Unnamed: 0,n_word,n_word_given_spam,n_word_given_ham,p_word_given_spam,p_word_given_ham
hi,3426,14,96,0.00069,0.001679
where,7445,2,91,0.000138,0.001593
are,1053,65,323,0.003038,0.005609
you,7677,234,1550,0.010817,0.02685
we,7373,43,271,0.002025,0.004709


# Classification of New Messages

Now that I have all parameters calculated, I can start creating the spam filter `message_classification()`. The spam filter can be understood as a function that:

- Takes in as input a new message $(w_1, w_2, ..., w_n)$.
- Calculates $P(Spam|w_1, w_2, ..., w_n)$ and $P(Ham|w_1, w_2, ..., w_n)$.
- Compares the values of $P(Spam|w_1, w_2, ..., w_n)$ and $P(Ham|w_1, w_2, ..., w_n)$, and:
    - If $P(Ham|w_1, w_2, ..., w_n) > P(Spam|w_1, w_2, ..., w_n)$, then the message is classified as ham.
    - If $P(Ham|w_1, w_2, ..., w_n) \leq P(Spam|w_1, w_2, ..., w_n)$, then the message is classified as spam.

In [0]:
# Classification function
def message_classification(message):

  # Transform the message with the training vocabulary
  message = count_vect.transform([message])
  message = pd.Series(message.A[0], index=count_vect.get_feature_names())

  # Calculate the naive probabilities
  p_spam_given_message = vocabulary["p_word_given_spam"].pow(message).prod() * p_spam
  p_ham_given_message = vocabulary["p_word_given_ham"].pow(message).prod() * p_ham

  if p_ham_given_message > p_spam_given_message:
    return 'ham'
  else:
    return 'spam'

In [89]:
message_classification("WINNER!! This is the secret code to unlock the money: C3421.")

'spam'

In [90]:
message_classification("Sounds good, Tom, then see u there")

'ham'

## Measuring the Spam Filter's Results

The two results above look promising, but let's see how well the filter does on the test set.

Let's create a new column in our test set with classification results.

In [96]:
test_set.loc[test_set.index, 'predicted'] = test_set['SMS'].apply(message_classification)

test_set.head()

Unnamed: 0,Label,SMS,predicted
1078,ham,"Yep, by the pretty sculpture",ham
4028,ham,"Yes, princess. Are you going to make me moan?",ham
958,ham,Welp apparently he retired,ham
4642,ham,Havent.,ham
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham


Now, let's evaluate the spam filter to find out how well does.

I'm going to use two functions for this goal, `sklearn.metrics.confusion_matrix()` and `sklearn.metrics.classification_report()`.

In [106]:
unique_label = np.unique([test_set["Label"], test_set["predicted"]])
cmtx = pd.DataFrame(
  confusion_matrix(test_set["Label"], test_set["predicted"], labels=unique_label),
  index=['true:{:}'.format(x) for x in unique_label], 
  columns=['pred:{:}'.format(x) for x in unique_label]
)
print("Confusion Matrix\n")
print(cmtx)
print("\n------------------------------------------------\n")
print("Classification Report\n")
print(classification_report(test_set["Label"], test_set["predicted"]))

Confusion Matrix

           pred:ham  pred:spam
true:ham        965          3
true:spam         8        139

------------------------------------------------

Classification Report

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       968
        spam       0.98      0.95      0.96       147

    accuracy                           0.99      1115
   macro avg       0.99      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115



The accuracy and f1-score are both 99%, which is really good. Our spam filter looked at 1,115 messages that it hasn't seen in training, and classified 1,104 correctly.

# Conclusion

In this project, I build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 99% on the test set I used, which is a pretty good result. My initial goal was an accuracy of over 80%, and I managed to do way better than that.