# Building a Spam Filter with Naive Bayes

In this project, we'll explore the practical side of the multinominal Naive Bayes algorithm by building a spam filter for SMS messages. Our goal will be to write a program that classifies new messages as spam or not-spam with an accuracy greater than 95%.

To train the algorithm, we'll use a dataset of 5,572 SMS messages that was put together by Tiago A. Almeida and José María Gómez Hidalgo. The dataset can be downloaded directly from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

## Exploring the Dataset

We will start by reading in and exploring the data.

In [1]:
import pandas as pd

# Data points are tab separated with no header row
sms_data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print(sms_data.shape)
sms_data.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
sms_data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

We can see that 87% of the messages are classified as ham (not-spam), and 13% are classified as spam.

## Splitting the Data into the Training Set & Test Set

Next, we'll split our data into a training set and a test set. We'll make the training data 80% of the dataset, and the remaining 20% will be used to test how good our spam filter is at classifying new messages.

In [3]:
# Randomize the dataset
data_randomized = sms_data.sample(frac=1, random_state=1)

# Calculate index for the split
data_index = round(len(data_randomized) * 0.8)

# Split into Train and Test
training_set = data_randomized[:data_index].reset_index(drop=True)
test_set = data_randomized[data_index:].reset_index(drop=True)

print("Train:", training_set.shape)
print("Test:", test_set.shape)

Train: (4458, 2)
Test: (1114, 2)


Now, we'll check the percentage of spam and not-spam messages in the training and test sets to be sure that the numbers are close to the ratio we had for the full dataset.

In [4]:
print("Train:", "\n", training_set['Label'].value_counts(normalize=True))
print("Test:", "\n", test_set['Label'].value_counts(normalize=True))

Train: 
 ham     0.86541
spam    0.13459
Name: Label, dtype: float64
Test: 
 ham     0.868043
spam    0.131957
Name: Label, dtype: float64


The results look similar to the full dataset.

## Data Cleaning

Our next step is to clean the data. We are going to transform the dataset so that each unique word found in the messages will become its own column containing the frequency for each word that appears for each message.

### Letter Case and Punctuation

We will begin by removing punctuation and making all words lowercase.

In [5]:
# Before cleaning
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [6]:
# After cleaning
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating the Vocabulary

Now, we'll create the vocabulary, which in this context means a list with all of the unique words found in the training set.

In [7]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
# Transforms vocabulary into a set to remove duplicates, and then back into a list        
vocabulary = list(set(vocabulary))

In [8]:
# View number of unique words in the training set
len(vocabulary)

7783

### The Final Training Set

Now we can use the vocabulary that we just created to make the data transformation we want.

In [9]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts = pd.DataFrame(word_counts_per_sms)

In [10]:
word_counts.head()

Unnamed: 0,bathing,progress,tmw,croydon,pataistha,amplikater,picsfree1,tirupur,belly,spile,...,i,dept,definite,doggy,oreo,solve,despite,shade,suggestions,pin
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [11]:
# Adding the word counts back to the original training set
training_set_clean = pd.concat([training_set, word_counts], axis=1)

training_set_clean.head()

Unnamed: 0,Label,SMS,bathing,progress,tmw,croydon,pataistha,amplikater,picsfree1,tirupur,...,i,dept,definite,doggy,oreo,solve,despite,shade,suggestions,pin
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


## Calculating Constants First

Now that we're done cleaning and preparing the data, we can begin creating the spam filter. For the Naive Bayes algorithm, we'll need to know the probability values of these two equations to be able to classify new messages:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

To calculate P(wi|Spam) and P(wi|Ham) in the formulas above, we'll need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

We'll start by calculating:
* P(Spam) & P(Ham)
* N<sub>Spam</sub>, N<sub>Ham</sub>, N<sub>Vocabulary</sub>

Laplace smoothing will be used to set $\alpha = 1$

In [12]:
# Isolate the spam and not-spam messages
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) & P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N-Spam
n_spam_words = spam_messages['SMS'].apply(len)
n_spam = n_spam_words.sum()

# N-Ham
n_ham_words = ham_messages['SMS'].apply(len)
n_ham = n_ham_words.sum()

# N-Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

## Calculating Parameters

Now that we've calculated the constant terms, we can calculate the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$ which will be the conditional probability values associated with each word in the vocabulary. 

We'll calculate these parameters using these formulas:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Because we calculate all these values before we begin our classification of new messages, the Naive Bayes algorithm ends up being very fast. When a new message comes in, most of the computations are already done, so the algorithm can almost instantly classify it.

In [13]:
# Initialize two dictionaries as our parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_spam = spam_messages[word].sum()
    p_word_spam = (n_word_spam + alpha) / (n_spam + alpha * n_vocabulary)
    parameters_spam[word] = p_word_spam
    
    n_word_ham = ham_messages[word].sum()
    p_word_ham = (n_word_ham + alpha) / (n_ham + alpha * n_vocabulary)
    parameters_ham[word] = p_word_ham

## Classifying a New Message

We're finally ready to start creating our spam filter now that the parameters have all been calculated.

The spam filter will be a function that takes a new message as input, calculates P(Spam) and P(Ham), then compares those values and classifies the message as ham, spam, or asks the user for help deciding.

To write the for calculating `p_spam_message` and `p_ham_message`, we'll need to use these equations:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

In [14]:
import re

def classify(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_message = p_spam
    p_ham_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_message *= parameters_ham[word]
            
    print('P(Spam|message): ', p_spam_message)
    print('P(Ham|message): ', p_ham_message)
    
    if p_ham_message > p_spam_message:
        print('Label: Ham')
    elif p_ham_message < p_spam_message:
        print('Label: Spam')
    else:
        print('Not sure if Ham or Spam. Need user input.')

In [15]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message):  1.3481290211300841e-25
P(Ham|message):  1.9368049028589875e-27
Label: Spam


In [16]:
classify("Sounds good, Tom, then see u there")

P(Spam|message):  2.4372375665888117e-25
P(Ham|message):  3.687530435009238e-21
Label: Ham


## Measuring Accuracy

The results of our classification algorithm look good from what we tested above, but we'll need to see how well our filter works on the test set. First, we'll change the `classify()` function so that it returns the classification labels instead of just printing them.

In [17]:
def classify_test_set(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_message = p_spam
    p_ham_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_message *= parameters_ham[word]

    if p_ham_message > p_spam_message:
        return 'ham'
    elif p_spam_message > p_ham_message:
        return 'spam'
    else:
        return 'needs user classification'

In [18]:
# Using our new function to create a new column with the predicted labels
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [19]:
# Measure the accuracy of the function on the test set
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct: ', correct)
print('Incorrect: ', total - correct)
print('Accuracy: ', correct / total)

Correct:  1100
Incorrect:  14
Accuracy:  0.9874326750448833


The accuracy of our model is almost 99%! The spam filter looked at 1,114 messages that it hadn't seen and classified 1,100 of them correctly. 

## Conclusion & Next Steps

In this project, we built a spam filter for SMS messages using the multinomial Naive Bayes algorithm to predict which messages were spam with a 98.74% accuracy. 

If we would like to continue working on this project, some next steps to take could be to:

* Isolate the 14 messages that were misclassified and try to figure out why.
* Make the filtering process more complex by making the algorithm sensitive to letter case.

The idea for this project comes from the [DATAQUEST](https://app.dataquest.io/) **Conditional Probability** course.