# Building a Spam Filter with the Multinomial Naive Bayes Algorithm

In this guided project, we're going to study the pracitcal side of the multinomial Naive Bayes algorithm by building a spam filter for SMS (text) messages.

To classify messages as spam or non-spam, the computer will:

1. Learns how humans classify messages.
2. Use that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classify a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition). The dataset can also be downloaded directly [from this link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), where some of the authors' papers can also be found.

We'll now read in the dataset and have a look at what we are working with.

In [1]:
# Import libraries we'll be using
import pandas as pd

# Read in the SMS messages dataset
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

#  A look at the first few rows and dataset dimensions/variables
print(str(sms.shape[0]) + ' rows', "\n" + str(sms.shape[1]) + ' columns', '\n',
      """unique values in "label" column: """ + str(sms['Label'].unique()), '\n')

# Checking for datatypes and missing data
print(sms.info(), '\n')

# Checking proportions of spam and non-spam messages
print(sms['Label'].value_counts(normalize=True) * 100)
sms.head()


5572 rows 
2 columns 
 unique values in "label" column: ['ham' 'spam'] 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB
None 

ham     86.593683
spam    13.406317
Name: Label, dtype: float64


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


The dataset contains 5572 messages, each labeled as either "spam" or "ham" ("ham" means non-spam). ~87% of messages are labels as "ham" (non-spam) and ~13% are labeled as spam.

## Creating a training  and testing set

When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

* A *training set*, which we'll use to "train" the computer how to classify messages.
* A *test set*, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

* The *training set will have 4,458 messages* (about 80% of the dataset).
* The *test set will have 1,114 messages* (about 20% of the dataset).

For this project, **our goal is to create a spam filter that classifies new messages with an accuracy greater than 80%** — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [2]:

# Randomize the entire dataset
random_sms = sms.sample(frac=1, random_state=1)

# Assign 80% of the randomized dataset as the training set
training_sms = random_sms[:4458].reset_index(drop=True)

# Assign the remaining 20% of the dataset as the test set
test_sms = random_sms[4458:].reset_index(drop=True)

# Check the results of our split
print(training_sms.shape)
print(test_sms.shape, '\n')

# Check if percentages of spam/non-spam are the same as the original dataset
print('Training set percentages:')
training_sms['Label'].value_counts(normalize=True) * 100

(4458, 2)
(1114, 2) 

Training set percentages:


ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [3]:
# Check if percentages of spam/non-spam are the same as the original dataset
print('Test set percentages:')
test_sms['Label'].value_counts(normalize=True) * 100

Test set percentages:


ham     86.804309
spam    13.195691
Name: Label, dtype: float64

We've succesfully split our original dataset of 5,572 messages:
* 4,458 messages have been assigned the training set, `training_sms`.
* The remaining 1,114 messages have been assigned to the test set, `test_sms`.

The ratios of spam/non-spam text messages have been retained across the orginal, training, and testing datasets (~87% spam, ~13% ham(non-spam)).

We'll now move on to cleaning the dataset.

## Cleaning/Reformatting the SMS column

We need our data in a format that allows us to easily extract all the information we need.

We essentially want to convert the data into this format: <br><br>
![cleaning_sms](https://camo.githubusercontent.com/27a4a0a699bd8f0713d73347abe2929c267a03d5/68747470733a2f2f64712d636f6e74656e742e73332e616d617a6f6e6177732e636f6d2f3433332f637067705f646174617365745f332e706e67)
<br><br>

In [4]:
# Data before cleaning
training_sms.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


We'll remove all punctation, convert letters to lowercase, and convert the message to a list of words by spitting each word at the space character.

In [5]:
# Delete all non-alphanumeric characters, transform to lowercase, and convert to list
training_sms['SMS'] = training_sms['SMS'].str.replace('\W', ' ').str.lower().str.split()

# Data after cleaning
training_sms.sample(5)

Unnamed: 0,Label,SMS
2942,ham,"[mostly, sports, type, lyk, footbl, crckt]"
2484,ham,"[the, house, is, on, the, water, with, a, dock..."
771,ham,"[sounds, better, than, my, evening, im, just, ..."
1445,ham,[ok]
4323,ham,"[mm, that, time, you, dont, like, fun]"


### Creating a vocabulary

We'll iterate through every word in the training set and create a vocabulary for our spam filter

In [6]:
# Collect all unique words across all messages with this list
vocabulary = []

# Iterate over each word in each message and add it to the vocabulary list
for row in training_sms['SMS']:
    for word in row:
        vocabulary.append(word)

# Convert the vocab list into a set to get rid of duplicates, then back into a list        
vocabulary = list(set(vocabulary))
print(len(vocabulary))

7783


There are 7,783 unique vocabularly words in our training set.

### Creating the final training dataset
Now we'll use the vocabularly we created to make the data transformation we need: <br><br>
![cleaning_sms](https://dq-content.s3.amazonaws.com/433/cpgp_dataset_3.png)

In [7]:
# Dictionary where each key is a unique word from the vocabulary and each value is list
# of the length of the training set, where each element in the list is a 0
word_counts_per_sms = {unique_word: [0] * len(training_sms['SMS']) for unique_word in vocabulary}

# Count the number of occurences of each word in each message/index
for index, sms in enumerate(training_sms['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

# Convert the newly built dictionary into a dataframe
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [8]:
# Concatenate the new dataframe to our training set
training_sms_clean = pd.concat([training_sms, word_counts], axis=1)
training_sms_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


##  Calculating constants for equations

We're done with data cleaning and are ready to use the Naive Bayes algorithm to create our spam filter. We'll need to know the probability values for the two equations below to be able to classify new messages: <br><br>

![equation1](https://render.githubusercontent.com/render/math?math=P%28Spam%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Spam%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CSpam%29&mode=display)
![equation2](https://render.githubusercontent.com/render/math?math=P%28Ham%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Ham%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CHam%29&mode=display)<br><br>

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we need to use these equations: <br><br>

![equation3](https://render.githubusercontent.com/render/math?math=P%28w_i%7CSpam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CSpam%7D%20%2B%20%5Calpha%7D%7BN_%7BSpam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)
![equation4](https://render.githubusercontent.com/render/math?math=P%28w_i%7CHam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CHam%7D%20%2B%20%5Calpha%7D%7BN_%7BHam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)<br><br>

Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:
* P(Spam) and P(Ham)
* N<sub>Spam</sub>, N<sub>Ham</sub>, N<sub>Vocabulary</sub>

We'll also use Laplace smoothing and set $\alpha = 1$.

In [9]:
# Isolate spam and ham (non-spam) messages
spam_messages = training_sms_clean[training_sms_clean['Label'] == 'spam']
ham_messages = training_sms_clean[training_sms_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_sms_clean)
p_ham = len(ham_messages) / len(training_sms_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocab = len(vocabulary)

# Laplace smoothing
alpha = 1

## Calculating parameters

Now that we have our contants, we'll use these equations to calculate all the parameters:<br><br>

![parameters1](https://render.githubusercontent.com/render/math?math=P%28w_i%7CSpam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CSpam%7D%20%2B%20%5Calpha%7D%7BN_%7BSpam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)
![parameters1](https://render.githubusercontent.com/render/math?math=P%28w_i%7CHam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CHam%7D%20%2B%20%5Calpha%7D%7BN_%7BHam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)

In [10]:
# Initiate dictionaries to fill in our calculated parameters
p_w_spam = {unique_word: 0 for unique_word in vocabulary}
p_w_ham= {unique_word: 0 for unique_word in vocabulary}

# Calculate parameters for spam and ham (non-spam)
for word in vocabulary:
    n_word_spam = spam_messages[word].sum()
    p_word_spam = (n_word_spam + alpha) / (n_spam + (alpha * n_vocab))
    p_w_spam[word] = p_word_spam
    
    n_word_ham = ham_messages[word].sum()
    p_word_ham = (n_word_ham + alpha) / (n_ham + (alpha * n_vocab))
    p_w_ham[word] = p_word_ham

## Building the spam filter function

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:
* Takes in as input a new message (w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>)
* Calculates P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Hamw<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>)
* Compares the values of P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), and:
    * If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) > P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as ham.
    * If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) > P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as ham.
    * If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) = P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the algorithm may request human help.
    
Again, we'll use the following equations to calculate probabilites for spam/non-spam given a set of words: <br><br>

![spam](https://render.githubusercontent.com/render/math?math=P%28Spam%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Spam%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CSpam%29&mode=display)
![ham](https://render.githubusercontent.com/render/math?math=P%28Ham%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Ham%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CHam%29&mode=display)

In [11]:
# Defining the spam filter function
import re
def classify(message):
    
    '''message: a string'''
    
    message = message.replace('\W', ' ').lower().split() # Cleaning
    
    # Initialize probabilites with constants
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Iterate through every word in the message an calculate probabilities
    for word in message:
        if word in p_w_spam:
            p_spam_given_message *= p_w_spam[word]
        if word in p_w_ham:
            p_ham_given_message *= p_w_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    # Output
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [12]:
# Testing spam filter on a spam message
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.0164097981708963e-18
P(Ham|message): 1.8195638182330266e-19
Label: Spam


In [13]:
# Testing spam filter on a ham (non-spam) message
classify('Sounds, good, Tom, then see u there')

P(Spam|message): 1.2312316178212498e-13
P(Ham|message): 1.5219566401449865e-10
Label: Ham


Our spam filter appears to be working correctly. Let's test it on our *test dataset* next.

## Measuring spam filter accuracy

Now that we appear to have a working spam fiter, lets see how well it does against the 1,114 messsages in our test dataset. 

We first need to rewrite the function to return values instead of print them.

In [14]:
# Rewrite the spam filter to return values in stead of printing them
def classify_test_set(message):
    
    '''message: a string.'''

    message = message.replace('\W', ' ').lower().split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in p_w_spam:
            p_spam_given_message *= p_w_spam[word]

        if word in p_w_ham:
            p_ham_given_message *= p_w_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [15]:
# Create a new column, labeling the messages with out spam filter
test_sms['predicted'] = test_sms['SMS'].apply(classify_test_set)

# View our modification
test_sms.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


We'll now calculate the accuracy of the spam filter

In [16]:
# Initialize variables to calculate spam filter accuracy
correct = 0
total = len(test_sms)

# Iterate over every message in the test dataset and compare human and spam-filter classifications
for row in test_sms.itertuples():
    human_label = row[1]
    filter_label = row[3]
    if human_label == filter_label:
        correct += 1

# Print the results
print('Correct predictions: ', correct)
print(' Incorrect predictions: ', total - correct)
print('Spam filter accuracy', 100 * correct / total)

Correct predictions:  1095
 Incorrect predictions:  19
Spam filter accuracy 98.29443447037701


Our spam filter did a phenomenal job at correctly labeling the spam messages in our test dataset. It correctly labels spam messags with a ~98% accuracy! It correctly labeled 1,085 out of 1,114 messages.

We initially aimed for an accuracy of over 80%, but we managed to do much bettet than that :)

## Next steps

Our spam filter has great accuracy. But, so long as the accuracy is not at 100%, there's room for improvement. Next steps include: 

* Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusions.
* Make the filtering process more complex by making the algorithm sensitive to letter case.

In [17]:
# Isolate the 19 messages the spam filter classified incorrectly
incorrectly_labeled = test_sms[test_sms['Label'] != test_sms['predicted']]
incorrectly_labeled

Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
287,spam,Cashbin.co.uk (Get lots of cash this weekend!)...,ham
319,ham,We have sent JD for Customer Service cum Accou...,spam
363,spam,Email AlertFrom: Jeri StewartSize: 2KBSubject:...,ham
466,spam,You won't believe it but it's true. It's Incre...,ham
492,ham,"Madam,regret disturbance.might receive a refer...",spam
504,spam,Oh my god! I've found your number again! I'm s...,ham


Here are the 19 messages the spam filter labled incorrectly. Figuring out why they were labeled incorrectly is beyond the scope of this project, as we only aim to practice and use our understandning of the Naive Bayes algorithm.

## Conclusion

We built a spam filter for SMS (text) messages to study the practical side of the (multinomial) 
Naive Bayes algorithm. We initially aimed to build a filter with greater than 80% accuracy but we managed to do much better, as our spam filter had a 98.29% accuracy on the test dataset we used. 