# Building a Spam Filter with Naive Bayes

We're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous lesson that the computer:

- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans. You can also download the dataset directly from this [link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection).

In [1]:
import pandas as pd
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
sms.shape

(5572, 2)

In [3]:
sms['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Training and Test Set

Now that we've become a bit familiar with the dataset, we can move on to building the spam filter.

However, before creating it, it's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing.

In [4]:
sms_random = sms.sample(frac=1, random_state=1)
size = sms_random.shape[0]

In [5]:
training = sms_random.iloc[:int(size/100*80)].copy()
training.reset_index()
training['Label'].value_counts(normalize=True) * 100

ham     86.53803
spam    13.46197
Name: Label, dtype: float64

In [6]:
test = sms_random.iloc[int(size/100*80):].copy()
test.reset_index()
test['Label'].value_counts(normalize=True) * 100

ham     86.816143
spam    13.183857
Name: Label, dtype: float64

We can see that the two datasets we extracted have approximately the same proportion of spam and non-spam messages as the original dataset.

## Letter Case and Punctuation

To calculate all the probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. 

<img src="transformation.png"/>

We'll start with: 
- All words in the vocabulary are in lower case, so `SECRET` and `secret` come to be considered to be the same word.
- Punctuation is not taken into account anymore.

In [7]:
training['SMS'] = training['SMS'].str.replace('\W', ' ', regex=True)
training['SMS'] = training['SMS'].str.replace('\s{2}', ' ', regex=True)
training['SMS'] = training['SMS'].str.lower().copy()
training['SMS']

1078                          yep by the pretty sculpture
4028          yes princess are you going to make me moan 
958                            welp apparently he retired
4642                                              havent 
4674    i forgot 2 ask ü all smth  there s a card on d...
                              ...                        
4255                 how about clothes jewelry and trips 
1982    sorry i ll call later in meeting any thing rel...
5180    babe i fucking love you too  you know fuck it ...
4020    u ve been selected to stay in 1 of 250 top bri...
371     hello my boytoy   geeee i miss you already and...
Name: SMS, Length: 4457, dtype: object

## Creating the Vocabulary

We'll create a list with all of the unique words (**vocabulary**) that occur in the messages of our training set

In [23]:
vocabulary = []
for sms in training['SMS'].str.split():
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))
vocabulary[:10]

['steed',
 'vibrant',
 'took',
 'hme',
 'everybody',
 'tablets',
 'invite',
 'token',
 'lk',
 'ahmad']

## The Final Training Set

Now we're going to use the vocabulary to make the data transformation we need:

<img src="transformation.png"/>

In [24]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}
for i, sms in enumerate(training['SMS']):
    for word in sms:
        if word in word_counts_per_sms:
            word_counts_per_sms[word][i] += 1 

In [25]:
words_count = pd.DataFrame(word_counts_per_sms)
words_count

Unnamed: 0,steed,vibrant,took,hme,everybody,tablets,invite,token,lk,ahmad,...,pub,skateboarding,goes,audiitions,corrct,entire,endowed,unmits,advice,window
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4452,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4453,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
training_final = pd.concat([training,words_count], axis=1)
training_final

Unnamed: 0,Label,SMS,steed,vibrant,took,hme,everybody,tablets,invite,token,...,pub,skateboarding,goes,audiitions,corrct,entire,endowed,unmits,advice,window
0,ham,go until jurong point crazy available only in...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,ok lar joking wif u oni,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ham,u dun say so early hor u c already then say,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ham,nah i don t think he goes to usf he lives arou...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,spam,this is the 2nd time we have tried 2 contact u...,,,,,,,,,...,,,,,,,,,,
5568,ham,will ü b going to esplanade fr home,,,,,,,,,...,,,,,,,,,,
5569,ham,pity was in mood for that so any other sugge...,,,,,,,,,...,,,,,,,,,,
5570,ham,the guy did some bitching but i acted like i d...,,,,,,,,,...,,,,,,,,,,


## Calculating Constants First

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. Recall that the Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, recall that we need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:

- P(Spam) and P(Ham)
- NSpam, NHam, NVocabulary

In [30]:
p_spam = len(training_final[training_final['Label'] == 'spam']) / len(training_final) 
p_spam

0.11233851338700618

In [45]:
p_ham = 1 - p_spam
p_ham

0.8876614866129938

In [64]:
spam_words = training_final[training_final['Label'] == 'spam']['SMS'].str.split().apply(len)
n_spam = spam_words.sum()
n_spam

15190

In [66]:
ham_words = training_final[training_final['Label'] == 'ham']['SMS'].str.split().apply(len)
n_ham = ham_words.sum()
n_ham

57233

In [58]:
n_vocabulary = len(vocabulary)
n_vocabulary

7782

In [53]:
alpha = 1

## Calculating Parameters

For P(wi|Spam) and P(wi|Ham) will vary depending on the individual words. For instance, P("secret"|Spam) will have a certain probability value, while P("cousin"|Spam) or P("lovely"|Spam) will most likely have other values.

Although both P(wi|Spam) and P(wi|Ham) vary depending on the word, the probability for each individual word is constant for every new message.

For instance, let's say we receive two new messages:

- "secret code"
- "secret party 2night"

We'll need to calculate P("secret"|Spam) for both these messages, and we can use the training set to get the values.

In more technical language, the probability values that P(wi|Spam) and P(wi|Ham) will take are called parameters.


In [70]:
spam_parameters = {word: 0 for word in vocabulary}
ham_parameters = {word: 0 for word in vocabulary}

In [85]:
spam_messages = training_final[training_final['Label'] == 'spam']
ham_messasges = training_final[training_final['Label'] == 'ham']

for word in vocabulary:
    n_word_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_spam + alpha) / (n_spam + (alpha * n_vocabulary))
    spam_parameters[word]  = p_word_given_spam
    
    n_word_ham = ham_messasges[word].sum()
    p_word_given_ham = (n_word_ham + alpha) / (n_ham + (alpha * n_vocabulary)) 
    ham_parameters[word] = p_word_given_ham