# Building a Spam Filter with Naive Bayes

We're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous lesson that the computer:

- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans. You can also download the dataset directly from this [link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection).

In [1]:
import pandas as pd
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
sms.shape

(5572, 2)

In [3]:
sms['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Training and Test Set

Now that we've become a bit familiar with the dataset, we can move on to building the spam filter.

However, before creating it, it's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing.

In [4]:
sms_random = sms.sample(frac=1, random_state=1)
size = sms_random.shape[0]

In [5]:
training = sms_random.iloc[:int(size/100*80)].copy()
training.reset_index()
training['Label'].value_counts(normalize=True) * 100

ham     86.53803
spam    13.46197
Name: Label, dtype: float64

In [6]:
test = sms_random.iloc[int(size/100*80):].copy()
test.reset_index()
test['Label'].value_counts(normalize=True) * 100

ham     86.816143
spam    13.183857
Name: Label, dtype: float64

We can see that the two datasets we extracted have approximately the same proportion of spam and non-spam messages as the original dataset.

## Letter Case and Punctuation

To calculate all these probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. 

We'll start with: 
- All words in the vocabulary are in lower case, so `SECRET` and `secret` come to be considered to be the same word.
- Punctuation is not taken into account anymore.

In [7]:
training['SMS'] = training['SMS'].str.replace('\W', ' ', regex=True)
training['SMS'] = training['SMS'].str.replace('\s{2}', ' ', regex=True)
training['SMS'] = training['SMS'].str.lower().copy()
training['SMS']

1078                          yep by the pretty sculpture
4028          yes princess are you going to make me moan 
958                            welp apparently he retired
4642                                              havent 
4674    i forgot 2 ask ü all smth  there s a card on d...
                              ...                        
4255                 how about clothes jewelry and trips 
1982    sorry i ll call later in meeting any thing rel...
5180    babe i fucking love you too  you know fuck it ...
4020    u ve been selected to stay in 1 of 250 top bri...
371     hello my boytoy   geeee i miss you already and...
Name: SMS, Length: 4457, dtype: object

## Creating the Vocabulary

We'll create a list with all of the unique words (**vocabulary**) that occur in the messages of our training set

In [14]:
vocabulary = []
for sms in training['SMS'].str.split():
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))
vocabulary[:10]

['',
 'vibrant',
 'took',
 'steed',
 'everybody',
 'tablets',
 'hme',
 'invite',
 'token',
 'lk']

In [15]:
len(vocabulary)

7783

## The Final Training Set

Now we're going to use the vocabulary to make the data transformation we need:

<img src="transformation.png"/>

In [17]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}
for i, sms in enumerate(training['SMS']):
    for word in sms:
        if word in word_counts_per_sms:
            word_counts_per_sms[word][i] += 1 

In [18]:
words_count = pd.DataFrame(word_counts_per_sms)
words_count

Unnamed: 0,Unnamed: 1,vibrant,took,steed,everybody,tablets,hme,invite,token,lk,...,pub,skateboarding,goes,audiitions,corrct,entire,endowed,unmits,advice,window
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4452,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4453,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
training_final = pd.concat([training,words_count], axis=1)
training_final.head()

Unnamed: 0,Label,SMS,Unnamed: 3,vibrant,took,steed,everybody,tablets,hme,invite,...,pub,skateboarding,goes,audiitions,corrct,entire,endowed,unmits,advice,window
0,ham,go until jurong point crazy available only in...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,ok lar joking wif u oni,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ham,u dun say so early hor u c already then say,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ham,nah i don t think he goes to usf he lives arou...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
