# Building a Spam Filter with Naive Bayes

We are going to use this project to learn about the practical side of the multinominal Naive Bayes algorithm by building a spam filter for SMS messages. Our goal is to write a program that classifies new messages as spam or non-spam with an accuracy greater than 95%.

To train the algorithm, we will use a dataset of 5,572 SMS messages that can be found here: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository.

We will start by reading in and expoloring the dataset.

## Exploring the Dataset

In [27]:
import pandas as pd

# data points are tab separated with no header row
sms_data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print(sms_data.shape)
sms_data.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [28]:
sms_data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

We can see that 87% of the messages are classified as ham (non-spam), and 13% are classified as spam.


## Splitting the Data into the Training Set & Test Set

Next we are going to split our data into a training set and a test set. We will make the training data 80% of the dataset, and the remaining 20% will be used to test how good our spam filter is at classifying new messages.

In [29]:
# Randomize the dataset
data_randomized = sms_data.sample(frac=1, random_state=1)

# Calculate index for the split
data_index = round(len(data_randomized) * 0.8)

# Split into Train and Test
training_set = data_randomized[:data_index].reset_index(drop=True)
test_set = data_randomized[data_index:].reset_index(drop=True)

print("Train:", training_set.shape)
print("Test:", test_set.shape)

Train: (4458, 2)
Test: (1114, 2)


Now we will check the percentage of spam and non-spam messages in the training and test sets to be sure that the numbers are close to the ratio we had for the full dataset.

In [30]:
print("Train:", "\n", training_set['Label'].value_counts(normalize=True))
print("Test:", "\n", test_set['Label'].value_counts(normalize=True))

Train: 
 ham     0.86541
spam    0.13459
Name: Label, dtype: float64
Test: 
 ham     0.868043
spam    0.131957
Name: Label, dtype: float64


The results look similar to the full dataset. Nest we will clean the data.

## Data Cleaning

We are going to transform the dataset so that each unique word found in the messages will become its own column containing the frequency that each word appears for each message.

### Letter Case and Punctuation

We will begin by removing punctuation and making all words lower case.

In [31]:
# Before cleaning
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [33]:
# After cleaning
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating the Vocabulary

Now we will create the vocabulary, which in this context means a list with all unique words found in the training set.

In [34]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
# Transforms vocabulary into a set to remove duplicates, and then back into a list        
vocabulary = list(set(vocabulary))

In [36]:
# View number of unique words in the training set
len(vocabulary)

7783

### Creating the Final Training Set

We are going to use the vocabulary we just created to make the data transformation we want.

In [37]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts = pd.DataFrame(word_counts_per_sms)

In [38]:
word_counts.head()

Unnamed: 0,machan,thanx,posted,ie,09058094597,death,paris,09066364311,thm,call,...,bone,jordan,week,throw,bthere,stairs,someday,renewal,patients,accounts
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
# Adding the word counts back to the original training set
training_set_clean = pd.concat([training_set, word_counts], axis=1)

training_set_clean.head()

Unnamed: 0,Label,SMS,machan,thanx,posted,ie,09058094597,death,paris,09066364311,...,bone,jordan,week,throw,bthere,stairs,someday,renewal,patients,accounts
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
