# Spam filter using Naive Bayes

## Introduction

In this project I am goint to build a probability algorytm using multinominal Naive Bayes. The aim of this project is to classify messages as spam or non-spam. To teach a model I will use a dataset from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). This dataset contains over 5,000 SMS messages that are labeled by humans as spam or not.

### Summary of results

Using Naive Bayes algorytm we are able to achieve accuracy over 98% which is satisfying.

## Data exploration

Lets start by reading the dataset as pandas dataframe.

In [2]:
import pandas as pd
import string
import re

In [3]:
data_sms = pd.read_csv('data//SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])

In [51]:
# Number of rows and colums
data_sms.shape

(5572, 2)

In [52]:
data_sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [53]:
# % of spam and non-spam 
data_sms.Label.value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

As we can see there is 5572 SMS messages which were labeled as spam or ham (means non-spam). There is approximately 86% messages classified as non-spam and the rest is labeled spam.


## Train/test validation

Lets now randomly split our data into two datasets:
- 80% training
- 20% testing

We can use training data to train our probability model and then we will use remaining data to check how accurate it is. Rule of thumb is to split data into 80% training and 20% testing data. From one side we want to hava as much data as possible to train our model but on the other hand we want to also have enought data to test our model against it.



In [54]:
# randomizing dataset
randomized_data = data_sms.sample(frac=1,random_state=1)

In [55]:
training = randomized_data.head(int(len(randomized_data)*(80/100))).copy()
training.shape

(4457, 2)

In [56]:
test = randomized_data.iloc[training.shape[0]:].copy()
test.shape

(1115, 2)

In [57]:
# % of spam and non-spam in training data
training.Label.value_counts(normalize=True)

ham     0.86538
spam    0.13462
Name: Label, dtype: float64

In [58]:
# % of spam and non-spam in test data
test.Label.value_counts(normalize=True)

ham     0.868161
spam    0.131839
Name: Label, dtype: float64

Now we have got randomly split data approx 13% of the messages are labeled spam in both datasets which means they were randomly selected.

## Data transformation

Lets transform our datasets to enable us to check how many times each work occures in each message.

In [59]:
# Basic cleaning

# Removing punctuation
punctuation = string.punctuation
training['SMS'] = training['SMS'].str.replace(r'[{}]+'.format(re.escape(punctuation)),' ').str.lower().str.strip()

# Building vocabulary of unique words
vocabulary=[]
for sms in training['SMS']:
    for word in sms.split():
        if word not in vocabulary:
            vocabulary.append(word)

In [60]:
# # Cleaning using Natural Language processing

# import nltk
# sms_words=[]
# vocabulary=[]
# useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)

# for index, sms in enumerate(training['SMS']):
#     bag_of_words = nltk.word_tokenize(sms)
#     bag_of_words_clean = []
#     for word in bag_of_words:
#         word = word.lower()
#         if word not in useless_words:
#             bag_of_words_clean.append(word)
#             if word not in vocabulary:
#                 vocabulary.append(word)
#     training['SMS'].iloc[index] = ' '.join(bag_of_words_clean)

In [61]:
len(vocabulary)

7857

We have developed two approaches:
- Basic cleaning: removing only punctiation and spliting words on whitespace,
- Natural Language processing: tokenizing text and removing stop words as well as punctiation

To our's amazement we got better results applying basic cleaning so we stick to this.
We have created the vocabulary list contaning all unique words across our messages from training dataset. There is almost 7860 unique words. 

Lets now transform our dataset to have vocabulary words as columns and value at each row coresponding to how many times a particular work occurs in the SMS column.

In [62]:
# Creating a dict with unique words filled with zeros on length of the dataset
word_counts_per_sms={}
for unique_word in vocabulary:
    word_counts_per_sms[unique_word] = [0] * len(training['SMS'])

# Incrementing corresponding row-column value by 1
for index, sms in enumerate(training['SMS']):
    for word in sms.split():
#         if word in word_counts_per_sms: # new line
        word_counts_per_sms[word][index] += 1

In [63]:
training_exp = pd.DataFrame(word_counts_per_sms)

In [64]:
training_transformed = pd.concat([training.reset_index(drop=True),training_exp], axis=1)
training_transformed.head(3)

Unnamed: 0,Label,SMS,yep,by,the,pretty,sculpture,yes,princess,are,...,prakesh,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526
0,ham,yep by the pretty sculpture,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,ham,welp apparently he retired,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We have transformet the training dataset by adding new columns representing each word present in all SMS messages once. For every row in our dataset we incremented value by number of times each word occurs in the SMS column. If any word doesn't occur in respective row the corresponding column's value is 0.

## Naive Bayes model

Next we can start building our model. We need to compare probability that given message is spam and probability that given message is not-spam. We can do this by comparing this two equations :

\begin{equation*}
P(Spam|(w_{1},w_{2},...,w_{n}) \propto P(Spam) \times \prod_{i=0}^{n} P (w_{i}|Spam)
\end{equation*}

\begin{equation*}
P(Ham|(w_{1},w_{2},...,w_{n}) \propto P(Ham) \times \prod_{i=0}^{n} P (w_{i}|Ham)
\end{equation*}

To calculate the probability of words given spam we need to use:
\begin{equation}
P(w_{i}|Spam) = \frac{N_{w_{i}|Spam}+ \alpha}{N_{Spam} + \alpha \times N_{Vocabulary}}
\end{equation}


Lets start by calculating **P(Spam)** and **P(Ham)**.

In [65]:
# Probability of P(Spam) and P(Ham)
p_ham = training_transformed['Label'].value_counts(normalize=True)[0]
p_spam = 1 - p_ham

# Number of words in Spam dataset
training_spam = training_transformed[training_transformed['Label']=='spam']
n_words_spam = 0
for sms in training_spam['SMS']:
    n_words_spam += len(sms.split())

# Number of words in Ham dataset
training_ham = training_transformed[training_transformed['Label']=='ham']
n_words_ham = 0
for sms in training_ham['SMS']:
    n_words_ham += len(sms.split())
        
# Vocabulary size
n_words_voc = len(vocabulary)

# Alpha value
alpha = 1

Lets now create two dictionaries, where each key represents a unique word from our vocabulary. One for spam and another one for ham messages.

In [66]:
# Calculating the probability of words given spam and ham
words_given_spam = {}
words_given_ham = {}

for word in vocabulary:
    words_given_spam[word] = (training_spam[word].sum() + alpha)/(n_words_spam+(alpha*n_words_voc))
    words_given_ham[word] = (training_ham[word].sum() + alpha)/(n_words_ham+(alpha*n_words_voc))

We have created all parameters needed for above equations. **P(words|Spam)** and **P(words|Ham)** is kept in words_given_spam and words_given_ham directories respectively.

## Message classification

In [73]:
def classify_message(sms):
    
    # transforim sms to list of strings, no punctuation
    message = re.sub(r'[{}]+'.format(re.escape(punctuation)),' ',sms)
    message = message.lower().strip().split()
    
    # calculating probability of spam and ham given message 
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in vocabulary:
            p_word_given_spam = words_given_spam[word]
            p_spam_given_message *=p_word_given_spam
            
            p_word_given_ham = words_given_ham[word]
            p_ham_given_message *=p_word_given_ham
            
    # printing results
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

The function *classify_message()* takes a message in and returns *spam* if probability of spam given message is greater then probability of ham given message. Otherwise it returns *ham*. If these two probabilities are equal then program can't classify the message so human assistance is required.

Lets test this function on two messages.

In [68]:
# test based on two messages
sms1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
sms2 = 'Sounds good, Tom, then see u there'
print('Label for sms1: ' +classify_message(sms1))
print('Label for sms2: ' +classify_message(sms2))

Label for sms1: spam
Label for sms2: ham


Looks like above two messages were classified correctly. Lets now check accuracy of our model against test dataset.

In [75]:
test['predicted'] = test['SMS'].apply(classify_message)
test.head(4)

Unnamed: 0,Label,SMS,predicted
3482,ham,Wherre's my boytoy ? :-(,ham
2131,ham,Later i guess. I needa do mcat study too.,ham
3418,ham,But i haf enuff space got like 4 mb...,ham
3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam


In [74]:
correct = (test['Label'] == test['predicted']).sum()
total = test.shape[0]
print('Accuracy of model: ' + str(correct/total*100))

Accuracy of model: 98.7443946188341


## Conclusion

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result.