# <p style="text-align: center;">Building a Spam Filter for SMS Message with Naive Bayes</p>

This project will use the Multinomial Naive Bayes algorithm to build a spam filer for SMS messages.
The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from [this link](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)

## Introduction

The dataset contains 5,572 messages that are already classified by humans. Our goal is to build a spam filter with an accuracy of at least 85%.

We will start to get familiar with the dataset first.

In [1]:
import pandas as pd

df=pd.read_csv("SMSSpamCollection",sep='\t',header=None, names=['Label','SMS'])

In [2]:
print(df.shape)

(5572, 2)


In [3]:
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

Here is some summaries of the data:
- There are two columns: __Label__,__SMS__
- Around 87% of messages are classified as 'ham'(non-spam), while 13% of messages are 'spam'

## Split the Dataset into Training and Test Set

In order to build and test the spam filter effectivly, we need to split the dataset into traiining and test set. The rule of thumb is:
- The trainning set contains 80 % of data
- The test set contains 20% of data

To minimize the errors due to human entries, we start to randomize the entire dataset to ensure the 'spam' and 'ham' messages are spread properly througout the dataset

In [5]:
#Randomize the dataset to ensure the proper spread of 'spam' and 'ham'
df=df.sample(frac=1,random_state=1)

In [6]:
df.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [7]:
# extract 80% of data as train set
train=df.sample(frac=0.8,random_state=1)


In [8]:
#The rest of dataset as test set
test=df.drop(train.index.tolist())

In [9]:
train.shape

(4458, 2)

In [10]:
train.head()

Unnamed: 0,Label,SMS
3404,ham,Good night my dear.. Sleepwell&amp;Take care
4781,ham,Sen told that he is going to join his uncle fi...
484,ham,Thank you baby! I cant wait to taste the real ...
502,ham,When can ü come out?
3898,ham,No. Thank you. You've been wonderful


For the better measurement, both train and test sets are reindexed

In [11]:
train=train.reset_index(drop=True)

In [12]:
train.head()

Unnamed: 0,Label,SMS
0,ham,Good night my dear.. Sleepwell&amp;Take care
1,ham,Sen told that he is going to join his uncle fi...
2,ham,Thank you baby! I cant wait to taste the real ...
3,ham,When can ü come out?
4,ham,No. Thank you. You've been wonderful


In [13]:
test=test.reset_index(drop=True)

In [14]:
test.head()

Unnamed: 0,Label,SMS
0,ham,Welp apparently he retired
1,ham,Dai what this da.. Can i send my resume to thi...
2,ham,I am late. I will be there at
3,spam,Congrats! 2 mobile 3G Videophones R yours. cal...
4,ham,Ooooooh I forgot to tell u I can get on yovill...


In [15]:
train['Label'].value_counts(normalize=True)

ham     0.866756
spam    0.133244
Name: Label, dtype: float64

In [16]:
test['Label'].value_counts(normalize=True)

ham     0.862657
spam    0.137343
Name: Label, dtype: float64

After examing the train and test: 
- 'ham' consists of 87% while'spam' consists of 13%, which is consistent with the original data.

Now we can proceed to the next steop

## Create the Vocabulary

The basic concept of Naive Bayers for the spam filter is to calculate the probability that each word occurs in 'spam' or 'non-spam' messages. Therefore, first we need to isolate the individual words and create a vocalubary that contains the unqiue words from all SMS messages. There are two main steps:
- 'Normalize' the message: remove the puncatuations,convert it to lowercase, and split it to individual words
- Create the vocabulary for the training set

In [17]:
#Remove the puncatuations
train['SMS']=train['SMS'].str.replace('\W',' ')

In [18]:
train.head()

Unnamed: 0,Label,SMS
0,ham,Good night my dear Sleepwell amp Take care
1,ham,Sen told that he is going to join his uncle fi...
2,ham,Thank you baby I cant wait to taste the real ...
3,ham,When can ü come out
4,ham,No Thank you You ve been wonderful


In [19]:
train['SMS']=train['SMS'].str.lower()

In [20]:
train['SMS']=train['SMS'].str.split()

In [21]:
train.iloc[0,1]

['good', 'night', 'my', 'dear', 'sleepwell', 'amp', 'take', 'care']

In [22]:
#Build the vocabulary
vocabulary=[]
for sms in train['SMS']:
    for w in sms:
        vocabulary.append(w)

vocabulary=list(set(vocabulary))
    
    

In [23]:
len(vocabulary)

7712

There are **7,712** unique words in training set's vocabulary.

In [24]:
len(train)

4458

## Data Transformation
We already have the whole 'vocabulary" of SMS. Now we need to tranform the trainning dataset to the occurencies of  words from each SMS in the 'vocabulary'.
- We start by initializing a dictionary named word_counts_per_sms, where each key is a unique word (a string) from the vocabulary, and each value is a list of the length of training set, where each element in the list is a 0.

  - The code [0] * 5 outputs [0, 0, 0, 0, 0]. So the code [0] * len(training_set['SMS']) outputs a list of the length of training_set['SMS'], where each element in the list will be a 0.
  
- We loop over training_set['SMS'] using at the same time the enumerate() function to get both the index and the SMS message (index and sms).

   - Using a nested loop, we loop over sms (where sms is a list of strings, where each string represents a word in a message).
   - We incremenent word_counts_per_sms[word][index] by 1.

In [25]:
word_counts_per_sms={unique_word: [0]*len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] +=1

In [26]:
#transform word_counts_per_sms into a Pandas data frame

word_counts=pd.DataFrame(word_counts_per_sms)

In [33]:
word_counts['subs'].value_counts()

0    4455
1       3
Name: subs, dtype: int64

After we have the 'word_counts' dataframe, we concatenate it with the original 'train' dataframe to include all data feature in one dataframe

In [27]:
train_f=pd.concat([train,word_counts],axis=1)

In [28]:
train_f.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,0121,01223585236,...,zoe,zogtorius,zoom,zouk,èn,é,ú1,ü,〨ud,鈥
0,ham,"[good, night, my, dear, sleepwell, amp, take, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[sen, told, that, he, is, going, to, join, his...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[thank, you, baby, i, cant, wait, to, taste, t...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,"[when, can, ü, come, out]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,ham,"[no, thank, you, you, ve, been, wonderful]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
#checking the dataframe, it looks correct
train_f.loc[0,'good']

1

## Calculate the Constants First
Now we have the training data set, we can begin creating the spam filter. There are two equations to classify the new messages given the words they contain:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

To calculate P(wi|Spam) and P(wi|Ham) inside the formulas above,  we need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

For every new message, several features of the equations are the same:
- p(spam) and p(ham)
- Nspam, Nham,Nvocabulary

- NSpam is equal to the number of words in all the spam messages
- NHam is equal to the number of words in all the non-spam messages 
- We'll also use Laplace smoothing and set  
α
=
1


In [30]:
# p(spam)
spam=train_f[train_f['Label']=='spam']
p_spam=len(spam)/len(train_f)

# p(ham)
ham=train_f[train_f['Label']=='ham']
p_ham=len(ham)/len(train_f)

In [33]:
# Number of words in all the spam messages
spam_n_words=[]
for sms in spam['SMS']:
    for w in sms:
        spam_n_words.append(w)

In [35]:
#Number of word in all the ham messages:
ham_n_words=[]
for sms in ham['SMS']:
    for w in sms:
        ham_n_words.append(w)

In [37]:
n_spam=len(spam_n_words)
n_ham=len(ham_n_words)
n_vocabulary=len(vocabulary)
alpha=1

## Calculate the Probability of Words Occurencies Given 'Spam' or 'Ham'

Now we have the constants, we start to calcualte the probability of each unique words in 'spam' and 'ham' messages

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

In [38]:
# Initiate the dictionaries for each category
p_words_given_spam={unique_words:0 for unique_words in vocabulary}
p_words_given_ham={unique_words:0 for unique_words in vocabulary}

In [39]:
for word in vocabulary:
    #for 'spam'
    n_word_given_spam=spam[word].sum()
    p_words_given_spam[word]=(n_word_given_spam + alpha)/(n_spam + alpha * n_vocabulary)
    
    #for 'ham'
    n_word_given_ham=ham[word].sum()
    p_words_given_ham[word]=(n_word_given_ham + alpha)/(n_ham + alpha * n_vocabulary)
    

## Build the Classifier Based on the Probability

Now we have the constants and probability of words occurencies in 'spam' or 'ham', we can start to classify a message. The mechanism of the spam filter would work as following:
1. Takes a message as input and splits it into individual words;
2. Calculates P(spam|w1.....wn) and P(ham|w1....wn)
3. Compares the values of P(spam|w1....wn) and P(ham|w1.....wn)

In [44]:
import re

def classify(message):
    message=re.sub('\W',' ',message)
    message=message.lower()
    message=message.split()
    
    p_spam_given_message=p_spam
    p_ham_given_message=p_ham
    
    for word in message:
        if word in p_words_given_spam:
            p_spam_given_message *= p_words_given_spam[word]
        
    
        if word in p_words_given_ham:
            p_ham_given_message *=p_words_given_ham[word]
        
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Need human classification'

## Measuring the Spam Filter Accuracy on Test Set

We already built the Spam Filter. Now we can apply it to our test dataset and investigate its accuracy.

In [45]:
#Apply the spam filter to the test set
test['predicted']=test['SMS'].apply(classify)

In [46]:
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Welp apparently he retired,ham
1,ham,Dai what this da.. Can i send my resume to thi...,ham
2,ham,I am late. I will be there at,ham
3,spam,Congrats! 2 mobile 3G Videophones R yours. cal...,spam
4,ham,Ooooooh I forgot to tell u I can get on yovill...,ham


In [58]:
#check the correctness of the prediction
correctness=(test['Label']==test['predicted'])

In [59]:
#Percentage of the correct predictions
correctness.value_counts(normalize=True)

True     0.987433
False    0.012567
dtype: float64

In [79]:
correctness.value_counts()

True     1100
False      14
dtype: int64

- We can see that the accuracy of the spam filter is about __98.74%__.
- It correctly predicted 1100 message while misclassfied 14 messages.
- The filter had an accuracy of 98.74% on the test set we used, which largely exceeds our initial goal of 85% accuracy.

## Further Investigation
So far, the spam filter has a high accuracy for predicting the spam message on the test set. We still want to investigate the wrong messages the spam filter predicted.

In [61]:
test_f=test.copy()

In [62]:
test_f['correctness']=correctness

In [63]:
test_f.head()

Unnamed: 0,Label,SMS,predicted,correctness
0,ham,Welp apparently he retired,ham,True
1,ham,Dai what this da.. Can i send my resume to thi...,ham,True
2,ham,I am late. I will be there at,ham,True
3,spam,Congrats! 2 mobile 3G Videophones R yours. cal...,spam,True
4,ham,Ooooooh I forgot to tell u I can get on yovill...,ham,True


In [74]:
wrong_p=test_f[test_f['correctness']==False]

In [76]:
wrong_p['predicted'].value_counts()

ham                          8
spam                         4
Need human classification    2
Name: predicted, dtype: int64

Out of 14 misclassified messages, two of them need 'human classification' while 12 of them were mislabeled.
- One of the possible reason for misclassification is __Limited Information__
- Collecting a larger dataset might improve the filter's performance