# Building  a Spam Filter with Naive Bayes

## 1. Introduction

In this project we are goint to build a **spam filter** for SMS messages using **multinominal Naive Bayes algorithm**. We are going to use this algorithm on a dataset called *`SMSSpamCollection`* with *5572* SMS messages that  are already classified by humans. The dataset can be downloaded from the following link __[The UCI Machine Repository](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)__.

Let us import necessary libraries and read the dataset.

In [1]:
#import necessary libraries

import pandas as pd
import re

In [2]:
SMS_collection = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])
SMS_collection.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
SMS_collection.shape

(5572, 2)

In [4]:
round(SMS_collection['Label'].value_counts(normalize=True)*100, 2)

ham     86.59
spam    13.41
Name: Label, dtype: float64

The dataset has **2 rows** and **5572 columns**. The SMS messages are labelled as *spam* and *ham*. Among them **86.59%** are *ham SMSs* and **13.41%** are *spam SMSs*.

## 2. Training the Test Set

Before creating a *spam filter software*, it is best to *design the test* to avoid any biases later on. 
 
 To test the spam filter, we need to split the dataset into two categories:
 
  - **Training Set: ** This is to train the computer how to classify the messages.
  - **Test Set: ** This is to test, how good the spam filter is with classification of new messages.
  
  Typically we keep *80%* of the data for training and *20%* of the data for testing. 
  
  In our dataset we have *5572* messages, this means:
  
  * Training set will have *4,458* messages.
  * Test set will have *1,114* messages.

First, we are going to randomize the entire dataset and also make sure the results are reproducible.

In [5]:
Sample = SMS_collection.sample(frac=1, random_state=1)
Sample.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


Further, we will split the randomized dataset into *training set* with *80%* of the data and *testing set* with *20%* of the remaining data. And calculate the percentage of *spam* and *ham* messages in each of the datasets.

In [6]:
Training = Sample.sample(frac=0.8, random_state=1)

Testing = Sample.drop(Training.index)

In [7]:
Training.head()

Unnamed: 0,Label,SMS
3404,ham,Good night my dear.. Sleepwell&amp;Take care
4781,ham,Sen told that he is going to join his uncle fi...
484,ham,Thank you baby! I cant wait to taste the real ...
502,ham,When can ü come out?
3898,ham,No. Thank you. You've been wonderful


In [8]:
Testing.head()

Unnamed: 0,Label,SMS
958,ham,Welp apparently he retired
2498,ham,Dai what this da.. Can i send my resume to thi...
4259,ham,I am late. I will be there at
4517,spam,Congrats! 2 mobile 3G Videophones R yours. cal...
5392,ham,Ooooooh I forgot to tell u I can get on yovill...


In [9]:
print("Number of messages in the Training Set")
Training.shape[0]

Number of messages in the Training Set


4458

In [10]:
print("Number of messages in the Testing Set")
Testing.shape[0]

Number of messages in the Testing Set


1114

The number of messages in the *Training* and *Testing* set are as expected.

In [11]:
print("Training Set")
Training['Label'].value_counts(normalize=True)

Training Set


ham     0.866756
spam    0.133244
Name: Label, dtype: float64

In [12]:
print("Testing Set")
Testing['Label'].value_counts(normalize=True)

Testing Set


ham     0.862657
spam    0.137343
Name: Label, dtype: float64

As we can see from the above analysis, both *Training Set* and *Testing Set* have *~ 87%* of *ham* messages and *~13%* of *spam* messages

## 3. Cleaning the Dataset

Before proceeding with our analysis, we need to clean our dataset. That is

 - We need to get rid of all the punctuation from the *SMS* column
 - We need to transform every letter to lower case in the *SMS* column
 
We achieve this by using *regex* method and *Series.str.replace()* & *Series.str.lower()* function.

In [13]:
pattern = r"(\W)"
Training['SMS'] = Training['SMS'].str.replace(pattern, ' ')
Training['SMS'] = Training['SMS'].str.lower()
Training.head()

Unnamed: 0,Label,SMS
3404,ham,good night my dear sleepwell amp take care
4781,ham,sen told that he is going to join his uncle fi...
484,ham,thank you baby i cant wait to taste the real ...
502,ham,when can ü come out
3898,ham,no thank you you ve been wonderful


We have cleaned the training datasets.

## 4. Creating the Vocabulary

In this section we will create a list with all the unique words that occur in the messages of our *Training Set*. This we perform by

 - Transforming each message from the SMS column into a list using Series.str.split() method
 - Iterating over the SMS column with the list of strings and appending each string/word to an empty list `vocabulary` 
 - Transforming `vocabulary` list into a set inorder to remove the duplicates
 - Converting the `vocabulary` set back to list

In [14]:
SMS_lst = Training["SMS"].str.split()
vocabulary = []
for row in SMS_lst:
    for value in row:
        vocabulary.append(value)

vocabulary = set(vocabulary)
vocabulary = list(vocabulary)    
vocabulary

['picking',
 'badrith',
 'com1win150ppmx3age16subscription',
 '7250',
 'wit',
 'survey',
 'rcb',
 'announced',
 'another',
 'mre',
 'trips',
 'escape',
 'clear',
 'rang',
 'woke',
 '2docd',
 'pen',
 'pushbutton',
 'dependents',
 'join',
 'prompts',
 'said',
 'catching',
 '09050001808',
 'meanwhile',
 '2stoptx',
 'five',
 'will',
 'collecting',
 'tsandcs',
 'motor',
 'studdying',
 'promptly',
 'celebrated',
 '150p16',
 'fix',
 'urmom',
 'regretted',
 'sarcastic',
 'paid',
 'bang',
 'woulda',
 'yetty',
 'not',
 'tells',
 '0quit',
 'ignorant',
 'trusting',
 'thecd',
 'dresser',
 'macedonia',
 'rebooting',
 'major',
 'jeans',
 'm60',
 'breath',
 'outsider',
 'partnership',
 'vale',
 'millions',
 'rightly',
 'smoothly',
 'machi',
 'come',
 'vic',
 '07781482378',
 '30ish',
 'cuck',
 'ibh',
 'blogging',
 'contention',
 'hypotheticalhuagauahahuagahyuhagga',
 'videosound',
 'spaces',
 'dedicated',
 'turned',
 'nalla',
 'leo',
 'who',
 'expert',
 'mila',
 'welp',
 'cardiff',
 'tke',
 'luxury',
 

In [15]:
len(vocabulary)

7712

We have achieved our objective of creating a list with all the unique words occuring in the messages of our training set and there are *7712* unique words in the list.

## 5. Building the Ultimate Training Set

In this section we will use the list we created in the earlier section called *vocabulary* to transform the data. First, we are going to create a dictionary called *word_counts_per_sms*. This dictionary will contain *unique words* as dictionary *keys* and *number of times these words appears in an sms* as it's *values*. Lastly, we will convert this dictionary into a dataframe and concat it with the *Training* dataset. 

In [16]:
word_counts_per_sms = {unique_word: [0] * len(Training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(Training['SMS']):
    for word in sms.split():
        word_counts_per_sms[word][index] += 1   

Let us tranform the dictionary into a dataframe for easier analysis.

In [17]:
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)
word_counts_per_sms.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,...,zoe,zogtorius,zoom,zouk,èn,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The above dataframe has *7712* number of columns. The exact number of *unique words* in the *vocabulary* list. Further, we concat the two dataframes *Training* and *word_counts_per_sms* so that we can have *Label* and *SMS* columns as well. 

In [18]:
Final_Set = pd.concat([Training, word_counts_per_sms], axis=1)
Final_Set.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,0121,01223585236,...,zoe,zogtorius,zoom,zouk,èn,é,ú1,ü,〨ud,鈥
0,ham,go until jurong point crazy available only ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,ok lar joking wif u oni,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,spam,free entry in 2 a wkly comp to win fa cup fina...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ham,u dun say so early hor u c already then say,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,ham,nah i don t think he goes to usf he lives aro...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 5. Calculating the Constants

<a id="5"></a>
**Bayes algorithm** for the **spam filter** is given below:

$P(Spam|w_1, w_2,......,w_n) \propto P(Spam) \cdot \prod_{i=1}^n P(w_i|Spam)\;\;\;\;\;\;\;\;\;\;\;$(1)

$P(Ham|w_1, w_2,......,w_n) \propto P(Ham) \cdot \prod_{i=1}^n P(w_i|Ham)\;\;\;\;\;\;\;\;\;\;\;\;\;$(2)


In order to calculate the **Bayes algorithm**, we need to calculate the following formulas first: 


$P(w_i|Spam) = \frac{N_{wi|Spam} + \alpha}{N_{Spam} + \alpha  * N_{Vocabulary}}\;\;\;\;\;\;\;\;\;$(3)


$P(w_i|Ham) = \frac{N_{wi|Ham} + \alpha}{N_{Ham} + \alpha  * N_{Vocabulary}}\;\;\;\;\;\;\;\;\;\;$(4)



In this exercise we are going to calculate the following constatnts: 
 
 - $P(Spam)$ and $P(Ham)$
 - $N_{Spam}$ - Total number of words in all the spam messages.
 - $N_{Ham}$ - Total number of words in all the ham message.
 - $N_{Vocabulary}$ - Total number of unique words in all the messages.
 
Let us first calculate the probability of Spam and Ham messages in the *Training* dataset.

In [19]:
print("Probability of Spam and Ham messages")
round(Training['Label'].value_counts(normalize=1),2)

Probability of Spam and Ham messages


ham     0.87
spam    0.13
Name: Label, dtype: float64

There are *87%* of *Ham* and *13%* of *Spam* messages. Let us calculate the nuumber of words in both *Spam* and *Ham* messages.

In [20]:
spam = Training[Training['Label']=='spam']
spam.head()

Unnamed: 0,Label,SMS
2124,spam,123 congratulations in this week s competit...
368,spam,here is your discount code rp176781 to stop f...
2420,spam,sms services for your inclusive text credits p...
5482,spam,urgent we are trying to contact you last weeke...
907,spam,all the lastest from stereophonics marley di...


In [21]:
n_spam = spam['SMS'].str.strip().str.replace(r"\s+"," " ).str.split(" ").apply(len)
n_spam.head(10)

2124    29
368     19
2420    25
5482    27
907     26
517     32
487     24
766     24
3468    27
5       36
Name: SMS, dtype: int64

In [22]:
print("Total number of words in spam messages is")
N_Spam = n_spam.sum()
N_Spam

Total number of words in spam messages is


15142

In [23]:
ham = Training[Training['Label']=='ham']
ham.head()

Unnamed: 0,Label,SMS
3404,ham,good night my dear sleepwell amp take care
4781,ham,sen told that he is going to join his uncle fi...
484,ham,thank you baby i cant wait to taste the real ...
502,ham,when can ü come out
3898,ham,no thank you you ve been wonderful


In [24]:
n_ham = ham['SMS'].str.strip().str.replace(r"\s+"," ").str.split(" ").apply(len)
n_ham.head(10)

3404     8
4781    13
484     11
502      5
3898     7
96       6
2177     6
2841    21
993     16
3590    28
Name: SMS, dtype: int64

In [25]:
print("Total number of words in ham messages is")
N_Ham = n_ham.sum()
N_Ham

Total number of words in ham messages is


57142

In [26]:
print("Total number of unique words in all the meassages is")
N_Vocabulary = len(vocabulary)
N_Vocabulary

Total number of unique words in all the meassages is


7712

In [27]:
alpha = 1

We have calculated the required entities and also initiated the variable *alpha*.

## 7. Calculating the Parameters

In the earlier section we calculated the *constants* and in this section we are going to calculate the *parameters*. i.e. we are going to calculate $P({w_i|Spam})$ and $P({w_i|Ham})$. 

Let us first initiate an empty dictionary with values in *vocabulary* as *dictionary keys* and *0* as *dictionary values* for both *Spam* and *Ham* messages. 

In [28]:
dict_spam = {}
for values in vocabulary:
    dict_spam[values] = 0
dict_spam

{'picking': 0,
 '500': 0,
 'mate': 0,
 'badrith': 0,
 'com1win150ppmx3age16subscription': 0,
 '7250': 0,
 'hey': 0,
 'portege': 0,
 'accent': 0,
 'wit': 0,
 'warwick': 0,
 'benefits': 0,
 'survey': 0,
 'bob': 0,
 'announced': 0,
 'dialogue': 0,
 'another': 0,
 'mre': 0,
 'woo': 0,
 'karnan': 0,
 'trips': 0,
 'escape': 0,
 'clear': 0,
 'rang': 0,
 'woke': 0,
 '2docd': 0,
 'pen': 0,
 'load': 0,
 'treats': 0,
 'dependents': 0,
 'cos': 0,
 'join': 0,
 'words': 0,
 'detroit': 0,
 'prompts': 0,
 'said': 0,
 'jeevithathile': 0,
 'catching': 0,
 'keris': 0,
 '2309': 0,
 'zogtorius': 0,
 'ultimately': 0,
 '09050001808': 0,
 'meanwhile': 0,
 '2stoptx': 0,
 'locaxx': 0,
 'five': 0,
 'nasty': 0,
 'collecting': 0,
 'tsandcs': 0,
 'prasad': 0,
 'motor': 0,
 'clas': 0,
 'jia': 0,
 'sacrifice': 0,
 'janinexx': 0,
 'studdying': 0,
 'folks': 0,
 'camry': 0,
 'friendships': 0,
 'promptly': 0,
 'abbey': 0,
 'celebrated': 0,
 'daddy': 0,
 '150p16': 0,
 'culdnt': 0,
 'sumfing': 0,
 'ctxt': 0,
 'brighten': 0

In [29]:
dict_ham = {}
for values in vocabulary:
    dict_ham[values] = 0
dict_ham

{'picking': 0,
 '500': 0,
 'mate': 0,
 'badrith': 0,
 'com1win150ppmx3age16subscription': 0,
 '7250': 0,
 'hey': 0,
 'portege': 0,
 'accent': 0,
 'wit': 0,
 'warwick': 0,
 'benefits': 0,
 'survey': 0,
 'bob': 0,
 'announced': 0,
 'dialogue': 0,
 'another': 0,
 'mre': 0,
 'woo': 0,
 'karnan': 0,
 'trips': 0,
 'escape': 0,
 'clear': 0,
 'rang': 0,
 'woke': 0,
 '2docd': 0,
 'pen': 0,
 'load': 0,
 'treats': 0,
 'dependents': 0,
 'cos': 0,
 'join': 0,
 'words': 0,
 'detroit': 0,
 'prompts': 0,
 'said': 0,
 'jeevithathile': 0,
 'catching': 0,
 'keris': 0,
 '2309': 0,
 'zogtorius': 0,
 'ultimately': 0,
 '09050001808': 0,
 'meanwhile': 0,
 '2stoptx': 0,
 'locaxx': 0,
 'five': 0,
 'nasty': 0,
 'collecting': 0,
 'tsandcs': 0,
 'prasad': 0,
 'motor': 0,
 'clas': 0,
 'jia': 0,
 'sacrifice': 0,
 'janinexx': 0,
 'studdying': 0,
 'folks': 0,
 'camry': 0,
 'friendships': 0,
 'promptly': 0,
 'abbey': 0,
 'celebrated': 0,
 'daddy': 0,
 '150p16': 0,
 'culdnt': 0,
 'sumfing': 0,
 'ctxt': 0,
 'brighten': 0

From the equations (3) & (4) from the [Section6](#5), we need to calculate $N_{w_i|Spam}$ (number of times the word $w_i$ occurs in all the Spam messages) and $N_{w_i|Ham}$ (number of times the word $w_i$ occurs in all the Ham messages). Iterate over the vocabulary and for each word calculate the number of times the word occurred in the *spam* and *ham* messages.

In [30]:
for word in vocabulary:
    for row in spam["SMS"].str.split():
        if word in row:
            dict_spam[word] += 1
        else:
            dict_spam[word] == 1
        
dict_spam

{'picking': 0,
 '500': 34,
 'mate': 0,
 'badrith': 0,
 'com1win150ppmx3age16subscription': 1,
 '7250': 1,
 'hey': 4,
 'portege': 0,
 'accent': 0,
 'wit': 0,
 'warwick': 0,
 'benefits': 1,
 'survey': 1,
 'bob': 1,
 'announced': 0,
 'dialogue': 0,
 'another': 2,
 'mre': 1,
 'woo': 0,
 'karnan': 0,
 'trips': 0,
 'escape': 0,
 'clear': 0,
 'rang': 0,
 'woke': 0,
 '2docd': 0,
 'pen': 0,
 'load': 0,
 'treats': 0,
 'dependents': 0,
 'cos': 0,
 'join': 13,
 'words': 0,
 'detroit': 1,
 'prompts': 1,
 'said': 0,
 'jeevithathile': 0,
 'catching': 0,
 'keris': 0,
 '2309': 1,
 'zogtorius': 0,
 'ultimately': 0,
 '09050001808': 2,
 'meanwhile': 0,
 '2stoptx': 1,
 'locaxx': 0,
 'five': 2,
 'nasty': 1,
 'collecting': 0,
 'tsandcs': 1,
 'prasad': 0,
 'motor': 0,
 'clas': 0,
 'jia': 0,
 'sacrifice': 0,
 'janinexx': 1,
 'studdying': 0,
 'folks': 1,
 'camry': 0,
 'friendships': 0,
 'promptly': 0,
 'abbey': 0,
 'celebrated': 0,
 'daddy': 0,
 '150p16': 2,
 'culdnt': 0,
 'sumfing': 0,
 'ctxt': 1,
 'brighten':

In [31]:
for word in vocabulary:
    for row in ham["SMS"].str.split():
        if word in row:
            dict_ham[word] += 1
        else:
            dict_ham[word] == 1
        
dict_ham

{'picking': 7,
 '500': 0,
 'mate': 9,
 'badrith': 1,
 'com1win150ppmx3age16subscription': 0,
 '7250': 0,
 'hey': 83,
 'portege': 1,
 'accent': 1,
 'wit': 10,
 'warwick': 1,
 'benefits': 1,
 'survey': 1,
 'bob': 0,
 'announced': 1,
 'dialogue': 1,
 'another': 30,
 'mre': 0,
 'woo': 1,
 'karnan': 1,
 'trips': 1,
 'escape': 4,
 'clear': 2,
 'rang': 1,
 'woke': 5,
 '2docd': 1,
 'pen': 2,
 'load': 1,
 'treats': 1,
 'dependents': 1,
 'cos': 58,
 'join': 9,
 'words': 17,
 'detroit': 1,
 'prompts': 0,
 'said': 65,
 'jeevithathile': 1,
 'catching': 2,
 'keris': 1,
 '2309': 0,
 'zogtorius': 1,
 'ultimately': 1,
 '09050001808': 0,
 'meanwhile': 2,
 '2stoptx': 0,
 'locaxx': 1,
 'five': 3,
 'nasty': 1,
 'collecting': 3,
 'tsandcs': 0,
 'prasad': 1,
 'motor': 1,
 'clas': 1,
 'jia': 2,
 'sacrifice': 1,
 'janinexx': 0,
 'studdying': 1,
 'folks': 0,
 'camry': 1,
 'friendships': 1,
 'promptly': 1,
 'abbey': 1,
 'celebrated': 1,
 'daddy': 8,
 '150p16': 0,
 'culdnt': 1,
 'sumfing': 1,
 'ctxt': 0,
 'bright

Ultimately, we need to calculate $P({w_i|Spam})$ and $P({w_i|Ham})$ by iterating over each word in vocabulary.

In [32]:
Parameter_Spam = {}
for word in vocabulary:
    Parameter_Spam[word] = (dict_spam[word] + alpha) / (N_Spam + (alpha * N_Vocabulary))
    Parameter_Spam[word] = '{:.3e}'.format(Parameter_Spam[word])
for key, value in Parameter_Spam.items():
    Parameter_Spam[key] = float(value)
print(Parameter_Spam)



In [33]:
Parameter_Ham = {}
for word in vocabulary:
    Parameter_Ham[word] = (dict_ham[word] + alpha) / (N_Ham + (alpha * N_Vocabulary))
    Parameter_Ham[word] = "{:.3e}".format(Parameter_Ham[word])
for key, value in Parameter_Ham.items():
    Parameter_Ham[key] = float(value)
print(Parameter_Ham)



We have succesfully calculated the *Constatnts* and *Parameters* beforehand. The reason we calculate these before the classification of the new messages makes the *Bayes algorithm* very fast (compared to other algorithms). As most of the computations are already done when the new message comes in, algorithm instantly classifies the new message.

## 8. Classifying a New Message

In this section we are going to build a *spam filter* using *Bayes algorithm* given in the equations (1) & (2) from the [Section6](#5)

The *spam filter* we are going to build will be called *classify*, a function that:

 - Takes in as input a new message pressumably a string. After data cleaning we will end up with a list of words in the message. i.e. $[w_1,w_2,...w_n]$.
 
 
 - Calculates $P(Spam|w_1,w_2,...w_n)$ and $P(Ham|w_1,w_2,...w_n)$.
 For this we have to build a function called *P_Spam_message* and *P_Ham_message*, which takes the above list as input and outputs the probability. 
 
 
 - Compares the values of $P(Spam|w_1,w_2,...w_n)$ i.e. (P_Spam_message) and $P(Ham|w_1,w_2,...w_n)$ i.e. (P_Ham_message)
     - If $P(Ham|w_1,w_2,...w_n)$ > $P(Spam|w_1,w_2,...w_n)$, then the message is classified as *Ham*.
     - If $P(Ham|w_1,w_2,...w_n)$ < $P(Spam|w_1,w_2,...w_n)$, then the message is classified as *Spam*.
     - If $P(Ham|w_1,w_2,...w_n)$ = $P(Spam|w_1,w_2,...w_n)$, then the algorithm may request human help.
     
Let us first create *P_Spam_message* and *P_Ham_message* functions.

In [34]:
def P_Spam_message(message):
    Spam_message = 1.0
    for word in message:
        if word in Parameter_Spam:
            Spam_message = Spam_message * Parameter_Spam[word]
    P_Spam_message = Spam_message * 0.13
    return P_Spam_message

def P_Ham_message(message):
    Ham_message = 1.0
    for word in message:
        if word in Parameter_Ham:
            Ham_message = Ham_message * Parameter_Ham[word]
    P_Ham_message = Ham_message * 0.87
    return P_Ham_message

Let us build the main function now, *classify*. 

In [35]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    print('P(Spam|message):', P_Spam_message(message))
    print('P(Ham|message):', P_Ham_message(message))

    if P_Ham_message(message) > P_Spam_message(message):
        print('Label: Ham')
    elif P_Ham_message(message) < P_Spam_message(message):
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Let us check our function, with two messages.

In [36]:
message = 'WINNER!! This is the secret code to unlock the money: C3421.'
classify(message)

P(Spam|message): 3.677062352722708e-26
P(Ham|message): 8.60053171370046e-28
Label: Spam


In [37]:
message = "Sounds good, Tom, then see u there"
classify(message)

P(Spam|message): 8.107224427538784e-26
P(Ham|message): 2.4040998007721775e-21
Label: Ham


Our function correctly classified both the messages. 

## 9. Measuring the Spam Filter's Accuracy 

In this section we are going to test the efficiency of our spam filter. For this purpose we will use our *test dataset*. The above algorithm will output a classification label for every message in the test set, which we can compare with the actual label.

In here we are going to modify the *classify()* function by using *return* statements instead of *print*. We will call this function *classify_test_set()*.

In [38]:
def classify_test_set(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = 1
    p_ham_given_message = 1
    
    for word in message:
        if word in Parameter_Spam:
            p_spam_given_message *= Parameter_Spam[word]
            
        if word in Parameter_Ham:
            p_ham_given_message *= Parameter_Ham[word]
            
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Let us apply the above function on the *test dataset* and create a new column called *predicted* inorder to store the classified values for comparison purposes.

In [39]:
Testing['predicted'] = Testing['SMS'].apply(classify_test_set)
Testing.head()

Unnamed: 0,Label,SMS,predicted
958,ham,Welp apparently he retired,ham
2498,ham,Dai what this da.. Can i send my resume to thi...,ham
4259,ham,I am late. I will be there at,ham
4517,spam,Congrats! 2 mobile 3G Videophones R yours. cal...,spam
5392,ham,Ooooooh I forgot to tell u I can get on yovill...,ham


Now we can compare the predicted labels with actual labels in order to measure how good the sapm filter we built is. We do this using the following measurement.

$$ Accuracy = \frac{number\;of\;correctly\;classified\;messages}{total\;number\;of\;classified\;messages}$$

In [40]:
correct = 0
total = 0

for index, row in Testing.iterrows():
    if row['Label'] == row['predicted']:
        correct += 1
    total+= 1
print("incorrect", ":", 1114 - correct)
    
Accuracy = (correct / total) * 100
print("Accuracy", ":", round(Accuracy, 2),"%")

incorrect : 18
Accuracy : 98.38 %


The accuracy of our spam filter is *~ 98.38%*, which is great.

## Conclusion

In this project our objective is to build a *spam filter* using *multinominal Naive Bayes algorithm*. For this purpose we used a dataset called *SMSSpamCollection* having *5572* SMS messages classified by humans. Among these *~ 86.6%* are *Ham* messages and *~ 13.41%* are *Spam* messages. 

**We performed following operations on our dataset:**

  * For avoiding biases in our spamfilter software, we split the dataset into  *Training set* containing *80%* of the data, i.e. *4458* SMSs and *Testing set* containing *20%* of the data, i.e. *1114* SMSs.
    - We found that the pecentage of *Ham* and *Spam* messages in the *Training* and the *Testing* dataset replicates the same in the initial dataset.
  * We performed a data cleaning by getting rid of all the punctuation from the *SMS* column and transformed every letter to a lower case for the *Training* dataset.
  * We created a list containing all the unique words from the *SMS* column of the *Training* dataset. We also created a dataframe having all the unique words as a column names and concatenated it with the *Training* set.
  * We calculated all the *constants* $P(Spam)$, $P(Ham)$, $N_{Spam}$, $N_{Ham}$, $N_{vocabulary}$ and *parameters* $N_{{W_i}|Ham}, \;N_{{W_i}|Ham}$ required to calculate the *Bayes algorithm* for the *Training* dataset.  
  * We built the *spam filter*, a function called *classify()* using *Bayes alogrithmn*. This function classifies the given message into *Spam*, *Ham* or asks for the human intervention if the probabilities are equal for *Spam* and *Ham*. 
  
**Our observation:**

  We applied our function on the *Testing* dataset and calculated the accuracy of our *spam filter*. We noted that the accuracy of our spam filter is *98.38%*, which is indeed an excellent value for a *spam filter*.