# A Naive Bayes Spam Filter

In this project we build a classifier using a Naive Bayes algorithm to sort SMS messages into spam and non-spam. This is a guided project part of the Dataquest data science program.

The goal is to build a filter that is at least 80% accurate. The spam filter was actually almost 99% accurate.

The dataset used in this project was put together by Tiago A. Almeida and José María Gómez Hidalgo, downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).

## Exploring the Dataset

In [8]:
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)

sms = pd.read_csv('SMSSpamCollection', sep='\t', header = None, names = ['Label', 'SMS'])

In [9]:
sms.shape

(5572, 2)

In [10]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
#what percent is spam and ham ('non-spam')

per_spam_ham = sms.Label.value_counts(normalize = True)*100
print(per_spam_ham)

ham     86.593683
spam    13.406317
Name: Label, dtype: float64


We see that about 87% of the text messages is ham and 13% is spam.

## Training and Test sets
Before writing the spam filter software, we will split the data into a training set and a test set. 80% of the dataset will be used for training and 20% will be used to test the spam filter.

In [12]:
#randomize the dataset
randomized = sms.copy()
randomized = randomized.sample(frac = 1, random_state = 1)


In [13]:
#calculating index values to use for split
training_index = round(len(randomized)*0.8)

#splitting into training and test
training = randomized[:training_index].reset_index(drop = True)
test = randomized[training_index:].reset_index(drop = True)

print(training.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [14]:
#checking the percentage of spam and ham in the training and test sets

training_ham_spam = training.Label.value_counts(normalize = True)*100
test_ham_spam = test.Label.value_counts(normalize = True)*100

print(training_ham_spam)
print(test_ham_spam)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


The training and test sets have the same percentage of spam and ham, and the same as the full dataset.

## Cleaning the data
We will need to clean the training data to get the relevant data for the spam filter.

In [15]:
#training dataset before cleaning
training.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


We will want to remove all the non-words from the SMS column and make them all lowercase (so 'SECRET' is not different than 'secret').

In [16]:
#removing non-word characters
training['SMS'] = training['SMS'].str.replace('\W', ' ')

#changing to all lowercase
training['SMS'] = training['SMS'].str.lower()

#training set after cleaning
training.head()

  training['SMS'] = training['SMS'].str.replace('\W', ' ')


Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [17]:
# creating a vocabulary 

#transform the SMS message into a list
training['SMS'] = training['SMS'].str.split()

In [18]:
training.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [19]:
vocabulary = []
for sms in training['SMS']:
    for word in sms:
        vocabulary.append(word)

print(vocabulary[:5])
        

['yep', 'by', 'the', 'pretty', 'sculpture']


In [20]:
#transform the list to a set to remove duplicates, then back to list
vocabulary = list(set(vocabulary))

In [21]:
len(vocabulary)

7783

There are 7,783 words in the training dataset vocabulary.

## The final training set
We'll create a dictionary of the words in the SMS where each key is a unique word from the vocabulary and each value is a list the length of the training set. For example if the training set contained only three messages, and 'secret' appeared twice in the first message, once in the second message and not at all in the third message, then the dictionary entry for that word would be 'secret': [2,1,0].

In [22]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}


In [23]:
for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [24]:
word_count = pd.DataFrame(word_counts_per_sms)

In [25]:
training_clean = pd.concat([training, word_count], axis = 1)

In [26]:
training_clean.head()

Unnamed: 0,Label,SMS,edward,pls,removed,weiyi,em,chocolate,evrey,planet,warwick,support,mth,contact,musta,08700435505150p,palm,wherever,subscribe,snuggles,thinkin,reache,theacusations,fondly,somewhere,...,somewhat,entry,ls1,255,embarrassed,3qxj9,answers,amp,aeronautics,booking,mess,poly,wiv,idea,free,each,wiskey,lecturer,calld,cme,sometime,bruv,021,summers,heading
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Calculating the probability values
The Naive Bayes alogrithm will need these probability value to classify the messages:

(1) $ P(Spam|w_{1},w_{2},...,w_{n})\propto P(Spam)\cdot \prod \limits_{i=1}^n P(w_{i}|Spam) $

(2) $ P(Ham|w_{1},w_{2},...,w_{n})\propto P(Ham)\cdot \prod \limits_{i=1}^n P(w_{i}|Ham) $


We will need $P(Spam)$, $P(Ham)$, as well as the values for $ P(w_{i}|Spam)$ and $P(w_{i}|Ham) $

(3) $ P(w_{i}|Spam) = \frac{N_{w_{i}|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}$

(4) $ P(w_{i}|Ham) = \frac{N_{w_{i}|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}$

In [27]:
# dividing the clean training set into spam and ham

spam_messages = training_clean[training_clean['Label'] == 'spam']
ham_messages = training_clean[training_clean['Label']=='ham']


In [28]:
#calculating P(Spam)
p_spam = len(spam_messages)/len(training_clean)
print(p_spam)

0.13458950201884254


In [29]:
#calculating P(Ham)
p_ham = len(ham_messages)/len(training_clean)
print(p_ham)

0.8654104979811574


In [30]:
#Nspam is the number of words in all the spam messages (not unique words)
num_s_words = spam_messages['SMS'].apply(len)
num_spam= num_s_words.sum()

In [31]:
#Nham is the number of words in all the ham messages (not unique words)
num_h_words = ham_messages['SMS'].apply(len)
num_ham = num_h_words.sum()

In [32]:
#Nvocabulary is the number of unique words in the set
num_vocabulary = len(vocabulary)

In [33]:
#Laplace smoothing
alpha = 1

We have calculated the constant terms $ P(Ham) $ and $ P(Spam) $ as well as $ N_{Spam}$, $ N_{Ham} $ and $ N_{Vocabulary}$. Now we need to find the parameters $ P(w_{i}|Spam) $ and $ P(w_{i}|Ham) $ using the equations (3) and (4) above.

In [34]:
#initializing two dictionaries for parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

#calculating parameter values
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum() #spam_messages defined above
    p_word_given_spam = (n_word_given_spam + alpha)/(num_spam + (alpha*num_vocabulary))
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum() #ham_messages defined above
    p_word_given_ham = (n_word_given_ham + alpha)/(num_ham + (alpha*num_vocabulary))
    parameters_ham[word] = p_word_given_ham

In [35]:
import re

def classify(message):
    '''    
    message = a string
    
    ''' 
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()


    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [36]:
sample_spam = 'WINNER!! This is the secret code to unlock the money: C3421.'
sample_ham = "Sounds good, Tom, then see u there"

In [37]:
classify(sample_spam)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [38]:
classify(sample_ham)

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Testing the spam filter
The filter worked well on the sample text we used above, but we'll see now how well it works with the test dataset.

We'll modify the classify function to return rather than print the result, we'll add this returned classification to a column in the test dataset.

In [39]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [40]:
#adding new column to the test set

test['predicted'] = test['SMS'].apply(classify_test_set)

In [41]:
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


## Measuring accuracy

$ Accuracy = \frac{number of correctly classified messages}{total number of classified messages} $

In [42]:
correct = 0
total = test.shape[0] #number of messages in the test set

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('correct = ', correct)
print('incorrect= ', total - correct)
accuracy = correct/total
print('accuracy = ', accuracy)

correct =  1100
incorrect=  14
accuracy =  0.9874326750448833


The spam filter classified 1,1110 messages correctly and 14 messages incorrectly. It is 98.74% accurate. This is much more accurate than I had anticipated.

### Going a bit further...
Let's take a quick look at the 14 messages that were classifed incorrectly to see if anything jumps out as to why they were classified incorrectly.

In [43]:
incorrect = test.loc[test['Label'] != test['predicted']]

In [44]:
print(incorrect)

    Label                                                SMS  \
114  spam  Not heard from U4 a while. Call me now am here...   
135  spam  More people are dogging in your area now. Call...   
152   ham                  Unlimited texts. Limited minutes.   
159   ham                                       26th OF JULY   
284   ham                             Nokia phone is lovly..   
293   ham  A Boy loved a gal. He propsd bt she didnt mind...   
302   ham                   No calls..messages..missed calls   
319   ham  We have sent JD for Customer Service cum Accou...   
504  spam  Oh my god! I've found your number again! I'm s...   
546  spam  Hi babe its Chloe, how r u? I was smashed on s...   
741  spam  0A$NETWORKS allow companies to bill for SMS, s...   
876  spam           RCT' THNQ Adrian for U text. Rgds Vatian   
885  spam                                      2/2 146tf150p   
953  spam  Hello. We need some posh birds and chaps to us...   

                      predicted  
114  

In [47]:
print(incorrect.loc[152]['SMS'])

Unlimited texts. Limited minutes.


After reviewing the mislabeled emails, there's nothing that jumps out at me as to why they were mislabeled. I'd have to research this more to be able to make an educated guess.

However I'll note that I was surprised by the degree of accuracy in the filter.