# Guided Project: Building a Spam Filter with Naive Bayes

In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

In [1]:
import pandas as pd
import numpy as np

df=pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label', 'SMS'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [2]:
label_perc=df['Label'].value_counts(normalize=True)*100
label_perc

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Spam Filter Testing

To test the spam filter, we're first going to split our dataset into two categories:

A training set, which we'll use to "train" the computer how to classify messages.
A test set, which we'll use to test how good the spam filter is with classifying new messages.

In [3]:
sample=df.sample(frac=1,random_state=1)

training_set=sample.iloc[:4458,:].copy()
training_set.reset_index(inplace=True)

test_set=sample.iloc[4459:,:].copy()
test_set.reset_index(inplace=True)

Now, we'll find the percentage of spam and ham in both the training and the test set to make sure it resembles the original data set distribution

In [4]:
print(training_set['Label'].value_counts(normalize=True)*100)
print('\n')
print(test_set['Label'].value_counts(normalize=True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64


ham     86.792453
spam    13.207547
Name: Label, dtype: float64


We can see the spam and non-spam % are OK

Now, we'll clean and re arrange the dataset, so the SMS column is splitted and placed as columns

In [5]:
training_set['modified_SMS']=training_set['SMS'].replace(to_replace ='\W', value = ' ', regex = True).str.lower()
training_set.head()

Unnamed: 0,index,Label,SMS,modified_SMS
0,1078,ham,"Yep, by the pretty sculpture",yep by the pretty sculpture
1,4028,ham,"Yes, princess. Are you going to make me moan?",yes princess are you going to make me moan
2,958,ham,Welp apparently he retired,welp apparently he retired
3,4642,ham,Havent.,havent
4,4674,ham,I forgot 2 ask ü all smth.. There's a card on ...,i forgot 2 ask ü all smth there s a card on ...


In [6]:
training_set['splitted_SMS']=training_set['modified_SMS'].str.split()
training_set.head()


Unnamed: 0,index,Label,SMS,modified_SMS,splitted_SMS
0,1078,ham,"Yep, by the pretty sculpture",yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]"
1,4028,ham,"Yes, princess. Are you going to make me moan?",yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,..."
2,958,ham,Welp apparently he retired,welp apparently he retired,"[welp, apparently, he, retired]"
3,4642,ham,Havent.,havent,[havent]
4,4674,ham,I forgot 2 ask ü all smth.. There's a card on ...,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [7]:
vocabulary=[]

for row in training_set['splitted_SMS']:
    for word in row:
        vocabulary.append(word)

print(vocabulary[:5])

['yep', 'by', 'the', 'pretty', 'sculpture']


In [8]:
# for removing list duplicates, we convert it to a set
vocabulary=set(vocabulary)
vocabulary=list(vocabulary)
print(vocabulary[:5])

['stylish', 'valuable', 'arranging', 'o2fwd', '09095350301']


In [9]:
word_counts_per_sms = {unique_word: [0] * len(training_set['splitted_SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['splitted_SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Now, we concatenate the DataFrame we just built above with the DataFrame containing the training set (this way, we'll also have the Label and the SMS columns)

In [10]:
training_concat=pd.concat([training_set,word_counts],axis=1)
training_concat.head()

Unnamed: 0,index,Label,SMS,modified_SMS,splitted_SMS,0,00,000,000pes,008704050406,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"Yep, by the pretty sculpture",yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"Yes, princess. Are you going to make me moan?",yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,Welp apparently he retired,welp apparently he retired,"[welp, apparently, he, retired]",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,Havent.,havent,[havent],0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,I forgot 2 ask ü all smth.. There's a card on ...,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter

In [11]:
#P(Spam)
p_spam=training_set['Label'].value_counts(normalize=True)[0]
p_ham=training_set['Label'].value_counts(normalize=True)[1]

print('P(Spam): ',p_spam,'\n', 'P(Ham): ', p_ham)

P(Spam):  0.8654104979811574 
 P(Ham):  0.13458950201884254


In [12]:
# Nspam
spam_words=[]

spam_df=training_concat[training_concat['Label']=='spam']

for row in spam_df['splitted_SMS']:
    for word in row:
        spam_words.append(word)

N_spam=len(spam_words)
print(N_spam)

15190


In [13]:
# N_ham
ham_words=[]

ham_df=training_concat[training_concat['Label']=='ham']

for row in ham_df['splitted_SMS']:
    for word in row:
        ham_words.append(word)

N_ham=len(ham_words)
print(N_ham)

57237


In [14]:
#N_vocabulary
N_vocabulary=len(vocabulary)
print(N_vocabulary)

7783


In [15]:
alpha=1

## Word Probabilities

In [16]:
spam_dict = { word : 0 for word in vocabulary }
ham_dict = { word : 0 for word in vocabulary }

for row in vocabulary:
    spam_dict[row]=spam_df[row].sum()
    
for row in vocabulary:
    ham_dict[row]=ham_df[row].sum()

   

## Now that we've calculated all the constants and parameters we need, we can start creating the spam filter

In [17]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message=p_spam
    for word in message:
        if word in vocabulary:
            p_spam_given_message*=(spam_dict[word]+alpha)/(N_spam+N_vocabulary)
    
    p_ham_given_message=p_ham
    for word in message:
        if word in vocabulary:
            p_ham_given_message*=(ham_dict[word]+alpha)/(N_ham+N_vocabulary)
    
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [18]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 8.66846960586644e-25
P(Ham|message): 3.0121382626111793e-28
Label: Spam


In [19]:
classify('"Sounds good, Tom, then see u there"')

P(Spam|message): 1.567143755316606e-24
P(Ham|message): 5.734884035784196e-22
Label: Ham


On the previous screen, we managed to create a spam filter, and we classified two new messages. We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

In [23]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message=p_spam
    for word in message:
        if word in vocabulary:
            p_spam_given_message*=(spam_dict[word]+alpha)/(N_spam+N_vocabulary)
    
    p_ham_given_message=p_ham
    for word in message:
        if word in vocabulary:
            p_ham_given_message*=(ham_dict[word]+alpha)/(N_ham+N_vocabulary)
    
        
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [24]:
correct=0
total=test_set['SMS'].count()
print(total)

1113


In [25]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,index,Label,SMS,predicted
0,3418,ham,But i haf enuff space got like 4 mb...,ham
1,3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
2,1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
3,5393,ham,"All done, all handed in. Don't know if mega sh...",ham
4,2744,ham,But my family not responding for anything. Now...,ham


In [27]:
for index,row in test_set.iterrows():
    if row['Label']==row['predicted']:
        correct+=1
        
accuracy=correct/total
print(accuracy)

0.9523809523809523
