# Building a Spam Filter with Naive Bayes

In this project we're going to build a Spam filter to classify messages as Spam or Non- Spam. We'll be using SMSSpamCollection dataset for our purpose. Our first task is to teach the computer how to classify messages based on Naive Bayes algorithm.

In [2]:
import pandas as pd
sms= pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms.shape # No of rows and columns

(5572, 2)

In [4]:
sms['Label'].value_counts(normalize=True)*100 # percentage of the messages is spam and ham ("ham" means non-spam).

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

In [5]:
# randomized dataset
sms_sample = sms.sample(frac=1, random_state=1)

# Calculate index for split
index = round(len(sms_sample) * 0.8)

# Training/Test split and reset index.
sms_training = sms_sample[:index].reset_index(drop=True)
sms_test = sms_sample[index:].reset_index(drop=True)

print(sms_training.shape)
print(sms_test.shape)

(4458, 2)
(1114, 2)


In [6]:
sms_training['Label'].value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [7]:
sms_test['Label'].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

The percentages of spam and ham in both the training and the test set are similar to that we have in the full dataset.

In [8]:
sms_training.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [9]:
# To remove all the punctuation from the SMS column.
sms_training['SMS']= sms_training['SMS'].str.replace('\W', ' ')

sms_training['SMS']= sms_training['SMS'].str.lower()

sms_training['SMS'].head() # After cleaning

0                         yep  by the pretty sculpture
1        yes  princess  are you going to make me moan 
2                           welp apparently he retired
3                                              havent 
4    i forgot 2 ask ü all smth   there s a card on ...
Name: SMS, dtype: object

## Creating the Vocabulary

In [10]:
sms_training['SMS']= sms_training['SMS'].str.split()
sms_training['SMS'].head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object

In [11]:
vocabulary= []
for msg in sms_training['SMS']:
    for word in msg:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary)) # Set to remove duplicate words.
vocabulary[:5]

['machi', '08704439680ts', '09066364311', 'whom', 'gibe']

In [12]:
len(vocabulary)

7783

In [13]:
# Creating dictionary.
word_counts_per_sms = {unique_word: [0] * len(sms_training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(sms_training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [14]:
word_count = pd.DataFrame(word_counts_per_sms) # Transform word_counts_per_sms into a DataFrame.
word_count.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [15]:
training_clean= pd.concat([sms_training, word_count], axis=1)
training_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [16]:
# Extracted Spam messages.
spam_msg= training_clean[training_clean['Label']=='spam']
p_spam= len(spam_msg)/len(training_clean)
p_spam # Prob of spam

0.13458950201884254

In [17]:
# Extracted ham messages.
ham_msg= training_clean[training_clean['Label']=='ham']
p_ham= len(ham_msg)/len(training_clean)
p_ham    # Prob of ham

0.8654104979811574

In [18]:
n_words_per_spam_message= spam_msg['SMS'].apply(len)
n_spam= n_words_per_spam_message.sum()
n_spam   #number of words in all the spam messages

15190

In [19]:
n_words_per_ham_message= ham_msg['SMS'].apply(len)
n_ham= n_words_per_ham_message.sum()
n_ham   #number of words in all the ham messages

57237

In [20]:
n_vocabulary= len(vocabulary)
alpha=1

In [21]:
spam_dict = {unique_word: 0 for unique_word in vocabulary}
ham_dict = {unique_word: 0 for unique_word in vocabulary}

In [22]:
spam_msg.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
16,spam,"[freemsg, why, haven, t, you, replied, to, my,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,spam,"[congrats, 2, mobile, 3g, videophones, r, your...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,spam,"[free, message, activate, your, 500, free, tex...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,spam,"[call, from, 08702490080, tells, u, 2, call, 0...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61,spam,"[someone, has, conacted, our, dating, service,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
ham_msg.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [24]:
for word in vocabulary:
    n_word_given_spam= spam_msg[word].sum()
    p_word_given_spam= (n_word_given_spam + alpha) / (n_spam + n_vocabulary*alpha)
    spam_dict[word]= p_word_given_spam

    n_word_given_ham= ham_msg[word].sum()
    p_word_given_ham= (n_word_given_ham + alpha) / (n_ham + n_vocabulary*alpha )  
    ham_dict[word]= p_word_given_ham

## Classifying A New Message

In [25]:
import re

def classify(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message= p_spam
    p_ham_given_message= p_ham
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Needs human classification!')

In [26]:
# Eg 1
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [27]:
# Eg 2
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


### Applying this function on entire Test dataset.

In [28]:
import re
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]

        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [29]:
sms_test['predicted'] = sms_test['SMS'].apply(classify_test_set)
sms_test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


## Measuring the Spam Filter's Accuracy

In [30]:
correct= 0
total= len(sms_test)

for row in sms_test.iterrows():
    row= row[1]
    if row['Label'] == row['predicted']:
        correct +=1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


In [31]:
sms_test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


The accuracy is approximately 98.7%, which is really good. We used 1,114 messages (sms_test) as an input to our Spam filter and it has classified 1,100 correctly.

In [32]:
wrong_classified= sms_test[sms_test['Label'] != sms_test['predicted']]
wrong_classified

Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here...,ham
135,spam,More people are dogging in your area now. Call...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
302,ham,No calls..messages..missed calls,spam
319,ham,We have sent JD for Customer Service cum Accou...,spam
504,spam,Oh my god! I've found your number again! I'm s...,ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham



In above table, for the index 293 predicted column shows "needs human classification" so for this message we can confirm that its not a spam. After analysing, many of the messages are classified correctly by our algorithm than existing labels. So for now we can leave this incorrect messages as it is!