<h1>Building a spam filter for SMS messages using Bayes Algorithm</h1>

For this project we'll be looking at a dataset from the UCI Machine Learning Repository. The dataset contains SMS messages that we will use to determine what is spam and what is not spam. Our goal is to be able to be at least 80% correct when it comes to filtering out spam messages. (note: non_spam in this dataset is called "ham".)

In [1]:
import pandas as pd
import re

sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

# Let's explore the data a little.

print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

We now need to pull out a random sample of entries from the data set which we will use to train our algorithm. We'll use the remaining entries as our test set to see how well our algorithm works.

In [3]:
# Randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


Let's just check and make sure the random samples are representative of our initial data when it comes to the ratio of spam to ham.

In [4]:
training_set['Label'].value_counts(normalize=True)


ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [5]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [6]:
# Before cleaning
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [7]:
# After cleaning - we need to clean up punctuation and make the words lower case for easier comparison.
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Now we need to get all the unique words out of our data set so we can use it in our Bayes Algorithm.

In [8]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
    
# Convert to a set and back to a list to remove duplicates.
    
vocabulary = list(set(vocabulary))

In [9]:
len(vocabulary)

7783

Now we need to expand our SMS messages in our learning data set so we can sum up how many times words are used.

In [10]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [11]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [12]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Add up all the rows and put them in a new column in the dataframe for later analysis.

In [13]:
training_set_clean['sum_of_rows'] = training_set_clean[vocabulary].sum(axis=1)
print(training_set_clean.head())

  Label                                                SMS  0  00  000  \
0   ham                  [yep, by, the, pretty, sculpture]  0   0    0   
1   ham  [yes, princess, are, you, going, to, make, me,...  0   0    0   
2   ham                    [welp, apparently, he, retired]  0   0    0   
3   ham                                           [havent]  0   0    0   
4   ham  [i, forgot, 2, ask, ü, all, smth, there, s, a,...  0   0    0   

   000pes  008704050406  0089  01223585334  02     ...       zoe  zogtorius  \
0       0             0     0            0   0     ...         0          0   
1       0             0     0            0   0     ...         0          0   
2       0             0     0            0   0     ...         0          0   
3       0             0     0            0   0     ...         0          0   
4       0             0     0            0   0     ...         0          0   

   zouk  zyada  é  ú1  ü  〨ud  鈥  sum_of_rows  
0     0      0  0   0  0    0  0

In [14]:
# We need to split up the spam and ham to get the two sides of our comparison.

training_set_spam = training_set_clean[training_set_clean['Label'] == 'spam']
training_set_ham = training_set_clean[training_set_clean['Label'] == 'ham']
print(training_set_spam.head())

   Label                                                SMS  0  00  000  \
16  spam  [freemsg, why, haven, t, you, replied, to, my,...  0   0    0   
18  spam  [congrats, 2, mobile, 3g, videophones, r, your...  0   0    0   
56  spam  [free, message, activate, your, 500, free, tex...  0   0    0   
60  spam  [call, from, 08702490080, tells, u, 2, call, 0...  0   0    0   
61  spam  [someone, has, conacted, our, dating, service,...  0   0    0   

    000pes  008704050406  0089  01223585334  02     ...       zoe  zogtorius  \
16       0             0     0            0   0     ...         0          0   
18       0             0     0            0   0     ...         0          0   
56       0             0     0            0   0     ...         0          0   
60       0             0     0            0   0     ...         0          0   
61       0             0     0            0   0     ...         0          0   

    zouk  zyada  é  ú1  ü  〨ud  鈥  sum_of_rows  
16     0      0  0 

In [15]:
# The next couple lines are gathering all the numbers we need to calcualte for our Bayes function.

n_spam = training_set_spam['sum_of_rows'].sum()
n_ham = training_set_ham['sum_of_rows'].sum()

In [16]:
n_vocabulary = len(vocabulary)
smoothing = 1

In [17]:
p_spam = len(training_set_spam)/len(training_set_clean)
p_ham = len(training_set_ham)/len(training_set_clean)

This is the algorithm we will use to determine if a new message is spam given our knowledge from the learning data set. We need to "clean" the SMS messages to fit our format, look at each word and then calculate its affect on the probability that the whole message is spam.

In [18]:
def spam_filter(sms_message):
    sms_message = re.sub('\W', ' ', sms_message)
    sms_message = sms_message.lower()
    sms_message_words = sms_message.split()
    
    p_sms_spam = p_spam
    p_sms_ham = p_ham
    
    for word in sms_message_words:
        if word not in vocabulary:
            continue
        else:
            p_sms_spam *= (training_set_spam[word].sum() + smoothing) / (n_spam +(smoothing * n_vocabulary))
            p_sms_ham *= (training_set_ham[word].sum() + smoothing) / (n_ham +(smoothing * n_vocabulary))
    
    if p_sms_spam > p_sms_ham:
        outcome = 'spam'
    elif p_sms_spam < p_sms_ham:
        outcome = 'ham'
    elif p_sms_spam == p_sms_ham:
        outcome = 'equal'
    else:
        outcome = 'error'
        
    return outcome
        

In [19]:
# Lets test the algorithm out on the first couple lines of the test set.
test_test_set = test_set.head()

test_test_set['predicted_outcome'] = test_test_set['SMS'].apply(spam_filter)
print(test_test_set)

  Label                                                SMS predicted_outcome
0   ham          Later i guess. I needa do mcat study too.               ham
1   ham             But i haf enuff space got like 4 mb...               ham
2  spam  Had your mobile 10 mths? Update to latest Oran...              spam
3   ham  All sounds good. Fingers . Makes it difficult ...               ham
4   ham  All done, all handed in. Don't know if mega sh...               ham


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [20]:
# Looks like it is working. Let's test it on the entire test_set.

test_set['predicted_outcome'] = test_set['SMS'].apply(spam_filter)

In [21]:
correct_test = test_set[test_set['predicted_outcome'] == test_set['Label']]

correct = len(correct_test)
total = len(test_set)
percent_accuracy = round((correct / total) * 100,2)

print('The total correct predictions of spam was ',correct)
print('The number of SMS messages analysed in this test set was ',total)
print('The percentage accuracy of this spam filter is ',percent_accuracy)

The total correct predictions of spam was  1100
The number of SMS messages analysed in this test set was  1114
The percentage accuracy of this spam filter is  98.74


The spam filter we built seems quite accurate. almost 99 out of 100 messages would be marked accurately by our filter. Some of the messages were not classified correctly. If we were to continue on with this project that is where we would start focusing our attention to see if we could identify why we got the 14 messages wrong in our prediction.