We'll be using the multinomial Naive Bayes algorithm to train the computer to classify SMS messages using a dataset of 5,572 messages provided by Tiago A. Almeida and José María Gómez Hidalgo.

In [1]:
import pandas as pd

data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [3]:
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
data.shape

(5572, 2)

In [5]:
data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

86.6% of the dataset is non_spam and 13.4% is spam. The dataset has 5572 rows and two columns. We will now randomize so the new dataset is shuffled

In [6]:
data_new = data.sample(frac=1, random_state=1)

Split the randomized data into training (80% of data) and test (20% of data).

In [7]:
#Calulate the index to use for the split

splitting_index = round(len(data) * 0.80)

#Split the dataset into training and test sets
training = data_new[:splitting_index]
test = data_new[splitting_index:]

#Show number of rows and columns for training and test
print(training.shape)
print(test.shape)

(4458, 2)
(1114, 2)


Now we will reset the index labels for both data sets.

In [8]:
training.reset_index(inplace=True)
test.reset_index(inplace=True)

In [9]:
print(training['Label'].value_counts(normalize=True))
print(test['Label'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


We see that the percentage of spam and ham messages in both sets is similar to the original dataset.

In [10]:
print(training.head(5))
print(test.head(5))

   index Label                                                SMS
0   1078   ham                       Yep, by the pretty sculpture
1   4028   ham      Yes, princess. Are you going to make me moan?
2    958   ham                         Welp apparently he retired
3   4642   ham                                            Havent.
4   4674   ham  I forgot 2 ask ü all smth.. There's a card on ...
   index Label                                                SMS
0   2131   ham          Later i guess. I needa do mcat study too.
1   3418   ham             But i haf enuff space got like 4 mb...
2   3424  spam  Had your mobile 10 mths? Update to latest Oran...
3   1538   ham  All sounds good. Fingers . Makes it difficult ...
4   5393   ham  All done, all handed in. Don't know if mega sh...


In [11]:
import re

training['SMS'] = training['SMS'].str.replace('\W', ' ')
training['SMS'] = training['SMS'].str.lower()
print(training.head(5))
test['SMS'] = test['SMS'].str.replace('\W', ' ')
test['SMS'] = test['SMS'].str.lower()
print(test.head(5))

   index Label                                                SMS
0   1078   ham                       yep  by the pretty sculpture
1   4028   ham      yes  princess  are you going to make me moan 
2    958   ham                         welp apparently he retired
3   4642   ham                                            havent 
4   4674   ham  i forgot 2 ask ü all smth   there s a card on ...
   index Label                                                SMS
0   2131   ham          later i guess  i needa do mcat study too 
1   3418   ham             but i haf enuff space got like 4 mb   
2   3424  spam  had your mobile 10 mths  update to latest oran...
3   1538   ham  all sounds good  fingers   makes it difficult ...
4   5393   ham  all done  all handed in  don t know if mega sh...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-

In [12]:
#Split the words in each message so they become
#a list
training['SMS'] = training['SMS'].str.split()

vocabulary = []
for msg in training['SMS']:
    for word in msg:
        vocabulary.append(word)   

#Transform vocabulary into a set, thereby 
#removing duplicates. Then transform the set back to a 
#list
vocabulary = list(set(vocabulary))
print(len(vocabulary))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


7783


There are 7,783 unique words in the training dataet. Below we get the counts for each unique word in the training set.

In [13]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

Transform word_counts_per_sms into a dataframe.

In [14]:
df1 = pd.DataFrame(data = word_counts_per_sms) 

In [15]:
comb_training = pd.concat([training, df1], axis=1)
comb_training.head(5)

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [16]:
comb_training.shape

(4458, 7786)

Calculate P(Spam) and P(Ham)

In [17]:
spam_count = 0
ham_count = 0
spam_msgs = comb_training[comb_training['Label'] == 'spam']
ham_msgs = comb_training[comb_training['Label'] == 'ham']
for t in comb_training['Label']:
    if t == 'spam':
        spam_count +=1
    elif t == 'ham':
        ham_count +=1
        
print("Spam messages: ", spam_count, "Nonspam messages: ", ham_count)        
p_spam = spam_count / (spam_count + ham_count)
print("P(Spam): ", p_spam)
p_ham = ham_count / (spam_count + ham_count)
print("P(Ham): ", p_ham)
words_per_spam_msg = spam_msgs['SMS'].apply(len)
N_spam = words_per_spam_msg.sum()
words_per_ham_msg = ham_msgs['SMS'].apply(len)
N_ham = words_per_ham_msg.sum()
N_vocabulary = len(vocabulary)
print("Nspam: ", N_spam, "Nham: ", N_ham, "Nvocabulary: ", N_vocabulary)

#Add LaPlace smoothing
alpha = 1

Spam messages:  600 Nonspam messages:  3858
P(Spam):  0.13458950201884254
P(Ham):  0.8654104979811574
Nspam:  15190 Nham:  57237 Nvocabulary:  7783


In [18]:
#Initialize dictionaries for spam and ham
spam_dict = {unique_word:0 for unique_word in vocabulary}
ham_dict = {unique_word:0 for unique_word in vocabulary}

#Spam and ham already isolated above
for word in vocabulary:
    count_word_given_spam = spam_msgs[word].sum()
    p_word_given_spam = (count_word_given_spam + alpha) / (N_spam + (alpha * N_vocabulary))
    spam_dict[word] = p_word_given_spam
    
    count_word_given_ham = ham_msgs[word].sum()
    p_word_given_ham = (count_word_given_ham + alpha) / (N_ham + (alpha * N_vocabulary))
    ham_dict[word] = p_word_given_ham
        
#print(spam_dict)        

Write a spam filter function for filtering the dataset.

In [19]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

#Initiate p_spam_given_message and p_ham_given_message
#with an initial value

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
     
    
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
   
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal probabilities, have a human classify this!')  

In [20]:
#Example messages to check classify function
message1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
message2 = "Sounds good, Tom, then see u there"

#Classify the example messages above to check classify function
check1 = classify(message1)
check2 = classify(message2)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


The first message (message1) is classified as spam and the second as ham. By inspection, this looks correct.

Below we will measure the accuracy of our spam filter. But first we will customize the classify function for our test.

In [23]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]

        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Using the classify_test_set function, we will predict spam or ham for each message in our test data.

In [26]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,index,Label,SMS,predicted
0,2131,ham,later i guess i needa do mcat study too,ham
1,3418,ham,but i haf enuff space got like 4 mb,ham
2,3424,spam,had your mobile 10 mths update to latest oran...,spam
3,1538,ham,all sounds good fingers makes it difficult ...,ham
4,5393,ham,all done all handed in don t know if mega sh...,ham


In [30]:
correct = 0
total = len(test['SMS'])
#print(total)

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1

accuracy = correct / total
print(accuracy)

0.9874326750448833


The accuracy is 98.7%. So our spam filter is highly effective!