## Classifying Messages As Spam or Not Spam Using Naive Bayes Algorithm

We are going to use the Naive Bayes Algorithm to classify messages as spam or not spam. We'll be using a dataset of human-classified messages put together by Tiago A. Almeida and José María Gómez Hidalgo, available at the UCE Machine Learning Repository.

In [1]:
import pandas as pd

spam_collection = pd.read_csv("SMSSpamCollection", sep = '\t', header = None, names = ['Label', 'SMS'])

spam_collection.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# Count rows and columns: 5572 rows, 2 columns

spam_collection.shape

# Calculate percentage of spam/non-spam ('ham')

print(spam_collection['Label'].value_counts(normalize = True)*100)

ham     86.593683
spam    13.406317
Name: Label, dtype: float64


We now know that our data has two columns. The "Label" column contains the classifiction (spam/ham), and the "SMS" column contains the message itself. There are 5572 rows/messages in the data, and 86.6% are not spam. So 13.4% are classified as spam. 

Our next step is to split the data into "training" data to build our algorithm, and "test" data to check our algorithm on. 

In [3]:
random_data = spam_collection.sample(frac = 1, random_state = 1)

train = random_data[:4458].reset_index(drop = True)
test = random_data[4458:].reset_index(drop = True)

print(train['Label'].value_counts(normalize = True)*100)
print(test['Label'].value_counts(normalize = True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


### Transforming the Training Data

We're going to transform the training data now so that each word is a column, and each row is a message. The entries in each column should tell us how many times that word appears in that message.

In [4]:
# Removing punctuation, upper case

train['SMS'] = train['SMS'].str.replace('\W', " ").str.lower()
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [5]:
train['SMS'] = train['SMS'].str.split()

vocabulary = []
for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [6]:
word_counts_per_sms = {}

for elt in vocabulary:
    word_counts_per_sms[elt] = [0]*len(train['SMS'])
    
for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [7]:
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)

In [8]:
train_clean = pd.concat([train, word_counts_per_sms], axis = 1)

In [9]:
train_clean.head()

Unnamed: 0,Label,SMS,definitely,tookplace,_,ab,2price,okies,mix,vegas,...,evey,kick,reserves,dload,lux,completely,receivea,morphine,necessary,prof
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Naive Bayes Algorithm

We're ready now to create our spam filter. In the next code blocks we'll begin the calculations to apply Naive Bayes Algorithm.

The Naive Bayes Algorithm starts with the idea that P(message is spam | word 1, word 2, etc) = P(word 1, word 2, etc | spam) x P(spam) / P(word 1, word 2, etc). By the same logic, P(message is not spam | word 1, word 2, etc) = P(word 1, word 2, etc | not spam) x P(not spam) / P(word 1, word 2, etc). The denominators are impossible to calculate, but we don't need them! We just want to compare the numerators to see if spam vs not spam is more likely.

We then pretend the appearance of words is independent, so we say P(message is spam) is proportional to P(spam) x P(word 1 | spam) x P(word 2 | spam) etc. The value of P(spam) was calculated earlier in the training data. To calculate the value of P(word | spam) we use additive smoothing and calculate it as (number of word 1 appearances in spam + alpha) / (number of words in spam + alpha x size of vocabulary). *Note: this avoids words that appear in only one category (spam vs not spam) to turn products to 0.

In [10]:
# Probability of spam vs. not spam 
counts_table = train['Label'].value_counts(normalize = True)

p_ham = counts_table[0]
p_spam = counts_table[1]

In [11]:
# Counts of spam, not spam, and words in general

# The easy one: how many words in vocab.
n_vocab = len(vocabulary)

# The trickier one: how many words in spam/not spam
spam = train_clean[train_clean['Label'] == "spam"]
ham = train_clean[train_clean['Label'] == "ham"]

n_spam = spam['SMS'].apply(len).sum()
n_ham = ham['SMS'].apply(len).sum()

alpha = 1

In [12]:
spam_dict = {word : 0 for word in vocabulary}
ham_dict = {word : 0 for word in vocabulary}

for word in vocabulary:
    n_word_spam = spam[word].sum()
    n_word_ham = ham[word].sum()
    spam_dict[word] = (n_word_spam + alpha)/(n_spam + alpha * n_vocab)
    ham_dict[word] = (n_word_ham + alpha)/(n_ham + alpha * n_vocab)

In [13]:
import re
import numpy as np

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    message_spam_probs = []
    message_ham_probs = []
    
    for word in message:
        if word in spam_dict.keys():
            message_spam_probs.append(spam_dict[word])
    for word in message:
        if word in ham_dict.keys():
            message_ham_probs.append(ham_dict[word])
    
    p_spam_given_message = p_spam*np.product(message_spam_probs)
    p_ham_given_message = p_ham * np.product(message_ham_probs)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [14]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [15]:
classify("Sounds good, Tom, then see u there")

'ham'

We've now created our classification function! We'll test it on our test data to see how well it does. 

In [16]:
test['predicted'] = test['SMS'].apply(classify)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [17]:
correct = 0
total = len(test)

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print(correct/total)

0.9874326750448833
