# Building a Spam Filter with Naive Bayes

The purpose of this project is to create a spam filter based on probabilities of Bayes Theorem. This project will utilize a UCI dataset with 5,572 SMS messages already classified as Spam or Non-Spam (referred to as Ham) in the dataset.  We will use this data to create a training set to deteremine the probabilities a message is spam vs ham and apply that function to the test set to determine the accuracy produced. 

In [1]:
import pandas as pd

# importing pandas

In [2]:
sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

# importing data as a pandas dataframe

### Data Profiling

In [3]:
sms_spam.head()

# reviewing first five rows

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
counts = sms_spam["Label"].value_counts()
counts / sum(counts)

# percentage of spam vs ham (ham is note spam)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

### Splitting Data into Training vs Test

In [5]:
data_randomized = sms_spam.sample(frac=1, random_state=1)

# randomizes data before splitting

In [6]:
data_randomized.head()

# confirms data randomization

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [7]:
print(len(data_randomized) *.20)
print(len(data_randomized))

# determining the indices for each set. Typically training is 80% and the test is 20%. The split will be 0:1114 and 1114:5572

1114.4
5572


In [8]:
training_set = data_randomized[0:1114].reset_index(drop = True)
test_set = data_randomized[1114:5572].reset_index(drop = True)

# creating training and test datasets and resetting index

In [9]:
training_set["Label"].value_counts()/1114

# confirms percetages of spam vs non-spame are very close for the training set

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [10]:
test_set["Label"].value_counts()/4458

# confirms percetages of spam vs non-spame are very close for the test dataset

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

### Cleaning SMS Messages

In [11]:
training_set.head()

# reviewing initial 5 SMS messages

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [12]:
training_set["SMS"] = training_set["SMS"].str.replace('\W'," ").str.lower()

# Strips punctionation and puts all words in lowercase

In [13]:
training_set.head()

# confirming removal of punctionation and lower case

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating DF for Word Counts

In [14]:
training_set['SMS'] = training_set['SMS'].str.split()

# creates a list for each SMS messsage where each element is one word

In [15]:
vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

# creates a list for all words used in the messages and removes duplicates with the set

In [16]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
# creates columns to show word counts for each message/row of data

In [17]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

# converts dictionary to DF and prints results

Unnamed: 0,00,000,000pes,0089,02,03,04,05,050703,0578,...,yoville,yr,ystrday,yun,yup,yupz,z,zogtorius,ú1,ü
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


In [18]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

# concats word counts DF with the original DF

Unnamed: 0,Label,SMS,00,000,000pes,0089,02,03,04,05,...,yoville,yr,ystrday,yun,yup,yupz,z,zogtorius,ú1,ü
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


### Creating Spam & Ham Probability DFs for each Word

In [19]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham) which is the percent of messages that are spam and ham
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam which is the total number of words in all spam messages
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham which is the total number of words in all ham messages
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary which is the total unique vocab words
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

# inital variables required for bayes theorem

In [20]:
# calculates probabilities each unique word is spam or ham

parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

### Function to Predict Spam vs Ham

In [21]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [22]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 5.421713188997406e-26
P(Ham|message): 1.0813140422305892e-26
Label: Spam


### Testing the Function Accuracy

In [23]:
# updating function to return spam or ham only based on the probabilities

def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [24]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

# creates classification column in our test set

Unnamed: 0,Label,SMS,predicted
0,ham,Yeah do! Don‘t stand to close tho- you‘ll catc...,ham
1,ham,"Hi , where are you? We're at and they're not ...",ham
2,ham,If you r @ home then come down within 5 min,ham
3,ham,When're you guys getting back? G said you were...,ham
4,ham,Tell my bad character which u Dnt lik in me. ...,ham


In [25]:
correct_count = sum(test_set['predicted']== test_set['Label'])
percent_correct = correct_count / len(test_set)

print(percent_correct)

0.9784656796769852


### Conclusions

After running our classify function on all of the test set elements, our spam filter was able to predict spam vs ham messages with 98% accuracy!

To recap the formula, the bayes probability to predict spam is...

part 1 :
percent of spam messages (using counts of messages in Label column) *

part 2 :
the probability a word is used given the message is spam  =  ((total times a given word is in all spam messages + 1) / (total words in all spam messages + unique count of words X 1) **the probability for each word in a message is multiplied out to determine the probability each individual message is spam**

Conversely the same formula is ran for ham (non-spam messages) to predict the highest percentage