# Spam Filter

Goal is to create a Spam filter that classifies new messages with an accuracy greater than 0%.

Dataset was obtained from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) and is available at the following [link](https://archive.ics.uci.edu/dataset/228/sms+spam+collection).

In [40]:
# Develop a Spam Filter to Classify SMS Messages

## Import Dataset

In [41]:
import pandas as pd
import re
file_path = '../github/Data/SMSSpamCollection'
sms_spam = pd.read_csv(file_path, sep='\t', header=None, names=['Label', 'Message'])
sms_spam.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Explore Dataset

In [42]:
sms_spam.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [43]:
# Review Percentage of Spam Messages in Dataset

(sms_spam['Label'].value_counts(normalize=True)*100).round(2)

Label
ham     86.59
spam    13.41
Name: proportion, dtype: float64

In [44]:
# Divide the Dataset Randomly for Training Purposes

random_sms_spam = sms_spam.sample(frac=1, random_state=1)
split_index = round(len(random_sms_spam)*0.8)
training_set = random_sms_spam[:split_index].reset_index(drop=True)
test_set = random_sms_spam[split_index:].reset_index(drop=True)

print(f'The training set contains {len(training_set)} messages.')
print()
print(f'The test set contains {len(test_set)} messages.')

The training set contains 4458 messages.

The test set contains 1114 messages.


In [45]:
# Is the spam distributed evenly between the Test Set and Training Set

(training_set['Label'].value_counts(normalize=True)*100).round(2)

Label
ham     86.54
spam    13.46
Name: proportion, dtype: float64

In [46]:
(test_set['Label'].value_counts(normalize=True)*100).round(2)

Label
ham     86.8
spam    13.2
Name: proportion, dtype: float64

## Training Data

In [47]:
# Remove Punctuation (Non-alphanumberic characters)
# Change All Text to Lower Case
# Split Message into a List of Substrings

def clean_msg(text):
    clean_text = text.str.replace('\W',' ',regex=True)
    clean_text = clean_text.str.lower()
    clean_text = clean_text.str.split()
    return clean_text

training_set['Message'] = clean_msg(training_set['Message'])

In [48]:
training_set.head()

Unnamed: 0,Label,Message
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [49]:
# Capture Vocabulary from Training Dataset
# Calculate N_vocab

vocabulary = []

for sms in training_set['Message']:
    for word in sms:
        vocabulary.append(word)

vocabulary = list(set(vocabulary))
n_vocab = len(vocabulary)
print(f'There are {n_vocab} in the vocabulary for the training set.')

There are 7783 in the vocabulary for the training set.


In [50]:
# Count the Occurence of Words in Each Message

word_count_dict = {unique_word: [0] * len(training_set['Message']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['Message']):
    for word in sms:
        word_count_dict[word][index] += 1



In [51]:
# Convert the Word Count Dictionary to a DataFrame

word_counts = pd.DataFrame(word_count_dict)
word_counts.head()

Unnamed: 0,lar,acc,frying,txtauction,6th,else,agent,88066,69876,secs,...,walsall,shoppin,held,netun,jocks,soup,snowman,va,accounts,08717509990
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
# Join the Word Counts Dataframe with the Training Set on Horizontal Axis

clean_training_set = pd.concat([training_set, word_counts],axis=1)

In [53]:
clean_training_set.head()

Unnamed: 0,Label,Message,lar,acc,frying,txtauction,6th,else,agent,88066,...,walsall,shoppin,held,netun,jocks,soup,snowman,va,accounts,08717509990
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculate Probability with Naive Bayes

We will classify messages based on probabilities calculated with the Naive Bayes algorithm. The probability each message is either spam or ham is calculated.

$
P(Spam | w_1, w_2, \ldots, w_n) \propto P(Spam) \cdot \prod\nolimits_{i=1}^{n} P(x_i | C_k)
$

*The symbol ∝ means "directly proportional too"*

*The symbol $\prod$ represents the product of the probabilities for each feature $x_i$*

In [54]:
# Segment Training Set by Labels (Spam or Ham)

spam_msg = clean_training_set[clean_training_set['Label']== 'spam']
ham_msg = clean_training_set[clean_training_set['Label']== 'ham']

# Calculate P(Spam) and P(Ham)

p_of_spam = len(spam_msg) / len(clean_training_set)
p_of_ham = len(ham_msg) / len(clean_training_set)

# N_Spam

n_words_per_spam_msg = spam_msg['Message'].apply(len)
n_spam = n_words_per_spam_msg.sum()

# N_Ham

n_words_per_ham_msg = ham_msg['Message'].apply(len)
n_ham = n_words_per_ham_msg.sum()

alpha = 1



In [55]:
# Calculate Parameters P(Wi|Spam) and P(Wi|Ham)
# Each parameter is the conditional probability value associated with each word in the vocabulary.

parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_msg[word].sum()
    p_word_given_spam = (n_word_given_spam+alpha)/(n_spam+alpha+n_vocab)
    parameters_spam[word] = p_word_given_spam

    n_word_given_ham = ham_msg[word].sum()
    p_word_given_ham = (n_word_given_ham+alpha)/(n_ham+alpha+n_vocab)
    parameters_ham[word] = p_word_given_ham
    

In [56]:
# Define a Function to Classify Messages with Naive Bayes

def classify(message):
    # message: a string

    # Clean and Format Message
    message = re.sub('\W',' ',message)
    message = message.lower()
    message = message.split()

    # Set Initial Probabilities for Spam and Ham
    p_spam_given_message = p_of_spam
    p_ham_given_message = p_of_ham

    # Calculate Conditional Probabilities for Each Word
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    # Print Probabilities
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    # Classify the message as Spam or Ham based on the probabilities
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal Probability')

In [61]:
# Spam Test Message
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3476009873135234e-25
P(Ham|message): 1.9365368329766623e-27
Label: Spam


In [62]:
# Ham Test Message
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4364950561289247e-25
P(Ham|message): 3.687133462921691e-21
Label: Ham


In [64]:
def classify_test_set(message):
    # message: a string

    # Clean and Format Message
    message = re.sub('\W',' ',message)
    message = message.lower()
    message = message.split()

    # Set Initial Probabilities for Spam and Ham
    p_spam_given_message = p_of_spam
    p_ham_given_message = p_of_ham

    # Calculate Conditional Probabilities for Each Word
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    # Classify the message as Spam or Ham based on the probabilities
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'equal probability'

In [71]:
test_set['Prediction'] = test_set['Message'].apply(classify_test_set)
test_set.head()


Unnamed: 0,Label,Message,prediction,Prediction
0,ham,Later i guess. I needa do mcat study too.,ham,ham
1,ham,But i haf enuff space got like 4 mb...,ham,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham,ham


In [75]:
total = test_set.shape[0]
correct = 0

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['Prediction']:
        correct += 1

print(f'Correct:', correct)
print()
print(f'Incorrect:', total-correct)
print()
print(f'Accuracy:', correct / total)

Correct: 1100

Incorrect: 14

Accuracy: 0.9874326750448833
