## Building a Spam Filter With Naive Bayes (from scratch)

## Import Libraries

In [94]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Explore Dataset

In [95]:
# load the dataset
messages = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'SMS'])
messages.head()

Unnamed: 0,label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [96]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


    Our dataset does not contain any missing value

In [97]:
# sample message
print(messages.loc[0][1])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...


In [98]:
# distribution of the label column
messages['label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: label, dtype: float64

The dataset is highly imbalanced, as about 87% of the messages in the dataset are non-spam ("ham" means non-spam) and the remaining 13% are spam. This is a problem that is commonly solve by `Upsampling` or `Downsampling`, but for this project, none of the two techiniques will be used since the classifier is a probabilistic model; `probability is relative and hence will make up for the imbalance`.

### Training and  Test Set
We split the dataset into training set and test set

In [99]:
# shuffle dataset
messages_sh = messages.sample(frac=1, random_state=1)

test_size = 0.2
divider = int((0.2*len(messages_sh)))

# split dataset
train_set = messages_sh[: -divider].reset_index(drop=True)
test_set = messages_sh[-divider:].reset_index(drop=True)

A sample of a population has to be a representative of the population, otherwise the results obtained can be faulty or skewed.Thus it becomes very important to check for this criterion before moving forward with the project.

In [100]:
print(train_set['label'].value_counts(normalize=True))
print(test_set['label'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: label, dtype: float64
ham     0.868043
spam    0.131957
Name: label, dtype: float64


Both the training and test data have approximately the same percentage of non-spam and spam messages - 87% and 13% respectively, as the full dataset

### Cleaning the SMS column
- First, we need to remove all punctuations and convert all the text to lowercase
- Then create a bag of words (or vocabulary)
- and create columns of word frequency in each message

#### Letter case and punctuation

In [101]:
train_set['SMS']

0                            Yep, by the pretty sculpture
1           Yes, princess. Are you going to make me moan?
2                              Welp apparently he retired
3                                                 Havent.
4       I forgot 2 ask ü all smth.. There's a card on ...
                              ...                        
4453    Sorry, I'll call later in meeting any thing re...
4454    Babe! I fucking love you too !! You know? Fuck...
4455    U've been selected to stay in 1 of 250 top Bri...
4456    Hello my boytoy ... Geeee I miss you already a...
4457                             Wherre's my boytoy ? :-(
Name: SMS, Length: 4458, dtype: object

In [102]:
# remove all punctuations and convert message to lowercase
train_set['SMS'] = train_set['SMS'].str.replace(r'\W', ' ').str.lower()
train_set['SMS']

0                            yep  by the pretty sculpture
1           yes  princess  are you going to make me moan 
2                              welp apparently he retired
3                                                 havent 
4       i forgot 2 ask ü all smth   there s a card on ...
                              ...                        
4453    sorry  i ll call later in meeting any thing re...
4454    babe  i fucking love you too    you know  fuck...
4455    u ve been selected to stay in 1 of 250 top bri...
4456    hello my boytoy     geeee i miss you already a...
4457                             wherre s my boytoy      
Name: SMS, Length: 4458, dtype: object

#### Creating the Vocabulary

In [103]:
# split the sms in each row into a list of words and add the words to a general list - the vocabulary
vocabulary = []

for word_list in train_set['SMS'].str.split():
    for word in word_list:
        vocabulary.append(word)
        
# remove duplicate words
vocabulary = list(set(vocabulary))
print('There are {} unique words in the train_set messages'.format(len(vocabulary)))

There are 7783 unique words in the train_set messages


Next, we check the number of times each word in the Vocabulary occured in each sms

In [104]:
word_counts_per_sms = {unique_word: ([0] * len(train_set['SMS'])) for unique_word in vocabulary}

for index, sms in enumerate(train_set['SMS'].str.split()):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [105]:
word_counts_per_sms_df = pd.DataFrame(word_counts_per_sms)
word_counts_per_sms_df.head()

Unnamed: 0,ready,callertune,maintain,murderer,disk,patients,930,hanumanji,83383,mac,...,mone,subs,unhappiness,88800,synced,ikea,now,ruining,07090298926,pobox365o4w45wq
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


    So what is the purpose of this new dataframe?? The dataframe above shows the frequency of each words across all messages

In [106]:
train_set_final = pd.concat([train_set, word_counts_per_sms_df], axis=1)
train_set_final.head()

Unnamed: 0,label,SMS,ready,callertune,maintain,murderer,disk,patients,930,hanumanji,...,mone,subs,unhappiness,88800,synced,ikea,now,ruining,07090298926,pobox365o4w45wq
0,ham,yep by the pretty sculpture,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,welp apparently he retired,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,havent,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,i forgot 2 ask ü all smth there s a card on ...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. 

## Creating the Spam Filter
    The general idea for this classifier is that: for every word in the input, the probability is calculated for that word appearing in either spam messages or non-spam messages. These probabilities are then compared. If the probabilities cumulatively indicate that having this word or a collection of words is strongly associated with spam (non-spam) messages, then that input is classified as spam (non-spam).
    
To be able to classify new messages, the Naive Bayes algorithm will need to know the probability values of these two equations:

\begin{equation} P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\ P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham) \end{equation}

where, 
\begin{equation} P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\ P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}} \end{equation}

***Note***: 
- P(A|B) is read as "probability of A given B"
- So `P(Spam|w1, w2...w3)` is the probability that a message is spam given it contains certain words

### Calculating Constants First
Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:
- `P(Spam)` and `P(Ham)`
- `NSpam`, `NHam`
- `NVocabulary`: number of words in the vocabulary


`NSpam` (/ `NHam`) is the number of words in all the spam (non-spam) messages — it's not equal to the number of spam (non-spam) messages, and it's not equal to the total number of unique words in spam (non-spam) messages.

In [107]:
train_set_final['SMS'] = train_set_final['SMS'].str.split()
print(f"Total number of words in all the messages: {train_set_final['SMS'].str.len().sum()}")

Total number of words in all the messages: 72427


In [108]:
# get spam and non-spam messages
spam_messages = train_set_final[train_set_final.label == 'spam']
ham_messages = train_set_final[train_set_final.label == 'ham']

# calculate P(spam) and P(ham)
p_spam = len(spam_messages) / len(train_set_final)
p_ham = len(ham_messages) / len(train_set_final)

# calculate n_spam n_ham nvocabulary
n_spam = spam_messages['SMS'].str.len().sum()
n_ham = ham_messages['SMS'].str.len().sum()
n_vocabulary = len(vocabulary)

#  initiate alpha 
alpha = 1

### Calculating Parameters
Let's now calculate all the parameters using the last two equations above:

In [109]:
spam_parameter = {unique_word: 0 for unique_word in vocabulary}
ham_parameter = {unique_word: 0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()                                    # frequency of a word in spam messages
    p_w_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
    spam_parameter[word] += p_w_given_spam
    
    n_word_given_ham = ham_messages[word].sum()                                     # frequency of a word in ham messages
    p_w_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocabulary)
    ham_parameter[word] += p_w_given_ham

#### Building the message classifier

In [125]:
import re

def message_classifier(message, verbose):
    """The function as specified before accepts an input message and classifies it. 
    The verbose parameter is to get printed output at every step of the function;
    It is useful when debugging or understanding the working.
    """
    message = re.sub('\W+', ' ', message)                       # remove punctuations
    message = message.lower().split()                           # convert the text to lowercase and split into a list of words
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_parameter:
            p_spam_given_message *= spam_parameter[word]
        
        if word in ham_parameter:
            p_ham_given_message *= ham_parameter[word]

    if verbose:
        print('P(Spam|message):', p_spam_given_message)
        print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        if verbose:
            print('Label: spam')
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        if verbose:
            print('Label: spam')
        return 'spam'
    else:
        if verbose:
            print('Equal proabilities, have a human classify this!')
        return 'human classification needed'

### Classifying a New Message
Some new messages will contain words that are not part of the vocabulary; we simply ignore these words when calculating the probabilities.

In [130]:
message_classifier('WINNER!! This is the secret code to unlock the money: C3421.',verbose=1)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: spam


'spam'

In [131]:
message_classifier("Sounds good, Tom, then see u there", 0)

'ham'

In [132]:
message_classifier("""'Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. 
                   Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out'""", 0)

'spam'

In [129]:
test_set['predicted'] = test_set['SMS'].apply(message_classifier, verbose=0) #try putting verbose >1 and see the output of the model
test_set.head()

Unnamed: 0,label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [89]:
test_set['predicted'].value_counts(normalize=True)

ham                            0.869838
spam                           0.129264
human classification needed    0.000898
Name: predicted, dtype: float64

    Out of the entire test sample, the model classified about 87% as ham and about 13% as spam. The model seems to have done pretty well as these proportions are analogous to the sample's proportion of classes. But this doesnt speak for missclassified labels.

### Measuring the Spam Filter Accuracy

Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric:

\begin{equation} \text{Accuracy} = \frac{\text{number of correctly classified messages}}{\text{total number of classified messages}} \end{equation}

In [119]:
correct = sum(test_set.label == test_set.predicted)
total = test_set.shape[0]
accuracy = correct / total * 100

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy Score: ', correct/total * 100)

Correct: 1100
Incorrect: 14
Accuracy Score:  98.74326750448833


In [120]:
from sklearn.metrics import classification_report

# confusion matrix
print(classification_report(test_set.predicted, test_set.label, zero_division=1))

                             precision    recall  f1-score   support

                        ham       0.99      0.99      0.99       969
human classification needed       1.00      0.00      0.00         1
                       spam       0.95      0.97      0.96       144

                   accuracy                           0.99      1114
                  macro avg       0.98      0.65      0.65      1114
               weighted avg       0.99      0.99      0.99      1114



    The model manages to get a 99% accuracy on the train set, and an equally good precision and recall.

In [74]:
misclassified = test_set[test_set.label != test_set.predicted]
misclassified

Unnamed: 0,label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here...,ham
135,spam,More people are dogging in your area now. Call...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,human classification needed
302,ham,No calls..messages..missed calls,spam
319,ham,We have sent JD for Customer Service cum Accou...,spam
504,spam,Oh my god! I've found your number again! I'm s...,ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham
