# Introduction

Given my latest studies on Probability and Conditional Probability, I will attemp today to build a spam filter using Naive Bayes.

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](!https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

In [1]:
# importing libraries

import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
import urllib.request
import re

from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile

In [2]:
# loading the dataset

uci_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/"
data_name = "smsspamcollection.zip"
response = urllib.request.urlopen(uci_url + urllib.request.quote(data_name))

zipfile = ZipFile(BytesIO(response.read()))

data = TextIOWrapper(zipfile.open('SMSSpamCollection'), encoding= 'utf-8')

df = pd.read_csv(data, header= None, sep='\t')

df

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [3]:
df.rename(columns=
          {0:'label',
           1:'message'},
          inplace= True)

In [4]:
df.describe()

Unnamed: 0,label,message
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [5]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [6]:
stats= df['label'].value_counts(normalize=True)*100
stats

ham     86.593683
spam    13.406317
Name: label, dtype: float64

We can see that 86.5% of entries in the dataset are `ham` (non-spam) and the remaining 13.4% are `spam`. 

The most common message that in the `ham` is "Sorry, I'll call later". Pretty straight forward it is not a spam message, likely to be a quick reply SMS text.

# Training and Test Set

Before we will commence to build the spam filter, it is important to test how well it will work before creating the spam. Creating the spam-detecting software first and testing it after that could lead to biases in test-design.

To test the spam filter, we're first going to split our dataset into two categories:

- A **training set**, which we'll use to "train" the computer how to classify messages.
- A **test set**, which we'll use to test how good the spam filter is with classifying new messages.

We will use the 80-20 proportion for the training and test data. The split will also help with identifying how good the spam filter actually is at predicting if a message is spam or not. That is because the test set, 20% of the data, has been already classified by a human. When the spam filter will be ready, it will treat this messages as new messages and we will be able to compare the algorithm classification with that done by a human. This will tell us how good the spam filter actually is.


In [7]:
# randomizing the dataset
data = df.sample(frac= 1, random_state= 1)

# calculating index for split
train_size = 0.8
train_end = int(len(data) * train_size)

# Train/Test split
df_train = data[:train_end].reset_index(drop= True)
df_test = data[train_end:].reset_index(drop= True)

print("The percentage of spam in train set is", "\n", df_train['label'].value_counts(normalize=True)*100)
print("\n")
print("The percentage of spam in test set is", "\n", df_train['label'].value_counts(normalize=True)*100)


The percentage of spam in train set is 
 ham     86.53803
spam    13.46197
Name: label, dtype: float64


The percentage of spam in test set is 
 ham     86.53803
spam    13.46197
Name: label, dtype: float64


Both train and test set have similar percentages of `ham` and `spam` messages. It maintains the percentage ratio from the initial dataset, which would be great in this way to train the algorithm.

# Data Cleaning
## Letter Case and Punctuation

To make it easier to clean the data, we will have to bring under a form that it is easy to use the Naive Bayes algorithm.

It may be the case that message is all of in capitals, has punctuations or non-latin characters. For this, we will be splitting up the messages in the `message` column and transform it in a series of new columns, where each column represents a unique word from the vocabulary.

In [8]:
df_train['message']= df_train['message'].str.replace("\W", ' ').str.lower()
df_train.head()

Unnamed: 0,label,message
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


# Creating the Vocabulary

Now that we removed the punctuation and all letters are in lowercase, our end goal is to transform the table above into one where each word has it's own column and it's counted for every `spam` and `ham` message.

In [9]:
# transforming each word into a column

df_train['message'] = df_train['message'].str.split()

vocabulary = []

for sms in df_train['message']:
    for word in sms:
        vocabulary.append(word)

# removing duplicates using set() function and then turning it back to a list
vocabulary = list(set(vocabulary))

There are 7782 unique words in the vocabulary

In [10]:
print(len(vocabulary))

7782


# The Final Training Set

We will create a dictionary called `word_count_per_message` where each key is a unique word from the vocabulary and each value is a list of the length of training set, where each element in the list is a `0`.

Then we loop over `df_train['message']` 



In [11]:
word_counts_per_message = {unique_word: [0] * len(df_train['message']) for unique_word in vocabulary}

for index, sms in enumerate(df_train['message']):
    for word in sms:
        word_counts_per_message[word][index] += 1
        
# transforming the dictionary in a dataframe
word_counts = pd.DataFrame(word_counts_per_message)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [None]:
# Concatinating the dataframes so that we will have the Label and Message columns
df_train = pd.concat([df_train, word_counts], axis= 1)
df_train.head()

Unnamed: 0,label,message,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


# Calculating Constants First

We're now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Spam^C) = \frac{N_{w_i|Spam^C} + \alpha}{N_{Spam^C} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we'll use our training set to calculate:

\begin{equation} P(Spam)\ and\ P(Ham) \end{equation}
    
\begin{equation} NSpam, NHam, NVocabulary \end{equation}
    
We'll also use Laplace smoothing and set $\alpha = 1$

In [None]:
# isolating the spam and ham messages
spam_messages = df_train[df_train['message'] == 'spam']
ham_messages = df_train[df_train['message'] == 'ham']

print(spam_messages)
print(ham_messages)
# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(df_train['message'])
p_ham = len(ham_messages) / len(df_train['message'])

# Calculating NSpam
n_words_per_spam_message = spam_messages['message'].apply(len)
n_spam = n_words_per_spam_message.sum()

# Calculating NHam
n_words_per_ham_message = ham_messages['message'].apply(len)
n_ham = n_words_per_ham_message.sum()

# Calculating NVocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

# Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Spam^C)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:

$$\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}$$

\begin{equation}
P(w_i|Spam^C) = \frac{N_{w_i|Spam^C} + \alpha}{N_{Spam^C} + \alpha \cdot N_{Vocabulary}}
\end{equation}

In [None]:
parameters_spam = {unique_word: 0 for word in vocabulary}
parameters_ham = {unique_word: 0 for word in vocabulary}

# Calculating the parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + aplha) / (n_spam + 1 * n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + 1) / (n_spam + 1 * n_vocabulary)
    paratamers_ham[word] = p_word_given_ham
