**Building a Spam Filter with Naive Bayes**

This project will work with a dataset of SMS messages available from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). Details on how the data was collected can be founded on this [page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition).

We will use the dataset to build a spam filter using the Naive Bayes algorithm. 

In [23]:
from google.colab import files
uploaded = files.upload()

Saving SMSSpamCollection to SMSSpamCollection (1)


In [24]:
import pandas as pd
import numpy as np
import io

df = pd.read_csv(io.StringIO(uploaded['SMSSpamCollection'].decode('utf-8')), sep='\t', header=None)
df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5572 non-null   object
 1   1       5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [26]:
print(df[0].value_counts())
print('\n')
print(df[0].value_counts(normalize=True))

ham     4825
spam     747
Name: 0, dtype: int64


ham     0.865937
spam    0.134063
Name: 0, dtype: float64


Having read in the data, we can see that it contains 5,572 SMSs of which 747 (13.4%) are labelled as `spam`. The remaining 4,825 (86.6%) are classified as `ham` which means non-spam.

**Training and Test Sets**

To enable appropriate testing of the algorithm we will design, the data should be split into two sets:

- Training set, which will make up 80% of the full dataset and be used to train on how to classify messages

- Testing set, which contains the remaining 20% of the full dataset and will be used to test how effective the filter is at classifying new messages

In [27]:
df = df.sample(frac=1, random_state=1)
train = df[:4458]
test = df[4458:]
df = df.reset_index()

In [28]:
#Checking both datasets have similar ratios of spam and non-spam messages
print(train[0].value_counts(normalize=True), '\n', 
      test[0].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: 0, dtype: float64 
 ham     0.868043
spam    0.131957
Name: 0, dtype: float64


**Data Cleaning**

Before we can proceed, our datasets need to be cleaned into a format which easily enables the algorithm to be executed. Specifically, we will need to transform the dataset so that each word in the messages become a column name, and each row contains the count for that word. 

For example, the message "You won the jackpot!" would go appear in the dataframe as:

| you | won | the | jackpot |
|-----|-----|-----|---------|
|1|1|1|1

In [29]:
train.rename(columns={0:'Label', 1:'SMS'}, inplace=True)
test.rename(columns={0:'Label', 1:'SMS'}, inplace=True)

train.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [30]:
#Detecting any character not from a-z, A-Z or 0-9
train['SMS'] = train['SMS'].str.replace('\W', ' ')

#Changing all characters to lower case
train['SMS'] = train['SMS'].str.lower()

In [31]:
test['SMS'] = test['SMS'].str.replace('\W', ' ')
test['SMS'] = test['SMS'].str.lower()
print(train.head(), '\n', test.head())

     Label                                                SMS
1078   ham                       yep  by the pretty sculpture
4028   ham      yes  princess  are you going to make me moan 
958    ham                         welp apparently he retired
4642   ham                                            havent 
4674   ham  i forgot 2 ask ü all smth   there s a card on ... 
      Label                                                SMS
2131   ham          later i guess  i needa do mcat study too 
3418   ham             but i haf enuff space got like 4 mb   
3424  spam  had your mobile 10 mths  update to latest oran...
1538   ham  all sounds good  fingers   makes it difficult ...
5393   ham  all done  all handed in  don t know if mega sh...


**Creating the vocabulary**

We need to create a vocabulary containing all words in our SMSs, as this is part of the equation we will use to evaluate and classify a message as spam or not. 

In [32]:
#Creating a list from words in SMS column
vocabulary = []
train['SMS'] = train['SMS'].str.split()
for sms in train['SMS']:
  for word in sms:
    vocabulary.append(word)

#Removing duplicates
vocabulary = list(set(vocabulary))

In [43]:
print(len(vocabulary))

7783


In [44]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
  for word in sms:
    word_counts_per_sms[word][index] += 1

word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,callback,nevamind,blankets,years,wanted,sorrows,mum,easy,adventuring,07090298926,utter,umma,rooms,offense,texts,frmcloud,lengths,eek,ki,tomorw,hurricanes,versus,aslamalaikkum,tuesday,89123,mmsto,landlineonly,wake,latelyxxx,loooooool,freak,dust,scared,ü,dubsack,instituitions,watching,nuther,aroundn,am,...,vargu,roses,seconds,enjoyin,mylife,gals,factory,ur,converter,freek,88066,aren,spoke,disconnected,click,bout,toughest,eveb,cld,slo,ppl,mahal,activate,world,operator,unlimited,or,087187262701,join,drugdealer,especially,clearer,regalportfolio,dizzamn,eg,mp3,valuable,gist,steyn,colany
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [45]:
train_clean = pd.concat([train, word_counts], axis=1)
train_clean.head()

Unnamed: 0,Label,SMS,callback,nevamind,blankets,years,wanted,sorrows,mum,easy,adventuring,07090298926,utter,umma,rooms,offense,texts,frmcloud,lengths,eek,ki,tomorw,hurricanes,versus,aslamalaikkum,tuesday,89123,mmsto,landlineonly,wake,latelyxxx,loooooool,freak,dust,scared,ü,dubsack,instituitions,watching,nuther,...,vargu,roses,seconds,enjoyin,mylife,gals,factory,ur,converter,freek,88066,aren,spoke,disconnected,click,bout,toughest,eveb,cld,slo,ppl,mahal,activate,world,operator,unlimited,or,087187262701,join,drugdealer,especially,clearer,regalportfolio,dizzamn,eg,mp3,valuable,gist,steyn,colany
0,ham,"[go, until, jurong, point, crazy, available, o...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,"[ok, lar, joking, wif, u, oni]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ham,"[u, dun, say, so, early, hor, u, c, already, t...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ham,"[nah, i, don, t, think, he, goes, to, usf, he,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Calculating Constants**

Now the training data has been cleaned, we can begin to create the spam filter.

The [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) algorithm requires the probability values of two equations in order to classify new messages: 

- The probability of the message being Spam given it contains the words specified
- The probability of the message being non-spam given it contains the words specified

The two calculations above can be represented using the formulae below:

![](https://drive.google.com/uc?export=view&id=12ah4EoGx6oXrI3rXp8fHvgRuWhILHpuU)

Also to calculate $P(w_i|Spam)$ and $P(w_i|Non-Spam)$ within the formulae above we need to use the following equations:

![](https://drive.google.com/uc?export=view&id=1ko0esu2_ObB5dntp2rrUzslP_O-Wx97M)

Let's start by calculating:

- $P(Spam)$, $P(Non-Spam)$ - probability of the SMS being spam or non-spam respectively
- $N_Spam$, $N_Ham$, $N_Vocabulary$ - number of words in spam SMSs, non-spam SMSs and the vocabulary respectively

0.11233851338700618 
 0.7223366410784497


In [47]:
# Filtering for spam and non-spam messages
spam_messages = train_clean[train_clean['Label'] == 'spam']
non_spam_messages = train_clean[train_clean['Label'] == 'ham']

# Calculating P(Spam) and P(Non-Spam)
p_spam = len(spam_messages) / len(train_clean)
p_non_spam = len(non_spam_messages) / len(train_clean)

print(p_spam, '\n', p_non_spam)

#n_spam
n_spam = spam_messages['SMS'].apply(len).sum()

#n_non_spam
n_non_spam = non_spam_messages['SMS'].apply(len).sum()

#n_vocabulary
n_vocab = len(vocabulary)

#laplace smoothing
alpha = 1

**Calculating Parameters**

Now we have calculated the constants, we need to define variables which will store the parameters for our two probability formulae:

$P(w_i|Spam)$ and $P(w_i|Non-Spam)$

In [56]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_non_spam = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
  n_word_given_spam = spam_messages[word].sum()
  p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocab)
  parameters_spam[word] = p_word_given_spam

  n_word_given_non_spam = non_spam_messages[word].sum()
  p_word_given_non_spam = (n_word_given_non_spam + alpha) / (n_non_spam + alpha * n_vocab)
  parameters_non_spam[word] = p_word_given_non_spam


In [57]:
print(n_word_given_spam, '\n',
      n_word_given_non_spam, '\n',
      p_word_given_spam, '\n',
      p_word_given_non_spam)

0.0 
 1.0 
 4.3529360553693465e-05 
 3.075976622577668e-05


**Classifying a New Message**

Now we have all of our constants and parameters calculated, we can begin to classify messages as either Spam or Non-Spam.

We will create a filter as a function that:

* Takes in as input a new message
* Calculates $P(Spam|w_n)$ and $P(Non-Spam|w_n)$
* Compares the values of $P(Spam|w_n)$ and $P(Non-Spam|w_n)$ and:
  - If $P(Spam|w_n)$ > $P(Non-Spam|w_n)$, classify the message as Spam
  - If $P(Spam|w_n)$ < $P(Non-Spam|w_n)$, classify the message as Non-Spam
  - If $P(Spam|w_n)$ = $P(Non-Spam|w_n)$, request help

In [58]:
import re

def classify(message):
  
  message = re.sub('\W', ' ', message)
  message = message.lower()
  message = message.split()

  p_spam_given_message = p_spam
  p_non_spam_given_message = p_non_spam

  message_words = [i for i in message]
  for word in message: 
    if word in parameters_spam:
      p_spam_given_message *= parameters_spam[word]

    if word in parameters_non_spam:
      p_non_spam_given_message *= parameters_non_spam[word] 

  print('P(Spam|message):', p_spam_given_message)
  print('P(Ham|message):', p_non_spam_given_message)

  if p_non_spam_given_message > p_spam_given_message:
      print('Label: Ham')
  elif p_non_spam_given_message < p_spam_given_message:
      print('Label: Spam')
  else:
      print('Equal proabilities, have a human classify this!')

In [60]:
#Testing the filter

classify('WINNER!! This is the secret code to unlock the money: C3421.')
classify("Sounds good, Tom, see u there then")

P(Spam|message): 4.1578268320785084e-29
P(Ham|message): 1.284446514392249e-25
Label: Ham
P(Spam|message): 2.4449942502228983e-24
P(Ham|message): 4.287484325865619e-22
Label: Ham


**Measuring the Spam Filter's Accuracy**

Now we have created our spam filter, it's time to see how it performs on our test datasets. 

We will update the algorithm to return labels, which we can then feed into the dataset to make direct comparisons.

In [69]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_non_spam_given_message = p_non_spam

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_non_spam:
            p_non_spam_given_message *= parameters_non_spam[word]

    if p_non_spam_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_non_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [74]:
test['predicted'] = test['SMS'].apply(classify_test_set)

#Measuring accuracy
correct = 0
total = test.shape[0]
for row in test.iterrows():
  row = row[1]
  if row['Label'] == row['predicted']:
    correct += 1
accuracy = correct/total
print('Accuracy:', accuracy)
print('Incorrect:', total-correct)
print('Correct:', correct)

Accuracy: 0.8662477558348295
Incorrect: 149
Correct: 965


**Next Steps**

Our filter works out to be almost 87% accurate on the test set, which is pretty good. 

To improve on this, we could:

- Isolate the messages incorrectly classified and attempt to understand why the filter produced the wrong output. 
- Enhance the filtering process by making the algorithm case-sensitive.