In [1]:
import pandas as pd
import numpy as np
import timeit

In [3]:
# Load the data into a Python data frame
df = pd.read_table("SMSSpamCollection", header = None, names = ["label", "text"])

# Pre-process the SMS messages
# lower case
data1= df.apply(lambda x: x.astype(str).str.lower())

# removing punctuations
data1["text"] = data1["text"].str.replace('[^\w\s]','')

# removing numbers
data1["text"] = data1["text"].str.replace('[0-9]','')

In [4]:
# Shuffle the messages and split them into a training set, validation set, and testing set         
dfs = data1.sample(frac=1).reset_index(drop=True)
training = data1.iloc[:2500, :]
validation = data1.iloc[2500:3500, :]
test = data1.iloc[3500:, :]

In [5]:
# Build a simple Naıve Bayes classiﬁer from scratch

class NaiveBayesForSpam:
    def train(self, hamMessages, spamMessages): 
        self.words = set(''.join(list(hamMessages) + list(spamMessages)).split())
        self.priors = np.zeros(2)
        self.priors[0] = float(len(hamMessages)) / (len(hamMessages) + len(spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        for i, w in enumerate(self.words):
            prob1 = (1.0 + len([m for m in hamMessages if w in m])) / len(hamMessages) 
            prob2 = (1.0 + len([m for m in spamMessages if w in m])) / len(spamMessages) 
            self.likelihoods.append([min(prob1, 0.95) , min(prob2, 0.95)])
        self.likelihoods = np.array(self.likelihoods).T

    
    def train2(self, hamMessages, spamMessages):
        self.words = set(''.join(list(hamMessages) + list(spamMessages)).split())
        self.priors = np.zeros(2)
        self.priors[0] = float(len(hamMessages)) / (len(hamMessages) + len(spamMessages))
        self.priors[1] = 1.0 - self.priors[0] 
        self.likelihoods = []
        spamkeywords = [ ]
        for i, w in enumerate(self.words):
            prob1 = (1.0 + len([m for m in hamMessages if w in m])) / len(hamMessages)
            prob2 = (1.0 + len([m for m in spamMessages if w in m])) / len(spamMessages) 
            if prob1 * 20 < prob2: 
                self.likelihoods.append([min(prob1 , 0.95) , min(prob2 , 0.95)])
                spamkeywords.append(w) 
        self.words = spamkeywords
        self.likelihoods = np.array(self.likelihoods).T
        
    def predict(self, message):
        posteriors = np.copy(self.priors)
        for i, w in enumerate(self.words):
            if w in message.lower(): 
                posteriors *= self.likelihoods[:,i] 
            else:
                posteriors *= np.ones(2) - self.likelihoods[:,i] 
            posteriors = posteriors / np.linalg.norm(posteriors, ord = 1) 
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]] 

    def score(self, messages, labels):
        confusion = np.zeros(4).reshape(2, 2) 
        for m, l in zip(messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham': 
                confusion[0 ,0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam': 
                confusion[0 ,1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham': 
                confusion[1 ,0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam': 
                confusion[1 ,1] += 1
        return (confusion[0,0] + confusion[1,1]) / float(confusion.sum()), confusion



In [6]:
# Use your training set to train the classiﬁers ‘train’ and ‘train2’
classiﬁer = NaiveBayesForSpam()

classiﬁer.train(training[training["label"] == "ham"]["text"], training[training["label"] == "spam"]["text"])
classiﬁer.train2(training[training["label"] == "ham"]["text"], training[training["label"] == "spam"]["text"])

In [7]:
# Using the validation set, explore how each of the two classiﬁers performs out of sample
'''Using train Function'''
# start timer
start = timeit.default_timer() 

# using train function with training data set
classiﬁer.train(training[training["label"] == "ham"]["text"], training[training["label"] == "spam"]["text"])

# calculate accuracy and confusion matrix
accuracy_1, confusion_matrix_1 = classiﬁer.score(pd.Series.tolist(validation["text"]), pd.Series.tolist(validation["label"]))

In [8]:
# stop timer and print
stop = timeit.default_timer() 
print(accuracy_1, confusion_matrix_1, stop - start)


0.966 [[ 862.   20.]
 [  14.  104.]] 207.40554137714022


In [9]:
'''Using train2 function'''
# start timer
start = timeit.default_timer()

# using train2 function with training data set
classiﬁer.train2(training[training["label"] == "ham"]["text"], training[training["label"] == "spam"]["text"])

# calculate accuracy and confusion matrix
accuracy_2, confusion_matrix_2 = classiﬁer.score(pd.Series.tolist(validation["text"]), pd.Series.tolist(validation["label"]))

# stop timer and print
stop = timeit.default_timer()
print(accuracy_2, confusion_matrix_2, stop - start)

0.969 [[ 871.   26.]
 [   5.   98.]] 7.11251009274892


In [10]:
# Run the ‘train2’ classiﬁer on the test set and report its performance using a confusion matrix.
start = timeit.default_timer()
classiﬁer.train2(training[training["label"] == "ham"]["text"], training[training["label"] == "spam"]["text"])
accuracy_3, confusion_matrix_3 = classiﬁer.score(pd.Series.tolist(test["text"]), pd.Series.tolist(test["label"]))
stop = timeit.default_timer()
print(accuracy_3, confusion_matrix_3, stop - start)

0.969594594595 [[ 1786.    55.]
 [    8.   223.]] 12.290132271242982


### Discussion

Train function:
This is the first step of the learning process where we trained our model. The function essentially prepares available data to be used using Bayes Theorem. This is achieved by filling in tables for prior probability of each words being a spam or a ham. Then, it calculates likelihood based on total number of messages and frequency of words in these messages and build frequency table. The function also uses Laplace Estimator approach to ensure that each feature has a non-zero probability of occurring with each class; here 1 has been added. Similarly, it eliminates situations
where probability equals 1 by replacing it with 0.95

Train2 function:
Train2 is almost identical to train function. The key difference is that it checks whether the probability of a word being spam is 20 times higher than being a ham, then that word will be assigned as spam keyword. Justification is to ensure that only strong candidate for a spam word
will be identified as spam.

Predict function:
This function acted as part of score function (describe below). This function will look at the new message, checks if the words inside the message are in frequency table. If it is, then posteriors are being calculated; if not- it assigns values by subtracting likelihood from 1. The posterior values are being normalised and are classified as spam or ham based on its value comparing to 0.5.

Score function:
This function creates confusion matrix as part of a validation process by comparing the predicted labels against actual labels. This function uses ‘predict’ function described above.

### Speed and accuracy of classifiers

In case of ‘train’ function, the algorithm compares words of new messages to frequency tables resulted from training set. While ‘train 2’ function starts the algorithm in the same way, the difference comes where the function compares the probability of being SPAM against the probability of being HAM multiplied by 20. When such an event occurs, the function appends the word to the list of ‘spam’ key words. This list of key words is being used to identify whether words in new messages are spam or ham. The key efficiency comes from the fact that the algorithm instead of going through the large list of words and their probabilities (as the  case in ‘train’) for every word it is trying to classify, it goes through a shorter list of spam key words.

In ‘train’ function- we get some words, which appear both in spam and ham messages, whereas in ‘train 2’ function it deals with words which have a very high likelihood of being spam.
I also would like to note that in some instances train function gives a better accuracy, however in the long run (with higher iterations) the train2 will perform better.