## Machine Learning
### Assignment 2

*07 Febraury, 2017*  
*Georgios Pastakas*

#### Group Assignment: Creating a SMS Message Spam Filter

In this assignment we will use the data file [SMSSpamCollection](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/) which contains SMS messages and categorises them in two types, *ham* and *spam*. Using this data, we will build a Naive Bayes spam filter.

In [242]:
import numpy as np
import pandas as pd
import string
from sklearn.model_selection import train_test_split

# Set of symbols we want to exclude from the messages
exclude = set(string.punctuation + string.digits + "£")

**1.** First, we load the data into a Python data frame.

In [233]:
data = pd.read_csv("./smsspamcollection/SMSSpamCollection", names = ['Type', 'Message'], sep = "\t")

**2.** Before moving on, we will pre-process the SMS messages, by:

1. Removing all punctuation and numbers from the SMS messages
2. Changing all messages to lower case

In [241]:
data["Clear Message"] = list(map(lambda msg: ''.join(ch for ch in msg if ch not in exclude).lower(), data["Message"]))

**3.** Next, we shuffle the messages and split them into

* a training set (2,500 messages)
* a validation set (1,000 messages) and 
* a test set (2,072 messages).

In [239]:
# Randomly select 3,500 messages out of the 5,572 and use the rest 2,072 messages as "test" set
train_and_validation, test = train_test_split(data, train_size = 3500, random_state = 42)

# Split "train_and_validation" set to "train" set and "test" set
train, validation = train_test_split(train_and_validation, train_size = 2500, random_state = 42)

**4.** While Python’s SciKit-Learn library has a Naive Bayes classifier, it works with continuous probability distributions and assumes numerical features. Although it is possible to transform categorical variables into numerical features using a binary encoding, we will instead build a simple Naive Bayes classifier from scratch, which includes the following functions:

* `train()`
* `train2()`
* `predict()`
* `score()`

In [256]:
class NaiveBayesForSpam:
  
    def train(self, hamMessages, spamMessages):
        self.words = set(' '.join(hamMessages + spamMessages).split())
        self.priors = np.zeros(2)
        self.priors[0] = float(len(hamMessages)) / (len(hamMessages) + len(spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        for i, w in enumerate(self.words):
            prob1 = (1.0 + len([m for m in hamMessages if w in m])) / len(hamMessages)
            prob2 = (1.0 + len([m for m in spamMessages if w in m])) / len(spamMessages)
            self.likelihoods.append([min(prob1, 0.95), min(prob2, 0.95)])
        self.likelihoods = np.array(self.likelihoods).T

    def train2(self, hamMessages, spamMessages):
        self.words = set(' '.join(hamMessages + spamMessages).split())
        self.priors = np.zeros(2)
        self.priors[0] = float(len(hamMessages)) / (len(hamMessages) + len(spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        spamkeywords = []
        for i, w in enumerate(self.words):
            prob1 = (1.0 + len([m for m in hamMessages if w in m])) / len(hamMessages)
            prob2 = (1.0 + len([m for m in spamMessages if w in m])) / len(spamMessages)
            if prob1 * 20 < prob2:
                self.likelihoods.append([min(prob1, 0.95), min(prob2, 0.95)])
                spamkeywords.append(w)
        self.words = spamkeywords
        self.likelihoods = np.array(self.likelihoods).T
    
    def predict(self, message):
        posteriors = np.copy(self.priors)
        for i, w in enumerate(self.words):
            if w in message.lower(): # convert to lower-case
                posteriors *= self.likelihoods[:, i]
            else:
                posteriors *= np.ones(2) - self.likelihoods[:, i]
            posteriors = posteriors / np.linalg.norm(posteriors, ord = 1) # normalise
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]
    
    def score(self, messages, labels):
        confusion = np.zeros(4).reshape(2, 2)
        for m, l in zip(messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0, 0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0, 1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1, 0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1, 1] += 1
        return (confusion[0, 0] + confusion[1, 1]) / float(confusion.sum()), confusion

**5.** The functions used in `class NaiveBayesForSpam` are:

#### `train()`

The `train()` function takes as arguments two lists, `hamMessages` and `spamMessages` that contain the messages that are ham and spam, respectively. It merges these two lists and creates a set of all the words contained in all messages.

Next, it calculates the prior probabiilites $P(ham)$ and $P(spam)$ of having a ham or a spam message. The results are stored in a list of length 2, names `priors`.

After that, it calculates for each word $W$ in the set of words, the likelihood of each word included in a message taking into account that the message is ham or spam, those are $P(W \mid ham)$ and $P(W \mid spam)$ for each $W$. It also adds 1 to the number of occurences of each word in hamd and spam emails to avoid having probabilites equal to zero (Laplace estimator). Finally, it restricts the values of the calculated probabilities up to 0.95 by replacing probabilities larger than this value with 0.95. The result is stored in the list `likelihoods` of length equal to the number of unique words.

#### `train2()`

The `train2()` function takes the same arguments as function `train()` and has the same purpose with only difference the fact that it takes into account only the words from the innitial set of words that have a probability of encountered in spam mails 20 times higher than a probability of encountered in ham mails, this is $20 \times P(W \mid ham) < P(W \mid spam)$. Again, it restricts the likelihood values of these words up to 0.95 by replacing probabilities larger than this value with 0.95. It also creates an additional list, named `spamkeywords`, which includes all these words.

####  `predict()`

The `predict()` function takes as argument a string object, `message`, which represents a new SMS message that needs to be classified as ham or spam. First, it makes a copy of the prior probabilities $P(ham)$ and $P(spam)$, names `posteriors`, which will constitute the posterior probabilities of a new message being ham or span, these are

$$P(ham \mid message) = \frac{P(message \mid ham) \times P(ham)}{P(message)}$$

and
$$P(spam \mid message) = \frac{P(message \mid spam) \times P(spam)}{P(message)}$$

where 
$$P(message \mid ham) = P(W_1 \mid ham) \times P(W_2 \mid ham) \times \cdots \times P(W_n \mid ham)$$

and
$$P(message \mid spam) = P(W_1 \mid spam) \times P(W_2 \mid spam) \times \cdots \times P(W_n \mid spam)$$

if the message contains all words $W_1, W_2, ..., W_n$. For the words $W_i$ that are not contained in the message $P(W_i)$ are replaced by $P(\neg W_i) = 1 - P(W_i)$. 

The function does this computation by initialising $P(ham \mid message) = P(ham)$ and $P(spam \mid message) = P(spam)$. Then it iterates through all the words we have in our classifier and checks whether each word $W_i$ exists in the new message or not. If so, it multiplies posterior probabilities with $P(W_i \mid ham)$ and $P(W_i \mid spam)$ otherwise it multiplies them with $P(\neg W_i \mid ham) = 1 - P(ham \mid W_i)$ and $P(\neg W_i \mid spam) = 1 - P(spam \mid W_i)$, respectively. After that, it divides both posterior probabilities with the normalised value of the two posterior probabilites which is

$$P(message) = P(message \mid ham) \times P(ham) + P(message \mid spam) \times P(spam)$$

The function, up to that point, applies Bayes' Theorem. At the end, if the first posterior probability is greater than 0.5, this is $P(message \mid ham) > 0.5$, the function returns the predicted class of the new message which is ham and its posterior probability of being a ham. Otherwise, this is $P(message \mid ham) \leq 0.5$, the function returns spam as the class of the new message and its posterior probability of being spam.

#### `score()`

The `score()` function takes as arguments two lists, `messages` and `labels` that contain the messages we have predicted and the true labels of them, respectively. It first creates a $2 \times 2$ matrix of zeros, which constitutes the confusion matrix. Then, for each pair of message and true value, it compares the predicted value of the message and its true values and increase the corresponding element of the confusion matrix. That is, if

* $predicted = ham$ and $actual = ham$: Increase ***True Negatives*** (***TN***) by 1
* $predicted = ham$ and $actual = spam$: Increase ***False Negatives*** (***FN***) by 1 
* $predicted = spam$ and $actual = ham$: Increase ***False Positives*** (***FP***) by 1 
* $predicted = spam$ and $actual = spam$: Increase ***True Positives*** (***TP***) by 1

Finally, the function returns both the confusion matrix and the accuracy of the model, which is calculated as

$$accuracy = \frac{TN + TP}{TN + FN + FP + TP}$$

**6.** Use your training set to train the classifiers `train()` and `train2()`. Note that the interfaces of our classifiers require you to pass the ham and spam messages separately.

In [296]:
# Separate messages of training set to hams and spams
train_hams = list(train[train["Type"] == "ham"]["Clear Message"])
train_spams = list(train[train["Type"] == "spam"]["Clear Message"])

# Build Naive Bayes classifier using "train()" function
NB_1 = NaiveBayesForSpam()
train_likelihoods = NB_1.train(train_hams, train_spams)

# Build Naive Bayes classifier using "train2()" function
NB_2 = NaiveBayesForSpam()
train_likelihoods = NB_2.train2(train_hams, train_spams)

**7.** After training the two classifiers, we will explore how each of them performs out of sample by using the validation set.

In [290]:
# Get messages and labels of the validation set 
train_messages = list(train["Clear Message"])
train_labels = list(train["Type"])

# Get messages and labels of the validation set 
validation_messages = list(validation["Clear Message"])
validation_labels = list(validation["Type"])

In [292]:
%%time
# Execution time for "train()" classifier on the training set
acc_train_1, cm_train_1 = NB_1.score(train_messages, train_labels)

Wall time: 11min 35s


In [293]:
%%time
# Execution time for "train()" classifier on the validation set
acc_validation_1, cm_validation_1 = NB_1.score(validation_messages, validation_labels)

Wall time: 4min 40s


In [297]:
%%time
# Execution time for "train2()" classifier on the training set
acc_train_2, cm_train_2 = NB_2.score(train_messages, train_labels)

Wall time: 29 s


In [298]:
%%time
# Execution time for "train2()" classifier on the validation set
acc_validation_2, cm_validation_2 = NB_2.score(validation_messages, validation_labels)

Wall time: 10.8 s


> *Comment on accuracy of validation set*

**8.** After using the two classifiers on both the training and the validation set, we wil efirstly valuate them both on their execution time and their accuracy.

#### Accuracy

> *Comment on accuracy in general*

#### Execution Time

The execution time of `train()` classifier is 00:11:35 for the training set of 2,500 messages and 00:04:40 for the validation set of 1,000 messages while for `train2()` classifier the execution time is is 00:00:29 for the training set and 00:00:11 for the validation set. We see that `train2()` classifier requires significantly less time to be executed than `train()` classifier does. 

This difference in execution times is something we expected, as in the `train2()` classifier, the words we use to find the posterior probabilities of a new message being ham or spam are only those whose probabilities satisfy inequality $20 \times P(W \mid ham) < P(W \mid spam)$, which are far less that the total number of total words.


**9.** We will now look at speicific classification results. More precisely we will compare the false positives (ham messages classified as spam  messages) and false negatives (spam messages classified as ham messages).

The confusion matrices of the two classifiers are as follows 

> *Continue with parts **9.** and **10.***