# Spam Classifier

This is a Naïve Bayes supervised learning based classifier to suggest whether a given email is spam or ham(not spam).

## Training Data
The training data is shown below and has 1000 rows including test data of 500 rows. Test data is functionally identical to the training data.

In [2]:
import numpy as np
from IPython.display import HTML,Javascript, display

training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(int)
print("Shape of the spam training data set:", training_spam.shape)
print(training_spam)

Shape of the spam training data set: (1000, 55)
[[1 0 0 ... 0 0 0]
 [0 0 1 ... 1 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [1 1 1 ... 1 1 0]
 [1 0 0 ... 1 1 1]]


Training set consists of 1000 rows and 55 columns. Each row corresponds to one email message. The first column is the _response_ variable and describes whether a message is spam `1` or not `0`. The remaining 54 columns are _features_ that are used to build a classifier. These features correspond to 54 different keywords (such as "money", "free", and "receive") and special characters (such as ":", "!", and "$"). A feature has the value `1` if the keyword appears in the message and `0` otherwise.

As mentioned there is also a 500 row set of *test data*. It contains the same 55 columns.

In [3]:
testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
print("Shape of the spam testing data set:", testing_spam.shape)
print(testing_spam)

Shape of the spam testing data set: (500, 55)
[[1 0 0 ... 1 1 1]
 [1 1 0 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]


The classifier takes input data and returns class predictions. The input is a single $n \times 54$ numpy array, the classifier returns a numpy array of length $n$ with classifications.

In [4]:
class SpamClassifier:
    def __init__(self, k):
        self.k = k
        
    def estimate_log_class_priors(self, data):
        # extracting the class labels
        class_labels = data[:, 0]

        # counting the occurences of 0s and 1s in the left-most column
        count_0s = np.sum(class_labels == 0)
        count_1s = np.sum(class_labels == 1)

        # finding the number of samples
        n_samples = len(class_labels)

        # calculating the logarithms of the empirical class priors (0s and 1s)
        log_prob_c0 = np.log(count_0s / n_samples)
        log_prob_c1 = np.log(count_1s / n_samples)

        return np.array([log_prob_c0, log_prob_c1])

    def estimate_log_class_conditional_likelihoods(self, data, alpha=1.0):
        # find and seperate spam and ham messages in different arrays
        ham_data = data[data[:, 0] == 0][:, 1:]
        spam_data = data[data[:, 0] == 1][:, 1:]
        #Include k? not self.k?
        #k = data.shape[1] - 1

        # counting the occurences of spam and ham messages
        count_hams = len(ham_data)
        count_spams = len(spam_data)

        # calculate the number of times that each feature(word) appears in spam and ham messages
        n_of_words_ham = ham_data.sum(axis=0)
        n_of_words_spam = spam_data.sum(axis=0)

        # calculating total number of words in spam and ham messages
        total_n_of_words_ham = n_of_words_ham.sum()
        total_n_of_words_spam = n_of_words_spam.sum()

        theta_ham = []
        for i in range(0, n_of_words_ham.shape[0]):
            theta_ham.append(np.log((n_of_words_ham[i] + alpha)/(total_n_of_words_ham + self.k*alpha)))

        theta_spam = []
        for i in range(0, n_of_words_spam.shape[0]):
            theta_spam.append(np.log((n_of_words_spam[i] + alpha)/(total_n_of_words_spam + self.k*alpha)))

        theta = np.array([theta_ham, theta_spam])
        return theta
        
    def train(self):
        self.log_class_priors = self.estimate_log_class_priors(training_spam)
        self.log_class_conditional_likelihoods = self.estimate_log_class_conditional_likelihoods(training_spam, alpha=1.0)
        
    def predict(self, new_data):
        # finding the sum of the ham conditional likehoods multiplied by the words binary value in the new data (0 if not present)
        ham_likelihoods_sum = np.dot(new_data[:,:], self.log_class_conditional_likelihoods[0])
        # finding the numerator of probability that the message is ham
        ham_results = self.log_class_priors[0] + ham_likelihoods_sum

        # finding the sum of the spam conditional likehoods multiplied by the words binary value in the new data (0 if not present)
        spam_likelihoods_sum = np.dot(new_data[:,:], self.log_class_conditional_likelihoods[1])
        # finding the numerator of probability that the message is spam
        spam_results = self.log_class_priors[1] + spam_likelihoods_sum

        # calculating the maximum a posteriori estimate for each row(each message) and finding the results
        class_predictions_ls = []
        for i in range (0, new_data.shape[0]):
            if ham_results[i] >= spam_results[i]:
                class_predictions_ls.append(0)
            else:
                class_predictions_ls.append(1)
        
        # changing the type from list to np.array
        class_predictions = np.array(class_predictions_ls)
        return class_predictions

    
def create_classifier():
    classifier = SpamClassifier(k=54)
    classifier.train()
    return classifier

classifier = create_classifier()

### Testing Details
The classifier is tested against some hidden data from the same source as the original. The accuracy is calculated.

In [13]:
SKIP_TESTS = True

if not SKIP_TESTS:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]

    predictions = classifier.predict(test_data)
    accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
    print(f"Accuracy on test data is: {accuracy}")

In [10]:
import sys
import pathlib

fail = False;

success = '\033[1;32m[✓]\033[0m'
issue = '\033[1;33m[!]'
error = '\033[1;31m\t✗'

#######
##
## Skip Tests check.
##
## Test to ensure the SKIP_TESTS variable is set to True to prevent it slowing down the automarker.
##
#######

if not SKIP_TESTS:
    fail = True;
    print("{} \'SKIP_TESTS\' is incorrectly set to False.\033[0m".format(issue))
    print("{} You must set the SKIP_TESTS constant to True in the cell above.\033[0m".format(error))
else:
    print('{} \'SKIP_TESTS\' is set to true.\033[0m'.format(success))

#######
##
## File Name check.
##
## Test to ensure file has the correct name. This is important for the marking system to correctly process the submission.
##
#######
    
p3 = pathlib.Path('./spamclassifier.ipynb')
if not p3.is_file():
    fail = True
    print("{} The notebook name is incorrect.\033[0m".format(issue))
    print("{} This notebook file must be named spamclassifier.ipynb\033[0m".format(error))
else:
    print('{} The notebook name is correct.\033[0m'.format(success))

#######
##
## Create classifier function check.
##
## Test that checks the create_classifier function exists. The function should train the classifier and return it so that it can be evaluated by the marking system.
##
#######

if "create_classifier" not in dir():
    fail = True;
    print("{} The create_classifier function has not been defined.\033[0m".format(issue))
    print("{} Your code must include a create_classifier function as described in the coursework specification.\033[0m".format(error))
    print("{} If you believe you have, \'restart & run-all\' to clear this error.\033[0m".format(error))
else:
    print('{} The create_classifier function has been defined.\033[0m'.format(success))

#######
##
## Classifier variable check.
##
## Test that checks the classifier variable exists. The marking system will use this variable to make predictions based on a set of random features you have not seen. Your score will be based on how well your classifier predicts the hidden labels.
##
#######

if 'classifier' not in vars():
    fail = True;
    print("{} The classifer variable has not been defined.\033[0m".format(issue))
    print("{} Your code must create a variable called \'classifier\' as described in the coursework specification.\033[0m".format(error))
    print("{} This variable should contain the trained classifier you have created.\033[0m".format(error))
else:
    print('{} The classifer variable has been correctly defined.\033[0m'.format(success))

#######
##
## Accuracy Estimation check.
##
## Test that checks the accuracy estimation function exists and is a reasonable value. This is a requirement of the coursework specification and is used by the marking system when generating your final grade.
##
#######

if "my_accuracy_estimate" not in dir():
    fail = True;
    print("{} The my_accuracy_estimate function has not been defined.\033[0m".format(issue))
    print("{} Your code must include a my_accuracy_estimate function as described in the coursework specification.\033[0m".format(error))
    print("{} If you believe you have, \'restart & run-all\' to clear this error.\033[0m".format(error))
else:
    if my_accuracy_estimate() == 0.5:
        print("{} my_accuracy_estimate function warning.\033[0m".format(issue))
        print("{} my_accuracy_estimate function returns a value of 0.5 - Your classifier is no better than random chance.\033[0m".format(error))
        print("{} Are you sure this is correct.\033[0m".format(error))
    else:
        print('{} The my_accuracy_estimate function has been defined correctly.\033[0m'.format(success))

#######
##
## Test set check.
##
## Test that checks your classifier actually works. The calls made here are the same made by the automarker - albeit with different data. If your work fails this test it will score 0 in the automarker.
##
#######

try:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]
    
    try:
        predictions = classifier.predict(test_data)
        accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
        print('{0} Success running test set - Accuracy was {1:.2f}%.\033[0m'.format(success, (accuracy*100)))
    except Exception as e:
        fail = True
        print("{} Error running test set.\033[0m".format(issue))
        print("{} Your code produced the following error. This error will result in a zero from the automarker, please fix.\033[0m".format(error))
#         print("{} {}\033[0m".format(error, e))
        print(e)
except:
    sys.stderr.write("Unable to run one test as the file \'data/testing_spam.csv\' could not be found.")

#######
##
## Final Summary
##
## Prints the final results of the submission tests.
##
#######

if fail:
    sys.stderr.write("Your submission is not ready! Please read and follow the instructions above.")
else:
    print("\033[1m\n\n")
    print("╔═══════════════════════════════════════════════════════════════╗")
    print("║                        Congratulations!                       ║")
    print("║                                                               ║")
    print("║            Your work meets all the required criteria          ║")
    print("║                   and is ready for submission.                ║")
    print("╚═══════════════════════════════════════════════════════════════╝")
    print("\033[0m")
    

[1;32m[✓][0m 'SKIP_TESTS' is set to true.[0m
[1;32m[✓][0m The notebook name is correct.[0m
[1;32m[✓][0m The create_classifier function has been defined.[0m
[1;32m[✓][0m The classifer variable has been correctly defined.[0m
[1;32m[✓][0m The my_accuracy_estimate function has been defined correctly.[0m
[1;32m[✓][0m Success running test set - Accuracy was 89.80%.[0m
[1m


╔═══════════════════════════════════════════════════════════════╗
║                        Congratulations!                       ║
║                                                               ║
║            Your work meets all the required criteria          ║
║                   and is ready for submission.                ║
╚═══════════════════════════════════════════════════════════════╝
[0m


In [8]:
# This is a test cell. Please do not modify or delete.