# Spam Classifier

This is a Naïve Bayes supervised learning based classifier to suggest whether a given email is spam or ham(not spam).

## Training Data
The training data is shown below and has 1000 rows including test data of 500 rows. Test data is functionally identical to the training data.

In [1]:
import numpy as np
from IPython.display import HTML,Javascript, display

training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(int)
print("Shape of the spam training data set:", training_spam.shape)
print(training_spam)

Shape of the spam training data set: (1000, 55)
[[1 0 0 ... 0 0 0]
 [0 0 1 ... 1 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [1 1 1 ... 1 1 0]
 [1 0 0 ... 1 1 1]]


Training set consists of 1000 rows and 55 columns. Each row corresponds to one email message. The first column is the _response_ variable and describes whether a message is spam `1` or not `0`. The remaining 54 columns are _features_ that are used to build a classifier. These features correspond to 54 different keywords (such as "money", "free", and "receive") and special characters (such as ":", "!", and "$"). A feature has the value `1` if the keyword appears in the message and `0` otherwise.

As mentioned there is also a 500 row set of *test data*. It contains the same 55 columns.

In [2]:
testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
print("Shape of the spam testing data set:", testing_spam.shape)
print(testing_spam)

Shape of the spam testing data set: (500, 55)
[[1 0 0 ... 1 1 1]
 [1 1 0 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]


The classifier takes input data and returns class predictions. The input is a single $n \times 54$ numpy array, the classifier returns a numpy array of length $n$ with classifications.

In [3]:
class SpamClassifier:
    def __init__(self, k):
        self.k = k
        
    def estimate_log_class_priors(self, data):
        # extracting the class labels
        class_labels = data[:, 0]

        # counting the occurences of 0s and 1s in the left-most column
        count_0s = np.sum(class_labels == 0)
        count_1s = np.sum(class_labels == 1)

        # finding the number of samples
        n_samples = len(class_labels)

        # calculating the logarithms of the empirical class priors (0s and 1s)
        log_prob_c0 = np.log(count_0s / n_samples)
        log_prob_c1 = np.log(count_1s / n_samples)

        return np.array([log_prob_c0, log_prob_c1])

    def estimate_log_class_conditional_likelihoods(self, data, alpha=1.0):
        # find and seperate spam and ham messages in different arrays
        ham_data = data[data[:, 0] == 0][:, 1:]
        spam_data = data[data[:, 0] == 1][:, 1:]
        #Include k? not self.k?
        #k = data.shape[1] - 1

        # counting the occurences of spam and ham messages
        count_hams = len(ham_data)
        count_spams = len(spam_data)

        # calculate the number of times that each feature(word) appears in spam and ham messages
        n_of_words_ham = ham_data.sum(axis=0)
        n_of_words_spam = spam_data.sum(axis=0)

        # calculating total number of words in spam and ham messages
        total_n_of_words_ham = n_of_words_ham.sum()
        total_n_of_words_spam = n_of_words_spam.sum()

        theta_ham = []
        for i in range(0, n_of_words_ham.shape[0]):
            theta_ham.append(np.log((n_of_words_ham[i] + alpha)/(total_n_of_words_ham + self.k*alpha)))

        theta_spam = []
        for i in range(0, n_of_words_spam.shape[0]):
            theta_spam.append(np.log((n_of_words_spam[i] + alpha)/(total_n_of_words_spam + self.k*alpha)))

        theta = np.array([theta_ham, theta_spam])
        return theta
        
    def train(self):
        self.log_class_priors = self.estimate_log_class_priors(training_spam)
        self.log_class_conditional_likelihoods = self.estimate_log_class_conditional_likelihoods(training_spam, alpha=1.0)
        
    def predict(self, new_data):
        # finding the sum of the ham conditional likehoods multiplied by the words binary value in the new data (0 if not present)
        ham_likelihoods_sum = np.dot(new_data[:,:], self.log_class_conditional_likelihoods[0])
        # finding the numerator of probability that the message is ham
        ham_results = self.log_class_priors[0] + ham_likelihoods_sum

        # finding the sum of the spam conditional likehoods multiplied by the words binary value in the new data (0 if not present)
        spam_likelihoods_sum = np.dot(new_data[:,:], self.log_class_conditional_likelihoods[1])
        # finding the numerator of probability that the message is spam
        spam_results = self.log_class_priors[1] + spam_likelihoods_sum

        # calculating the maximum a posteriori estimate for each row(each message) and finding the results
        class_predictions_ls = []
        for i in range (0, new_data.shape[0]):
            if ham_results[i] >= spam_results[i]:
                class_predictions_ls.append(0)
            else:
                class_predictions_ls.append(1)
        
        # changing the type from list to np.array
        class_predictions = np.array(class_predictions_ls)
        return class_predictions

    
def create_classifier():
    classifier = SpamClassifier(k=54)
    classifier.train()
    return classifier

classifier = create_classifier()