# Spam Classifier
## Assignment Preamble
Please ensure you carefully read all of the details and instructions on the assignment page, this section, and the rest of the notebook. If anything is unclear at any time please post on the forum or ask a tutor well in advance of the assignment deadline.

In addition to all of the instructions in the body of the assignment below, you must also follow the following technical instructions for all assignments in this unit. *Failure to do so may result in a grade of zero.*
* [At the bottom of the page](#Submission-Test) is some code which checks you meet the submission requirements. You **must** ensure that this runs correctly before submission.
* Do not modify or delete any of the cells that are marked as test cells, even if they appear to be empty.
* Do not duplicate any cells in the notebook – this can break the marking script. Instead, insert a new cell (e.g. from the menu) and copy across any contents as necessary.

Remember to save and backup your work regularly, and double-check you are submitting the correct version.

This notebook is the primary reference for your submission. You may write code in separate `.py` files but it must be clearly imported into the notebook so that it runs without needing to reference those files, and you must explain clearly what functionality is contained in those files (through comments, markdown cells, etc).

As always, **the work you submit for this assignment must be entirely your own.** Do not copy or work with other students. Do not copy answers that you find online. These assignments are designed to help improve your understanding first and foremost – the process of doing the assignment is part of *learning*. They are also used to assess your ability, and so you must uphold academic integrity. Submitting plagiarised work risks your entire place on your degree.

**The pass mark for this assignment is 40%.** We expect that students, on average, will be able to produce a submission which gets a mark between 50-70% within the normal workload allocation for the unit, but this will vary depending on individual backgrounds. Please ask for help if you are struggling.

## Getting Started
Spam refers to unwanted email, often in the form of advertisements. In the literature, an email that is **not** spam is called *ham*. Most email providers offer automatic spam filtering, where spam emails will be moved to a separate inbox based on their contents. Of course this requires being able to scan an email and determine whether it is spam or ham, a classification problem. This is the subject of this assignment.

This assignment has two parts. Each part is worth 50% of the overall grade for this assignment.

For part one you will write a supervised learning based classifier to determine whether a given email is spam or ham. You must write and submit the code in this notebook. The training data is provided for you. You may use any classification method. Marks will be awarded primarily based on the accuracy of your classifier on unseen test data, but there are also marks for estimating how accurate you think your classifier will be.

In part two you will produce a short video explaining your implementation, any decisions or extensions you made, and what parameter values you used. This part is explained in more detail on the assignment page. The video file must be submitted with your assignment.

### Choice of Algorithm
While the classification method is a completely free choice, the assignment folder includes [a separate notebook file](data/naivebayes.ipynb) which can help you implement a Naïve Bayes solution. If you do use this notebook, you are still responsible for porting your code into *this* notebook for submission. A good implementation should give a high  enough accuracy to get a good grade on this section (50-70%).

You could also consider a k-nearest neighbour algorithm, but this may be less accurate. Logistic regression is another option that you may wish to consider.

If you are looking to go beyond the scope of the unit, you might be interested in building something more advanced, like an artificial neural network. This is possible just using `numpy`, but will require significant self-directed learning. *Extensions like this are left unguided and are not factored into the unit workload estimates.*

**Note:** you may use helper functions in libraries like `numpy` or `scipy`, but you **must not** import code which builds entire models for you. This includes but is not limited to use of libraries like `scikit-learn`, `tensorflow`, or `pytorch` – there will be plenty of opportunities for these libraries in later units. The point of this assignment is to understand code the actual algorithm yourself. ***If you are in any doubt about any particular library or function please ask a tutor.*** Submissions which ignore this will receive penalties or even zero marks.

If you choose to implement more than one algorithm, please feel free to include your code and talk about it in part two (your video presentation), but only the code in this notebook will be used in the automated testing.

## Training Data
The training data is described below and has 1000 rows. There is also a 500 row set of test data. These are functionally identical to the training data, they are just in a separate csv file to encourage you to split out your training and test data. You should consider how to best make use of all available data without overfitting, and to help produce an unbiased estimate for your classifier's accuracy.

The cell below loads the training data into a variable called `training_spam`.

In [1]:
import numpy as np
from IPython.display import HTML,Javascript, display

training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(int)
print("Shape of the spam training data set:", training_spam.shape)
print(training_spam)

Shape of the spam training data set: (1000, 55)
[[1 0 0 ... 0 0 0]
 [0 0 1 ... 1 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [1 1 1 ... 1 1 0]
 [1 0 0 ... 1 1 1]]


Your training set consists of 1000 rows and 55 columns. Each row corresponds to one email message. The first column is the _response_ variable and describes whether a message is spam `1` or ham `0`. The remaining 54 columns are _features_ that you will use to build a classifier. These features correspond to 54 different keywords (such as "money", "free", and "receive") and special characters (such as ":", "!", and "$"). A feature has the value `1` if the keyword appears in the message and `0` otherwise.

As mentioned there is also a 500 row set of *test data*. It contains the same 55 columns.

In [2]:
testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
print("Shape of the spam testing data set:", testing_spam.shape)
print(testing_spam)

Shape of the spam testing data set: (500, 55)
[[1 0 0 ... 1 1 1]
 [1 1 0 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]


## Part One
Write all of the code for your classifier below this cell. There is some very rough skeleton code in the cell directly below. You may insert more cells below this if you wish, but you must not duplicate any cells as this can break the grading script.

### Submission Requirements
Your code must provide a variable with the name `classifier`. This object must have a method called `predict` which takes input data and returns class predictions. The input will be a single $n \times 54$ numpy array, your classifier should return a numpy array of length $n$ with classifications. There is a demo in the cell below, and a test you can run before submitting to check your code is working correctly.

Your code must run on our test machine in under 30 seconds. If you wish to train a more complicated model (e.g. neural network) which will take longer, you are welcome to save the model's weights as a file and then load these in the cell below so we can test it. You must include the code which computes the original weights, but this must not run when we run the notebook – comment out the code which actually executes the routine and make sure it is clear what we need to change to get it to run. Remember that we will be testing your final classifier on additional hidden data.

In [3]:
# This skeleton code simply classifies every input as ham
#
# Here you can see there is a parameter k that is unused, the
# point is to show you how you could set up your own. You might
# also pass in extra data via a train method (also does nothing
# here). Modify this code as much as you like so long as the 
# accuracy test in the cell below runs.

# ------------------------------------------------------------
class SpamClassifier:
    def __init__(self):
        self.tree=None
        self.rules=None
        
    # ------------------------------------------------------------
    def entropy(self, prob_yes):
        '''
        Entropy is a measure of the impurity in the data.
        It's 0 if every record has the same value (either NO or YES).
        It's 1 if there is an equal number of records with YES and NO.
        
        Parameters
        ----------
        prob_yes : float

        Returns
        -------
        entropy : float
            defined as E = -(P_yes * log2P_yes + P_no * log2P_no)
            if prob_yes is 0 or 1, entropy is 0 
        '''
        if prob_yes in [0,1]: # only works for binary outcomes because 1 is the max
            return 0.0

        return -(prob_yes*np.log2(prob_yes)+(1-prob_yes)*np.log2(1-prob_yes)).round(5)

    
    # ------------------------------------------------------------
    def conditional_entropy(self, data, attribute, values=[0,1]):
        '''
        Estimate the entropy of a given attribute. An attribute with two distinct values divides
        the training set into two subsets. Each subset has pk positive examples and nk negative examples.
        Here, the remainder(attribute) is calculated as the sum of the calculated entropy of .. 
        .. the "yes probability" (email is spam) for each subset, weighted by the probability of ..
        .. (pk+nk)/training set size. For more information see ..
        .. p. 662, Chapter 19, AI A Modern Approach 4th Edition by Russell & Norvig
        
        Parameters
        ----------
        data : numpy array of N rows x M=55 columns,
               where M_0 refers to the response variable
               and M_1 to M_54 refer to the attributes
        
        attribute : float ranging from 1 to 54 inclusive
        
        values : a list of values (this implementations supports only binary values)
        
        Returns
        -------
        remainder : float          
        '''
        remainder = 0
        # for each attribute possible value (i.e. 0 and 1)
        for value in values:
            # create a filter for the given attribute e.g. data[:, 2]==0
            filt = data[:,attribute]==value
            # and reduce the (response variable) data to those records only
            filtered_data = data[:,0][filt]
            # calculate entropy for that set if the array is non-empty
            if filtered_data.shape[0]>0:
                # first calculate the probability of "yes" for the current attribute value
                prob = filtered_data.sum()/filtered_data.shape[0]
                # calculate its entropy and weight by the number of responses in the data
                remainder += (filtered_data.shape[0]/data.shape[0])*self.entropy(prob)
            # if the array is empty, move on
            else:
                remainder += 0

        return remainder

    # ------------------------------------------------------------
    def info_gain(self, data, avail_attributes):
        '''
        Estimate the information gain which is defined as the expected reduction in entropy.
        That is IG(S,A) = Entropy(S) - Remainder(S, A), where A is an attribute and S the dataset.
        So, for a given training set S we must select the A for which the IG is the biggest.
        
        Parameters
        ----------
        data : numpy array of N rows x M=55 columns,
               where M_0 refers to the response variable
               and M_1 to M_54 refer to the attributes
        
        avail_attributes : list of floats, ranging from 1 to 54 inclusive
                           attributes are dropped as the tree is being built
        
        Returns
        -------
        a list of tuples where each tuple is (information gain, attribute), for example (0.2, 54)
        '''
        
        try:
            # the training set size (number of records)
            total_counts = data[:,0].shape[0]

            if total_counts>0:
                # count spam instances
                positive = data[:,0].sum()
                # calc spam share as % of total
                pos_prob = positive/total_counts
                # calculate entropy
                B = self.entropy(pos_prob)
                
                # calc ham share as % of total
                # neg_prob = 1-pos_prob
                # B = -(pos_prob*np.log2(pos_prob) + neg_prob*np.log2(neg_prob))
                
                info_gain_list = []
                # for every attribute on the list
                for att in avail_attributes:
                    # calculate the remainder value
                    remainder = self.conditional_entropy(data, att)
                    # store in the list
                    info_gain_list.append(((B-remainder).round(4),att))
                
                # return a sorted list of attributes (based on IG)
                return sorted(info_gain_list,reverse=True)
            
            # if there are no records return an empty list
            else:
                return []
        except:
            return []

    # ------------------------------------------------------------
    def most_common_value(self, numpy_bin_arr):
        '''
        Return the most common value in a binary array.
        Specifically, it returns 1 if the sum of the array is >= than half ..
        .. the size of the array and 0 otherwise.
        Parameters
        ----------
        numpy_bin_arr : numpy array comprised of 0s and 1s
        
        Returns
        -------
        return 1 or 0
        '''
        return (numpy_bin_arr.sum()>=numpy_bin_arr.shape[0]*0.5)*1
    
    # ------------------------------------------------------------
    def decision_tree_learning(self, examples, attributes, parent_examples, min_sample, max_depth, counter):
        '''
        This function estimates the decision tree. It uses the information gain criterion to select the 
        attribute that reduces entropy the most and then splits the dataset based on the selected attribute.
        Recursion is used to continue selecting the best attribute and splitting the training set into subsets.
        This process continues until one of the following stopping criteria are met:
        - all examples in the subset have the same class
        - there are no attributes to split the data with
        - the number of observations in the subset is less than a minimum number of observations (min_sample)
        - maximum depth is reached
        
        Parameters
        ----------
        examples : numpy array of N rows x M=55 columns,
                   where M_0 refers to the response variable
                   and M_1 to M_54 refer to the attributes
        
        attributes : list of floats, ranging from 1 to 54 inclusive
                     attributes are dropped as the tree is being built
                     
        parent_examples : int, 1 or 0 (provided by the most_common_value function)
                          represents the default value in case no records exist for the selected
                          combination of attribute values
        
        min_sample : int, for example 50
        
        max_depth : int, for example 4
        
        counter : int, for example 0
        
        Returns
        -------
        tree : a nested dictionary with the following keys
                feature, samples, ig, level, condition, prediction, reason
        '''
        n_obs = examples.shape[0]
        
        #print('counter',counter,'\n','list',str(attributes),'\n')
        #print('missing', [i for i in range(1,55) if i not in attributes])
        #print('-------')
        
        # if the number of records is less than the min sample required, return the default class
        if n_obs < min_sample: #this condition contains the case where n_obs = 0 as min_sample>0
            return {'prediction':parent_examples,'level':counter,'samples':n_obs,'reason':'not adequate sample'}

        # if every example has the same class, return that class
        elif np.all(examples[:,0]==examples[0,0]):
            return {'prediction':examples[0,0],'level':counter,'samples':n_obs,'reason':'single class'}

        # if there are no attributes left to check, return the most common (examples) class
        elif len(attributes)==0:
            # this returns 1 if the "1s" are mte to 50% of the observations, and 0 otherwise
            return {'prediction':self.most_common_value(examples[:,0]),'level':counter,
                    'samples':n_obs,'reason':'no attributes left'}

        else:
            # if the max depth hasn't been reached
            if max_depth>0:
                # calculate the information gain for each attribute
                list_of_att_by_ig = self.info_gain(examples, attributes)

                # extract the best attribute and gain
                best_info_gain, best_attribute = list_of_att_by_ig[0]

                tree = {'feature':best_attribute,
                         'samples':n_obs,
                         'ig':best_info_gain,
                         'level':counter
                         }

                # remove the attribute from the list of attributes to consider
                attributes_temp = [a for a in attributes if a!=best_attribute]

                # the attribute's value is equal to either 0 or 1
                mapp = {0:'left',1:'right'}
                for v in mapp.keys():

                    # reduce examples using the best attribute's values
                    examples_temp = examples[examples[:,best_attribute]==v]


                    # go down the branch for the subset given v (recursion)
                    subtree = self.decision_tree_learning(
                                         examples=examples_temp,
                                         attributes=attributes_temp,
                                         parent_examples=self.most_common_value(examples_temp[:,0]),#examples[:,0]
                                         min_sample=min_sample,
                                         max_depth=max_depth-1,# reduce remaining depth
                                         counter=counter+1 # increase level ie counter
                                        )
                    # store the tree and add "condition" which is a label of the attribute's column
                    # and the current value, e.g. Feat3==1
                    tree[mapp[v]] = {**subtree,
                                         'condition':f"Feat{str(tree['feature'])}=="+str(v)}

                return tree

            # when/if the max depth is reached
            else:

                return {'prediction':self.most_common_value(examples[:,0]),
                        'level':counter,'samples':n_obs,'reason':'max depth reached'}

    # ------------------------------------------------------------
    def extract_rules(self, tree):
        '''
        Parse the tree to generate its rules.
        
        Parameters
        ----------
        tree : a nested dictionary with the following keys
                feature, samples, ig, level, condition, prediction, reason
        
        Returns
        -------
        rules_dict : a nested dictionary with rules for each tree attribute
        
        '''
        rules_dict = dict()
        if 'prediction' in tree:
            feat = tree['condition'].replace("Feat","")
            rules_dict[feat] = tree['prediction']
            return rules_dict
        else:
            if 'condition' not in tree:
                tree['condition'] = None
            if tree['condition'] is not None:
                feat = tree['condition'].replace("Feat","")
                rules_dict[feat] = {**self.extract_rules(tree['left']), **self.extract_rules(tree['right'])}
                return rules_dict
            else:
                return {**self.extract_rules(tree['left']),**self.extract_rules(tree['right'])}

    # ------------------------------------------------------------
    def predict_one(self, tesing_data_arr, rules):
        '''
        Predict the response variable based on the test data for a single record.
        Uses recursion to read the parsed rules.
        
        Parameters
        ----------
        tesing_data_arr : numpy array of 1 row x M=55 columns,
               where M_0 refers to the response variable
               and M_1 to M_54 refer to the attributes
        
        rules : dictionary of the rules (as provided by the function extract_rules)
        
        Returns
        -------
        int : 0 or 1
        '''
        for i in rules.keys():
            col = int(i.split('==')[0])
            val = int(i.split('==')[1])

            if tesing_data_arr[col] == val:
                if isinstance(rules[i], dict):
                    return self.predict_one(tesing_data_arr, rules[i])

                else:
                    return rules[i]

    # ------------------------------------------------------------
    def predict(self, data):
        '''
        Predict the response variable based on the test data.
        
        Parameters
        ----------
        data : numpy array of N rows x 54 columns with attributes
        
        Returns
        -------
        a numpy array of size N
        '''
        # if the data is one dimensional [N=1, or shape = (54,)] then reshape into (1,54)
        if data.ndim==1:
            data = data.reshape(1,data.shape[0]).copy()
    
        # IMPORTANT!
        # data is an N x 54 array
        # so we need to add an additional column at position 0 (it will not be used but must be there ..
        # .. to ensure rules consistency, otherwise a rule which is based for example on column 1 ..
        # .. will be read as a rule at column 0 and so forth)
        revised_data = np.column_stack([np.zeros(data.shape[0]),data])
        
        predictions = []
        # predictions are made individually for each email record
        for row in range(revised_data.shape[0]):
            predictions.append(self.predict_one(revised_data[row], self.rules))
        return np.array(predictions)
    
    # ------------------------------------------------------------
    def train(self, training_data, min_sample=None, max_depth=None):
        
        if max_depth is None:
            max_depth = 100 #training_data.shape[0]
        if min_sample is None:
            min_sample = 2
        
        
        # learn the tree
        # the attributes param is a list of integer values [1,2,3,...,54] representing attributes columns
        tree = self.decision_tree_learning(training_data,
                                           attributes=[i for i in range(1,training_data.shape[1])],
                                           parent_examples=None,#self.most_common_value(training_data[:,0])
                                           min_sample=min_sample, # e.g n=50
                                           max_depth=max_depth, # max depth set at 4 node levels
                                           counter=0 # initialise the counter at 0
                                          )
        # parse rules
        self.rules = self.extract_rules(tree)
        
        # set internal tree parameter
        self.tree = tree


# ------------------------------------------------------------
# def cross_validation(data, k, min_sample=None, max_depth=None):
#     '''
#     Model cross validation based on training dataset data
#     '''
#     # split data into k folds
#     N = data.shape[0]
#     fold_size = N//k
    
#     # shuffle indices
#     indices = np.random.permutation(N)
    
#     accuracy_ls = []
#     for fold in range(k):
#         # create validation test
#         validation_indices = indices[fold * fold_size: (fold + 1) * fold_size]
#         validation_set = data[validation_indices,:]
        
#         # create training set
#         training_indices = np.concatenate([indices[:fold * fold_size], indices[(fold + 1) * fold_size:]])
#         training_set = data[training_indices,:]
        
#         if len(set(validation_indices)&set(training_indices))>0:
#             print("Common indices found")
        
#         cl = create_classifier(training_set, min_sample, max_depth)
#         pred = cl.predict(validation_set[:,1:])
#         labels = validation_set[:,0]
#         accur = np.count_nonzero(pred == labels)/labels.shape[0]
#         accuracy_ls.append(accur)

#     cval_res = [k, min_sample, max_depth, fold_size, np.mean(accuracy_ls).round(3), np.median(accuracy_ls).round(3),
#                 np.std(accuracy_ls).round(3), np.max(accuracy_ls).round(3), np.min(accuracy_ls).round(3),
#                 accuracy_ls
#                ]
#     return cval_res

# ------------------------------------------------------------
# def print_tree(tree,prefix='-'):
#     if 'condition' not in tree:
#         tree['condition'] = None
#     if 'feature' not in tree:
#         tree['feature'] = None
#     if 'ig' not in tree:
#         tree['ig'] = None

        
#     if 'prediction' not in tree.keys():
#         print(prefix[:-1]+"|","Lvl:", tree['level'], '| Cond:', tree['condition'], 
#               '| Feat:', tree['feature'], '| IG:',tree['ig'],'| Samples:',tree['samples'])
#     if 'left' in tree.keys():
#         print_tree(tree['left'],prefix=prefix+'-')
#     if 'right' in tree.keys():
#         print_tree(tree['right'],prefix=prefix+'-')
#     if 'prediction' in tree.keys():
#         print(prefix[:-1]+'>',"Lvl:",tree['level'], '| Cond:', tree['condition'], 
#               '| Feat:', tree['feature'], '| IG:',tree['ig'],'| Samples:',
#               tree['samples'],' | Prediction:',tree['prediction'],'| Reason:',tree['reason'])

# ------------------------------------------------------------
def create_classifier(training_data, min_sample=None, max_depth=None):
    
    classifier = SpamClassifier()
    
    classifier.train(training_data=training_data, 
                     min_sample=min_sample,
                     max_depth=max_depth)
    
    return classifier
# ------------------------------------------------------------
classifier = create_classifier(training_spam, min_sample=50, max_depth=5)

# print out tree and rules
# print("Rules:","\n", classifier.rules,"\n")
# print("Tree:")
# print_tree(classifier.tree)

### Accuracy Estimate
In the cell below there is a function called `my_accuracy_estimate()` which returns `0.5`. Before you submit the assignment, write your best guess for the accuracy of your classifier into this function, as a percentage between `0` and `1`. So if you think you will get 80% of inputs correct, return the value `0.8`. This will form a small part of the marking criteria for the assignment, to encourage you to test your own code.

In [4]:
def my_accuracy_estimate():
    return 0.85

Write all of the code for your classifier above this cell.

### Testing Details
Your classifier will be tested against some hidden data from the same source as the original. The accuracy (percentage of classifications correct) will be calculated, then benchmarked against common methods. At the very high end of the grading scale, your accuracy will also be compared to the best submissions from other students (in your own cohort and others!). Your estimate from the cell above will also factor in, and you will be rewarded for being close to your actual accuracy (overestimates and underestimates will be treated the same).

#### Test Cell
The following code will run your classifier against the provided test data. To enable it, set the constant `SKIP_TESTS` to `False`.

The original skeleton code above classifies every row as ham, but once you have written your own classifier you can run this cell again to test it. So long as your code sets up a variable called `classifier` with a method called `predict`, the test code will be able to run. 

Of course you may wish to test your classifier in additional ways, but you *must* ensure this version still runs before submitting.

**IMPORTANT**: you must set `SKIP_TESTS` back to `True` before submitting this file!

In [5]:
SKIP_TESTS = True

if not SKIP_TESTS:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]

    predictions = classifier.predict(test_data)
    accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
    print(f"Accuracy on test data is: {accuracy}")

In [6]:
import sys
import pathlib

fail = False;

success = '\033[1;32m[✓]\033[0m'
issue = '\033[1;33m[!]'
error = '\033[1;31m\t✗'

#######
##
## Skip Tests check.
##
## Test to ensure the SKIP_TESTS variable is set to True to prevent it slowing down the automarker.
##
#######

if not SKIP_TESTS:
    fail = True;
    print("{} \'SKIP_TESTS\' is incorrectly set to False.\033[0m".format(issue))
    print("{} You must set the SKIP_TESTS constant to True in the cell above.\033[0m".format(error))
else:
    print('{} \'SKIP_TESTS\' is set to true.\033[0m'.format(success))

#######
##
## File Name check.
##
## Test to ensure file has the correct name. This is important for the marking system to correctly process the submission.
##
#######
    
p3 = pathlib.Path('./spamclassifier.ipynb')
if not p3.is_file():
    fail = True
    print("{} The notebook name is incorrect.\033[0m".format(issue))
    print("{} This notebook file must be named spamclassifier.ipynb\033[0m".format(error))
else:
    print('{} The notebook name is correct.\033[0m'.format(success))

#######
##
## Create classifier function check.
##
## Test that checks the create_classifier function exists. The function should train the classifier and return it so that it can be evaluated by the marking system.
##
#######

if "create_classifier" not in dir():
    fail = True;
    print("{} The create_classifier function has not been defined.\033[0m".format(issue))
    print("{} Your code must include a create_classifier function as described in the coursework specification.\033[0m".format(error))
    print("{} If you believe you have, \'restart & run-all\' to clear this error.\033[0m".format(error))
else:
    print('{} The create_classifier function has been defined.\033[0m'.format(success))

#######
##
## Classifier variable check.
##
## Test that checks the classifier variable exists. The marking system will use this variable to make predictions based on a set of random features you have not seen. Your score will be based on how well your classifier predicts the hidden labels.
##
#######

if 'classifier' not in vars():
    fail = True;
    print("{} The classifer variable has not been defined.\033[0m".format(issue))
    print("{} Your code must create a variable called \'classifier\' as described in the coursework specification.\033[0m".format(error))
    print("{} This variable should contain the trained classifier you have created.\033[0m".format(error))
else:
    print('{} The classifer variable has been correctly defined.\033[0m'.format(success))

#######
##
## Accuracy Estimation check.
##
## Test that checks the accuracy estimation function exists and is a reasonable value. This is a requirement of the coursework specification and is used by the marking system when generating your final grade.
##
#######

if "my_accuracy_estimate" not in dir():
    fail = True;
    print("{} The my_accuracy_estimate function has not been defined.\033[0m".format(issue))
    print("{} Your code must include a my_accuracy_estimate function as described in the coursework specification.\033[0m".format(error))
    print("{} If you believe you have, \'restart & run-all\' to clear this error.\033[0m".format(error))
else:
    if my_accuracy_estimate() == 0.5:
        print("{} my_accuracy_estimate function warning.\033[0m".format(issue))
        print("{} my_accuracy_estimate function returns a value of 0.5 - Your classifier is no better than random chance.\033[0m".format(error))
        print("{} Are you sure this is correct.\033[0m".format(error))
    else:
        print('{} The my_accuracy_estimate function has been defined correctly.\033[0m'.format(success))

#######
##
## Test set check.
##
## Test that checks your classifier actually works. The calls made here are the same made by the automarker - albeit with different data. If your work fails this test it will score 0 in the automarker.
##
#######

try:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]
    
    try:
        predictions = classifier.predict(test_data)
        accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
        print('{0} Success running test set - Accuracy was {1:.2f}%.\033[0m'.format(success, (accuracy*100)))
    except Exception as e:
        fail = True
        print("{} Error running test set.\033[0m".format(issue))
        print("{} Your code produced the following error. This error will result in a zero from the automarker, please fix.\033[0m".format(error))
#         print("{} {}\033[0m".format(error, e))
        print(e)
except:
    sys.stderr.write("Unable to run one test as the file \'data/testing_spam.csv\' could not be found.")

#######
##
## Final Summary
##
## Prints the final results of the submission tests.
##
#######

if fail:
    sys.stderr.write("Your submission is not ready! Please read and follow the instructions above.")
else:
    print("\033[1m\n\n")
    print("╔═══════════════════════════════════════════════════════════════╗")
    print("║                        Congratulations!                       ║")
    print("║                                                               ║")
    print("║            Your work meets all the required criteria          ║")
    print("║                   and is ready for submission.                ║")
    print("╚═══════════════════════════════════════════════════════════════╝")
    print("\033[0m")
    

[1;32m[✓][0m 'SKIP_TESTS' is set to true.[0m
[1;32m[✓][0m The notebook name is correct.[0m
[1;32m[✓][0m The create_classifier function has been defined.[0m
[1;32m[✓][0m The classifer variable has been correctly defined.[0m
[1;32m[✓][0m The my_accuracy_estimate function has been defined correctly.[0m
[1;32m[✓][0m Success running test set - Accuracy was 87.00%.[0m
[1m


╔═══════════════════════════════════════════════════════════════╗
║                        Congratulations!                       ║
║                                                               ║
║            Your work meets all the required criteria          ║
║                   and is ready for submission.                ║
╚═══════════════════════════════════════════════════════════════╝
[0m


In [7]:
# This is a test cell. Please do not modify or delete.