# Assignment 02 Part 2: Neural Net Template

For this assignment I tried a few different approaches to building neural network digit classifiers. I built a binary classifier, working similarly to the perceptron but with hidden layers, a one-hot encoded softmax classifier, and a multi-class real-number classifier.

The scores are compiled below the code along side reflections.

## Data + Imports

In [None]:
'''Imports'''
import numpy as np
import random
from sklearn import metrics
import math
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import pandas as pd


'''Data'''
#loading training data inputs + labels 
train_data = np.loadtxt("mnist_train.csv", delimiter = ",")
train_input = [ d[1:] for d in train_data ]
#stacking 1s on as biases
train_input = (np.hstack((np.ones((60000, 1)),train_input)))

#loading testing data inputs + labels 
test_data = np.loadtxt("mnist_test.csv", delimiter = ",")
test_input = [ d[1:] for d in test_data ]
#stacking 1s on as biases
test_input = (np.hstack((np.ones((10000, 1)),test_input)))

'''Binary'''
target_digit = 7
train_label_binary = np.array([ int(d[0] == target_digit) for d in train_data]).T
test_label_binary = np.array([ int(d[0] == target_digit) for d in test_data]).T

'''MultiClass'''
train_label_multi = np.array([ int(d[0]) for d in train_data]).T
test_label_multi = np.array([ int(d[0]) for d in test_data]).T

'''One Hot'''
train_label_one_hot = np.asfarray(train_data[:, :1])
test_labels_one_hot = np.asfarray(test_data[:, :1])

# transform labels into one hot representation
lr = np.arange(10)
train_labels_one_hot = (lr==train_label_one_hot).astype(np.float)
test_labels_one_hot = (lr==test_labels_one_hot).astype(np.float)

## Neural Network Class

In [None]:
random.seed(42)

class ANN:
    
    '''2.2. Complete the initialisation of the neural net.'''
    
    def __init__(self, no_inputs = 784, #plus bias...
                 max_iterations = 65,
                 config  = (70, 50, 50, 50, 28, 1), #no of layers, no of nodes in layer, one output 
                 learning_rate = 0.0003, #learning rate
                 activation_function = "relu", 
                 target = 7): #or rectifier
                
        self.max_iterations = max_iterations
        self.learning_rate = learning_rate
        self.config = config
        self.activation_function = activation_function 
        self.target = target
        self.no_inputs = no_inputs
        
        #initialising weights
        #random numbers between 10 and -10 + stacked 1s for biases
        self.w = []
        for i in range(len(self.config)):
            self.w.append(None)
            
        for i in range(len(self.config)):
            if i == 0: 
                #self.w[i] = np.array(np.random.random((config[i], no_inputs)))
                self.w[i] = (2*np.random.random((config[i], no_inputs)) - 1)/ no_inputs
                self.w[i] = np.hstack((np.ones((config[i], 1)), self.w[i]))
            else:
                #self.w[i] = np.array(np.random.random(-1, 1, (config[i], config[i-1])))/config[i-1]
                self.w[i] = (2*np.random.random((config[i], config[i-1])) - 1) / config[i-1]
                self.w[i] = (np.hstack((np.ones((config[i], 1)), self.w[i])))                         

        self.last_output = []
        for i in range(len(self.w)):
            self.last_output.append(None)
            
        self.error = []
        for i in range(len(self.w)):
            self.error.append(None)
        
        self.gv = []
        for i in range(len(self.w)):
            self.gv.append(None)
    
    ##============###
    ##  Functions ###
    ##============### 
    
    def print_details(self):
        print("Target:\t" + str(self.target))
        print("No. inputs:\t" + str(self.no_inputs))
        print("Max iterations:\t" + str(self.max_iterations))
        print("Learning rate:\t" + str(self.learning_rate))
        print("Architecture:\t" + str(self.config))

    
    def sigmoid(self, a):
        return 1/ (1 + np.exp(-1 * a))
    
    '''2.5. Implement the rectifier activation function.
    
    The forward phase can be updated to use the alternate activation function via activation 
    parameter.
    Backpropagation is also updated to use the alternative derivative via this activation 
    parameter.
    > 95% accuracy is achieved with the binary and multiclass neural networks
    
    '''
    
    def rectifier(self, a):
        return ((a > 0) * a)
        
    def dfrectifier(self, a):
        return ((a > 0) * 1)
    #===============================#
    #feed forward
    def prediction(self,x): 
        for i in range(len(self.w)):
            #if first layer
            if i == 0: 
                #a = weights * intputs
                a = np.dot(x, self.w[i].T)
            #for all other layers
            else: 
                #a = weights * outputs of previous layers
                a = np.dot(self.last_output[i-1], self.w[i].T)

            #run through activation + store
            if i < len(self.w)-1:
                if self.activation_function == "sigmoid":
                    self.last_output[i] = (np.hstack((np.ones((len(a),1)),self.sigmoid(a))))
                else: #ReLu
                    self.last_output[i] = (np.hstack((np.ones((len(a),1)),self.rectifier(a)))) 
            else:
                self.last_output[i] = np.around(a) 
        return(self.last_output[len(self.w)-1])
    #===============================#
    
    '''2.3. Complete the training implementation '''
    
    #sgd backpropogation
    def train(self, train_inputs, train_labels):
        #shuffle
        train_inputs, train_labels = shuffle(train_inputs, train_labels) 
        #labels
        labels = np.array([train_labels]).T
        #for each iter
        for iteration in range(self.max_iterations):
            #for each input
            for i in range(len(train_inputs)-1):
                o = np.array([train_inputs[i]])
                t = train_labels[i]
                #for each layer
                for i in range(len(self.w),0,-1):
                    #if output layer (if i == 4)
                    if i == len(self.w):
                        #error = o-t
                        self.error[i-1] = self.prediction(o) - t
                        #gv = error * inputs
                        self.gv[i-1] = np.dot(self.error[i-1].T, self.last_output[i-2])
                    
                    else: #if hidden layer 
                        error = self.error[i]#next error
                        weights = self.w[i][:, 1:] #next weights
                        #derivation of activation
                        if self.activation_function == "sigmoid":
                            df = self.last_output[i-1][:, 1:] * (1 - self.last_output[i-1][:, 1:])
                        else: #rectifier
                            df = self.dfrectifier(self.last_output[i-1][:, 1:])
    
                        error1 = error.dot(weights)
                        self.error[i-1] = np.multiply(error1, df)
                        
                        #gv calculations
                        if i == 1: 
                            self.gv[i-1] = np.dot(self.error[i-1].T, o)
                        else:
                            self.gv[i-1] = np.dot(self.error[i-1].T,  self.last_output[i-2])
                    #weight updates
                for i in range(len(self.gv)):
                    self.w[i-1] = self.w[i-1] - (self.learning_rate * self.gv[i-1])
                
    #===============================#
    
    '''2.4. (3 marks) Complete the testing implementation.
    
    The forward phase is calculated for each item in the testing set by Prediction function
    The accuracy, precision, and recall are printed
    > 95% accuracy is achieved for the binary neural network
    '''
    
    
    #testing
    def test(self, testing_data, labels):
        assert len(testing_data) == len(labels)
        o = self.prediction(testing_data)
        example = []
        for i in o.flatten()[:10]:
            example.append(int(i))
        print(" ")
        print("Predictions", example)
        t = np.array([labels]).T
        examplet = []
        for i in t.flatten()[:10]:
            examplet.append(int(i))
        print("True Labels", examplet)

        print("Accuracy:\t", metrics.accuracy_score(t, o))  
        print("Precision:\t", metrics.precision_score(t, o, average = 'macro', 
                                                     zero_division = 1))
        print("Recall:\t", metrics.recall_score(t, o, average = 'macro', 
                                                zero_division = 1))
        print(" ")


### Main method

### 1. Binary Classifier

This binary method is a direct comparison with the perceptron, which classified binary target digits, as 0,1, and iterated through the series of digits. A Relu NN performed much better than the perceptron, and a Sigmoid NN, which perhaps struggled with oscillating weights due to the increased neurons and layers.

Sigmoid: 1 hidden layer of 35 nodes, a learning rate of 0.02, training for 17 iterations with the sigmoid function, scores ~ 96% accuracy.

Relu: 3 hidden layers of 28 nodes, a learning rate of 0.001, and training for 5 iterations with the relu function, scores ~99% accuracy in identifying digits through binary classification. 

In [None]:
"""Create 10 NNs to identify 10 digits"""
Nodes = []
for i in range(10):
    Nodes.append(ANN(activation_function = "sigmoid", config = (35, 1), 
        max_iterations = 17, learning_rate = 0.02, target = i))
    
"""Iterate through and train, test and print outputs"""
for net in Nodes:
    train_label = [ int(d[0] == net.target) for d in train_data ]
    test_label = [ int(d[0] == net.target) for d in test_data ]

    print("-----")
    net.print_details()
    net.test(test_input, test_label)
    net.train(train_input, train_label)
    net.test(test_input, test_label)

In [None]:
"""Create 10 NNs to identify 10 digits"""
Nodes = []
for i in range(10):
    Nodes.append(ANN(activation_function = "relu", config = (28, 28, 28, 1), 
        max_iterations = 5, learning_rate = 0.001, target = i))
    
"""Iterate through and train, test and print outputs"""
for net in Nodes:
    train_label = [ int(d[0] == net.target) for d in train_data ]
    test_label = [ int(d[0] == net.target) for d in test_data ]

    print("-----")
    net.print_details()
    net.test(test_input, test_label)
    net.train(train_input, train_label)
    net.test(test_input, test_label)

### 2. One-Hot/SoftMax

This method encodes the labels as a vector of numbers from 0.1 to 0.9 to represent the respective probabilities of each digit, similar to one-hot encoding methods. That is, how likely this input is digit x for all digits 0-9. This is also known as a softmax classification approach. 

A softmax method is outlined fully in *Python Machine Learning Tutorial*'s neural network chapter. I did not follow this tutorial in full, as I only came to the softmax approach late on in the assignment, and so most of my code deviates from theirs. For that reason, I have very different scores. 

Sigmoid: 140 neurons, a learning rate of 0.0005 and 10 iterations reached 92% accuracy. 

Relu: 4 hidden layers, with 50, 28, 28 and 28 nodes, a learning rate of 0.01 and 20 epochs reaches 95% accuracy. 

In [None]:
net = ANN(activation_function = "sigmoid", config = (140, 10), learning_rate = 0.0005, 
         max_iterations = 10)

#calling testing + training methods
print("-----")
net.test(test_input, test_labels_one_hot)
net.train(train_input, train_labels_one_hot)
net.test(test_input, test_labels_one_hot)

In [None]:
net = ANN(activation_function = "relu", config = (50, 28, 28, 28, 10), learning_rate = 0.01, 
         max_iterations = 20)

#calling testing + training methods
print("-----")
net.test(test_imgs, test_labels_one_hot)
net.train(train_imgs, train_labels_one_hot)
net.test(test_imgs, test_labels_one_hot)

### 3. Multi-Class

This is a more complex approach, which classifies every digit as a real number, rather than a probability. This was a harder challenge, and took a long time to train. Sigmoid activation struggled with this, perhaps due to exploding gradients and oscillating weights due to the complexity of the problem and layers needed and could not reach a respectible score. Relu did manage it, however.

A complex configuration of 5 hidden layers sizes 70, 50, 50 50,28, a learning rate of 0.0003 and 65 iterations with relu activation scores 96% accuracy in identifying digits through multiclass classification. 

In [None]:
net = ANN(activation_function = "relu", no_inputs = 784, 
                 max_iterations = 65,
                 config  = (70, 50, 50, 50, 28, 1),
                 learning_rate = 0.0003)

#calling testing + training methods
print("-----")
net.test(test_input, test_label_multi)
net.train(train_input, train_label_multi)
net.test(test_input, test_label_multi)

In [None]:
net = ANN(activation_function = "sigmoid", no_inputs = 784, 
                 max_iterations = 80,
                 config  = (28, 28, 1),
                 learning_rate = 0.03)

#calling testing + training methods
print("-----")
net.test(test_input, test_label_multi)
net.train(train_input, train_label_multi)
net.test(test_input, test_label_multi)

### Results

In [9]:
Results = {
    'Perceptron Binary Classifier': ['Sigmoid: 98%', 'Step: 98%'],
    'NN Binary Classifier':  ['Sigmoid: 96%', 'Relu: 99%'],
    'NN OneHot': ['Sigmoid: 92.15%', 'Relu: 95.5%'],
    'NN MultiClass Classifier': ['Sigmoid: --', 'Relu: 96%'],
        
        }
df = pd.DataFrame(Results, columns = ['Perceptron Binary Classifier',
                                      'NN Binary Classifier',
                                      'NN OneHot',
                                      'NN MultiClass Classifier', 
                                      ])

In [10]:
df

Unnamed: 0,Perceptron Binary Classifier,NN Binary Classifier,NN OneHot,NN MultiClass Classifier
0,Sigmoid: 98%,Sigmoid: 96%,Sigmoid:,Sigmoid: --
1,Step: 98%,Relu: 99%,Relu: 96,Relu: 96%


### Reflections
#### 2.4. Complete the testing implementation.

How much better are the results for digit recognition, compared to the single-layer perceptron?

* For binary digit recognition, the optimal NN was more accurate, achieving ~99% accuracy, whereas the perceptron achieved just under 98% accuracy. 
* However, the main advantage of the NN over the perceptron is the option for multiple outputs, complex configurations and mulitiple classifications. With this, though it take much longer to train than the single-layer perceptron, the single NN is able to classify every digit from 0 to 9 with 96% accuracy. 

How did you modify the initial weights, learning rate, and iterations to achieve this?

* The optimisation of the final binary model did not take too long, since the model learned fairly quickly with relu. Parameterising the configuration into a single vector made the process of finding the optimal architecture much easier than my original code (for which I had to manually change and update the architecutre throughout the code). 
* I started with the configuration, finding an optimum number of layers, and nodes. Then I gradually decreased the learning rate and increased the epochs. This was to ensure the gradient descent of error was measured and accurate, rather than wild and inconsistent.

How much faster/slower is the training time, compared to the single-layer perceptron?

* The single-layer perceptron was slower to learn than the sigmoid activation function for the binary neural network. Though there are far less parameters to tweak in the perceptron, the use of matrix multiplication sped up the process and learning significantly compared to the perceptron which did not make use of matrix multiplication. 
* However, the single-layer perceptron was far quicker than the multi-labeled classification. However, with the optimal architecture for multi-label classification below, there are approximately 65,000 parameters to tune, per 60,000 input at 65 iterations. So, no wonder it takes along time!

How much quicker/slower does the learning converge, compared to the single-layer perceptron?

* The binary classifier NN with sigmoid took about 1/3 of the iterations needed (15) for online-learning, as the single layer perceptron did for batch-learning (40). So the binary neural network was quicker to learn and converge. 

#### 2.5 Implement the rectifier activation function

How much better are the results for digit recognition, compared to the sigmoid activation function?

* The results are much better and much quicker. For instance, with the configuration and parameters set below, sigmoid would achieve 90% accuracy while the relu would reach 99% accuracy. Since Sigmoid took longer to train, it needed more iterations and a simpler structure to converge.
* The best score achieved by relu was over 99% while the sigmoid struggled to reach over  97%.

How much quicker/slower does the learning converge, compared to the sigmoid activation function?

* The binary classifier NN with relu took about 1/4 of the iterations (5) the sigmoid needed (17). So the relu was much quicker to learn and converge. 

How did you modify the initial weights, learning rate, and iterations to achieve this?

* As above, I started with the configuration, and tuned the learning rate and number of iterations to increase accuracy towards the global minimum.

## References

Python Machine Learning Tutorial (n.d) *Neural Network*. Available at https://www.python-course.eu/neural_network_mnist.php [Accessed 30/03/2021].