**Class**: CT5133 Deep Learning

**Student id**: 20230852

## Logistic Regression with Stochastic Gradient Descent

### Task 1 - Implement Logistic Regression
Logistic Regression is a classification algorithm used to classify categorical data. The Sigmoid function or the Hypothesis function to calculate the probabilities of the input data passed to the function. In this implementation of the algorithm, Stochastic Gradient Descent is used to determine model's parameter values that minimises the cost occurred as much as possible to reach the local minima [1]. In Stochastic Gradient Descent, we select one sample from the training cases at random and update weight and bias values as per that case. There are two hyper-parameters learning rate (alpha) and the number of epochs which are passed to the fit function. The size of the steps we take to reach the local minima is decided by the learning rate alpha. 

In [1]:
import numpy as np
import pandas as pd
from numpy import random
from sklearn import metrics
from sklearn.model_selection import train_test_split
import random
import cv2

In [2]:
def weighted_sum(X,w,b):
    # Weighted Sum function takes in the feature list (X) along with weights (w) &  
    # bias (b) and returns the value from below calculation
    
    y = np.dot(w.T,X)+b
    return float(y)

In [3]:
def sigmoid(y):
    # The Sigmoid function calculates the probabilities of the input data passed
    # to the function. The output of this function will be a value between 0
    # and 1.
    
    z = 1 / (1 + np.exp(-y))
    return z    

In [4]:
def fit(X,Y,a,epochs):
    # The fit function takes in the features list and label from the train dataset 
    # along with the learning rate (alpha) and epochs. The features and labels of 
    # the validation set is also taken as input. Stochastic Gradient descent is 
    # implemented to update model parameters - weights and bias. 
    
    #initialization
    np.random.seed(50) #for ensuring data can be reproduced
    
    alpha = a
    max_iterations = epochs

    w = np.random.rand(len(X.columns),1)
    b = np.random.rand() 
    del_w = np.empty(w.shape)
    
    for _ in range(max_iterations):
        #selecting 1 random training example for Stochastic Gradient Descent
        random_index = np.random.randint(0, len(X), 1) 
        
        #extracting 1 row from training example based on index found above 
        x = X.iloc[random_index,:]
        y = Y.iloc[random_index,:]
        
        x=x.squeeze()
        y=np.array(y)

        #Forward propogation calculation
        z = sigmoid(weighted_sum(x,w,b))
        
        if (z>=0.5):
            y_hat = 1
        else: y_hat = 0
                
        #Gradient Descent calculation
        for j in range(len(w)):
            del_w[j] = (y_hat - y) * x[j]
        del_b = (y_hat - y)
        for j in range(len(w)):
            w[j] = w[j] - alpha * del_w[j]
        b = b - alpha * del_b
                
    return w,b

In [5]:
def predict(X,theta):
    # Predict function takes in the feature list passed to the function. The theta 
    # value in the argument is a list which contains the model parameters - weights 
    # and bias. This function returns the predicted label value based on the 
    # probability returned by the sigmoid function.
    
    #extracting weights and bias from theta
    w = theta[0]
    b = theta[1]
    y_hat = np.empty((len(X),1))
    
    for i in range(len(X)):
        z = sigmoid(weighted_sum(X.iloc[i,:],w,b))
        if (z>=0.5):
            y_hat[i] = 1
        else: y_hat[i] = 0
            
    return y_hat
    

### Task 2 - Checking implementation on 2 datasets

In [6]:
def StandardScaler(X):
    # StandardScaler function performs normalization on values in X and return 
    # new X dataframe 
    
    X_std = pd.DataFrame(X)
    n = X.shape[0]
    for col in range(X.shape[1]):
        col_sum = X.iloc[:,col].sum()
        col_mean = X.iloc[:,col].mean()
        sd = X.iloc[:,col].std()
        for row in range(X.shape[0]):
            x =  X.iloc[row,col]
            y = (x - col_mean)/sd
            X_std.iloc[row,col] = y
    
    return X_std 

In [7]:
# Read Moons CSV file as a dataframe
moons_data = pd.read_csv("moons400.csv")

# The y values are those labelled 'Class': extract their values
y_moons = moons_data['Class'].values
y_moons=pd.DataFrame(y_moons)

# The x values are all other columns
del moons_data['Class']   # drop the 'Class' column from the dataframe
X_moons = moons_data.values # convert the remaining columns to a numpy array

#normalizing feature list using defined Standard Scalar function
scaled_X_moons = StandardScaler(pd.DataFrame(X_moons)) 

X_train_val_moons,X_test_moons,y_train_val_moons,y_test_moons=train_test_split(
                                                                scaled_X_moons,
                                                                y_moons,
                                                                test_size=0.15,
                                                                random_state=50)
X_train_moons,X_val_moons,y_train_moons,y_val_moons=train_test_split(X_train_val_moons,
                                                        y_train_val_moons,
                                                        test_size=0.15,
                                                        random_state=50)

In [8]:
## For Moons dataset
alpha=0.001
epochs = 10000

print("***TRAINING MODEL***")
theta_moons = fit(X_train_moons,y_train_moons,alpha,epochs)
print()

#VALIDATION
y_pred_val_moons = predict(X_val_moons,theta_moons)

#TESTING
y_pred_test_moons = predict(X_test_moons,theta_moons)

simple_val_moons_accr = metrics.accuracy_score(y_val_moons, y_pred_val_moons)*100
print("Validation Accuracy = ",round(simple_val_moons_accr,2),"%")

simple_test_moons_accr = metrics.accuracy_score(y_test_moons, y_pred_test_moons)*100
print("Test Accuracy = ",round(simple_test_moons_accr,2),"%")

***TRAINING MODEL***

Validation Accuracy =  92.16 %
Test Accuracy =  83.33 %


##### Observation:
For Moons dataset, the data is not linearly separable. The model gives 92% accuracy on validation data and only 83% accuracy on test data. The model is overfitting the data a little bit as the validation accuracy is a bit higher than the test accuracy.

The model gave best accuracy when following hyperparameters were used:<br>
Learning rate (alpha) = 0.001<br>
Epochs = 10,000

In [9]:
# Read Blobs CSV file as a dataframe
blobs_data = pd.read_csv("blobs250.csv")

# The y values are those labelled 'Class': extract their values
y_blobs = blobs_data['Class'].values
y_blobs=pd.DataFrame(y_blobs)

# The x values are all other columns
del blobs_data['Class']   # drop the 'Class' column from the dataframe
X_blobs = blobs_data.values     # convert the remaining columns to a numpy array

#normalizing feature list using defined Standard Scalar function
scaled_X_blobs = StandardScaler(pd.DataFrame(X_blobs))  

X_train_val_blobs, X_test_blobs, y_train_val_blobs, y_test_blobs = train_test_split(
                                                                    scaled_X_blobs,
                                                                    y_blobs,
                                                                    test_size=0.15,
                                                                    random_state=50)
X_train_blobs, X_val_blobs, y_train_blobs, y_val_blobs=train_test_split(
                                                            X_train_val_blobs,
                                                            y_train_val_blobs,
                                                            test_size=0.15,
                                                            random_state=50)

In [10]:
## For Blobs dataset
alpha = 0.01
epochs = 10000

print("***TRAINING MODEL***")
theta_blobs = fit(X_train_blobs,y_train_blobs,alpha,epochs)
print()

#VALIDATION
y_pred_val_blobs = predict(X_val_blobs,theta_blobs)

#TESTING
y_pred_test_blobs = predict(X_test_blobs,theta_blobs)

simple_val_blobs_accr = metrics.accuracy_score(y_val_blobs, y_pred_val_blobs)*100
print("Validation Accuracy = ",round(simple_val_blobs_accr,2),"%")

simple_test_blobs_accr = metrics.accuracy_score(y_test_blobs, y_pred_test_blobs)*100
print("Test Accuracy = ",round(simple_test_blobs_accr,2),"%")

***TRAINING MODEL***

Validation Accuracy =  100.0 %
Test Accuracy =  100.0 %


##### Observation:
Blobs dataset is a linearly separable dataset. Based on above results, the Logistic Regression model is giving good accuracy (100%) on Blobs dataset for both validation and test data. The model is not overfitting as both validation and tests sets has 100% accuracy. 

The model gave best accuracy when following hyperparameters are used:<br>
Learning rate (alpha) = 0.01 <br>
Epochs = 10,000 <br>

### Task 3 - Shallow Neural Network Implementation

Artificial Neural Networks are a computational model that is similar to the way human brain process information using biological neural networks [2]. Neural networks consists of input nodes which taken in the feature vectors. There can be one or more hidden layers which perform intermediate computation. These hidden layers can consist of one or more hidden nodes. The output layer uses activation function such as sigmoid to map the result to the required output. Similar to a simple machine learning mode, neural network also have weights and biases in each layer of the network architecture.

In this assignment, Logistic Regression with Stochastic Gradient Descent algorithm is implemented using a shallow Neural network with a single hidden layer. In forward propogation, we use the weights and bias of the hidden layer to get the weighted sum for the selected training case. In backpropogation, we propogate the errors from the output through the network and calculate the derivatives of weights and bias with respect to cost [3]. Using Stochastic Gradient Descent, we adjust the weight and bias values in the entire architecture.

To avoid overfitting, early stopping is implemented for neural networks. In this technique, while training the model we check the performace of the validation set with current model parameters at regular intervals. If the current performance is better, we save the model performance and when we finish training, instead of returning the final parameters, we return the last saved parameters [4].

In [11]:
def fit_nn(X,Y,a,epochs,neurons,X_val,y_val):
    # The fit_nn function takes in the features list and labels from the train dataset 
    # along with the learning rate (alpha) and epochs and the number of neurons in the 
    # hidden layer. The features and labels of the validation set is also taken as  
    # input. Stochastic Gradient descent(SGD) is implemented to update model parameters   
    # - weights and bias. Using early stopping, the last saved weight and bias values   
    # are returned from this function.
    
    #initialization
    alpha = a
    max_iterations = epochs
    hidden_nodes = neurons
    
    np.random.seed(10) #for ensuring data can be reproduced
    
    a_hidden = np.empty((hidden_nodes,1))
    w_sum = np.empty((hidden_nodes,1))
    
    #Intializing weights and bias with random values
    w_input = np.random.rand(hidden_nodes,len(X.columns))
    w_hidden = np.random.rand(hidden_nodes,1)
    b_input = np.random.rand(1,hidden_nodes)
    b_hidden = np.random.rand()
        
    del_w_input = np.empty((hidden_nodes,len(X.columns)))
    del_w_hidden = np.empty((hidden_nodes,1))
    del_b_input = np.empty((1,hidden_nodes))
    del_z_input = np.empty((hidden_nodes,1))
 
    current_accuracy= 0
    previous_accuracy= 0
    
    for _ in range(max_iterations):
        #selecting 1 random index from the training example for SGD
        random_index = np.random.randint(0, len(X), 1) 
        
        #extracting 1 row from training example based on index found above 
        x = X.iloc[random_index,:] 
        y = Y.iloc[random_index,:]
        
        x=x.squeeze() #removing dimension from feature list X
        y=np.array(y)

        #Forward propogation - Hidden Layer calculation
        for i in range(hidden_nodes):
            w_sum[i] = weighted_sum(x,w_input[i,:],b_input[:,i])
            a_hidden[i] = sigmoid(w_sum[i])
        
        #Forward propogation - Output Layer calculation
        z = sigmoid(weighted_sum(a_hidden,w_hidden,b_hidden))
        if (z>=0.5):  
            y_hat = 1
        else: y_hat = 0
        
        #Backward Propogation - output layer calculation
        del_z_hidden = y_hat - y
        for i in range(hidden_nodes):
            del_w_hidden[i] = del_z_hidden * a_hidden[i]
        del_b_hidden = del_z_hidden
        
        #Backward Propogation - hidden layer calculation
        for i in range(len(w_sum)):
            z = w_sum[i]
            sigmoid_prime = sigmoid(z)*(1-sigmoid(z))
            del_z_input[i] = sigmoid_prime * del_z_hidden * w_hidden[i]

        x = np.array(x)
        x = x.reshape(1,len(X.columns))

        for i in range(hidden_nodes):
            for j in range(len(X.columns)):
                del_w_input[i][j] = del_z_input[i] * x[0][j]
    
        del_b_input = w_sum
        
        #Gradient Descent calculation
        w_input = w_input - alpha * del_w_input
        w_hidden = w_hidden - alpha * del_w_hidden
        
        for i in range(hidden_nodes):
            b_input[:,i] = b_input[:,i] - alpha * del_b_input[i,:]
        b_hidden = b_hidden - alpha * del_b_hidden
        
        #Implementing Early Stopping to avoid Overfitting
        if(_ % 1000 == 0):
            theta = [w_input,b_input,w_hidden,b_hidden]
            
            #testing Validation data with currect weight & bias values
            y_pred = predict_nn(X_val,tuple(theta)) 
            
            current_accuracy = metrics.accuracy_score(y_val, y_pred)
            if(current_accuracy > previous_accuracy):
                saved_w_input = w_input
                saved_b_input = b_input
                saved_w_hidden = w_hidden
                saved_b_hidden = b_hidden
            
    return saved_w_input, saved_b_input, saved_w_hidden, saved_b_hidden

In [12]:
def predict_nn(X,theta):
    # The predict_nn function takes in the feature list passed to the function. 
    # The theta value in the argument is a list which contains the model parameters 
    # - weights and bias for all layers. This function returns the predicted label 
    # value based on the probability returned by the sigmoid function.

    #extracting weights and bias from theta
    w_input = theta[0]
    b_input = theta[1]
    w_hidden = theta[2]
    b_hidden = theta[3]
    
    a_hidden = np.empty((len(w_input),1))
    y_hat = np.empty((len(X),1))
    
    for i in range(len(X)):
        x = X.iloc[i,:]
        x=x.squeeze()
        
        #Forward propogation - Hidden Layer calculation
        for j in range(len(a_hidden)):
            a_hidden[j] = sigmoid(weighted_sum(x,w_input[j,:],b_input[:,j]))
            
        #Forward propogation - Output Layer calculation
        z = sigmoid(weighted_sum(a_hidden,w_hidden,b_hidden))
        if (z>=0.5):
            y_hat[i] = 1
        else: y_hat[i] = 0

    return y_hat
    

In [13]:
## For Blobs dataset
alpha = 0.01
epochs = 10000
neurons_in_hidden_layer = 2

print("***TRAINING MODEL***")
theta_blobsNN = fit_nn(X_train_blobs,y_train_blobs,alpha,epochs,
                       neurons_in_hidden_layer,X_val_blobs,y_val_blobs)

#VALIDATION
y_pred_val_blobsNN = predict_nn(X_val_blobs,theta_blobsNN)

#TESTING
y_pred_test_blobsNN = predict_nn(X_test_blobs,theta_blobsNN)

nn_val_blobs_accr = metrics.accuracy_score(y_val_blobs, y_pred_val_blobsNN)*100
print("Validation Accuracy = ",round(nn_val_blobs_accr,2),"%")

nn_test_blobs_accr = metrics.accuracy_score(y_test_blobs, y_pred_test_blobsNN)*100
print("Test Accuracy = ",round(nn_test_blobs_accr,2),"%")

***TRAINING MODEL***
Validation Accuracy =  100.0 %
Test Accuracy =  97.37 %


##### Observation:
For linearly seperable data such as blobs dataset, the Logistic Regression model implemented with shallow neural network is giving 100% accuracy for validation data and has 97% accuracy for test data. The performance of the shallow neural network is a bit lower (97%) than that of the simple logistic regression model which gave 100% accuracy for test data as well. 

The model gave best accuracy when following hyperparameters were used:<br>
Learning rate (alpha) = 0.01 <br>
Epochs = 10,000 <br>
Number of neurons in the hidden layer = 2 <br>
Validation accuracy for Early stopping is checked at every 1,000. 

In [14]:
## For Moons dataset
alpha = 0.001
epochs = 10000
neurons_in_hidden_layer = 2

print("***TRAINING MODEL***")
theta_moonsNN = fit_nn(X_train_moons,y_train_moons,alpha,epochs,
                       neurons_in_hidden_layer,X_val_moons,y_val_moons)

#VALIDATION
y_pred_val_moonsNN = predict_nn(X_val_moons,theta_moonsNN)

#TESTING
y_pred_test_moonsNN = predict_nn(X_test_moons,theta_moonsNN)

nn_val_moons_accr = metrics.accuracy_score(y_val_moons, y_pred_val_moonsNN)*100
print("Validation Accuracy = ",round(nn_val_moons_accr,2),"%")

nn_test_moons_accr = metrics.accuracy_score(y_test_moons, y_pred_test_moonsNN)*100
print("Test Accuracy = ",round(nn_test_moons_accr,2),"%")

***TRAINING MODEL***
Validation Accuracy =  90.2 %
Test Accuracy =  86.67 %


##### Observation:
For linearly inseperable data such as Moons dataset, the Logistic Regression model implemented with shallow neural network is giving 90% accuracy on validation data and only 86% accuracy on test data. However, when compared to the performance of simple logistic regression model implementation for the same data, we see a improvement in the accuracy of the test data from 83% to 86%. However, the accuracy of the validation set seems to have decreased from 92% in simple implementation to 90% in the neural network implementation. 

The model gave best accuracy when following hyperparameters were used:<br>
Learning rate (alpha) = 0.001 <br>
Epochs = 10,000 <br>
Number of neurons in the hidden layer = 2 <br>
Validation accuracy for Early stopping is checked at every 1,000. 

## TASK 4

The enhancement implemented below for neural network implementation of Logistic Regression with Stochastic Gradient model is **Backprop with Momentum Algorithm**. Stochastic Gradient descent has issues traversing ravines which usually occur near the local minimas. Ravines are areas where the surface curves more steeply in one dimension as compared to others. Stochastic gradient tends to oscillate across the slope of ravines and only  slow progressing towards the local optima [5]. Using momentum we use previous changes in the parameter values to infulence the current direction of parameter updates [6] which helps stochastic gradient to accelerate in the right direction.

In [26]:
def fit_nn_opt(X,Y,a,epochs,neurons,X_val,y_val):
    # The fit_nn_opt function takes in the features list and labels from the train 
    # dataset along with the learning rate (alpha) and epochs and the number of 
    # neurons in the hidden layer. The features and labels of the validation set is 
    # also taken as input. Stochastic Gradient descent is implemented to update model 
    # parameters - weights and bias. Using early stopping, the last saved weight and 
    # bias values are returned from this function. For backprop with momentum, a new
    # hyperparameter beta1 introduced which is set to 0.9 in below implementation.  
    
    #initialization
    alpha = a
    max_iterations = epochs
    hidden_nodes = neurons
    
    np.random.seed(23)
    
    a_hidden = np.empty((hidden_nodes,1))
    w_sum = np.empty((hidden_nodes,1))
        
    w_input = np.random.rand(hidden_nodes,len(X.columns))
    w_hidden = np.random.rand(hidden_nodes,1)

    b_input = np.random.rand(1,hidden_nodes)
    b_hidden = np.random.rand()
    
    del_w_input = np.empty((hidden_nodes,len(X.columns)))
    del_w_hidden = np.empty((hidden_nodes,1))
    del_b_input = np.empty((1,hidden_nodes))
    del_z_input = np.empty((hidden_nodes,1))
    
    V_del_w_input = np.zeros((hidden_nodes,len(X.columns)))
    V_del_w_hidden = np.zeros((hidden_nodes,1))
    V_del_b_input = np.zeros((1,hidden_nodes))
    V_del_b_hidden = 0
    
    beta1 = 0.9

    current_accuracy=0
    previous_accuracy=0
    
    for _ in range(max_iterations):
        t = _
        #selecting 1 random training example for Stochastic Gradient Descent
        random_index = np.random.randint(0, len(X), 1) 
        
        x = X.iloc[random_index,:]
        y = Y.iloc[random_index,:]
        
        x=x.squeeze()
        y=np.array(y)

        #Forward propogation - Hidden Layer calculation
        for i in range(hidden_nodes):
            w_sum[i] = weighted_sum(x,w_input[i,:],b_input[:,i])
            a_hidden[i] = sigmoid(w_sum[i])
        
        #Forward propogation - Output Layer calculation
        z = sigmoid(weighted_sum(a_hidden,w_hidden,b_hidden))
        if (z>=0.5):
            y_hat = 1
        else: y_hat = 0
        
        #Backward Propogation - output layer calculation
        del_z_hidden = y_hat - y
        for i in range(hidden_nodes):
            del_w_hidden[i] = del_z_hidden * a_hidden[i]
        del_b_hidden = del_z_hidden
        
        #Momentum - output layer calculation
        V_del_w_hidden = (1 - beta1) * del_w_hidden + beta1 * V_del_w_hidden
        V_del_b_hidden = (1-beta1) * del_b_hidden + beta1 * V_del_b_hidden
        
        #Backward Propogation - hidden layer calculation
        for i in range(len(w_sum)):
            z = w_sum[i]
            sigmoid_prime = sigmoid(z)*(1-sigmoid(z))
            del_z_input[i] = sigmoid_prime * del_z_hidden * w_hidden[i]

        x = np.array(x)
        x = x.reshape(1,len(X.columns))

        for i in range(hidden_nodes):
            for j in range(len(X.columns)):
                del_w_input[i][j] = del_z_input[i] * x[0][j]
    
        del_b_input = w_sum.T
        
        #Momentum - hidden layer calculation
        V_del_w_input = (1-beta1)*del_w_input + beta1 * V_del_w_input
        V_del_b_input = (1-beta1)*del_b_input + beta1 * V_del_b_input

        #Gradient Descent calculation
        w_input = w_input - alpha * V_del_w_input
        w_hidden = w_hidden - alpha * V_del_w_hidden
        
        b_hidden = b_hidden - alpha * V_del_b_hidden
        b_input = b_input - alpha * V_del_b_input
        
        #Implementing Early Stopping to avoid Overfitting
        if(_ % 1000 == 0):
            theta = [w_input,b_input,w_hidden,b_hidden]
            
            #testing Validation data with currect weight & bias values
            y_pred = predict_nn(X_val,tuple(theta)) 
            
            current_accuracy = metrics.accuracy_score(y_val, y_pred)
            if(current_accuracy > previous_accuracy):
                saved_w_input = w_input
                saved_b_input = b_input
                saved_w_hidden = w_hidden
                saved_b_hidden = b_hidden
            
    return saved_w_input, saved_b_input, saved_w_hidden, saved_b_hidden

In [27]:
## For Blobs dataset
alpha = 0.01
epochs = 10000
neurons_in_hidden_layer = 2

print("***TRAINING MODEL***")
theta_blobsNN = fit_nn_opt(X_train_blobs,y_train_blobs,alpha,epochs,
                           neurons_in_hidden_layer,X_val_blobs,y_val_blobs)

#print("***VALIDATION***")
y_pred_val_blobsNN = predict_nn(X_val_blobs,theta_blobsNN)

#print("***TESTING***")
y_pred_test_blobsNN = predict_nn(X_test_blobs,theta_blobsNN)

opt_val_blobs_accr = metrics.accuracy_score(y_val_blobs, y_pred_val_blobsNN)*100
print("Validation Accuracy = ",round(opt_val_blobs_accr,2),"%")

opt_test_blobs_accr = metrics.accuracy_score(y_test_blobs, y_pred_test_blobsNN)*100
print("Test Accuracy = ",round(opt_test_blobs_accr,2),"%")

***TRAINING MODEL***
Validation Accuracy =  100.0 %
Test Accuracy =  100.0 %


##### Observation:
For linearly seperable data such as blobs dataset, the Logistic Regression model implemented with shallow neural network & backprop with momentum is giving good accuracy (100%) for both validation and test data. The model is not overfitting as both validation and tests sets has 100% accuracy. Thus, we can see the model is giving similar performance to that of simple logistic regression model for the same dataset.

The model gave best accuracy when following hyperparameters are used:<br>
Learning rate (alpha) = 0.01<br>
Epochs = 10,000 <br>
Number of neurons in the hidden layer = 2 <br>
Beta for momentum calculation = 0.9 <br>
Validation accuracy for Early stopping is checked at every 1,000. 

In [28]:
## For Moons dataset
alpha = 0.001
epochs = 10000
neurons_in_hidden_layer = 2

print("***TRAINING MODEL***")
theta_moonsNN = fit_nn_opt(X_train_moons,y_train_moons,alpha,epochs,
                           neurons_in_hidden_layer,X_val_moons,y_val_moons)

y_pred_val_moonsNN = predict_nn(X_val_moons,theta_moonsNN)

y_pred_test_moonsNN = predict_nn(X_test_moons,theta_moonsNN)

opt_val_moons_accr = metrics.accuracy_score(y_val_moons, y_pred_val_moonsNN)*100
print("Validation Accuracy = ",round(opt_val_moons_accr,2),"%")

opt_test_moons_accr = metrics.accuracy_score(y_test_moons, y_pred_test_moonsNN)*100
print("Test Accuracy = ",round(opt_test_moons_accr,2),"%")

***TRAINING MODEL***
Validation Accuracy =  94.12 %
Test Accuracy =  88.33 %


##### Observation:
For linearly inseperable data such as Moons dataset, the Logistic Regression model implemented with shallow neural network & backprop with momentum is giving 94% accuracy on validation data and 88% accuracy on test data. The model is not overfitting as accuracy of both validation and tests sets is around 90%. However, when compared to the performance of simple logistic regression model & shallow neural netowork implementation for the same data, we see accuracy has improved with implementation of backprop with momentum to 88% on the test data from 83% of simple implementation & 86% of neural net implementation. The accuracy of the validation set seems to have improved as well from 92% & 90% in simple implementation and neural net implementation respectively to 94% when momentum is implemented.

The model gave best accuracy when following hyperparameters are used:<br>
Learning rate (alpha) = 0.001 <br>
Epochs = 10,000 <br>
Number of neurons in the hidden layer = 2 <br>
Beta for momentum calculation = 0.9 <br>
Validation accuracy for Early stopping is checked at every 1,000. 

In [30]:
#For images dataset
alpha = 0.01
epochs = 10000
neurons_in_hidden_layer = 2

print("***TRAINING MODEL***")
theta_img = fit_nn_opt(X_train_img,y_train_img,alpha,epochs,
                       neurons_in_hidden_layer,X_val_img,y_val_img)

#VALIDATION
y_pred_val_img = predict_nn(X_val_img,theta_img)

#TESTING on train data
y_pred_test_img = predict_nn(X_test_img,theta_img)

#TESTING on test data
test_y_pred_img = predict_nn(test_X_scaled,theta_img)

opt_val_img_accr = metrics.accuracy_score(y_val_img, y_pred_val_img)*100
print("Validation Accuracy on train data = ",round(opt_val_img_accr,2),"%")

opt_test_img_accr = metrics.accuracy_score(y_test_img, y_pred_test_img)*100
print("Test Accuracy on train data = ",round(opt_test_img_accr,2),"%")

opt_test_data_accr = metrics.accuracy_score(test_y, test_y_pred_img)*100
print("Test Accuracy on test data = ",round(opt_test_data_accr,2),"%")


***TRAINING MODEL***
Validation Accuracy on train data =  55.62 %
Test Accuracy on train data =  54.39 %
Test Accuracy on test data =  52.4 %


##### Observation:
For CIFAR image dataset, the Logistic Regression model implemented with shallow neural network & backprop with momentum is giving around 55% accuracy on validation data and 54% on test data from training dataset and 52% on the test dataset. The model is underfitting as accuracy of both validation and tests sets is around 55%. The performance of the model has same as that of the shallow neural network so there is no improvment in accuracy after implementing backprop with momentum for image dataset.

The model gave best accuracy when following hyperparameters are used:<br>
Learning rate (alpha) = 0.01 <br>
Epochs = 10,000 <br>
Number of neurons in the hidden layer = 2 <br>
Beta for momentum calculation = 0.9 <br>
Validation accuracy for Early stopping is checked at every 1,000. 

### References
[1] Niklas Donges. (2020, December 5). Gradient Descent: An Introduction To 1 Of Machine Learning’s Most Popular Algorithms. Retrieved from https://builtin.com/data-science/gradient-descent <br>
[2] David Fumo. (2017, August 4). A Gentle Introduction to Neural Networks Series - Part 1. Retrieved from https://towardsdatascience.com/a-gentle-introduction-to-neural-networks-series-part-1-2b90b87795bc#:~:text=A%20feedforward%20neural%20network%20is,or%20loops%20in%20the%20network <br>
[3] Dr. Michael Madden (2021). Deep Learning [Lecture notes Week 3]. Retrieved from https://nuigalway.blackboard.com/ <br>
[4] Dr. Michael Madden (2021). Deep Learning [Lecture notes Week 4]. Retrieved from https://nuigalway.blackboard.com/ <br>
[5] Sebastian Ruder. (2016, January 19). An overview of gradient descent optimization algorithms. Retrieved from https://ruder.io/optimizing-gradient-descent/index.html#momentum <br>
[6] Dr. Michael Madden (2021). Deep Learning [Lecture notes Week 5]. Retrieved from https://nuigalway.blackboard.com/