# MTH 4320 / 5320 - Homework 2

## Dense Neural Networks and Keras

**Deadline**: Oct 3

**Points**: 50

### Instructions

Submit **one** Python notebook file for grading. Your file must include **text explanations** of your work, **well-commented code**, and the **outputs** from your code.

### Problems

---

#### Gradients

1. [10 points] Consider a single neuron with 3 inputs and PReLU activation function. Find the mathematical formula for the gradient of the activated output with respect to its incoming weights **and** the learnable PReLU parameter.

The weighted sum $z$ of a single neuron with 3 inputs can be expressed as follows:

$$z = w_1 x_1 + w_2 x_2 + w_3 x_3 + b$$

Where $w$ is the weight of neuron, $x$ is the input of neuron, and $b$ is the bias term. The gradient of the weighted sum is then the gradient with respect to each weight:

$$\frac{\partial z}{\partial w_1} = x_1 $$

$$\frac{\partial z}{\partial w_2} = x_2 $$

$$\frac{\partial z}{\partial w_3} = x_3 $$

Next, let's find the gradient of the PReLU function. The PReLU function is defined as follows:

$$ f(x) = \left\{ 
    \begin{array}{ll}
    x & x > 0 \\
    \alpha x & x \le 0 \\
    \end{array}
\right. $$

The gradient with respect to the learnable parameter $\alpha$ is:

$$ \frac{\partial f(x)}{\partial \alpha} = \left\{ 
    \begin{array}{ll}
    x & \alpha x > x \\
    0 & \alpha x \le x\\
    \end{array}
\right. $$

The gradient with respect to the weighted sum $z$ is:

$$ \frac{\partial f(z)}{\partial z} = \left\{ 
    \begin{array}{ll}
    \alpha & \alpha z > z \\
    1 & \alpha z \le z\\
    \end{array}
\right. $$

We then need to find the gradient with respect to the weights. This can be done with the chain rule:

$$ \frac{\partial f(z)}{\partial w_1} = \frac{\partial f(z)}{\partial z} \frac{\partial z}{\partial w_1} $$

$$ \frac{\partial f(z)}{\partial w_2} = \frac{\partial f(z)}{\partial z} \frac{\partial z}{\partial w_2} $$

$$ \frac{\partial f(z)}{\partial w_3} = \frac{\partial f(z)}{\partial z} \frac{\partial z}{\partial w_3} $$

We will also need is the gradient with respect to the learnable parameter $\alpha$, which can once again be done with chain rule:

$$ \frac{\partial f(z)}{\partial \alpha} = \frac{\partial f(z)}{\partial z} \frac{\partial z}{\partial \alpha} $$

$$ \frac{\partial z}{\partial \alpha} = \frac{\partial \alpha x}{\partial \alpha} = x$$

$$ \frac{\partial f(z)}{\partial \alpha} = \frac{\partial f(z)}{\partial z} x $$

Now we can put it all together to get our final formula for the gradient of the activated output with respect to its incoming weights and the learnable PReLU parameter.

$$ \frac{\partial f(z)}{\partial w_1} = \left\{ 
    \begin{array}{ll}
    x_1 \alpha & \alpha z > z \\
    x_1 & \alpha z \le z \\
    \end{array}
\right. $$

$$ \frac{\partial f(z)}{\partial w_2} = \left\{ 
    \begin{array}{ll}
    x_2 \alpha & \alpha z > z \\
    x_2 & \alpha z \le z \\
    \end{array}
\right. $$

$$ \frac{\partial f(z)}{\partial w_3} = \left\{ 
    \begin{array}{ll}
    x_3 \alpha & \alpha z > z \\
    _3 & \alpha z \le z \\
    \end{array}
\right. $$

$$ \frac{\partial f(z)}{\partial \alpha} = \left\{ 
    \begin{array}{ll}
    x z & \alpha z > z \\
    0 & \alpha z \le z \\
    \end{array}
\right. $$

---

#### Dense Neural Networks

2. [40 points]. Use a feedforward NN with SGD to classify the CIFAR-10 dataset, and tune its hyperparameters as best you can. **You must use Keras or PyTorch**. Requirements below. 

Randomly split the dataset into 60\%/20/\%/20\% training/validation/testing sets. When tuning hyperparameters, test on the validation set. After you find the best hyperparameters, run your code **once** with these settings on the test. Use `random_state = 1` before splitting data.

Start with a 1-node classifier as a benchmark.

You must run **at least one experiment** using all major techniques (5 points each):

* Normalization/Standardization
* Weight Initialization
* Architectures
* Activation functions
* Loss functions
* Regularization (must include dropout)

**For each experiment, document why you chose to run this experiment, training accuracy/loss, validation accuracy/loss, epoch number with best validation accuracy (see the `EarlyStopping` callback), and training runtime.**

Training takes significant time, so brute force is *not* feasible. Make *informed decisions* on how to proceed and write your reasoning in your report. Include all fruitful experiments you run along the way. More importantly than the results, I want to see that you are *thinking well* and making good decisions. Good results will come from eventually if you *understand what you are doing*.

**Explanations and reasoning for your progression = [10 points]**

**Recommended:** Use small training sets for your initial tests so it works more quickly and then scale up when you results get better.

**Bonus:** Top 3 highest classification accuracy submissions earn +5 points.

In [357]:
# Overfitting => Use regularization

Let's begin by importing all the necessary libraries from tensorflow. 

In [417]:
# Imports
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.regularizers import l1_l2
from tensorflow.keras.datasets import mnist
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.utils import plot_model
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import LeakyReLU
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.layers import PReLU 

import matplotlib.pyplot as plt
import numpy as np

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

Let's also define some parameters we will be using throughout out code. 

In [359]:
# Define parameters
learning_rate = 0.0001
epochs = 20
batch_size = 64

The dataset itself can be directly loaded from keras. 

In [360]:
# Load the CIFAR-10 dataset
(trainX, trainY), (testX, testY) = cifar10.load_data()

The data is then split into 60% training, 20% validation, and 20% testing, all using `random_state = 1`.

In [361]:
# Split the data into training, validation, and testing sets
trainX, tempX, trainY, tempY = train_test_split(trainX, trainY, test_size=0.4, random_state=1)
validX, testX, validY, testY = train_test_split(tempX, tempY, test_size=0.5, random_state=1)

# Check the shapes of the resulting sets
print("Training data shape:", trainX.shape)
print("Validation data shape:", validX.shape)
print("Testing data shape:", testX.shape)

Training data shape: (30000, 32, 32, 3)
Validation data shape: (10000, 32, 32, 3)
Testing data shape: (10000, 32, 32, 3)


Next, we want to ensure the labels are properly setup with one-hot enconding.

In [362]:
# Convert labels to one-hot encoding
num_classes = 10
trainY = tf.keras.utils.to_categorical(trainY, num_classes)
testY = tf.keras.utils.to_categorical(testY, num_classes)
validY = tf.keras.utils.to_categorical(validY, num_classes)

Now we can set up a 1-node classifier as a baseline.

In [363]:
# Define a 1-node classifier
baselineModel = Sequential([
    Flatten(input_shape=(32,32,3)),
    Dense(1, activation='relu'),
    Dense(num_classes, activation='softmax')
]) 

# Compile the model
baselineModel.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

We want to also define early stopping in case our model sees no improvement. 

In [364]:
# Define the early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

Let's try running the model with no optimizations to get our baseline results.

In [365]:
# Train the model on the training data
baselineModel.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = baselineModel.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1017
           1       0.00      0.00      0.00       990
           2       0.00      0.00      0.00      1061
           3       0.00      0.00      0.00       994
           4       0.00      0.00      0.00       936
           5       0.00      0.00      0.00      1009
           6       0.00      0.00      0.00      1034
           7       0.00      0.00      0.00      1005
           8       0.10      1.00      0.17       956
           9       0.00      0.00      0.00       998

    accuracy                           0.10     10000
   macro avg       0.01      0.10      0.02     10000
weighted avg       0.01      0.10      0.02     10000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


As we can see here, the baseline compiles, but there is not much improvement to be seen, and the accuracy ends up around 10%. This shows the necesity to have a more complex and layered neural network. 

For this purpose, we'll construct a feedforward neural network, in which we'll be using ReLU and softmax layers with categorical crossentropy loss and the SGD optimizer.

In [366]:
# Create a feedforward neural net
model = Sequential()

# Create the layers
model.add(Flatten(input_shape=(32,32,3)))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(10, activation = 'softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

Let's see what happens when we run this model. 

In [367]:
# Train the model on the training data
model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = model.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report for Default Model:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Classification Report for Default Model:
              precision    recall  f1-score   support

           0       0.10      0.01      0.01      1017
           1       0.13      0.01      0.01       990
           2       0.17      0.00      0.01      1061
           3       0.20      0.06      0.10       994
           4       0.07      0.00      0.00       936
           5       0.27      0.03      0.05      1009
           6       0.15      0.81      0.26      1034
           7       0.16      0.01      0.01      1005
           8       0.18      0.56      0.27       956
           9       0.26      0.21      0.23       998

    accuracy                           0.17     10000
   macro avg       0.17      0.17      0.10     10000
weighted avg       0.17 

Here we get some better results than our baseline, and some of the numbers are pretty well predicted. With this in mind, we can now start experimenting with hyperparamters and tuning towards a final model. 

##### Normalization/Standardization

Let's try normalizing our data, and see if that yields similar results.

In [368]:
# Create a feedforward neural net
model = Sequential()

# Create the layers
model.add(Flatten(input_shape=(32,32,3)))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(10, activation = 'softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

In [369]:
# Normalize the pixel values to the range [0, 1]
trainX = trainX.astype('float32') / 255.0
testX = testX.astype('float32') / 255.0
validX = validX.astype('float32') / 255.0

In [370]:
# Train the model on the training data
model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = model.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report for Normalized Model:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Classification Report for Normalized Model:
              precision    recall  f1-score   support

           0       0.33      0.48      0.39      1017
           1       0.43      0.12      0.18       990
           2       0.22      0.07      0.11      1061
           3       0.18      0.12      0.14       994
           4       0.24      0.31      0.27       936
           5       0.23      0.39      0.29      1009
           6       0.28      0.20      0.24      1034
           7       0.27      0.15      0.19      1005
           8       0.33      0.50      0.40       956
           9       0.31      0.52      0.39       998

    accuracy                           0.28     10000
   macro avg       0.28      0.28      0.26     10000
weighted avg       0.

Here the results have gotten much better. However, we notice that the learning rate is still increasing throughout the epochs and there may be room to grow. We could try adding more epochs, but seeing how slow the loss is changing, let's instead try changing the learning rate to something a bit higher. It is important to note here that we are seeing the training loss and accuaracy being larger than the valid loss and accuracy. This is a sign of overfitting and needs to be addressed, however, with a new learning rate this may change so lets first experiment with that. 

In [371]:
# Change learning rate
learning_rate = 0.001 

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

In [372]:
# Train the model on the training data
model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = model.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report for Normalized Model:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Classification Report for Normalized Model:
              precision    recall  f1-score   support

           0       0.41      0.51      0.45      1017
           1       0.44      0.45      0.44       990
           2       0.33      0.25      0.29      1061
           3       0.30      0.25      0.27       994
           4       0.34      0.33      0.33       936
           5       0.36      0.36      0.36      1009
           6       0.39      0.41      0.40      1034
           7       0.45      0.42      0.43      1005
           8       0.46      0.56      0.50       956
           9       0.47      0.45      0.46       998

    accuracy                           0.40     10000
   macro avg       0.39      0.40      0.39     10000
weighted avg       0.

Beautiful! Here we see a hige spike in accuracy thanks to our increased learning rate. However, we still see the training loss and accuracy being better than our valid loss and accuracy, once again displaying overfitting. To fix this overfitting, the best course of action is to implement regularization. 

##### Regularization

Let's try regularization using $L^1$ and $L^2$ kernal regularizers. We'll also include a droput rate of 0.5 after the dense layer of 256 units, and a dropout rate of 0.3 after the dense latyer of 128 units. This wil help prevent overfitting by randomly dropping out 50% and 30% of the neurons respectively. 

In [410]:
# Create a feedforward neural net with regularization
regularModel = Sequential()

# Create the layers
regularModel.add(Flatten(input_shape=(32,32,3)))
regularModel.add(Dense(256, activation = 'relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
regularModel.add(Dropout(0.5))
regularModel.add(Dense(128, activation = 'relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
regularModel.add(Dropout(0.3))
regularModel.add(Dense(10, activation = 'softmax', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))

# Compile the model
regularModel.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

In [411]:
# Train the model on the training data
regularModel.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = regularModel.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report for Regularization:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Classification Report for Regularization:
              precision    recall  f1-score   support

           0       0.45      0.41      0.43      1017
           1       0.41      0.38      0.39       990
           2       0.33      0.13      0.18      1061
           3       0.31      0.11      0.16       994
           4       0.32      0.29      0.30       936
           5       0.36      0.38      0.37      1009
           6       0.32      0.51      0.39      1034
           7       0.34      0.40      0.36      1005
           8       0.41      0.61      0.49       956
           9       0.41      0.49      0.45       998

    accuracy                           0.37     10000
   macro avg       0.36      0.37      0.35     10000
weighted avg       0.36

Here we've finally fixed our problem of overfitting, at the cost of some accuracy. This is an overall a step in the right directions, and we can now try experimenting with other parameters to further increase our accuracy. 

##### Weight Initialization

Let's try experimenting with weight initialization. For this purpose, we're gonna use the `kernal_initializer` parameters in our dense layers. We're gonna apply some of the most common initializations, which are `glorot_uniform`, `he_normal`, and `lecun_normal`.

In [412]:
# Define a list of different weight initializations to test
initializations = ['glorot_uniform', 'he_normal', 'lecun_normal']

# Loop through different weight initializations and train models
for initialization in initializations:
    # Create a feedforward neural net
    newModel = Sequential()

    # Create the layers with the chosen weight initialization
    newModel.add(Flatten(input_shape=(32, 32, 3)))
    newModel.add(Dense(256, activation='relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001), kernel_initializer=initialization))
    newModel.add(Dropout(0.5))
    newModel.add(Dense(128, activation='relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001), kernel_initializer=initialization))
    newModel.add(Dropout(0.3))
    newModel.add(Dense(10, activation='softmax'))

    # Compile the model
    newModel.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

    # Train the model on the training data
    model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

    # Use the trained model to make predictions on the valid set
    valid_predictions = newModel.predict(validX)

    # Convert predictions to class labels
    valid_predictions_labels = np.argmax(valid_predictions, axis=1)
    valid_true_labels = np.argmax(validY, axis=1)

    # Generate a classification report for each initialization
    classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

    # Print the classification report for each initialization
    print(f"Classification Report for {initialization}:")
    print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Classification Report for glorot_uniform:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1017
           1       0.10      0.86      0.18       990
           2       0.00      0.00      0.00      1061
           3       0.00      0.00      0.00       994
           4       0.00      0.00      0.00       936
           5       0.25      0.00      0.00      1009
           6       0.00      0.00      0.00      1034
           7       0.11      0.18      0.14      1005
           8       0.00      0.00      0.00       956
           9       0.05      0.00      0.00       998

    accuracy                           0.10     10000
   macro avg       0.05      0.10      0.03     10000
weighted avg       0.05      0.10      0.03     10000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Classification Report for he_normal:
              precision    recall  f1-score   support

           0       0.07      0.05      0.06      1017
           1       0.08      0.54      0.14       990
           2       0.00      0.00      0.00      1061
           3       0.14      0.02      0.04       994
           4       0.05      0.04      0.04       936
           5       0.06      0.00      0.00      1009
           6       0.06      0.00      0.01      1034
           7       0.00      0.00      0.00      1005
           8       0.11      0.13      0.12       956
           9       0.14      0.07      0.09       998

    accuracy                           0.09     10000
   macro avg       0.07      0.09      0.05     10000
weighted avg       0.07      0.09      0.05     10000

Epoch 1/20
Epoch 2/20
Epoch 3/20


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Well that was unfortunate. All these initializers seemeed really promising, showing us upwards of 50% accuracy during the epochs. However, they all ended at awful 10% accuraccies. We are also seeing some warnings showing up in the terminal indicating a division by zero is occuring. This probably has to do with numbers being too small within logarithms, and honestly I don't know how to fix it, so we're just gonna accept this loss and move on. 

##### Architectures

There are many ways we could modify the architecture, from adding convolutional layers to using a pre-trained model. However, a lot of these are beyond the scope of this project, so instead we'll simply add an additional hidden layer to see if that has an impact. This should increase the model's capacity to learn, however, it is important to keep in mind that our parameters might not be tuned for this additional layer. 

In [413]:
# Create a feedforward neural net with additional layer
increasedModel = Sequential()

# Create the layers
increasedModel.add(Flatten(input_shape=(32,32,3)))
increasedModel.add(Dense(256, activation = 'relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
increasedModel.add(Dropout(0.5))
increasedModel.add(Dense(128, activation = 'relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
increasedModel.add(Dropout(0.3))
increasedModel.add(Dense(64, activation = 'relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
increasedModel.add(Dropout(0.1))
increasedModel.add(Dense(10, activation = 'softmax', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))

# Compile the model
increasedModel.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

In [414]:
# Train the model on the training data
increasedModel.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = increasedModel.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report for Default Model:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Classification Report for Default Model:
              precision    recall  f1-score   support

           0       0.36      0.47      0.41      1017
           1       0.47      0.16      0.24       990
           2       0.25      0.23      0.24      1061
           3       0.26      0.12      0.17       994
           4       0.26      0.36      0.30       936
           5       0.34      0.36      0.35      1009
           6       0.31      0.33      0.32      1034
           7       0.33      0.23      0.27      1005
           8       0.41      0.47      0.44       956
           9       0.36      0.58      0.44       998

    accuracy                           0.33     10000
   macro avg       0.34      0.33      0.32     10000
weighted avg       0.33 

Surprisingly enough, we see the model preforms worse here with an additional layer. We can suspect that this is due to the rest of the parameters not being a good fit, and in the interest of time we'll just stick to our previous model. 

##### Activation functions

Our current ReLU activation seems to be doing decently well, so maybe we can try a variation of it to seek improvements. For that we'll be using a Leaky ReLU. This fixes one of the main problems that a normal ReLU function faces, which is the "dying ReLU" problem, where neurons can get stuck during training and never activate again. This is done by allowing a small gradient for negative outputs, determined by the $\alpha$ parameter. For the purposes of our experiment, we'll leave this $\alpha$ as a very small number, `0.1`, for slight improvement. Note that the Leaky ReLU will only be applied to the itermediate layers, while the output remains as `softmax`. 

In [418]:
# Create a feedforward neural net with regularization
model = Sequential()

# Create the layers
model.add(Flatten(input_shape=(32,32,3)))
model.add(Dense(256, activation=LeakyReLU(alpha=0.1), kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
model.add(Dropout(0.5))
model.add(Dense(128, activation=LeakyReLU(alpha=0.1), kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
model.add(Dropout(0.3))
model.add(Dense(10, activation = 'softmax', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

In [419]:
# Train the model on the training data
model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = model.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report for New Activation:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Classification Report for New Activation:
              precision    recall  f1-score   support

           0       0.46      0.38      0.41      1017
           1       0.38      0.36      0.37       990
           2       0.29      0.22      0.25      1061
           3       0.30      0.15      0.20       994
           4       0.29      0.35      0.32       936
           5       0.34      0.39      0.36      1009
           6       0.34      0.45      0.39      1034
           7       0.44      0.27      0.33      1005
           8       0.36      0.68      0.47       956
           9       0.43      0.40      0.42       998

    accuracy                           0.36     10000
   macro avg       0.37      0.36      0.35     10000
weighted avg       0.37

Not much improvement to be seen here. We've achieved similar accuracy as before without overfitting so that is a plus.

Let's try PReLU, which is similar to LeakyReLU, with the key difference being the slope of the negative part of the function is learned during training, rather than being a fixed hyperparameter.

In [420]:
# Create a feedforward neural net with regularization
model = Sequential()

# Create the layers
model.add(Flatten(input_shape=(32,32,3)))
model.add(Dense(256, activation=PReLU(), kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
model.add(Dropout(0.5))
model.add(Dense(128, activation=PReLU(), kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
model.add(Dropout(0.3))
model.add(Dense(10, activation = 'softmax', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

In [421]:
# Train the model on the training data
model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = model.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report for New Activation:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Classification Report for New Activation:
              precision    recall  f1-score   support

           0       0.40      0.41      0.40      1017
           1       0.40      0.39      0.39       990
           2       0.28      0.21      0.24      1061
           3       0.27      0.17      0.21       994
           4       0.29      0.30      0.30       936
           5       0.33      0.38      0.35      1009
           6       0.34      0.44      0.38      1034
           7       0.44      0.26      0.33      1005
           8       0.40      0.64      0.49       956
           9       0.42      0.41      0.42       998

    accuracy                           0.36     10000
   macro avg       0.36      0.36      0.35     10000
weighted avg       0.36

Not much improvement here, but it was worth a try either way.

##### Loss functions

Previously, we were using the `categorical_crossentropy` loss function, which is known to work well with the dataset we are using. However, we can also try modifying the loss function to see if there is any effect on the accuracy. For our experiment, we will be using `MeanSquaredError` loss function. This loss function is generally used for regression, but can work with classification as well so it may suit our needs. 

In [422]:
# Create a feedforward neural net with regularization
model = Sequential()

# Create the layers
model.add(Flatten(input_shape=(32,32,3)))
model.add(Dense(256, activation = 'relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
model.add(Dropout(0.5))
model.add(Dense(128, activation = 'relu', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))
model.add(Dropout(0.3))
model.add(Dense(10, activation = 'softmax', kernel_regularizer = l1_l2(l1 = 0.0, l2 = 0.0001)))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

In [423]:
# Compile the model with MeanSquaredError as the loss function
model.compile(loss=MeanSquaredError(), optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])

# Train the model on the training data
model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, validation_data=(validX, validY), callbacks=[early_stopping])

# Use the trained model to make predictions on the valid set
valid_predictions = model.predict(validX)

# Convert predictions to class labels
valid_predictions_labels = np.argmax(valid_predictions, axis=1)
valid_true_labels = np.argmax(validY, axis=1)

# Generate a classification report
classification_rep = classification_report(valid_true_labels, valid_predictions_labels)

# Print the classification report
print("Classification Report with Mean Squared Error Loss:")
print(classification_rep)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Classification Report with Mean Squared Error Loss:
              precision    recall  f1-score   support

           0       0.16      0.34      0.22      1017
           1       0.12      0.06      0.08       990
           2       0.14      0.10      0.11      1061
           3       0.16      0.07      0.10       994
           4       0.09      0.03      0.04       936
           5       0.21      0.32      0.25      1009
           6       0.16      0.07      0.10      1034
           7       0.24      0.19      0.21      1005
           8       0.21      0.60      0.31       956
           9       0.25      0.12      0.16       998

    accuracy                           0.19     10000
   macro avg       0.17      0.19      0.16     10000
weighted avg 

A significant accuracy decrease here, which is to be expected as categorical crossentropy is known to be the better loss function in this usage case. Still very happy to see up to this point that our regularization is keeping the overfitting in check throughout the process.

##### Final Model

After all our trials, we found the following to work best:
* SGD Learning Rate: 0.001
* Batch Size: 64
* Normalization
* Regularization: $L^2$ = 0.0001 with 0.5 and 0.3 dropout
* Activation: `ReLU`
* Loss Function: `categorical_crossentropy`

All of this achieved a best case accuracy of 37%. Overall not an accuracy to write home about, but keeping the model in check and preventing overfitting throughout is definitely something to be content with. Given more time, I would have experimented more with various parameters such as architecture and initialization which I believe have potential to bring the accuracy all the way up to at least a 50%. 