## REI602M Machine Learning - Homework 8
### Due: *Thursday* 18.3.2021

**Objectives**: Neural networks, gradient descent, convolutional neural networks, Keras.

**Name**: Alexander Guðmundsson, **email: ** alg35@hi.is, **collaborators:** (if any)

Problem 1 on this week's assignment is taken from the CS231n machine learning course at Stanford University. Solutions to the problem are easily available on the web but I trust that you will not look them up and solve the problem on your own (feel free to collaborate though). This problem will take some time, but when you do complete it you will have acquired a solid understanding of the backpropagation algorithm for training neural networks. The second problem is more straightforward. Its solution depends on the Keras package though and you therefore have to install TensorFlow use Google's Colab.

Please provide your solutions by filling in the appropriate cells in this notebook, creating new cells as needed. Hand in your solution on Gradescope, taking care to locate the appropriate page numbers in the PDF document. Make sure that you are familiar with the course rules on collaboration (encouraged) and copying (very, very, bad).

### 1\. [Implementing a Neural network, 60 points]
In this problem you will implement a feedforward neural network with a single hidden layer and use it to classify images from the CIFAR-10 dataset. The hidden layer has weights $W_1$ and biases $b_1$, and the output layer has weights $W_2$ and biases $b_2$. The nodes in the hidden layer use ReLU activation, $g(z)=\max(0,z)$ and the output layer uses the vector valued softmax function,
$$
f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}
$$
where $z$ is a real valued vector. The softmax function is a generalization of the sigmoid function for multiple classes. It transforms the values in $z$ (called class scores) to values between zero and one that sum to one. The $f$ values can then be interpreted as class probabilities during prediction.

The objective to be minimized is regularized cross-entropy loss,
$$
L = \frac{1}{N}\sum_{i=1}^N L_i + \lambda (||W_1||_F^2 + ||W_2||_F^2)
$$
where $N$ is the number of training examples, and $L_i$ is the loss for element $i$,
$$
L_i = -\log \left( \frac{e^{z_{y_i}}}{\sum_j e^{z_j}} \right) = -z_{y_i} + \log{\sum_j e^{z_j}}
$$

Training uses *mini-batch gradient descent* where a single gradient update is based on multiple examples, instead of all the examples (batch gradient descent) or a single example (stochastic gradient descent).

*Comments*:

1) The lecture notes from CS231n on Softmax http://cs231n.github.io/linear-classify/#loss, feedforward neural networks http://cs231n.github.io/neural-networks-1/, backpropagation http://cs231n.github.io/optimization-2/ and mini-batch gradient descent http://cs231n.github.io/optimization-1/#gd will be helpful. Andrew Ng's notes from CS229 http://cs229.stanford.edu/notes2019fall/cs229-notes-deep_learning.pdf may also come in handy with derivation of the gradient updates.

2) To classify an example $x$, it is sent through the network and assigned to the class with the highest probability value.

3) The CIFAR-10 dataset is described here: http://www.cs.toronto.edu/~kriz/cifar.html

In [1]:
# A bit of setup

import numpy as np
import matplotlib.pyplot as plt

from nn.neural_net_sol import TwoLayerNet

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

ModuleNotFoundError: No module named 'nn.neural_net_sol'

We will use the class `TwoLayerNet` in the file `nn/neural_net.py` to represent instances of our network. The network parameters are stored in the instance variable `self.params` where keys are string parameter names and values are numpy arrays. 

**While working on the problem it is probably best to edit `neural_net.py` outside the notebook. Once you are done, copy the relevant parts into the cells below**.

In [None]:
# Code from TwoLayerNet.loss
    
        #############################################################################
        # TODO: Perform the forward pass, computing the class scores for the input. #
        # Store the result in the scores variable, which should be an array of      #
        # shape (N, C).                                                             #
        #############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        pass

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # If the targets are not given then jump out, we're done
        if y is None:
            return scores

        # Compute the loss
        loss = None
        #############################################################################
        # TODO: Finish the forward pass, and compute the loss. This should include  #
        # both the data loss and L2 regularization for W1 and W2. Store the result  #
        # in the variable loss, which should be a scalar. Use the Softmax           #
        # classifier loss.                                                          #
        #############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        pass

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # Backward pass: compute gradients
        grads = {}
        #############################################################################
        # TODO: Compute the backward pass, computing the derivatives of the weights #
        # and biases. Store the results in the grads dictionary. For example,       #
        # grads['W1'] should store the gradient on W1, and be a matrix of same size #
        #############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        pass

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

In [None]:
# Code from TwoLayer.train

            #########################################################################
            # TODO: Create a random minibatch of training data and labels, storing  #
            # them in X_batch and y_batch respectively.                             #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            pass

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
            # Compute loss and gradients using the current minibatch
            loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
            loss_history.append(loss)

            ########################################################################
            # TODO: Use the gradients in the grads dictionary to update the         #
            # parameters of the network (stored in the dictionary self.params)      #
            # using stochastic gradient descent. You'll need to use the gradients   #
            # stored in the grads dictionary defined above.                         #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            pass

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

In [None]:
Code from TwoLayer.predict

        ###########################################################################
        # TODO: Implement this function; it should be VERY simple!                #
        ###########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        pass

Below, we initialize toy data and a toy model that we will use to develop your implementation.

In [None]:
# Create a small net and some toy data to check your implementations.
# Note that we set the random seed for repeatable experiments.

input_size = 4
size = 10
num_classes = 3
num_inputs = 5

def init_toy_model():
    np.random.seed(0)
    return TwoLayerNet(input_size, hidden_size, num_classes, std=1e-1)

def init_toy_data():
    np.random.seed(1)
    X = 10 * np.random.randn(num_inputs, input_size)
    y = np.array([0, 1, 2, 2, 1])
    return X, y

net = init_toy_model()
X, y = init_toy_data()

### Forward pass: compute scores
Open the file `nn/neural_net.py` and look at the method `TwoLayerNet.loss`. The function takes the data and weights and computes the class scores (weighted sums at the output nodes), the loss, and the gradients on the parameters.

Implement the first part of the forward pass which uses the weights and biases to compute the scores for all inputs.

In [None]:
scores = net.loss(X)
print('Your scores:')
print(scores)
print()
print('correct scores:')
correct_scores = np.asarray([
  [-0.81233741, -1.27654624, -0.70335995],
  [-0.17129677, -1.18803311, -0.47310444],
  [-0.51590475, -1.01354314, -0.8504215 ],
  [-0.15419291, -0.48629638, -0.52901952],
  [-0.00618733, -0.12435261, -0.15226949]])
print(correct_scores)
print()

# The difference should be very small. We get < 1e-7
print('Difference between your scores and correct scores:')
print(np.sum(np.abs(scores - correct_scores)))

### Forward pass: compute loss
In the same function, implement the second part that computes the data and regularization loss.

In [None]:
loss, _ = net.loss(X, y, reg=0.05)
correct_loss = 1.30378789133

# should be very small, we get < 1e-12
print('Difference between your loss and correct loss:')
print(np.sum(np.abs(loss - correct_loss)))

### Backward pass
Implement the rest of the function. This will compute the gradient of the loss with respect to the variables `W1`, `b1`, `W2`, and `b2`. Now that you (hopefully!) have a correctly implemented forward pass, you can debug your backward pass using a numeric gradient check:

In [None]:
from nn.gradient_check import eval_numerical_gradient

# Use numeric gradient checking to check your implementation of the backward pass.
# If your implementation is correct, the difference between the numeric and
# analytic gradients should be less than 1e-8 for each of W1, W2, b1, and b2.

loss, grads = net.loss(X, y, reg=0.05)

# these should all be less than 1e-8 or so
for param_name in grads:
    f = lambda W: net.loss(X, y, reg=0.05)[0]
    param_grad_num = eval_numerical_gradient(f, net.params[param_name], verbose=False)
    print('%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name])))

### Train the network
To train the network we will use stochastic gradient descent (SGD), similar to the SVM and logistic regression classifiers. Look at the function `TwoLayerNet.train` and fill in the missing sections to implement the training procedure. You will also have to implement `TwoLayerNet.predict`, as the training process periodically performs prediction to keep track of accuracy over time while the network trains.

Once you have implemented the method, run the code below to train a two-layer network on toy data. You should achieve a training loss less than 0.02.

In [None]:
net = init_toy_model()
stats = net.train(X, y, X, y,
            learning_rate=1e-1, reg=5e-6,
            num_iters=100, verbose=False)

print('Final training loss: ', stats['loss_history'][-1])

# plot the loss history
plt.plot(stats['loss_history'])
plt.xlabel('iteration')
plt.ylabel('training loss')
plt.title('Training Loss history')
plt.show()

### Load the CIFAR-10 data set
Now that you have implemented a two-layer network that passes gradient checks and works on toy data, it's time to load up our favorite CIFAR-10 data so we can use it to train a classifier on a real dataset.

Download the data from http://notendur.hi.is/steinng/kennsla/2021/ml/data/cifar-10-batches-py.zip and extract into the `nn/datasets` folder.

In [None]:
from nn.data_utils import load_CIFAR10

def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):
    """
    Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
    it for the two-layer neural net classifier. These are the same steps as
    we used for the SVM, but condensed to a single function.  
    """
    # Load the raw CIFAR-10 data
    cifar10_dir = 'nn/datasets/cifar-10-batches-py'
    
    # Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
    try:
       del X_train, y_train
       del X_test, y_test
       print('Clear previously loaded data.')
    except:
       pass

    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
        
    # Subsample the data
    mask = list(range(num_training, num_training + num_validation))
    X_val = X_train[mask]
    y_val = y_train[mask]
    mask = list(range(num_training))
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = list(range(num_test))
    X_test = X_test[mask]
    y_test = y_test[mask]

    # Normalize the data: subtract the mean image
    mean_image = np.mean(X_train, axis=0)
    X_train -= mean_image
    X_val -= mean_image
    X_test -= mean_image

    # Reshape data to rows
    X_train = X_train.reshape(num_training, -1)
    X_val = X_val.reshape(num_validation, -1)
    X_test = X_test.reshape(num_test, -1)

    return X_train, y_train, X_val, y_val, X_test, y_test


# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

### Train a network
To train our network we will use SGD. In addition, we will adjust the learning rate with an exponential learning rate schedule as optimization proceeds; after each epoch, we will reduce the learning rate by multiplying it by a decay rate.

In [None]:
input_size = 32 * 32 * 3
hidden_size = 50
num_classes = 10
net = TwoLayerNet(input_size, hidden_size, num_classes)

# Train the network
stats = net.train(X_train, y_train, X_val, y_val,
            num_iters=1000, batch_size=200,
            learning_rate=1e-4, learning_rate_decay=0.95,
            reg=0.25, verbose=True)

# Predict on the validation set
val_acc = (net.predict(X_val) == y_val).mean()
print('Validation accuracy: ', val_acc)

### Debug the training
With the default parameters we provided above, you should get a validation accuracy of about 0.29 on the validation set. This isn't very good.

One strategy for getting insight into what's wrong is to plot the loss function and the accuracies on the training and validation sets during optimization.

Another strategy is to visualize the weights that were learned in the first layer of the network. In most neural networks trained on visual data, the first layer weights typically show some visible structure when visualized.

In [None]:
# Plot the loss function and train / validation accuracies
plt.subplot(2, 1, 1)
plt.plot(stats['loss_history'])
plt.title('Loss history')
plt.xlabel('Iteration')
plt.ylabel('Loss')

plt.subplot(2, 1, 2)
plt.plot(stats['train_acc_history'], label='train')
plt.plot(stats['val_acc_history'], label='val')
plt.title('Classification accuracy history')
plt.xlabel('Epoch')
plt.ylabel('Classification accuracy')
plt.legend()
plt.show()

In [None]:
from nn.vis_utils import visualize_grid

# Visualize the weights of the network

def show_net_weights(net):
    W1 = net.params['W1']
    W1 = W1.reshape(32, 32, 3, -1).transpose(3, 0, 1, 2)
    plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))
    plt.gca().axis('off')
    plt.show()

show_net_weights(net)

### Tune your hyperparameters

**What's wrong?**. Looking at the visualizations above, we see that the loss is decreasing more or less linearly, which seems to suggest that the learning rate may be too low. Moreover, there is no gap between the training and validation accuracy, suggesting that the model we used has low capacity, and that we should increase its size. On the other hand, with a very large model we would expect to see more overfitting, which would manifest itself as a very large gap between the training and validation accuracy.

**Tuning**. Tuning the hyperparameters and developing intuition for how they affect the final performance is a large part of using Neural Networks, so we want you to get a lot of practice. Below, you should experiment with different values of the various hyperparameters, including hidden layer size, learning rate, numer of training epochs, and regularization strength. You might also consider tuning the learning rate decay, but you should be able to get good performance using the default value.

**Approximate results**. You should be aim to achieve a classification accuracy of greater than 48% on the validation set. Our best network gets over 52% on the validation set.

**Experiment**: You goal in this exercise is to get as good of a result on CIFAR-10 as you can (52% could serve as a reference), with a fully-connected Neural Network. Feel free implement your own techniques (e.g. PCA to reduce dimensionality, or adding dropout, or adding features to the solver, etc.).

**Explain your hyperparameter tuning process below.**

$\color{blue}{\textit Your Answer:}$

In [None]:
best_net = None # store the best model into this 

#################################################################################
# TODO: Tune hyperparameters using the validation set. Store your best trained  #
# model in best_net.                                                            #
#                                                                               #
# To help debug your network, it may help to use visualizations similar to the  #
# ones we used above; these visualizations will have significant qualitative    #
# differences from the ones we saw above for the poorly tuned network.          #
#                                                                               #
# Tweaking hyperparameters by hand can be fun, but you might find it useful to  #
# write code to sweep through possible combinations of hyperparameters          #
# automatically like we did on the previous exercises.                          #
#################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

pass

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****


In [None]:
# visualize the weights of the best network
show_net_weights(best_net)

### Run on the test set
When you are done experimenting, you should evaluate your final trained network on the test set; you should get above 48%.

In [None]:
test_acc = (best_net.predict(X_test) == y_test).mean()
print('Test accuracy: ', test_acc)

**Inline Question**

Now that you have trained a Neural Network classifier, you may find that your testing accuracy is much lower than the training accuracy. In what ways can we decrease this gap? Select all that apply.

1. Train on a larger dataset.
2. Add more hidden units.
3. Increase the regularization strength.
4. None of the above.

$\color{blue}{\textit Your Answer:}$

$\color{blue}{\textit Your Explanation:}$



### 2\. [Image classification with convolutional neural networks, 40 points]

Here you construct three convolutional neural networks using the Keras library in TensorFlow and apply them to the CIFAR-10 image data set (see comments below) from problem 1 (where you should have obtained approx. 50% accuracy). Human accuracy on CIFAR-10 is approximately 94% while state of the art CNNs achieve around 99% accuracy.

Starting with a very simple convolutional network you move on to more sophisticated architectures with the aim of improving classifier accuracy.

In the following, INPUT denotes the input layer, FC-n denotes a fully connected layer with $n$ nodes, CONV-$m$ represents a 2D-convolutional layer with $m$ filters, POOL corresponds to a 2D pooling layer, RELU to ReLU activation units, [...]\*n denotes repetition $n$ times of the units inside the brackets. The last last layer (FCS) denotes a fully connected layer with 10 nodes and Softmax activation (this is the classification step). Use dropout for regularization and only following FC layers.

For each of the networks below, report the training, validation and test set accuracy. Summarize the results in a table.
Monitor the accuracy during training and stop when the validation accuracy no longer improves. Visualize the loss and training/validation error during training (you can include these graphs as screenshots in the notebook).

a) INPUT -> CONV-12 -> RELU -> POOL -> FCS (minimalistic CNN).

b) INPUT -> CONV-32 -> RELU -> CONV-64 -> RELU -> POOL -> FC-128 -> FCS

c) [CONV-32 -> RELU]\*2 -> POOL -> [CONV-64 -> RELU]\*2 -> POOL -> FC-512 -> RELU -> FCS (simplified VGGnet).

d) Retrain the network from c) using *data augmention*. Use the `ImageDataGenerator` class in Keras to generate new examples during training. Comment briefly on the type of mistakes that your final network makes, e.g. by inspecting a confusion matrix for the test set. How does your classifier compare to the results listed here: https://benchmarks.ai/cifar-10 ?

*Comments*:
* Implement your networks using TensorFlow.Keras. You can see examples of fully connected and convolutional networks in `Code examples (12.3).zip` on Canvas.
* Use GPU acceleration if possible (Colab: Runtime menu - Change runtime type).
* To load the data you can use the `get_CIFAR10_data` function from problem 1 with `num_training=45000, num_validation=5000, num_test=10000`, after removing the three lines reshaping the data to rows (Keras assumes that the data is in 3D matrices).
* If you use Colab and have problems with the `get_CIFAR10_data` function on Google drive, you can use `(x_tr, y_tr), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()` instead. The training set has 50,000 examples. Set aside 5,000 of these examples aside as a validation set.
* Subtract the mean from the images prior to training and divide by the standard deviation:
```python
mean_image = np.mean(x_train, axis=0)
std_image = np.std(x_train)
x_train = (x_train - mean_image) / std_image
x_val = (x_val - mean_image) / std_image
x_test = (x_test - mean_image) / std_image
```

* Use the Adam optimizer.
* Regularization of convolutional layers does not seem to be very effective. Fully connected layers need regularization to prevent overfitting. Dropout with $p=0.5$ is usually quite effective.
* Use `padding="same"` to zero-pad the input to convolutional layers.
* You can continue training a model by calling `model.fit` repeatedly.
* To save a model use `model.save(filename)`. You may also want to look into model checkpoints and early stopping. See `ModelCheckpoint` and `EarlyStopping` in the Keras documentation.
* The CIFAR-10 "high score" was obtained by training giant deep networks on huge image databases in order to learn feature maps relevant to image classification. The networks were then fine-tuned on CIFAR-10 (this an example of *transfer learning*).
* When the amount of training data is small in relation to the number of parameters in a model, overfitting becomes an issue. In many specialized image recognition tasks such as tumor classification, the amount of labeled data is often quite limited and a state of the art convolutional network are likely to severly overfit the data set. Data augmentation refers to techniques that create additional training examples from the original data set. For image data it is possible to create additional training examples by simple operations such as reflection, cropping and translation as well as by changing the color palette.
* You need to think about the parameter settings for `ImageDataGenerator` in the context of CIFAR-10. See e.g. https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/

In [None]:
# a) Insert your code here
# ...

In [None]:
# b) Insert your code here
# ...

In [None]:
# c) Insert your code here
# ...

In [None]:
# d) Insert your code here
# ...