# Assignment 4 - Questions
## Q1
### a)
The main difference between the 'convolutions' in CNNs and the convolutions in other computer vision algorithms is through the construction of the kernel filters. In most computer vision algorithms, the kernel is constructed by the user in order to perform a specific task (e.g. Harris Corner Detector to find corners in the image). On the other hand, CNNs aim to learn the weights of the kernel filter, where the functionality of the filter is not necessarily known to an outside observer.
### b)
The main advantage of user-constructed filters is its algorithmic applicability. That is to say, because the filter and its functionality is known, different kernels can be selectively chosen to perform specific tasks depending on the algorithm. In the SIFT algorithm, kernel filters are used to extract corners in an image, which are used to represent fundamental features of the image. However, in many cases, human-defined characterizations of what features are fundamental to an image tend to not be effective. This is the main advantage of CNNs. By simply defining a loss and an architecture, we allow a network to self-learn the fundamental features of an image, leading to locally optimal results for the given architecture. Of course, the main disadvantage of this approach is the lack of understandability for a human interpreter, thus making neural network approaches 'black box' approaches.
## Q2
A locally-connected MLP likely will have worse results compared to one that's densely connected. Each perceptron in a locally connected MLP only takes in as input the outputs of the perceptrons closest to it in the previous layer, in contrast to a densely connected layer which takes all previous outputs. Although a locally connected MLP may require fewer operations, resulting in a faster network, it is likely that just the locally connected features are insufficient for making a classification. In a famous example, a group of blind men who have never come across an elephant try to imagine what an elephant is like by touching it. However, they can only touch one body part. Unsurprisingly, each man came to a different conclusion about the elephant depending on the body part: the man who touched the trunk compared it to a snake, the man who touched the leg compared it to a tree, etc. We can think of each of these blind men as the learned convolutional feature maps in a CNN, where each convolutional filter learns a different feature of the image. A densely connected layer is important for taking the global context from every convolutional filter in order to make an observation.
## Q3
Learning rate, batch size, and training time are all important hyperparameters to select when training a neural network. Learning rate denotes the step size for the algorithm to use when moving along the gradient towards to global minimum. A large step size may result in the network overstepping the global minimum and never being able to find the true optimum. On the other hand, a small step size is much less likely to overshoot the global/local minimum but will result in a much slower training speed, since more steps will be required to converge. Batch size denotes the subset of the training data used to train the model during each epoch of training. Generally large batch sizes are preferred since they allow faster computations through GPU parallelization, although this takes a lot of memory. Using a larger batch size also guarantees convergence to the global optima of the objective function. However, it is found empirically that too large of a batch size leads to slower convergence to the optima as well as poorer generalization of the network. On the other hand, smaller batch sizes allow for faster convergence to a local optima and good generalization, at the cost of likely not converging to the global maxima of the objective function. Lastly, training time is used to denote the number of epochs that the network trains on the dataset. It is necessary to have training time long enough such that the network may learn the dataset, however too long training times will result in overfitting, or memorization of the data, and thus poor generalization.
## Q4
Effect of $1\times 1\times d$ max pooling layer
- Increases computational cost of training                  F
- Decreases computational cost of training                  T
- Increases computational cost of testing                   F
- Decreases computational cost of testing                   T

- Increases overfitting                                     F
- Decreases overfitting                                     T
- Increases underfitting                                    T
- Decreases underfitting                                    F

- Increases the nonlinearity of the decision function       T
- Decreases the nonlinearity of the decision function       F

- Provides local rotational invariance                      T
- Provides global rotational invariance                     T
- Provides local scale invariance                           F
- Provides global scale invariance                          T
- Provides local translational invariance                   T
- Provides global translational invariance                  T

## Q5
The single layer perceptron with 10 neurons is implemented as follows:

In [None]:
import numpy as np
import random
from code import hyperparameters as hp

def train_nn(self):
     # This is our training data as indices into an image storage array
    indices = list(range(self.train_images.shape[0]))

    # These are our storage variables for our update gradients.
    # delta_W is the matrix of gradients on the weights of our neural network
    #     Each row is a different neuron (with its own weights)
    # delta_b is the vector of gradients on the biases (one per neuron)
    delta_W = np.zeros((self.input_size, self.num_classes))
    delta_b = np.zeros((1, self.num_classes))

    # Iterate over the number of epochs declared in the hyperparameters.py file
    for epoch in range(hp.num_epochs):
        # Overall per-epoch sum of the loss
        loss_sum = 0
        # Shuffle the data before training each epoch to remove ordering bias
        random.shuffle(indices)

        # For each image in the datset:
        for index in range(len(indices)):
            # Get input training image and ground truth label
            i = indices[index]
            img = self.train_images[i]
            gt_label = self.train_labels[i]

            out = np.matmul(img, self.W) + self.b

            prob = np.exp(out) / (np.sum(np.exp(out)) + 1E-10)

            y = np.zeros(self.num_classes)
            y[gt_label] = 1
            loss_over_all_classes = - np.log(prob + 1E-10)
            loss_sum += loss_over_all_classes[:, gt_label]

            delta_b = prob - y
            delta_W = np.outer(img, delta_b)

            self.b -= self.learning_rate * delta_b
            self.W -= self.learning_rate * delta_W

The following are the performances on the different datasets with and without the SVM. The batch size is chosen to
be 4 to decrease runtime, the random weights for initializing the neurons is
divided by 10 to prevent large calculations in the cross-entropy, and the
learning rate is set to 0.01 for the same reason.
#### NN on MNIST
0 epochs

In [2]:
%cd code

C:\Users\Pan Lab\Documents\SYDE671ASS4\Question\code


In [3]:
#0 epochs
!python main.py -data mnist -mode nn

Epoch 0: Total loss: [22324.35980181]
nn model training accuracy: 91%


In [15]:
#9 epochs
!python main.py -data mnist -mode nn

Epoch 0: Total loss: [22361.69047881]
Epoch 1: Total loss: [18944.67126444]
Epoch 2: Total loss: [18158.54682276]
Epoch 3: Total loss: [17764.28092771]
Epoch 4: Total loss: [17609.96183968]
Epoch 5: Total loss: [17373.16453262]
Epoch 6: Total loss: [17325.65309289]
Epoch 7: Total loss: [17155.56097669]
Epoch 8: Total loss: [16978.27277752]
Epoch 9: Total loss: [16876.34491097]
nn model training accuracy: 92%


The accuracy stays largely the same between 1 epoch and 10 epochs, despite the loss decreasing.
This is likely because the learning rate is small, as well as the initial accuracy being immediately
quite accurate.
#### NN+SVM on MNIST

In [9]:
# 0 epochs
!python main.py -data mnist -mode nn+svm


Epoch 0: Total loss: [22194.22858588]
nn+svm model training accuracy: 91%


In [11]:
#9 epochs
!python main.py -data mnist -mode nn+svm


Epoch 0: Total loss: [22349.5672069]
Epoch 1: Total loss: [18873.20031704]
Epoch 2: Total loss: [18183.25423491]
Epoch 3: Total loss: [17787.47851326]
Epoch 4: Total loss: [17570.2050617]
Epoch 5: Total loss: [17345.27009409]
Epoch 6: Total loss: [17247.77519733]
Epoch 7: Total loss: [17139.15738445]
Epoch 8: Total loss: [16968.56722361]
Epoch 9: Total loss: [16945.29344995]
nn+svm model training accuracy: 91%


As can be seen, performance doesn't improve with the SVM.
This is likely because the dataset is very simple
and a neural network is sufficient for classification.

#### NN on scenerec

In [12]:
# 0 epochs
!python main.py -data scenerec -mode nn

Epoch 0: Total loss: [11703.76179277]
nn model training accuracy: 12%


In [13]:
# 9 epochs
!python main.py -data scenerec -mode nn

Epoch 0: Total loss: [11637.92271927]
Epoch 1: Total loss: [10614.0663993]
Epoch 2: Total loss: [10041.60088724]
Epoch 3: Total loss: [9327.49023939]
Epoch 4: Total loss: [8931.12291809]
Epoch 5: Total loss: [8331.23461914]
Epoch 6: Total loss: [7726.60796492]
Epoch 7: Total loss: [7259.54056056]
Epoch 8: Total loss: [7178.83979744]
Epoch 9: Total loss: [6916.56043638]
nn model training accuracy: 15%


In this case, there is a slight increase in performance between
0 epochs and 9 epochs, although performance remains
relatively poor.
#### NN+SVM on scenerec

In [3]:
# 0 epochs
!python main.py -data scenerec -mode nn+svm

Epoch 0: Total loss: [11774.73438813]
nn+svm model training accuracy: 24%


In [4]:
# 9 epochs
!python main.py -data scenerec -mode nn+svm

Epoch 0: Total loss: [11974.44627687]
Epoch 1: Total loss: [10898.54619586]
Epoch 2: Total loss: [9944.78743449]
Epoch 3: Total loss: [9118.73405276]
Epoch 4: Total loss: [8649.65384342]
Epoch 5: Total loss: [8248.76335881]
Epoch 6: Total loss: [8164.13652592]
Epoch 7: Total loss: [7338.03317666]
Epoch 8: Total loss: [7028.96348365]
Epoch 9: Total loss: [7015.46219634]
nn+svm model training accuracy: 21%


In this case, SVM shows improvement from the NN only
case. However, further training of the model results in
poorer performance, even though the performance is quite poor already.
We can attribute the overall poor performanc eof the model to its lack
of complexity, where the model is not complex enough to represent the complext
dataset. This is in contrast to the simple MNIST dataset where 91% accuracy
was easily achieved. The decrease in performance  while training can perhaps
be attributed to random movement, where the network
cannot find the proper gradient in the feature space and thus moves essenially
random direction.
