# PS4 Coding #

This assignment will have us looking to build a deep convolutional residual network neural network that also incorporates dropout. 

This network will use the softmax function to make a 10 class image classification on the MNIST data set (the original MNIST and not the fashion_mnist we've been working with thusfar)


But before you get started, please make sure you have the following packages installed
## Packages to install:
1. numpy
2. keras
3. tensorflow
4. matplotlib

For keras and tensorflow, please refer to this link (https://docs.floydhub.com/guides/environments/) to make sure you install versions that are compatible with each other. I would highly recommend getting tensorflow==1.14.1 and the compatible keras version. The exact python version, as long as it's python3+, should not impact your ability to use these two packages.

## Structure of Assigment ##

What's new compared to PS4:
1. **Skip Layer Residual Connections**
2. **Drop out**


## Terminology
Please look over the power point under Piazza > Resources > ConvolutionalNetwork.ppt to make sure you understand exactly what I mean when I type the following terms:
1. Applying a kernel/filter
2. Kernel/Filter
3. Max Pool
4. Feature map
5. Convolution

## Network Architecture
![title](./ps5_img/ResNet1.png)

It is quite similar to PS4, and in fact I recommend that the first step you take is to:

## copy your code from PS4 in the cells below


Remember the following equation: **out_shape is an integer greater than 1**, where $$out\_shape = \frac{input\_shape + 2 * padding - kernel\_shape}{stride} + 1$$


## Assignment Grading and Procedure Recommendation ##
1. If you correctly implement the residual skip layer connections with a single kernel at each conv layer, you will recieve 75%
2. Correct implementation of drop out will recieve 20%
3. Experimentation section will give the last 5%

Here is how I recommend going about this assignment:
1. Copy over your code from PS4
2. Modify code so that it has the residual connections
3. Implement dropout feature
4. Experiment with parameter tuning of your convolutional residual network!

### A note on "different kernel shape": 
For a convolutional network, all kernels applied at the same layer will be the **same shape**, when I say that your network should be able to handle differnt kernel shapes, that means if you change the kernel shape at a given layer, **all kernels applied on that layer will adopt the new shape**.

For instance, Kernel1 begins as a 4x4 kernel. This means if I wanted to apply multiple Kernel1's to the input layer, then I will apply multiple **4x4 kernels** (they are all the same shape). If I change my network such that Kernel1 is now a **6x6 kernel** that means ALL applications of kernel1 will now be 6x6.

### Data Format

You will notice that this assignment has very little headers and comments. I am leaving it up to you to decide exactly what info you need to incorporate for each function as a parameter, and the functionality and output of each that function. Feel free to use the previous problem sets as models for how to model your code. I recommend you continue to format your data in terms of N x M

1. N = number of features
2. M = number of data points


## Residual Skip Layer connections

Your model will ahve two skip layer connections, the first from the output of maxpool_1 with the output of conv_3, and the second with the output of conv_3 and the output of conv_5.

Remember that at each residual connection, you simply add the two outputs together. For instance, at the first residual connection, the output of conv_3 would become: $$Conv3 = Conv3 + MaxPool1$$.

At the second residual connection, we would get the following equation: $$Conv5 = Conv5 + (Conv3 + MaxPool1)$$

Your task is to implement these skip layer connections. 

Remember that the skip layer connection itself is not learnable, it is simply an identity mapping. 

## Dropout

The core idea behind drop out is quite simple: for each node in the network, flip a weighted coin. If the coin returns heads, keep the node in the network, else "drop" it during training.

Practically, this looks like the following: After the final computation of a layers output (including AFTER the residual addition), compute a binary vector, p, in N-space, where N is the number of nodes in that layer. The assignment of p should be based on a uniform probability distribution, with $\mu$ chance of $x_i = 1$, and $1-\mu$ chance of $x_i = 0$. You then element wise multiply this binary vector to the output of that respective layer. 

The binary mask should be generated randomly for every layer, during each epoch, and for each batch of data. 

This binary mask will also be used in back propogation, similar to the binary mask from max-pooling, but note that you should NOT apply dropout to max-pool layers. 

Your task is to incorporate dropout into your convolutional residual network. 

## Data management

We will be using four data sets for this problem set. 
1. MNIST (the most popular computer vision data set)
2. Dummy data (for testing purposes)

## Forward and BackProp functions

![title](./ps5_img/backprop.png)


In [1]:
#### Left here for a copy of your PS4 code

from keras.datasets import cifar10
import tensorflow as tf
#from tensorflow import keras
import keras
from matplotlib import pyplot
import numpy as np

def gen_dummy():
    '''
    dummy data is exceptionally useful to test whether or not your network behaves as expected.
    For dummy data, you should generate a few (<= 5) input/output pairs that you can use to test
    your forward and backward propagation algorithms
    
    output: 
    dummy_x = a NxM np matrix, both dimensions of your choosing of very simple data
    dummy_y = a (M, ) np array with the corresponding labels
    '''
    dummy_x = []
    dummy_y = []
    ####BEGIN CODE HERE####
    
    N = 28 * 28
    M = 5
    
    dummy_x = np.ones((N, M))
    dummy_y = np.ones(M)
    ####END CODE HERE####

    return dummy_x, dummy_y

def load_mnist():
    ''' 
    look up how to load the mnist data set via keras
    '''
    mnist = keras.datasets.mnist
    (mnist_train_images, mnist_train_labels), (mnist_test_images, mnist_test_labels) = mnist.load_data()
    return mnist_train_images, mnist_train_labels, mnist_test_images, mnist_test_labels


def flatten_normalize(images):
    '''
    convert the image from a N1xN1xM to NxM format where N1 = square_root(N) and normalize
    '''
    images = np.reshape(images, (-1, images.shape[-1]))
    images = np.divide(images, np.max(images, 1))
    return images


def subset_mnist_training():
    '''
    Return 100 training samples from each of the 10 classes, 1000 samples all together
    '''
    mnist_train_images, mnist_train_labels, mnist_test_images, mnist_test_labels = load_mnist()
    
    images = []
    labels = []
    
    for i in np.unique(mnist_train_labels):
        train = mnist_train_images[mnist_train_labels == i]
        train = train[np.random.choice(train.shape[0], 100, False)]
        images.extend(train)
        labels.extend([i for k in range(100)])
        
    return images, labels


def subset_mnist_testing():
    '''
    Return 20 training samples from each of the 10 classes, 200 samples all together
    '''
    mnist_train_images, mnist_train_labels, mnist_test_images, mnist_test_labels = load_mnist()
    
    images = []
    labels = []
    
    for i in np.unique(mnist_test_labels):
        test = mnist_test_images[mnist_test_labels == i]
        test = test[np.random.choice(test.shape[0], 20, False)]
        imgs.extend(test)
        labels.extend([i for k in range(20)])
        
    return images, labels

def load_data():
    ''' 
    look up how to load the mnist data set via keras
    '''
    mnist = keras.datasets.mnist
    (tx, ty), (ex, ey) = mnist.load_data()
    return (tx/255, ty), (ex/255, ey)
    
def kernel_initialization(w1, w2):
    '''
    returns a kernel with specified height and width. 
    The values of the kernel should be initialized using the same formula as the
    He_initialize_weight() function
    '''
    ret = np.random.rand(w1, w2) * np.sqrt(2 / max(w1, w2))
    return ret

def He_initialize_weight(w1, w2):
    '''
    (same as PS3)
    returns a weight matrix with the passed in dimensions
    '''
    ret = np.random.rand(w1, w2) * np.sqrt(2 / max(w1, w2))
    return ret

def bias_initialization(w1):
    ''' (same as PS3)
    returns a bias matrix of the passed in dimensions
    '''
    ret = np.random.rand(w1) * np.sqrt(2 / w1)
    return ret


def log_cost(label, prediction):
    '''
    computes the log cost of the current predictions using the labels (same as PS3)
    '''
    cost = 0
    for i in range(label.shape[0]):
        cost += -np.log(prediction[int(label[i]), i])
    return cost


def softmax(x):
    '''
    computes the softmax of the input (same as PS3)
    '''
    ret = np.zeros(x.shape)
    for col in range(x.shape[1]):
        total = np.sum(np.exp(x[:, col]))
        ret[:, col] = np.exp(x[:, col]) / total
    return ret


def ReLU(x):
    '''
    computes the ReLU of the input (same as PS3)
    '''
    return np.maximum(0, x)


def ReLU_prime(x):
    ''''
    computes the ReLU' of the input (same as PS3)
    '''
    ret = np.copy(x)
    ret[ret > 0] = 1
    ret[ret <= 0] = 0
    return ret


def kernel_to_matrix(kernal: np.ndarray, input_size):
    '''
    converts a kernel to its matrix form
    '''
    kernal_size = kernal.shape[0]
    output_size = input_size - kernal_size + 1

    kernal = kernal.flatten()
    to_right = input_size - kernal_size
    ret = np.zeros((output_size * output_size, input_size * input_size))
    t0 = 0
    for i in range(output_size * output_size):
        t1 = i
        for j in range(kernal.shape[0]):
            ret[i, j + t0 + t1] = kernal[j]
            if (j + 1) % kernal_size == 0:
                t1 += to_right
        if (i + 1) % output_size == 0:
            t0 += (kernal_size - 1)
    return ret


def max_pool(X):
    '''
    applies max-pooling to an input image
    '''
    size = int(np.sqrt(X.shape[0]))
    out_size = size // 2
    ret = np.empty((out_size, out_size, X.shape[-2], X.shape[-1]))
    for i in range(X.shape[1]):
        for j in range(X.shape[2]):
            t1 = np.reshape(X[:, i, j], (size, size))
            for x in range(out_size):
                for y in range(out_size):
                    ret[x, y, i, j] = np.max(t1[x * 2: (x + 1) * 2, y * 2: (y + 1) * 2])
    return np.reshape(ret, (out_size * out_size, X.shape[-2], X.shape[-1]))


def max_pool_backwards(delta_el, z):
    '''
    takes the output of a maxpool, and projects back to the original shape.
    see PPT slides on convolutional backprop if you have no idea what I'm talking about.
    '''
    small_size = int(np.sqrt(delta_el.shape[0]))
    big_size = small_size * 2 + 1
    origin = np.reshape(delta_el, (small_size, small_size, delta_el.shape[-2], delta_el.shape[-1]))
    z = np.reshape(z, (big_size, big_size, z.shape[-2], z.shape[-1]))
    ret = np.zeros((big_size, big_size, delta_el.shape[-2], delta_el.shape[-1]))
    for b in range(origin.shape[-2]):
        for layer in range(origin.shape[-1]):
            for i in range(small_size):
                for j in range(small_size):
                    index = np.unravel_index(np.argmax(z[i * 2: (i + 1) * 2, j * 2: (j + 1) * 2, b, layer].flatten()), (2, 2))
                    ret[i * 2 + index[0], j * 2 + index[1], b, layer] = origin[i, j, b, layer]
    return ret.reshape((-1, ret.shape[-2], ret.shape[-1]))


def delta_Last(predictions, labels):
    '''
    task: computer error term for ONLY output layer
    '''
    return predictions - labels


def delta_el(weight_d_plus_1, delta_d_plus_1, input_d):
    '''
    task: compute error term for any hidden layer
    '''
    ####BEGIN CODE HERE ####
    return np.matmul(weight_d_plus_1, delta_d_plus_1) * ReLU_prime(input_d)
    ####END CODE HERE####


def dW(delta_d, a_d_minus_1):
    '''
    task: compute gradient for any weight matrix
    '''
    return np.matmul(delta_d, a_d_minus_1.T)


def db(delta_d):
    '''
    task: compute gradient for any bias term
    '''
    return np.sum(delta_d, axis=1, keepdims=True)


def weight_upate(weights, gradients, learning_rate):
    '''

    task: udpate each of the weight matrices, and return them in a variable, name of  your choosing
    '''
    for i in range(len(weights)):
        weights[i] -= learning_rate * gradients[i]
    return weights


def bias_update(bias, bias_grad, learning_rate):
    '''
    task : update each of the bias terms, return them in a variable, name of  your choosing
    '''
    for i in range(len(bias)):
        bias[i] -= learning_rate * bias_grad[i]
    return bias


In [2]:
def predict(x, weights, bias, activations, dropout):
    '''
    minimum output: the predictions made by the network
    
    You are free to return more things from this function if you see fit
    
    hint: you will need to return all intermediate computations, not just the output
    To figure out what you need to return, look at what intermediate results you need
    to compute backpropagation
    
    you can return more than one variable with the following syntax:
    
    return var1, var2, ..., varN
    '''
    out = x
    z = [x]
    a = [x]
    for i in range(len(activations)):
        if activations[i] == "max" and isinstance(weights[i], list):
            if len(out.shape) <= 2:
                out = out[..., np.newaxis]
            layer_num = 1 if len(out.shape) <= 2 else out.shape[2]
            t = np.empty((weights[i][0].shape[0], out.shape[1], len(weights[i]) * layer_num))
            for j in range(len(weights[i])):
                for k in range(layer_num):
                    t[:, :, j * layer_num + k] = np.matmul(weights[i][j], out[:, :, k])
            out = t
            if activations[i + 1] == 'relu':
                z.append(np.transpose(out, (0, 2, 1)).reshape((-1, out.shape[-2])))
            else:
                z.append(out)
            out = max_pool(out)
            a.append(out)
        elif activations[i] == 'relu':
            if len(out.shape) > 2:
                out = np.transpose(out, (0, 2, 1))
                out = out.reshape((-1, out.shape[-1]))
                a[-1] = out
            out = np.add(np.matmul(weights[i].T, out), bias[i][..., np.newaxis])
            z.append(out)
            out = ReLU(out)
            a.append(out)
        elif activations[i] == "softmax":
            out = softmax(out)
            a.append(out)
        elif activations[i] == "dropout":
            out = dropout_forward(out, dropout)
            a.append(out)
    
    return out, z, a

In [3]:
def train(x, y, weights, bias, activations, layers, epoch, lr, dropout):
    
    num_samples = x.shape[1]

    for i in range(epoch):
        print('--------epoch', i, "--------------")
        out, z, a = predict(x, weights, bias, activations, dropout)

        cost = log_cost(y, out)
        print("epoch: ", i, " cost: ", cost)
        # backward
        delta_l = delta_Last(out, y)
        
        d_W = [np.zeros_like(w) for w in weights]
        d_b = [np.zeros_like(w) for w in bias]
        for j in range(2, len(activations) - 2):
            if activations[-j] == "relu":
                d_W_t = dW(delta_l, a[-j - 1]).T / num_samples
                d_b_t = db(delta_l) / num_samples
                d_W.append(d_W_t)
                d_b.append(d_b_t)
                if activations[-j - 1] != "relu":
                    delta_l = delta_el(weights[-j + 1], delta_l, a[-j - 1])
                else:
                    delta_l = delta_el(weights[-j + 1], delta_l, z[-j - 1])
            elif activations[-j] == "max":
                layer = 1
                pre_z = z[-j + 1]
                for k in range(len(weights) - j + 2):
                    layer = layer * len(weights[k])
                if activations[-j + 1] != "max":
                    delta_l = delta_l.reshape((-1, layer, delta_l.shape[-1]))
                    pre_z = pre_z.reshape((-1, layer, delta_l.shape[-1]))
                delta_l = max_pool_backwards(delta_l, pre_z)
                if layers[-j + 1] == "residual":
                    delta_l = delta_l.reshape((-1, layer, delta_l.shape[-1]))
                    pre_z = pre_z.reshape((-1, layer, delta_l.shape[-1]))
                delta_l = max_pool_backwards(delta_l, pre_z)
                d_W_t = []
                d_b_t = []
                for k in range(layer):
                    d_W_t.append(dW(delta_l[:, k, :], a[-j - 1][:, :, k // len(weights[-j + 1])]).T / num_samples)
                    d_b_t.append(db(delta_l[:, k, :]) / num_samples)
                d_W.append(d_W_t)
                d_b.append(d_b_t)
                if j == len(activations):
                    continue
                t_delta_l = []
                for k in range(layer // len(weights[-j + 1])):
                    t_delta_l.append(delta_el(weights[-j][k], delta_l[:, k * len(weights[-j+1]), :], z[-j - 1]))
                    t_delta_l.append(delta_el(weights[-j][k], delta_l[:, k * len(weights[-j+1]) + 1, :], z[-j - 1]))
            elif activations[-j] == "dropout":
                layer = 1
                pre_z = z[-j + 1]
                for k in range(len(weights) - j + 2):
                    layer = layer * len(weights[k])
                if activations[-j + 1] != "dropout":
                    delta_l = delta_l.reshape((-1, layer, delta_l.shape[-1]))
                    pre_z = pre_z.reshape((-1, layer, delta_l.shape[-1]))
                delta_l = max_pool_backwards(delta_l, pre_z)
                d_W_t = []
                d_b_t = []
            else:
                raise Exception()
            weights = weight_upate(weights, d_W, lr)
            bias = bias_update(bias, d_b, lr)
    return weights, bias

In [4]:
# residual skip layer
def residual_forward(x1, x2):
    return x1+x2


def residual_backward(g, x1, x2):
    s = x1+x2
    return g*(x1/s), g*(x2/s)


# dropout layer
def dropout_forward(x, prob):
    """
    :param x x.shape should be [n,n,m]
    """
    mask = (np.random.rand(*x.shape) < self._keep_prob)
    x *= mask
    x /= prob
    return x, mask


def dropout_forward(g, mask):
    g *= mask
    g /= prob
    return g

In [5]:
output_nodes = 10
dropout = 0.1
lr = 0.01
epoch = 10

####BEGIN CODE HERE####
bias_0 = bias_initialization(2)
bias_1 = bias_initialization(2)
bias_2 = bias_initialization(10)
bias = [bias_0, bias_1, bias_2]

weights_0 = []
ratio_0 = 2
for i in range(ratio_0):
    t = kernel_initialization(4, 4)
    weights_0.append(kernel_to_matrix(t, 28))

weights_1 = []
ratio_1 = 2
for i in range(ratio_1):
    t = kernel_initialization(4, 4)
    weights_1.append(kernel_to_matrix(t, 12))

weights_2 = He_initialize_weight(64, 10)
weights = [weights_0, weights_1, weights_2]

activations = ['max', 'max', 'relu', 'softmax']
layers = ['conv', 'maxpool', 
          'conv', 'conv', 'residual', 
          'conv', 'conv', 'residual',
         'fc', 'fc', 'fc']
x, y = gen_dummy()

train(x, y, weights, bias, activations, layers, epoch, lr, dropout)


--------epoch 0 --------------
epoch:  0  cost:  108.8673757619556
--------epoch 1 --------------
epoch:  1  cost:  108.8673757619556
--------epoch 2 --------------
epoch:  2  cost:  108.8673757619556
--------epoch 3 --------------
epoch:  3  cost:  108.8673757619556
--------epoch 4 --------------
epoch:  4  cost:  108.8673757619556
--------epoch 5 --------------
epoch:  5  cost:  108.8673757619556
--------epoch 6 --------------
epoch:  6  cost:  108.8673757619556
--------epoch 7 --------------
epoch:  7  cost:  108.8673757619556
--------epoch 8 --------------
epoch:  8  cost:  108.8673757619556
--------epoch 9 --------------
epoch:  9  cost:  108.8673757619556


([[array([[0.51395805, 0.34728995, 0.59089185, ..., 0.        , 0.        ,
           0.        ],
          [0.        , 0.51395805, 0.34728995, ..., 0.        , 0.        ,
           0.        ],
          [0.        , 0.        , 0.51395805, ..., 0.        , 0.        ,
           0.        ],
          ...,
          [0.        , 0.        , 0.        , ..., 0.27927116, 0.        ,
           0.        ],
          [0.        , 0.        , 0.        , ..., 0.22538358, 0.27927116,
           0.        ],
          [0.        , 0.        , 0.        , ..., 0.08118517, 0.22538358,
           0.27927116]]),
   array([[0.1793265 , 0.4533772 , 0.05489483, ..., 0.        , 0.        ,
           0.        ],
          [0.        , 0.1793265 , 0.4533772 , ..., 0.        , 0.        ,
           0.        ],
          [0.        , 0.        , 0.1793265 , ..., 0.        , 0.        ,
           0.        ],
          ...,
          [0.        , 0.        , 0.        , ..., 0.04564939, 0.  

## Experimentation explanation:

What are you experimenting on? Any aspect of your network that is not learned by SGD, and is not the number of layers of the network, is free for experimentation. Learning rate, number of training samples, epochs, nodes per layer etc etc are a few examples.

There are a lot of ways to get full credit on this experimentation portion, as there is no dedicated format I am requesting. There is no benchmark. There is no "you must get 100% accuracy". I simply ask you to see how well your models can do with the restrictions in place.

Best of luck, 
Ryan

Experimentation notes:

    This cell has been left in text format for you to freely edit and keep track of your experimentations.