# Exercise 9.3 Convolutional Neural Networks (6 points)
## DO NOT EDIT THIS FILE
You will be implementing the building blocks of a convolutional neural network! Each function you will implement will have detailed instructions that will walk you through the steps needed. In this exercise we are going to implement the following functions

- Zero padding
- Maxpool2D functions:
    - Max pooling forward `__call__()`
    - Pooling backward function `grad()`
- Convolution2D functions:
    - Convolution `__call__()`
    - Convolution `grad()`

 

In [None]:
import numpy as np
from activations import ReLU, LeakyReLU, Tanh, Softmax, Sigmoid
from losses import CrossEntropy, MSELoss
from layers import Linear
from layers import L2regularization, Dropout
from layers import Conv2D, zero_padding
from layers import Maxpool2D
from model import Model
import matplotlib.pyplot as plt

numpy_random_seed = 42

## 9.3.a - Zero-Padding (1 point)


Adding zero padding to images helps to avoid shrinking of the outputs as we build deeper networks

Implement the `zero_padding()` function in layers/utils.py, which pads all the images of a batch of examples X with zeros. [Use np.pad](https://docs.scipy.org/doc/numpy/reference/generated/numpy.pad.html). 
Note that the padding in an image is done on the borders of the image.

In [None]:
np.random.seed(numpy_random_seed)

# Create a dummy image batch
image_batch = np.random.randn(5, 3, 3, 3)

# Pad the batch using zero_padding function
image_batch_padded = zero_padding(image_batch, 2)

print ("image_batch.shape =", image_batch.shape)
print ("image_batch_padded.shape =", image_batch_padded.shape)
print ("image_batch[1,1] =", image_batch[1,1])
print ("image_batch_padded[1,1] =", image_batch_padded[1,1])

fig, ax = plt.subplots(1, 2)
ax[0].set_title('image')
ax[0].imshow(image_batch[0,:,:,0])
ax[1].set_title('padded_image')
ax[1].imshow(image_batch_padded[0,:,:,0])

# 9.3.b Max Pooling 2D forward pass (1 point)
Pooling layers are used to shrink the height and width of the window. The max pooling operation slides a kernel window of $k \times k$ and stores the maximum values in the window

These pooling layers have no parameters for backpropagation to train. However, they have hyperparameters such as the window size $k$. You take the maximum of the kernel window. Another pooling operation is global average pooling but we aren't going to cover it's implementations

Now, you are going to implement the max pooling operation in the `Maxpool2D.py`. 
Implement the forward pass of the pooling layer in `__call__()`. Follow the hints in the documentation of max pooling

In [None]:
np.random.seed(numpy_random_seed)
X = np.random.randn(5, 5, 3, 2)
maxpool2D = Maxpool2D()
out = maxpool2D(X)
print("out =", out)

## 9.3.c Max Pooling 2D backward pass (2 points)
Next, let's implement the backward pass for the pooling layer. Even though a pooling layer has no parameters for backprop to update, you still need to backpropagation the gradient through the pooling layer in order to compute gradients for layers that came before the pooling layer. 

Before jumping into the backpropagation of the pooling layer, you are going to build a helper function called `create_mask()`  in MaxPool2D class which does the following: 

$$ X = \begin{bmatrix}
3 && 1 \\
2 && 4
\end{bmatrix} \quad \rightarrow  \quad M =\begin{bmatrix}
0 && 0 \\
0 && 1
\end{bmatrix}$$

This function creates a mask which keeps track of where the maximum is in the matrix.

Implement the `grad()` function by following the instruction given in the file. You will use 4 for-loops (iterating over training examples, height, width, and channels). 

In [None]:
np.random.seed(numpy_random_seed)
X_in_grad = np.random.randn(5, 4, 2, 2)
print(np.mean(maxpool2D.grad(X_in_grad)[0]))

## 9.3.d Convolutional Neural Networks - Forward pass (1 point)

In the forward pass, you will take many filters and convolve them on the input. Each 'convolution' gives you a 2D matrix output. You will then stack these outputs to get a 3D volume: 


Implement the `__call__()` function in `layers/Conv2D.py` to convolve the filters W on an tensor X. This class takes as input the number of channels of the input, the number of channels of the output, stride, kernel size and padding

Follow the instructions in the file `layers/Conv2D.py` to implement the forward pass

In [None]:
np.random.seed(numpy_random_seed)
conv_layer = Conv2D(in_channels = 3, out_channels=8, padding=1, stride=1)
np.random.seed(numpy_random_seed)
X = np.random.randn(14, 32, 32,3)

Z = conv_layer(X)
print("Mean Z = ", np.mean(Z))
print(Z.shape)

## 9.3.e Conv backward pass (2 points)

In previous exercises we computed the gradient for most of the layers that can be used to update the parameters. But in this exercise we aren't going to perform the update step because convolutions take too long to train on CPUs. Hence we keep this exercise only to the point of calculating the gradients without updating the model parameters. Nevertheless it's instructive to know the math behind the backpropagation of CNNs. The grad function is supposed to return three parameters
- Gradient wrt to input of the layer
- Gradient wrt to weights of the layer
- Gradient wrt to biases of the layer
Let's start by implementing the backward pass for a CONV layer. In the following description, $dL/dZ$ corresponds to the in_gradient of the layer that you receive as an argument. With a little abuse of notation we write it as $dLZ$.

#### Computing dL/dX:
This is the formula for computing $dL/dX$ with respect to the cost for a certain filter $W_c$ and a given training example:

$$ (dL/dX) += \sum _{h=0} ^{n_H} \sum_{w=0} ^{n_W} W_c \times dLZ_{h,w}$$

#### Computing dL/dW:
We calculate the derivative of one filter with respect to loss by

$$ (dL/dW)_c  += \sum _{h=0} ^{n_H} \sum_{w=0} ^ {n_W} X_{window} \times dLZ_{h,w} $$

$X_{window}$ corresponds to the slice of the input which was used to generate the acitivation $Z_{ij}$. 

#### Computing dL/db:

For computing $dL/db$ with respect to the loss for a certain filter $W_c$. $dL/db$ is computed by summing $dLZ$ for the particular channel. In this case, you are just summing over all the in gradients with respect to the loss. 

$$ (dL/db) = \sum_h \sum_w dLZ_{h,w}$$

In [None]:
np.random.seed(numpy_random_seed)
print(Z.shape)
dLdx, dLdW, dLdb = conv_layer.grad(Z)
print("dX mean =", np.mean(dLdx))
print("dW mean =", np.mean(dLdW))
print("db mean =", np.mean(dLdb))