# Exercise #3 - Training

Finally, we are going to train our network so it can actually predict something useful.

You will first implement the whole backpropagation chain to get the gradients of the loss with respect to the parameters of the neural network. Then we will implement the training loop that updates the parameters and let it run for a while.

Let's start with importing the basic necessities. We will use reference implementations of the functions you implemented in the previous exercises.

In [None]:
import numpy as np
from siouxdnn import dense, relu, sigmoid, binary_cross_entropy_loss

## Loss function

Since we are going from the back to the front of the network we start with the last block, the loss function. The loss function computes the loss based on the predictions of the network. So when we go back through the network we have to compute the partial derivative of the loss value with respect to those predictions. In other words, we want to compute:

$$\frac{\partial{L}}{\partial{\hat{y}}}$$

This loss is computed in two parts: the binary cross entropy loss and the reduction (mean). We can compute the partial derivatives of both parts and combine them using the chain rule to compute what we need.

$$\frac{\partial{L}}{\partial{\hat{y}}} = \frac{\partial{L}}{\partial{L'}} \cdot \frac{\partial{L'}}{\partial{\hat{y}}}$$

##### Reduction

The partial derivative for the reduction part is simply a scalar value.

$$\frac{\partial{L}}{\partial{L'}} = \frac{1}{N} $$

Where $N$ is the number of samples in the batch.

##### Binary cross entropy loss

The partial derivative for the binary cross entropy loss part is defined as follows.

$$\frac{\partial{L'}}{\partial{\hat{y}}} = \frac{\hat{y}-y}{\hat{y}(1-\hat{y})}$$


##### Implementation
Implement these computations in the `binary_cross_entropy_loss_backward` function.

It's input parameters are defined as before.
* `y_true`: the ground truth $y$.
* `y_pred`: the output $\hat{y}$ of the network.

The variable you need to compute are:
* `dl_dlosses`: the partial derivative of the loss with respect to the binary cross entropy losses, i.e. $\frac{\partial{L}}{\partial{L'}}$.
* `dlosses_dy_pred`: the partial derivative of the binary cross entropy losses with respect to the network output, i.e. $\frac{\partial{L'}}{\partial{\hat{y}}}$
* `dl_dy_pred`: the final partial derivate of the loss with respect to the network output, i.e. $\frac{\partial{L}}{\partial{\hat{y}}}$.


**Hint:**
You can simply use the basic operators like `*` and `-` here, because we need element-wise computations. No matrix multiplications are needed here.

In [None]:
def binary_cross_entropy_loss_backward(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1-1e-7) # Prevent division by zero
    N = len(y_pred) # Number of samples in batch
    #### BEGIN IMPLEMENTATION ####
    dl_dlosses = 
    dlosses_dy_pred = 
    dl_dy_pred = 
    #### END IMPLEMENTATION ####
    return dl_dy_pred

Let's test your implementation with some predictions.

In [None]:
y_true = np.array([[1.0], [1.0], [0.0], [1.0], [0.0], [0.0], [0.0], [1.0], [1.0], [0.0]])
y_pred = np.array([[0.6], [0.4], [0.2], [0.7], [0.1], [0.2], [0.5], [0.9], [0.8], [0.6]])
dl_dy_pred = binary_cross_entropy_loss_backward(y_true, y_pred)
print(dl_dy_pred)

If you implemented it correctly the output should be:

    [[-0.16666667]
     [-0.25      ]
     [ 0.125     ]
     [-0.14285714]
     [ 0.11111111]
     [ 0.125     ]
     [ 0.2       ]
     [-0.11111111]
     [-0.125     ]
     [ 0.25      ]]

Otherwise, check your implementation.

## Training intuition

With this single function we can already get some intuition on how training will work. The function `binary_cross_entropy_loss_backward` will return the gradient of the predictions. So if we change the predictions themselves slightly in the opposite direction, then the predictions will be closer to the ground truth.

Let's test this. We compute (and print) the loss at the start, then we do a number of small steps in the opposite direction of the gradients, followed by a final loss computation. The result should be that the loss ends up close to 0.

In [None]:
print(f'loss at start {binary_cross_entropy_loss(y_true, y_pred)}')
dl_dy_pred = binary_cross_entropy_loss_backward(y_true, y_pred)

for _ in range(500):
    y_pred = y_pred - 1e-2*dl_dy_pred
    dl_dy_pred = binary_cross_entropy_loss_backward(y_true, y_pred)

print(f'loss at end {binary_cross_entropy_loss(y_true, y_pred)}')

The final training of the network will work in a similar fashion. Instead of changing the predictions we will then change the parameters `w` and `b` of each layer. But we need a few other building blocks before we can do that.

## Activation functions

Next up are the partial derivatives for the activation functions. The activation functions `g()` are used as follows.

$$a_n = g(z_n)$$

The input and output shapes are the same, so we can use the chain rule to compute the partial derivatives of the loss with respect to the inputs of the activation functions. In other words we want to compute the following chain:

$$\frac{\partial{L}}{\partial{z_n}} = \frac{\partial{L}}{\partial{a_n}} \cdot \frac{\partial{a_n}}{\partial{z_n}} = \frac{\partial{L}}{\partial{a_n}} \cdot \frac{\partial{g(z_n)}}{\partial{z_n}}$$


The values for $\frac{\partial{L}}{\partial{a_n}}$ are given by the previous backpropagation step. We computed that for the last layer in the previous section. Remember that $\hat{y}$ = $a_3$, so $\frac{\partial{L}}{\partial{a_3}} = \frac{\partial{L}}{\partial{\hat{y}}}$

### Sigmoid

Let's begin with backpropagating the `sigmoid` function. The sigmoid function has a nice feature that it's derivative can be described in terms of its own.

$$\frac{\partial{g(ùëß_n)}}{\partial{z_n}} = \frac{\partial{ùë†ùëñùëî(ùëß_n)}}{\partial{z_n}} = ùë†ùëñùëî(ùëß_n)(1‚àíùë†ùëñùëî(ùëß_n)) = a_n (1-a_n)$$

Implement it in the function below. It should return the output of the whole chain, i.e. it should return the value for $\frac{\partial{L}}{\partial{z_n}}$ (stored in `dl_dz`). To do that you need the value for $\frac{\partial{L}}{\partial{a_n}}$, which is given as parameter `dl_da`.

In [None]:
def sigmoid_backward(z, dl_da):
    #### BEGIN IMPLEMENTATION ####
    a = 
    dl_dz = 
    #### END IMPLEMENTATION ####
    return dl_dz

Let's test your implementation.

In [None]:
z = np.array([[ 0.90145683, 0.35448664, 0.12282405, 0.14473946],
              [ 0.90185776,-0.83611207,-0.74306445, 0.75256026],
              [-0.68165728, 0.46613136, 0.22696806,-0.34379635]])
dl_da = np.array([[ 0.09298582,-0.13436976, 0.02718292, 0.02473067],
               [-0.12145152, 0.17550431,-0.03550441,-0.03230145],
               [-0.15272906, 0.22070212,-0.0446479 ,-0.04062008]])
dl_dz = sigmoid_backward(z, dl_da)
print(dl_dz)

The output should be equal to:

    [[ 0.01909686 -0.03255884  0.00677016  0.0061504 ]
     [-0.02493875  0.03702021 -0.0077554  -0.00703186]
     [-0.03406903  0.0522837  -0.01101945 -0.00986076]]


If not, check your implementation.

### ReLU

The other activation function, ReLU, is even simpler. The ReLU function returns the input value `z` if it is greater than 0 otherwise it returns 0. That means that the gradient is 1 if `z` is greater than 0, otherwise it is 0.

$$\frac{\partial{g(ùëß_n)}}{\partial{z_n}} = \frac{\partial{relu(ùëß_n)}}{\partial{z_n}} = \begin{cases} 1, & z_n \gt 0 \\ 0, & z_n \le 0 \end{cases}$$

Again it should return the output of the whole chain ($\frac{\partial{L}}{\partial{z_n}}$) and store it in `dl_dz`. $\frac{\partial{L}}{\partial{a_n}}$ is given as parameter `dl_da`.

There are a few ways to compute the result. The simplest is to make a copy of `dl_dz` and change the values to 0 where the corresponding value of `z` is less than 0. (It must be a copy otherwise it screws up other computations later on).

**Hint:**
You can use boolean arrays as index for NumPy arrays, so you can do things like `a[a == 0] = c`, then all places where `a` has value 0 are set to value `c` and all others remain untouched.

In [None]:
def relu_backward(z, dl_da):
    dl_dz = np.array(dl_da, copy=True) # Create a copy
    #### BEGIN IMPLEMENTATION ####
    dl_dz[...] = 
    #### END IMPLEMENTATION ####
    return dl_dz

Testing, testing.

In [None]:
z = np.array([[-0.3841685 , 0.96313277, 0.01752332, 0.70375869],
              [-0.90051653, 0.36881432,-0.87937234,-0.07766394],
              [ 0.99909255,-0.28151873, 0.21100903, 0.96628448]])
dl_da = np.array([[ 0.10135023,-0.09567351, 0.00798541, 0.01288583],
               [ 0.12338271,-0.11647193, 0.00972135, 0.01568707],
               [ 0.18808731,-0.17755236, 0.01481944, 0.02391371]])
dl_dz = relu_backward(z, dl_da)
print(dl_dz)

The output should be:

    [[ 0.         -0.09567351  0.00798541  0.01288583]
     [ 0.         -0.11647193  0.          0.        ]
     [ 0.18808731  0.          0.01481944  0.02391371]]

## Dense

And last but not least, the backpropagation of the `dense` function. This function is more involved, because the shapes of the input and output are not the same. This function also uses the network parameters `w` and `b`, and we are in the end interested in the partial derivatives of those. And we are also interested in the partial derivative of this function with respect to it's input $a_{n-1}$ so we can continue the chain backwards to the previous layer.

So instead of computing only one derivate this function should return three.

#### Previous layer

The partial derivative of the loss with respect to the output of the previous layer can be computed as follows (see the lectures for the derivation of this).

$$\frac{\partial{L}}{\partial{a_{n-1}}} = \frac{\partial{L}}{\partial{z_n}} w_n^\mathsf{T} $$

Note that this is a matrix multiplication!

#### Parameters

The partial derivatives of the loss with respect to the parameters are defined as follows (again see the lectures for the derivation of this).

$$\frac{\partial{L}}{\partial{w_n}} = a_{n-1}^\mathsf{T} \frac{\partial{L}}{\partial{z_n}}$$

$$\frac{\partial{L}}{\partial{b_n}} = \begin{pmatrix}
\sum_{n}^{N} \frac{\partial{L}}{\partial{z_{n,0}}} & \sum_{n}^{N} \frac{\partial{L}}{\partial{z_{n,1}}} & \cdots & \sum_{n}^{N} \frac{\partial{L}}{\partial{z_{n,K-1}}}
\end{pmatrix}$$

#### Implementation

Thanks to backpropagation the term $\frac{\partial{L}}{\partial{z_n}}$ is already computed and given as input parameter `dl_dz`. Implement the rest in the function below.

Remember that the shapes of the outputs must match the shapes of the inputs. So `dl_dw` must have the same shape as `w`, `dl_db` the same as `b`, and `dl_da_prev` the same as `a_prev`. Take real good care of this when you multiply matrices. If needed add `print(b.shape)` statements to verify you're doing it right. And remember the matrix multiplications rule $(n,k) \cdot (k,m) \rightarrow (n,m)$!

**Hint:**
You will need to use [np.matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) for the matrix multiplications.

Computing the gradient for `b` is a bit tricky. It only depends on `dl_dz`, which has shape `(S, N)`, where `S` is the number of samples in the batch and `N` the number of output units of this layer. However `b` must have shape `(1, N)`. So you need to use the function [np.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) with some extra parameters. If you set parameter `keepdims` to `True`, then the output shape of the summation will have the same _number_ of dimensions as the input array, which is needed in this case. The parameter `axis` can be used to perform summation over only one axis of the array. For this implementation you will need to use `axis=0` instead of the default `-1`.

In [None]:
def dense_backward(a_prev, w, dl_dz):
    #### BEGIN IMPLEMENTATION ####
    dl_dw = 
    dl_db = 
    dl_da_prev = 
    #### END IMPLEMENTATION ####
    return dl_dw, dl_db, dl_da_prev

Let's test it with some dummy input. The layer as 4 input units and 5 output units and we use 3 samples.

In [None]:
a_prev = np.array([[ 0.35375751,-0.3776729 , 0.32942281,-0.27827347],
 [-0.93693319,-0.085921  , 0.01547009, 0.80320843],
 [-0.74478941, 0.22313203,-0.21071294, 0.86850543]])
dl_dz = np.array([[ 0.        , 0.03582932, 0.        , 0.05249304,-0.08647048],
 [ 0.04884253, 0.06242368, 0.11415199, 0.09145607, 0.        ],
 [ 0.04169974, 0.05329477, 0.09745826, 0.07808142, 0.        ]])
w = np.array([[-0.89889975,-0.47854249,-0.52593611,-0.90020105, 0.63363267],
 [ 0.14985559,-0.36020986,-0.85147906,-0.68566762, 0.1925108 ],
 [-0.54756299, 0.80574929,-0.81764706,-0.4837164 , 0.66329728],
 [-0.6358417 ,-0.13549578,-0.08654791, 0.03401955,-0.01993571]])
dl_dw, dl_db, dl_da_prev = dense_backward(a_prev, w, dl_dz)
print('dw', dl_dw)
print('db', dl_db)
print('da_prev', dl_da_prev)

The output should be:

    dw [[-0.07681971 -0.08550531 -0.17953867 -0.12527264 -0.03058958]
     [ 0.00510795 -0.0070035   0.01193801 -0.01026073  0.03265756]
     [-0.00803108  0.0015388  -0.01876977  0.00225447 -0.02848535]
     [ 0.07544718  0.08645567  0.17633087  0.126665    0.02406244]]
    db [[ 0.09054227  0.15154777  0.21161025  0.22203053 -0.08647048]]
    da_prev [[-0.11919066 -0.06554535 -0.05387793 -0.00134508]
     [-0.21614243 -0.17507279 -0.11402137 -0.04628258]
     [-0.18453349 -0.14946993 -0.09734673 -0.03951416]]

# Complete backpropagation

Now that we have all the building blocks available we can build the whole backpropagation chain for our model.

Below is the code of the model as it was implemented in the previous exercises. You need to implement the new `get_gradients` function.

In this function, first the forward pass is computed, so all the intermediate results are available. Then the backpropagation starts. Use the right builing blocks you implemented above with the right parameters.

You're almost done. Good luck!

In [None]:
class Model(object):
    def __init__(self):
        N0, N1, N2,N3 = 5, 64, 64, 1
        self.w1 = np.random.uniform(-0.5, 0.5, size=(N0, N1))
        self.b1 = np.zeros((1, N1))
        self.w2 = np.random.uniform(-0.5, 0.5, size=(N1, N2))
        self.b2 = np.zeros((1, N2))
        self.w3 = np.random.uniform(-0.5, 0.5, size=(N2, N3))
        self.b3 = np.zeros((1, N3))
        
    def predict(self, x):
        a0 = x
        z1 = dense(a0, self.w1, self.b1)
        a1 = relu(z1)
        z2 = dense(a1, self.w2, self.b2)
        a2 = relu(z2)
        z3 = dense(a2, self.w3, self.b3)
        a3 = sigmoid(z3)
        y_pred = a3
        return y_pred
    
    def evaluate(self, x, y_true):
        y_pred = self.predict(x)
        loss = binary_cross_entropy_loss(y_true, y_pred)
        return loss

    def get_gradients(self, x, y_true):
        
        # Forward propagation
        a0 = x
        z1 = dense(a0, self.w1, self.b1)
        a1 = relu(z1)
        z2 = dense(a1, self.w2, self.b2)
        a2 = relu(z2)
        z3 = dense(a2, self.w3, self.b3)
        a3 = sigmoid(z3)
        y_pred = a3
        # Loss
        loss = binary_cross_entropy_loss(y_true, y_pred)
        
        #### BEGIN IMPLEMENTATION ####
        
        # Backprop loss to network output
        dl_da3 = dl_dy_pred = ...
        
        # backprop three layers
        dl_dz3 = 
        dl_dw3, dl_db3, dl_da2 = 
        dl_dz2 = 
        dl_dw2, dl_db2, dl_da1 = 
        dl_dz1 = 
        dl_dw1, dl_db1, dl_da0 = 
       
        #### END IMPLEMENTATION ####

        dl_dx = dl_da0
        
        return loss, dl_dx, dl_dw1, dl_db1, dl_dw2, dl_db2, dl_dw3, dl_db3

We could add a test here if you implemented it right, but we need quite some data for that and we are almost done, so let's postpone that.

# Training

The forward propagation is implemented, so we can predict values. The backward propagation is now implemented, so we can compute the gradients of the parameters. We are finally ready to actually train the model!

We will train the model on the same dataset as before. This time we will use both the training and the validation set. So let's import that first.

In [None]:
from siouxdnn import load_data
X_train, Y_train, X_val, Y_val = load_data()
print('training set', X_train.shape, Y_train.shape)
print('validation set', X_val.shape, Y_val.shape)

Now we implement the training loop. This is actually pretty simple. You first have to compute the gradients using the function you just implemented. Then update all the parameters in the opposite direction with the `learning_rate` as a factor.

In [None]:
def train(model, x, y_true, learning_rate):
    loss, dl_dx, dl_dw1, dl_db1, dl_dw2, dl_db2, dl_dw3, dl_db3 = model.get_gradients(x, y_true)
    #### BEGIN IMPLEMENTATION ####
    model.w1 = 
    model.b1 = 
    model.w2 = 
    model.b2 = 
    model.w3 = 
    model.b3 = 
    #### END IMPLEMENTATION ####
    return loss

Alright, this is it. Time to train your model. Run the following code and it will create a new model and run 1000 training steps, each time training on the whole training set at once (i.e. no batches).

The training set is used to actually train the network on, so the weights will be adjusted according to that input and ground truth. The validation set will be used after every training step to evaluate the performance of the network on data it hasn't "seen" yet.

A nice plot will be displayed while training. You should see both the training loss and the validation loss going down.

In [None]:
%matplotlib notebook
from loss_plot import init_loss_plot, add_loss_to_plot, finish_loss_plot

from siouxdnn import reset_seed
reset_seed(123)
model = Model()

learning_rate = 1e-2

init_loss_plot()
for epoch in range(1000):
    train_loss = train(model, X_train, Y_train, learning_rate)
    val_loss = model.evaluate(X_val, Y_val)

    add_loss_to_plot(train_loss, val_loss, epoch%50 == 0)
finish_loss_plot()

# Validation

The model is now trained!

Let's check how well it performs by inspecting the metrics on the validation dataset.

In [None]:
from siouxdnn import get_metrics

y_pred = model.predict(X_val)

accuracy, precision, recall = get_metrics(Y_val, y_pred)
print(f'accuracy {accuracy:.3f}, precision {precision:.3f}, recall {recall:.3f}')

The output should be `accuracy 0.922, precision 0.886, recall 1.000`.

As you can see we do pretty well. An accuracy of 92.2% and high precision and recall.

# End

We are done. We have written a complete neural network (of 3 layers) all from scratch and trained it on a dataset. And it even performs quite well. Congratulations!