## Day 6 - Implementing Backpropagation



We start with an implementation of the neural network from last week. You can either use the implementation below or your own. 

In [None]:
import numpy as np
from matplotlib import pyplot as plt

# Dummy data
data_x = np.array([   # 4 attributes 
    [0.5, -0.2, 0.1, 0.4],
    [1.5,  0.2, 1.1, -0.4],
    [0.3,  0.8, 0.5, 0.7],
    [0.6,  0.3, -0.9, 1.0],
    [1.0, -0.1, 0.2, -0.3]
])

# There are 3 classes 0, 1, 2.
data_y = np.array([0, 2, 1, 1, 0])

The point of the dummy data is that it will allow us to test our implementation. We should be able to learn the training data perfectly!

Here is a neural network with 4 inputs, 3 output units (softmax activated), and 2 hidden layers (10 units each). The hidden layers should use the sigmoid activation function. 

In [None]:
# Initialize weights and biases
W1 = np.random.rand(10,4) * 0.01   # we make the initial weights smaller
b1 = np.random.rand(10) * 0.01

W2 = np.random.rand(10,10) * 0.01
b2 = np.random.rand(10) * 0.01

# I forgot to provide these in the initial notebook -- with two hidden layers you need the parameters for 
# each hidden layer PLUS the parameters for the output layer.
W3 = np.random.rand(3, 10) * 0.01
b3 = np.random.rand(3) * 0.01

In [None]:
# Define the activation functions
def sigmoid(z):
    return 1 / (1 + np.exp(-z))    # Good to use the numpy methods for numeric stability

def softmax(z):
    # First compute the denominator
    denom = np.sum(np.exp(z))
    # then element-wise divide e^z by the denominator
    return np.exp(z) / denom

Test the activation functions: 

In [None]:
sigmoid(10)

In [None]:
sigmoid(-5)

In [None]:
sigmoid(0)

In [None]:
zvec = np.array([0.9, 0.1, 0.7, 0.4])
probs = softmax(zvec)
print(probs)
print(np.sum(probs))

In [None]:
# Forward pass for a single input vector
def forward(x):
    z1 = np.matmul(W1, x) + b1 
    a1 = sigmoid(z1)
    
    z2 = np.matmul(W2,a1) + b2
    a2 = sigmoid(z2)
    
    z3 = np.matmul(W3,a2) + b3
    out = softmax(z3)
    
    return out  # returns probabilities over 3 classes

Testing the forward pass

In [None]:
for i, x in enumerate(data_x):
    probs = forward(x)
    pred_class = np.argmax(probs)
    print(f"Example {i}: softmax: {probs}, predicted:{pred_class}, target:{data_y[i]}")

We will also implement accuracy to measure performance of the model. 

In [None]:
def accuracy(data_x, data_y):
    correct = 0
    for x, y in zip(data_x, data_y):
        pred = np.argmax(forward(x))
        if pred == y:
            correct += 1
    return correct / len(data_y)

print("Accuracy on dummy data:", accuracy(data_x, data_y))

## Implementing the Backward Pass

The loss function we will use is negative log-likelihood / categorical cross-entropy $loss(y',y)=- \sum_{i=1}^d y_{(i)} \log y'$ Where y' is the the prediction and y is the target. Note that in a classification problem the target is a one-hot vector, so only one of the dimensions are included in the sum (the one for which the one-hot target is 1). 

In [None]:
def loss(prediction, y_one_hot): 
    return 

We also need a function to compute a one-hot vector from the integer class label for the target. 

In [None]:
def one_hot(class_label, num_classes):
    
    
# one_hot(data_y[0],3)  # test it like this

In [None]:
# loss(forward(data_x[0]), one_hot(data_y[0],3))  #and test the loss

We also need the derivative of the sigmoid function 

In [None]:
def sigmoid_deriv(z):
    return 

Now for the backward step -- this is for a single input vector and target class. Because we need to keep track of the forward activations (and pre-activations), we will just duplicate the forward pass code below. 

Alternatively, we could call the forward function above and then return the activations. 

In [None]:
def step(x, target, learning_rate = 0.1):
        
    global W1, b1, W2, b2, W3, b3  #ensure that parameter changes survive outside the scope of this function
    
    # Forward pass
    z1 = W1 @ x + b1   # might want to use the @ notation instead of .matmul for readability
    a1 = sigmoid(z1)
    
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)
    
    z3 = W3 @ a2 + b3
    out = softmax(z3)
    
    
    y = one_hot(target,3)
    loss_val = loss(out, y)
    

    # Backward pass
    
    # *** Output layer ***
    #deltaL = ...
    
    # weight update
    #dW3 = ...             # you need deltaL and a2 and you need to reshape them into a column vector and a row vector. 
    #db3 = ...
    

    # pass back 
    # da2 = ...           # compute da2 from deltaL
    # delta2 = ..         # then compute delta2 from da2 (note: use the derivative of the sigmoid)
    
    # *** Hidden layer 2 ***
    # weight update
    #dW2 = ...
    #db2 = ...
    
    #pass back
    #da1 = ...
    #delta1 = ...
    
    
    # *** Hidden layer 1 ***
    # weight update
    #dW1 = ...
    #db1 = ...
    
    
    # It's customary to actually update the weights in the end, once the gradients are computed
    W3 -= learning_rate * dW3
    b3 -= learning_rate * db3    

    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2

    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    
    return loss_val # we will return the loss so we can see it decrease 

Now we can run an epoch -- a single pass over the entire training data.

In [None]:
def epoch(data_x, data_y, lr =0.1):
    total_loss = 0
    for x, y in zip(data_x, data_y):
        total_loss += step(x,y, lr)
    print(f"Epoch loss: {total_loss}")

And now we train the model (for 100 epochs):

In [None]:
learning_rate = 1 # i found that you need to set the learning rate higher to fit the data set

for i in range(100):    
   epoch(data_x, data_y, learning_rate)
   print(accuracy(data_x, data_y))

As mentioned above, you should be able to learn the dummy data _perfectly_. The resulting model will be overfitted, but we have demonstrated that we are learning. 

Optional next step: Train a model for the penguin data! 