In [None]:
import numpy as np

# Gradient Descent with NumPy 

The goal of this notebook is to implement your own version of a multilayer perceptron (neural network) with NumPy and tune the weights with gradient descent. The target is to create a XOR perceptron.

### XOR
A XOR gate (or exclusive OR) is a logic gate that returns true (or 1) when one, and only one, of the inputs of the gate is true. Otherwise it returns false.

| A | B | A XOR B | 
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

Whereas other logical operators (such as AND and OR) can be modeled with a single layer perceptron, the XOR operator is more complex and must be modeled with a multi-layer perceptron. 

### Step 1. Encode your data
Encode your input data **X** and expected output data *y*. What is your input data? What is your output data? How can you encode this with a NumPy array? 

*Hint*: Ensure your two arrays have a similar number of dimensions. Check this with `X.ndim` and `y.ndim`

In [None]:
# %load ../answers/perceptron_encode.py



### Step 2.  Create your weights
Create a multi-layered perceptron with three layers: input layer, hidden layer and output layer, and define the training hyperparameters. 

- What number neurons should the input layer have? 
- What number of neurons should the output layer have? 
- What number of neurons should the hidden layer have? 


Initialize your weights. What shape should your weights? What type of initialization would you use? 

In [None]:
# %load ../answers/perceptron_weights.py




### Step 3. Define your activation
Define your activation function. In this case, we will be using a sigmoid activation:

$\sigma(x) = \frac{1}{1 + e^{-x}}$

As we will be using this for backpropagation, we implement the derivative of the sigmoid function: 

$\frac{d}{dx}\sigma(x) = \sigma'(x) = \sigma(x)(1 - \sigma(x))$

However, in this case, as you apply the derivative sigmoid function on the *activated* output of a layer, your input to the derivative function is _already activated_. This means applying $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ to the layer output is equivalent to applying $\sigma'(x) = x * (1 - x)$ to the _activated_ layer output. Our implementation of `sigmoid_derivative` can therefore be simplified to $\sigma'(x) = x * (1 - x)$.


In [None]:
def sigmoid(x): 
    return ...
    
def sigmoid_derivative(x):
    return ...

In [None]:
# %load ../answers/perceptron_sigmoids.py


### Step 4: Tune the weights
Next, let's train our model! This consists of three steps: 
1. Forward pass
2. Backpropagation
3. Update the weights

#### Forward pass
For each consecutive layer in the multi-layer perceptron, the weights are multiplied with their input, and the associated bias is added. These are passed through the activation function (sigmoid) to determine the output of a layer. The output of this layer is then used as the input for the next layer.

$\text{output}_\text{hidden} = \sigma(x \cdot w_\text{hidden})$ 

$\text{output}_\text{output} = \sigma(\text{output}_\text{hidden} \cdot w_\text{output} )$

#### Backpropagation
For backpropagation, you will want to calculate the error for a given layer and calculate the delta by multiplying the error with the output of the sigmoid derivative function applied to the output of that layer. It is called _back_ propagation because you start with the error (and delta) of the last layer, and work your way back. 

Output layer: 
* $\text{error}_\text{output} = y_\text{true} - y_\text{predicted}$

* $\delta_\text{output} = \text{error}_\text{output} \times \sigma'(\text{output}_\text{output})$ 

Hidden layer:
* $\text{error}_\text{hidden} = \delta_\text{output} \cdot w_\text{output}^T$
 
* $\delta_\text{hidden} = \text{error}_\text{hidden} \times \sigma'(\text{output}_\text{hidden})$ 


#### Update the weights
Use the deltas calculated in the previous step to update the weights.

* $w_\text{hidden} = w_\text{hidden} + (x^T \cdot \delta_\text{hidden})$
* $w_\text{output} = w_\text{output} + (\text{output}_\text{hidden}^T \cdot \delta_\text{output})$

_Tip:_ the `@` operator in NumPy is a shortcut for `np.dot`. This allows you to do matrix multiplications ([see documentation](https://numpy.org/doc/stable/reference/generated/numpy.dot.html)).

_Tip:_ `x.T` will give you the transpose of `x` if `x` is a `numpy` array.

In [None]:
epochs = 10000
for _ in range(epochs):
    # Forward pass. 
    hidden_layer = ...
    hidden_activated = ...

    output_layer = ...
    output_activated = ...
    y_hat = output_activated
    
    # Backpropagation / error calculation
    error_output = ...
    delta_output = ...
    
    error_hidden = ...
    delta_hidden = ...
    
    # Update weights. 
    output_weights += ...
    hidden_weights += ...
    
print(y)
y_hat


In [None]:
# %load ../answers/perceptron_network.py


### Step 5: Evaluate
Evaluate! Note that in our evaluation code, we've called the output of the final layer of the neural network `y_hat`. Feel free to give the variable a different name, but change it accordingly in the evaluation code.

In [None]:
# Step 5. Evaluate
for i, input_pair in enumerate(X):
    target = y[i][0]
    predicted_output = y_hat[i][0]
    print(f'Input: {input_pair}, target: {target}, predicted output: {predicted_output} ({np.round(predicted_output).astype(np.int16)})')

In [None]:
%load ../answers/perceptron.py