# Part 2 - The hidden layer

Welcome to Part 2!

Here, we'll implement a neural network with 2 layers (a hidden and output layer), and talk about feedforward and backpropagation. We'll take advantage of Numpy, a library that provides fast alternatives to math operations in Python and is designed to work efficiently with groups of numbers - like matrices.

## Index

- [Multi-layer perceptron](#multi-layer-perceptron)
- [The weights](#the-weigts)
- [Feedforward](#feedforward)
  - The hidden layer
  - The output layer
- Backpropagation
  - Caculating the error
  - Learning
- Full implementation in a real case

## Multi-layer perceptron <a id='multi-layer-perceptron'></a>

In [Part 1](Part1.ipynb) we implemented the simplest neural network - a perceptron. This neural network doesn't have a hidden layer so it can't help us to find predictions to complex problems.

When we combine perceptrons so that the output of one becomes the input of another one, we form a multi-layer perceptron or a neural network.

Neural networks have a certain special architecture with layers:

![Neural network](img/part2/1.png)

- **Input layer:** contains the inputs $x_1$, $x_2$, $...$, $x_n$, $1$.
- **Hidden layer:** set of linear models created with the first input layer.
- **Output layer:** where the linear models get combined to obtain a nonlinear model.

Now, not all neural networks look like the one above. They can be way more complicated. In particular, we can do the following things:
- Add more nodes to the input, hidden and output layers.
- Add more layers.

The following image shows the network with which we will work, with its input units, labeled $x_1$, $x_2$, and $x_3$, its hidden nodes labeled $h_1$ and $h_2$, and all of the weights between the input layer and the hidden layer, labeled with their appropriate $w_{ij}$ indices:

![Neural network for this exercise](img/part2/2.png)

## The weigts <a id='the-weigts'></a>

The weights need to be stored in a **matrix**, indexed as $w_{ij}$. Each **row** in the matrix will correspond to the weights **leading out** of a **single input unit**, and each **column** will correspond to the weights **leading in** to a **single hidden unit**.

For our three input units and two hidden units, the weights matrix looks like this:

![Weights matrix](img/part2/3.png)

To initialize these weights in Numpy, we have to provide the shape of the matrix. If `features` is a 2D array containing the input data:

In [1]:
import numpy as np

# Use to same seed to make debugging easier
np.random.seed(21)

features = np.array([
    [1, 2, 3], 
    [4, 5, 6]])

# Number of records and input units
n_records, n_inputs = features.shape

# Input to hidden weights
n_hidden = 2
weights_input_to_hidden = np.random.normal(0, n_inputs**-0.5, size=(n_inputs, n_hidden))

# Hidden to output weights
weights_hidden_to_output = np.random.normal(0, n_inputs**-0.5, size=n_hidden)

This creates a 2D array (i.e. matrix) named `weights_input_to_hidden` with dimensions `n_inputs` by `n_hidden`:

In [2]:
print(weights_input_to_hidden)

[[-0.03000157 -0.06419907]
 [ 0.60148166 -0.72557877]
 [ 0.43034978 -0.98787735]]


and a 2D array named `weights_hidden_to_output` with dimensions `n_hidden`:

In [3]:
print(weights_hidden_to_output)

[-0.11885586 -0.1354298 ]


## Feedforward <a id='feedforward'></a>

Feedforward is the process neural networks use to turn the input into an output:

![Feedforward](img/part2/4.png)

On a multi-layer perceptron or neural network, to calculate a prediction $\hat {y}\,$ we start with the unit vector $x$ and then we apply the first matrix $W^{(1)}$ and a sigmoid activation function to get the values in the second layer. Then we apply the second matrix $W^{(2)}$ and another sigmoid function to get the values on the third layer, and so on and so forth, until we get our final prediction $\hat y$. And this is the feedforward process that neural networks use to obtain the prediction from the input vector.

Let's see step by step this process with our neural network with two layers (hidden and output layer).

### The hidden layer

The input to a hidden unit is the sum of all the inputs multiplied by the hidden unit's weights. So for each hidden layer unit, $h_j$, we need to calculate the following:
$$
h_j = \sum_i w_{ij}x_i
$$

To do that, we need to use **matrix multiplication**.

In this case, we're multiplying the inputs (a row vector here) by the weights. To do this, you take the dot (inner) product of the inputs with each column in the weights matrix. For example, to calculate the input to the first hidden unit, $j=1$, you'd take the dot product of the inputs with the first column of the weights matrix, like so:

![Input to the first hidden unit](img/part2/5.png)

$$
h_1 = w_{11}x_1 + w_{21}x_2 + w_{31}x_3
$$

And for the second hidden layer input, you calculate the dot product of the inputs with the second column. And so on and so forth.

In Numpy, you can do this for all the inputs and all the outputs at once using `np.dot`:

In [7]:
input = features[0]

# Calculate the inputs for the hidden layer
hidden_inputs = np.dot(input, weights_input_to_hidden)

print('input')
print(input)
print('\nweights_input_to_hidden')
print(weights_input_to_hidden)
print('\nhidden_inputs')
print(hidden_inputs)

input
[1 2 3]

weights_input_to_hidden
[[-0.03000157 -0.06419907]
 [ 0.60148166 -0.72557877]
 [ 0.43034978 -0.98787735]]

hidden_inputs
[ 2.46401109 -4.47898866]


Now that you have the inputs for the hidden layer, you calculate the outputs of that hidden layer passing the inputs through an activation function, which in this case we use the sigmoid function:

In [8]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

hidden_outputs = sigmoid(hidden_inputs)
hidden_outputs

array([ 0.92158004,  0.01121762])

After this process we have our neural network with values in each of its hidden units:

![Neural network with values in each of its hidden units](img/part2/6.png)

### The output layer

Now that you have the outputs for the hidden layer, it's time to calculate the input for the output unit, and this process is the same as with the hidden layer, but instead of multiplying the inputs by the hidden unit's weights, we multiply the outputs for the hidden layer by the output unit's weights:

In [9]:
output_inputs = np.dot(hidden_outputs, weights_hidden_to_output)

print('hidden_outputs')
print(hidden_outputs)
print('\nweights_hidden_to_output')
print(weights_hidden_to_output)
print('\noutput_inputs')
print(output_inputs)

hidden_outputs
[ 0.92158004  0.01121762]

weights_hidden_to_output
[-0.11885586 -0.1354298 ]

output_inputs
-0.11105438401


And finally, we calculate the sigmoid of the result to have our prediction $\hat y$:

In [10]:
output = sigmoid(output_inputs)
output

0.47226490306194757

![Prediction](img/part2/7.png)

## Backpropagation

Training is the process that looks for the parameters a neural network should have on its edges (weights) in order to model our data well. One method used for training a neural network is backpropagation.

In a nutshell, backpropagation consists of:
- Calculating the error of the prediction.
- Running the feedforward operation backward (backpropagation) to spread the error to each of the weights.
- Use this to update the weights, and get a better model (learn).
- Continue this until we have a model that is good.

### Calculating the error

To update the weights to hidden layers we use an algorithm called gradient descent. In order to do this, you need to know how much error each of the hidden units contributed to the final output. Since the output of a layer is determined by the weights between layers, the error resulting from units is scaled by the weights going forward through the network. Since we know the error at the output, we can use the weights to work backward to hidden layers.

You can view this process as flipping the network over and using the error as the input:

![Backpropagation](img/part2/8.png)

In the output layer, you have error terms $\delta_k^0$ attributed to each output unit $k$:

$$
\delta_k^0 = (y_k - \hat {y_k}) \, f'(output_k)
$$

Remember that we are using the sigmoid for the activation function $f(h) = 1/(1+e^{-h})$ and:

$$
f'(h) = f(h)(1 - f(h))
$$

In [11]:
# The correct value we're trying to predict.
target = 0.6

# Calculate the network's output error
error = target - output

# Calculate the output layer's error term
output_error_term = error * output * (1 - output)

print('output_error_term')
print(output_error_term)

output_error_term
0.0318355158503


Then, the error attributed to hidden unit $j$ is the output error, scaled by the weights between the output and hidden layers (and the gradient):

$$
\delta_j^h = \sum W_{jk} \, \delta_k^0 \, f'(h_j)
$$

In [12]:
# Calculate the hidden layer's error term
hidden_error = output_error_term * weights_hidden_to_output
hidden_error_term =  hidden_error * hidden_outputs * (1 - hidden_outputs)

print('weights_hidden_to_output')
print(weights_hidden_to_output)
print('\noutput_error_term')
print(output_error_term)
print('\nhidden_error')
print(hidden_error)
print('\nhidden_outputs')
print(hidden_outputs)
print('\nhidden_error_term')
print(hidden_error_term)

weights_hidden_to_output
[-0.11885586 -0.1354298 ]

output_error_term
0.0318355158503

hidden_error
[-0.00378384 -0.00431148]

hidden_outputs
[ 0.92158004  0.01121762]

hidden_error_term
[ -2.73458970e-04  -4.78219744e-05]


![Neural network with error terms](img/part2/9.png)

### Learning

"Learning" takes our errors and tells each weight how it can change to reduce it.

Then, the change to the weights will be:

$$
\Delta w_{ij} = \eta \delta_j^h x_i
$$

where $w_{ij}$ are the weights between the inputs and hidden layer and $x_i$ are the input unit values. This form holds for however many layers there are. The weight steps are equal to the learning rate times the output error of the layer times the values of the inputs to that layer:

$$
\Delta w_{pq} = \eta \delta_{output} V_{in}
$$

Here, you get the output error, $\delta_{output}$, by propagating the errors backward from higher layers. And the input values, $V_{in}$ are the inputs to the layer, the hidden layer activations to the output unit for example.

In [13]:
learnrate = 0.1

# Calculate change in weights for hidden layer to output layer
delta_weights_hidden_to_output = learnrate * output_error_term * hidden_outputs

# Calculate change in weights for input layer to hidden layer
delta_weights_input_to_hidden = learnrate * hidden_error_term * input[:, None]

print('output_error_term =', output_error_term)
print('hidden_outputs = ', hidden_outputs)
print('Change in weights for hidden layer to output layer:')
print(delta_weights_hidden_to_output)

print('\nhidden_error_term = ', hidden_error_term)
print('input = ', input)
print('Change in weights for input layer to hidden layer:')
print(delta_weights_input_to_hidden)

output_error_term = 0.0318355158503
hidden_outputs =  [ 0.92158004  0.01121762]
Change in weights for hidden layer to output layer:
[  2.93389758e-03   3.57118668e-05]

hidden_error_term =  [ -2.73458970e-04  -4.78219744e-05]
input =  [1 2 3]
Change in weights for input layer to hidden layer:
[[ -2.73458970e-05  -4.78219744e-06]
 [ -5.46917940e-05  -9.56439489e-06]
 [ -8.20376911e-05  -1.43465923e-05]]


With the changes in weights computed, our new weights are:

In [14]:
print('Old weights for hidden layer to output layer:')
print(weights_hidden_to_output)

print('\nOld weights for input layer to hidden layer:')
print(weights_input_to_hidden)

# Calculate new weights
new_weights_hidden_to_output = weights_hidden_to_output + delta_weights_hidden_to_output
new_weights_input_to_hidden = weights_input_to_hidden + delta_weights_input_to_hidden

print('\n\nNew weights for hidden layer to output layer:')
print(new_weights_hidden_to_output)

print('\nNew weights for input layer to hidden layer:')
print(new_weights_input_to_hidden)

Old weights for hidden layer to output layer:
[-0.11885586 -0.1354298 ]

Old weights for input layer to hidden layer:
[[-0.03000157 -0.06419907]
 [ 0.60148166 -0.72557877]
 [ 0.43034978 -0.98787735]]


New weights for hidden layer to output layer:
[-0.11592196 -0.13539409]

New weights for input layer to hidden layer:
[[-0.03002892 -0.06420385]
 [ 0.60142697 -0.72558833]
 [ 0.43026774 -0.9878917 ]]


You may think that this was a tiny change in our weights, but this process was realized with the input data of just one record. In order to find a better model (with better weights and predictions), you have to iterate this process in your whole dataset a lot of times. We'll do this next.

The takeaway is that you understand how all the pieces that allow a neural network "learn" fit together.

# Feedback