# Backpropagation

This notebook demonstrates the backpropagation algorithm to update the weights of a neural network. The derived result is compared against PyTorch, a popular machine learning framework.

In [None]:
import numpy as np
import torch
import torch.nn as nn
from collections import OrderedDict

## A Simple Regression Problem

Given a dataset made up of 8 points sampled along the line $y = x$, our goal is to fit a neural network with 1 hidden layer consisting of a single node followed by an output layer with a single node. We will use mean-squared error as our error function.

In [None]:
data = np.linspace(-1, 1, 8)
targets = data

# Create a simple network in Pytorch
class MyModel(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.hidden = nn.Linear(1, 1)
        self.activation = nn.Sigmoid()
        self.output = nn.Linear(1, 1)

    def forward(self, x):
        a1 = self.hidden(x)
        z1 = self.activation(a1)
        a2 = self.output(z1)

        return [a1, z1, a2]

model = MyModel()

weights_dict = OrderedDict(
    {
        'hidden.weight': torch.tensor([[-0.1]]),
        'hidden.bias': torch.tensor([0]),
        'output.weight': torch.tensor([[-0.1]]),
        'output.bias': torch.tensor([0])
    }
)

model.load_state_dict(weights_dict)

loss_fn = nn.MSELoss()

In [None]:
data_pt = torch.from_numpy(data.astype(np.float32))
targets_pt = torch.from_numpy(targets.astype(np.float32))
[a1, z1, y_hat] = model(data_pt.unsqueeze(1))
print(y_hat)

loss = loss_fn(y_hat, targets_pt.unsqueeze(1))
loss.retain_grad()
print(loss)

# Compute the backward pass
model.zero_grad()
loss.backward()

print(f'** PyTorch Gradients **')
print(f'Loss = {loss.grad}')
print(f'Hidden Weight = {model.hidden.weight.grad.item():.3}')
print(f'Hidden Bias = {model.hidden.bias.grad.item():.3}')
print(f'Output Weight = {model.output.weight.grad.item():.3}')
print(f'Output Bias = {model.output.bias.grad.item():.3}')

# For computing gradients manually
print(f'\n** Manual Computation **')
dLdy_hat = 2 * (y_hat - targets_pt.unsqueeze(1))
dy_hatdw2 = dLdy_hat * z1
dy_hatdb2 = dLdy_hat * 1
dy_hatdz1 = model.hidden.weight.item()

# Computing back to layer 1
dz1da1 = z1 * (1 - z1) # sigmoid derivative
da1dw = data_pt.unsqueeze(1)
da1db = 1

# Combining Results
dy_hatdw1 = dLdy_hat * dy_hatdz1 * dz1da1 * da1dw
dy_hatdb1 = dLdy_hat * dy_hatdz1 * dz1da1 * da1db

print(f'Hidden Weight = {dy_hatdw1.mean().item():.3}')
print(f'Hidden Bias = {dy_hatdb1.mean().item():.3}')
print(f'Output Weight = {dy_hatdw2.mean().item():.3}')
print(f'Output Bias = {dy_hatdb2.mean().item():.3}')

# Computing the Gradients Manually

Now let's compute the gradients of each function.
If we do this correctly, the result will match the output of PyTorch.

## The Loss Function

The loss function we are using is Mean Squared Error:

$$
\mathcal{L} = \text{MSE}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2
$$

In the example above, our `loss` is `0.4289`. This is the value that is produced by $\mathcal{L}$.
The first gradient is

$$
\frac{\partial E}{\partial E} = 1,
$$

confirmed by calling `loss.grad` in the code above.

## Output Layer

There are two inputs into $\mathcal{L}$: 
1. The output of our output layer $\hat{\mathbf{y}}$.
2. The targets $\mathbf{y}$.

There are two gradients we could compute then. The gradient with respect to the targets $\mathbf{y}$ are not useful here, so we will only consider

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{\hat{y}}} = \mathbf{\hat{y}} - \mathbf{y}
$$

To get the gradients with respect to the weights and bias, we compute

$$
\begin{align*}
\frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{w}} &= \mathbf{z}^{(1)}\\[0.5em]
\frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{b}} &= 1
\end{align*}
$$

We now have the values needed to compute the gradients for the output layer.

$$
\begin{align*}
\frac{\partial \mathcal{L}}{\partial \mathbf{w}^{(2)}} &= \frac{\partial \mathcal{L}}{\partial \mathbf{\hat{y}}} \frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{w}^{(2)}} = \\
\frac{\partial \mathcal{L}}{\partial \mathbf{w}^{(2)}} &= \frac{\partial \mathcal{L}}{\partial \mathbf{\hat{y}}} \frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{w}^{(2)}}\\
\end{align*}
$$