# Backpropagation in a Multi-Layer Neural Network

We explore backpropagation assuming the features are $\begin{bmatrix}-10 & 20\end{bmatrix}$ and the correct category is $1$ which is one-hot encoded as $\begin{bmatrix}0 & 1\end{bmatrix}$.

The parameters of the multi-layer neural network (the weights and biases) which minimize the loss are discovered by descending the loss gradient.

In exercises 710 through 730, we calculated the partial derivatives for a single-layer neural network.

In this exercise, we compute partial derivatives for a multi-layer neural network.


In [1]:
import torch
import torch.nn.functional as F

weights1 = torch.nn.Parameter(torch.Tensor([[1,1],[2,2]]))
print("Parameters (weights) 1st layer: "+str(weights1.data))

weights2 = torch.nn.Parameter(torch.Tensor([[2, 1],[2,1]]))
print("Parameters (weights) 2nd layer: "+str(weights2.data))

bias1 = torch.nn.Parameter(torch.Tensor([[-1,-1]]))
print("Parameters (bias) 1st layer: "+str(bias1.data))

bias2 = torch.nn.Parameter(torch.Tensor([[5,5]]))
print("Parameters (bias) 2nd layer: "+str(bias2.data))

inputs = torch.autograd.Variable(torch.Tensor([[-10, 20]]))

target = torch.autograd.Variable(torch.LongTensor([1]))
#print(target)

one_hot_target = torch.autograd.Variable(torch.Tensor([[0, 1]]))
#print(one_hot_target)


Parameters (weights) 1st layer: 
 1  1
 2  2
[torch.FloatTensor of size 2x2]

Parameters (weights) 2nd layer: 
 2  1
 2  1
[torch.FloatTensor of size 2x2]

Parameters (bias) 1st layer: 
-1 -1
[torch.FloatTensor of size 1x2]

Parameters (bias) 2nd layer: 
 5  5
[torch.FloatTensor of size 1x2]



We assume the weights are $\begin{bmatrix}1 & 1 \\ 2 & 2\end{bmatrix}$ for the lower layer and $\begin{bmatrix}2 & 1 \\ 2 & 1\end{bmatrix}$ for the higher layer at the start.

The training is assumed to proceed one data point at a time (batch size of 1).

For this iteration of training, the features are $\begin{bmatrix}-10 & 20\end{bmatrix}$ and the correct category is $1$ which is one-hot encoded as $\begin{bmatrix}0 & 1\end{bmatrix}$.

In [2]:
# Forward Pass

hidden = torch.mm(inputs, weights1) + bias1
print("a = "+str(hidden))

hidden_relu = F.relu(hidden)
print("h = relu(a) = "+str(hidden_relu))

result = torch.mm(hidden_relu, weights2) + bias2
print("c = "+str(result))

softmax_result = F.softmax(result, dim=1)
print("softmax(c) = "+str(softmax_result))

loss = F.cross_entropy(result, target)
print("Cross entropy loss: "+str(loss))

# Backward Pass

grad_softmax = (softmax_result.data - one_hot_target.data)
grad_hidden_activation = grad_softmax.mm(weights2.data.t())
grad_hidden_pre_activation = grad_hidden_activation.clone()
grad_hidden_pre_activation[hidden.data < 0] = 0
grad_weight2 = hidden.data.t().mm(grad_softmax)
grad_bias2 = grad_softmax
grad_weight1 = inputs.data.t().mm(grad_hidden_pre_activation)
grad_bias1 = grad_hidden_pre_activation

print("\tThe manually computed gradient of the loss with respect to the weights of the second layer is "+str(grad_weight2))

print("\tThe manually computed gradient of the loss with respect to the weights of the first layer is "+str(grad_weight1))

print("\tThe manually computed gradient of the loss with respect to the biases of the second layer is "+str(grad_bias2))

print("\tThe manually computed gradient of the loss with respect to the biases of the first layer is "+str(grad_bias1))

# Autograd

loss.backward()

gradient = weights2.grad

print("\tThe automatically computed gradient of the loss with respect to the weights of the second layer is "+str(gradient.data))

gradient = weights1.grad

print("\tThe automatically computed gradient of the loss with respect to the weights of the first layer is "+str(gradient.data))

gradient = bias2.grad

print("\tThe automatically computed gradient of the loss with respect to the biases of the second layer is "+str(gradient.data))

gradient = bias1.grad

print("\tThe automatically computed gradient of the loss with respect to the biases of the first layer is "+str(gradient.data))

if weights1.grad is not None:
    weights1.grad.data.zero_()
if weights2.grad is not None:
    weights2.grad.data.zero_()
if bias1.grad is not None:
    bias1.grad.data.zero_()
if bias2.grad is not None:
    bias2.grad.data.zero_()

a = Variable containing:
 29  29
[torch.FloatTensor of size 1x2]

h = relu(a) = Variable containing:
 29  29
[torch.FloatTensor of size 1x2]

c = Variable containing:
 121   63
[torch.FloatTensor of size 1x2]

softmax(c) = Variable containing:
 1.0000e+00  6.4702e-26
[torch.FloatTensor of size 1x2]

Cross entropy loss: Variable containing:
 58
[torch.FloatTensor of size 1]

	The manually computed gradient of the loss with respect to the weights of the second layer is 
 29 -29
 29 -29
[torch.FloatTensor of size 2x2]

	The manually computed gradient of the loss with respect to the weights of the first layer is 
-10 -10
 20  20
[torch.FloatTensor of size 2x2]

	The manually computed gradient of the loss with respect to the biases of the second layer is 
 1 -1
[torch.FloatTensor of size 1x2]

	The manually computed gradient of the loss with respect to the biases of the first layer is 
 1  1
[torch.FloatTensor of size 1x2]

	The automatically computed gradient of the loss with respect to th