# Assignment 3a: Automatic Differentiation with Tensors

The purpose of this assignment is to help you understand automatic differentiation with Tensors. 

For the first part of the assigment you will define a two-layer neural network following the equations in Section 5.3.1 of the textbook Zhang et al.

For the second part of the assignment you will use PyTorch to verify the backpropagation equations in Section 5.3.2.

In [1]:
import torch
from torch import nn
from torch.nn import functional as F
import numpy as np
from d2l import torch as d2l

Weight decay uisng L2 Norm

In [2]:
def l2_penalty(w):
    return (w ** 2).sum() / 2

Model Parameters

In [3]:
num_inputs = 4
num_hiddens = 3
num_outputs = 2

Minibatch data

In [4]:
num_batch_size = 5

torch.manual_seed(0)
X = torch.randn(num_batch_size, num_inputs)
y = torch.randint(num_outputs, [num_batch_size])

##  Define simple two-level network (TODO)

Follow the equations in Section 5.3.1, and define a simple two-level network with cross entropy loss and weight decay.

* Use `torch.randn()` to create two weight matrices.
* Use `torch.relu()` as the activation function

In [5]:
class SimpleClassifier:
    def __init__(self, num_inputs, num_hiddens, num_outputs, lambda1=0.1):
        self.num_inputs = num_inputs
        self.num_hiddens = num_hiddens
        self.num_outputs = num_outputs
        self.lambda1 = lambda1

        # Create and intialize the weight matrices using random normals, torch.randn().
        torch.manual_seed(0)
        # Begin - insert your code
        sigma = 1.0
        
        self.W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma)
        self.b1 = nn.Parameter(torch.zeros(num_hiddens))
        self.W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs) * sigma)
        self.b2 = nn.Parameter(torch.zeros(num_outputs))

        self.cross_entropy = nn.CrossEntropyLoss()
        # End - insert your code
        

    def forward(self, X):

        # Follow the equations in section 5.3.1 to define the forward pass. Use torch.relu() as the activation function
        # Begin - insert your code
        batch_size = X.shape[0]
        X = X.reshape((-1, self.num_inputs))
        
        self.z = torch.matmul(X, self.W1) + self.b1
        self.h = torch.relu(self.z)
        self.o = torch.matmul(self.h, self.W2) + self.b2

        assert self.z.shape == (batch_size, self.num_hiddens)
        assert self.h.shape == (batch_size, self.num_hiddens)
        assert self.o.shape == (batch_size, self.num_outputs)
        # End - insert your code

        # Save intermediate gradient computations
        self.z.retain_grad()
        self.h.retain_grad()
        self.o.retain_grad()

        return self.o
        
    def loss(self, y):

        # Follow the equations in section 5.3.1 to define the cross entropy loss with l2 weight decay.
        # Use the provided l2_penalty function.
        # Begin - insert your code
        self.L = self.cross_entropy(self.o, y)
        self.s = self.lambda1 * (l2_penalty(self.W1) + l2_penalty(self.W2))
        self.J = self.L + self.s
        # End - insert your code

        # Save intermediate gradient computations
        self.L.retain_grad()
        self.s.retain_grad()

        return self.J

## Create an instance

Create instance and do forward pass and calculate loss

In [6]:
dJ =  SimpleClassifier(num_inputs, num_hiddens, num_outputs)
dJ.forward(X)

tensor([[-4.1042,  1.7542],
        [-0.9505,  0.5568],
        [-2.1101,  1.3511],
        [-3.2251,  4.1435],
        [-2.5274,  3.2354]], grad_fn=<AddBackward0>)

In [7]:
dJ.loss(y)

tensor(4.6537, grad_fn=<AddBackward0>)

Do backward propagation starting from variable J

## Do backward pass from variable J

In [8]:
dJ.J.backward()

Typically we are only interested in $\frac{\partial J}{\partial W1}$ and $\frac{\partial J}{\partial W2}$:

In [9]:
dJ.W1.grad

tensor([[ 1.0557, -0.1049, -0.4492],
        [ 1.1706, -0.0147, -0.3028],
        [ 1.1757,  0.5057, -0.1072],
        [-0.3526, -0.2323, -0.1262]])

In [10]:
dJ.W2.grad

tensor([[-1.4173,  1.4417],
        [-0.4229,  0.3281],
        [-0.8347,  0.8154]])

Since we invoked `retain_grad()` on all the intermediate variables, we can examine the partial of J with respect the intermediate varibles: $\frac{\partial J}{\partial L}$ , $\frac{\partial J}{\partial s}$ , $\frac{\partial J}{\partial o}$ , $\frac{\partial J}{\partial h}$, and $\frac{\partial J}{\partial z}$

According to Equations 5.3.8 of Zhang et al, $\frac{\partial J}{\partial L} = 1$ and $\frac{\partial J}{\partial s} = 1$.

Let's check.

In [11]:
dJ.L.grad

tensor(1.)

In [12]:
dJ.s.grad

tensor(1.)

We can check the other intermediate varialbles as well:

In [13]:
dJ.o.grad

tensor([[-0.1994,  0.1994],
        [ 0.0363, -0.0363],
        [ 0.0061, -0.0061],
        [-0.1999,  0.1999],
        [-0.1994,  0.1994]])

In [14]:
dJ.h.grad

tensor([[ 0.3903,  0.2381,  0.1874],
        [-0.0710, -0.0433, -0.0341],
        [-0.0119, -0.0073, -0.0057],
        [ 0.3912,  0.2386,  0.1878],
        [ 0.3902,  0.2380,  0.1873]])

In [15]:
dJ.z.grad

tensor([[ 0.0000,  0.2381,  0.1874],
        [ 0.0000, -0.0433, -0.0341],
        [-0.0119, -0.0073,  0.0000],
        [ 0.3912,  0.0000,  0.0000],
        [ 0.3902,  0.2380,  0.0000]])

## Verify equation 5.3.9

According to equation 5.3.9, $\frac{\partial J}{\partial o} = \mathrm{prod}\left( \frac{\partial J}{\partial o}, \frac{\partial L}{\partial o} \right) = \frac{\partial L}{\partial o}$.

Create a new isntance, and run backward from varialbe `L`, instead of variable `J`.

In [16]:
dL =  SimpleClassifier(num_inputs, num_hiddens, num_outputs)
dL.forward(X)
dL.loss(y)

tensor(4.6537, grad_fn=<AddBackward0>)

In [17]:
dL.L.backward()

In [18]:
torch.isclose(dJ.L.grad * dL.o.grad,  dL.o.grad)

tensor([[True, True],
        [True, True],
        [True, True],
        [True, True],
        [True, True]])

## Verify equations 5.3.10 (TODO)

Create a new instance `ds`, and perform forward and backward pass from variable `s`.

In [19]:
# Begin - insert your code
lambda1 = 0.1
ds = SimpleClassifier(num_inputs, num_hiddens, num_outputs, lambda1)
ds.forward(X)
ds.loss(y)
ds.s.backward()
# End - insert your code

Use `torch.isclose` to verify equations in equations 5.3.10

In [20]:
# Begin - insert your code
torch.isclose(ds.W1.grad, lambda1 * dJ.W1)
# End - insert your code

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True]])

In [21]:
# Begin - insert your code
torch.isclose(ds.W2.grad, lambda1 * dJ.W2)
# End - insert your code

tensor([[True, True],
        [True, True],
        [True, True]])

## Verify the right-hand side of equation 5.3.11 (TODO)

Verify the right-hand side of equation 5.3.11, i.e. verify

$$ \frac{\partial J}{\partial W2} = h^T\frac{\partial J}{\partial o} + \lambda W2$$


In [22]:
dJ.W2.grad

tensor([[-1.4173,  1.4417],
        [-0.4229,  0.3281],
        [-0.8347,  0.8154]])

In [23]:
# Begin - insert your code
lambda1 = 0.1
dJ_dW2 = torch.matmul(dJ.h.T, dJ.o.grad) + lambda1 * dJ.W2
# End - insert your code

In [24]:
torch.isclose(dJ.W2.grad, dJ_dW2)

tensor([[True, True],
        [True, True],
        [True, True]])

## Verify the middle part of equation 5.3.11 (TODO)

Verify


$$ \frac{\partial J}{\partial W2} = \mathrm{prod}\left(\frac{\partial J}{\partial o}, \frac{\partial o}{\partial W2}\right) + 
\mathrm{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial W2}\right)$$


The sum of the two terms on the right-hand side should be `dJ.W2.grad`.

In [25]:
dJ.W2.grad

tensor([[-1.4173,  1.4417],
        [-0.4229,  0.3281],
        [-0.8347,  0.8154]])

Let's focus on the second term $\mathrm{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial W2}\right)$.

Create a new instance `ds` and compute forward and loss.

In [26]:
ds =  SimpleClassifier(num_inputs, num_hiddens, num_outputs)
ds.forward(X)
ds.loss(y)

tensor(4.6537, grad_fn=<AddBackward0>)

Run backward from variable `s`.

In [27]:
ds.s.backward()

Compute  $\mathrm{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial W2}\right)$.

In [28]:
# Begin - insert your code
term2 = dJ.s.grad * ds.W2.grad
# End - insert your code

Now focus on the first term of the equation, $\mathrm{prod}\left(\frac{\partial J}{\partial o}, \frac{\partial o}{\partial W2}\right)$, by creating a new instance `do` and computing forward and loss.

In [29]:
do =  SimpleClassifier(num_inputs, num_hiddens, num_outputs)
do.forward(X)
do.loss(y)

tensor(4.6537, grad_fn=<AddBackward0>)

In [30]:
dJ.o.shape, dJ.W2.shape

(torch.Size([5, 2]), torch.Size([3, 2]))

**Notice:** $W2$ is a second-order tensors and $o$ is a vector (if minibatch size is 1). The matrix derivative
$$\frac{\partial o}{\partial W2}$$
would be a third-order tensor. 

PyTorch autograd does not explicitly return third-order matrix derivates. This is because with deep learning we almost always immediately perfrom a multiplication with these intermediated matrix derivates. In this case, we compute 
$\mathrm{prod}\left(\frac{\partial J}{\partial o}, \frac{\partial o}{\partial W2}\right)$.

Read the following links to understand the $gradient$ parameter the $backward$ method.

* [Optional Reading - Vector Calculus using autograd](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#optional-reading-vector-calculus-using-autograd)
* [torch.Tensor.backward](https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html)
* [The “gradient” argument in Pytorch’s “backward” function — explained by examples](https://zhang-yang.medium.com/the-gradient-argument-in-pytorchs-backward-function-explained-by-examples-68f266950c29)

Call the `backward` method of of variable `do.o` with the appropriate `gradient` argument.

In [31]:
# Begin - insert your code
do.o.backward(gradient=dJ.o.grad)
# End - insert your code

In [32]:
# Begin - insert your code
term1 = do.W2.grad
# End - insert your code

In [33]:
torch.isclose(dJ.W2.grad, term1 + term2)

tensor([[True, True],
        [True, True],
        [True, True]])

## Verify the right-hand side of Equation 5.3.12  (TODO)

Verify

$$ \frac{\partial J}{\partial h} = \frac{\partial J}{\partial o} W2^T$$


In [34]:
# Begin - insert your code
prod = torch.matmul(dJ.o.grad, dJ.W2.T)
# End - insert your code

In [35]:
torch.isclose(dJ.h.grad, prod)

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True]])

## Verify the middle part of Equation 5.3.12 (TODO)

Verify

$$ \frac{\partial J}{\partial h} = \mathrm{prod}\left(\frac{\partial J}{\partial o}, \frac{\partial o}{\partial h}\right) $$


In [36]:
# Begin - insert your code
# do.h.grad = dJ/do * do/dh
torch.isclose(dJ.h.grad, do.h.grad)
# End - insert your code

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True]])

## Verify the middle part of Equation 5.3.13 (TODO)

Verify

$$ \frac{\partial J}{\partial z} = \mathrm{prod}\left(\frac{\partial J}{\partial h}, \frac{\partial h}{\partial z}\right) $$


In [37]:
dh = SimpleClassifier(num_inputs, num_hiddens, num_outputs)
dh.forward(X)
dh.loss(y)

tensor(4.6537, grad_fn=<AddBackward0>)

Call the `backward` method of `dh.h` with the appropriate `gradient` argument.

In [38]:
# Begin - insert your code
dh.h.backward(dJ.h.grad)
# End - insert your code

Verify by using the `torch.iscloss` function.

In [39]:
# Begin - insert your code
# dh.z.grad = dJ/dh * dh/dz
torch.isclose(dJ.z.grad, dh.z.grad)
# End - insert your code

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True]])

## Verify the middle part of Equation 5.3.14 (TODO)

Verify


$$ \frac{\partial J}{\partial W1} = \mathrm{prod}\left(\frac{\partial J}{\partial z}, \frac{\partial z}{\partial W1}\right) + 
\mathrm{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial W1}\right)$$



In [40]:
dz= SimpleClassifier(num_inputs, num_hiddens, num_outputs)
dz.forward(X)
dz.loss(y)

tensor(4.6537, grad_fn=<AddBackward0>)

Call the `backward` method of `dz.z` with the appropriate `gradient` argument.

In [41]:
# Begin - insert your code
# dh.h.backward
dz.z.backward(gradient=dJ.z.grad)
# End - insert your code

Verify by using the `torch.isclose` function.

In [42]:
# Begin - insert your code
# dz.W1.grad = dJ/dz * dz/dW1
torch.isclose(dJ.W1.grad, dz.W1.grad + dJ.s.grad * ds.W1.grad)
# End - insert your code

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True]])