In [8]:
import torch as t
import numpy as np

Lets say $z$ is a function of $\mathbf x$ and $\mathbf w$ that outputs a scalar value. This is the most common setup in ML loss functions.

$$
z = 3 \mathbf x^T (\mathbf w + 2)^{\circ 2} \\
$$

If $\mathbf x$ is a 3-element vector -
$$
\mathbf x = \begin{bmatrix}
x_1 \\ 
x_2 \\ 
x_3
\end{bmatrix} \\
$$

and correspondingly $\mathbf w$ is also a 3-element vector -
$$
\mathbf w = \begin{bmatrix}
w_1 \\
w_2 \\
w_3 \\
\end{bmatrix} \\
$$

Then $z$ can be written as -
$$
\begin{align}
z &= 3 \cdot \begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix} \cdot \begin{bmatrix}
(w_1 + 2)^2 \\
(w_2 + 2)^2 \\
(w_3 + 2)^2 \\
\end{bmatrix} \\
&= 3x_1(w_1 + 2)^2 + 3x_2(w_2 + 2)^2 + 3x_3(w_3 + 2)^2 \\
\end{align}
$$

And the Jacobian of $z$ is a row vector - 
$$
\begin{align}
\nabla_{\mathbf w} z &= \begin{bmatrix} \frac{\partial z}{\partial w_1} & \frac{\partial z}{\partial w_2} & \frac{\partial z}{\partial w_2} \\ \end{bmatrix} \\
&= \begin{bmatrix} 6x_1(w_1 + 2) & 6x_2(w_2 + 2) & 6x_3(w_3 + 2) \\ \end{bmatrix} \\
\end{align}
$$

This Jacobian is also colloquially known as the gradient of $\mathbf w$. Keeping $\mathbf x$ constant at $\begin{bmatrix} 0.5 \\ 0.5 \\ 0.5 \end{bmatrix}$ the gradient becomes -
$$
\nabla_{\mathbf w} z = \begin{bmatrix} 3(w_1 + 2) & 3(w_2 + 2) & 3(w_3 + 2) \\ \end{bmatrix}
$$

Now for different values of $\mathbf w$, we will have different values of the gradient. E.g., when $\mathbf w = \begin{bmatrix}1 \\ 1 \\ 1 \end{bmatrix}$, $z$ evaluates to -

$$
\begin{align}
z &= 3 \times 0.5 \times (1 + 2)^2 + 3 \times 0.5 \times (1 + 2)^2 + 3 \times 0.5 \times (1 + 2)^2 \\
&= 13.5 + 13.5 + 13.5 \\
&= 40.5
\end{align}
$$

And the gradient of $\mathbf w$ becomes -
$$
\begin{align}
\nabla_{\mathbf w} z &= \begin{bmatrix} 3(1 + 2) & 3(1 + 2) & 3(1 + 2) \end{bmatrix} \\
&= \begin{bmatrix} 9 & 9 & 9 \\ \end{bmatrix}
\end{align}
$$

By setting the `requires_grad` flag on `w` we are telling PyTorch that this is the variable that we'll differentiate with respect to and will have a gradient. `x` will not have a graident.

In [12]:
x = t.full((3,), 0.5)
w = t.ones(3, requires_grad=True)
w2 = (w + 2) ** 2
z = 3 * t.dot(x, w2)
z

tensor(40.5000, grad_fn=<MulBackward0>)

Calling `Tensor.backward()` will run the backpropagation on this compute graph which will calculate the gradients for the leaves of this graph and store it alongside the tensors.

In [19]:
z.backward()
w.grad

tensor([9., 9., 9.])

In [34]:
x.grad is None

True

If I try to backprop through this compute graph again it'll fail. I need to recalculate the grpah with a new $\mathbf w$ and then I can run `backward` again.

In [23]:
try:
    z.backward()
except RuntimeError as err:
    print(err)

Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.


In [26]:
w = t.full((3,), 2., requires_grad=True)
w2 = (w + 2) ** 2
z = 3 * t.dot(x, w2)
print("z = ", z)
z.backward()
print("grad w = ", w.grad)

z =  tensor(72., grad_fn=<MulBackward0>)
grad w =  tensor([12., 12., 12.])


If for some reason I do need to backprop multiple times, then it will accumulate the gradients on the leaf nodes. And the way I can circumvent the above error is by setting the `retain_graph` flag. Normally this is never needed and according to the [documentation](https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html#torch-tensor-backward) I should avoid setting this flag. But for demonstrating this concept I'll use it.

In [31]:
w = t.ones((3,), requires_grad=True)
w2 = (w + 2) ** 2
z = 3 * t.dot(x, w2)
print("z = ", z)
z.backward(retain_graph=True)
print("grad w = ", w.grad)
z.backward(retain_graph=True)
print("grad w = ", w.grad)

z =  tensor(40.5000, grad_fn=<MulBackward0>)
grad w =  tensor([9., 9., 9.])
grad w =  tensor([18., 18., 18.])


The gradient is just another tensor that can be changed to anything I want.

In [32]:
print("grad w so far = ", w.grad)
w.grad.data = t.tensor([1., 2., 3.])
print("grad w after I manually changed it = ", w.grad)
z.backward(retain_graph=True)
print("grad w after backprop = ", w.grad)

grad w so far =  tensor([18., 18., 18.])
grad w after I manually changed it =  tensor([1., 2., 3.])
grad w after backprop =  tensor([10., 11., 12.])


Unlike the example above, when the output is a vector the call to backprop needs the initial accumulated gradient to start from. I think this is because the implementation is the same when starting the backprop from somewhere in the middle of the compute graph. It needs some gradients to flow in.

$$
z = 3(x+2)^2
$$

$$
\frac{dz}{dx} = 6(x+2)
$$

$$x = 0, \quad \frac{dz}{dx} = 12$$

$$x = 1, \quad \frac{dz}{dx} = 18$$

$$x = 2, \quad \frac{dz}{dx} = 24$$

In [2]:
# x = torch.zeros(4, requires_grad=True)
x = torch.ones(4, requires_grad=True)
# x = torch.tensor([2.0, 2.0, 2.0, 2.0], requires_grad=True)
y = x + 2
y2 = y**2
z = 3*y2
print("x = ", x)
print("z = ", z)

x =  tensor([1., 1., 1., 1.], requires_grad=True)
z =  tensor([27., 27., 27., 27.], grad_fn=<MulBackward0>)


In [3]:
dz_dz = torch.ones_like(z)
z.backward(dz_dz, retain_graph=True)

In [4]:
print(x)
print(x.grad)

tensor([1., 1., 1., 1.], requires_grad=True)
tensor([18., 18., 18., 18.])


In [5]:
# Everytime I back propagate, the gradients are accumulated
# i.e., dz_dx := dz_dx(old) + dz_dx(new)
z.backward(dz_dz, retain_graph=True)
print(x.grad)

z.backward(dz_dz, retain_graph=True)
print(x.grad)

tensor([36., 36., 36., 36.])
tensor([54., 54., 54., 54.])


In [6]:
# The grads themselves are fully mutable, so I can reset their values to 0 (say)
x.grad = torch.zeros_like(x)
x.grad

tensor([0., 0., 0., 0.])

In [7]:
z.backward(dz_dz, retain_graph=True)
print(x.grad)

tensor([18., 18., 18., 18.])
