<center> </center>

<center><font size=5 face="Helvetica" color=#EE4B2B><b>
Pytorch Tutorial: Automatic Differentiation with torch.autograd
</b></font></center>

<center><font face="Helvetica" size=3><b>Ang Chen</b></font></center>
<center><font face="Helvetica" size=3>July, 2024</font></center>

***

When training neural networks, the most frequently used algorithm is **back propagation**.
In this algorithm, parameters (model weights) are adjusted according to the **gradient** of the loss
function with respect to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine called $\texttt{torch.autograd}$.
It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, with input $\texttt{x}$, parameters $\texttt{w}$ and $\texttt{b}$, and some loss function.
It can be defined in PyTorch in the following manner:

In [1]:
import torch

x = torch.ones(5) # Input tensor
y = torch.zeros(3) # Expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b

print('x:', x)
print('y:', y)
print('w:', w)
print('b:', b)
print('z:', z)

x: tensor([1., 1., 1., 1., 1.])
y: tensor([0., 0., 0.])
w: tensor([[ 1.2457,  2.2194,  0.1104],
        [ 0.1583,  1.2008, -0.3096],
        [ 0.4335,  0.6922, -0.3115],
        [ 0.0470, -0.9832,  0.2616],
        [ 0.0829,  0.3779, -1.0898]], requires_grad=True)
b: tensor([0.5424, 1.3166, 0.1436], requires_grad=True)
z: tensor([ 2.5098,  4.8237, -1.1953], grad_fn=<AddBackward0>)


In [2]:
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
print('loss:', loss)

loss: tensor(2.5613, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


# Tensors, Functions and Computational Graphs

This code defines the following **computational graph**:

<img src="./figures/computational_graph.png" alt="Computational graph" width="800">

In this work, $w$ and $b$ are **parameters**, which we need to optimize.
Thus, we need to be able to compute the gradients of loss function with respect to those variables.
In order to do that, we set the $\texttt{requires\_grad}$ property of those tensors.

A function that we apply to tensors to construct computational graph is in fact an object of class $\texttt{Function}$.
This object knows how to compute the function in forward direction, and also how to compute its derivative during the *backward propagation* step.
A reference to the backward propagation is stored in $\texttt{grad\_fn}$ property of a tensor.
You can find more information of $\texttt{Function}$ in the documentation.

In [3]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x105962d90>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x105962280>


# Computing Gradients

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need $\frac{\partial\text{loss}}{\partial w}$ and $\frac{\partial\text{loss}}{\partial b}$ under some fixed values of $x$ and $y$.
To compute those derivatives, we call $\texttt{loss.backward()}$, and then retrieve the values from $\texttt{w.grad}$ and $\texttt{b.grad}$:

In [4]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.3083, 0.3307, 0.0774],
        [0.3083, 0.3307, 0.0774],
        [0.3083, 0.3307, 0.0774],
        [0.3083, 0.3307, 0.0774],
        [0.3083, 0.3307, 0.0774]])
tensor([0.3083, 0.3307, 0.0774])


# Disabling Gradient Tracking

By default, all tensors with $\texttt{requires\_grad=True}$ are tracking their computational history and support gradient computation.
However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e., we only want to do *forward* computatoins through the network.
We can stop tracking computations by surrounding our computation code with $\texttt{torch.no\_grad()}$ block:

In [5]:
z = torch.matmul(x, w) + b
z.requires_grad

True

In [6]:
with torch.no_grad():
    z = torch.matmul(x, w) + b
z.requires_grad

False

Another way to achieve the same result is to use the $\texttt{detach()}$ method on the tensor:

In [7]:
z = torch.matmul(x, w) + b
print(z.requires_grad)
z_det = z.detach()
z_det.requires_grad

True


False

There are reasons you might want to disable gradient tracking:
 * To mark some parameters in your neural network as **frozen parameters**.
 * To **speed up computations** when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

# More on Computational Graphs

Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects.
In this DAG, leaves are the input tensors, roots are the output tensors.
By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:
 * run the requested operation to compute a resulting tensor
 * maintain the operation’s gradient function in the DAG.

The backward pass kicks off when $\texttt{.backward()}$ is called on the DAG’s root. 
$\texttt{autograd}$ then:
 * computes the gradientts from each $\texttt{.grad\_fn}$,
 * accumulates them in the respective tensors’ `.grad` attribute
 * using the chain rule, propagates all the way to the leaf tensors.

# Optional Reading: Tensor Gradients and Jacobian Products

In many cases, we have a scalar loss function, and we need ot compute the gradient with respect to some parameters.
However, there are cases when the output function is an arbitrary tensor.
In this case, PyTorch allows you to compute so-called **Jacobian product**, and not the actual gradient.

For a vector function $\vec{y}=f(\vec{x})$, where $\vec{x}\langle x_1, \ldots, x_n\rangle$ and $\vec{y}\langle y_1, \ldots, y_m\rangle$, a gradient of $\vec{y}$ with respect to $\vec{x}$ is given by **Jacobian matrix**:
$$
\begin{equation*}
J = \begin{pmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n}\\
\vdots & \ddots & \vdots\\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{pmatrix}.
\end{equation*}
$$
Instead of computing the Jacobian matrix itself, PyTorch allows you to compute the **Jacobian product** $v^\text{T}\cdot J$ for a given input vector
$v=(v_1\cdot v_m)$.
This is achieved by calling $\texttt{backward}$ with $v$ as an argument.
The size of $v$ should be the same as the size of the original tensor, with respect to which we want to compute the product:

In [18]:
inp = torch.eye(4, 5, requires_grad=True)
out = (inp+1).pow(2).t()
print(inp.grad)
out.backward(torch.ones_like(out), retain_graph=True)
print(inp.grad)
out.backward(torch.ones_like(out), retain_graph=True)
print(inp.grad)
inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
inp.grad

None
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])


tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

Note that when we call $\texttt{backward}$ for the second time with the same argument, the value of the gradient is different.
This happens bacause when doing $\texttt{backward}$ propagation, PyTorch $\textbf{accumulates the gradients}$, i.e., the value of computed gradient is added to the $\texttt{grad}$ attribute of all leaf nodes of cmputational graph.
If you want to compute the proper gradients, you need to zero out the $\texttt{grad}$ attribute before.
In real-life trainning an optimizer helps us to do this.