# [PyTorch - Learning the Basics](https://pytorch.org/tutorials/beginner/basics/intro.html)

In part five we'll cover automatic differentiation.

## Automatic differentiation with `torch.autograd`

**Back propagation** is the most frequently used algorithm when training neural networks. In back propagation, parameters, or model weights, are adjusted according to the **gradient** of the loss function with respect to the given parameter. Here's a quick overview of some terms for clarification:

- The **Loss function** is a formula that measures **how bad the model's prediction is** compared to the actual target. It's essentially a score that we want to be low, it's best to minimize the loss function during training.
- The **gradient** is the **slope of the loss function** with respect to the model's parameters (weights and biases). We can use it to learn **how to change the parameters** to reduce the loss. If the gradient is positive, we want to decrease the weight and if it's negative, we want to increase it. It's calculated using **calculus** (specifically, partial derivatives).
- **Back propagation** is the algorithm used to **efficiently compute all gradients** of the loss with respect to every weight in the network. In it, we do a **forward pass**, which is used for computing predictions and loss, and a **backward pass**, where we apply the chain rule to propagate gradients from output to input. The gradients returned from back propagation are used to **update the weights**, typically via gradient descent.

In PyTorch, we use the built-in differentiation engine `torch.autograd` to compute the gradients. It supports automatic computation of gradient for any computational graph.

As an example, let's consider the simplest one-layer neural network, with input `x`, parameters `w` and `b`, and some loss function. It can be defined in PyTorch like so:

In [1]:
import torch

x = torch.ones(5)       # input tensor
y = torch.zeros(3)      # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

### Tensors, functions, and computational graph

The code above defines the following **computational graph**:

```mermaid
flowchart LR
 subgraph s1["Parameters"]
        n7["w"]
        n8["b"]
  end
    A["x"] --> n1["times"]
    n1 --> n2["plus"]
    n2 --> n3["z"]
    n3 --> n4["CE"]
    n4 --> n5["loss"]
    n6["y"] --> n4
    n7 --> n1
    n8 --> n2
    n7@{ shape: rounded}
    n8@{ shape: rounded}
    A@{ shape: rounded}
    n1@{ shape: rounded}
    n2@{ shape: rounded}
    n3@{ shape: rounded}
    n4@{ shape: rounded}
    n5@{ shape: rounded}
    n6@{ shape: rounded}
```

So, in this network `w` and `b` are **parameters** that need to be optimized. Thus, we need to be able to compute the gradients of the loss function with respect to those variables. To do this, we set the `requires_grad` property of those tensors.

> **！Note**
>
> You can set the value of `requires_grad` when creating a tensor, or later by using the `x.requires_grad_(True)` method.

A function that we apply to tensors to construct a computational graph is in fact an object of class `Function`. This object knows how to compute the function in the *forward* direction, and also how to compute its derivative during the *backward propagation* step. A reference to the backward propagation function is stored in the `grad_fn` property of a tensor. Read more about the [`Function` in the documentation](https://pytorch.org/docs/stable/autograd.html#function).

In [2]:
print(f"Gradient function for z: {z.grad_fn}")
print(f"Gradient function for loss: {loss.grad_fn}")

Gradient function for z: <AddBackward0 object at 0x111275240>
Gradient function for loss: <BinaryCrossEntropyWithLogitsBackward0 object at 0x1112757b0>


### Computing gradients

Let's get into actually computing the gradients. To optimize the weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to the parameters, namely, we need $\frac{\delta \text{loss}}{\delta w}$ and $\frac{\delta \text{loss}}{\delta b}$ under some fixed values of `x` and `y`. We call `loss.backward()` to compute those derivatives and then retrieve the values from `w.grad` and `b.grad`:

In [3]:
loss.backward()

print(w.grad)
print(b.grad)

tensor([[0.3077, 0.1131, 0.0441],
        [0.3077, 0.1131, 0.0441],
        [0.3077, 0.1131, 0.0441],
        [0.3077, 0.1131, 0.0441],
        [0.3077, 0.1131, 0.0441]])
tensor([0.3077, 0.1131, 0.0441])


> **！Note**
>
> - We can only obtain the `grad` properties for the leaf nodes of the computational graph, which have the `requires_grad` property set to `True`. For all other nodes in our graph, gradients will not be available.
> - We can only perform gradient calculations using `backward` once on a given graph, for performance reasons. If we need to do several `backward` calls on the same graph, we need to pass `retain_graph=True` to the `backward` call.

### Disabling gradient tracking

All tensors with `requires_grad=True` are tracking their computational history and support gradient computation by default. There are some cases that we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do *forward* computations through the network. We can stop tracking computations by surrounding our computation code with the `torch.no_grad()` block:

In [4]:
z = torch.matmul(x, w) + b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w) + b

print(z.requires_grad)

True
False


We can also use the `detach()` method on the tensor to achieve the same results:

In [5]:
z = torch.matmul(x, w) + b
z_det = z.detach()

print(z_det.requires_grad)

False


**Here are some reasons why you might want to disable gradient tracking:**

- To mark some parameters in your neural network as **frozen parameters**.
- To **speed up computations** when you are only doing a forward pass, because computations on tensors that do not track gradients would be more efficient.

### More on computational graphs

Autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a direct acyclic graph (DAG) consisting of [Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things at the same time:

- run the requested operation to compute a resulting tensor, and
- maintain the operation's *gradient function* in the DAG.

The backward pass kicks off when `.backward()` is called on the DAG root. `autograd` then:

- computes the gradients from each `.grad_fn`,
- accumulates them in the respective tensor's `.grad` attribute, and
- using the chain rule, propagates all the way to the leaf tensors.

> **！Note**
>
> **DAGs are dynamic in PyTorch.** An important thing to note is that the graph is recreated from scratch; after each `.backward()` call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your mode; you can change the shape, size, and operations at every iteration if needed.

### Tensor gradients and Jacobian products

In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows you to compute a so-called **Jacobian product**, and not the actual gradient.

For a vector function $\vec{y}=f(\vec{x})$, where $\vec{x}=\langle x_1,...,x_n\rangle$ and $\vec{y}=\langle y_1,...,y_m\rangle$, a gradient of $\vec{y}$ with respect to $\vec{x}$ is given by the **Jacobian matrix**:

$$
J = \begin{bmatrix}
\frac{\delta y_1}{\delta x_1} & \cdots & \frac{\delta y_1}{\delta x_n} \\
\vdots & \ddots & \vdots \\
\frac{delta y_m}{\delta x_1} & \cdots & \frac{\delta y_m}{\delta x_n} \\
\end{bmatrix}
$$

Instead of computing the Jacobian matrix itself, PyTorch allows you to compute **Jacobian product** $v^T$. $J$ for a given input vector $v=(v_1\cdots v_m)$. This is achieved by calling `backward` with $v$ as an argument. The size of $v$ should be the same as the size of the original tensor, with respect to which we want to compute the product.

In [6]:
inp = torch.eye(4, 5, requires_grad=True)
out = (inp + 1).pow(2).t()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"First call\n{inp.grad}")

out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")

inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

First call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

Second call
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])

Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])


Notice that when we call `backward` for the second time with the same argument, the value of the gradient is different. This happens because when doing `backward` propagation, PyTorch **accumulates the gradients**, i.e. the value of computed gradients is added to the `grad` property of all leaf noes of the computational graph. If you want to compute the proper gradients, you need to zero out the `grad` property before. In real-life training, an *optimizer* helps us to do this.

> **！Note**
>
> Previously we were calling the `backward()` function without parameters. This is essentially equivalent to calling `backward(torch.tensor(1.0))`, which is a useful way to compute the gradients in case of a scalar-valued function, such as loss during neural network training.