# Credits

This is heavily influenced or copied from https://github.com/pytorch/tutorials

# Autograd: automatic differentiation

Central to all neural networks in PyTorch is the ``autograd`` package.
Let’s first briefly visit this, and we will then go to training our first neural network.

The `autograd` package provides automatic differentiation for all operations on Tensors.
It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

Let us see this in more simple terms with some examples.

## 1. Tensor

`torch.Tensor` is the central class of the package. Setting the attribute `.requires_grad` to `True` will make the tensor "record" all operations on it. When you finish your computation you can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into the `.grad` attribute.

![autograd.Variable](../static_files/autograd-variable.png)

There’s one more class which is very important for autograd implementation - a `Function`.

`Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor` (except for Tensors created by the user - their `grad_fn` is `None`).

If you want to compute the derivatives, you can call `.backward()` on a Tensor. If `Tensor` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to backward(), however if it has more elements, you need to specify a `gradient` argument that is a tensor of matching shape.

In [2]:
import torch

Create a tensor

In [3]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


Do a tensor operation:

In [7]:
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


`y` was created as a result of an operation, so it has a `grad_fn`.

In [8]:
print(y.grad_fn)

<AddBackward0 object at 0x000001FF58FDFAF0>


Do more operations on y

In [9]:
z = y * y * 3
out = z.mean()

print(z)
print(out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>)
tensor(27., grad_fn=<MeanBackward0>)


# Assignments

1. Create a Tensor that `requires_grad` of size (5, 5)
2. Sum the values in the Tensor

In [21]:
a = torch.randn(5, 5, requires_grad=True)
print(a)
print(f"Sum of the values: {a.sum():.3f}")

tensor([[-2.6417,  1.0485,  0.6656, -0.7590, -0.7866],
        [-0.6268, -0.6789, -0.8367,  0.3465,  0.0406],
        [-0.0577,  0.3424, -0.4682,  0.1584, -1.3160],
        [-0.2772, -0.0488,  0.2400,  0.4549, -0.5794],
        [-1.3434, -0.2605, -0.9742, -0.0462,  0.0401]], requires_grad=True)
Sum of the values: -8.364


## 2. Gradients

Let’s backprop now. Because `out` contains a single scalar, `out.backward()` is equivalent to `out.backward(torch.tensor([1.0]))`

In [24]:
out.backward()

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Print gradients d(out)/dx

In [25]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


You should have a matrix of `4.5`. Let’s denote the tensor `out` with $o$.

We have:
$o = \frac{1}{4}\sum_i z_i$, $z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$.

Therefore, $\frac{\partial o}{\partial x_i} = \frac{3}{2}(x_i+2)$,
hence $\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.

You can do many crazy things with autograd!

In [29]:
x = torch.randn(3, requires_grad=True)
print(x)
y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

tensor([ 0.0475, -0.6505, -1.0864], requires_grad=True)
tensor([   48.6576,  -666.0714, -1112.4949], grad_fn=<MulBackward0>)


In [30]:
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)

print(x.grad)

tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])


**Read later** \
*Documentation* \
`Tensor`: https://pytorch.org/docs/stable/tensors.html \
`Function`: http://pytorch.org/docs/autograd

# Assignments

1. Define a tensor and set `requires_grad` to `True`
3. Multiply the tensor by 2 and assign the result to a new python variable (i.e. `x = result`)
4. Sum the variable's elements and assign to a new python variable
5. Print the gradients of all the variables
6. Now perform a backward pass on the last variable (NOTE: for each new python variable that you define, call `.retain_grad()`)
7. Print all gradients again
  - what did you notice?

In [36]:
import torch

# Step 1: Create a tensor with requires_grad=True
tensor = torch.randn(3, 3, requires_grad=True)

# Step 2: Multiply the tensor by 2 and assign the result to a new variable
x = tensor * 2

# Step 3: Sum the elements of the new variable and assign it to another variable
sum_result = x.sum()

# Step 4: Print the gradients of all variables
print("Gradients before backward pass:")
print("Gradient of tensor:", tensor.grad)
print("Gradient of x:", x.grad)
print("Gradient of sum_result:", sum_result.grad)

# Step 5: Perform a backward pass on the last variable
sum_result.retain_grad()  # Retain gradients for sum_result
sum_result.backward()

# Step 6: Print gradients again after the backward pass
print("\nGradients after backward pass:")
print("Gradient of tensor:", tensor.grad)
print("Gradient of x:", x.grad)
print("Gradient of sum_result:", sum_result.grad)


Gradients before backward pass:
Gradient of tensor: None
Gradient of x: None
Gradient of sum_result: None

Gradients after backward pass:
Gradient of tensor: tensor([[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]])
Gradient of x: None
Gradient of sum_result: tensor(1.)


  print("Gradient of x:", x.grad)
  print("Gradient of sum_result:", sum_result.grad)
  print("Gradient of x:", x.grad)


The reason for this behavior is that, by default, PyTorch does not compute gradients for intermediate variables (like x) when you perform a backward pass. It only computes gradients for leaf tensors with requires_grad=True and tensors that you explicitly call .retain_grad() on (like sum_result). This is an optimization to save memory and computation.

So, after the backward pass, you'll observe that the gradient of sum_result contains valid gradient values, but the gradients of tensor and x remain None unless you explicitly set requires_grad=True for x or use retain_grad() on x.

With `.retain_grad()` function all gradients can be obtained. Variables that had no gradient are now assigned to gradient equal 1.