# Automatic Differentiation with Autograd

When training nn, the most frequiently used algorithm is back progation. 

In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

To compute those gradients, pytorch has a built in differentiation engine called autograd. It supports automatic computation of gradient for any computational graph.

Consider the simplest one layer nn, with input x, parameters w and b, and some loss function.

In [1]:
import torch

In [2]:
x = torch.ones(5) # input tensor
y = torch.zeros(3) # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b

loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

## Tensors, functions and computational graph

This code defines the following computational graph:

In this network, w and b are parameters, which we need to optimize. Thus we need to be able to compute the gradients of loss function with respect to those variables. To order to do that, we set the requires_grad property of those tensors

A function that we apply to tensors to construct computational graph is in fact an object of class function. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step.

A reference to the backward propagation function is stored is grad_fn property of a tensor. 

In [3]:
print(f"Graient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Graient function for z = <AddBackward0 object at 0x109a08130>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x109a084f0>


## Computing Gradients

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need under some fixed values of x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad

- We can only obtain the grad properties for the leaf nodes of the computational graph, which have requires_grad property set to true. For all other nodes in our graph, gradients will not be available

- We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass retain-graph=true to the backward call

In [4]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.3169, 0.0088, 0.3204],
        [0.3169, 0.0088, 0.3204],
        [0.3169, 0.0088, 0.3204],
        [0.3169, 0.0088, 0.3204],
        [0.3169, 0.0088, 0.3204]])
tensor([0.3169, 0.0088, 0.3204])


## Disabling Gradient Tracking

By default, all tensors with requires_grad=true are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, ie we only want to do forward computations through the network.

We can stop tracking computations by surrounding our computation code with torch.no_grad block

In [5]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

True


In [6]:
with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

False


To achieve the same result is to use the detach method on the tensor

In [7]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

False


There are resaosn you might want to disable gradient tracking:
- to mark some parameters in your nn as forzen parameters
- to speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient

## More on Computational Graph

Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in directed acyclic graph (DAG) consisting of function objects. 

In DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leave, you can automatically compute the gradients using the chain rule.

In a fordward pass, autograd does two things simultaneously:
- run the requested operation to compute a resulting tensor
- maintain the operation gradient function in the DAG

The backward pass kicks off when .backward is called on the DAG root autogra then:
- computes the graident from each .grad_fn
- accumulates them in the respective tensor .grad attribute
- using the chain rule, propagates all the way to the leaf tensors

DAGS are dynamic in pytorch important thing to note is that the graph is recreated from scratch

## Tensor Gradients and Jacobian Products

In many cases, we have scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is arbitraty tensor. 

Pytorch allows you to compute so called jacobian product, and not the actual gradient


Instead of computing the jacobian matric itself, pytorch allows you to compute jacobian product for a given input vector. This is achieved by calling backward with v as the argument. The size of v should v the same as the size of the orginal tensor, with respect to which we want to compute the product:

In [8]:
inp = torch.eye(4, 5, requires_grad=True)
out = (inp+1).pow(2).t()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"Frint call\n{inp.grad}")
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")
inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

Frint call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

Second call
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])

Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])


Notice that when backward for the second time with the same argument, the value of the gradient is different. This happens because when doing backward propagation, pytroch accumulates the gradients, ie the value of computed gradients is added to the grad property of all leaf nodes of computational graph

If you want to compute the proper gradients, you need to zero out the grad property before. In real-life training an optimizer helps us to do this