### Automatic Differentiation with <i>torch.autograd</i>.

When training neural networks, the most frequently used algorithm is <b>back propagation</b>. In this algorithm, parameters (model weights) are adjusted according to the <b>gradient</b> of the loss function with respect to the given parameter. 

To compute those gradients, PyTorch has a built-in differentiation engine called <i>torch.autograd</i>. It supports automatic computation of gradient of any computational graph.

Consider the simplest one-layer neural network, with input x, parameters w, and b, and some loss function. It can be defined in PyTorch in the following manner:

In [6]:
import torch
import torch.nn.functional as F

torch.set_printoptions(sci_mode=False, linewidth=300)

x = torch.ones(5) # input tensor
y = torch.zeros(3) # expected output

w = torch.randn(5, 3, require_grad=True)
b = torch.randn(3, require_grad=True)

z = torch.matmul(x, w) + b
loss = F.binary_cross_entropy_with_logits(x, y)

### Tensors, Functions, and Computational graph

This code defines the following <b>computational graph</b>

In this network, w and b are <b>parameters</b>, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. In order to do that, we set the <i>requires_grad</i> property of those tensors.

A function that we apply to tensors to construct computational graph is in fact an object of class <i>Function</i>. This object knows how to compute the funtion in the forward direction, and also how to compute its derivative during the backward propogation step. A reference to the backward propogation function is stored in <i>grad_fn</i> property of a tensor.

In [None]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

### Computing Gradients

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need <i>dloss/dw</i> and <i>dloss/db</i>under some fixed values of x and y. To compute those derivatives, we call <i>loss.backward()</i>, and then retrieve the values from <i>w.grad</i> and <i>b.grad</i>

In [None]:
loss.backward()
print(w.grad)
print(b.grad)

We can only obtain the <i>grad</i> properties for <b>leaf nodes</b> of the computational graph, which have <i>requires_grad</i> property set to <b>True</b>. For all other nodes in our graph, gradients will not be available.

Leaf nodes are typically tensors created by the user, such as the weights in a neural network. They are not derived from other nodes in the computational graph.

In [None]:
print(z.grad) # z is NOT a leaf node because it is created by a tensor operation on other tensors.

We can only perform gradient calculations using <i>backward</i> once on a given graph, for performance reasons. if we need to do several <i>backward</i> calls on the same graph, we need to pass <i>retain_graph=True</i> to the <i>backward</i> call.

That's because the computational graph is destructed after you call <i>backward</i>, unless you request to "retain" it.

### Disabling Gradient Tracking

By default, all tensors with <i>requires_grad=True</i> are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do <i>forward</i> computations through the network.

We can stop tracking computations by surrounding our computation code with <i><b>torch.no_grad()</i></b>

In [None]:
z = torch.matmul(x, w) + b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w) + b
print(z.requires_grad)

Another way to achieve the same result is to use the <i><b>detach</i></b> method on the tensor.

In [None]:
# Redefine w and b
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)

z = torch.matmul(x, w) + b
z_det = z.detach()
print(z_det.requires_grad)



A "detached" tensor doesn't know the history of how it was created. Even though in this case it is obvious to us that <i><b>z_det</i></b> holds the result of <i><b>torch.matml(x, w) + b, z_det</i></b>  would think itself as a fresh new tensor filled with some values.

Therefore, any attempt of <i>backward</i> through <i>z_det</i> will NOT update the <i>.grad</i> attribute of w or b.

In [None]:
z_det.requires_grad(True) # Need ot turn on the gradient calculation for a z_det
loss = F.binary_cross_entropy_with_logits(z_det, y)
loss.backward()

In [None]:
print(w.grad)
print(b.grad)

There are reasons you might want to disable gradient tracking:
- To mark some parameters in your neural network as <b>frozen parameters</b>. This is a very common scenario for finetuning a pretrained network.
- To <b>speed up computations</b> when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.