Best resource to understand Automatic Differentiation: https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/

# Automatic Differentiation with torch.autograd

When training NN, the most frequently used algorithm is backpropagation. In this algorithm, parameters(model weights) are adjusted according to the __gradient__ of the loss function with respect to the given parameter.

What is exactly meant here by __gradient__?

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer network, with input _x_, parameters _w_ ,and _b_, and some _loss function_. It can be defined in Pytorch in the following manner.

In [3]:
import torch
x = torch.ones(5)
y = torch.zeros(3)
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w) + b
loss=torch.nn.functional.binary_cross_entropy_with_logits(z, y)

In [4]:
x, y, w, b, z, loss

(tensor([1., 1., 1., 1., 1.]),
 tensor([0., 0., 0.]),
 tensor([[ 0.1120, -0.0213,  0.5249],
         [-0.3198, -0.0042, -0.7752],
         [ 0.8590,  1.2993, -0.1937],
         [-0.1327, -0.9119,  0.6002],
         [ 0.4720, -1.5287,  0.9750]], requires_grad=True),
 tensor([ 0.6012, -0.2897,  2.4452], requires_grad=True),
 tensor([ 1.5918, -1.4565,  3.5764], grad_fn=<AddBackward0>),
 tensor(1.8635, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>))

### Note: You can set the value of requires_grad when creating a tensor, or later by using x.requires_grad(True) method.

In [5]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7fe2ebac6340>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7fe2ebd0df40>


In autograd, if any input _Tensor_ of an operation has _requires_grad=True_, the computation will be tracked. After computing the backward pass, a gradient w.r.t this tensor is accumulated into .grad attribute.

In the forward phase, the autograd tape will remember all the operations it executed, and in the backward phase, it will replay the operations.

If you want to compute the derivatives, you can call .backward() on a _Tensor_. If _Tensor_ is a scalr, you don't need to specify any arguments to backward, however, if it has more elements, you need to specify a _grad_output_ argument that is a tensor of matching shape.

## Second Day on this Automatic Differentiation thing

I ain't scared of nothing.

## Understading Graphs, Automatic Differentiation and Autograd

 Automatic Differentiation is a building block of not only PyTorch, but every DL libray out there. In my opinion, PyTorch's automatic differentiation engine, called _Autograd_ is a brilliant tool to understand how automatic differentiation works. This will not only help you understand PyTorch better, but also other DL libraries. 

Modern neural network architectures can have millions of learnable parameters. From a computational point of view, training a neural network consists of two phases:

- A forward pass to compute the value of the loss function. 
- A backward pass to compute the gradients of the learnable parameters.

The forward pass is pretty straightforward. The output of one layer is the input to the next and so forth. 

Backward pass is a bit more complicated since it requires us to use the chain rule to compute the gradients of weights with respect to the loss function. 

The nodes on the Computational Graph are basically __operators__. These operators are basically the mathematical operators except for one case, wehre we need to represent creation of user-defined variable. 

In [2]:
import torch
tsr = torch.Tensor(3, 5)
tsr

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

AttributeError: 'Tensor' object has no attribute 'torch'

While initiating the variables that are going to be part of the computation, one should set the requires_grad = True in order to make sure it appears in the computation graph.

In [4]:
t1 = torch.randn((3,3), requires_grad = True)
t2 = torch.FloatTensor(3,3) # we will use requires_grad outside the initialization
t2.requires_grad = True

In [5]:
t1

tensor([[ 0.0759,  1.0211,  0.0208],
        [-1.6086, -0.2280, -1.8427],
        [-0.3586, -0.8975,  0.4918]], requires_grad=True)

### Note:

requires_grad is __contagious__. It meas that when a _Tensor_ is created by operating on other _Tensor_, the requires_grad of the resultant Tensor would be set __True__ given at least one of the tensors used for creating has it's requires_grad set to True

 Each _Tensor_ has an attribute called __grad_fn__, which refers to the mathematical operator that create the variable. Notice that it is only available if requires_grad is set to True, otherwise it will return None.

In [7]:
a = torch.randn((3,3), requires_grad = True)
a

tensor([[-0.1827, -0.7592,  0.0770],
        [ 1.1466, -1.2832, -0.3279],
        [ 0.7889, -0.2127, -0.4810]], requires_grad=True)

In [8]:
w1 = torch.randn((3, 3), requires_grad = True)
w2 = torch.randn((3, 3),requires_grad = True)
w3 = torch.randn((3, 3), requires_grad = True)
w4 = torch.rand((3, 3), requires_grad = True)

b = w1 * a
c = w2 * a
d = w3*b + w4*c 
L = 10 - d

In [9]:
print(f"The grad fn for a is, {a.grad_fn}")
print(f"The grad fn for d is, {d.grad_fn}")

The grad fn for a is, None
The grad fn for d is, <AddBackward0 object at 0x7f7e7ffd2ee0>


# Very important explanation:

In our example, where d = f(w3b, w4c), d's grad function would be the addition operation as shown by the computational graph. However, if our Tensor is a leaf node (initialized by the user, then the grad_fn is None)

One can use the member function __is_leaf__ to determine whether a variable is a leaf _Tensor_ or not.

In [10]:
c.grad_fn

<MulBackward0 at 0x7f7e7ffd2880>

## Function

 All mathematical operations in PyTorch are implemented by the __torch.nn.Autograd__ Function class. This class has two important member functions we need to look at. 
 
 The first is __forward__ function, which simply computes the output using it's inputs (i.e when calculating the loss function). 
 
 The __backward__ function takes the incoming gradient coming from the part of the network in front of it. As you can see, the gradient to be backpropagated from a function _f_ is basically the __gradient that is backpropagated to _f_ from the layers in front of it multiplied by the local gradient of the output of _f_ with respect to it's inputs.__ This is exactly what the __backward__ function does. 

In [12]:
"""Example here using the backward function on d, which
takes the gradient of d'l/d'd, and we can get this value from 
d.grad"""
print(" Gradient of L wrt to d in stored in grad attribute of the d:")
d.grad

 Gradient of L wrt to d in stored in grad attribute of the d:


  d.grad


The code above is basically telling me that __we only find the gradient of a node that is a leaf!!!!!!!!!!!!!!!!__ that answers the question I had yesterday regarding the use of grad. Now proving my point with the leafs  _w1, w2, w3, w4_

# WAIT GENIUS!!!

1) In order to compute derivatives in our NN, we generally call __backward__ on the __Tensor__ representing our loss. 
2) We backtrack through the graph starting from the node representing the __grad_fn__ of our loss. 

In order to compute derivatives in our neural network, we generally call backward on the Tensor representing our loss. Then, we backtrack through the graph starting from node representing the grad_fn of our loss.

As described above, the backward function is recursively called through out the graph as we backtrack. Once, we reach a leaf node, since the grad_fn is None, but stop backtracking through that path.

One thing to note here is that PyTorch gives an error if you call backward() on vector-valued Tensor. This means you can only call backward on a scalar valued Tensor. In our example, if we assume a to be a vector valued Tensor, and call backward on L, it will throw up an error.

In [17]:
print("Overwriting the existing data:")


a = torch.randn((3,3), requires_grad = True)

w1 = torch.randn((3,3), requires_grad = True)
w2 = torch.randn((3,3), requires_grad = True)
w3 = torch.randn((3,3), requires_grad = True)
w4 = torch.randn((3,3), requires_grad = True)

b = w1*a 
c = w2*a

d = w3*b + w4*c 

L = (10 - d)
d.grad_fn

Overwriting the existing data:


<AddBackward0 at 0x7f7e7b6e76a0>

In [18]:
L.grad_fn

<RsubBackward1 at 0x7f7e806847f0>

In [19]:
L.is_leaf

False

In [20]:
a.is_leaf

True

In [21]:
L.backward()

RuntimeError: grad can be implicitly created only for scalar outputs

# Note: 

This is because gradients can be computed with respect to scalar values by definition. You can't exactly differentiate a vector with respect to another vector. The mathematical entity used for such cases is called a Jacobian, the discussion of which is beyond the scope of this article.

There are two ways to overcome this.

If you just make a small change in the above code setting L to be the sum of all the errors, our problem will be solved.

In [22]:
print("Overwriting L")
L = (10 -d).sum()
L.backward()

Overwriting L


In [24]:
w1.grad

tensor([[-0.0447, -0.1020,  0.2793],
        [-0.0362,  0.3686,  0.0630],
        [-0.4692, -0.6442,  1.2812]])

In [26]:
w2.grad

tensor([[ 0.0232, -0.2550, -2.9517],
        [-0.2709, -1.1645, -0.0170],
        [ 2.1593, -0.0626, -1.5296]])

# How are PyTorch's graphs different from TensorFlow graphs?

PyTorch creates something called a Dynamic Computation Graph, which means that the graph is generated on the fly.

Until the forward function of a Variable is called, there exists no node for the Tensor (it’s grad_fn) in the graph.

The graph is created as a result of forward function of many Tensors being invoked. Only then, the buffers for the non-leaf nodes allocated for the graph and intermediate values (used for computing gradients later.  When you call backward, as the gradients are computed, these buffers (for non-leaf variables) are essentially freed, and the graph is destroyed ( In a sense, you can't backpropagate through it since the buffers holding values to compute the gradients are gone).

Next time, you will call forward on the same set of tensors, the leaf node buffers from the previous run will be shared, while the non-leaf nodes buffers will be created again.

If you call backward more than once on a graph with non-leaf nodes, you'll be met with the following error.