# PyTorch Computational Graphs

Modern neural network architectures can have millions of learnable parameters. From a computational point of view, training a neural network consists of two phases:

* A forward pass to compute the value of the loss function.

* A backward pass to compute the gradients of the learnable parameters.

The forward pass is pretty straight forward. The output of one layer is the input to the next and so forth.

Backward pass is a bit more complicated since it requires us to use the chain rule to compute the gradients of weights w.r.t to the loss function.

In PyTorch: 

* The autograd package provides automatic differentiation to automate the computation of the backward passes in neural networks. 

* The forward pass of your network defines the computational graph; 

    * nodes in the graph are Tensors

    * edges are functions that produced the output Tensors from input Tensors. 
    
    * Back-propagation through this graph then gives the gradients.



## Tensors: Basic Building Blocks of PyTorch

Tensor is a data structure which is a fundamental building block of PyTorch. Tensors are pretty much like numpy arrays, except that unlike numpy, tensors are designed to take advantage of parallel computation capabilities of a GPU. A lot of Tensor syntax is similar to that of numpy arrays. 

In [1]:
import torch

x = torch.Tensor(3, 5)
print(x)

tensor([[4.6977e+02, 4.5836e-41, 4.6155e+02, 4.5836e-41, 4.7412e+02],
        [4.5836e-41, 4.6894e+02, 4.5836e-41, 4.6893e+02, 4.5836e-41],
        [4.6156e+02, 4.5836e-41, 4.7194e+02, 4.5836e-41, 4.6157e+02]])


### Leaf Tensor

All Tensors that have `requires_grad` set to `False` will be leaf Tensors by convention. For Tensors that have `requires_grad` which is `True`, they will be leaf Tensors if they were created by the user(Eg. weights of your neural network). This means that they are not the result of an operation and so grad_fn is None.

Basically, if require_grad is False then it will be a leaf tensor. Moreover, if requires_grad is True and it is created by user, it is also a leaf tensor. 

### requires_grad

One it's own, Tensor is just like a numpy ndarray. A data structure that can let you do fast linear algebra options. 

Every Tensor in PyTorch has a flag: `required_grad` that allows for fine-grained exclusion of subgraphs from gradient computation and can increase efficiency. If x is a Tensor that has `x.requires_grad=True` then `x.grad` is another Tensor holding the gradient of x with respect to some scalar value.    

The API can be a bit confusing here. There are multiple ways to initialise tensors in PyTorch. While some ways can let you explicitly define that the `requires_grad` in the constructor itself, others require you to set it manually after creation of the Tensor.

In [7]:
t1 = torch.randn((3,3), requires_grad = True) 

t2 = torch.FloatTensor(3,3) # No way to specify requires_grad while initiating 
t2.requires_grad = True

print(t2.grad)

None


`requires_grad` is contagious. It means that when a Tensor is created by operating on other Tensors, the requires_grad of the resultant Tensor would be set True given at least one of the tensors used for creation has it's `requires_grad` set to `True`.

In [4]:
import torch

x = torch.randn(3,3) # requires_grad=False by default
y = torch.randn(3,3) #requires_grad=False by default
z = torch.randn((3,3),requires_grad=True)
a = x+y # since both x and y don't require gradients, a also doesn't require gradients
print(a.requires_grad) #output: False
b = a+z #since z requires gradient, b also requires gradient
print(b.requires_grad) #output: True

False
True


As seen from the above example, if there is a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it.

### grad_fn

Each Tensor has an attribute called `grad_fn`, which refers to the mathematical operator that create the variable. If `requires_grad` is set to False, `grad_fn` would be None. 

If a Tensor is a leaf node (initialised by the user), then the `grad_fn` is also None.

In [6]:
print(x.grad_fn)    # x is a leaf node, no grad_fn
print(y.grad_fn)    # y is a leaf node, no grad_fn
print(z.grad_fn)    # z is a leaf node, no grad_fn
print(a.grad_fn)    # a's requires_grad is False, no grad_fn
print(b.grad_fn)

None
None
None
None
<AddBackward0 object at 0x7fc581b8c100>


## Function

All mathematical operations in PyTorch are implemented by the `torch.autograd.Function` class. This class has two important member functions we need to look at. 

1. The first is it's `forward`  function, which simply computes the output using it's inputs. 

2. The `backward` function takes the incoming gradient coming from the the part of the network in front of it. 

These concepts can be represented as following diagram.

<img src="figs/0_p9_fUhKXCf0LWAxh.png">

One thing to note here is that PyTorch gives an error if you call `backward` on vector-valued Tensor. This means you can only call `backward` on a scalar valued Tensor. 

In [8]:
import torch 

a = torch.randn((3,3), requires_grad = True)

w1 = torch.randn((3,3), requires_grad = True)
w2 = torch.randn((3,3), requires_grad = True)
w3 = torch.randn((3,3), requires_grad = True)
w4 = torch.randn((3,3), requires_grad = True)

b = w1*a 
c = w2*a

d = w3*b + w4*c 

L = (10 - d)

L.backward()

RuntimeError: grad can be implicitly created only for scalar outputs

There are two ways to overcome this.

1. If you just make a small change in the above code setting L to be the sum of all the errors, our problem will be solved

2. Second way is, for some reason have to absolutely call backward on a vector function, you can pass a torch.ones of size of shape of the tensor you are trying to call backward with. 

In [9]:
import torch 

a = torch.randn((3,3), requires_grad = True)

w1 = torch.randn((3,3), requires_grad = True)
w2 = torch.randn((3,3), requires_grad = True)
w3 = torch.randn((3,3), requires_grad = True)
w4 = torch.randn((3,3), requires_grad = True)

b = w1*a 
c = w2*a

d = w3*b + w4*c 

# Replace L = (10 - d) by 
L = (10 -d).sum()

L.backward()

In [10]:
import torch 

a = torch.randn((3,3), requires_grad = True)

w1 = torch.randn((3,3), requires_grad = True)
w2 = torch.randn((3,3), requires_grad = True)
w3 = torch.randn((3,3), requires_grad = True)
w4 = torch.randn((3,3), requires_grad = True)

b = w1*a 
c = w2*a

d = w3*b + w4*c 

# Replace L = (10 - d) by 
L = (10 -d)

L.backward(torch.ones(L.shape))

In this way, we can have gradients for every Tensor , and we can update them using Optimisation algorithm of our choice. 

In [11]:
learning_rate = 0.5
w1 = w1 - learning_rate * w1.grad

## Dynamic Computation Graph

PyTorch creates something called a <b>Dynamic Computation Graph</b>, which means that the graph is generated on the fly.

Until the forward function of a Variable is called, there exists no node for the Tensor (it’s grad_fn) in the graph.

In [12]:
a = torch.randn((3,3), requires_grad = True)   #No graph yet, as a is a leaf

w1 = torch.randn((3,3), requires_grad = True)  #Same logic as above

b = w1*a   #Graph with node `mulBackward` is created.

The graph is created as a result of `forward` function of many Tensors being invoked. Only then, the buffers for the non-leaf nodes allocated for the graph and intermediate values (used for computing gradients later).  When you call `backward`, as the gradients are computed, these buffers (for non-leaf variables) are essentially freed, and the graph is destroyed ( In a sense, you can't backpropagate through it since the buffers holding values to compute the gradients are gone).

Next time, you will call `forward` on the same set of tensors, the leaf node buffers from the previous run will be shared, while the non-leaf nodes buffers will be created again.

If you call backward more than once on a graph with non-leaf nodes, you'll be met with the following error.

```text
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
```

This is because the non-leaf buffers gets destroyed the first time `backward()` is called and hence, there’s no path to navigate to the leaves when backward is invoked the second time. You can undo this non-leaf buffer destroying behaviour by adding `retain_graph = True` argument to the backward function.

In [None]:
loss.backward(retain_graph = True)

## torch.no_grad()

When we are computing gradients, we need to cache input values, and intermediate features as they maybe required to compute the gradient later. This affects the memory footprint of the network.

While, we are performing inference, we don't compute gradients, and thus, don't need to store these values. Infact, no graph needs to be create during inference as it will lead to useless consumption of memory.

PyTorch offers a context manager, called `torch.no_grad()` for this purpose.

In [None]:
with torch.no_grad():
    inference code goes here

No graph is defined for operations executed under this context manager.

## Autograd 

Conceptually, autograd keeps a graph recording of all of the operations that created the data as you execute operations, giving you a `directed acyclic graph` whose leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the `chain rule (back-propagation)`.

Every primitive autograd operator is two functions that operate on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar and computes the gradient of the input Tensors with respect to that same scalar.

To summarize, Tensor and Function are interconnected and build up an acyclic graph, that encodes a complete history of the computation. Each tensor has a `.grad_fn` attribute that references a Function that has created the Tensor (except for Tensors created by the user since their grad_fn is None). If you want to compute the derivatives, you can call `.backward()` on a Tensor. After the call to the backwards function the gradient values are stored as tensors in `grad` attribute.

So for example if you create two Tensors a and b. Followed by c = a/b. The `grad_fn` of c would be `DivBackward` which is the backward function for the / operator. And as discussed earlier a collection of these `grad_fn` makes the backward graph. The forward and backward function are a member of `torch.autograd.Function`. You can define your own autograd operator by defining a subclass of torch.autograd.Function.

### is_leaf() and retain_grad()

is_leaf: All Tensors that have `requires_grad` which is False are leaf Tensors by convention. For Tensors that have `requires_grad` with is True, they will be leaf Tensors if they were created by the user. This means that they are not the result of an operation and so `grad_fn` is None. Only leaf Tensors have their grad populated during a call to `backward()`. To get grad populated for non-leaf Tensors, you can use `retain_grad()`.

In [14]:
import torch

# Define the graph a,b,c,d are leaf nodes and e is the root node
# The graph is constructed with every line since the 
# computational graphs are dynamic in PyTorch
a = torch.tensor([2.0],requires_grad=True)
b = torch.tensor([3.0],requires_grad=True)
c = torch.tensor([5.0],requires_grad=True)
d = torch.tensor([10.0],requires_grad=True)
u = a*b
t = torch.log(d)
v = t*c
t.retain_grad()
e = u+v

In [18]:
print(f"a is leaf: {a.is_leaf}")
print(f"a grad_fn: {a.grad_fn}")
print(f"a grad: {a.grad}")
print()

print(f"e is leaf: {e.is_leaf}")
print(f"e grad_fn: {e.grad_fn}")
print(f"e grad: {e.grad}")
print()

print(f"t is leaf: {t.is_leaf}")
print(f"t grad_fn: {t.grad_fn}")
print(f"t grad: {t.grad}")

a is leaf: True
a grad_fn: None
a grad: None

e is leaf: False
e grad_fn: <AddBackward0 object at 0x7fc581aef2e0>
e grad: None

t is leaf: False
t grad_fn: <LogBackward0 object at 0x7fc581aef2e0>
t grad: None


  print(f"e grad: {e.grad}")


The leaves don’t have grad_fn but will have gradients. Non leaf nodes have grad_fn but don’t have gradients. Before the backward() is called there are no grad values.

In [19]:
from IPython.display import display, Math

e.backward()
display(Math(fr'\frac{{\partial e}}{{\partial a}} = {a.grad.item()}'))
print()
display(Math(fr'\frac{{\partial e}}{{\partial b}} = {b.grad.item()}'))
print()
display(Math(fr'\frac{{\partial e}}{{\partial c}} = {c.grad.item()}'))
print()
display(Math(fr'\frac{{\partial e}}{{\partial d}} = {d.grad.item()}'))

<IPython.core.display.Math object>




<IPython.core.display.Math object>




<IPython.core.display.Math object>




<IPython.core.display.Math object>

In [20]:
print(f"a is leaf: {a.is_leaf}")
print(f"a grad_fn: {a.grad_fn}")
print(f"a grad: {a.grad}")
print()

print(f"e is leaf: {e.is_leaf}")
print(f"e grad_fn: {e.grad_fn}")
print(f"e grad: {e.grad}")
print()

print(f"t is leaf: {t.is_leaf}")
print(f"t grad_fn: {t.grad_fn}")
print(f"t grad: {t.grad}")

a is leaf: True
a grad_fn: None
a grad: tensor([3.])

e is leaf: False
e grad_fn: <AddBackward0 object at 0x7fc5819ad5b0>
e grad: None

t is leaf: False
t grad_fn: <LogBackward0 object at 0x7fc5819ad5b0>
t grad: tensor([5.])


  print(f"e grad: {e.grad}")


## References

* [PyTorch 101, Part 1: Understanding Graphs, Automatic Differentiation and Autograd](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)