### leaf nodes:
* a PyTorch leaf node is a tensor which is not created by any operation tracked by the autograd engine
* leaf nodes:
    * store gradients
    * are usually the inputs or weights to the forward graph
    * are not created by operations that can be traced back to any tensor that has requires_grad=True
* the other type of nodes in a PyTorch compute graph is the intermediate nodes:
    * have a grad_fn field that points to a node in the backward graph
    * do not store gradients by default, unless register a hook called retain_grad()


In [None]:
""" leaf nodes & intermediate nodes """

import torch

# leaf node A
a = torch.tensor(1.0, requires_grad=True)
# leaf node B
b = torch.tensor(2.0, requires_grad=True)
# intermediate node C
c = a*b
c.backward()
print(c.grad)
print(a.grad)
print(b.grad)


In [None]:
""" autograd on a tensor """

import torch

x1 = torch.randn((2, 4), requires_grad=True)
x2 = torch.randn((2, 4), requires_grad=True)
y = x1 + x2
y.backward(y)
print(x1.grad)

In [None]:
""" autograd on the output tensor of a layer. """

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 20),
)
x = torch.randn((2, 10), requires_grad=True)
out = model(x)
out.backward(out)
print(x.grad)
print(out.grad)

In [None]:
""" .retain_grad() vs requires_grad=True """

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 20),
)

# 1. gradients are not populated for both leaf tensor x and intermediate tensor out
x = torch.randn((2, 10))
out = model(x)
out.backward(out)
print(x.grad)       # None
print(out.grad)     # None

# 2. gradients are populated for intermediate tensor out by declaring out.retain_grad() before backward-pass
x1 = torch.randn((2, 10))
out1 = model(x1)
out1.retain_grad()
# out.requires_grad=True -> error: can only change requires_grad flag for leaf nodes
# x1.retain_grad() -> error: can't retain_grad() on a tensor that has requires_grad=False
out1.backward(out1)
print(x1.grad)      # None
print(out1.grad)

# 3. gradients are populated for leaf node x by declaring requires_grad=True
x2 = torch.randn((2, 10), requires_grad=True)
out2 = model(x2)
out2.backward(out2)
print(x2.grad)  
print(out2.grad)    # None

# model parameters (weights and biases) by default have requires_grad=True
# access all (named) parameters in model in the following way:
for name, param in model.named_parameters():
    print(name)
    print(param.shape)
    print(param.requires_grad)  # all True

#### .retain_grad() vs requires_grad=True
* .retain_grad()
    * usually used by intermediate tensors (e.g. outputs);
    * if use it for a leaf tensor (e.g. input), the `requires_grad=True` must be set first for that tensor, but in this case the effects are redundant;
* requires_grad=True
    * usually set for leaf tensors;
    * default = True; see the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.requires_grad_.html) for the underlying method `requires_grad_`;
        * Q: but why in the previous example `x` doesn't by default has gradients populated?
            * maybe b/c be default input tensors, although are leaf nodes, do not usually need gradients for training -> only weights / biases need gradients;
* usually in my model, inputs and weights are leaf tensors that have `requires_grad=True` set, so their gradients are populated when computed;
* the outputs & hidden_states / activatations are intermediate tensors; their gradients are not stored by default, instead they have a field called `grad_fn` that points to the proper node in the backward graph so that backprop can be handled;
    * this mechanism is probably used to same mem footprint, as really only parameter gradients are needed for training;