# Extra lab 1 - Basics of Autograd in Python

In this extra laboratory, we will learn a few concepts about autograd in PyTorch and how to build modules with a custom backward pass.

## Building a computational graph and obtaining derivatives

A computational graph in PT is automatically constructed by just applying some operations on one or more PT tensors.

Let us reproduce the example from prof. Manzoni's lecture:

![](img/compgraph.png)

Input (leaf) tensors are indicated in yellow circles, gray circles indicate intermediate tensors, blue circles output tensors, operations are shown in black squares.

In [None]:
import torch
from matplotlib import pyplot as plt

We construct the tensor by specifying `requires_grad=True` in the constructor. If we don't do it, the gradient information won't be retained for the specific leaf tensor.

In [None]:
x_1 = torch.tensor([5.0], requires_grad=True)
x_2 = torch.tensor([-2.1], requires_grad=True)
print(x_1)
print(x_2)

By obtaining and printing `a`, we can see that the tensor has a specific gradient function attached.

In [None]:
a = x_1 * x_2
print(a)

In [None]:
b = x_1.cos()
y_1 = a + b
print(y_1)

In [None]:
d = x_1.log()
e = x_2 ** 3
g = d + e
y_2 = g - x_2
print(y_2)

In [None]:
f = torch.stack((y_1, y_2))
print(f)

We ask Python to calculate the gradient with the `backward` method.

Note: `backward()` may be called only on singleton tensors!

In [None]:
f.backward()

We then call `backward()` on the two scalars composing it. We also call `retain_graph=True` on the first try since each time a vanilla `backward()` is called, PT deletes the underlying computational graph for efficiency reasons.

In [None]:
f[0].backward(retain_graph=True) # equivalent to y_1.backward()
f[1].backward()

Let us analyze the gradients:

In [None]:
e.grad

Indeed, the grad is stored (again, for efficiency reasons) only on the leaf tensors. You may use [hooks](https://www.youtube.com/watch?v=syLFCVYua6Q) to retain also intermediate grads.

In [None]:
print(x_1.grad, "\n")
print(x_2.grad)

Let's do it the quick way:

In [None]:
ff = torch.stack([
    x_1 * x_2 + x_1.cos(),
    x_1.log() + x_2 ** 3 - x_2
])
ff[0].backward(retain_graph=True)
ff[1].backward()
print(ff)
print(x_1.grad, "\n")
print(x_2.grad)

Before, we had this

```
tensor([-0.9411]) 

tensor([17.2300])
```

now, this

```
tensor([-1.8822]) 

tensor([34.4600])
```

**Q**: who knows what happened?
*Hint*: it has nothing to do with the fact that we did it _the quick way_.

## Building a custom non-parametric module

Basically, we want to create a module which is not controlled by any parameter, be it trainable or non-trainable.

As an example, we might have the Leaky ReLU, an activation function which can be used in place of the more-known ReLU.

$\text{Leaky_ReLU} = \max\{0.01\cdot x, x\}$

![](https://i1.wp.com/clay-atlas.com/wp-content/uploads/2019/10/image-37.png?resize=640%2C480&ssl=1)

We could do it with the basic PyTorch Tensor methods, like we did at the end of Lab2. Suppose though that, for any reason, we did not have an automatic gradient calculation: we would need to build an autograd Function to implement our Leaky ReLU.

An autograd Function inherits from `torch.autograd.Function` and has two compulsory methods: `forward` and `backward`, whose meaning should be obvious to all.

Both functions have a compulsory first argument which is the **context**, `ctx` for brevity.
From the context we can infer informations about the entities involved in the calculation of the gradient.
The context is built upon calling the `forward` method, so that, during the `backward` call, we can obtain the info such what tensors have been used in `forward` and whether a tensor requires or not the grad.

Moreover, the backward method needs an additional argument, `output_grad`, which conveys information about the gradient which is _entering_ the Function (be mindful, we're running _backward_, so a gradient _enters the function_ upstream w.r.t. the forward pass).

In [None]:
class LeakyReLU_Fun(torch.autograd.Function):
    @staticmethod # mind the decorator
    def forward(ctx, input_):
        ctx.save_for_backward(input_) # the parameters that will be involved in the gradient
        return torch.max(input_, input_ * 0.01)
    
    @staticmethod
    def backward(ctx, grad_output):
        input_, = ctx.saved_tensors # these are the variables which we need to backpropagate the gradient to (only the input)
        # the gradient is 1 for positive x's, 0.01 for negative x's
        grad_input = torch.ones_like(input_)
        grad_input[input_<0] = 0.01
        # now, we need to rescale for the grad_output
        grad_input *= grad_output
        '''
        a valid alternative (maybe better performing?):
        grad_input = grad_output.clone()
        grad_input[input_<0] *= 0.01
        '''
        return grad_input
        

In [None]:
fun = LeakyReLU_Fun.apply
x = torch.linspace(-5,5,11, requires_grad=True)
y = fun(x)
z = y.sum()
z.backward()

In [None]:
x.grad

In [None]:
class LeakyReLU(torch.nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, X):
        return LeakyReLU_Fun.apply(X)

In [None]:
LeakyReLU()(x)

## Building a custom parametric module

We wish to extend our Leaky ReLU module to the Parametric ReLU: $\text{Param_ReLU} = \max\{\alpha\cdot x, x\}, x \in [0,1)$.

![](https://pytorch.org/docs/stable/_images/PReLU.png)

Parametric ReLU with $\alpha=0.25$

In [None]:
class ParamReLU_Fun(torch.autograd.Function):
    @staticmethod # mind the decorator
    def forward(ctx, input_, alpha:float):
        assert alpha >= 0 and alpha < 1, f"alpha should be >= 0 and < 1. Found {alpha}."
        ctx.save_for_backward(input_) # the parameters that will be involved in the gradient
        ctx.alpha = alpha # note that we don't use self.alpha
        return torch.max(input_, input_ * alpha)
    
    @staticmethod
    def backward(ctx, grad_output):
        input_, = ctx.saved_tensors # these are the variables which we need to backpropagate the gradient to (only the input)
        grad_input = grad_output.clone()
        grad_input[input_<0] *= ctx.alpha
        return grad_input, None

In [None]:
class ParamReLU(torch.nn.Module):
    def __init__(self, alpha):
        super().__init__()
        self.alpha = alpha
    
    def forward(self, X):
        return ParamReLU_Fun.apply(X, self.alpha)

In [None]:
prelu = ParamReLU(0.25)
x = torch.linspace(-5,5,11, requires_grad=True)
y = prelu(x)
z = y.sum()
z.backward()
print(x.grad)

## Building a custom parametric module, with trainable parameters

What if our $\alpha$ within the parametric ReLU was a trainable parameter? I.e., what if the optimizer could upldate values of $\alpha$ during training?

In this case, we will not have a single parameter $\alpha$, but a vector $\mathbf{a}$ of the same size of the input of the function.

Moreover, we're not enforcing anymore a condition on $\mathbf{a}$, so we must extend our Parametric ReLU formula to encompass also the condition in which $\alpha<0$ or $\alpha>1$. The formula becomes:

$\text{Parametric_ReLU}(x) = \max(0, x) + \alpha \min(0, x)$

In [None]:
class ParamReLU_Trainable_Fun(torch.autograd.Function):
    @staticmethod # mind the decorator
    def forward(ctx, input_:torch.Tensor, alpha:torch.Tensor):
        # we are not enforcing anymore the condition on alpha
        ctx.save_for_backward(input_, alpha) # the parameters that will be involved in the gradient
        zeros = torch.zeros_like(input_)
        return torch.max(input_, zeros) + alpha * torch.min(input_, zeros)
    
    @staticmethod
    def backward(ctx, grad_output):
        input_, alpha, = ctx.saved_tensors # these are the variables which we need to backpropagate the gradient to (only the input)
        grad_input = grad_output.clone()
        grad_input[input_<0] *= alpha.expand_as(input_)[input_<0]
        
        # gradient of alpha - note that the funciton param_relu(a) is constant for positive input_ -> zero derivative
        grad_alpha = grad_output.clone()
        grad_alpha[input_<0] *= input_[input_<0]
        grad_alpha[input_>=0] = 0
        return grad_input, grad_alpha

In [None]:
# plot the function for various "anomalous" levels of alpha
prelu_fun = ParamReLU_Trainable_Fun.apply
y1 = prelu_fun(x, torch.full_like(x, -0.25))
y2 = prelu_fun(x, torch.full_like(x, 1.75))
y3 = prelu_fun(x, torch.full_like(x, -1.25))
plt.plot(x.detach().numpy(), y1.detach().numpy())
plt.plot(x.detach().numpy(), y2.detach().numpy())
plt.plot(x.detach().numpy(), y3.detach().numpy())

In [None]:
class ParamReLU_Trainable(torch.nn.Module):
    def __init__(self, in_features):
        super().__init__()
        self.alpha = torch.nn.Parameter(torch.Tensor(in_features))
        self.alpha.data.uniform_(0, 1)
    
    def forward(self, X):
        return ParamReLU_Trainable_Fun.apply(X, self.alpha)

In [None]:
prelu = ParamReLU_Trainable(x.shape)
print(prelu.alpha)

In [None]:
y = prelu(x)
print(y)

In [None]:
z = y.sum() * 2
z.backward()
print(prelu.alpha.grad)
print(x.grad)

Let us put this inside our MLP and see how things work out...

In [None]:
from scripts import train
from scripts import mnist

In [None]:
class MLP(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = torch.nn.Sequential(
            torch.nn.Flatten(),
            torch.nn.Linear(28*28, 16),
            ParamReLU_Trainable(16),

            torch.nn.BatchNorm1d(num_features=16),
            torch.nn.Linear(16, 32),
            ParamReLU_Trainable(32),

            torch.nn.BatchNorm1d(num_features=32),
            torch.nn.Linear(32, 24),
            ParamReLU_Trainable(24),

            torch.nn.BatchNorm1d(num_features=24),
            torch.nn.Linear(24, 10)
        )
        
    def forward(self, X):
        return self.layers(X)

In [None]:
net = MLP()

First, let us inspect alphas

In [None]:
def inspect_alphas(net):
    for name, param in net.named_parameters():
        if "alpha" in name:
            print(name, "\n", param, "\n")
inspect_alphas(net)

Then, train our network...

In [None]:
optim = torch.optim.Adam(net.parameters())
loss = torch.nn.CrossEntropyLoss()
num_epochs = 5
trainloader, testloader, _, _ = mnist.get_data()
train.train_model(net, trainloader, loss, optim, num_epochs)

In [None]:
train.test_model(net, testloader)

Actually, our model is not performing bad at all.

Let us check the values of the $\alpha$s after training:

In [None]:
inspect_alphas(net)