# Machine Learning

We recommend going through the notebook using Google Colaboratory.

# Tutorial 4: PyTorch and Deep Learning

In this tutorial, we will cover:

- PyTorch
- Autograd, back-propagation
- Modules, `torch.nn`

Authors:

- Prof. Emanuele Rodolà
- Based in part on original material by Dr. Antonio Norelli and Dr. Luca Moschella

Course:

- Lectures and notebooks at https://github.com/erodola/ML-s2-2024/

Today we'll cover a new library: [PyTorch](https://pytorch.org/), one of the most popular python frameworks for deep learning. We'll then use it to implement our first deep neural network (a Multi-Layer Perceptron). Brace yourselves for the ride!

# PyTorch tensors

Similarly to Numpy's multidimensional arrays we have used so far, PyTorch also provides a data structure for storing high-dimensional data. It's explictly called a `torch.tensor`, and it's a close cousin of our beloved `numpy.array`.

In fact, since we already know how to manipulate Numpy's multidimensional arrays, shifting to tensors is not a problem! There are a few differences though, worth going through.

## Tensor basics

In [None]:
import torch
import numpy as np

torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.int32)

In [None]:
torch.zeros((3,5))

In [None]:
torch.ones((2,5), dtype=torch.float64)

In [None]:
torch.eye(4)

In [None]:
torch.rand(2,2)  # note: in Numpy, we did np.random.rand(2,2)

**Pro tip**: Bookmark the [PyTorch docs](https://pytorch.org/docs/stable/).

In [None]:
torch.randint(0, 100, (3,3))

In [None]:
t = torch.rand((3, 3))
torch.ones_like(t)

⚠️ The **transpose** operation has a different meaning in PyTorch if compared with NumPy!

In [None]:
# in numpy, transpose() is actually permuting *all* the dimensions:
a = np.ones((2, 3, 6))
a.transpose(0, 2, 1).shape

In [None]:
# in pytorch, this is done by using permute():
a = torch.ones((2, 3, 6))
a.permute(0, 2, 1).shape

In [None]:
# instead, pytorch's transpose() is more intuitive and only swaps *two* dimensions:
a = torch.ones((2, 3, 6))
a.transpose(1, 2).shape

In [None]:
# in numpy, you can call transpose() without arguments to transpose 2d arrays
a = np.ones((2, 3))
a.transpose()

In [None]:
# in pytorch, this doesn't work at all!
a = torch.ones((2, 3))
a.transpose()

In [None]:
# use T instead:
a.T

If you are a fan of `einsum`, you can still use it in PyTorch:

In [None]:
A = torch.rand((2, 3))
b = torch.rand(3)
torch.einsum('ij, j -> i', A, b)

**Reshaping, indexing, slicing, broadcasting, stacking, concatenating**, and **adding new dimensions** work as in Numpy:

In [None]:
a = torch.arange(12)
a.reshape(2, 2, 3)

In [None]:
a.reshape(1, -1)

In [None]:
a.reshape(-1)

Differently from Numpy, you can add dimensions using `unsqueeze`:

In [None]:
b = a.unsqueeze(0)  # add a new dimension at the beginning
b.shape

In [None]:
b = a.unsqueeze(-1)  # add a new dimension at the end
b.shape

In Numpy, we would have done it by using `np.newaxis` or `None`:

In [None]:
b = a[:, None]
b.shape

Also, the `axis` attribute of Numpy is called `dim` in Pytorch:

In [None]:
a = np.random.rand(2, 3)
np.sum(a, axis=1)

In [None]:
a = torch.rand((2, 3))
torch.sum(a, dim=1)

Torch also adds new functions for tensor manipulation that we don't have in Numpy, such as [gather](https://pytorch.org/docs/stable/generated/torch.gather.html).

**Type conversion** of tensors is also made easier in PyTorch:

In [None]:
a = torch.rand(3, 3) + 0.5

In [None]:
a.int()

In [None]:
a.long()

In [None]:
a.float()

In [None]:
a.double()

In [None]:
a.bool()

In [None]:
a.to(torch.double)

In [None]:
a.to(torch.uint8)

In [None]:
a.bool().int()

Finally, one can easily **convert** to/from Numpy tensors:

In [None]:
t = torch.rand((3, 3), dtype=torch.float32)
t.numpy()

In [None]:
n = np.random.rand(3,3).astype(np.float16)
torch.from_numpy(n)

If Numpy's `ndarray` is so similar to Torch's `tensor`, why should we prefer the latter to do Deep Learning?

In fact, there are two important distinctions:

- ``Tensor`` supports GPU computations.
- ``Tensor`` may store extra information needed for **back-propagation**:
  - A `grad` attribute storing the gradient of the loss w.r.t. the tensor.
  - A node representing an operation in the computational graph that produced this tensor.

The **device** of a tensor indicates the memory in which the tensor is currently stored: RAM (denoted as ``cpu``) or GPU memory (denoted as ``cuda``)

In [None]:
t = torch.rand((3,5))
t.device

## Using the GPU

Thanks to the explosion of the videogame industry in the last 50 years, the performance of the chips specialized in rendering and processing graphics --known as GPUs-- has dramatically improved.

In 2007 NVidia realized the potential of parallel GPU computing outside the videogame world, and released the first version of the CUDA framework, allowing  software developers to use GPUs for general purpose processing.

Graphics operations are mostly linear algebra operations, and accelerating them can turn very useful in many other fields.

In 2012 Hinton et al. [demonstrated](https://en.wikipedia.org/wiki/AlexNet) the huge potential of GPUs in training deep neural networks, starting *de facto* the glorious days of deep learning.

In [None]:
# Check if the GPU is available
torch.cuda.is_available()

In [None]:
# If available use the GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

In [None]:
t = torch.rand((2,3,7))
t = t.to(device)  # Note that we are assigning back to t, otherwise t won't be updated!
t

In [None]:
# Construct tensors directly on the GPU memory
t = torch.ones((5, 5), device='cuda')
t

In [None]:
t = torch.rand((3,3))

# you can also do this, but be careful: the code will not run if a GPU is not available
t = t.cuda()
t

In [None]:
t = t.cpu()
t

> **EXERCISE**
>
> Build a tensor $X \in \mathbb{R}^{k \times k}$ **on GPU**, filled with zeros and the sequence $[0, ..., k-1]$ along the diagonal.

In [None]:
# ✏️ your solution
k = 12

In [None]:
# @title 👀 Solution

k = 12
X = torch.diag(torch.arange(k)).to('cuda')
X

## Reproducibility

As mentioned in the previous notebooks, we are going to ensure that all the RNGs used in different parts of this notebook produce the same sequence of numbers each time. Let's add PyTorch's generators as well:

In [None]:
import torch
import numpy as np
import random
torch.manual_seed(42)      # PyTorch CPU
np.random.seed(42)         # NumPy
random.seed(0)             # Python built-in
torch.cuda.manual_seed(0)  # PyTorch GPU

For some operations, cuDNN (NVIDIA's library for deep neural networks) uses algorithms that can produce different results on different runs, *even* with the same inputs and the same seed. Below we set `deterministic = True`, forcing cuDNN to use deterministic algorithms where possible. This might limit cuDNN to a subset of algorithms that might not be  the most efficient, but will produce the same results across different runs.

In [None]:
torch.backends.cudnn.deterministic = True

Finally, when `benchmark = True`, cuDNN will automatically find the most efficient algorithms for your specific operations based on your network architecture and input sizes. This can greatly improve performance. However, since the selection of algorithms might change from one run to another, this can lead to non-deterministic behavior. Setting it to `benchmark = False` prevents cuDNN from dynamically selecting algorithms, thus improving reproducibility at the cost of potential performance gains.

In [None]:
torch.backends.cudnn.benchmark = False

# PyTorch Datasets

So far we have handled the data either ourselves by hand, or by using Scikit-learn's provided datasets. PyTorch provides convenient classes to handle the data and make its manipulation as painless as possible!

[`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html?highlight=dataset#torch.utils.data.Dataset) is an abstract class representing a dataset. Your custom dataset should inherit `Dataset` and override the following methods:

- `__len__`: so that `len(dataset)` returns the size of the dataset
- `__getitem__`: so that `dataset[i]` returns the $i$-th sample from the dataset.

Let's create a toy dataset:

In [None]:
from torch.utils.data import Dataset

class ToyDataset(Dataset):

  def __init__(self, n_points: int = 20, noise: float = .1):
    super().__init__()

    self.n_points = n_points

    # these two lines pre-load the entire dataset in memory
    self.x = torch.linspace(-1, 1, n_points)
    self.y = self.x ** 3 + noise * torch.randn(n_points)

  def __len__(self):
    return self.n_points

  def __getitem__(self, idx):
    return {
        'x': self.x[idx],
        'y': self.y[idx]
    }

In this case the dataset is composed of simple pairs:

In [None]:
toydataset = ToyDataset(20, noise=.1)
toydataset[2]

In [None]:
import plotly.express as px
fig = px.scatter(x=toydataset.x.numpy(), y=toydataset.y.numpy())
fig.update_layout(width=400, height=300)
fig.show()

> **NOTE**
>
> Small Python reminder. Every object that implements the `__getitem__` method follows the [iterator procotol](https://www.python.org/dev/peps/pep-0234/#python-api-specification). It means that you can **iterate** the dataset:

In [None]:
from tqdm.notebook import tqdm as tqdm   # just a progress bar

for sample in tqdm(toydataset):  # wrap the iterable in tqdm and you're done
  pass

⚠️ In `ToyDataset` we stored the whole dataset **in memory**, i.e. in the attributes `x`, `y` of the `ToyDataset` class.
This is the fastest and simplest way to implement a dataset, but it is **not always feasible**.
What if you must train a neural network on 500GB of images?
The whole dataset does not fit in memory!

We can instead implement *lazy loading*: we load each item **only when it's needed** -- and even apply some preprocessing on the fly.

Example:

```python
class LazyDataset(Dataset):

  def __init__(self, file_paths: Sequence[Path]):
    super().__init__()  
    self.file_paths = file_paths

  def __len__(self):
    return len(self.file_paths)

  def __getitem__(self, idx):
    sample_path = self.file_paths[idx]

    # -> Load sample_path in memory
    # -> Perform some lightweight preprocessing
    # -> Generate (sample_input, sample_output)

    return {
        'x': sample_input,
        'y': sample_output
    }

```

The dataset can return any type of object, i.e. you are *not* forced to return a dictionary of tensors:

In [None]:
from torch.utils.data import Dataset

class AnotherDataset(Dataset):
  def __init__(self):
    super().__init__()
    self.myitems = torch.arange(100)

  def __len__(self):
    return len(self.myitems)

  def __getitem__(self, idx):
    return f'Sample{idx}', self.myitems[idx], None, 3.5

dataset = AnotherDataset()
dataset[5]

However, returning a dictionary makes your code more readable, and makes the creation of mini-batches (for SGD) way easier through the `DataLoader` (next chapter).

⚠️ Do *not* return tensors that are stored on the GPU memory, as it [causes problems](https://pytorch.org/docs/stable/data.html#multi-process-data-loading) with the multiprocessing behavior of the DataLoader. There's a better way to achieve this via _memory pinning_, as we'll see further below.

# PyTorch DataLoader

[`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader) is an iterator that provides:

- Data batching
- Data shuffling
- Parallel data loading using `multiprocessing` workers. Meaning that while the `GPU` is performing some computation on a batch, in parallel you can load the next batch.


Creating a dataloader from a dataset is straightforward. Here's an example that highlights some of the most used parameters:

In [None]:
from torch.utils.data import DataLoader

toydataset = ToyDataset(200)
toyloader = DataLoader(toydataset,
                       batch_size=8,    # number of elements in each batch
                       shuffle=True,    # shuffle the dataset
                       num_workers=4,   # number of workers, i.e. batches to prefetch
                       pin_memory=True  # return memory-pinned tensors, see below for an explanation!
                       )

# 200 iterations
for sample in tqdm(toydataset):
  pass
print(f"SAMPLE: {sample}")

# 25 iterations
for batch in tqdm(toyloader):  # there is some overhead when using multiple workers!
  pass
print(f"BATCH: {batch}")

Did you notice how the batch samples were put together by the `DataLoader`? The batch is not simply a list of samples; rather, the data loader **collated** the samples into a dictionary with two keys (`x` and `y`) and populated these with tensors. This is why making your `DataSet` return dictionaries is useful!

If your `DataSet` returns something else, you must manually specify *how* to put the samples together to form a batch. You do this by defining a custom collate function, and passing it to the `collate_fn` parameter of the `DataLoader` (see the [docs](https://pytorch.org/docs/stable/data.html#loading-batched-and-non-batched-data)).

Remember: a `DataSet` can be directly indexed:

In [None]:
toydataset = ToyDataset(20, noise=.1)
toydataset[2]

But a `DataLoader` is an _iterator_ and can't be directly indexed:

In [None]:
toyloader = DataLoader(toydataset, batch_size=8)

try:
  toyloader[0]  # NOT OK
except Exception as e:
  print('Error:', e)

print('')

for b in toyloader:  # OK
  print(b)

## 📖 Memory pinning

Pytorch tensors support [memory pinning](https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/):

![](https://devblogs.nvidia.com/wp-content/uploads/2012/12/pinned-1024x541.jpg)

Pinned tensors enable:
- **Much faster copies** from CPU to GPU.
- **Aynchronous GPU copies**: *while* the tensor is being transferred, the CPU code continues if it doesn't need that tensor! To enable this, just pass an additional `non_blocking=True` argument to a `to()` or a `cuda()` call.  

Note that differently from data transfer, GPU operations (e.g. tensor product) are [asynchronous by default](https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution).

Let's see how to pin a tensor by hand:

In [None]:
t = torch.rand(100)
t.is_pinned()

In [None]:
# be sure to use a GPU runtime in Colab!

t = t.pin_memory()  # reassigning to t, because pin_memory() is not in-place

In [None]:
t.is_pinned()

From now on, whenever there is the necessity to transfer `t` from CPU to GPU, the transfer will be more efficient. In addition, if the transfer is done using the `non_blocking=True` option in a `to()` call, the transfer will happen asynchronously!

Passing `pin_memory=True` to a `DataLoader` will automatically put the fetched data tensors in pinned memory. Note that `DataLoader` only knows how to pin standard types like Tensor, Map and Sequence of Tensors. If you want to pin some custom type, read more [here](https://pytorch.org/docs/stable/data.html#memory-pinning) (tldr: define a `pin_memory()` method on your custom type(s)).

In [None]:
batch['x'].is_pinned()

## Potential bottlenecks

In general, you want to make sure that data loading is not a bottleneck in your pipeline. **Your GPU must not wait for data**!

- **Check resource usage**

Check the GPU (or CPU) usage, if it is ~$100\%$, it's being fully used. This is good!

Otherwise you may have a bottleneck somewhere, or your data operations may not be GPU friendly (e.g. small batches).

- **Check data loading speed**

Iterate over the `Dataset` or `DataLoader` and check the data loading speed by counting the number of items that are loaded per second. Then compare this to how many items per second are processed by the rest of the pipeline.

If you can load more items than you can process in the training loop, it means you _don't_ have a bottleneck in your data loading. The GPU is not waiting for you, good job!

You can use the [tqdm](https://github.com/tqdm/tqdm) package to easily check the iteration speed of any iterable with a minimal overhead.

In [None]:
# Checking the loading speed
from tqdm.notebook import tqdm as tqdm

toydataset = ToyDataset(20000)
toyloader = DataLoader(toydataset,
                       batch_size=8,
                       shuffle=True,
                       num_workers=4,
                       pin_memory=True
                       )
for batch in tqdm(toyloader):
  pass

# Example:
# tqdm reports 350.00it/s (iterations per second) for the loader.
# tqdm reports 280.12it/s for the training step given the data.
# -> data loading is NOT a bottleneck!
#
# This scenario means your computation time is the primary factor in how long
# each iteration takes, and the data loading process is efficient enough to keep
# up with the computational demands.

How to fix a bottleneck in the data loading?

1) If your dataset fits on memory, load it beforehand.

2) Tune the `num_workers` and `batch_size` parameter of the `DataLoader`, paying attention that the batch size will have a direct impact on the training. A good default for `num_workers` is the number of cores in your CPU.

3) Do not preprocess data on the fly, but save the preprocessed files to disk.

4) Consider changing the way in which the data files are stored on disk (e.g. another [format](https://www.h5py.org/) or a [database](https://github.com/google/leveldb)).

🌐 Keep in mind that the [`torchvision`](https://pytorch.org/vision/stable/index.html) package provides some common datasets and transforms. We'll use it in this notebook!

> **EXERCISE**
>
> Suppose you are creating a neural network to restore noisy images from the [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset. Given as input a corrupted image, the model will output a corresponding uncorrupted version.
>
> In this exercise, you are only concerned for the data loading steps.
>
> 1) Simulate the corruption by applying your favorite among: random pixel noise, random black patches, random crop, random reflections, or all the previous together.
>
> 2) Create the corresponding `Dataset` and `DataLoader`.
>
> 3) Plot the images in a **batch** to ensure you are doing everything right.
>
>  *hint*: you may use [`torchvision.datasets`](https://pytorch.org/vision/stable/datasets.html) and [`torchvision.transforms`](https://pytorch.org/vision/stable/transforms.html)

In [None]:
# ✏️ your code here

# Logistic regression with PyTorch

We have seen how to do this with Scikit-learn, or by implementing everything by ourselves from scratch. Let's now do this with PyTorch. Why? Because it brings some interesting gimmicks along!

## Data loading

Instead of downloading the MNIST data and writing our own dataset code, we can use torchvision `datasets.MNIST`, which already inherits from torch `Dataset`, to do the job more quickly. Let's do that!

In [None]:
import torch
import torchvision
from torchvision import datasets, transforms
from tqdm import tqdm

train_dataset = datasets.MNIST(
    './',
    train=True,
    download=True,
    transform=transforms.Compose([
        # tranforming images to pytorch tensors
        transforms.ToTensor(),

        # normalizing the tensors, i.e. the distribution of values on each sample should have mean=0.1307 and stddev=0.3081,
        # corresponding to the mean and stddev of the whole MNIST dataset.
        # Check https://stats.stackexchange.com/questions/211436/why-normalize-images-by-subtracting-datasets-image-mean-instead-of-the-current
        # for an intuition.
        transforms.Normalize((0.1307,), (0.3081,))
    ])
)

test_dataset = datasets.MNIST('./', train=False,
                    transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])
                )

It is always a good idea to look at some entries in the dataset.

**Note:** `train_dataset` and `test_dataset` are objects of the `torchvision.datasets.MNIST` class. Check the [docs](https://pytorch.org/vision/0.16/generated/torchvision.datasets.MNIST.html) to see how to use them.

In [None]:
import plotly.express as px

mnist_example = train_dataset[42][0][0].numpy()  # what are all those indices? investigate!
print('A MNIST sample has size', mnist_example.shape)
fig = px.imshow(mnist_example)
fig.update_layout(width=400, height=300)
fig.show()

**Is this a 1 or a 7?**

This is the existential question we will try to answer today.

Let's proceed by selecting only the 1 and 7 samples from the MNIST dataset.

In [None]:
for dataset in [train_dataset, test_dataset]:
    mask_sevens = dataset.targets == 7
    mask_ones = dataset.targets == 1

    # re-map 7s to have label 0 and 1s to have label 1
    dataset.targets[mask_sevens] = 0
    dataset.targets[mask_ones] = 1

    # only keep 7s and 1s
    dataset.targets = dataset.targets[mask_sevens + mask_ones]
    dataset.data = dataset.data[mask_sevens + mask_ones]

Let's wrap the dataset in a pytorch DataLoader.

Notice that by using `batch_size=len(dataset)`, each batch contains the entire dataset. This is not common, we are doing this here just because we don't really need smaller batches in this part of the notebook.

In [None]:
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=len(train_dataset), shuffle=True)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=len(test_dataset), shuffle=True)

**7 or 1**?

How difficult is this task?

It is always important to have an idea of how much *intelligence* you expect from your AI application.

>**EXERCISE (warm-up)**: Combine several samples in a grid and plot a big picture. Then try to classify each one as a 1 or a 7, using your own judgement. What is your classification accuracy?

In [None]:
# ✏️ your code here
# hint: use `torchvision.utils.make_grid` and `px.imshow`

images = next(iter(train_dataloader))  # we only get one batch of data, since it contains the entire dataset
# ...

In [None]:
# @title 👀 Solution  { run: "auto" }

cols = 4  #@param {type:"slider", min:1, max:8, step:1}
rows = 3 #@param {type:"slider", min:1, max:4, step:1}

grid_images = images[0][:rows*cols, ...]
resolved_grid = torchvision.utils.make_grid(grid_images, padding=4, nrow=cols, normalize=True, value_range=(0, 1))
px.imshow(resolved_grid.permute(1, 2, 0))

## Training

What's PyTorch's counterpart for Scikit-learn's `fit()`?

Before we find out, we are going to implement our own SGD, just like we did in the previous notebook -- however, this time around **we won't have to compute the gradient by hand**. Gradients will be computed automagically🪄 with the `backward()` method.

We can reuse the `model` function from our previous notebook:

In [None]:
def model(xb):
  return torch.nn.functional.log_softmax(xb @ weights + bias, dim=1)

See how we are using torch's implementation of `log_softmax`. This is convenient, since it also takes care of numerical errors that might arise when using exponential and logarithms (they did in my tests!). In fact, let's use torch's `nll_loss` as well, instead of our own implementation:

In [None]:
X_train = images[0].reshape(-1, 28*28)
y_train = images[1]
n_classes = 1 + y_train.max()

weights = torch.rand(X_train.shape[1], n_classes)
bias = torch.rand(1, n_classes)

preds = model(X_train)
torch.nn.functional.nll_loss(model(X_train), y_train)  # nll_loss wants log-probabilities as input

> **EXERCISE:** Adapt your NLL code from the previous notebook to work with torch tensors, and compare the loss value you get with the one obtained by torch `nll_loss`.

We only miss the gradient, and then we'll have all we need to start training with gradient descent. Let's restart by initializing the training with random weights and biases, this time with a small modification:

In [None]:
weights = torch.rand((X_train.shape[1], n_classes), requires_grad=True)
bias = torch.rand((1, n_classes), requires_grad=True)

The parameter `requires_grad=True` is telling torch that later we will compute the gradient of some function (i.e. the loss) with respect to `weights` and `bias`. In other words, these will be leaves🍃 of a computational graph!

Here's our implementation of gradient descent. Read the comments!

In [None]:
import matplotlib.pyplot as plt

lr = 0.01
max_iters = 200
losses = torch.zeros(max_iters)

for it in tqdm(range(max_iters)):

  # -> forward pass of autodiff
  loss = torch.nn.functional.nll_loss(model(X_train), y_train)

  # <- backward pass of autodiff
  loss.backward()

  # Note that loss is a rank-0 tensor (not a plain scalar).

  # After executing backward(), the .grad attributes of weights ad bias
  # (which had requires_grad=True) contain the gradient of the loss wrt them.

  # Clearly, we are not interested in computing the gradient *of the update operations themselves*.
  # It sounds obvious, but if we don't tell PyTorch, it will attempt to compute those derivatives!
  # We use the .no_grad() command to do so.

  with torch.no_grad():  # Disable gradient computation since we are only updating the parameters

    losses[it] = loss.item()

    # downhill step
    weights -= lr * weights.grad
    bias -= lr * bias.grad

    # reset the gradients. otherwise, the next backward() call will accumulate!
    weights.grad = None
    bias.grad = None

plt.figure(figsize=(6, 3))
plt.plot(losses, color='black', label='GD')
plt.xlabel('epochs')
plt.ylabel('training loss')
plt.legend()
plt.show()

> **EXERCISE:** Implement SGD with mini-batches of size $m=50$ and add the SGD curve to the previous plot. Use `train_dataloader` to load the mini-batches, and remember to set the correct batch size.

### Optimizers

While SGD is very effective, PyTorch also provides several alternative **optimizers** that turn out to be useful in many practical cases. We don't need to implement them as we did with SGD, but rather we can use the [`torch.optim`](https://pytorch.org/docs/stable/optim.html) library. We'll try the [Adam optimizer](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/), by many suggested as the default optimization method for deep learning:

In [None]:
weights = torch.rand((X_train.shape[1], n_classes), requires_grad=True)
bias = torch.rand((1, n_classes), requires_grad=True)

lr = 0.01
max_iters = 200
losses = torch.zeros(max_iters)

adam = torch.optim.Adam([weights, bias], lr=lr)  # Instantiate the optimizer

for it in tqdm(range(max_iters), desc='Training'):

  # ->
  loss = torch.nn.functional.nll_loss(model(X_train), y_train)
  # <-
  loss.backward()

  # no need to manually update the gradients!
  adam.step()
  adam.zero_grad()

  losses[it] = loss.item()

plt.figure(figsize=(6, 3))
plt.plot(losses, color='black', label='Adam')
plt.xlabel('epochs')
plt.ylabel('training loss')
plt.legend()
plt.show()

> **EXERCISE:**
> - Add _momentum_ to the previous optimization.
> - Compare with the `RMSprop` optimizer, with and without momentum.
>
> _Hint:_ Check the docs!

### Learning rate scheduler

When using some optimizers it may be useful to introduce a learning rate **decay** policy. This way, the learning rate will not be fixed for each step but will vary throughout the training epochs.

Some optimizers do this on their own (e.g. Adam), while others leave everything in our hands (e.g. SGD).

PyTorch provides some easy-to-use classes to manage the decay policy. The [`torch.optim.lr_scheduler`](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) package provides several methods to adjust the learning rate based on the number of epochs.

Learning rate scheduling should be applied **after the optimizer’s update**; e.g., you should write your code this way:

```python
scheduler = ...
for epoch in range(n_epochs):
    out = fn(...)
    out.backward()
    opt.step()
    opt.zero_grad()
    scheduler.step() # AFTER opt.step(): breaking change with PyTorch 1.1.0
```

If you are also testing your model on a validation dataset during the training loop, this is how it should look:

```python
scheduler = ...
for epoch in range(n_epochs):
    train(...)
    validate(...)
    scheduler.step()
```

Examples of such policies are [`lr_scheduler.ExponentialLR`](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ExponentialLR.html#torch.optim.lr_scheduler.ExponentialLR) and [`lr_scheduler.CosineAnnealingLR`](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html#torch.optim.lr_scheduler.CosineAnnealingLR).

Let's see the `ExponentialLR` in action! This is a multiplicative decay where, at each epoch, the learning rate is multiplied by a `gamma` value (smaller than one).

> **EXERCISE:**
> - Complete the following code.
> - Also _validate_ your training model, and plot two loss curves (training and validation) across the iterations.
>
> Remember that we already loaded the validation data in `test_dataset` and we have the `test_dataloader`.

In [None]:
from torch.optim.lr_scheduler import ExponentialLR
from torch.optim import SGD

weights = torch.rand((X_train.shape[1], n_classes), requires_grad=True)
bias = torch.rand((1, n_classes), requires_grad=True)

lr = 0.01
max_iters = 200

opt = SGD([weights, bias], lr=lr)
scheduler = ExponentialLR(opt, gamma=.8)

for i in tqdm(range(max_iters), desc="Training"):
    #
    # ✏️ your code here
    #
    pass

> **EXERCISE:** Time to switch to the full MNIST dataset! Run your latest code on the complete MNIST (not just 1 and 7). Can you reach **>90%** accuracy?

# Autograd



The `backward` call that computed the gradients for us in the previous parts of this notebook uses the ``autograd`` package. As you can imagine, it provides *automatic differentiation* for all operations on tensors.

As we have seen in theory class, when we execute PyTorch operations on tensors, the framework is constructing a computational graph behind the curtains. The graph will then be used for *reverse-mode automatic differentation*, also called **backpropagation**.

## Basics

Let's start by defining a tensor $x$ that may appear in some computation like $f(x) = x^2 + x^3$.

Suppose we want to calculate its derivative at the point $x=42$:
$$\frac{\partial f}{\partial x}\Bigr\rvert_{x=42}$$

In [None]:
x = torch.tensor(42., requires_grad=True)  # we'll compute the gradient w.r.t. this variable!
x2 = x ** 2
x3 = x ** 3
f = x2 + x3

Now the backward pass, where we'll appreciate the *automatic* part of the differentation: you just need to call the `backward()` method from your output $f$.  

In [None]:
f.backward()

Now you have $\frac{\partial f}{\partial x}\Bigr\rvert_{x=42}$ in the `grad` of $x$

In [None]:
x.grad

In [None]:
2 * 42 + 3 * 42**2  # Yep, it's correct.

This is enough for training very standard models on PyTorch.

Nevertheless, the design principles behind the PyTorch `autograd` package are not always as straightforward. For instance, what do you think will happen executing `backward()` a second time?





In [None]:
# f.backward()  # try!


To fully understand the world of the `Autograd` package we must go deeply down the rabbit hole.

You do not need to get at the first pass everything we are going to mention from now on. There are explanations of advanced concepts and some PyTorch internals, which are usually not needed but can be useful (e.g. in debugging or complex implementations).

Feel free to refer back to this notebook when needed!

## Aggressive buffer freeing

### The second backward
So, what was the problem with the second backward?

When we computed the first `backward()`, the intermediate variables needed for the computation of $f$, as well as its gradient, were freed to save memory. So PyTorch does not have the necessary information to do backward from $f$ a second time.

`Autograd` has an aggressive buffer freeing policy to be very memory efficient!



If you want to prevent this, you can use `.backward(retain_graph=True)`.

Let's redo from scratch the previous computation:

In [None]:
x = torch.tensor(42., requires_grad=True)
x2 = x ** 2
x3 = x ** 3
f = x2 + x3
f.backward(retain_graph=True)
f.backward()

So we did backward two times. Let's check again the gradient of $x$:

In [None]:
x.grad

It's doubled!

The reason is that `Autograd` keeps accumulating into the `grad` attribute. This means that multiple `backward()` calls **will sum up previously computed gradients** if they are not explicitly zeroed out.

### Intermediate gradients are not kept by default
Intermediate gradients are other victims of PyTorch's aggressive buffer freeing policy.

We do not have access to the gradient with respect to $x_2$, even if we actually computed it to calculate the one with respect to $x$.

In [None]:
x2.requires_grad  # we *require* the grad w.r.t. x2, in order to compute the one w.r.t. x...

In [None]:
x2.grad is None  # ...but we had asked Pytorch to only compute the gradient w.r.t. x, so the one wrt x2 is not maintained in memory!

Did you read the user warning up there? That should already give you an intuition of what a _leaf_ tensor is 🍃!

### Sick of being tracked? 🍪

You can call `detach()` to **remove a tensor from the computational graph**. This means that the tensor will _not_ be used for computing the gradient and will not partake to the chain rule.

We saw one example earlier in this notebook, where we were implementing gradient descent by ourselves and we didn't want to compute gradients of the descent steps! Another classical example is when you run a trained model just for inference, which means you already know you won't call `backward()` at all.

In [None]:
x = torch.tensor(42., requires_grad=True)
x2 = x ** 2
x2sig = x2
print(x2sig.requires_grad)
x2nog = x2.detach()
print(x2nog.requires_grad)

Of course, if a tensor is `detach()`ed, a gradient won't be computed for it and thus `requires_grad` will be `False`.

As a "blanket solution", you can also wrap the code block in a context `with torch.no_grad()`. This is equivalent to calling `detach()` everywhere:

In [None]:
x = torch.tensor(42., requires_grad=True)
x2 = x ** 2
print(x2.requires_grad)
with torch.no_grad():
    x2nog = x ** 2
    x3nog = (x2 + 7) ** 3
    print(x2nog.requires_grad)
    print(x3nog.requires_grad)

`.no_grad()` is particularly useful for inference, when you are certain that you won't ever call `.backward()`.

Clearly, you won't be able to backpropagate trough a detached tensor because it was removed from the graph:

In [None]:
try:
  x2nog.sum().backward()
except Exception as e:
  print(e)

In [None]:
# backward() still works for the tensor that we didn't detach:
x2sig.sum().backward()

## Tensors 🎲



``torch.Tensor`` is the central class of the `autograd` package.


In order to understand in detail how autograd works, it is necessary to dissect some of the most relevant attributes of the Tensors:

---

- **`data`**:

It is the data stored in the tensor. Usually you do not need to access directly this attribute.

In [None]:
t = torch.rand(4, 4)
t.data


---

- **`requires_grad`**:

  - If `True`, the gradient with respect to this tensor will be computed.
  - If `True` and the tensor is a _leaf_, the gradient will also be saved in the `.grad` attribute.
  - If `False`, the gradient with respect to this tensor will _not_ be computed.

In [None]:
x = torch.tensor(42., requires_grad=True)
x2 = x ** 2
x3 = x ** 3
f = x2 + x3

x.requires_grad, x2.requires_grad, x3.requires_grad, f.requires_grad

Note how all the tensors involved in the computation above have `requires_grad=True`. This means that a gradient will be computed for all of them; these intermediate gradients are all needed by the chain rule, when we will compute `f.backward()`!

You can't force any of the intermediate tensors to _not_ have their gradient computed, because this would break the computation of the entire gradient from `f` back to `x`.

In [None]:
x = torch.tensor(42., requires_grad=True)
x2 = x ** 2
x3 = x ** 3
try:
  x3.requires_grad = False
except RuntimeError as e:
  print(f"Error: {e}")
f = x2 + x3
f.backward()

---

- **`grad`**:

This attribute is `None` by default; it actually becomes a Tensor when `backward()` is called. The attribute will then contain the computed gradient, and future calls to `backward()` will accumulate (add) gradients into it. Only the leaf nodes of the computational graph with `requires_grad=True` will have the `grad` attribute populated.




---

- **`grad_fn`**:

The backward function that `autograd` will use to use to compute the gradient. For example, if we sum two tensors during the forward pass, then the `grad_fn` attribute of the result will indicate that it was created as a result of an addition operation.

In [None]:
t3 = x + x2
t3.grad_fn

When we call `backward()` on a tensor, PyTorch will traverse the computational graph from the tensor backward to its inputs, using these `grad_fn` functions to calculate gradients along the way.


---

- **`is_leaf`**: a boolean. You can _not_ set it, this is read-only.

🍃 **Only *leaf* tensors with `requires_grad=True` will have their `grad` populated during a call to `backward()`**. To get `grad` populated for non-leaf tensors, you can use `retain_grad()`.
Keep in mind that:
  - All tensors that have `requires_grad=False` will be leaf tensors by default.
  - For tensors that have `requires_grad=True`, they will be leaf tensors if their `grad_fn` is `None`. This means that they are not the result of an operation of tracked tensors, but rather they were created directly by the user.

**NOTE:** Make sure you are on a GPU runtime before running the following.

In [None]:
a = torch.rand(10, requires_grad=True)
a.is_leaf, a.requires_grad

In [None]:
a = torch.rand(10, requires_grad=True) + 2
a.is_leaf, a.requires_grad  # was created by the addition operation

In [None]:
a = torch.rand(10, requires_grad=True, device="cuda")
a.is_leaf, a.requires_grad  # requires grad, directly created by the user

In [None]:
a = torch.rand(10).cuda()
a.is_leaf, a.requires_grad  # requires_grad=False, thus it is a leaf by default

In [None]:
a = torch.rand(10, requires_grad=True).cuda()
a.is_leaf, a.requires_grad  # Was created by the operation that casts a cpu tensor into a cuda tensor.
                            # Since we are moving a cpu tensor that requires gradients, this is creating a new version of the tensor in GPU.
                            # Therefore 'a' is not a leaf, but the cpu tensor was.

In [None]:
a = torch.rand(10).cuda().requires_grad_()  # Here we move a cpu tensor that does not require gradients, so it stays a leaf, and then modify it.
a.is_leaf, a.requires_grad  # requires gradients and has `grad_fn=None`

In [None]:
a = torch.rand(10, requires_grad=True, device="cuda")
b = a + 2                          # non leaf, since requires grad and it is produced by an operation
print(b.is_leaf, b.requires_grad)
c = b.detach()                     # leaf, it has been detached and now has requires_grad=False
print(c.is_leaf, c.requires_grad)

---

- **`backward()`**:

Computes the gradient of current tensor w.r.t. computational graph leaves.

> 🧠 **MEMO**: Remember, the graph is created on the fly during the forward pass, as operations are performed on tensors. When you call `backward()` on the final tensor (usually the rank-0 tensor representing the loss value), Pytorch traverses the computational graph back to the leaf tensors (usually the network parameters), calculating the gradient with respect to them and storing it in their `.grad` attribute.

> ### Leaves recap
>
> Let's recap the answer to the following question:
>
> *What are the nodes that will have the `.grad` attribute populated?*
>
> Here's a computational graph:
>
> ![](https://raw.githubusercontent.com/erodola/DLAI-s2-2021/main/labs/05/pics/leaves.svg)
>
> 1. Take the subgraph of nodes with `requires_grad=True` *(green and blue nodes)*
> 2. Take the leaves of this subgraph *(green nodes)*
>
> The nodes selected with this procedure *(green nodes)* will have their `.grad` attribute populated.

## Gradients

Let's look at one last example.

Create a tensor and set ``requires_grad=True`` to track operations:

In [None]:
x = torch.ones(2, 2, requires_grad=True)
x

Do some operation:


In [None]:
y = x + 2
y

``y`` was created as a result of a tracked operation, so it has a ``grad_fn``:



In [None]:
y.grad_fn

Do more operations on `y`:

In [None]:
z = y * y * 3
out = z.mean()

print(z, out)

In [None]:
out.backward()

With this operation we computed $\frac{\partial \, \text{out}}{\partial \, x}$ as well as all the intermediate partial derivatives, but the only one we can actually read is $\frac{\partial \, \text{out}}{\partial \, x}$:


In [None]:
x.grad

Let's double-check why `x.grad` is a `2x2` tensor full of $4.5$.

The output is defined as:

$$ \mathrm{out} = \frac{1}{4} \sum_i 3(x_i + 2)^2 \: \text{ with } x_i = 1 \, \forall i$$

We have the partial derivatives:

$$
\frac{\partial \mathrm{out}}{\partial x_i}
= \frac{3 \times 2}{4} (x_i + 2)
= \frac{3}{2} (x_i + 2)
$$

*(Note: the derivative for every $x_j$ with $j \neq i$ is zero)*


Thus, since $x_i=1$ for all $i$ in the input, we obtain $\frac{\partial \mathrm{out}}{\partial x_i} = \frac{9}{2} = 4.5$.

> **EXERCISE:**
>
> Understanding if a tensor is a leaf or not is suprisingly tricky, but it is very important to be able to distinguish leaf tensors: **only leaves with `requires_grad=True` tensors will have the grad attribute populated**. The leaves will be the parameters of our neural networks.
>
> Consider the two following scenarios and try to understand if `a.grad` and/or `b.grad` will be populated.
>
> **Scenario 1**
>
> ```python
> a = torch.randn(2, 2, requires_grad=True)
> b = a ** 2                                
> b.requires_grad_(True)                    
> b.sum().backward()                        
> ```
> - [ ] `a.grad` is populated (it is not `None`)
> - [ ] `b.grad` is populated (it is not `None`)
>
>
> **Scenario 2**
>
> ```python
> a = torch.randn(2, 2, requires_grad=False)
> b = a ** 2                                
> b.requires_grad_(True)                    
> b.sum().backward()                        
> ```
> - [ ] `a.grad` is populated (it is not `None`)
> - [ ] `b.grad` is populated (it is not `None`)

In [None]:
# @title Solution 👀

if False:  # Change to true to enable the prints
  # 1)
  a = torch.randn(2, 2, requires_grad=True)  # leaf tensor that requires grad

  b = a ** 2                                 # non leaf tensor: requires grad and produced by an op
  b.requires_grad_(True)                     # it already requires a grad!

  print(f'a.is_leaf: {a.is_leaf} \t a.requires_grad: {a.requires_grad}  \t a.grad_fn: {a.grad_fn}')
  print(f'b.is_leaf: {b.is_leaf} \t b.requires_grad: {b.requires_grad}  \t b.grad_fn: {b.grad_fn}')

  b.sum().backward()                         # just a sample backprop

  print("\nGradients:")
  print(f'a.grad: {a.grad}')                 # a is a leaf, thus it will have .grad
  print(f'b.grad: {b.grad}')                 # b is not a leaf, thus it will not have .grad

  print('\n\n---\n\n')

  # 2)
  a = torch.randn(2, 2, requires_grad=False) # leaf tensor that does not requires grad

  b = a ** 2                                 # leaf tensor, because not requires grad
  b.requires_grad_(True)                     # now it requires a grad and has grad_fn=None! It is a leaf

  print(f'a.is_leaf: {a.is_leaf} \t a.requires_grad: {a.requires_grad}  \t a.grad_fn: {a.grad_fn}')
  print(f'b.is_leaf: {b.is_leaf} \t b.requires_grad: {b.requires_grad}  \t b.grad_fn: {b.grad_fn}')

  b.sum().backward()                         # just a sample backprop

  print("\nGradients:")
  print(f'a.grad: {a.grad}')                 # a is a leaf but does not require grad, thus it will not have .grad
  print(f'b.grad: {b.grad}')                 # b is a leaf and requires grad, thus it will have .grad

  print('\n\n---\n\n')

> **EXERCISE:**
>
> Consider the following expression:
>
> $$ z = \frac{\sqrt{x^2 +1} - \sqrt{y - 1}}{\sqrt{x^2 + y^2}} + \sqrt{y - 1} $$
>
> Compute the gradients $\frac{\partial z}{\partial x}$, $\frac{\partial z}{\partial y}$, $\frac{\partial z}{\partial \sqrt{x^2 +1}}$ and $\frac{\partial z}{\partial \sqrt{y-1}}$ at $x=2$, $y=10$

In [None]:
# Expected results, respectively:
# x.grad: 0.08914636820554733
# y.grad: 0.15752650797367096
# x3.grad: 0.0980580672621727
# y2.grad: 0.9019419550895691

# ✏️ your solution here


## Autograd Mechanics 🧑‍🔧



### Custom `Function` 📖

Look at this simple example:


In [None]:
t = torch.rand(4, 4, requires_grad=True)
t2 = torch.rand(4, 4)

t3 = t + t2
t3.grad_fn

That `AddBackward0` is an object of the `Function` class. It indicates that `t3` was created by a sum operation, but not only! Together with the `Tensor` class, `Function` makes up the graph that encodes a complete history of computation.

All mathematical operations in PyTorch are implemented as objects of the `torch.nn.Autograd.Function` class.

📜 **Story time**

Once upon a time, we needed to backpropagate through the operation `lambda = eig(X)`, which computes the eigenvalues of a matrix `X`. But the `eig()` operation was not a `Function`! 😱

So we implemented our own `Function` and defeated the evil derivative.

**Good ending!** Our heroes make their way directly into the sun 🌅.


Our heroes had to implement these two methods:

- `forward()`: the code that performs the operation. It can take as many arguments as you want. All Python objects are accepted as input. _Any input of the `Tensor` type should be explicitly `detach()`ed inside the `forward()` call, so that whatever happens inside the function will not affect the computational graph_; recall that we are going to manually implement the gradient anyway! You can return either a single `Tensor` or a tuple of `Tensor`. Refer to the docs of `Function` to find descriptions of useful methods that can be called only from `forward()`.

- `backward()`: gradient formula. The size of its input matches the size of `forward()`'s output. It should return as many `Tensor` s as there were inputs in `forward()`, with each of them containing the gradient w.r.t. its corresponding input. If your inputs didn't require a gradient (`needs_input_grad`, in the `ctx` argument, is a tuple of booleans indicating whether each input needs gradient computation), or were non-Tensor objects, you can return `None`. Also, if you have optional arguments to `forward()` you can return more gradients than there were inputs, as long as they're all `None`.

Confused? Let's see an example.


We are going to implement our own ReLU from scratch.

$$f(x) = \max \{0, x \} $$

The _forward_ pass is easy to implement: just write the operation above, and return the result. We'll also need the value of $x$ for computing the derivative $\frac{\partial f}{\partial x}$, so `forward()` must save $x$ for later use. Scroll down to see how we implemented the forward.

The _backward_ pass is a bit more tricky. Reverse-mode autodiff requires us to compute the _derivative of the **loss** with respect to $x$_:

$$ {\color{blue}{\frac{\partial\ell}{\partial x}}} = {\color{green}{\frac{\partial \ell}{\partial f}}} {\color{red}{\frac{\partial f}{\partial x}}} $$

In particular, `backward()` will receive ${\color{green}{\frac{\partial \ell}{\partial f}}}$ as input, and must produce ${\color{blue}{\frac{\partial\ell}{\partial x}}}$ in the output. All we must do is compute the portion:

$${\color{red}{\frac{\partial f}{ \partial x}}} =  \begin{cases} 1 & \text{if } x > 0\\ 0 & \text{if } x \le 0 \end{cases}$$

and simply output the product ${\color{green}{\frac{\partial \ell}{\partial f}}} {\color{red}{\frac{\partial f}{\partial x}}}$. Note how, as promised, we are also using $x$ for this calculation.



In [None]:
class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, x):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a "context" object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(x)

        # The operation we do here can be even external to PyTorch, like playing a Mario🥸 level and recording the final score.
        # We're going simple here: let's implement a standard ReLU.
        x_device = x.device
        x_dtype = x.dtype
        xnumpy = x.cpu().detach().numpy()  # detach() ensures that operations done here do not interfere with the autograd
        xnumpy = xnumpy.clip(min=0)

        return torch.tensor(xnumpy, dtype=x_dtype, device=x_device)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors  # unpack the tuple to its only element

        grad_input = torch.zeros_like(grad_output)
        grad_input[input > 0] = 1
        grad_input *= grad_output

        # Alternatively, to avoid the element-wise product:
        # grad_input = grad_output.clone()  # deep copy
        # grad_input[input <= 0] = 0

        return grad_input

myrelu = MyReLU.apply  # not really needed, but useful to have an alias for future use

Let's test this out:

In [None]:
x = torch.rand(50, requires_grad=True)

In [None]:
out = myrelu(x - 0.5)
print(out)  # grad_fn=<MyReLUBackward>
out.sum().backward()
x.grad

In [None]:
x.grad.zero_()  # usually you should not use this method

# -> Let's check our implementation against torch.relu
out = torch.relu(x - 0.5)
print(out)  # grad_fn=<MyReLUBackward>
out.sum().backward()
x.grad      # Negative numbers get zeroed, and their grad is zero

> **EXERCISE**
>
> Implement your own "ReCU", defined as:
>
> $$ f(x) = \max \{0, x^3\} $$
>
> Write the `forward()` and `backward()` functions, and test them out.

In [None]:
# your solution here ✏️


In [None]:
# @title 👀 Solution

class MyReCU(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x):

        ctx.save_for_backward(x)

        x_device = x.device
        x_dtype = x.dtype
        xnumpy = x.cpu().detach().numpy() ** 3
        xnumpy = xnumpy.clip(min=0)

        return torch.tensor(xnumpy, dtype=x_dtype, device=x_device)

    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        # no cloning necessary, since we are not modifying grad_output directly
        grad_input = grad_output * 3 * (input**2) * (input > 0).float()
        return grad_input

myrecu = MyReCU.apply

# testing

x = torch.rand(50, requires_grad=True)

out = myrecu(10 * x - 5)
print(out)

out.sum().backward()
x.grad

### Excluding subgraphs from backward

The `requires_grad` flag allows for fine-grained exclusion of subgraphs from gradient computation and can increase efficiency. As a reminder, if any input tensor of an operation has `requires_grad=True`, the output tensor automatically gets `requires_grad=True` as well.

In [None]:
x = torch.randn(5, 5)  # requires_grad=False by default
y = torch.randn(5, 5)  # requires_grad=False by default
z = torch.randn((5, 5), requires_grad=True)

a = x + y
b = a + z

a.requires_grad, b.requires_grad

Explicitly setting certain tensors to `requires_grad=False` is useful when you want to **❄️ freeze a subset of parameters of your model** so they are not updated during training. This would be done, for instance, to **finetune** the last layer of a pretrained CNN: simply set `requires_grad=False` for all the parameter tensors except the ones in the last layer.

Let's do it:

In [None]:
import torchvision
model = torchvision.models.resnet18(pretrained=True)  # no need to understand this right now

In [None]:
# compute some random prediction from this pretrained network
random_prediction = model(torch.rand(2, 3, 224, 224))

# dummy loss, just to get some gradients
f = random_prediction.sum()

# compute the gradients
f.backward()

The model's parameters, together with the gradient of `f` with respect to them, are stored in...

In [None]:
model.parameters()

For example, we can look for all the parameters having a nonzero gradient (based on our dummy loss function):

In [None]:
grads = list(x.grad for x in model.parameters() if x.grad.bool().any())
len(grads)

Let's now freeze the pretrained model except for the last layer:

In [None]:
import torch.nn as nn
import torch.optim as optim

# Clear the previous gradients to avoid undue accumulation later
model.zero_grad()

# Freeze the pretrained model
for param in model.parameters():
    param.requires_grad = False  # you can do this, because they are all leaves!

# Replace the last fully-connected layer
# These parameters have requires_grad=True by default
model.fc = nn.Linear(512, 100)

# Configure an optimizer for the last layer only.
# NOTE: we don't actually optimize, this is just to show you how we would setup the training.
optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)

### In-place operations

From our discussion so far, you might suppose that in-place operations on Pytorch tensors can potentially **overwrite values required to compute gradients**. This is true: with an in-place operation, we may break the backpropagation mechanism.

Here's an example:

```python
x = torch.rand(5, requires_grad=True)
y = x * 2
y.add_(torch.sqrt(y * x))
```

What happens to the internal attributes of `y` as we keep overwriting it?

Each in-place operation actually rewrites the computational graph. This can be tricky, especially if there are many `Tensors` that reference the same storage (e.g. created by indexing or transposing), and in-place functions will actually raise an error if the storage of modified inputs is referenced by any other `Tensor`. In contrast, **out-of-place versions simply allocate new objects and keep references** to the old graph.

#### In-place correctness checks 📖

Every tensor keeps a _version counter_, incremented each time the tensor is marked as "dirty" by an in-place operation. When a `Function` uses `save_for_backward()` to save references of any tensors for its backward pass, a version counter of their containing `Tensor` is saved as well. Once you access `self.saved_tensors`, the version is checked. If it is greater than the saved value, an error is raised. This ensures that if you’re using in-place functions and not seeing any errors, you can be sure that the computed gradients are correct.

In [None]:
x = torch.rand(10, requires_grad=True)
o = x * 10
o.retain_grad()
o2 = o + 10
o2.retain_grad()
y = torch.rand(10)

In [None]:
o._version  # the version counter is initialized to zero

In [None]:
o.add_(-1)  # dirty edit, increase the version counter
o._version

In [None]:
z = x + y  # it does not modify x in place
x._version

In [None]:
x = x + x  # x is a new tensor
x._version

In [None]:
# 😈 Let's break autodiff with in-place operations

try:
  x = torch.ones(5, requires_grad=True)
  x2 = (x + 1).sqrt()
  z = (x2 - 10)
  x2[0] = -1
  z.sum().backward()
except Exception as e:
  print(e)

References:

- [PyTorch docs](https://pytorch.org/docs/stable/index.html)
- [Autograd tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html)
- [Autograd mechanics](https://pytorch.org/docs/stable/notes/autograd.html)
- [Extending PyTorch](https://pytorch.org/docs/stable/notes/extending.html)
- Nice [blogpost](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)
- Nice [blogpost](https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95) number  two

# The `torch.nn` package

Finally, let's implement a **Deep Neural Network**, beyond the simple logistic regression model 🚀


PyTorch provides the elegantly designed modules and classes
[`torch.nn`](https://pytorch.org/docs/stable/nn.html),
[`torch.optim`](https://pytorch.org/docs/stable/optim.html),
[`Dataset`](https://pytorch.org/docs/stable/data.html?highlight=dataset#torch.utils.data.Dataset),
and [`DataLoader`](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader)
to help you create and train neural networks.
You have already seen how to use `torch.optim`, `Dataset` and `DataLoader`. In this section we will review all these classes together with the new `torch.nn` package to understand how they work together to simplify our life.

To develop this understanding, we will first train a basic neural net on the MNIST dataset _without_ using any of these modules: we will just use the most basic PyTorch tensor functionality.

---

Our final goal is to reach an elegant, general structure suitable for most problems and models with minor tweaks:

```python
# load data
# instantiate model
# instantiate optimizer

# for each epoch:
  # train the model on the training set
  # evaluate the model on one or more evaluation sets
  # log metrics (e.g. accuracy)
```

For the weights, we set `requires_grad` **after** the initialization, since we don't want the initialization function to be included in the gradient computation. (remember that a trailling `_` in PyTorch means that the operation is performed in-place.)

We are initializing the weights with a simplified version of
[Xavier initialization](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), i.e. by multiplying with $\frac{1}{\sqrt{n}}$


In [None]:
import math
import torch

weights = torch.randn(784, 10) / math.sqrt(784)  # Xavier init.
weights.requires_grad_()                         # Start to track the weights
bias = torch.zeros(10, requires_grad=True)       # Initialize the bias with zeros

For these tests we are going to use the entire MNIST dataset, where each image has 784 pixels (check the size of `weights` above).

We are also going to use PyTorch's implementations of the loss and activation functions. Previously we used `torch.nn.functional.log_softmax` and `torch.nn.functional.nll_loss`; we are now going to simplify this further:

In [None]:
import torch.nn.functional as F

loss_func = F.cross_entropy  # log_softmax and nll_loss all in one, for better numerical stability!

def model(xb):
  return xb @ weights + bias  # we don't explicitly apply log-softmax anymore

## Refactor: use `nn.Module`

Next up, we'll use ``nn.Module`` and ``nn.Parameter``, for a clearer and more
concise training loop. We subclass ``nn.Module`` to create a class that
holds our weights, bias, and method for the forward step.  ``nn.Module`` has a
number of attributes and methods (such as ``.parameters()`` and ``.zero_grad()``)
which we will be using.

In [None]:
from torch import nn

class Mnist_Logistic(nn.Module):
  def __init__(self):
    super().__init__()
    self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
    self.bias = nn.Parameter(torch.zeros(10))

  def forward(self, xb):
    return xb @ self.weights + self.bias

Since we're now using an object instead of just using a function (our old `model()`), we first have to instantiate our model:



In [None]:
model = Mnist_Logistic()

Now we can calculate the loss in the same way as before. Note that
``nn.Module`` objects are used as if they are functions (i.e they are
*callable*), but behind the scenes Pytorch will call our ``forward``
method automatically.

> The `__call__` method of the Modules, internally calls the `forward` method and *does other stuff* (e.g. registers some hooks, you can check the implementation [here](https://pytorch.org/docs/stable/_modules/torch/nn/modules/module.html#Module)). Thus, you should always call the forward with `model(inputs)` and never directly `model.forward(inputs)`.

Previously in our training loop we had to update `weights` and `bias` explicitly and manually zero out the grads, like this:

```python
  with torch.no_grad():
      weights -= weights.grad * lr
      bias -= bias.grad * lr
      weights.grad.zero_()
      bias.grad.zero_()
```

Now we can take advantage of `model.parameters()` and `model.zero_grad()` (which are both defined by PyTorch for ``nn.Module``) to make these steps more concise and less prone to error:

```python
  with torch.no_grad():
      for p in model.parameters(): p -= p.grad * lr
      model.zero_grad()
```


## Refactor: use `nn.Linear`



We continue to refactor our code.  Instead of manually defining and
initializing ``self.weights`` and ``self.bias``, and calculating ``xb  @
self.weights + self.bias``, we will instead use the Pytorch class
[`nn.Linear`](https://pytorch.org/docs/stable/nn.html#linear-layers) for a
linear layer, which does all that for us.

Pytorch has many predefined layers that can greatly simplify our code, often making it faster too.

In [None]:
class Mnist_Logistic(nn.Module):
  def __init__(self):
    super().__init__()
    self.lin = nn.Linear(784, 10)

  def forward(self, xb):
    return self.lin(xb)

model = Mnist_Logistic()

## Refactor: use `torch.optim`

We already know the `torch.optim` package; we can use the ``step`` method to take a forward step, instead of manually updating each parameter.

This will let us replace our previous custom optimization step:

```python
with torch.no_grad():
  for p in model.parameters(): p -= p.grad * lr
  model.zero_grad()
```

and instead use just:
```python
opt.step()
opt.zero_grad()
```
where `opt` can be any fancy optimizer.

In [None]:
from torch import optim

Note that we will **not replace the training loop**. The optimizer only helps us for a single step; we still have to run the training loop ourselves.

## Refactor: use `Dataset` and `DataLoader`

We already know this as well. PyTorch's [`TensorDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset)
is a `Dataset` that wraps tensors. This also gives us a way to easily iterate, index, and slice the dataset with a more compact code.

In [None]:
import numpy as np

!wget https://s3.amazonaws.com/img-datasets/mnist.npz

def load_data_impl():
    # file retrieved by:
    #   wget https://s3.amazonaws.com/img-datasets/mnist.npz -O code/dlgo/nn/mnist.npz
    # code based on:
    #   site-packages/keras/datasets/mnist.py
    path = 'mnist.npz'
    f = np.load(path)
    x_train, y_train = f['x_train'].reshape(-1, 784), f['y_train']
    x_test, y_test = f['x_test'].reshape(-1, 784), f['y_test']
    f.close()
    return (x_train.astype(np.float32), y_train), (x_test.astype(np.float32), y_test)

(x_train, y_train), (x_valid, y_valid) = load_data_impl()

x_train = (x_train / 255 - 0.13) / 0.3  # data normalization
x_valid = (x_valid / 255 - 0.13) / 0.3

# Convert to PyTorch tensors
x_train, y_train, x_valid, y_valid = map(
  torch.tensor, (x_train, y_train, x_valid, y_valid)
)
y_train = y_train.long()  # PyTorch wants int64 as indices
y_valid = y_valid.long()
print(x_train, y_train)
print(x_train.shape)
print(y_train.min(), y_train.max())

Both ``x_train`` and ``y_train`` can be combined in a single ``TensorDataset``,
which will be easier to iterate over and slice:



In [None]:
from torch.utils.data import TensorDataset
train_ds = TensorDataset(x_train, y_train)

``DataLoader`` can then provide us with mini-batches automatically.


In [None]:
from torch.utils.data import DataLoader

bs = 64
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)

Thanks to Pytorch's ``nn.Module``, ``nn.Parameter``, ``Dataset``, and ``DataLoader``, our training loop is now dramatically smaller and easier to understand:

In [None]:
epochs = 3
lr = 0.5

opt = optim.SGD(model.parameters(), lr=lr)

for epoch in range(epochs):

  for xb, yb in train_dl:

    pred = model(xb)
    loss = loss_func(pred, yb)

    loss.backward()
    opt.step()
    opt.zero_grad()

print(loss_func(model(xb), yb))

## Add: validation set

As usual, we should use the validation set to check the quality of our model and potentially identify overfitting.

**A note about shuffling:** Shuffling the training data is important to prevent correlation between batches and overfitting. On the other hand, the validation loss will be identical whether we shuffle the validation set or not. Since shuffling takes extra time and makes qualitative comparisons more difficult, _it makes no sense to shuffle the validation data_.

Still, we'll build mini-batches for the validation set as well, for efficiency reasons (e.g. avoid a memory bottleneck of loading the entire validation set at once). We'll use a batch size for the validation set that is twice as large as
that for the training set, because it doesn't need to
store any gradients.




In [None]:
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)

valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds, batch_size=bs * 2)

We will calculate and print the validation loss at the end of each epoch.

In the code below, we also call `model.train()` before training, and `model.eval()` before inference; this will be required by layers such as ``nn.BatchNorm2d``
and ``nn.Dropout`` to ensure appropriate behavior, and it's good practice to do this always to be safe.

In [None]:
for epoch in range(epochs):

  model.train()

  for xb, yb in train_dl:

    pred = model(xb)
    loss = loss_func(pred, yb)

    loss.backward()
    opt.step()
    opt.zero_grad()

  model.eval()

  with torch.no_grad():
    valid_loss = sum(loss_func(model(xb), yb) for xb, yb in valid_dl) / len(valid_dl)

  print(epoch, valid_loss)

Is the loss always going down? Try a few times!

> **EXERCISE:** Plot training and validation curves for our latest model.

## 🎉 Our first MLP

We are now going to build a deep network with two fully-connected layers. Let's start with a single layer:

In [None]:
class Mnist_MLP(nn.Module):
  def __init__(self):
    super().__init__()
    self.lin1 = nn.Linear(784, 10)

  def forward(self, xb):
    xb = self.lin1(xb)
    return xb

model = Mnist_MLP()

In [None]:
bs = 50

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)

valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds, batch_size=128)

In [None]:
import matplotlib.pyplot as plt

epochs = 20
lr = 0.01

opt = optim.Adam(model.parameters(), lr=lr)

valid_accuracy = torch.zeros(epochs)

for epoch in range(epochs):

  model.train()

  for xb, yb in train_dl:

    pred = model(xb)
    loss = loss_func(pred, yb)

    loss.backward()
    opt.step()
    opt.zero_grad()

  model.eval()

  with torch.no_grad():
    valid_accuracy[epoch] = sum((model(xb).argmax(dim=1) == yb).int().sum() for xb, yb in valid_dl) / len(valid_ds)

  if not epoch % 10:
    print(epoch, loss.item())

plt.figure(figsize=(6, 3))
plt.plot(valid_accuracy, label='validation', color='red')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.xlim(-1, epochs + 1)
plt.ylim(0.7, 1)
plt.legend()
plt.grid(True)
plt.show()

print(f"final validation accuracy: {valid_accuracy[-1]*100:.2f}%")

You should get something around ~88% accuracy on the validation set. Looks ok, but we can definitely do better. Instead of playing with parameters such as batch size, optimizer, learning rate, and so forth, let's do one simple modification: **change the network architecture**📐.

> **EXERCISE:** Now it's your turn to appreciate the power of _deep_ networks. Right now we have a one-layer network going from 784 (i.e. number of pixels per image) to 10 features (i.e. the class predictions). Add a layer to the previous network, such that the feature dimensions change as 784 → 512 → 10. Use ReLU as an activation function after the first layer. What validation accuracy do you reach?

``torch.nn`` has another handy class we can use to simplify our code: [`Sequential`](https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential). A ``Sequential`` object runs each of the modules contained within it, in a
sequential manner. This is a simpler way of writing a neural network.

For example:

In [None]:
model = nn.Sequential(
  nn.Linear(784, 100),
  nn.ReLU(),
  nn.Linear(100, 50),
  nn.Tanh(),
  nn.Linear(50, 10)
  )

⚠️ `nn.Sequential` takes _modules_ as input. Therefore, we are passing the module `nn.ReLU` rather than the functional `F.relu`, and similarly for tanh.

## Use: your GPU

If you're lucky enough to have access to a CUDA-capable GPU, you can
use it to speed up your code. First check that your GPU is working in
Pytorch:



In [None]:
print(torch.cuda.is_available())

And then create a device object for it:



In [None]:
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

We can then move our model to the GPU:



In [None]:
model.to(dev)

Further, we should also move our data to the GPU, and then re-initialize the optimizer with the GPU-stored model. If we did everything correctly, the training should run faster!



> **EXERCISE:** Measure the training time of your MLP from the previous exercise. Then move it to the GPU and measure the training time again. What's the gain in performance?
>
> You can use python's builtin `time` module for timing.