<img src=https://upload.wikimedia.org/wikipedia/commons/c/c6/PyTorch_logo_black.svg width="300"></br>

# Introduction to PyTorch

PyTorch is a toolkit implemented in Python, specifically oriented to **developing Deep Learning frameworks**. In particular, this toolkit allows us to build and train deep models in an efficient and intuitive way, which leaves most of the mechanic, yet complex operations to be carried out **automatically**. These involve, in particular:
* GPU support
* computation of gradients for the back-propagation

Today we are going to take a look at the main tools that this scientific library provides:
* tensors (concept of computational graph)
* modules 
* loss functions
* optimizers
* datasets and dataloaders


## Tensors and Computational Graphs
The goal of computing the gradients of a given function for the optimization process implies the necessity of **tracking** its input and the operations that are applied to them. This tracking results in an object that goes by the name of **Computational Graph**. 

<img src="https://colah.github.io/posts/2015-08-Backprop/img/tree-eval-derivs.png" width="500"></br></br>

For each operation that is executed a new node is appended to the graph, allowing us to exploit the **chain rule** in order to compute all the derivatives in a single back-propagation pass. In order to efficiently support this functionality, PyTorch provides a specific class called **Tensor**. A tensor can encode scalar values as well as multidimensional vectors, and supports a wide variety of operations. Let us look at some examples.

The tools provided by PyTorch can be made accessible by simply importing `torch`

In [None]:
import torch

Let us now create two tensors containing scalar value and define some operations on them

In [None]:
# create the tensors. By default tensors do not require gradients, so we enable them
a = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([3.0], requires_grad=True)

# the result c is also a tensor
c = a * b
print(c)

All the intermediate values of the computation performed in the background by the machine are tracked in order to enable back-propagation. Our computational graph will have a node for `a`, one for `b` and one for the `*` opearation. Let us now compute the gradients of `c = a * b` with respect to one of its inputs.

In [None]:
# Create the tensors. By default tensors do not require gradients, so we enable them
a = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([3.0], requires_grad=True)

# The result c is also a tensor
c = a * b

print("Gradient before computation")
print(a.grad)
print(b.grad)
c.backward()
print("Gradient after computation")
print(a.grad)
print(b.grad)

Calling the `backward()` method deallocates the computational graph, releasing the memory used to store it, and updates the `grad` attribute of each tensor, summing to it the newly computed gradient. It's easy to see how this automatic differentiation save lots of coding. Take as an example the code that would be needed to manually implement gradient computation in Numpy for a linear layer:


```python
# define derivative of the activation
def derivative_sigmoid(z):
  return sigmoid(z) * (1 - sigmoid(z))

# =========== BACKWARD =========== #

# compute gradient from L to z
dL_dy = -2 * (t - y)
dL_dz = dL_dy * derivative_sigmoid(z)

# compute gradient w.r.t. input 
dL_dx = np.dot(dL_dz, W)

# compute gradient w.r.t. parameters
dL_dW = np.dot(x, dL_dz)
dL_db = dl_dZ
```

## Working with different devices
Until now, the operations that we performed were executed on **CPU**. PyTorch, however, supports a large range of processors for execution including **GPUs** and **TPUs**. These are called **Devices**. Using the correct device can have a huge impact on **performance**. In order to execute an operation on a specific device, we first need to move all the involved tensors to the memory of such device. The operation will then be automatically executed on that device.

In [None]:
# fetch the first GPU. In case of multiple GPUs, the index specifies which GPU to use
device = "cuda:0"

# create some tensors in GPU memory
a = torch.tensor([2.0], device=device)  # directly create the tensor on GPU
b = torch.tensor([3.0]).to(device)  # creates the tensor on CPU and moves it to GPU

# the result c is also on GPU
c = a * b

print(c.device)

Different operations can also be executed on different devices. PyTorch will automatically keep track of it in the Computational Graph. We now have all the ingredients we need to create and train a deep model. However, using operations at such a low level is unconvenient and prone to errors. We will now look at additional functionalities offered by the library to speed up development.

## Tensor operations
PyTorch offers a wide variety of methods to create and manipulate tensors with most of NumPy functions being directly supported. Let us start with an overview of some methods for tensor creation

In [None]:
# create a (2, 3) tensor from python data
a = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(a)

# creates a (2, 3) tensor with all 0s
b = torch.zeros((2, 3), dtype=torch.float32)
print(b)

# creates a (2, 3) tensor with all 1s
c = torch.ones((2, 3), dtype=torch.int32, device="cuda:0")
print(c)

# creates a (1, 4, 3) tensor with values from a normal distribution
# note the extra initial dimension in the output
d = torch.randn((1, 4, 3))
print(d)

PyTorch supports the basic python operators, which are applied elementwise to the tensors

In [None]:
# create a (2, 3) tensor from python data
a = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(a)

# create a (2, 3) tensor with all 1s
b = torch.ones((2, 3))
print(b)

a = a * 2
print("a * 2")
print(a)

print("(a * 2) + b")
a = a + b
print(a)

print("((a * 2) + b)^2")
a = a**2
print(a)

Other operations, instead, operate on entire dimensions of the tensors and can change their size.

In [None]:
# create a (2, 3) tensor from python data
a = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(a)

# sum the values in a along the dimension of the rows
b = torch.sum(a, dim=0)
print("torch.sum(a, dim=0)")
print(b)
print(b.size())

# sum the values in a along the dimension of the columns
c = torch.sum(a, dim=1)
print("torch.sum(a, dim=1)")
print(c)
print(c.size())

# sum the values in a
d = torch.sum(a)
print("torch.sum(a)")
print(d)
print(d.size())

Tensor indexing in PyTorch is quite similar to NumPy

In [None]:
# create a (2, 3) tensor from python data
a = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(a)

# index a specific scalar
print(a[0, 0])

# index row 0
print(a[0])

# index column 0
print(a[:, 0])

# index columns 0 and 1
print(a[:, 0:2])

# index the elements greater or equal to 3.0
print(a[a >= 3.0])

# the returned tensors share the memory with the original tensor
a[a >= 3.0] += 10
print(a)

PyTorch supports a concept called **Tensor Broadcasting**, which is designed to automatically deal with operations involving tensors of **different sizes**. This often happens in practice, as in the case where an entire tensor is multiplied by a single scalar.

Given two tensors, we say that they are "broadcastable" if when iterating over their dimensions starting from the last one and proceding towards the initial ones, one of this conditions hold for the size of the dimensions:

1. **they match**: no special treatment is needed in this case
2. **one of them is 1**: the dimension of size 1 is replicated to make it reach the size of the corresponding dimension in the other tensor
3. **one of them does not exist**: the dimension is created with size 1 and the former rule applies

Let see some examples

In [None]:
# create a (2, 3) tensor
a = torch.ones((2, 3))
print(a)
print(a.size())

# create a scalar value with no dimensions
b = torch.ones(())
print(b)
print(b.size())

# a (2, 3)
# b (2, 3)
# We apply rule 3 and b is transformed to
# b (2, 3)
# Then the sum is performed elementwise
c = a + b

print("Result")
print(c)
print(c.size())

In [None]:
# create a (2, 1, 3) tensor
a = torch.ones((2, 1, 3))
print(a)
print(a.size())

# create a (4, 1) tensor
b = torch.randn((4, 1))
print(b)
print(b.size())

# a (2, 1, 3)
# b ( , 4, 1)
# We start from the last dimension and apply rule 2. Tensor b is replicated 3 times on the last dimension
# We proceed and apply rule 2. Tensor a is replicated 4 times on the second dimension
# We then apply rule 3. An additional initial dimension is created in b and is repeated 2 times
# a (2, 4, 3)
# b (2, 4, 3)
# Then the sum is performed elementwise
c = a + b

print("Result")
print(c)
print(c.size())

The `squeeze()` and `unsqueeze()` methods are often used in conjunction with broadcasting. These methods respectively remove or add a dimension with size 1 in the tensor on which they are called.

In [None]:
# create a (2) tensor
a = torch.arange(0, 2)  # [0, 1]
print(a)
print(a.size())

# create a (3) tensor
b = torch.arange(0, 3)  # [0, 1, 2]
print(b)
print(b.size())

# Say we want to multiply each element in a with each element in b, getting a (2, 3) tensor
# Leveraging broadcast, we can add a trailing 1 dimension to a and multiply

print("\nUnsqueezed a")
a = a.unsqueeze(dim=-1)
print(a)
print(a.size())

# a is (2, )
# b is ( , 3)
# c is (2, 3)
c = a * b
print("\nResult")
print(c)
print(c.size())

PyTorch adopts some conventions on the shape of the tensors expected by its modules. 1D data is typically represented in the `(batch_size, features_count)` format. 2D data instead is represented in the `(batch_size, channels, height, width)` format.

## Modules
Let use what we learned so far to implement a simple linear layer:
`y = Wx + b`

In [None]:
batch_size = 4
in_features = 8
out_features = 16


def apply_linear_layer(x, W, b):
    """Apply a linear layer on an input.

    Args:
      x (batch_size, in_features)
      W (out_features, in_features)
      b (out_features)
    """

    # (batch_size, in_features, 1)
    x = x.unsqueeze(-1)
    # (1, out_features, in_features)
    W = W.unsqueeze(0)

    # (batch_size, out_features, 1)
    product = torch.matmul(W, x).squeeze(-1)

    # (batch_size, out_features)
    result = product + b
    return result


x = torch.randn((batch_size, in_features))

# Weights of the linear layer
W = torch.randn((out_features, in_features), requires_grad=True)
b = torch.zeros((out_features), requires_grad=True)

output = apply_linear_layer(x, W, b)
print("Result")
print(output)
print(output.size())

While the layer is functional, instantiating multiple such layers quickly becomes **unmanageable**. The main problem lies in the fact that the weights of the layer **are not tied** with the computational logic: creating a **class** for the layer would solve this problem. PyTorch provides a base class (`torch.nn.Module`) for such purpose, which provides a variety of functionalities.

Let's implement our linear layer in PyTorch style.

In [None]:
import torch
from torch import nn


class Linear(nn.Module):
    def __init__(self, in_features, out_features):
        """Linear layer.

        Args:
          in_features: number of input features
          out_features: number of output features
        """
        super(Linear, self).__init__()

        # Creates tensors for the weights
        W = torch.randn((out_features, in_features))
        b = torch.zeros((out_features))

        # Uses the Parameter class (subclass of Tensor) to create parameters for the module
        # When assigned to a member of self, Parameter tensors are automatically registered
        # Require gradient by default
        # Other Module objects are also automatically registered if assigned to self
        self.W = nn.Parameter(W)
        self.b = nn.Parameter(b)

    def forward(self, x):
        """Method executed when the object is called.

        Args:
          x (batch_size, in_features)

        Return:
          tensors (batch_size, out_features)
        """

        # (batch_size, in_features, 1)
        x = x.unsqueeze(-1)
        # Note that Parameters
        # (1, out_features, in_features)
        W = self.W.unsqueeze(0)

        # (batch_size, out_features, 1)
        product = torch.matmul(W, x).squeeze(-1)

        # (batch_size, out_features)
        result = product + self.b
        return result


batch_size = 4
in_features = 8
out_features = 16

x = torch.randn((batch_size, in_features))

# Creates an instance of our linear layer
linear_layer = Linear(in_features, out_features)
# Computes the results. the forward method is internally called
output = linear_layer(x)

print("Result")
print(output)
print(output.size())

Note how now both the parameters, their initialization and the processing logic are contained in the class. Multiple instances can be handled more conveniently. The Module class provides a range of additional functionalities.

In [None]:
# obtain all the parameters in the model. Useful for model optimization
parameters = linear_layer.parameters()
for parameter in parameters:
    print(parameter)

# move all the tensors associated to the model to the specified device
# recursively applies to other Module objects contained in the instance
linear_layer.to("cuda:0")

# obtain a representation of the model and saves it
print("Saving model")
saved_model = linear_layer.state_dict()
torch.save(saved_model, "save.pth")

# load the saved model
print("Loading model")
loaded_state_dict = torch.load("save.pth")
linear_layer.load_state_dict(loaded_state_dict)
print("Model loaded")

PyTorch provides a wide range of `Module` implementations representing the most common computational blocks. These include

*   Linear layers
*   1D, 2D and 3D Convolutions
*   Transposed Convolutions
*   Batch Normalization layers
*   RNN, LSTM, GRU cells
*   ...

Moreover, many common networks are implemented as `Module`s

*   Alexnet
*   VGG
*   ResNet
*   DenseNet
*   ...


## Loss functions

PyTorch provides a wide range of already implemented loss functions as part of `torch.nn`

*   L1
*   MSE
*   Cross Entropy
*   Binary Cross Entropy

Let see an example

In [None]:
import torch
from torch import nn

# create some tensors for the loss
x = torch.zeros((2, 4))
y = torch.ones((2, 4)) * 2

# instantiate the loss
l1_loss = nn.L1Loss()
mse_loss = nn.MSELoss()

# compute the loss functions
loss = l1_loss(x, y)
print(loss)

loss = mse_loss(x, y)
print(loss)

## Optimizers

PyTorch implements a wide range of optimizers as part of the `torch.optim` package. When an optimizer is created, a sequence of tensors to optimize is required. The optimizer then uses the `grad` attribute of each to update their value.

A typical optimization cycle is made of the following steps:

1.   perform the computations that build the Computational Graph
2.   compute the loss term
3.   use `backward()` to compute gradients for each tensor in the Computational Graph
4.   perform an optimization step using the optimizer
5.   zero the gradient in all tensors for the next optimization cycle using `zero_grad()`

Let see an example

In [None]:
import torch
from torch import nn

in_features = 8
out_features = 4
batch_size = 4

learning_rate = 1e-4

# create tensors for the loss
x = torch.zeros((batch_size, in_features))
y = torch.ones((batch_size, out_features))

# create the model to optimize
model = nn.Linear(in_features, out_features)

# the tensors whose value will be optimized
parameters = model.parameters()
# instantiate the optimizer
optimizer = torch.optim.SGD(parameters, learning_rate)

# instantiate the loss
l1_loss = nn.L1Loss()

# =========== perform an optimization step =========== #

# 1. perform computations
y_pred = model(x)

# 2. compute the loss term
loss = l1_loss(y_pred, y)

# 3. compute the gradients on the loss term
#    all tensors involved in the computation now have a
#    .grad value
loss.backward()

# 4. perform the optimization step with the gradient values in .grad
optimizer.step()

# 5. set all .grad attributes to 0 for the next optimization cycle
optimizer.zero_grad()

# ==================================================== #

print(loss)

## Datasets and Dataloaders

While we could manually load training data into input tensors, doing so would be a major performance bottleneck in the training of a deep model. For this reason, PyTorch provides a range of utilities in the `torch.utils.data` package that help us efficiently dealing with data. The most relevant ones are the `Dataset` and `DataLoader` classes.

* the `Dataset` class represents our training data and contains the logic to load a single element. We typically subclass it when creating a new dataset.

* the `DataLoader` class is an utility class that efficiently loads a batch of data from a dataset. Multiprocessing is used to speed up data processing and to overlap the processing of the next batch with the current model computations.

Subclassing the Dataset class requires overriding the `__len__` and the `__getitem__` methods to return respectively the number of elements in the dataset and the item at a speficied position. Any object type can be returned by the `__getitem__` method.


In [None]:
import torch
from torch.utils.data import Dataset


class SimpleDataset(Dataset):
    """A simple dataset representing the numbers from 0 to size-1"""

    def __init__(self, size):
        super(SimpleDataset, self).__init__()

        self.size = size

    def __getitem__(self, idx):
        """Get an item given its id.

        Args:
          idx: the integral index of the element to retrieve

        Returns:
          element at index idx
        """
        return torch.tensor([idx], dtype=torch.float32)

    def __len__(self):
        """Get the length of the dataset.

        Returns:
          number of elements that compose the dataset
        """
        return self.size


size = 10

# instantiate the dataset
dataset = SimpleDataset(size)

# fetch the length of the dataset (__len__ method)
length = len(dataset)
print(f"Dataset length: {length}")

# get each element of the dataset through indexing
# (__getitem__ method)
for idx in range(len(dataset)):
    print(f"- {idx}: {dataset[idx]}")

A range of methods are provided to conveniently work with datasets

In [None]:
train_size = 6
val_size = 2
test_size = 2

# split the dataset into training, validation and test sets
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(
    dataset, [train_size, val_size, test_size]
)

# print all the splits
for current_dataset in [train_dataset, val_dataset, test_dataset]:
    current_length = len(current_dataset)
    print(f"Current length: {current_length}")
    for idx in range(current_length):
        print(f"- {idx}: {current_dataset[idx]}")

Once a `Dataset` instance is available, a `DataLoader` object can be used to efficiently gather batches of data from the dataset. We just need to specify the size of the batch we would like to retrieve, the number of parallel workers to use for data processing and whether or not we would like batch elements to be sampled randomly from the dataset.

The DataLoader object is iterable and yields a batch of data at each iteration. Internally, when a batch is requested, the dataloader uses the `Dataset` `__getitem__` method to retrieve each item in the batch. If the object type returned by this function is known to PyTorch (eg. it is a Tensor), then they are automatically combined into a single object representing the batch. For example, if the returned type is Tensor, PyTorch fuses all Tensors composing the batch into a single Tensor with an additional initial dimension of size equal to the batch size. If the dataset returns custom data types instead a `collate_fn` function can be manually specified that takes as input a list of objects returned by dataset and returns a single object representing the entire batch.

Due to `DataLoader parallelism, PyTorch recommends that objects returned by datasets be placed in CPU memory due to the subtleties in handling objects placed in GPU memory from multiple processes.

In [None]:
from torch.utils.data import DataLoader

# Creates a dataloader for our dataset instance.
# Does not randomize the order of elements and returns the last batch even if
# it is not of size batch_size
dataloader = DataLoader(
    dataset, num_workers=2, batch_size=4, shuffle=False, drop_last=False
)

print("Unshuffled DataLoader")
for idx, batch in enumerate(dataloader):
    print(f"Batch {idx}:")
    print(batch)

## Transformations

The `Dataset` class gives us the freedom to insert data augmentation strategies directly inside the `__getitem__` method implementation. Doing so, however, is inconvenient since for the same dataset we may want to apply different augmentation, for example during training and during evaluation.

For this reason, a typical pattern in PyTorch is providing to the Dataset constructor a `transform` function. The Dataset class, will apply the desired transformations **before** returning the `__getitem__` value, thus actually returning `transform(__getitem__(idx))`.

In [None]:
import torch
from torch.utils.data import Dataset


class TransformableDataset(Dataset):
    """A simple dataset class representing the numbers from 0 to size - 1"""

    def __init__(self, size, transform=None):
        super(TransformableDataset, self).__init__()

        self.transform = transform
        self.size = size

    def __getitem__(self, idx):
        """Get an item given its id.

        Args:
          idx: the integral index of the element to retrieve

        Returns:
          element at index idx
        """
        result = torch.tensor([idx], dtype=torch.float32)

        # if a transformation is available, we apply it
        if self.transform is not None:
            result = self.transform(result)

        return result

    def __len__(self):
        """Get the length of the dataset.

        Returns:
          number of elements that compose the dataset
        """
        return self.size


# a simple transformation
def square(input):
    return input**2


size = 10

# instantiates the dataset
dataset = TransformableDataset(size, transform=square)

# Gets the length of the dataset (__len__ method)
length = len(dataset)
print(f"Dataset length: {length}")

# Gets each element of the dataset through indexing
# (__getitem__ method)
for idx in range(len(dataset)):
    print(f"- {idx}: {dataset[idx]}")

Many transformations are available in PyTorch. In particular, the `torchvision.transforms` package contains a range of transformations designed for images, together with utilities to compose a complex chain of transformations into a single pipeline.

When designing a transformation, it is important to consider that PyTorch datasets typically return images represented by the PIL Image class, which is the format expected by many of the transformations in the `torchvision.transforms` package. The `ToTensor` transformation can be used to convert PIL Images to Tensors.

Let see an example where we will use a dataset of images provided by PyTorch returning PIL Images, and apply some typical transformations to it.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torchvision
from torchvision import transforms

# Obtains the CIFAR10 dataset, downloading it if necessary
# Each returned element is a tuple (PIL Image, int) with the int representing the label of the image
# transform is applied only to the first element in the tuple
# target_transform can be specified for the second element
dataset = torchvision.datasets.CIFAR10(root="cifar", transform=None, download=True)

# Plots the first image
sample_image, sample_label = dataset[0]
plt.imshow(np.asarray(sample_image))
print(f"Label: {sample_label}")

In [None]:
from torchvision import transforms

# build a transformation that will apply a random affine transformation
affine_transformation = transforms.RandomAffine(degrees=20, translate=(0.1, 0.1))

# transform and plot the first image
transformed_image = affine_transformation(sample_image)
plt.imshow(np.asarray(transformed_image))

# build a chain of transformations
transformations_sequence = [
    affine_transformation,
    # random changes in pixel colors
    transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1),
    # the former transformations accept and return PIL Image objects, now convert to Tensor
    transforms.ToTensor(),
    # apply normalization
    transforms.Normalize(mean=[0.4913, 0.4821, 0.4465], std=[0.2470, 0.2434, 0.2615]),
]

composed_transformation = transforms.Compose(transformations_sequence)

dataset = torchvision.datasets.CIFAR10(
    root="cifar", transform=composed_transformation, download=True
)
# show the first image
transformed_image, sample_label = dataset[0]
print("Sample image")
print(transformed_image)
print(f"Label: {sample_label}")

## Logging

Understanding the training behavior of deep models is often challenging. A wide variety of metrics, however, can give us clues on why a certain behavior is shown. Moreover, when working on a deep learning project, multiple configurations and architecture variations are typically tested, generating a large quantity of data. Being able to explore these data and to compare them among different configurations is thus of primary importance.

Multiple tools are available to achieve these goal. In this unit we will cover two main logging utilities: Tensorboard and WandB. The idea behind these tools is simple: when training or evaluating a model, we log some metrics at each step, and the tool provides us with a web interface where plots showing the dynamics of our model are automatically populated.

Let start with a Tensorboard example.


In [None]:
! rm -r runs

In [None]:
import torch
from torch.utils.tensorboard import SummaryWriter

# ====== write fake data representing a first experiment ====== #


# creates a logger for the experiment
writer = SummaryWriter(log_dir="runs/exp1")

# simulate 100 training steps
for training_step in range(100):
    # log training metrics
    writer.add_scalar("train/quantity_a", training_step * 0.5, training_step)
    writer.add_scalar("train/quantity_b", training_step**1.5, training_step)
    writer.add_scalar("train/quantity_c", 1 / (1 + training_step), training_step)

# close the logger
writer.close()

# ============================================================== #

# ====== write fake data representing a second experiment ====== #

### write fake data representing a second experiment

writer = SummaryWriter(log_dir="runs/exp2")

for training_step in range(100):
    writer.add_scalar("train/quantity_a", training_step * 0.4, training_step)
    writer.add_scalar("train/quantity_b", training_step**1.4, training_step)
    writer.add_scalar("train/quantity_c", 1 / (1 + 2 * training_step), training_step)

writer.close()

# ============================================================== #

In [None]:
# if getting an error, enable third party cookies!
%load_ext tensorboard
%tensorboard --logdir=runs

Let's now try out WandB

In [None]:
%pip install wandb -q

In [None]:
# import the library
import wandb

wandb.login()

wandb.init(project="lab_01_intro", name="exp1")

# simulate 100 training steps
for training_step in range(100):
    # log training metrics
    wandb.log(
        {
            "train/quantity_a": training_step * 0.5,
            "train/quantity_b": training_step**1.5,
            "train/quantity_c": 1 / (1 + training_step),
        }
    )


# ====== write fake data representing a second experiment ====== #

wandb.init(project="lab_01_intro", name="exp2")

for training_step in range(100):
    wandb.log(
        {
            "train/quantity_a": training_step * 0.4,
            "train/quantity_b": training_step**1.3,
            "train/quantity_c": 1 / (1 + 2 * training_step),
        }
    )