# PyTorch

It is composed by:
1. PyTorch tensors = numpy + GPU
2. Autograd (automatic differentiation engine) to compute the gradients for tensor operations. Eg: backpropagation.
3. Deep learning library that contains pre-trained models, loss functions, etc.

We will go through every component

## PyTorch tensors

In [1]:
import torch
torch.__version__


'2.7.0+cu126'

In [2]:
print(torch.cuda.is_available())

True


In [4]:
# 0D tensor (scalar)
tensor0d = torch.tensor(1)
print(tensor0d)
# 1D tensor (vector)
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d)
# 2D tensor (matrix)
tensor2d = torch.tensor([[1, 2], [3, 4]])
print(tensor2d)
# 3D tensor
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(tensor3d)

tensor(1)
tensor([1, 2, 3])
tensor([[1, 2],
        [3, 4]])
tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])


In [5]:
# default 64-bit integer
print(tensor1d.dtype)

torch.int64


In [6]:
# default 32-bit precision
floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

torch.float32


In [7]:
# change type
tensor1d_float = tensor1d.to(torch.float32)
print(tensor1d_float.dtype)

torch.float32


In [8]:
# shape of a tensor
print(tensor0d.shape)
print(tensor1d.shape)
print(tensor2d.shape)
print(tensor3d.shape)

torch.Size([])
torch.Size([3])
torch.Size([2, 2])
torch.Size([2, 2, 2])


In [10]:
# reshape a tensor
tensor2d.reshape(4, 1)

tensor([[1],
        [2],
        [3],
        [4]])

In [11]:
# reshape a tensor (common method)
tensor2d.view(4, 1)

tensor([[1],
        [2],
        [3],
        [4]])

In [12]:
# Transpose
tensor2d.T

tensor([[1, 3],
        [2, 4]])

In [13]:
# matmul 1
tensor2d.matmul(tensor2d.T)

tensor([[ 5, 11],
        [11, 25]])

In [14]:
# matmul 2
tensor2d @ tensor2d.T

tensor([[ 5, 11],
        [11, 25]])

## PyTorch autograd engine

In [16]:
# Suppose we have a model with the weight w1 and th bias b,
# to compute the gradients, pytorch computes a graph in the background
# as shown in the following figure
import torch.nn.functional as F

y = torch.tensor([1.0]) # true label
x1 = torch.tensor([1.1]) # input
w1 = torch.tensor([2.2]) # weight
b = torch.tensor([0.0]) # bias

z = x1 * w1 + b
a = torch.sigmoid(z) # predicted label

loss = F.binary_cross_entropy(a, y)
print("[a]", a)
print("[y]", y)
print("[loss]", loss)

[a] tensor([0.9183])
[y] tensor([1.])
[loss] tensor(0.0852)


The following figure illustrates the graph of the above 'model'.

As long as the final node, in this case `loss = L(a,y)` has the requires_grad attribute set to True, pytorch will build the graph to compute the gradients.

The way pytorch compute the gradients is from right to left, called backpropagation, it starts from the output layer (loss) and goes backward to the input layer.

In this way, pytorch computes the gradient of the loss respect to each parameter (weights and biases) to update these parameters during training.

![pytorch_automatic_differentiation.png](./images/pytorch_automatic_differentiation.png)


In [17]:
# in the previous code the code pytorch didn't build the graph
# because there were no terminal nodes with the requires_grad
# as True. In this code, the graph is built

# This is where the automatic differentiation engine is important,
# given the graph, the engine can compute the gradients using the
# function grad.

import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)
# by default, the graph is deleted after the gradients are computed
# we retain it to use it later
grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)

print(grad_L_w1)
print(grad_L_b)

(tensor([-0.0898]),)
(tensor([-0.0817]),)


In [18]:
# anyway, the common way to compute the gradients is using the
# method backward, the results will be stored in the grad attribute
loss.backward()
print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


## PyTorch as a deep learning library

Create a MLP with PyTorch, similar to the following image, this network will have:

- 50 inputs for a data point with 50 features
- 30 neurons in the 1st layer, resulting in:
    - 50x30 weights to calculate
    - 30 multilinear equations thus 30 biases
- 20 neurons in the 2nd layer, resulting in:
    - 30x20 weights to calculate
    - 20 multilinear equations thus 20 biases
- 3 outputs, resulting in:
    - 20x3 weights to calculate
    - 3 biases

Counting all, it gives 2213 parameters to compute

![mlp.png](./images/mlp.png)



In [39]:
# our class inherits the Module subclass because it allows us to encapsulate
# the layers and operations and track the model's parameters
class NeuralNetwork(torch.nn.Module):
    # to define the network layers
    def __init__(self, num_inputs, num_outputs):
        # calls the Module class constructor
        super().__init__()
        # encapsulate all the layers
        self.layers = torch.nn.Sequential(
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer of logits
            torch.nn.Linear(20, num_outputs)
        )
    # to define how the input data passes through the network
    def forward(self, x):
        logits = self.layers(x)
        return logits

In [40]:
# reproducibility
torch.manual_seed(123)

model = NeuralNetwork(50, 3)
print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


In [41]:
# same result calculated before
num_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)
print('[Parameters]', num_params)

[Parameters] 2213


In [42]:
# As indicated above the trainable parameters have requires_grad
# as True, it occurs in the Linear layers, for example, in the
# first linear layer which was initialized with randoms:
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


In [43]:
print(model.layers[0].weight.shape)
print(model.layers[0].bias.shape)

torch.Size([30, 50])
torch.Size([30])


In [44]:
# forward pass without training

# tensor of inputs
X = torch.rand((1, 50))
out = model.forward(X)
print(out)

tensor([[-0.1670,  0.1001, -0.1219]], grad_fn=<AddmmBackward0>)


The previous result also gives us the last used function to compute in the graph, pytorch uses this information during backpropagation. It means, `mm` for matrix multiplication followed by `Add` for addition.

When we use models only for inferent rather than training, we don't need the creation of the graph, in fact, it would be a waste of resources, so there is a better way in this case:

In [45]:
with torch.no_grad():
    out = model.forward(X)
print(out)

tensor([[-0.1670,  0.1001, -0.1219]])


**Common practice**
Create models that return `logits` as outputs without an activation function, the `logits` are Real numbers. This happens because pytorch combine the activation function and the loss for efficiency (use cancellation tricks), so the combined functions expect `logits` and output probabilities.
Eg:
- CrossEntropyLoss = LogSoftmax + NLLLoss
- BCEWithLogitsLoss = Sigmoid + BCELoss

In [46]:
# apply the activation function outside the creation of the model
# this one in particular ensure all output values are positive and
# sum 1
with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)

tensor([[0.2983, 0.3896, 0.3121]])


## DataLoader

Before we can train the neural network defined above, we have to create an efficient data loader.

As shown in the figure, we first create a custom `Dataset` to implement methods to retrieve individual records, then we instantiate a training and a test `Dataset`. Both of them are passed to a `DataLoader` which will define how the data is shuffled and assembled into batches

![dataloader.png](./images/dataloader.png)


In [47]:
# To simulate the creation of a DataLoader, we can create a dataset
# with train and test data
x_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

x_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6]
])

y_test = torch.tensor([0, 1])

In [48]:
# Creation of the custom Dataset
# IMPORTANT: the number of labels in the dataset can't exceed the
# number of output nodes minus 1
from torch.utils.data import Dataset

class ToyDataset(Dataset):
    # set up the attributes, it doesn't have to be tensors, it
    # could be files, database connectors, etc
    def __init__(self, x, y):
        self.features = x
        self.labels = y
    # access to specific individual record
    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]

train_ds = ToyDataset(x_train, y_train)
test_ds = ToyDataset(x_test, y_test)

In [50]:
# Instantiate the DataLoader with the custom Datasets
from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2, # number of record retrieved in each iteration
    shuffle=True,
    num_workers=0
)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

In [52]:
# To iterate over the train data
for idx, (x, y) in enumerate(train_loader):
    print("[BATCH]", idx+1)
    print(x, y)

[BATCH] 1
tensor([[ 2.3000, -1.1000],
        [-1.2000,  3.1000]]) tensor([1, 0])
[BATCH] 2
tensor([[-0.9000,  2.9000],
        [ 2.7000, -1.5000]]) tensor([0, 1])
[BATCH] 3
tensor([[-0.5000,  2.6000]]) tensor([0])


As you can see, there were 5 record in the train data and the five ones are retrieved exactly one time. The problem here is the last batch, which could disturb the convergence during training, so we can just drop the last batch

In [53]:
train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2, # number of record retrieved in each iteration
    shuffle=True,
    num_workers=0,
    drop_last=True
)

for idx, (x, y) in enumerate(train_loader):
    print("[BATCH]", idx+1)
    print(x, y)


[BATCH] 1
tensor([[-0.9000,  2.9000],
        [-1.2000,  3.1000]]) tensor([0, 0])
[BATCH] 2
tensor([[-0.5000,  2.6000],
        [ 2.7000, -1.5000]]) tensor([0, 1])


Another important parameter in the `DataLoader` is the num_workers, when set to 0, the data loading will be in the main process. So, as seen in the figure, in each iteration the CPU has to interrupt the processing of the model, to load the data while the GPU keeps waiting.

This bottleneck is solve using multiple workers, where multiple processes are launched to load the data in parallel and leaving the main process to the processing of the model. In this way, the data is already queued up in the background when the model needs it.

![num_workers.png](./images/num_workers.png)

> IMPORTANT: increase the num_workers in tiny datasets o in Jupyter notebooks can be harmful. In tiny datasets because DataLoader has to start many processes and in Jupyter notebooks because it can generate issues related with the sharing of resources between different processes.

Classical choice is num_workers=4

# Training

Now that we have the neural network defined and the DataLoader ready, we can combine all of this.

> Validation dataset: in practice, it is used a third dataset, it is similar to the test dataset. The difference is that validation dataset to tweak hyperparameters and

In [57]:
import torch.nn.functional as F

torch.manual_seed(123)
model = NeuralNetwork(
    num_inputs=2, # because each input has 2 features
    num_outputs=2 # because labels in our data are 0 or 1
)

# stochastic gradient descent with learning rate of 0.5
# learning rate is a tunable parameter, we need a value
# where the loss converges after some epochs
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

# another tunable parameter
num_epochs = 3

for epoch in range(num_epochs):
    # train() and eval() change the mode of the model, which is
    # important for component that works different during
    # training and inference, such as dropout or batch normalization,
    # as we don't use that, train() and eval() are redundant
    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):
        logits = model.forward(features)
        # loss function + activation function (softmax)
        loss = F.cross_entropy(logits, labels)

        # set gradients to zero to avoid accumulation
        optimizer.zero_grad()
        # calculate gradients
        loss.backward()
        # use the gradients to update the model parameters
        # to minimize the loss by multiplying the gradients
        # with the learning rate and adding the negative
        # result to the parameters
        optimizer.step()

        print(f"[EPOCH] {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx+1:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:0.2f}")
    model.eval()

[EPOCH] 001/003 | Batch 001/002 | Train/Val Loss: 0.75
[EPOCH] 001/003 | Batch 002/002 | Train/Val Loss: 0.65
[EPOCH] 002/003 | Batch 001/002 | Train/Val Loss: 0.44
[EPOCH] 002/003 | Batch 002/002 | Train/Val Loss: 0.13
[EPOCH] 003/003 | Batch 001/002 | Train/Val Loss: 0.03
[EPOCH] 003/003 | Batch 002/002 | Train/Val Loss: 0.00


In [62]:
# now we can make predictions
model.eval()

torch.set_printoptions(sci_mode=False) # only changes notation
with torch.no_grad():
    # the output are logits so we apply softmax to get the probabilities
    probs = torch.softmax(model(x_train), dim=1)

print(probs)


tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])


In [63]:
# get the indexes of the highest values along rows (rows are dim 1)
predictions = torch.argmax(probs, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


In [65]:
# Check the results manually
print(predictions == y_train)

print(torch.sum(predictions == y_train))

tensor([True, True, True, True, True])
tensor(5)


We can check the results as above, but we can generalize as below

In [66]:
def compute_accuracy(model, dataloader):
    model = model.eval()
    correct = 0.0
    total_examples = 0
    # process the data in small parts due to memory limitations
    for idx, (features, labels) in enumerate(dataloader):

        with torch.no_grad():
            logits = model.forward(features)
        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)
    return (correct / total_examples).item()

In [68]:
print(f'[TRAIN ACCURACY] {compute_accuracy(model, train_loader) * 100}%')
print(f'[TEST ACCURACY] {compute_accuracy(model, test_loader) * 100}%')

[TRAIN ACCURACY] 100.0%
[TEST ACCURACY] 100.0%


## Save and load models

In [69]:
# to save a model we use a state_dict that maps each layer in the
# model its weights and biases
torch.save(model.state_dict(), "model.pth")

In [70]:
# to use the model we have to define a network that matches the
# original model
model = NeuralNetwork(2, 2)
model.load_state_dict(torch.load("model.pth", weights_only=True))

<All keys matched successfully>

## Optimization training with GPU

In PyTorch, a device is where computation occur, where data resides. Examples are CPU and GPU.

In [71]:
# with cpu
tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])
print(tensor_1 + tensor_2)

tensor([5., 7., 9.])


In [72]:
# with gpu
tensor_1 = tensor_1.to("cuda")
tensor_2 = tensor_2.to("cuda")
print(tensor_1 + tensor_2)

tensor([5., 7., 9.], device='cuda:0')


Now, we can use the GPU to train our model with little changes

In [74]:
import torch.nn.functional as F

torch.manual_seed(123)
model = NeuralNetwork(
    num_inputs=2, # because each input has 2 features
    num_outputs=2 # because labels in our data are 0 or 1
)

# NEW: DEVICE AS CUDA -------------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# NEW: TRANSFER MODEL TO GPU -----------------------
model.to(device)

# stochastic gradient descent with learning rate of 0.5
# learning rate is a tunable parameter, we need a value
# where the loss converges after some epochs
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

# another tunable parameter
num_epochs = 3

for epoch in range(num_epochs):
    # train() and eval() change the mode of the model, which is
    # important for component that works different during
    # training and inference, such as dropout or batch normalization,
    # as we don't use that, train() and eval() are redundant
    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        # NEW: TRANSFER DATA TO GPU ----------------------
        features, labels = features.to(device), labels.to(device)

        logits = model.forward(features)
        # loss function + activation function (softmax)
        loss = F.cross_entropy(logits, labels)

        # set gradients to zero to avoid accumulation
        optimizer.zero_grad()
        # calculate gradients
        loss.backward()
        # use the gradients to update the model parameters
        # to minimize the loss by multiplying the gradients
        # with the learning rate and adding the negative
        # result to the parameters
        optimizer.step()

        print(f"[EPOCH] {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx+1:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:0.2f}")
    model.eval()


[EPOCH] 001/003 | Batch 001/002 | Train/Val Loss: 0.75
[EPOCH] 001/003 | Batch 002/002 | Train/Val Loss: 0.65
[EPOCH] 002/003 | Batch 001/002 | Train/Val Loss: 0.44
[EPOCH] 002/003 | Batch 002/002 | Train/Val Loss: 0.13
[EPOCH] 003/003 | Batch 001/002 | Train/Val Loss: 0.03
[EPOCH] 003/003 | Batch 002/002 | Train/Val Loss: 0.00


The above code runs in only 1 device, in order to reduce the training times even more, we can use multiple GPUs using distributed training.

The most basic strategy is DistributedDataParallel (DDP). This works as following:
- PyTorch launches a separate process on each GPU
- Each process receives a copy of the model
- The DataLoader sends a unique minibatch to every GPU
- Each model computes different gradients and outputs, to synchronize them, the gradients are averaged and sent to other GPUs to have the same updated weights
- In this way, the training time should reduce as the numbers of GPUs increases.
> DDP doesn't work in Jupyter notebooks because DDP needs to spawn multiple processes with its own python interpreter instance

In [None]:
# Code for DDP is in pytorch_one_hour_DDP.py

## Resources:

https://sebastianraschka.com/teaching/pytorch-1h/