# PyTorch

It is composed by:
1. PyTorch tensors = numpy + GPU
2. Autograd (automatic differentiation engine) to compute the gradients for tensor operations. Eg: backpropagation.
3. Deep learning library that contains pre-trained models, loss functions, etc.

We will go through every component

## PyTorch tensors

In [1]:
import torch
torch.__version__


'2.7.0+cu126'

In [2]:
print(torch.cuda.is_available())

True


In [4]:
# 0D tensor (scalar)
tensor0d = torch.tensor(1)
print(tensor0d)
# 1D tensor (vector)
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d)
# 2D tensor (matrix)
tensor2d = torch.tensor([[1, 2], [3, 4]])
print(tensor2d)
# 3D tensor
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(tensor3d)

tensor(1)
tensor([1, 2, 3])
tensor([[1, 2],
        [3, 4]])
tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])


In [5]:
# default 64-bit integer
print(tensor1d.dtype)

torch.int64


In [6]:
# default 32-bit precision
floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

torch.float32


In [7]:
# change type
tensor1d_float = tensor1d.to(torch.float32)
print(tensor1d_float.dtype)

torch.float32


In [8]:
# shape of a tensor
print(tensor0d.shape)
print(tensor1d.shape)
print(tensor2d.shape)
print(tensor3d.shape)

torch.Size([])
torch.Size([3])
torch.Size([2, 2])
torch.Size([2, 2, 2])


In [10]:
# reshape a tensor
tensor2d.reshape(4, 1)

tensor([[1],
        [2],
        [3],
        [4]])

In [11]:
# reshape a tensor (common method)
tensor2d.view(4, 1)

tensor([[1],
        [2],
        [3],
        [4]])

In [12]:
# Transpose
tensor2d.T

tensor([[1, 3],
        [2, 4]])

In [13]:
# matmul 1
tensor2d.matmul(tensor2d.T)

tensor([[ 5, 11],
        [11, 25]])

In [14]:
# matmul 2
tensor2d @ tensor2d.T

tensor([[ 5, 11],
        [11, 25]])

## PyTorch autograd engine

In [16]:
# Suppose we have a model with the weight w1 and th bias b,
# to compute the gradients, pytorch computes a graph in the background
# as shown in the following figure
import torch.nn.functional as F

y = torch.tensor([1.0]) # true label
x1 = torch.tensor([1.1]) # input
w1 = torch.tensor([2.2]) # weight
b = torch.tensor([0.0]) # bias

z = x1 * w1 + b
a = torch.sigmoid(z) # predicted label

loss = F.binary_cross_entropy(a, y)
print("[a]", a)
print("[y]", y)
print("[loss]", loss)

[a] tensor([0.9183])
[y] tensor([1.])
[loss] tensor(0.0852)


The following figure illustrates the graph of the above 'model'.

As long as the final node, in this case `loss = L(a,y)` has the requires_grad attribute set to True, pytorch will build the graph to compute the gradients.

The way pytorch compute the gradients is from right to left, called backpropagation, it starts from the output layer (loss) and goes backward to the input layer.

In this way, pytorch computes the gradient of the loss respect to each parameter (weights and biases) to update these parameters during training.

![pytorch_automatic_differentiation.png](./images/pytorch_automatic_differentiation.png)


In [17]:
# in the previous code the code pytorch didn't build the graph
# because there were no terminal nodes with the requires_grad
# as True. In this code, the graph is built

# This is where the automatic differentiation engine is important,
# given the graph, the engine can compute the gradients using the
# function grad.

import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)
# by default, the graph is deleted after the gradients are computed
# we retain it to use it later
grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)

print(grad_L_w1)
print(grad_L_b)

(tensor([-0.0898]),)
(tensor([-0.0817]),)


In [18]:
# anyway, the common way to compute the gradients is using the
# method backward, the results will be stored in the grad attribute
loss.backward()
print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


## PyTorch as a deep learning library

Create a MLP with PyTorch, similar to the following image, this network will have:

- 50 inputs
- 30 neurons in the 1st layer, resulting in:
    - 50x30 weights to calculate
    - 30 multilinear equations thus 30 biases
- 20 neurons in the 2nd layer, resulting in:
    - 30x20 weights to calculate
    - 20 multilinear equations thus 20 biases
- 3 outputs, resulting in:
    - 20x3 weights to calculate
    - 3 biases

Counting all, it gives 2213 parameters to compute

![mlp.png](./images/mlp.png)



In [39]:
# our class inherits the Module subclass because it allows us to encapsulate
# the layers and operations and track the model's parameters
class NeuralNetwork(torch.nn.Module):
    # to define the network layers
    def __init__(self, num_inputs, num_outputs):
        # calls the Module class constructor
        super().__init__()
        # encapsulate all the layers
        self.layers = torch.nn.Sequential(
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs)
        )
    # to define how the input data passes through the network
    def forward(self, x):
        logits = self.layers(x)
        return logits

In [40]:
# reproducibility
torch.manual_seed(123)

model = NeuralNetwork(50, 3)
print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


In [41]:
# same result calculated before
num_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)
print('[Parameters]', num_params)

[Parameters] 2213


In [42]:
# As indicated above the trainable parameters have requires_grad
# as True, it occurs in the Linear layers, for example, in the
# first linear layer which was initialized with randoms:
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


In [43]:
print(model.layers[0].weight.shape)
print(model.layers[0].bias.shape)

torch.Size([30, 50])
torch.Size([30])


In [44]:
# forward pass without training

# tensor of inputs
X = torch.rand((1, 50))
out = model.forward(X)
print(out)

tensor([[-0.1670,  0.1001, -0.1219]], grad_fn=<AddmmBackward0>)


The previous result also gives us the last used function to compute in the graph, pytorch uses this information during backpropagation. It means, `mm` for matrix multiplication followed by `Add` for addition.

When we use models only for inferent rather than training, we don't need the creation of the graph, in fact, it would be a waste of resources, so there is a better way in this case:

In [45]:
with torch.no_grad():
    out = model.forward(X)
print(out)

tensor([[-0.1670,  0.1001, -0.1219]])


**Common practice**
Create models that return `logits` as outputs without an activation function, the `logits` are Real numbers. This happens because pytorch combine the activation function and the loss for efficiency (use cancellation tricks), so the combined functions expect `logits` and output probabilities.
Eg:
- CrossEntropyLoss = LogSoftmax + NLLLoss
- BCEWithLogitsLoss = Sigmoid + BCELoss

In [46]:
# apply the activation function outside the creation of the model
# this one in particular ensure all output values are positive and
# sum 1
with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)

tensor([[0.2983, 0.3896, 0.3121]])
