# Introduction to PyTorch

The current notebook covers:
- An overview of the PyTorch deep learning library
- Setting up an environment and workspace for deep learning
- Tensors as a fundamental data structure for deep learning
- The mechanics of training deep neural networks
- Training models on GPUs

### 1. What is PyTorch?

**PyTorch** (https://pytorch.org/) is an open-source Python-based deep learning
library. 

According to Papers With Code (https://paperswithcode.com/trends),
a platform that tracks and analyzes research papers, PyTorch has been the
most widely used deep learning library for research since 2019 by a wide
margin.

Firstly, PyTorch is a tensor library that extends the concept of array-oriented
programming library NumPy with the additional feature of accelerated
computation on GPUs, thus providing a seamless switch between CPUs and
GPUs.

Secondly, PyTorch is an automatic differentiation engine, also known as
autograd, which enables the automatic computation of gradients for tensor
operations, simplifying backpropagation and model optimization.

Finally, PyTorch is a deep learning library, meaning that it offers modular,
flexible, and efficient building blocks (including pre-trained models, loss
functions, and optimizers) for designing and training a wide range of deep
learning models, catering to both researchers and developers.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/1.webp" width="600px">

In this lab, we are going to use Pytorch 2.0.1.

Instructions on how to install PyTorch locally on your machine can be found at: https://pytorch.org/get-started/previous-versions/#v201

We will continue with the installation of PyTorch in the next cell.

(it can take a few minutes to install)

In [None]:
## only if you don't already have pytorch installed
%pip install torch==2.0.1

After installing Pytorch, we need to restart the kernel to use it.

In [1]:
#importing pythorch and checking the version
import torch
print(torch.__version__)

2.0.1+cu117


If you are running the code on Google Colab, you can change the runtime and run the code on the GPU. To do this, go to the Runtime tab, click on Change runtime type, and select GPU from the Hardware accelerator dropdown.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/5.webp" width="700px">

Whether you are using Google Colab or your local machine, we can check if the GPU is available by running the following code:

In [2]:
print(torch.cuda.is_available())

True


If the command returns True, you are all set. If the command returns False,your computer may not have a compatible GPU, or PyTorch does not recognize it. However, you can still run the code on the CPU, as GPU is not required for the main part of this lab.

## 2. Understanding tensors

Tensors represent a mathematical concept that generalizes vectors and
matrices to potentially higher dimensions. In other words, tensors are
mathematical objects that can be characterized by their order (or rank), which
provides the number of dimensions. For example, a scalar (just a number) is a
tensor of rank 0, a vector is a tensor of rank 1, and a matrix is a tensor of rank
2.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/6.webp" width="400px">

### 2.1. Scalars, vectors, matrices, and tensors

We can create objects of PyTorch's Tensor class using the torch.tensor
function as follows:

In [3]:
import torch
import numpy as np

# create a 0D tensor (scalar) from a Python integer
tensor0d = torch.tensor(1)

# create a 1D tensor (vector) from a Python list
tensor1d = torch.tensor([1, 2, 3])

# create a 2D tensor from a nested Python list
tensor2d = torch.tensor([[1, 2], 
                         [3, 4]])

# create a 3D tensor from a nested Python list
tensor3d_1 = torch.tensor([[[1, 2], [3, 4]], 
                           [[5, 6], [7, 8]]])

# create a 3D tensor from NumPy array
ary3d = np.array([[[1, 2], [3, 4]], 
                  [[5, 6], [7, 8]]])
tensor3d_2 = torch.tensor(ary3d)  # Copies NumPy array
tensor3d_3 = torch.from_numpy(ary3d)  # Shares memory with NumPy array

In [4]:
ary3d[0, 0, 0] = 999
print(tensor3d_2) # remains unchanged

tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])


In [5]:
print(tensor3d_3) # changes because of memory sharing

tensor([[[999,   2],
         [  3,   4]],

        [[  5,   6],
         [  7,   8]]])


### 2.2. Tensor data types

PyTorch adopts the default 64-bit integer data type from Python. We can
access the data type of a tensor via the ``.dtype`` attribute of a tensor:

In [6]:
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d.dtype)

torch.int64


If we create tensors from Python floats, PyTorch creates tensors with a 32-bit
precision by default, as we can see below:

In [7]:
floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

torch.float32


This choice is primarily due to the balance between precision and
computational efficiency. A 32-bit floating point number offers sufficient
precision for most deep learning tasks, while consuming less memory and
computational resources than a 64-bit floating point number. Moreover, GPU
architectures are optimized for 32-bit computations, and using this data type
can significantly speed up model training and inference.

Moreover, it is possible to readily change the precision using a tensor's .to
method. The following code demonstrates this by changing a 64-bit integer
tensor into a 32-bit float tensor:

In [8]:
floatvec = tensor1d.to(torch.float32)
print(floatvec.dtype)

torch.float32


### 2.3. Common PyTorch tensor operations

In [9]:
tensor2d = torch.tensor([[1, 2, 3], 
                         [4, 5, 6]])
tensor2d

tensor([[1, 2, 3],
        [4, 5, 6]])

Using ``.shape`` attribute allows us to access the shape of a tensor:

In [10]:
tensor2d.shape

torch.Size([2, 3])

As you can see above, ``.shape`` returns [2, 3], which means that the tensor
has 2 rows and 3 columns. To reshape the tensor into a 3 by 2 tensor, we canuse the ``.reshape`` method:

In [11]:
tensor2d.reshape(3, 2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

However, note that the more common command for reshaping tensors in
PyTorch is ``.view()``:

In [12]:
tensor2d.view(3, 2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

Next, we can use ``.T`` to transpose a tensor, which means flipping it across its
diagonal. Note that this is similar from reshaping a tensor as you can see
based on the result below:

In [13]:
tensor2d.T

tensor([[1, 4],
        [2, 5],
        [3, 6]])

Lastly, the common way to multiply two matrices in PyTorch is the ``.matmul``
method:

In [14]:
tensor2d.matmul(tensor2d.T)

tensor([[14, 32],
        [32, 77]])

However, we can also adopt the ``@`` operator, which accomplishes the same
thing more compactly:

In [15]:
tensor2d @ tensor2d.T

tensor([[14, 32],
        [32, 77]])

## 3. Seeing models as computation graphs

PyTorch's automatic differentiation engine, autograd, provides functions to compute gradients in dynamic computational graphs automatically.

A computational graph (or computation graph in short) is a directed graph
that allows us to express and visualize mathematical expressions. In the
context of deep learning, a computation graph lays out the sequence of
calculations needed to compute the output of a neural network -- we will need
this later to compute the required gradients for backpropagation, which is the
main training algorithm for neural networks.

### 3.1. A logistic regression forward pass

Let's look at a concrete example to illustrate the concept of a computation
graph. The following code implements the forward pass (prediction step) of asimple logistic regression classifier, which can be seen as a single-layer
neural network, returning a score between 0 and 1 that is compared to the true
class label (0 or 1) when computing the loss:

In [16]:
import torch.nn.functional as F

y = torch.tensor([1.0])  # true label
x1 = torch.tensor([1.1]) # input feature
w1 = torch.tensor([2.2]) # weight parameter
b = torch.tensor([0.0])  # bias unit

z = x1 * w1 + b          # net input
a = torch.sigmoid(z)     # activation & output

loss = F.binary_cross_entropy(a, y)
print(loss)

tensor(0.0852)


The point of this example is to illustrate how we can think of a sequence of computations as a
computation graph, as shown below:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/7.webp" width="600px">

PyTorch builds a computation graph in the background, and we
can use this to calculate gradients of a loss function with respect to the model
parameters (here ``w1`` and ``b``) to train the model, which is the topic of the
upcoming sections.

## 4. Automatic differentiation made easy

If we carry out computations in PyTorch, it will build such a graph internally by
default if one of its terminal nodes has the ``requires_grad`` attribute set to
``True``. This is useful if we want to compute gradients. Gradients are required
when training neural networks via the popular backpropagation algorithm,
which can be thought of as an implementation of the chain rule from calculus
for neural networks, which is illustrated below:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/8.webp" width="600px">

### 4.1. Partial derivatives and gradients

The above figure shows partial derivatives, which measure the rate at which a
function changes with respect to one of its variables. A gradient is a vector
containing all of the partial derivatives of a multivariate function, a function
with more than one variable as input.

If you are not familiar or don't remember the partial derivatives, gradients, or
the chain rule from calculus, don't worry. On a high level, all you need to
know is that the chain rule is a way to compute gradients of a
loss function with respect to the model's parameters in a computation graph.
This provides the information needed to update each parameter in a way that
minimizes the loss function, which serves as a proxy for measuring the
model's performance, using a method such as gradient descent.

### 4.2. Computing gradients via autograd

By tracking every operation performed on tensors, PyTorch's autograd engine
constructs a computational graph in the background. Then, calling the ``grad``
function, we can compute the gradient of the loss with respect to model
parameter w1 as follows:

In [17]:
import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b 
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)

grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)

print(grad_L_w1)
print(grad_L_b)

(tensor([-0.0898]),)
(tensor([-0.0817]),)


Above, we have been using the grad function "manually," which can be
useful for experimentation, debugging, and demonstrating concepts. But in
practice, PyTorch provides even more high-level tools to automate this
process. For instance, we can call ``.backward`` on the loss, and PyTorch will
compute the gradients of all the leaf nodes in the graph, which will be stored
via the tensors' ``.grad`` attributes:

In [18]:
loss.backward()

print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


## 5. Implementing multilayer neural networks

This section focuses on PyTorch as a library for implementing
deep neural networks. We focus on a multilayer perceptron, which is
a fully connected neural network.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/9.webp" width="500px">

When implementing a neural network in PyTorch, we typically subclass the
``torch.nn.Module`` class to define our own custom network architecture. This
Module base class provides a lot of functionality, making it easier to build and
train models. For instance, it allows us to encapsulate layers and operations
and keep track of the model's parameters.


Within this subclass, we define the network layers in the ``__init__``
constructor and specify how they interact in the forward method. The ``forward``
method describes how the input data passes through the network and comes
together as a computation graph.

In contrast, the ``backward`` method (shown before), which we typically do not need to
implement ourselves, is used during training to compute gradients of the loss
function with respect to the model parameters.

### 5.1. A multilayer perceptron with two hidden layers

The following code implements a classic multilayer perceptron with two
hidden layers to illustrate a typical usage of the Module class:

In [19]:
class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()

        self.layers = torch.nn.Sequential(
                
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs),
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits

We can then instantiate a new neural network object as follows:

In [20]:
model = NeuralNetwork(50, 3)

But before using this new model object, it is often useful to call ``print`` on the
model to see a summary of its structure:

In [21]:
print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


Note that we used the ``Sequential`` class when we implemented the ``NeuralNetwork`` class. Using ``Sequential`` is not required, but it can make our
life easier if we have a series of layers that we want to execute in a specific
order, as is the case here. This way, after instantiating ``self.layers =
Sequential(...)`` in the ``__init__`` constructor, we just have to call the
``self.layers`` instead of calling each layer individually in the
``NeuralNetwork``'s forward method.

Let's check the total number of trainable parameters of this model:

In [22]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params)

Total number of trainable model parameters: 2213


Note that each parameter for which ``requires_grad=True`` counts as a
trainable parameter and will be updated during training.

In the case of our neural network model with the two hidden layers above,
these trainable parameters are contained in the ``torch.nn.Linear`` layers. A
linear layer multiplies the inputs with a weight matrix and adds a bias vector.
This is sometimes also referred to as a feedforward or fully connected layer.

Based on the ``print(model)`` call we executed above, we can see that the first
``Linear layer`` is at index position 0 in the layers attribute. We can access the
corresponding weight parameter matrix as follows:

In [23]:
print(model.layers[0].weight)

Parameter containing:
tensor([[ 0.1182,  0.0606, -0.1292,  ..., -0.1126,  0.0735, -0.0597],
        [-0.0249,  0.0154, -0.0476,  ..., -0.1001, -0.1288,  0.1295],
        [ 0.0641,  0.0018, -0.0367,  ..., -0.0990, -0.0424, -0.0043],
        ...,
        [ 0.0618,  0.0867,  0.1361,  ..., -0.0254,  0.0399,  0.1006],
        [ 0.0842, -0.0512, -0.0960,  ..., -0.1091,  0.1242, -0.0428],
        [ 0.0518, -0.1390, -0.0923,  ..., -0.0954, -0.0668, -0.0037]],
       requires_grad=True)


Since this is a large matrix that is not shown in its entirety, let's use the
``.shape`` attribute to show its dimensions:

In [25]:
print(model.layers[0].weight.shape)

torch.Size([30, 50])


The weight matrix above is a 30x50 matrix, and we can see that the
``requires_grad`` is set to True, which means its entries are trainable -- this is
the default setting for weights and biases in ``torch.nn.Linear``.

Note that if you execute the code above on your computer, the numbers in the
weight matrix will likely differ from those shown above. This is because the
model weights are initialized with small random numbers, which are different
each time we instantiate the network. In deep learning, initializing model
weights with small random numbers is desired to break symmetry during
training -- otherwise, the nodes would be just performing the same operations
and updates during backpropagation, which would not allow the network to
learn complex mappings from inputs to outputs.

However, while we want to keep using small random numbers as initial
values for our layer weights, we can make the random number initialization
reproducible by seeding PyTorch's random number generator via
``manual_seed``:

In [24]:
torch.manual_seed(123)

model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


Let's see how we use the ``NeuralNetwork`` the ``forward`` pass:

In [26]:
torch.manual_seed(123)

X = torch.rand((1, 50))
out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)


In the code above, we generated a single random training example X as a toy
input (note that our network expects 50-dimensional feature vectors) and fed
it to the model, returning three scores. When we call ``model(x)``, it will
automatically execute the forward pass of the model.

The forward pass refers to calculating output tensors from input tensors. This
involves passing the input data through all the neural network layers, starting
from the input layer, through hidden layers, and finally to the output layer.

These three numbers returned above correspond to a score assigned to each
of the three output nodes. Notice that the output tensor also includes a
``grad_fn`` value.

``grad_fn=<AddmmBackward0>`` represents the last-used function to
compute a variable in the computational graph. It specifies the
operation that was performed. In this case, it is an Addmm operation. Addmm
stands for matrix multiplication (mm) followed by an addition (Add).

When we use a model for inference (for instance, making predictions) rather than
training, it is a best practice to use the ``torch.no_grad()`` context manager, as
shown below. This tells PyTorch that it doesn't need to keep track of the
gradients, which can result in significant savings in memory and
computation.

In [27]:
with torch.no_grad():
    out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]])


In PyTorch, it's common practice to code models such that they return the
outputs of the last layer (logits) without passing them to a nonlinear
activation function. That's because PyTorch's commonly used loss functions
combine the ``softmax`` (or ``sigmoid`` for binary classification) operation with the
negative log-likelihood loss in a single class. The reason for this is numerical
efficiency and stability. So, if we want to compute class-membership
probabilities for our predictions, we have to call the ``softmax`` function
explicitly:

In [28]:
with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)

tensor([[0.3113, 0.3934, 0.2952]])


The values can now be interpreted as class-membership probabilities that
sum up to 1. The values are roughly equal for this random input, which is
expected for a randomly initialized model without training.

## 6. Setting up efficient data loaders

We have defined a custom neural network model above. Before
we can train this model, we have to briefly talk about creating efficient data
loaders in PyTorch, which we will iterate over when training the model.

We will implement a
custom ``Dataset`` class that we will use to create a training and a test dataset
that we'll then use to create the data loaders.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/10.webp" width="600px">

### 6.1. Creating a small toy dataset

Let's start by creating a simple toy dataset of five training examples with two
features each. Accompanying the training examples, we also create a tensor
containing the corresponding class labels: three examples below to class 0,
and two examples belong to class 1.

##### Class label numbering

PyTorch requires that class labels start with label 0, and the largest class label
value should not exceed the number of output nodes minus 1 (since Python
index counting starts at 0). So, if we have class labels 0, 1, 2, 3, and 4, the
neural network output layer should consist of 5 nodes.

In [29]:
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

In addition, we also make a test set
consisting of two entries.

In [30]:
X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])

y_test = torch.tensor([0, 1])

### 6.2. Defining a custom Dataset class

We create a custom dataset class, ``ToyDataset``, by subclassing from
PyTorch's ``Dataset`` parent class.

In PyTorch, the three main components of a custom Dataset class are the
``__init__`` constructor, the ``__getitem__`` method, and the ``__len__`` method.

In the ``__init__`` method, we set up attributes that we can access later in the
``__getitem__`` and ``__len__`` methods. This could be file paths, file objects,
database connectors, and so on. Since we created a tensor dataset that sits in
memory, we are simply assigning X and y to these attributes, which are
placeholders for our tensor objects.

In the ``__getitem__`` method, we define instructions for returning exactly one
item from the dataset via an index. This means the features and the class
label corresponding to a single training example or test instance.

Finally, the ``__len__`` method constrains instructions for retrieving the length
of the dataset. Here, we use the ``.shape`` attribute of a tensor to return the
number of rows in the feature array.

In [31]:
from torch.utils.data import Dataset


class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]        
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

In the case of the training dataset, we
have five rows, which we can double-check as follows:

In [32]:
len(train_ds)

5

### 6.3. Instantiating data loaders

This custom ``ToyDataset`` class's purpose is to use it to instantiate a PyTorch
``DataLoader``. The ``DataLoader`` class is used to sample from the ``ToyDataset``.

In [33]:
from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0
)

In [34]:
test_ds = ToyDataset(X_test, y_test)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

After instantiating the training data loader, we can iterate over it as shown
below.

In [35]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Batch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])


As we can see based on the output above, the ``train_loader`` iterates over the
training dataset visiting each training example exactly once. This is known as
a training epoch. Since we seeded the random number generator using
``torch.manual_seed(123)`` above, you should get the exact same shuffling
order of training examples as shown above. However if you iterate over the
dataset a second time, you will see that the shuffling order will change. This
is desired to prevent deep neural networks getting caught in repetitive update
cycles during training.

Note that we specified a batch size of 2 above, but the 3rd batch only
contains a single example. That's because we have five training examples,
which is not evenly divisible by 2. In practice, having a substantially smaller
batch as the last batch in a training epoch can disturb the convergence during
training. To prevent this, it's recommended to set ``drop_last=True``, which
will drop the last batch in each epoch, as shown below:

In [36]:
train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,
    drop_last=True
)

Now, iterating over the training loader, we can see that the last batch is
omitted:

In [37]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 2: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])


### 6.4. Data loading with/ without multiple workers

The ``num_workers=0`` in the ``DataLoader`` is crucial for parallelizing data
loading and preprocessing. When ``num_workers`` is set to 0, the data loading
will be done in the main process and not in separate worker processes. This
might seem unproblematic, but it can lead to significant slowdowns during
model training when we train larger networks on a GPU. This is because
instead of focusing solely on the processing of the deep learning model, the
CPU must also take time to load and preprocess the data. As a result, the
GPU can sit idle while waiting for the CPU to finish these tasks. In contrast,
when ``num_workers`` is set to a number greater than zero, multiple worker
processes are launched to load data in parallel, freeing the main process to
focus on training your model and better utilizing your system's resources.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/11.webp" width="600px">

## 7. A typical training loop

We've discussed all the requirements for training neural networks: PyTorch's tensor library, autograd, the Module API, and efficient data loaders.
Let's now combine all these things and train a neural network on the toy
dataset from the previous section.

### 7.1. Neural network training in PyTorch

We will initialize a model with two inputs and two outputs. That's
because the toy dataset from the previous section has two input features and
two class labels to predict. We used a stochastic gradient descent (SGD)
optimizer with a ``learning rate`` (lr) of 0.5. The ``learning rate`` is a
hyperparameter, meaning it's a tunable setting that we have to experiment
with based on observing the loss. Ideally, we want to choose a learning rate
such that the loss converges after a certain number of epochs -- the number of
``epochs`` is another hyperparameter to choose.

It is important to include an ``optimizer.zero_grad()`` call in each update
round to reset the gradients to zero. Otherwise, the gradients will accumulate,
which may be undesired.

In [38]:
import torch.nn.functional as F


torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):
    
    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        logits = model(features)
        
        loss = F.cross_entropy(logits, labels) # Loss function
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()
    # Optional model evaluation

Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00


The loss reaches zero after 3 epochs, a sign that the model
converged on the training set. 

After we trained the model, we can use it to make predictions, as shown
below:

In [39]:
model.eval()

with torch.no_grad():
    outputs = model(X_train)

print(outputs)

tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])


In the above cell, we can see the logits for the train samples.

To obtain the class membership probabilities, we can then use PyTorch's
softmax function. 

Let's consider the first row in the code output below. Here, the first value
(column) means that the training example has a 99.91% probability of
belonging to class 0 and a 0.09% probability of belonging to class 1. 

(The
``set_printoptions`` call is used here to make the outputs more legible.)

We can convert the probabilities into class labels predictions using PyTorch's
``argmax`` function, which returns the index position of the highest value in each
row if we set ``dim=1`` (setting ``dim=0`` would return the highest value in each
column, instead):

In [40]:
torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1)
print(probas)

predictions = torch.argmax(probas, dim=1)
print(predictions)

tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])
tensor([0, 0, 0, 1, 1])


Note that it is unnecessary to compute ``softmax`` probabilities to obtain the
class labels. We could also apply the ``argmax`` function to the logits (outputs)
directly:

In [41]:
predictions = torch.argmax(outputs, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


Above, we computed the predicted labels for the training dataset. Since the
training dataset is relatively small, we could compare it to the true training
labels by eye and see that the model is 100% correct. We can double-check
this using the ``==`` comparison operator:

In [42]:
predictions == y_train

tensor([True, True, True, True, True])

Using ``torch.sum``, we can count the number of correct prediction as follows:

In [43]:
torch.sum(predictions == y_train)

tensor(5)

Since the dataset consists of 5 training examples, we have 5 out of 5
predictions that are correct, which equals 5/5 × 100% = 100% prediction
accuracy.

### 7.2. A function to compute the prediction accuracy

To generalize the computation of the prediction accuracy, let's
implement a ``compute_accuracy`` function.

The function iterates over a data loader to compute the
number and fraction of the correct predictions. This is because when we work
with large datasets, we typically can only call the model on a small part of the
dataset due to memory limitations. The ``compute_accuracy`` function above is
a general method that scales to datasets of arbitrary size since, in each
iteration, the dataset chunk that the model receives is the same size as the
batch size seen during training.

In [44]:
def compute_accuracy(model, dataloader):

    model = model.eval()
    correct = 0.0
    total_examples = 0
    
    for idx, (features, labels) in enumerate(dataloader):
        
        with torch.no_grad():
            logits = model(features)
        
        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)

    return (correct / total_examples).item()

We can then apply the function to the training as follows:

In [45]:
compute_accuracy(model, train_loader)

1.0

Similarly, we can apply the function to the test set as follows:

In [46]:
compute_accuracy(model, test_loader)

1.0

## 8. Saving and loading models

Here's the recommended way how we can save models in PyTorch:

In [47]:
torch.save(model.state_dict(), "model.pth")

The model's ``state_dict`` is a Python dictionary object that maps each layer in
the model to its trainable parameters (weights and biases). Note that
``"model.pth"`` is an arbitrary filename for the model file saved to disk. We can
give it any name and file ending we like; however, ``.pth`` and ``.pt`` are the most
common conventions.

Once we saved the model, we can restore it from disk as follows:

In [48]:
model = NeuralNetwork(2, 2) # needs to match the original model exactly
model.load_state_dict(torch.load("model.pth", weights_only=True))

<All keys matched successfully>

The ``torch.load("model.pth")`` function reads the file ``"model.pth"`` and
reconstructs the Python dictionary object containing the model's parameters
while ``model.load_state_dict()`` applies these parameters to the model,
effectively restoring its learned state from when we saved it.

## 9. Optimizing training performance with GPUs

In this section, we will see how we can utilize GPUs,
which will accelerate deep neural network training compared to regular
CPUs. First, we will introduce the main concepts behind GPU computing in
PyTorch. Then, we will train a model on a single GPU.

### 9.1. PyTorch computations on GPU devices

In this part of the lab, we will learn how to move tensors and models to GPU devices and how to perform computations on GPUs. For this, we will need to check again if the GPU is available.

In [None]:
print(torch.cuda.is_available())

True


Now, suppose we have two tensors that we can add as follows -- this
computation will be carried out on the CPU by default:

In [None]:
tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])

print(tensor_1 + tensor_2)

tensor([5., 7., 9.])


We can now use the ``.to()`` method to transfer these tensors onto a GPU
and perform the addition there:

In [None]:
tensor_1 = tensor_1.to("cuda")
tensor_2 = tensor_2.to("cuda")

print(tensor_1 + tensor_2)

tensor([5., 7., 9.], device='cuda:0')


Notice that the resulting tensor now includes the device information,
``device='cuda:0'``, which means that the tensors reside on the first GPU. If
your machine hosts multiple GPUs, you have the option to specify which
GPU you'd like to transfer the tensors to. You can do this by indicating the
device ID in the transfer command. For instance, you can use
``.to("cuda:0")``, ``.to("cuda:1")``, and so on.

However, it is important to note that all tensors must be on the same device.
Otherwise, the computation will fail, as shown below, where one tensor
resides on the CPU and the other on the GPU:

In [None]:
tensor_1 = tensor_1.to("cpu")
print(tensor_1 + tensor_2)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

### 9.2. A training loop on a GPU

We are using the same ToyDataset, DataLoader, and NeuralNetwork as before, but we will train the model on a GPU this time.

We can modify the training loop from section 7  to run on a GPU.
This requires only changing three lines of code, as shown
below.

In [None]:
import torch.nn.functional as F


torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # NEW
model = model.to(device) # NEW

optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):

    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        features, labels = features.to(device), labels.to(device) # NEW
        logits = model(features)
        loss = F.cross_entropy(logits, labels) # Loss function

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()
    # Optional model evaluation

Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00


We can also use ``model.to("cuda")`` directly instead of ``device = torch.device("cuda")`` and ``model = model.to(device)``.

We also added the line ``device = torch.device("cuda" if torch.cuda.is_available() else "cpu")`` to make the code executable on a
CPU if a GPU is not available, which is usually considered best practice.

In the case of the modified training loop above, we probably won't see a
speed-up because of the memory transfer cost from CPU to GPU. However,
we can expect a significant speed-up when training deep neural networks,
especially large language models.