# APPENDIX A


This following file covers the initiations and basics of pytorch



**A.1 What is Pytorch?**



Useful due to its dealing with tensors, it automatically computes gradients for tensor operations and it has many built in loss functions and optimizers.
Deep learning is just a type of machine learning

Initializing it on the terminal
-pip install pytorch
-pip3 install torch torchvision torchaudio
-pip show torch should return version 2.4.0

In [2]:
import torch
torch.cuda.is_available()

False

**A.2 Understanding tensors?**

Tensors are a generalization of matrices to higher dimensions
They are data containers for array-like structures

In [4]:
import torch

# 0 dimensional tensor
tensor0d = torch.tensor(1)

# 1 dimensional tensor
tensor1d = torch.tensor([1,2,3])

# 2 dimensional tensor
tensor2d = torch.tensor([[1,2],[3,4]])

# 3 dimensional tensor
tensor3d = torch.tensor([[[1,2],[3,4]],[[5,6],[7,8]]])

In [None]:
# Data Types, they are both 64-bit integers, 64 bits leads to more precision although it causes a larger memory consumption.
print(tensor0d.dtype, tensor1d.dtype)

torch.int64 torch.int64


In [12]:
# Operations

# Obtaining the tensor
print(tensor0d)

# Obtaining the size
print(tensor2d.shape)

# Reshaping the tensor
print(tensor3d.reshape(4,2))
# .view is more common on this case

# Transposing the tensor
print(tensor3d.T)


tensor(1)
torch.Size([2, 2])
tensor([[1, 2],
        [3, 4],
        [5, 6],
        [7, 8]])
tensor([[[1, 5],
         [3, 7]],

        [[2, 6],
         [4, 8]]])


  print(tensor3d.T)


**A.3 Seeing models as computation graphs**

Autograd is a built in function of Torch which computes gradients automatically.

In [None]:
# Logistic regression classifier

import torch.nn.functional as F

y = torch.tensor([1.0]) # label
x1 = torch.tensor([1.1]) # input
w1 = torch.tensor([2.2]) # weight
b = torch.tensor([0.0]) # bias
z = x1 * w1 + b # formula
a = torch.sigmoid(z) # activation formula, any number is squashed between 0 and 1.
loss = F.binary_cross_entropy(a, y) # output, how wrong the prediction is
print(loss)

tensor(0.0852)


**A.4 Automatic Differentiation Made Easy**

The attribute requires_grad set to True will build a computational graph internally, this is useful if we want to compute gradients.
Gradients are computed with partial derivatives, done using the chain rule from right to left in the computation graph.

In [18]:
# Computing gradients with autograd
import torch.nn.functional as F
from torch.autograd import grad


y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True) # Parameter requires grad set to True
b = torch.tensor([0.0], requires_grad=True)
z = x1 * w1 + b
a = torch.sigmoid(z)
loss = F.binary_cross_entropy(a, y)

grad_L_w1 = grad(loss, w1, retain_graph=True) # Loss is a scalar value representing the model's error.
grad_L_b = grad(loss, b, retain_graph=True) # Retain graph maintains the gradient in memory, useful if we wish to use it later

print(grad_L_w1)
print(grad_L_b)

# loss.backward() does gradient computation for all parameters that have requires_grad at once. Store in .grad attributes
loss.backward()
b.grad


(tensor([-0.0898]),)
(tensor([-0.0817]),)


tensor([-0.0817])

**A.5 Implementing Multilayer Neural Networks**

The subclass torch.nn.Module class is used to define our own network architecture.

In [10]:
# Initiating a  multilayer perceptron with two hidden layers
class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        self.layers = torch.nn.Sequential( # Sequential are used as layers follow a straight path in this case

        # 1st hidden layer
        torch.nn.Linear(num_inputs, 30), # Linear Layers takes the number of inputs and outputs in this case 30
        torch.nn.ReLU(),  # Nonlinear activation functions
        # 2nd hidden layer
        torch.nn.Linear(30, 20), # The inputs have to match the previous outputs
        torch.nn.ReLU(),
        # output layer
        torch.nn.Linear(20, num_outputs), # Final layer matching the current input to the desired ones
        )
    def forward(self, x):
        logits = self.layers(x)
        return logits

# Initiating a neural network object
model = NeuralNetwork(50,3)

print(model)

# Due to our model being sequential, we can call self.layers for all of them 

# Our first layer as seen on the previous output is at position index 0
print(model.layers[0].weight)
print(model.layers[0].weight.shape)

# To ensure reproducibility, we can use torch.manual_seed(123)
torch.manual_seed(123)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)
Parameter containing:
tensor([[ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        [-0.0920, -0.0480,  0.0105,  ..., -0.0923,  0.1201,  0.0330],
        ...,
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509],
        [-0.1250,  0.0513,  0.0366,  ..., -0.1370,  0.1074, -0.0704]],
       requires_grad=True)
torch.Size([30, 50])


<torch._C.Generator at 0x1eb52a88110>

Each of the neurons has a weight parameter randomly initialized to a low number. This is because if it were the same everytime, the same gradients would be constantly calculated and the update will be identical as well. Moreover, they are low number to ensure that gradients are not too large causing them to explode, although they should not be too smal as this would cause the gradients to vanish

In [13]:
# Simple training example
torch.manual_seed(123)
X = torch.rand((1, 50))
out = model(X)
print(out)

# If we were to predict, not train the model
with torch.no_grad():
 out = model(X)
print("prediction:",out)
# In this way, gradients are not stored, saving both memory and computation

# To obtain the probabilities, we use softmakx activation function
with torch.no_grad():
 out = torch.softmax(model(X), dim=1)
print("softmax:",out)

tensor([[-0.0879,  0.1729,  0.1534]], grad_fn=<AddmmBackward0>)
prediction: tensor([[-0.0879,  0.1729,  0.1534]])
softmax: tensor([[0.2801, 0.3635, 0.3565]])


**A.6 Setting up efficient data loaders**

Crucial for handling training and testing data.
A dataset calss will be created to handle training and testing, after that, the data loaders will be created

In [14]:
# Creating a small toy dataset
X_train = torch.tensor([
 [-1.2, 3.1],
 [-0.9, 2.9],
 [-0.5, 2.6],
 [2.3, -1.1],
 [2.7, -1.5]
])
y_train = torch.tensor([0, 0, 0, 1, 1])
X_test = torch.tensor([
 [-0.8, 2.8],
 [2.6, -1.6],
])
y_test = torch.tensor([0, 1])

Now we will create a toy dataset using the already built in dataset class

In [16]:
from torch.utils.data import Dataset

class ToyDataset(Dataset):
 def __init__(self, X, y): 
    self.features = X
    self.labels = y

 def __getitem__(self, index): # Used to obtain one data example and its label
    one_x = self.features[index]
    one_y = self.labels[index]
    return one_x, one_y
 
 def __len__(self):
    return self.labels.shape[0] # Returns the data total length
 
train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

With this done, we can now proceed to initiate the DataLoader.

In [32]:
from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
 dataset=train_ds, # Uses our custom data
 batch_size=2, # Loads two training instances at a time
 shuffle=True, # Shuffles data at each epoch to avoid bias
 num_workers=0 # Subprocesses loading in parallel
)

test_loader = DataLoader(
 dataset=test_ds,
 batch_size=2,
 shuffle=False, # Data is not shuffled on testing as results are determinsitic, it keeps evaluation consistent
 num_workers=0
)

# Iterating over our data
for idx, (x, y) in enumerate(train_loader):
 print(f"Batch {idx+1}:", x, y)

# As we have 5 instances and we specified a batch size of 2, our last batch has half the number of tensors.
train_loader = DataLoader(
 dataset=train_ds,
 batch_size=2,
 shuffle=True,
 num_workers=0,
 drop_last=True
)
# This will drop the las batch as it can affect training

Batch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Batch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])


**A.7 A typical training loop**

In this simple case, we will use Stochastic Gradient Descent, while there are many others, this is just a simple explanation of what this optimizer is.

Stochastic means random, instead of computing the gradient will all the data it takes a small subset which can be a single instance.

In [None]:
import torch.nn.functional as F


torch.manual_seed(123)

model = NeuralNetwork(num_inputs=2, num_outputs=2)

optimizer = torch.optim.SGD( # Optimizer is Stochastic Gradient Descent
 model.parameters(), lr=0.5 # Specifying which parameters to optimize, lr <- learning rate
)

num_epochs = 3
for epoch in range(num_epochs):

 model.train() # Set to training, important as components such as dropouts behave differently if not

 for batch_idx, (features, labels) in enumerate(train_loader):
    logits = model(features) # Obtaining the output

    loss = F.cross_entropy(logits, labels) # Cross entropy function, applies softmax internally

    optimizer.zero_grad() # Sets gradients to 0 at start of each loop
    loss.backward() # Computes gradients of the loss
    optimizer.step() # Optimizer uses gradients to update parameters

    ### LOGGING
    print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
    f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
    f" | Train Loss: {loss:.2f}")
 model.eval()
 

# Both the lr and the number of epochs are hyperparemters we can tune
# In this case, we obtain convergence, loss = 0 after 3 epochs


Epoch: 001/003 | Batch 000/002 | Train Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train Loss: 0.00


In [None]:
# Evaluating
model.eval()
with torch.no_grad():
 outputs = model(X_train)
print(outputs)

# Probabilities
torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1)
print(probas)

# For probabilities, values on the left probability of class 0, values on the right class 1, each row is an instance

# To return the predictions for the instnaces
predictions = torch.argmax(probas, dim=1) # dim = 1 returns highest value in each row
print(predictions)
# It is unncessary to apply softmax, as this could be done directly



tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])
tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])
tensor([0, 0, 0, 1, 1])


In [37]:
# Evaluating models performance
def compute_accuracy(model, dataloader):
    model = model.eval()
    correct = 0.0
    total_examples = 0

    for idx, (features, labels) in enumerate(dataloader):

        with torch.no_grad():
            logits = model(features) # Obtaining outputs

    predictions = torch.argmax(logits, dim=1) # Obtaining probabilities
    compare = labels == predictions # Returns a T/F tensor is values match
    correct += torch.sum(compare) # Counts the total number of True values
    total_examples += len(compare)

    return (correct / total_examples).item()

# Obtaining accuracy for both training and testing
print(compute_accuracy(model, train_loader))
print(compute_accuracy(model, test_loader))

1.0
1.0


**A.8 Saving and loading models**

In [None]:
# torch.save(model.state_dict(), "model.pth")
# state_dict maps each layer to its parameters
# model.pth is an arbitrary filename

# model = NeuralNetwork(2, 2) 
# important to have an instnace of the model in memory to apply the saved parameters
#   model.load_state_dict(torch.load("model.pth"))


**A.9 Optimizing training performances with GPUs**

My computer does not support this methods.

**SUMMARY**

 PyTorch is an open source library with three core components: a tensor library,
automatic differentiation functions, and deep learning utilities.

 PyTorch’s tensor library is similar to array libraries like NumPy.

 In the context of PyTorch, tensors are array-like data structures representing
scalars, vectors, matrices, and higher-dimensional arrays.

 PyTorch tensors can be executed on the CPU, but one major advantage of
PyTorch’s tensor format is its GPU support to accelerate computations.

 The automatic differentiation (autograd) capabilities in PyTorch allow us to
conveniently train neural networks using backpropagation without manually
deriving gradients.

 The deep learning utilities in PyTorch provide building blocks for creating custom deep neural networks.

 PyTorch includes Dataset and DataLoader classes to set up efficient data-loading pipelines.

 It’s easiest to train models on a CPU or single GPU.

 Using DistributedDataParallel is the simplest way in PyTorch to accelerate
the training if multiple GPUs are available.