# Introduction to Deep Learning with PyTorch

## Basic informations about tensors

In [None]:
import torch
import numpy as np

In [None]:
# create a tensor from a list
array = [[1,2,3],[4,5,6]]
tensor = torch.tensor(array)
tensor

In [None]:
# create a tensor from a NumPy arra
np_array = np.array(array)
np_tensor = torch.from_numpy(np_array)
np_tensor

In [None]:
# tensor attributes
tensor.device, tensor.shape, tensor.dtype

In [None]:
# tensor operations (The size of tensor a must match the size of tensor b)
a = torch.tensor([[1,1],[2,2]]) 
b = torch.tensor([[2,2],[3,3]])
# addition/substraction/multiplication/divizion
a+b, a-b, a*b, a/b

## Roadmap to training a neural network

1. Create an architecture of the nerural network (see an example bellow) ![nn_architecture](nn_architecture.png)
2. Load data using PyTorch
3. Define the loss function (Measures difference between model predictions and true labels; Used to assess the accuracy of our model; Accuracy is determined by weights and biases, which are parameters learned during model training; They denote strength and direction of connections between individual neurons.)
4. Set up an optimizer (Optimizer will update network weights during training; SGD most used)
5. Define a training loop (Pass input data through network to gain initial predictions, i.e. run a forward pass; Compute loss using loss function; Run backpropagation to compute gradients, which determine how much each neuron contributed to overall error in final output; Update weights and biases using the gradients)
6. Test the trained network on a separate dataset to evaluate performance (Define metrics to test usefulness of our predictions on validation and test sets)


### 1. Create a neural network

**PyTorch implementation of a neural network with three layers.**

- The input layer has 8 neurons and is represented by nn.Linear(8, 4), which means that it takes an input with 8 features and outputs a tensor with 4 values.
- The first hidden layer is represented by nn.Sigmoid(), which applies the sigmoid activation function to the output of the input layer.
- The second hidden layer has 2 neurons and is represented by nn.Linear(4, 2), which takes the output of the first hidden layer as input and outputs a tensor with 2 values.
- The second hidden layer is followed by another nn.Sigmoid() activation function.
- The output layer has only 1 neuron and is represented by nn.Linear(2, 1), which takes the output of the second hidden layer as input and produces a scalar value as output.
- The final nn.Sigmoid() activation function is applied to the output of the output layer, which maps the output to a value between 0 and 1, making it suitable for binary classification tasks.

The network bellow has three layers with sigmoid activation functions, which makes it suitable for classification problems where the output is a binary variable.

In [None]:
import torch.nn as nn

model = nn.Sequential(nn.Linear(8, 4),
                      nn.Sigmoid(),
                      nn.Linear(4, 2),
                      nn.Sigmoid(),
                      nn.Linear(2, 1),
                      nn.Sigmoid())

input_tensor = torch.Tensor([[1, 2, 3, 4, 5, 6, 7, 8]])
output = model(input_tensor)

Note: A neural network with only one layer can be similar to logistic regression if the activation function used in the output layer is the sigmoid function. In this case, the neural network is essentially a logistic regression model with additional trainable parameters.

Input and output layers dimensions are fixed (input layer depends on the number of features n_features; output layer depends on the number of categories n_classes). Increasing the number of hidden layers = increasing the number of parameters = increasing the model capacity.

### 2. Load data in PyTorch

In [None]:
# Import libraries
from torch.utils.data import Dataset
import pandas as pd
from torch.utils.data import DataLoader, TensorDataset

In [None]:
# Create a PyTorch Dataset class that reads a CSV file and provides a method for extracting features and labels for a given index.

class MyDataset(Dataset):
    def __init__(self, csv_path):
        """
        A PyTorch dataset for loading data from a CSV file.
        
        Parameters:
            csv_path (str): The path to the CSV file.
        
        Returns:
            None
        """
        super(MyDataset, self).__init__()
        self.data = pd.read_csv(csv_path).to_numpy()
        
    def __len__(self):
        """
        Get the number of samples in the dataset.
        
        Parameters:
            None
        
        Returns:
            int: The number of samples in the dataset.
        """
        return self.data.shape[0]
    
    def __getitem__(self, index):
        """
        Get the features and label of a sample at a given index.
        
        Parameters:
            index (int): The index of the sample to retrieve.
        
        Returns:
            tuple: A tuple containing the features (as a tensor) and label (as a tensor) of the sample.
        """
        features, label = self.extract_features_and_label(index)
        return features, label
    
    def extract_features_and_label(self, index):
        """
        Extract the features and label of a sample at a given index.
        
        Parameters:
            index (int): The index of the sample to retrieve.
        
        Returns:
            tuple: A tuple containing the features (as a tensor) and label (as a tensor) of the sample.
        """
        row = self.data[index]
        features = row[:-1].astype(np.float32)
        label = torch.tensor(row[-1])
        return torch.from_numpy(features), torch.tensor(label)

In [None]:
# Create an instance of the MyDataset class and a DataLoader to load the data in batches.
dataset = MyDataset('water_potability.csv')
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Iterates over the DataLoader to extract features and labels in batches, and prints the shape of the features and labels in each batch. 
for batch_idx, (features, label) in enumerate(dataloader):
    print(f'Batch {batch_idx}: {features.shape}, {label.shape}')

# Extract the first batch of features and labels using the next function.
x, y = next(iter(dataloader))

In [None]:
# Create a model using the nn.Sequential API
model = nn.Sequential(nn.Linear(4, 16), nn.Linear(16, 1))

### 3. Define the loss function

The cross entropy loss is the most used loss for classification problems. 

In [None]:
import torch.nn.functional as F

y = [2]
scores = torch.tensor([[0.1, 6.0, -2.0, 3.2]])

# Create a one-hot encoded vector of the label y
one_hot_label = F.one_hot(torch.tensor(y), scores.shape[1])

# Create the cross entropy loss function
loss = nn.CrossEntropyLoss()

# Calculate the cross entropy loss
loss(scores.double(), one_hot_label.double())

Note: 
This code is calculating the cross entropy loss for a single sample with four classes.
First, it defines the ground truth label y as class 2. Then, it creates a tensor scores of shape (1, 4) with the predicted scores for each class.
Next, it uses the F.one_hot function to create a one-hot encoded tensor of y with the same number of classes as scores. If the ground truth label y is class 2, it means that the correct class for the given sample is the one represented by the third element (index 2) of the scores tensor. The resulting one_hot_label tensor is of shape (1, 4) and has a 1 in the 2nd position (corresponding to the ground truth class) and 0s elsewhere.
Finally, it creates an instance of the nn.CrossEntropyLoss() class and applies it to the scores tensor and the one_hot_label tensor using the .double() method to ensure that both tensors have the same data type (double precision).
The output of the loss function is a scalar tensor representing the cross entropy loss between the predicted scores and the one-hot encoded ground truth label.

### 4. Set up an optimizer

In PyTorch, an optimizer takes care of weight updates. The most common optimizer is stochastic gradient descent (SGD).

### Manual optimization

In [None]:
# Initialize the model and loss function
model = nn.Sequential(nn.Linear(16, 8),
                      nn.Linear(8, 4),
                      nn.Linear(4, 2))
criterion = nn.CrossEntropyLoss()

# Define the input data and target labels
inputs = torch.randn(2, 16)
labels = torch.LongTensor([0, 1])

# Compute the forward pass
outputs = model(inputs)

# Compute the loss
loss = criterion(outputs, labels)

# Compute the gradients using backpropagation
loss.backward()

# Learning rate is typically small
lr = 0.001

# Update the weights
weight = model[0].weight
weight_grad = model[0].weight.grad
weight = weight - lr * weight_grad

# Update the biases
bias = model[0].bias
bias_grad = model[0].bias.grad
bias = bias - lr * bias_grad

print(weight, bias)

### Using the PyTorch optimize

In [None]:
import torch.optim as optim

# Create the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)

# Update the model's parameters using the optimizer
optimizer.step()

print(weight, bias)

Note: In the code above, the optimizer is an instance of the stochastic gradient descent (SGD) algorithm, which is a popular optimization algorithm used in deep learning. The learning rate (lr) is set to 0.001, which determines how much the optimizer should adjust the parameters based on the gradients of the loss function.
After defining the optimizer, the step() method is called to update the model's parameters based on the gradients computed during backpropagation. This step is typically executed inside a training loop, where the model is iteratively trained on mini-batches of data.

### 5. Define a training loop 

In [None]:
# Create an instance of the MyDataset class and a DataLoader to load the data in batches.
dataset = MyDataset('data/ds_salaries_clean.csv')
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# Create the model
model = nn.Sequential(nn.Linear(4, 2),nn.Sigmoid(),nn.Linear(2, 1))

# Create the loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

num_epochs = 10

# Loop through the dataset multiple times
for epoch in range(num_epochs):
    for data in dataloader:
        # Set the gradients to zero
        optimizer.zero_grad()
        # Get feature and target from the data loader
        feature, target = data
        # Run a forward pass
        pred = model(feature)
        # Compute loss and gradients
        loss = criterion(pred, target)
        loss.backward()
        # Update the parameters
        optimizer.step()

def show_results(model, dataloader):
    """
    The show_results function takes a PyTorch model and a PyTorch DataLoader object as input and    prints the ground truth and predicted values for each batch in the DataLoader.

    Parameters:
            model: a PyTorch model object that takes in features as input and outputs predicted                     values.
            dataloader: a PyTorch DataLoader object that contains the input features and target                     salaries in batches.
    Returns:
            None.
    """
    with torch.no_grad():
        for data in dataloader:
            feature, target = data
            pred = model(feature)
            for i in range(len(feature)):
                print(f"Truth value: {target[i]:.3f}. Predicted value: {pred[i][0]:.3f}.")
                
show_results(model, dataloader)

Note: The code above trains a neural network model using PyTorch to predict salaries based on a dataset of employee features. The dataset is loaded into an instance of the MyDataset class and a DataLoader is created to load the data in batches. The neural network model consists of two linear layers with sigmoid activation function between them. The mean squared error loss function and stochastic gradient descent optimizer are used to train the model for 10 epochs. The show_results function is then called to display the predicted and ground truth salaries for each data point in the dataset.

### 6. Test the trained network on a separate dataset to evaluate performance

In [None]:
# Set the model to evaluation mode
model.eval()
validation_loss = 0.0
with torch.no_grad():
  for data in validationloader:
      outputs = model(data[0])
      loss = criterion(outputs, data[1])
      # Sum the current loss to the validation_loss variable
      validation_loss += loss.item()
# Calculate the mean loss value
validation_loss_epoch = validation_loss / len(validationloader)
# Set the model back to training mode
model.train()

In [None]:
import torchmetrics

# Create accuracy metric using torch metrics
metric = torchmetrics.Accuracy(task="multiclass", num_classes=3)
for data in dataloader:
    features, labels = data
    outputs = model(features)
    
    # Calculate accuracy over the batch
    acc = metric(outputs.softmax(dim=-1), labels.argmax(dim=-1))
    
# Calculate accuracy over the whole epoch
acc = metric.compute()

# Reset the metric for the next epoch 
metric.reset()

### Improve performance

Steps to maximize performance: overfit the training set, reduce overfitting, fine-tune the hyperparamters

In [None]:
# overfit the training set

features, labels = next(iter(trainloader))
for i in range(1e3):
    outputs = model(features)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

In [None]:
# reduce overfitting (the model does not generalize to unseen data)

In [None]:
# 1. If the dataset in not large enought, get more data or use data augmentation

from torchvision import transforms

# Create a data augmentation strategy using at least one transform
augmentation = nn.Sequential(transforms.RandomHorizontalFlip(p=0.5))

# Create a data augmentation strategy using all three transforms.
augmentation = nn.Sequential(transforms.RandomHorizontalFlip(p=0.5),
                             transforms.RandomResizedCrop(size=32, scale=(0.3, 1.5)),
                             transforms.RandomRotation(degrees=20))

In [None]:
# 2. If the model has too much capacity, reduce model size or add dropout
model = nn.Sequential(nn.Linear(8, 4), nn.ReLU(), nn.Dropout(p=0.5))
features = torch.randn((1, 8))
#model(i)

#Behaves differently during training and evaluation. Do not forget to switch modes using model.train() and model.eval()

In [None]:
# 3. If weights are too large, weight decay
optimizer = optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)

#weight_decay parameter:values between 0 and 1. Typically, a very small value (10^-4 ). the higher the parameter, the less likely the model is to overfit.

In [None]:
# fine-tune hyperparameters

In [None]:
# implement random search

values = []
for idx in range(10):
    # Randomly sample a learning rate factor between 0.01 and 0.0001
    factor = np.random.uniform(2, 6)
    lr = 10 ** -factor
    
    # Randomly sample a momentum between 0.85 and 0.99
    momentum = np.random.uniform(0.85, 0.99)
    
    values.append((lr, momentum))

In [None]:
# create the best model  
model = nn.Sequential(transforms.Normalize(mean, std),
                      transforms.RandomHorizontalFlip(p=0.5),
                      transforms.RandomResizedCrop(size=32, scale=(0.8, 1.2)),
                      transforms.RandomRotation(degrees=10),
                      nn.Flatten(),
                      nn.Linear(3072, 3))

# Pick a learning rate
lr = 0.01
train_and_evaluate(model, learning_rate=lr)

# Adjust other parameters
train_and_evaluate(model, learning_rate=0.01, num_epochs=20, momentum=0.9, weight_decay=1e-4)

# More about NN

### Activation functions

In [None]:
# Implement ReLU in NumPy
def relu_numpy(x):
  # Implement the ReLU function
  return np.maximum(x,0)

# Create a ReLU function with PyTorch
relu_pytorch = nn.ReLU()

# Calculate the gradient of the ReLU function for x
x = torch.tensor(-1.0, requires_grad=True)
y = relu_pytorch(x)
y.backward()
gradient = x.grad

# Implementing leaky ReLU
def leaky_relu_python(x, slope):
  # Implement the leaky_relu function
  if x >=0:
    return x
  else:
    return x * slope

# Create a leaky relu function in PyTorch
leaky_relu_pytorch = nn.LeakyReLU(negative_slope = 0.05)

y = torch.tensor(-2.0)
# Call the above function of the tensor y
output = leaky_relu_pytorch(y)


### Model capacity

In [None]:
def calculate_capacity(model):
  total = 0
  for p in model.parameters():
    total += p.numel()
  return total

n_features = 8
n_classes = 2

# Create a neural network with less than 120 parameters
model = nn.Sequential(nn.Linear(n_features, 8),
                      nn.Linear(8, 4),
                      nn.Linear(4, n_classes))
print(calculate_capacity(model))

# Create a neural network with more than 120 parameters
model = nn.Sequential(nn.Linear(n_features, 8),
                      nn.Linear(8, 6),
                      nn.Linear(6, n_classes))
calculate_capacity(model)

### Learning rate and momentum

SGD has two parameters: learning rate that controls the step size and momentul that controls the inertia of the optimizer. Bad values can lead to long raining time and bad overall performance (poor accuracy).

- Learning rate: controls the step size; too small leads to long training times; too high leads to poor performance; typical values between 10^-2 and 10^-4.
- Momentum: controls the inertia; null momentum can lead to the optimizer being stuck in a local minimum; non-null momentum can help find the function minimum; typical values between 0.85 and 0.99.

### Layer initialization, transfer learning and fine-tuning

Fine-tuning process: find a model trained on a similar task; load pre-trained weights; freeze (or not) some of the layers in the model; train with a smaller lerning rate; look at the loss values and see if the learning rate needs to be adjusted

In [None]:
# Layer initialization (weights are initialized to small values)
layer = nn.Linear(64, 128)
nn.init.uniform_(layer.weight)
print(layer.weight.min(), layer.weight.max())

# Transfer learning: reusing a model trained on a first task for a second similar task, to accelerate the training process.
torch.save(layer, 'layer.pth')
new_layer = torch.load('layer.pth')

# Fine-tuning (A type of transfer learning; Smaller learning rate; Not every layer is trained (we freeze some of them); Rule of thumb: freeze early layers of network and fine-tune layers closer to output layer)
model = nn.Sequential(nn.Linear(64, 128), nn.Linear(128, 256))
for name, param in model.named_parameters():
    # Check if the parameters belong to the first layer
    if name == '0.weight':
        # Freeze the parameters
        param.requires_grad = False