# Visualizing Training Process of Neural Networks

* by: [Jannik Mettner](mailto:jmettner@stud.hs-heilbronn.de)
* Heilbronn University, Germany
* Created: March 2024,  Last Edited: March 2024

In this notebook I will visualize the training process of a basic neural network that will learn to approximate the cosine function. By the end of the notebook you should be able to:
* create a pytorch dataset
* create a basic pytorch neural network 
* implement the main loop for training the network on the dataset

## 0 - Basic imports
If you want to run pytorch with your gpu, follow this guide: https://pytorch.org/get-started/locally/
Supported are cuda enabled GPUs from Nvidia and Apple Silicon GPUs. To use your Nvidia GPU click the highest CUDA version in the guide and to use Apple Silicon make sure to select the Preview (Nightly) build, otherwise only the CPU is used. Using anything but Conda has lead to a erroneous installation for me and Conda is the officially recommended way to install pytorch.

In [1]:
#!pip install numpy torch matplotlib Pillow torchvision tqdm ipython scikit-learn torchview graphviz
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

from IPython import display

import torch
from torch import optim, nn
from torch.utils.data import DataLoader, Dataset
import torch.nn.functional as F
from torchview import draw_graph

from sklearn.preprocessing import MinMaxScaler

if torch.cuda.device_count() > 0: # Check if Nvidia GPU is used
    device = "cuda"
    print("The device used for training is", torch.cuda.get_device_name(0))
elif torch.backends.mps.is_available(): # Check if Apple Silicon GPU is used
    device = "mps"
    x=torch.ones(1).to(device)
    print("The device used for training is Apple Silicon GPU")
else:
    device = "cpu"
    print("The device used for training is the cpu")

ModuleNotFoundError: No module named 'tqdm'

## 1 - The Dataset
Neural networks are considered to be universal function approximators. Now we want to approximate the simple cosine function  $ y=cos(x) $ | $ f: \mathbb{R} \rightarrow [-1,1] $ . 
First of all, we define the dataset that is used for training the neural net. It consists of three main functions that should be defined. The first is the __init__ function, that creates the dataset object and stores important variables like the datapoints and the scaler used for the datapoints. The second function simply returns the length of the dataset and the third function returns the x and y values of the datapoint at position $ \texttt{idx} $
The cosine function is defined for an infinite amount of inputs and outputs but the dataset for training a neural network has to be finite. Therefore, the function will be sampled in the interval $ [ \texttt{min} , \texttt{max} ] $ with $ \texttt{n_datapoints} $ equally spaced datapoints inbetween.
Since $ y $ is already in the range of -1 to 1, it does not have to be rescaled. However, x is in the range of min to max and for efficiently training the neural network, we normalize it to the range -1 to 1.

In [None]:
def function(x):
    return np.cos(x)

class CosineDataset(Dataset):
    def __init__(self, n_datapoints, min, max):
        np.random.seed(1)
        self.x_values = np.random.uniform(min, max, n_datapoints)
        self.x_scaler = MinMaxScaler(feature_range=(-1,1))
        
        self.y_values = function(self.x_values)
        self.y_scaler = MinMaxScaler(feature_range=(-1,1))
        
        self.x_values = self.x_scaler.fit_transform(self.x_values.reshape(-1,1))
        self.y_values = self.y_scaler.fit_transform(self.y_values.reshape(-1,1))

    def __len__(self):
        return len(self.x_values)

    def __getitem__(self, idx):
        x = self.x_values[idx]
        y = self.y_values[idx]
        return x, y

In [None]:
# Create the dataset object
dataset = CosineDataset(n_datapoints=100, min=0, max=40)

# Now a dataloader is defined that uses the previously defined dataset to pull multiple datapoints simultaneously from the dataset to create a mini-batch.
# The size of the mini-batches has to be specified for that. Also, datapoints are shuffled to make the model train more efficiently.
batch_size = 32
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
for x, y in loader:
    print(x.shape, y.shape)

Let's visualize the dataset by looping over it and storing all values for x and y

In [None]:
points = np.array([[x, y] for x, y in dataset])
# Visualize the data as it is seen by the neural network
plt.scatter(points[:,0], points[:,1])
plt.ylabel("function(x)")
plt.xlabel("x")
plt.ylim(-5,5)
plt.title("Neural network's view on the data")
plt.show()
# Visualize the data with the correct and unscaled values for x.
plt.scatter(dataset.x_scaler.inverse_transform(points[:,0].reshape(-1,1)), dataset.y_scaler.inverse_transform(points[:,1].reshape(-1,1)) ) # undo scaling with scaler.inverse_transform
plt.ylabel("function(x)")
plt.xlabel("x")
plt.ylim(-5,5)
plt.title("Unscaled view on the data")
plt.show()

## 2 - The Neural Network
Now we will define the neural network that should solve the task. Feel free to play around with the layers and hyperparameters and observe their effects on the training output.
To create the neural network, only two functions have to be defined. The __init__ function simply initializes and stores all the layers given their hyperparameters and the forward function defines the order in which a datapoint passes through the layers.
For this task we create 4 fully connected / dense / linear layers. The important thing here is to keep track of the shape/size/features of datapoints that go in and out of each layer. Since we want to approximate the cosine function that takes one number and outputs one number, the first layers $\texttt{in_features}$ is 1 and the last layers $\texttt{out_features}$ is 1. For the hidden layers, it is only important that the $\texttt{out_features}$ of the previous layer is the same as the $\texttt{in_features}$ of the current.
In the forward function the input x passes through the previously defined dense layers with the activation function $\texttt{leaky_relu}$ inbetween the hidden layers. The last function applied is the $\texttt{tanh}$ function which makes sure that the value y can only be between -1 and 1 as it should because the cosinus function can also only output values between -1 and 1. If the sigmoid function would be used instead, the values of y would be in the range 0 to 1 which makes it impossible to approximate the cosine function.

In [None]:
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel,self).__init__()
        self.linear1 = nn.Linear(in_features=1, out_features=100)
        self.linear2 = nn.Linear(in_features=100, out_features=100)
        self.linear3 = nn.Linear(in_features=100, out_features=100)
        self.linear4 = nn.Linear(in_features=100, out_features=1)
        
    def forward(self, x):
        block1 = F.leaky_relu(self.linear1(x))
        block2 = F.leaky_relu(self.linear2(block1))
        block3 = F.leaky_relu(self.linear3(block2))
        y = F.tanh(self.linear4(block3))
        return y

In [None]:
# Create the model
model = SimpleModel()

# If you want to visualize the model, you need to install graphviz to your system. Otherwise, comment those two lines.
# Windows: winget install graphviz          And then add C:\Program Files\Graphviz\bin to the environment variables and restart the ide
# MacOS: brew install graphviz
# Unix: sudo apt-get install graphviz 
# The input size is (batch_size, x).
model_graph = draw_graph(model, input_size=(32,1), expand_nested=False)
model_graph.visual_graph

## 3 - Training the Neural Network

In [None]:
model = model.to(device) # move the model to the gpu, if one exists
# Create the loss function and optimizer
loss_func = nn.MSELoss() # Define the loss function. The Mean Squared Error can be used for such a regression task.

# Define the optimizer. One of the first ones was stochastic gradient descent but the Adam optimizer is now the most widely used one.
# The learning rate (lr) is also specified here.
optimizer = optim.Adam(model.parameters(), lr=0.01) 


# Preparing the plot that visualizes the training process 
fig, ax = plt.subplots(2,1, figsize=(10,10))

num_epochs = 1000
losses = []
# Train the model
for epoch in range(num_epochs):
    epoch_loss = 0 # To determine the average loss of all datapoints in that epoch
    
    # Visualizes in the jupyter output the loop over the dataloader. Shows what the current epoch, mini-batch and MSE is.
    progress_bar = tqdm(loader, desc=f'Epoch {epoch+1}/{num_epochs}') 
    for i, (x, y) in enumerate(progress_bar):
        # Bring data into the right shape and move it to the gpu, if available
        x, y = x.to(torch.float32).unsqueeze(1).to(device), y.to(torch.float32).unsqueeze(1).to(device) 

        # Forward pass
        y_pred = model(x)

        # Compute loss
        loss = loss_func(y_pred, y)
        epoch_loss += loss.item() * x.size(0) # Mean loss of the batch times the number of datapoints in the batch to get the accumulated epoch loss

        # Backward pass and optimization
        optimizer.zero_grad() # Reset gradients to 0 so the next backward() call can add gradients of this batch
        loss.backward() # Backpropagation
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.01) # The gradients for each parameter is stored together with the parameter. We clip the gradients so they do not explode.
        
        optimizer.step() # The optimizer goes one step into the optimal direction on the error surface to reduce the error.
        
        progress_bar.set_postfix(MSE=epoch_loss / ((i*batch_size)+1)) # Update the jupyter progress bar to include the current loss
    losses.append(epoch_loss/len(loader.dataset))
    
    
    # Updating the figure to visualize current prediction accuracy of the neural net
    ax[0].cla()
    ax[1].cla()
    display.clear_output(wait=True)
    ax[0].scatter(dataset.x_scaler.inverse_transform(points[:,0].reshape(-1,1)), points[:,1], label="Training points")
    x_values = np.linspace(-1.3,1.3,200) 
    outputs = model(torch.tensor(x_values).to(torch.float32).unsqueeze(1).to(device)).cpu().detach().numpy()
    ax[0].plot(dataset.x_scaler.inverse_transform(x_values.reshape(-1,1)), outputs, c="red", label="Prediction")
    ax[0].legend()
    ax[0].set_title("Neural network predicting the cosine function")
    ax[0].set_ylim([-2.5,2.5])
    ax[0].set_ylabel("function(x)")
    ax[0].set_xlabel("x")
    ax[1].set_xlim([0, num_epochs])
    ax[1].set_ylim([0, 1])
    ax[1].plot(losses, label="MSE", c="red")
    ax[1].set_title("Mean Squared Error")
    ax[1].set_ylabel("MSE")
    ax[1].set_xlabel("Epoch")
    ax[1].legend()
    display.display(fig)

    # Log the average loss per epoch
    print(f'Epoch {epoch+1}, Average Loss: {epoch_loss/len(loader.dataset):.4f}')