# Modules, Optimizers and Losses

The training loop that we implemented in the last section works, but there are better approaches to solve the same problem. Once our neural network architectures get more and more complex, we will be glad that we are able to utilize a more efficient training approach.

While there is a lot to improve from the last chapter, many of the components will also feel familiar.

First we import the necessary libraries.

In [1]:
import torch
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms

We get a dataset.

In [2]:
dataset = MNIST(root="../datasets", train=True, download=True, transform=transforms.ToTensor())

We set the hyperparameters.

In [3]:
# parameters
DEVICE = ("cuda:0" if torch.cuda.is_available() else "cpu")
NUM_EPOCHS=10
BATCH_SIZE=32

#number of hidden units in the first and second hidden layer
HIDDEN_SIZE_1 = 100
HIDDEN_SIZE_2 = 50
NUM_LABELS = 10
NUM_FEATURES = 28*28
ALPHA = 0.1

And we initialize the DataLoader object.

In [4]:
dataloader = DataLoader(dataset=dataset, 
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              drop_last=True,
                              num_workers=4)

This time around we will start by looking at our final product, the training loop in order to understand what we need to make our code clean, modular and scalable. Instad of calculating one layer after another we will calculate our forward pass using a single call to `model`. The model is essentially the function that generates the predictions we are interested in. In our case the model will contain all the matrix multiplications and activation functions needed to predict the probability that the features belong to a certain number. The criterion is essentially a loss function, in our case it is the cross-entropy. The optimizer loops through the parameters and applies gradient descent when we call `optimizer.step` and clears all the gradients when we call `optimizer.zero_grad()`. The rest of the implementation is actually the same, but we hope that you notice how much more understandable the code gets.

In [5]:
def train(dataloader, model, criterion, optimizer):
    for epoch in range(NUM_EPOCHS):
        loss_sum = 0
        batch_nums = 0
        for batch_idx, (features, labels) in enumerate(dataloader):
            # move features and labels to GPU
            features = features.view(-1, NUM_FEATURES).to(DEVICE)
            labels = labels.to(DEVICE)

            # ------ FORWARD PASS --------
            probs = model(features)

            # ------CALCULATE LOSS --------
            loss = criterion(probs, labels)

            # ------BACKPROPAGATION --------
            loss.backward()

            # ------GRADIENT DESCENT --------
            optimizer.step()

            # ------CLEAR GRADIENTS --------
            optimizer.zero_grad()

            # ------TRACK LOSS --------
            batch_nums += 1
            loss_sum += loss.detach().cpu()

        print(f'Epoch: {epoch+1} Loss: {loss_sum / batch_nums}')

Now the time has finally come to talk about modules, optimizers and losses.

In order to make our calculations more modular, we will create a `Module` class. You can think about a module as a piece of a neural network. Usually modules are those pieces of a network, that we apply over and over again. In essence you create a neural network by defining and stacking modules. As we need to apply linear transformations several times, we put the logic of a linear layer into a separate class and we call that class `Module`. This module initializes a weight matrix $\mathbf{W}$ and a bias vector $\mathbf{b}$. For easier access at a later point we create an attribute `parameters`, which is just a list holding the weights and biases. We also implement the `__call__` method, which contains the logic for the forward pass.

In [8]:
class Module:
    
    def __init__(self, in_features, out_features):
        self.W = torch.normal(mean=0, 
                              std=0.1, 
                              size=(out_features, in_features), 
                              requires_grad=True, 
                              device=DEVICE, 
                              dtype=torch.float32)
        self.b = torch.zeros(1, 
                             out_features, 
                             requires_grad=True, 
                             device=DEVICE, 
                             dtype=torch.float32)
        self.parameters = [self.W, self.b]
                
    def __call__(self, features):
        return features @ self.W.T + self.b

This procedure could for example look as follows.

In [9]:
features = torch.randn(10, 2).to(DEVICE)
linear_module = Module(2, 5)
print(linear_module(features))

tensor([[ 0.0735, -0.0272,  0.0012,  0.1392, -0.0333],
        [ 0.0268, -0.0219,  0.0403, -0.1075, -0.0814],
        [ 0.1632, -0.0679,  0.0277,  0.2101, -0.1172],
        [ 0.0622, -0.0454,  0.0755, -0.1778, -0.1576],
        [ 0.0847, -0.0413,  0.0345,  0.0291, -0.0958],
        [ 0.1066, -0.0480,  0.0302,  0.0892, -0.0976],
        [ 0.0538, -0.0309,  0.0373, -0.0423, -0.0875],
        [-0.0066, -0.0063,  0.0289, -0.1275, -0.0474],
        [-0.0408,  0.0046,  0.0342, -0.2158, -0.0421],
        [ 0.1061, -0.0353, -0.0116,  0.2543, -0.0248]], device='cuda:0',
       grad_fn=<AddBackward0>)


Our model also needs a sigmoid and a softmax activation functions. These are the same implementations as in the previoius section. Here we define activations as separate functions in order to be able to reuse them.

In [10]:
def sigmoid(z):
    return 1 / (1 + torch.exp(-z))

In [11]:
def softmax(z):
    numerator = torch.exp(z)
    denominator = numerator.sum(dim=1, keepdim=True)
    return numerator / denominator

The model initializes three linear modules. In the `__call__` method we implement the full forward pass. So when we call `model(features)`, the features are processed by the neural network and the outputs are returned to the origin of the call. Additionally we implement the `parameters` method, which just all parameters of the model.

In [12]:
class Model:
    
    def __init__(self):
        self.linear_1 = Module(NUM_FEATURES, HIDDEN_SIZE_1)
        self.linear_2 = Module(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
        self.linear_3 = Module(HIDDEN_SIZE_2, NUM_LABELS)
        
    def __call__(self, features):
        x = self.linear_1(features)
        x = sigmoid(x)
        x = self.linear_2(x)
        x = sigmoid(x)
        x = self.linear_3(x)
        x = softmax(x)
        return x
    
    def parameters(self):
        parameters = [*self.linear_1.parameters, 
                      *self.linear_2.parameters,
                       *self.linear_3.parameters]
        return parameters

Applying the forward pass of a predefined model feels more intuitive than our previous implementation.

In [14]:
features = torch.randn(BATCH_SIZE, NUM_FEATURES).to(DEVICE)
model = Model()
output = model(features)

The optimizer class is responsible for applying gradient descent and for clearing the gradients. Our class needs the learning rate (alpha) and the parameters of the model. When we call `step` we loop over all parameters and apply gradient descent and when we call `zero_grad()` we clear all the gradients. Notice that the optimizer logic works independent of the exact architecture of the model, making the code more managable.

In [15]:
class SGDOptimizer:
    
    def __init__(self, alpha, parameters):
        self.alpha = alpha
        self.parameters = parameters
    
    def step(self):
        with torch.inference_mode():
            for parameter in self.parameters:
                parameter.sub_(self.alpha * parameter.grad)
                
    def zero_grad(self):
        with torch.inference_mode():
            for parameter in self.parameters:
                parameter.grad.zero_()

Finally we implement the loss function. Once again the calculation of the loss is independent of the model or the optimizer. When we change one of the components, we do not introduce any breaking changes. If we replace the negative log likelihood by mean squared error our training loop will still keep working.

In [16]:
def nll_loss(outputs, labels):
    one_hot_labels = torch.zeros(BATCH_SIZE, NUM_LABELS).to(DEVICE)
    for sample_idx, label in enumerate(labels):
        one_hot_labels[sample_idx][label] = 1
        
    return -(one_hot_labels * torch.log(outputs)).mean()

Now we have all components, that are required by our training loop.

In [17]:
model = Model()
optimizer = SGDOptimizer(ALPHA, model.parameters())
criterion = nll_loss

In [18]:
train(dataloader, model, criterion, optimizer)

Epoch: 1 Loss: 0.22587831318378448
Epoch: 2 Loss: 0.2053816169500351
Epoch: 3 Loss: 0.15154482424259186
Epoch: 4 Loss: 0.10453948378562927
Epoch: 5 Loss: 0.0807894766330719
Epoch: 6 Loss: 0.0671185776591301
Epoch: 7 Loss: 0.05817437916994095
Epoch: 8 Loss: 0.05199269950389862
Epoch: 9 Loss: 0.04752832278609276
Epoch: 10 Loss: 0.044177427887916565


You can probaly guess, that PyTorch provides classes and functions out of the box. `torch.nn` contains the base `Module` class and `torch.nn.functional` contains activation functions. 

In [19]:
import torch.nn as nn
import torch.nn.functional as F

When we write custom modules we need to subclass `nn.Module`. All trainable parameters are put into the `nn.parameter.Parameter()` class. This tells PyTorch to put those tensors into the parameters list (which is used by the optimizer) and the tensors are automatically tracked for gradient computation. Instad of defining `__call__` as we did before, we define the `forward` method. PyTorch calls this method automatically, when we call the module object. You should never call this method directly, as PyTorch does additional calculations during the forward pass (instead of `module.forward(features)` use `module(features)`). 

In [22]:
class Module(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.W = nn.parameter.Parameter(torch.normal(mean=0, std=0.1, 
                              size=(out_features, in_features)))
        self.b = nn.parameter.Parameter(torch.zeros(1, out_features))

    def forward(self, features):
        return features @ self.W.T + self.b

The great thing about PyTorch modules is their composability. Earlier created modules can be used in subsequent modules. Below for example we use the above defined `Module` in the `Model` module. In later chapter we will see how we can create blocks of arbitrary complexity using this simple approach. 

You should notice, that we use the `log_softmax` activation function, instead of a simple softmax. According to the PyTorch documentation `log_softmax` is faster and has better numerical properties.

In [27]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_1 = Module(NUM_FEATURES, HIDDEN_SIZE_1)
        self.linear_2 = Module(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
        self.linear_3 = Module(HIDDEN_SIZE_2, NUM_LABELS)
        
    def forward(self, features):
        x = self.linear_1(features)
        x = torch.sigmoid(x)
        x = self.linear_2(x)
        x = torch.sigmoid(x)
        x = self.linear_3(x)
        x = F.log_softmax(x, dim=1)
        return x

PyTorch obviously provides loss functions and optimizers. The `NLLLoss` calculates the negative log likelihood loss based on log probabilities from our model and true labels. You can find all optimizers in `torch.optim`. For know we will use stochastic gradient descent, but there are many more optimizers that we will encounter soon. The interface of the loss functions and optimizers.

In [28]:
model = Model().to(DEVICE)
criterion = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)

In [29]:
train(dataloader, model, criterion, optimizer)

Epoch: 1 Loss: 1.0678449869155884
Epoch: 2 Loss: 0.36083656549453735
Epoch: 3 Loss: 0.2820141315460205
Epoch: 4 Loss: 0.23701903223991394
Epoch: 5 Loss: 0.20419947803020477
Epoch: 6 Loss: 0.17840386927127838
Epoch: 7 Loss: 0.15804895758628845
Epoch: 8 Loss: 0.14154206216335297
Epoch: 9 Loss: 0.12750738859176636
Epoch: 10 Loss: 0.1162610873579979


PyTorch provides a lot of modules out of the box. A linear transformation layer is a common procedure, therefore you should use `nn.Linear` instead of implementing your solutions from scratch.

In [34]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_1 = nn.Linear(NUM_FEATURES, HIDDEN_SIZE_1)
        self.linear_2 = nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
        self.linear_3 = nn.Linear(HIDDEN_SIZE_2, NUM_LABELS)
    
    def forward(self, features):
        x = self.linear_1(features)
        x = torch.sigmoid(x)
        x = self.linear_2(x)
        x = torch.sigmoid(x)
        x = self.linear_3(x)
        x = F.log_softmax(x, dim=1)
        return x

In [35]:
model = Model().to(DEVICE)
criterion = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)

In [36]:
train(dataloader, model, criterion, optimizer)

Epoch: 1 Loss: 1.466644525527954
Epoch: 2 Loss: 0.4674857556819916
Epoch: 3 Loss: 0.3282727301120758
Epoch: 4 Loss: 0.2696901261806488
Epoch: 5 Loss: 0.22721980512142181
Epoch: 6 Loss: 0.19520707428455353
Epoch: 7 Loss: 0.17021693289279938
Epoch: 8 Loss: 0.15134933590888977
Epoch: 9 Loss: 0.13562604784965515
Epoch: 10 Loss: 0.12347173690795898


To finish this chapter let us discuss a couple more of PyTorch conveniences, that you will find useful. You might have noticed, that all modules and activation functions are called one after another, where the output of one module (or activation) is used as the input into the next. In that case we can pack all modules and activations into a `nn.Sequential` object. When we call that object, the components will be executed in a sequential order.

In [41]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
                nn.Linear(NUM_FEATURES, HIDDEN_SIZE_1),
                nn.Sigmoid(),
                nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2),
                nn.Sigmoid(),
                nn.Linear(HIDDEN_SIZE_2, NUM_LABELS),
            )
    
    def forward(self, features):
        return self.layers(features)

Above we have omitted the `nn.LogSoftmax()` activation function and we replace the `nn.NLLLoss()` loss function by the `nn.CrossEntropyLoss`. We do that, because `nn.CrossEntropyLoss()` combines a log softmax layer with the negative log likelihood loss. This combination is expected to be numerically more stable.

In [42]:
model = Model().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)

In [43]:
train(dataloader, model, criterion, optimizer)

Epoch: 1 Loss: 1.4635180234909058
Epoch: 2 Loss: 0.4384552836418152
Epoch: 3 Loss: 0.32038575410842896
Epoch: 4 Loss: 0.265774130821228
Epoch: 5 Loss: 0.22523467242717743
Epoch: 6 Loss: 0.1935357302427292
Epoch: 7 Loss: 0.170162171125412
Epoch: 8 Loss: 0.15130649507045746
Epoch: 9 Loss: 0.13578613102436066
Epoch: 10 Loss: 0.12304723262786865


We know that this chapter contained a lot of new information, but rest assured, that as you move along with your studies of PyTorch all those concepts will become second nature to you.