# Training Loop

In this chapter we will finally learn to classify the MNIST dataset. We will implement the full training loop: we will loop over the number of epoch, get a batch of data from the dataset, perform a forward pass, utilize the backpropagation algorithm and use gradient descent. In this section we will implement many of those steps manually. In the next section we will utilize some helpful PyTorch classes and show how we can improve our efficiency.

In [3]:
import torch
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms

Below we download the MNIST dataset. We utilize so called `transforms` from the torchvision library for the first time, when we download the data. Transforms allow us to process images in a predetermined way. We will cover transforms in more detail later. For now you should now, that the `ToTensor()` transform automatically transforms the image into the PyTorch `Tensor` and rescales the pixel values between 0 and 1 (from originally between 0 and 255). This rescaling is important to make sure that training of the neural network actually works. We will discuss in the next section, why this technique, called `feature scaling`, is important.

In [4]:
dataset = MNIST(root="../datasets", train=True, download=True, transform=transforms.ToTensor())

Usually it is good practice to save the parameters that were used in the training process. There are more convenient and efficient ways to do that, but a simple list is good enought at the moment.

In [5]:
# parameters
DEVICE = ("cuda:0" if torch.cuda.is_available() else "cpu")
NUM_EPOCHS=10
BATCH_SIZE=32

#number of hidden units in the first and second hidden layer
HIDDEN_SIZE_1 = 100
HIDDEN_SIZE_2 = 50
NUM_LABELS = 10
NUM_FEATURES = 28*28
ALPHA = 0.1

We create a vanilla DataLoader, where we shuffle the data with each epoch, drop the last batch if has less than 32 samples and use several processes to get the data 

In [6]:
dataloader = DataLoader(dataset=dataset, 
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              drop_last=True,
                              num_workers=4)

We need to somehow initialize our weighs. PyTorch allows us to initialize a Tensor with random numbers. 

`torch.rand(size)` for example initializes a tensor of shape `size` with uniform random values between 0 and 1. 

In [7]:
# size determines the shape of the tensor
torch.rand(size=(2, 3))

tensor([[0.8546, 0.8391, 0.5011],
        [0.3893, 0.3010, 0.3057]])

`torch.randn(size)` on the other hand initialize the tensor with the standard normal distribution.

In [8]:
torch.randn(size=(2, 3))

tensor([[ 1.0298, -0.9004, -0.8039],
        [ 0.1111, -1.7512, -0.9334]])

Additionally we can use the normal distribution `torch.normal(mean, std, size)`. This is the method we will utilize to initialize our weights.

In [9]:
torch.normal(mean=0, std=1, size=(2,3))

tensor([[-0.6043,  1.4194, -0.1698],
        [-1.3896, -1.0160, -2.1488]])

Our neural network has two hidden layers. That means that we need three sets of weights and biases. 

`features -> hidden_1 -> hidden_2 -> outputs`

We initialize the weights with the normal distribution and set the biases to 0. Because we want autodiff to track those variables we set `requires_grad=True`.

In [10]:
# we create a set of weights and biases
W_1 = torch.normal(mean=0, std=0.1, size=(HIDDEN_SIZE_1, NUM_FEATURES), requires_grad=True, device=DEVICE)
b_1 = torch.zeros(1, HIDDEN_SIZE_1, requires_grad=True, device=DEVICE)

W_2 = torch.normal(mean=0, std=0.1, size=(HIDDEN_SIZE_2, HIDDEN_SIZE_1), requires_grad=True, device=DEVICE)
b_2 = torch.zeros(1, HIDDEN_SIZE_2, requires_grad=True, device=DEVICE)

W_3 = torch.normal(mean=0, std=0.1, size=(NUM_LABELS, HIDDEN_SIZE_2), requires_grad=True, device=DEVICE)
b_3 = torch.zeros(1, NUM_LABELS, requires_grad=True, device=DEVICE)

In the training loop below we will use some PyTorch functionality we have not covered yet. To demonstrate those new methods we create a simle Tensor object with shape (3, 2).

In [11]:
a = torch.tensor([[1,2], [3, 4], [6, 5]], dtype=torch.float32)
print(a)
print(a.shape)

tensor([[1., 2.],
        [3., 4.],
        [6., 5.]])
torch.Size([3, 2])


Often we want to reshape a tensor. In PyTorch this is done with the method `view`. `a.view(1, 6)` for example takes the tensor of form (3, 2) and returns a tensor with the same data but, one that has 1 row and 6  columns.

In [12]:
a.view(1, 6)

tensor([[1., 2., 3., 4., 6., 5.]])

If we use `-1` for one of the dimensions, PyTorch will try to automatically infer the dimensionality. `a.view(-1, 3)` returns a (2, 3) tensor.

In [13]:
a.view(-1, 3)

tensor([[1., 2., 3.],
        [4., 6., 5.]])

We will often apply so called reduction operations, like `sum()`, `mean()` or `max()`. Those methods reduce the dimensionality of the Tensor.

In [14]:
print(a.sum())
print(a.mean())
print(a.max())

tensor(21.)
tensor(3.5000)
tensor(6.)


Those operations are rarely applied to the whole tensor. Usually you want to apply those functions to a certain dimension. Our tensor has the shape (3, 2). The 0 dimension is the batch dimension and 1st dimension is the feature dimension. For the most part you would want to do some per batch calculation. If you set `dim=0`, you would do a calculation per feature and average over the batch. If you set `dim=1` you average over all features within a batch.

In [15]:
print(a.sum(dim=1))
print(a.mean(dim=1))
# max returns a tuple
# the first position contains actual max values
# the second position contains the indices of max values
print(a.max(dim=1))

tensor([ 3.,  7., 11.])
tensor([1.5000, 3.5000, 5.5000])
torch.return_types.max(
values=tensor([2., 4., 6.]),
indices=tensor([1, 1, 0]))


Finally, we often apply mathematical functions like `exp()` or `log()` to all elements within a batch. Think how we would like to apply an activation function to all elements simultaneously.

In [16]:
print(a.exp())
print(a.log())

tensor([[  2.7183,   7.3891],
        [ 20.0855,  54.5981],
        [403.4288, 148.4132]])
tensor([[0.0000, 0.6931],
        [1.0986, 1.3863],
        [1.7918, 1.6094]])


The code for the actual training loop below is somewhat lengthy, but is actually relatively straightforward. You can find the corresponding numbers in the code below.

1. We iterate over the number of epochs. In our case we train for 10 epochs
2. In each epoch we iterate over the whole dataset, batch by batch
3. Each batch contains 32 images and 32 labels.
4. We reshape the features from shape (32, 1, 28, 28) into (32, 784), because our neural network needs a (batch_size, num_features) shape as input. The original features shape has 32 samples, 1 channel, a height and width of 28 pixels. The channel 1 means that the image is black and white. Colored images have 3 channels (red, green and blue). 
5. The labels that we receive contain the correct class of the handwritten digit, numbers from 0 to 1. In order to calculate the cross-entropy loss at a later step, we have to transform those classes into one hot vectors. That means that the labels are transformed from (32, 1) to (32, 10). The tensor contains the number 1 in the slot that corresponds to the correct class of the sample and 0 elsewhere.
6. Run the forward pass by multiplying the input vector $\mathbf{X}$ (or $\mathbf{A^{<l-1>}}$) with the weight matrix $\mathbf{W}$ and add the bias vector $\mathbf{b}$. In the first two layers we apply the sigmoid activation function. In the last layer we use the softmax activation.
7. We calculate the cross-entropy loss
8. We use the backpropagation algorithm
9. We apply gradient descent
10. Finally we clear the gradients 

In [17]:
# 1. Iterate over epochs
for epoch in range(NUM_EPOCHS):
    # variables to track progress
    loss_sum = 0
    batch_nums = 0
    # 2. Iterate over the dataset
    # 3. And receive features and labels tensors
    for batch_idx, (features, labels) in enumerate(dataloader):
        # 4. Reshape features and move tensor to gpu
        features = features.view(-1, NUM_FEATURES).to(DEVICE)
        
        # 5. Create one hot labels and move to GPU
        one_hot_labels = torch.zeros(BATCH_SIZE, NUM_LABELS).to(DEVICE)
        for sample_idx, label in enumerate(labels):
            one_hot_labels[sample_idx][label] = 1
        
        # 6. Forward pass
        # first linear transformation
        hidden_1 = features @ W_1.T + b_1
        # sigmoid activation
        hidden_1 = 1 / (1 + torch.exp(-hidden_1))
        # second linear transformation
        hidden_2 = hidden_1 @ W_2.T + b_2
        # sigmoid activation
        hidden_2 = 1 / (1 + torch.exp(-hidden_2))
        # third linear transformation
        logits = hidden_2 @ W_3.T + b_3
        # softmax activation
        numerator = torch.exp(logits)
        denominator = numerator.sum(dim=1, keepdim=True)
        softmax = numerator / denominator
    
        # 7. Calcualte the cross-entropy loss
        loss = -(one_hot_labels * torch.log(softmax)).mean()

        # 8. Apply Backprop
        loss.backward()

        # 9. Gradient Descent
        with torch.inference_mode():
            W_1.sub_(ALPHA * W_1.grad)
            b_1.sub_(ALPHA * b_1.grad)
            W_2.sub_(ALPHA * W_2.grad)
            b_2.sub_(ALPHA * b_2.grad)
            W_3.sub_(ALPHA * W_3.grad)
            b_3.sub_(ALPHA * b_3.grad)
            

        # 10. Clear Gradients
        W_1.grad.zero_()
        W_2.grad.zero_()
        W_3.grad.zero_()
        b_1.grad.zero_()
        b_2.grad.zero_()
        b_3.grad.zero_()

        
        # ------TRACK LOSS --------
        batch_nums += 1
        loss_sum += loss.detach().cpu()
    
    print(f'Epoch: {epoch+1} Loss: {loss_sum / batch_nums}')

Epoch: 1 Loss: 0.22500242292881012
Epoch: 2 Loss: 0.20238369703292847
Epoch: 3 Loss: 0.15146227180957794
Epoch: 4 Loss: 0.10374166071414948
Epoch: 5 Loss: 0.07694655656814575
Epoch: 6 Loss: 0.06303787231445312
Epoch: 7 Loss: 0.05480561777949333
Epoch: 8 Loss: 0.049296267330646515
Epoch: 9 Loss: 0.04533310979604721
Epoch: 10 Loss: 0.042320940643548965


The cross-entropy loss decreases significantly, but we are also interested in the classification accuracy. We will calculate that below. To calculate accuracy we first determine the class, that the model predicst. This is relatively straightforward, all we have to do is to look up the category with the highest probability from the softmax activation function. When you look through the code below, you will notice that we do not actually work with the probabilities from softmax, but with the logits (the hidden features that are used as input into softmax activation are called logits). We do that, because the softmax is not necessary for the actual predictions. The class with the highest logit is also the class with the highest probability. We apply the `argmax()` function to the logits and end up with the predicted class (armax returns the index of the max value). When we use the expression `labels == predictions`, PyTorch returns a tensor, that contains `True` if the prediction equals the actual label and `False` otherwise. The `sum()` method treats a true value as 1 and a false value as 0, thereby essentially calculating the number of correct predictions. The `cpu()` method moves the tensor to the cpu and the `item()` method turns a Tensor object into a simple Python value (only works if a Tensor has one single value and not a list).

In [24]:
# test acccuracy
num_samples = 0
num_correct = 0
for batch_idx, (features, labels) in enumerate(dataloader):
    with torch.inference_mode():
        features = features.view(-1, NUM_FEATURES).to(DEVICE)
        labels = labels.to(DEVICE)
    
        # ------ FORWARD PASS --------
        # first linear transformation
        hidden_1 = features @ W_1.T + b_1
        # sigmoid activation
        hidden_1 = 1 / (1 + torch.exp(-hidden_1))
        # second linear transformation
        hidden_2 = hidden_1 @ W_2.T + b_2
        # sigmoid activation
        hidden_2 = 1 / (1 + torch.exp(-hidden_2))
        # third linear transformation
        logits = hidden_2 @ W_3.T + b_3
        
        predictions = logits.argmax(dim=1)
        num_samples+=len(features)
        num_correct+=(labels == predictions).sum().detach().cpu().item()
        
accuracy = num_correct / num_samples
print(f'The accuracy is {accuracy*100:.3f}%')

The accuracy is 88.848%


The accuracy is close to 90%. While this is not too bad, we will learn techniques over the next chapters that will bring us much closer to 100%. In the next section we will essentially redo the same calculations, using built in PyTorch classes. This will save us a lot of time in the chapters to come.