In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import torch
import torchvision

# Implementing LeNet

* In 1998, **LeNet** was among the first published convolutional neural networks to gather wide attention for its performance on computer vision tasks. In this lab session, you will use high-level PyTorch functionalities to implement the LeNet convolutional neural network.

* You will use the [fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset, which is composed of labeled images of clothes and accessories.

* The following function is used to load the dataset. You don't need to understand it for now.


In [None]:
# You don't need to understand this function for now.
def load_data_fashion_mnist(batch_size, resize=None):
    """Download the Fashion-MNIST dataset and then load it into memory."""
    trans = [torchvision.transforms.ToTensor()]
    if resize:
        trans.insert(0, torchvision.transforms.Resize(resize))
    trans = torchvision.transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(
        root="../data", train=True, transform=trans, download=True)
    mnist_test = torchvision.datasets.FashionMNIST(
        root="../data", train=False, transform=trans, download=True)
    return (torch.utils.data.DataLoader(mnist_train, batch_size, shuffle=True,
                            num_workers=2),
            torch.utils.data.DataLoader(mnist_test, batch_size, shuffle=False,
                            num_workers=2))

In [None]:
batch_size = 256 # Defines the batch size
train_iter, test_iter = load_data_fashion_mnist(batch_size) # Loads the fashion MNIST dataset. `train_iter` and `test_iter` are `DataLoader` objects.

In [None]:
X, y = next(iter(train_iter)) # Requests the first training batch
print(X.size()) # 256 images per batch. Each image is represented by a 1 x 28 x 28 tensor (number of channels x height x width). The images are grayscale, so there is a single channel.
print(y.size()) # 256 targets. Each target is a number between 0 and 9. The classification problem has 10 clases.

* The following code displays some images from the first training batch.


In [None]:
from google.colab.patches import cv2_imshow

class_labels = ['top', 'trouser', 'pullover', 'dress', 'coat', 'sandal', 'shirt', 'sneaker', 'bag', 'boot'] # Pre-defined class labels

for i in range(3):
    print(f'\nImage {i} ({class_labels[int(y[i])]}):\n') # Prints the index `i` and the label associated to the `i`-th image.
    cv2_imshow(X[i].numpy().transpose(1, 2, 0) * 255) # Converts and displays the `i`-th image in the batch.

## Architecture

* Your task is implementing the LeNet architecture described below in a class that inherits from `torch.nn.Module`.

* The architecture receives a $1 \times 28 \times 28$ image and outputs a vector with $10$ elements.  The following layers are employed:
   * A convolutional layer with $6$ kernels, each a $1 \times 5 \times 5$ tensor, padding $2$, stride $1$, and a sigmoid activation function. **Output**: a $6 \times 28 \times 28$ tensor.
   * An average pooling layer with windows of size $2 \times 2$ and stride $2$. **Output**: a $6 \times 14 \times 14$ tensor.
   * A convolutional layer with $16$ kernels, each a $6 \times 5 \times 5$ tensor, padding $0$, stride $1$, and a sigmoid activation function. **Output**: a $16 \times 10 \times 10$ tensor.
   * An average pooling layer with windows of size $2 \times 2$ and stride $2$. **Output**: a $16 \times 5 \times 5$ tensor.
   * A fully connected layer wih $120$ units and a sigmoid activation function. The input is flatenned into a vector with $16 \cdot 5 \cdot 5 = 400$ elements. **Output**: a vector with $120$ elements.
   * A fully connected layer with $84$ units and a sigmoid activation function. **Output**: a vector with $84$ elements.
   * A fully connected layer with $10$ units and a softmax activation function **Output**: a vector with $10$ elements. *Note: the original LeNet used a so-called Gaussian activation layer, which is currently rarely used.*

![LeNet-5 architecture.](https://drive.google.com/uc?export=view&id=1qht6z0oT0TGBYQl-aLiSRhBaPINi4rzE)

* The class `torch.nn.Module` requires implementing the method `forward`, which should define the forward pass for a batch of images.
    * Because the batch size was set to $256$, a batch of images is represented by a $256 \times 1 \times 28 \times 28$ tensor.
    * The method `forward` should compute the $256 \times 10$ logits matrix $\mathbf{O}$, not the prediction matrix $\mathbf{\hat{Y}} = \text{softmax}(\mathbf{O})$. In other words, the last fully connected layer does not need an (explicit) softmax activation function.

* Use the code presented below as a starting point for implementing the convolutional neural network and fill the lines labeled with `TODO`.




In [None]:
class LeNet(torch.nn.Module):
    def __init__(self, num_outputs):
        super(LeNet, self).__init__()
        self.num_outputs = num_outputs

        self.Sigmoid = None # TODO: Create a `torch.nn.Sigmoid` object to represent the sigmoid activation function

        self.Convl1 = # TODO: Create a `torch.nn.Conv2d` object to represent the first convolutional layer
        self.Avg1 = # TODO: Create a `torch.nn.AvgPool2d` object to represent the first pooling layer
        self.Convl2 = # TODO: Create a `torch.nn.Conv2d` object to represent the second convolutional layer
        self.Avg2 = # TODO: Create a `torch.nn.AvgPool2d` object to represent the second pooling layer

        self.Flatten = # TODO: Create a `torch.nn.Flatten` object to represent the operation that transforms a batch of images into a batch of vectors
        self.Linear1 = # TODO: Create a `torch.nn.Linear` object to represent the first fully connected layer
        self.Linear2 = # TODO: Create a `torch.nn.Linear` object to represent the second fully connected layer
        self.Linear3 = # TODO: Create a `torch.nn.Linear` object to represent the third fully connected layer

    def forward(self, x):
        # TODO: Apply each `torch.nn.Module` create above to the batch of images `x`.
        # Hint: Don't forget to apply the sigmoid function after each convolutional/fully connected layer (except the last)
        return None

In [None]:
# Applies Xavier initialization if the `torch.nn.Module` is `torch.nn.Linear` or `torch.nn.Conv2d`
def init_weights(m):
    if type(m) == torch.nn.Linear or type(m) == torch.nn.Conv2d:
        torch.nn.init.xavier_uniform_(m.weight)

num_outputs = 10
model = LeNet(num_outputs)
model.apply(init_weights) # Applies `init_weights` to every `torch.nn.Module` inside `model`

## Loss function

* The *convolutional neural network* defined above computes the logits matrix $\mathbf{O}$, not the prediction matrix $\mathbf{\hat{Y}} = \text{softmax}(\mathbf{O})$.

* This is because PyTorch provides a class called `CrossEntropyLoss` that implements the desired cross entropy loss but requires a logits matrix $\mathbf{O}$ instead of the prediction matrix $\mathbf{\hat{Y}}$.

* The class `CrossEntropyLoss` implements the cross entropy loss in a way that avoids numerical instabilities that would result from a naive implementation.

In [None]:
loss = torch.nn.CrossEntropyLoss()

## Optimization Algorithm

* We will employ minibatch stochastic gradient descent with a learning rate of $0.9$ as the optimization algorithm.

* Because we implemented a subclass of `torch.nn.Module`, the model parameters can be accessed through the method `parameters`.

In [None]:
lr = 0.9
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

## Evaluation

* Recall that the highest element of a logits vector determines which class will be predicted.

* We can use this to compute the number of correct predictions per batch.

In [None]:
def correct(logits, y):
    y_hat = logits.argmax(axis=1) # Finds the column with the highest value for each row of `logits`.
    return (y_hat == y).float().sum() # Computes the number of times that `y_hat` and `y` match.

# Example: 1 correct classification,
y = torch.tensor([2, 1])
logits = torch.tensor([[0.1, 0.3, 0.6], [0.5, 0.2, 0.3]])
print(correct(logits, y))

* We can use the previous function to compute the accuracy of our model in a given dataset by accumulating the number of correct predictions across batches and then dividing that number by the number of examples in the dataset.

In [None]:
def evaluate_metric(model, data_iter, metric):
    """Compute the average `metric` of the model on a dataset."""
    c = torch.tensor(0.)
    n = torch.tensor(0.)
    for X, y in data_iter:
        logits = model(X)
        c += metric(logits, y)
        n += len(y)

    return c / n

In [None]:
print(f'Training accuracy: {evaluate_metric(model, train_iter, correct)}. Testing accuracy: {evaluate_metric(model, test_iter, correct)}.')

## Training

* The following code implements the training loop for the convolutional neural network.

* The training/testing dataset accuracy is displayed after each epoch and stored for plotting.

* **Important:** it is a methodological mistake to compute performance metrics on the *testing* dataset for the purposes of hyperparameter tuning. A *validation* dataset should be used for that purpose, even if it requires splitting the original training dataset into a training dataset and a validation dataset. The *test* dataset should only be used to evaluate the performance of the final set of hyperparameters, in order to assess generalization.

In [None]:
losses = [] # Stores the loss for each training batch
train_accs = [] # Stores the training accuracy after each epoch
test_accs = [] # Stores the testing accuracy after each epoch

num_epochs = 20
for epoch in range(num_epochs):
    print(f'\nEpoch {epoch + 1}/{num_epochs}.')
    for X, y in train_iter:
        logits = model(X) # Computes the logits for the batch of images `X`

        l = loss(logits, y) # Computes the loss given the `logits` and the class vector `y`
        optimizer.zero_grad() # Zeroes the gradients stored in the model parameters
        l.backward() # Computes the gradient of the loss `l` with respect to the model parameters

        optimizer.step() # Updates the model parameters based on the gradients stored inside them

        losses.append(float(l)) # Stores the loss for this batch

    with torch.no_grad(): # Computing performance metrics does not require gradients
        train_accs.append(evaluate_metric(model, train_iter, correct))
        test_accs.append(evaluate_metric(model, test_iter, correct))
        print(f'Training accuracy: {train_accs[-1]}. Testing accuracy: {test_accs[-1]}.') # Computes and displays training/testing dataset accuracy.

plt.plot(losses) # Plots the loss for each training batch
plt.xlabel('Training batch')
plt.ylabel('Cross entropy loss')
plt.show()

plt.plot(train_accs, label='Training accuracy')
plt.plot(test_accs, label='Testing accuracy')
plt.legend(loc='best')
plt.xlabel('Epoch')
plt.show()