**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.**

In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install skorch

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score
from skorch import NeuralNetClassifier
import torch.nn as nn
import torch

# Classifying MNIST Digits

This example will illustrate how to construct a simple convolutional network for image classification on the MNIST handwritten digits dataset.

## Loading the Dataset

We will start by loading the MNIST dataset. This step will be very simple – all we need to do is to use function ``fetch_openml`` from ``scikit-learn``, which will download it for us. The dataset comes pre-split into the train and test sets: to get this standard split, we will use the first 60000 samples for training and the remaining 10000 samples for testing.

At the same time we will cast the data into the appropriate data types and ensure that the arrays have correct shapes. Our data is composed of $28 \times 28$ pixel images with a single colour channel. In ``PyTorch`` colour channels are represented by the zeroth dimension of the array, which is why are reshaping into (1, 28, 28).

When converting the types, we also divide the data by 255 so as to transform from integers between 0 and 255 to floats between 0 and 1.

In [None]:
mnist = fetch_openml('mnist_784')

In [None]:
X = mnist.data.reshape((-1, 1, 28, 28)).astype('float32') / 255
Y = mnist.target.astype('int64')
X_train = X[:60000]
Y_train = Y[:60000]
X_test = X[:60000]
Y_test = Y[:60000]

We can now display a few randomly selected examples from the train set:

In [None]:
num_rows = 4; num_cols = 4
fig, axes = plt.subplots(num_rows, num_cols)

for row in axes:
    for ax in row:
        ax.imshow(X_train[np.random.randint(0,
                            len(X_train)-1), 0],
                  cmap='Greys')
        ax.set_xticks([])
        ax.set_yticks([])

## Constructing the Convolutional Network

When constructing a convolutional net the procedure is usually to study literature that deals with similar tasks and use that knowledge to design a similar neural architecture for the problem at hand (and possibly to tune it).

Given that the MNIST dataset is not especially difficult, we will use to illustrate an even simpler approach:
* We will keep chaining blocks of convolutional layers, ReLU functions and pooling layers.
* We will keep going until the dimensions of the inputs have decreased sufficiently.
* Once that happens we will append one or several standard linear layers and ReLUs.
* Finally, we will append the output layer with the softmax activation function.

To make it easier to keep track of what the dimensions of the output are after all the individual layers have been applied, we will not wrap our layers into a class just yet: we will instead experiment with them freely first. To this end we will take a few samples from the dataset, which will be used as a dummy input. We need to cast these into a tensor before we feed them into ``PyTorch``: we will use the ``torch.as_tensor`` function to do this. So far we were able to avoid the step because the ``skorch`` interface took care of it for us.

In [None]:
y = torch.as_tensor(X_train[:5])

Let us now create our first block and apply it to tensor ``y``. We will first create our 2D convolutional layer using class ``nn.Conv2d``. We need to specify a few parameters: namely the number of input and output channels and the kernel size. The number of input channels is 1, of course, because as we mentioned, we have a single colour channel. The number of output channels is a hyperparameter – we are going to use 32.

Convolutional kernels can be of different sizes, but the conventional wisdom based on empirical evidence is that $3 \times 3$ kernels tend to work well. Making the kernel unnecessarily large is something we want to avoid because the larger the matrices we are working with, the longer it will take to multiply them.

After the convolutional layer we apply the ReLU activation function and max-pooling, for which we again need to specify a kernel size. With pooling, the larger the kernel size, the more rapidly our data will be downsampled. We are therefore using a small $2 \times 2$ kernel. A number of modern architectures have now dispensed with the use of pooling layers altogether and use strides or dilations in the convolutional layer to downsample the data.

In [None]:
conv1 = nn.Conv2d(
    in_channels=1, out_channels=32,
    kernel_size=(3, 3))

y = conv1(y)
y = torch.relu(y)
y = torch.max_pool2d(y, kernel_size=(2, 2))

After we have constructed our first block, let us check what effect this had on the dimensionality of our data.

In [None]:
np.product(y.shape[1:])

Alas, our data still has too many dimensions and we need to reduce its dimensionality further. Let's try to apply one more block to it and let's also use a somewhat lower number of output channels now.

In [None]:
conv2 = nn.Conv2d(32, 16, (3, 3))
y = conv2(y)
y = torch.relu(y)
y = torch.max_pool2d(y, (2, 2))

In [None]:
np.product(y.shape[1:])

The number of dimensions is much more reasonable now. We can now flatten the output (transform it from a 2-dimensional image into a 1-dimensional vector) and apply some standard linear layers and ReLUs. Again we make sure that the dimension of the data decreases gradually and the change from one layer to the next is not too drastic. The output layer is going to have 10 output neurons, because we are going to classify into 10 classes: the digits. We will use softmax as its activation function.

In [None]:
y = torch.flatten(y, 1)

fc1 = nn.Linear(400, 128)
y = fc1(y)
y = torch.relu(y)

fc2 = nn.Linear(128, 10)
y = fc2(y)
y = torch.softmax(y, dim=1)

Now that we have designed our architecture, we need to wrap it in a class again. As usual, layers with parameters need to be constructed in ``__init__`` (in our case this is true of all layers from ``nn.`` but not for layers from ``torch.``) and then reused whenever ``forward`` is called.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
num_outputs = 10

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, (3, 3))
        self.conv2 = nn.Conv2d(32, 16, (3, 3))
        self.fc1 = nn.Linear(400, 128)
        self.fc2 = nn.Linear(128, num_outputs)

    def forward(self, y):
        y = self.conv1(y)
        y = torch.relu(y)
        y = torch.max_pool2d(y, kernel_size=(2, 2))
        
        y = self.conv2(y)
        y = torch.relu(y)
        y = torch.max_pool2d(y, kernel_size=(2, 2))
        
        y = torch.flatten(y, 1)
        
        y = self.fc1(y)
        y = torch.relu(y)

        y = self.fc2(y)
        y = torch.softmax(y, dim=1)
        
        return y

## Constructing and Training the Classifier

The construction of a ``NeuralNetClassifier`` and training will be analogical to what we did in our previous examples.

In [None]:
net = NeuralNetClassifier(
    Net,
    max_epochs=5,
    batch_size=128,
    optimizer=torch.optim.Adam,
    train_split=None,
    device=device
)

In [None]:
net.fit(X_train, Y_train)

## Testing

Finally, we apply our standard testing procedure for classifiers: we display the confusion matrix and the accuracy on the testing set.

In [None]:
y_test = net.predict(X_test)

In [None]:
cm = pd.crosstab(Y_test, y_test,
    rownames=['actual'],
    colnames=['predicted']
)
print(cm)

In [None]:
acc = accuracy_score(Y_test, y_test)
print("Accuracy = {}".format(acc))