---
# Introduction

This is material for the tutorial on Topic 1 in the **Machine Learning in Mathematics & Theoretical Physics** Summer School in Oxford in 2023.

Below we will learn to build and train neural networks with PyTorch.

On the way we will also go into a little more detail on some of the basic aspects of machine learning that there was not time to cover in the lecture.

---
## Jupyter notebooks and online platforms

This file is a Python notebook (which have file extensions .ipynb), and in this tutorial we'll be working entirely within this notebook.

To open and work with a Python notebook, you can either (1) install Jupyter Notebook on your local machine, or (2) use an online interface.

### Online interfaces

Two of the most popular online interfaces for working with Python notebooks are [Google Colab](https://colab.research.google.com/) and [Kaggle notebooks](https://www.kaggle.com/docs/notebooks). You can also use Jupyter's own [implementation](https://jupyter.org/try).

(Also, if there is an online notebook that you don't want to interact with but simply view, you can do that with nbviewer like [this](https://nbviewer.org/github/callum-ryan-brodie/oxford-ml-physmath-school/blob/main/oxford_ml_physmath_school_notebook_1.ipynb).)

In particular, this notebook is written to work in Google Colab, and you may run into some errors if you're working with another platform. If you haven't already, you can open it in Google Colab with [this link](https://colab.research.google.com/github/callum-ryan-brodie/oxford-ml-physmath-school/blob/main/oxford_ml_physmath_school_notebook_1.ipynb).

### Cells

A Python notebook is made up of a series of cells, of which there are two types: text cells (like this one), and code cells.

You can add cells by hovering over a gap between two cells and clicking **+ Code** or **+ Text**. (This and the below are specific to Google Colab.)

You can edit a text cell by double-clicking on it. (For example, if you do it with this cell, you can edit the text, and you'll see how heading and other formatting are represented in **Markdown**.)

You can edit a code cell by simply clicking on it.

To run a code cell, click it and then either press Shift-Enter or click the arrow button at the left of the cell. You'll see that clicking the code cell also lets you edit it. For a text cell (like this one), double click it to edit, and when you're done editing press Shift-Enter. For more information take a look at the Colab [welcome page](https://colab.research.google.com/).

### Online computing platforms

When you open a Python notebook using local software like Jupyter Notebook, when you run any code cells they are run on your local machine (and so have access to the packages that you have installed).

However, in Google Colab and Kaggle Notebooks, the code cells are run on a remote machine. This has a couple of advantages: (1) Many widely-used packages are already installed and so can simply be imported, (2) You can install any other packages you need and this is likely to work a little more smoothly than if you do it locally, and (3) Of particular importance for machine learning, this machine has a GPU and it is straightforward to utilise it. (We'll discuss the use of a GPU in more detail below.)

---
# Building neural networks with PyTorch

In this tutorial, we'll be using PyTorch. Another popular framework for machine learning is TensorFlow (and more recently JAX).

First let's import what we need from PyTorch, by running the following code cell. (We'll also import NumPy and Matplotlib.)


In [None]:
import torch
import torchvision
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

import matplotlib.pyplot as plt
import numpy as np

One of the most straightforward ways to build a neural network in PyTorch is by using the ```nn.Sequential``` model. The easiest way to understand how this works is to look at an example:

In [None]:
ex_seq_model = nn.Sequential(
     nn.Linear(3, 2)
    ,nn.ReLU()
    ,nn.Linear(2,3)
    ,nn.Sigmoid()
)

ex_seq_model

As the above example makes clear, the way we use ```nn.Sequential``` is to pass it a series of operations. In the above example, these operations are as follows:
1.   A 'linear map' ```nn.Linear(3,2)``` (from $\mathbb{R}^3$ to $\mathbb{R}^2$)
2.   A ReLU activation function ```nn.ReLU()```
3.   A 'linear map' ```nn.Linear(2,3)``` (from $\mathbb{R}^2$ to $\mathbb{R}^3$)
4.   A sigmoid activation function ```nn.Sigmoid()```

Together, this sequence of operations can be seen to constitute a neural network, and this network is what ```nn.Sequential``` creates for us.

In our above network we have used three types of operation. For a list of other operations we can include, see the PyTorch [documentation](https://pytorch.org/docs/stable/nn.html).

**Q: In the above example, how many layers are there in the network? How many hidden layers are there? How many nodes are in each layer? How does the architecture of this network compare to the one shown in the first few lecture slides?**

**Q: The following call of ```nn.Sequential``` has two problems: one technical, and one conceptual. What are the problems? Fix the code so that it defines a sensible neural network.**

(PyTorch won't throw an error if you run the below code as is. If you're stuck, you can try passing a tensor to this network; see below under 'Initialisation' for how to do this.)

In [None]:
### FIX THIS CODE CELL ###

ex_seq_model_2 = nn.Sequential(
     nn.Linear(4, 3)
    ,nn.Linear(2, 1)
    ,nn.Sigmoid()
)

ex_seq_model_2

---

### Aside: Bias

Note that, by default, the 'linear maps' specified by ```nn.Linear(n,m)``` include a bias term (we see that ```bias=True``` in the above model summary), which means that they act as
$$
\mathbb{R}^n \to \mathbb{R}^m :~ \vec{v} \mapsto W \vec{v} + \vec{b} \,.
$$
(They are hence not strictly linear maps, but rather affine transformations. However they are frequently called 'linear' operations in this context.)


Note that there is another, distinct, way in which the word 'bias' is used within machine learning. Namely, we say that a machine learning model has a large 'bias' if it is too simple, i.e. has too few parameters, to fit the data. Conversely, if a model has too many parameters and we train it so much that it overfits the data, we say it has a large 'variance'. We'll discuss this further below.

---
## Initialisation

In creating our ```nn.Sequential``` model, we have defined a neural network architecture, but nowhere have we specified the weights $W$ and biases $\vec{b}$ of the linear layers. In fact, these have been initialised randomly.

Let's take a look at the random output resulting from an input to our network. In order for an object to accepted as input by PyTorch, it needs to have the type ```torch.Tensor```. We can turn a list or array into a tensor by using ```torch.tensor(...)```.

**Q: Try completing the code cell below to pass a tensor to the above network**.

You might also want to check the type of the input and output with ```type(...)```.

(Note: ```torch.tensor(...)``` expects the list or array to have ```float``` type entries, and will give an error if it has all ```int``` type entries. You can also use ```torch.tensor(..., dtype=torch.float32)``` to convert the entries to ```float```.)

In [None]:
### COMPLETE THIS CODE CELL ###

ex_tensor = ...

# We can run the tensor through the model like this:
ex_seq_model(ex_tensor)

We can also pick out individual operations in the sequential model with ```ex_seq_model[i]```.

**Q: Complete the following code cell to pass an input tensor through the first two operations of ```ex_seq_model```.**

In [None]:
### COMPLETE THIS CODE CELL ###

---
# Training a model

We saw above that our neural network has been initialised with random values for the weights and biases.

What we might now want to do is 'train' the network to perform some task. In particular, we'll consider the case of supervised learning, and specifically below we'll use supervised learning to train a network to perform a classification task.

---
## Interlude: supervised vs. unsupervised learning, and classification vs. regression tasks

In supervised learning, we have a set of data whose outputs ('labels') are known, and we want to train a model to give the correct outputs for unseen inputs. In contrast, in 'unsupervised learning' one trains a model to recognise patterns in an unlabelled dataset.

**Q: Which of the following are natural tasks for supervised learning, and which are tasks for unsupervised learning?**

- Image classification (e.g. classifying images of distracted vs. attentive drivers)
- Clustering (e.g. determining if a customer-base naturally decomposes into distinct groups)
- Language translation (e.g. translating between English and French)
- Dimensionality reduction (e.g. compressing an image by replacing the colour of each pixel by the closest of a small set of colours)
- Anomaly / outlier detection (e.g. flagging manufactured items with unusual results during quality control tests)

Within supervised learning, we can distinguish between 'classification' tasks, in which the outputs of the model (and the training labels) take values from a discrete set, and 'regression' tasks, in which the outputs (and the training labels) take continuous values.

**Q: Which of the following are natural classification tasks, and which are regression tasks?**

- Time series forecasting (e.g. predicting the future behaviour of the stock market based on its past behaviour)
- Sentiment analysis (e.g. telling whether a product review is essentially positive or negative)
- Housing price prediction (e.g. based on the number of rooms, the postcode, etc.)

A neural network takes as inputs, and gives as outputs, arrays of numbers. But then how do we represent categorical outputs? A simple way is to use what's called 'one-hot' encoding, in which we index the categories and represent the first as $[1,0,0,\ldots]$, the second as $[0,1,0,\ldots]$, and so on. We'll use this below when to train a neural network to recognise handwritten digits.

**Q: Below we will train a neural network to recognise handwritten digits (0-9) from greyscale $28 \times 28$ pixel images. For a given image, what form will the input take, and what form will the expected output take (using one-hot
encoding)?**


---
## Training set

In supervised learning, in order to train our neural network, we pass it inputs whose labels are known, measure how far away the outputs are from the labels, and adjust the weights and biases slightly to make the difference smaller. The collection of known input-output pairs that we use is called the 'training set'.

The specific measure of difference we use, called the 'loss function', depends on the task we're trying to learn (among other factors). To decrease the loss, we use gradient descent to update the weights and biases (via backpropagation as explained in the lectures).

As we train the model on more and more examples, its performance on the training set will improve. However, if the model is too flexible, i.e. has too many parameters, then after a lot of training it will learn fine details of this specific dataset which won't typically be present in unseen data. In this case we say the model has 'overfit' the data and won't 'generalise' well. (We also say the model has high 'variance'.)

Conversely, if the model doesn't have enough flexibility to fit the training set well, then we say it has 'underfit' the data. (We also say the model has high 'bias'.)

**Q: Consider a set of 10 data points $(x_i,y_i)$ sampled from a quadratic function, $\overline{y} = Q(x)$ plus a small amount of noise, in some interval $x \in [-a,a]$. Suppose we fit the data with a degree-$k$ polynomial. For which values of $k$ would you expect underfitting, and for which values would you expect overfitting?**

---
## Validation set

The performance of a model on the training set is hence not necessarily a good indicator of performance on new data. A better indicator is to check how the model performs on separate set of data which it has not been trained on, which we call the 'validation set'. Hence we keep aside some portion of the given labelled dataset, typically maybe around $10\%$ to $20\%$ of it.

When we train a model below, it will be the loss on the validation set that we use as a measure of performance.

**Q: As we train, the loss on the training set and the loss on the validation set will change. From looking at a joint plot of the training and validation losses against the training time, could you see the signs of overfitting? What about underfitting?**

**Q: Since the validation loss is a good measure of the model's performance, we can use it for 'model selection', choosing (say) the neural network architecture which, when trained on the same training set for the same amount of time, gives the lowest validation loss. Does this present a secondary overfitting problem? If it does, what could you do about it?**

---
## The loss vs. the metric

A steady decrease in the loss on the validation set is a reasonable measure of the improving performance of the model. However, the absolute value of the loss does not give an intuitive measure of how well the latest model performs. In the case of a classification problem, we might consider instead the accuracy of the model on the validation set, meaning the fraction of correctly classified examples (or we might consider the error rate, which is one minus the accuracy).

This intuitive measure of performance is often called a 'metric', as opposed to the loss function. When we train an image classifier below, after each epoch we will print the accuracy of the model on the validation set.

**Q: Why couldn't we directly use the accuracy (in a classification problem) as the loss function for training?**

---
# Training neural networks with PyTorch

---
## Deep learning with a CPU vs a GPU (vs a TPU)

Graphics processing units (GPUs) were developed as hardware to run video games. However, their use for machine learning is that they are very good at running many tasks in parallel, as opposed to a central processing unit (CPU) which runs tasks quickly but sequentially. This can result in a huge speed-up when parallelising computations over a large dataset, and when a neural network contains many nodes. (However, note that [not all](https://www.run.ai/guides/multi-gpu/cpu-vs-gpu) machine learning applications benefit from the use of a GPU.)

(More recently, tensor processing units (TPUs) have been developed, which are specifically designed for use in machine learning.)

Getting set up to use the GPU of your local machine can be fiddly. However, if you're running this notebook in a cloud-hosted environment, it's very easy, as we explain below.

### Google Colab

If you're using Google Colab, you can switch between a CPU, a GPU, or a TPU by clicking 'Change runtime type' under 'Runtume' in the menu bar.

Note that unless you pay for pay for Colab Pro / Pro+ (or pay individually for compute units), Google places a quota on how much GPU / TPU compute time you can use in a given time period. (The precise quotas aren't published, ostensibly because they tend to vary over time.)

(You can also use a GPU or a TPU in a Kaggle notebook, by opening the sidebar (by clicking on the tiny arrow in the bottom right), and expanding 'Notebook options'.)

### Ending your session

If you're using Google Colab with a GPU / TPU, then to conserve your compute allowance you may want to make sure you end your session when you're done. This can be done either by clicking 'Disconnect and delete runtime' under 'Runtime' in the menu bar, or in a cell with the following code:

```python
from google.colab import runtime
runtime.unassign()
```

---
## Telling PyTorch to use the GPU

We can check that we have a GPU available by running the following code cell. If the GPU is available, it will say that we're using a 'cuda device'.

In [None]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

By default, even if a GPU is available, PyTorch will perform computations on the CPU. We can to tell it to instead work with a GPU as follows.

First, note that what PyTorch really sends to the GPU are tensors, and putting the neural network on the GPU means sending the weight and bias tensors to the GPU. We can see that by default a tensor lives on the CPU by running the following code cell.

In [None]:
ex_tensor.device

To send it to the GPU instead, we can use ```to()```:

In [None]:
ex_tensor = ex_tensor.to(device)

ex_tensor

By default, the neural network (i.e. its weight and bias tensors) lives on the CPU, as we can see by running:

In [None]:
for n, p in ex_seq_model.named_parameters():
  print(p.device," ",n)


The network and the tensors being passed to it need to be on the same device, or else we will get an error.

For example, if we try to pass a tensor living on the GPU to our network which currently lives on the CPU, we will get an error:

In [None]:
try:
  ex_seq_model(ex_tensor)
except Exception as e:
  print(e)

We can move our network to the GPU with the same syntax we used for an individual tensor.

**Q: Complete the following code cell to move ```ex_seq_model``` to the GPU, and then check that the network is indeed on the correct device.**

In [None]:
### COMPLETE THIS CODE CELL ###

Below when we train our digit-recogniser model, we will run everything on the GPU if it's available.

---
## Mini-batches, epochs, and shuffling

As discussed in the lectures, it is typical to split up the training dataset into a number of smaller pieces, called 'mini-batches'. During training, we pass a mini-batch to the model, compute the loss, and perform gradient descent, and then do the same with the next mini-batch.

Sometimes it may be necessary to do this, because the full training set doesn't fit in memory, but additionally it is typically more efficient to do this, as it means the network weights are updated more frequently, so that gradient descent occurs more quickly.

When all of the mini-batches, i.e. the entire training set, have been passed through once, this completes one 'epoch'.

When we train a model below, we will choose the number of epochs, and after each epoch we'll output the performance of the updated network on the validation set

At the start of a new epoch, it is typical to first shuffle the data before separating it into (a hence new set of) mini-batches, as this improves the performance of gradient descent, by for example making it more likely to escape a local (i.e. not global) minimum.

**Q: Consider again the example of data points $(x_i,y_i)$ sampled from a quadratic function plus a small amount of noise. Suppose we have 1000 data points, and imagine the data is batched into 2 sets of 500 points, such that by chance in one set all the points are above the quadratic, while in the other they are all below. What will happen during one training epoch? Would shuffling help here?**

---
## Dataloaders

One could straightforwardly write code to manually shuffle the dataset, batch it, and pass these batches into the training process. However, PyTorch's built-in ```DataLoader``` functionality can do this for us. The syntax for ```DataLoader``` is very simple (here we imagine we have already defined ```train_set``` and ```validation_set```):

```
BATCH_SIZE = 64

train_dataloader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)
validation_dataloader = DataLoader(validation_set, batch_size=BATCH_SIZE, shuffle=True)
```

**Q: Batching the training data can improve the performance of gradient descent during training. When might it also be necessary to batch the validation set?**

Below when we write the training / validation loops over batches, we will see how the barched data is fetched from the ```DataLoader```.

(Note that to pass your own custom data to ```DataLoader```, it needs to be prepared in a certain format. This is however very straightforward - see for example [this video](https://youtu.be/zN49HdDxHi8).)

---
## Optimizers

As discussed in the lecture, during training the way the parameters of the model are updated to improve performance is by performing gradient descent on the loss function, i.e. updating the parameters in the direction that will best reduce the difference between the model outputs and the correct labels.

There are however a number of ways one can make improvements to a naive gradient descent process. For example, one might use a variable learning rate, i.e. slowly reduce the size of the step taken for the same gradient, since by starting with a large learning rate and slowly reducing it, we might hope to better find global minima.

The different possible choices define different 'optimizers', of which naive gradient descent is merely the simplest.

Unsurprisingly, there has been significant work towards finding better and better optimizers. In PyTorch, there are a number of available optimizers, which we can use through ```torch.optim```. Below we will use one of the most widely-used optimizers, called 'Adam', which we call with ```torch.optim.Adam```. For the full range of options available in PyTorch, see the [documentation](https://pytorch.org/docs/stable/optim.html).

---
## Training and validation loops

We're now ready to perform training (and validation) of our neural network.

To perform a single epoch of training, we loop over all of the batches, and perform for each batch a gradient descent step (with our chosen optimizer).

After each epoch, we perform a single validation step, by simply checking how the model performs on the validation set (which is also batched).

Below, we have defined functions which perform a single training or a single validation loop.

**Q: Check that you understand the inputs being passed to ```train_loop``` and ```validation_loop```.**

Some of the steps in the below loops are somewhat non-obvious:

- ```model.train()``` and ```model.eval()``` put the network into training and evaluation mode, respectively (in practice this is only necessary if the network contains some [specific](https://stackoverflow.com/questions/66534762) types of layers, but it's good practice to always include these lines)
- Since in practice the way the gradients are computed for use in gradient descent is by adding up contributions as more inputs are fed in, ```zero_grad``` is used to reset the gradient to zero
- ```item()``` simply takes a tensor containing a single element and outputs that element, e.g. ```torch.tensor([1]).item()``` produces ```1```.
- The output ```pred``` of the neural network will be a vector of probabilities $p_i$ that the input corresponds to the $i$th category. Hence to get a prediction of the most likely category, we want the index $i$ of the highest probability, to compare with the known labels during the validation loop. This is what ```pred.argmax(1)``` does.

**Q: Check that you understand the steps do in the training and validation loops below.**

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        pred = model(X)
        loss = loss_fn(pred, y)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def validation_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    validation_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            validation_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    validation_loss /= num_batches
    correct /= size
    print(f"Validation Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {validation_loss:>8f} \n")

---
# Example: Building and training an image classifier

Now that we've covered all of the individual steps, we're ready to build and train a neural network in PyTorch. In particular, we will train a model to classify images.

The images we'll consider are of handwritten digits. This is a classic example, and the classic dataset of such images is called the ['MNIST' database](https://en.wikipedia.org/wiki/MNIST_database).

As with a number of other classic datasets, PyTorch has built-in support for the MNIST database, and we can simply download the data directly as follows:



In [None]:
train_set = datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

validation_set = datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

Let's take a look at the structure of the data. The individual elements are images, and we can look at one by plotting it:

In [None]:
image, label = train_set[0]
image.shape

plt.imshow(image.squeeze(), cmap='gray');

This is a greyscale image, and it is $28 \times 28$ pixels. In particular, this image is encoded as a tensor whose dimensions we can check as follows, where the first dimension being 1 corresponds to this being a greyscale image.

(Note that when we plotted at the image above, we used `squeeze()`, which removes the first index of the tensor, as you can see by running `image.squeeze().shape`.)

**Q: What would the dimensions of the tensor be if the image were instead in colour?**

In [None]:
image.shape

We can also check what the possible labels are for this dataset as follows.

In [None]:
train_set.classes

As we discussed above, we put our train and validation data into a dataloader, as follows.

In [None]:
BATCH_SIZE = 64

train_dataloader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)
validation_dataloader = DataLoader(validation_set, batch_size=BATCH_SIZE, shuffle=True)

We can also take a look at all the images in a batch:

In [None]:
def imshow(img):
    img = img / 2 + 0.5
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.tick_params(left=False, bottom=False, labelbottom=False, labelleft=False)
    plt.show()

dataiter = iter(train_dataloader)
images, labels = next(dataiter)

imshow(torchvision.utils.make_grid(images))

Next we will construct the neural network which we will train to perform this image recognition task.

For this purpose we will construct a neural network with a single hidden layer. (The use of hidden layers makes this an example of 'deep learning'.) We will see that this gives the network enough flexibility to achieve a high accuracy.

(Note that in the context of image recognition there is a widely-used and powerful class of neural networks called 'convolutional neural networks' (CNNs). However, the present task is simple enough that we can achieve good results by using only linear layers.)

The input will be the list of intensities of the pixels of the image - here we will 'unroll' the two-dimensional tensor representing the grid of pixels into a single one-dimensional tensor. The output will be the list of probabilities $p_i$ that the image is of the digit $n_i$.

**Q: How many nodes do we then need in the input layer? And how many in the output layer?**

While the input and output layers are fixed by the classification problem, the number of nodes in the hidden layer is free for us to choose, and its choice will determine the number of parameters in the network and hence to a large degree its performance.

**Q: Would it make sense to make the hidden layer have more nodes than the input layer? Or is it more sensible to give it fewer nodes?**

**Q: Would it make sense for the hidden layer to use fewer nodes than the output layer in this context? What would be the interpretation of doing that?**

**Q: Complete the below code cell by choosing a sensible number ```num_hidden``` of nodes in the hidden layer.**


In [None]:
### COMPLETE THIS CODE CELL ###

num_in = image.numel()
num_out = len(train_set.classes)

num_hidden = ...

We can now straightforwardly define our neural network with PyTorch's ```Sequential``` functionality.

**Q: Complete the following code cell to define a neural network with three layers, with respectively ```num_in```, ```num_hidden```, and ```num_out``` nodes, and with a ReLu activation function at the hidden layer.**

(Note that below we have included first a ```Flatten``` layer to unroll the two-dimensional tensor defining the image, as discussed above.)

(Also note that unlike the example ```Sequential``` model we constructed above, we will not need to include something like a ```Sigmoid``` activation function at the final layer, because this is already implemented by the cross entropy loss function that we will utilise below. Hence the last operation should simply be linear.)


In [None]:
seq_model = nn.Sequential(
     nn.Flatten(start_dim=1) # Flattens the input (but leaves the 0th index unlattened)
    ,...
)

seq_model

Next we set up the gradient descent process:

In [None]:
learning_rate = 1e-3

optimizer = torch.optim.Adam(seq_model.parameters(), lr=learning_rate)

For a multi-class classification problem, we choose to use the cross-entropy loss function. For some more information see for example [this post](https://medium.com/swlh/cross-entropy-loss-in-pytorch-c010faf97bab).

In [None]:
loss_fn = nn.CrossEntropyLoss()

We can now finally train our model. (Note we already set up ```train_loop``` and ```validation_loop``` above.)

In [None]:
epochs = 5 # We go through the training set 5 times (you can try doing more, or fewer, epochs)
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, seq_model, loss_fn, optimizer)
    validation_loop(validation_dataloader, seq_model, loss_fn)
print("Done!")

How does your final trained network perform? For a sensible choice of ```num_hidden```, you should see that accuracy on the validation set is above 95%.

**Q: If your model isn't performing well, go back and try to change ```num_hidden```, or fix errors in defining the architecture of the network.**

---
### Interpretability of an image classification network

How does a neural network learn to extract sensible features out of image data? By looking at which nodes in a given layer are activated by a given input, we can see how each layer picks out particular features, say line segments, or spirals.

For some further study in this direction, you can check out [this video](https://youtu.be/aircAruvnKk) by 3Blue1Brown, which is in the specific context of handwritten digit recognition, or take a look at [this paper](https://arxiv.org/abs/1311.2901) (and specifically Figure 2), which discussed interpretability in the broader context of image recognition