# 💪 Part 3: Building and Training Your Model

In [20]:
# Import relevant libraries
import torch 
import torchvision
from torch import nn 
from torchvision import transforms
from torch.utils import data
import random
import matplotlib.pyplot as plt
import time
from IPython import display
import numpy as np
from helper import helper

random.seed(2021) # We set a seed to ensure our samples will be the same every time we run the code.

## ⚗️ The Data Science Pipleine 
*This section will be repeated in both Part 2 and Part 3*

> What on earth is data science?! -- George Washington (probably not)

Seriously though, nowadays, in such a data-rich world, data science has become the new buzzword, the new cool kid in the block. But what exactly is it? Unfortunately, no one can really pin down a [rigourous definition](https://hdsr.mitpress.mit.edu/pub/jhy4g6eg/release/7) of data science. At the high level:

> Data science is the systematic extraction of novel information from data.

Good enough! With this definition, most practitioners can somewhat agree on a pipeline or flow. Here are the steps:
1. Identify your problem (What are you trying to do?)
2. Obtain your data (What resource do we have to work with?)
3. Explore your data (What does our data actually look like?)
4. Prepare your data (How do we clean/wrangle our data to make it ingestible?)
5. Model your data (How do we automate the process of drawing out insights?)
6. Evaluate your model (How good are our predictions?)
7. Deploy your model (How can the wider-user base access these insights?)

The 7th step is out-of-scope for this workshop, but we well be exploring the other steps to varying degrees:
* Steps 1-4 will be explored in Part 2.
* Steps 5-6 will be explored in Part 3 and 4.


## 🧢 Recap
Let's review what we have done so far:

|Pipeline | Our Problem |
|---| --- |
|1. Identify Your Problem | Classify images of items of clothing |
|2. Obtain Your Data | 70,000 labelled images (10 different types) of clothes |
|3. Explore Your Data | Class distribution perfectly equal across classes |
|4. Preare Your Data | Split 70,000 into 60,000 train and 10,000 test set |

Note that we didn't have to do too much cleaning because the data we have is close to *perfect* in many regards. For further details about intricacies of this process, this excellent [textbook](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) provides all the nitty gritty detail. 

We need to re-run the data reading part of the tutorial from the last notebook. Please set your variables accordingly.

In [21]:
# First define the function without running it
def load_data_fashion_mnist(batch_size, n_workers):
    """Download the Fashion-MNIST dataset and then load it into memory."""
    trans = [transforms.ToTensor()]
    trans = transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root="../data",
                                                    train=True,
                                                    transform=trans,
                                                    download=True)
    mnist_test = torchvision.datasets.FashionMNIST(root="../data",
                                                   train=False,
                                                   transform=trans,
                                                   download=True)
    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
                            num_workers=n_workers),
            data.DataLoader(mnist_test, batch_size, shuffle=False,
                            num_workers=n_workers))

# Then execute the function here
batch_size = 1024  # Set to 256 on your own device
n_workers = 0      # Set to 4 on your own device
train_iter, test_iter = load_data_fashion_mnist(batch_size=batch_size, n_workers = n_workers)

## Step 5A: Setting Up Your Model
So we have our data fully prepared and ready to train. What is our model going to look like? Recall that neural networks are made of building blocks called 'Perceptrons'. Put the perceptrons in a proper way and we'll get ourselves a nice neural network. We won't go into the intricate details of how to determine the architecture. Instead we will use work that has already been done.

### 🪟 Convolutional Neural Network
Recall that for image data, a typical/simple neural network such as the multi-layer perceptron (or dense neural network or fully-connected neural network) can be okay, but is usually not powerful enough to capture the information in pictures. Instead we use **convolutional** neural networks. The mathematical concept of [convolution](https://en.wikipedia.org/wiki/Convolution#Visual_explanation) can take a bit of time to get used to, but instead the best way to think about it is using a window to 'look' at the image chunks at a time to process it.

![](../images/convolution.gif)  
[source](https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Full_padding_no_strides.gif)

### LeNet
The particular convolutional neural network architecture we will use is called the LeNet. It was one of the first successful neural network architectures to be concieved by Yann LeCun as he worked at Bell Laboratory. Here is the [original paper](https://www.researchgate.net/publication/2985446_Gradient-Based_Learning_Applied_to_Document_Recognition) if you are interested. In this instance, we will be building a slightly adapted version that the d2l.ai textbook outlines in [this chapter](http://d2l.ai/chapter_convolutional-neural-networks/lenet.html). (The key difference is that we will be dropping the Gaussian activation function in the final layer).

We will be using the two diagrams below to construct our neural network:

![lenet](../images/lenet.svg)

**Figure 1:** The architecture of LeNet. ([source](http://d2l.ai/chapter_convolutional-neural-networks/lenet.html))

![lenetsimple](../images/lenet-vert.svg)

**Figure 2:** Compact version of the architecture of LeNet. ([source](http://d2l.ai/chapter_convolutional-neural-networks/lenet.html))

#### Convolution Layers
Recall from the presentation that the convolution layer takes in 4 hyperparameters. The first convolution layer from Figure 2 provides us with all the key answers to these:

| Hyperparameter | Description | Value |
| --- | --- | --- |
| Kernel Size | Size of window | 5 (5x5) | 
| Output Layers | Number of filters | 6 |
| Padding | Size of the '0' ring | 2 |
| Stride | How far to slide the window | 1 |

Note that if the stride is not stated, the default value of 1 is used.

<font color='#F89536'> **Discussion:** </font> If the input image is 28x28, what are the output dimensions of the layer? (Hint: With pad-2, the image will be a 30x30 image. How many strides (of 1) can the window move horizontally before reaching the right-hand side?)

Since our images are black and white, there is only a single input channel. In code, this looks like:  
`nn.Conv2d(in_channels = 1, out_channels = 6, kernel_size=5, padding=2)`.

**Note:** If you want to get a more conceptual understanding of what is happening, [this Stanford University course](https://cs231n.github.io/convolutional-networks/) has an animated figure which you can toggle on and off.

<font color='red'>Examples! Maybe discuss [cross-correlation](http://d2l.ai/chapter_convolutional-neural-networks/conv-layer.html?highlight=cross%20correlation) also</font>.

#### Pooling Layers
All pooling in LeNet is average pooling. It takes in two hyperparameters. The first pooling layer (bottom of Figure 2) gives:

| Hyperparameter | Description | Value |
| --- | --- | --- |
| Kernel Size | Size of window | 2 (2x2) | 
| Stride | How far to slide the window | 2 |

Putting this together, the code becomes: `nn.AvgPool2d(kernel_size=2, stride=2)`.


#### Linear/Dense Layer
Let's look at the FC(120) layer in Figure 2. The input is 16 layers of 5x5 images. How many pixels is that? $16 \times 5 \times 5 = 400$. We squish each pixel into a fully connected layer with 120 as the output.

Putting this together, the code becomes: `nn.Linear(in_features = 16 * 5 * 5, out_features = 120)`.

<font color='red'>Examples!</font>.


#### Tying Loose Ends
Between each layer, we will use a sigmoid function `nn.Sigmoid()` as our activation function. Note we do not need to put activation functions after pooling layers, only after convolutional and linear layers. 

To convert from a 2D image representation to a 1D linear representation, we use the `nn.Flatten()` function. Try filling out the `?` below according to the architecture.

In [10]:
# Initialise LeNet Architecture
net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
                    nn.AvgPool2d(kernel_size=2, stride=2),
                    nn.Conv2d(?, ?, kernel_size=?), ?,
                    ?, nn.Flatten(),
                    nn.Linear(16 * 5 * 5, 120), ?,
                    ?, ?, 
                    nn.Linear(84, 10))

SyntaxError: invalid syntax (Temp/ipykernel_13272/2565157279.py, line 4)

In [22]:
# Initialise LeNet Architecture (ans)
net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
                    nn.AvgPool2d(kernel_size=2, stride=2),
                    nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
                    nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
                    nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
                    nn.Linear(120, 84), nn.Sigmoid(), 
                    nn.Linear(84, 10))

Let's have a look at what each layer looks like

In [23]:
# Show layers
X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__, 'output shape:    \t', X.shape)

Conv2d output shape:    	 torch.Size([1, 6, 28, 28])
Sigmoid output shape:    	 torch.Size([1, 6, 28, 28])
AvgPool2d output shape:    	 torch.Size([1, 6, 14, 14])
Conv2d output shape:    	 torch.Size([1, 16, 10, 10])
Sigmoid output shape:    	 torch.Size([1, 16, 10, 10])
AvgPool2d output shape:    	 torch.Size([1, 16, 5, 5])
Flatten output shape:    	 torch.Size([1, 400])
Linear output shape:    	 torch.Size([1, 120])
Sigmoid output shape:    	 torch.Size([1, 120])
Linear output shape:    	 torch.Size([1, 84])
Sigmoid output shape:    	 torch.Size([1, 84])
Linear output shape:    	 torch.Size([1, 10])


## Step 5B: Training Your Model

### Initialising weights
When you set up your neural network `net`, the weights are completely random and non-sensical. There are, however, certain systematic ways of randomly initialising your weights (as weird as that sentence sounds). A sensible starting point is using the [Xavier Uniform](https://pytorch.org/docs/stable/nn.init.html#) distribution as outlined in [this paper](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). No further justification will be given in this tutorial, although initial weight values of neural networks can spark fascinating discussions in and of themselves.

In [24]:
def init_weights(m):
    if type(m) == nn.Linear or type(m) == nn.Conv2d: # We will only set the weights from linear and Conv2d layers, since pooling layers do not require this
        nn.init.xavier_uniform_(m.weight)

net.apply(init_weights) # the function apply() takes in another function as input! 

Sequential(
  (0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (1): Sigmoid()
  (2): AvgPool2d(kernel_size=2, stride=2, padding=0)
  (3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (4): Sigmoid()
  (5): AvgPool2d(kernel_size=2, stride=2, padding=0)
  (6): Flatten(start_dim=1, end_dim=-1)
  (7): Linear(in_features=400, out_features=120, bias=True)
  (8): Sigmoid()
  (9): Linear(in_features=120, out_features=84, bias=True)
  (10): Sigmoid()
  (11): Linear(in_features=84, out_features=10, bias=True)
)

### Setting Hyperparameters
There are certain parameters that we allow the data to define (eg. the weights of the neural network). There are other parameters we need to provide to the model (eg. the exact architecture of the neural network). The latter is called **hyperparameters**. We have already set the network architecture, but here are three more points of decision we need to make:

| Hyperparameter | Description | Selected Value |
| --- | --- | --- |
| Learning Rate | How quickly the algorithm converges. Too quick and we might *miss* the optimal weights. Too slow and it will take a long time to run | $0.9$ |
| Optimiser | What algorithm do we use to find the optimal weights? | [Stochastic Gradient Descent (SGD)](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD) |
| Loss Function | How do we measure the *correctness* of our predictions? | [Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) |

We define these below:

In [25]:
lr = 0.9 
optimizer = torch.optim.SGD(net.parameters(), lr=lr) 
loss = nn.CrossEntropyLoss() 

### Training the Model

To train the model we must:
1. Set the 'mode' of the network to `train`
2. Select a single mini-batch to train on
3. Conduct a forward pass to make predictions
4. Calculate the loss (lack of 'correctness') of these predictions
5. Calculate the gradients required for back propagation (according to the loss function)
6. Update weights according to gradient descent

Each line below corresponds to each of these steps above.

In [26]:
net.train() # This doesn't actually train, but sets the network on training mode
X, y = next(iter(train_iter)) # Pick a single minibatch at random to do the training
optimizer.zero_grad() # before running the forward/backward pass we need to reset the gradient (otherwise it accumulates)
y_hat = net(X) # Forward pass on the data to make prediction
l = loss(y_hat, y) # calculate the loss 
l.backward() # calculate gradients for back prop
optimizer.step() # step forward in optimisation and apply backprop to update weights

🎉🎉🎉 Congratulations! You have trained your first neural network in PyTorch 🎉🎉🎉

...well not quite. This model is going to be quite terrible, since we only trained on a small sample of our dataset. In the next part we will look into scaling this procedure up. But first, let's see how we went.


## Step 6: Evaluate Your Model
It's all well and good if you can train a model, but it's pretty useless if you can't see how well it does. Recall that our performance metric is that we want:
* Predictions to be correct (Accuracy)
* Model to generalise to unseen data (No Overfitting)

Thus we should extract both the train accuracy (how well the model runs on the dataset it trained on), and the test accuracy (how well the model runs on unseen/independent data).

### Training Accuracy
First we calculate the training accuracy.

In [29]:
loss = l * X.shape[0]
n_correct = helper.accuracy(y_hat, y)
n_total = X.shape[0] 


print("1. The mini-batch loss is: \t\t\t\t", loss)
print("2. The number of correct training predictions is: \t", n_correct)
print("3. The number of total training predictions is: \t", n_total)

print("This means we get a training accuracy of ", n_correct/n_total)
print("The average loss for each example is ", float(loss/n_total))

1. The mini-batch loss is: 				 tensor(2604.3523, grad_fn=<MulBackward0>)
2. The number of correct training predictions is: 	 93.0
3. The number of total training predictions is: 	 1024
This means we get a training accuracy of  0.0908203125
The average loss for each example is  2.5433127880096436


### Testing Accuracy
Then we calculate the testing accuracy.

In [30]:
test_accuracy = helper.evaluate_accuracy(net, test_iter)
print("The testing accuracy is: ", test_accuracy)

The testing accuracy is:  0.1


The train and test accuracy both hover around 10%. That means the model gets the right label about 1 in 10 times. This is no better than randomly picking labels for each image! However, we have only trained over a single mini batch of data. Of course the performance is going to be low. In reality we need to run it over the entire data at least once. Each time we run over the train data once is called an epoch. In the next section we will talk about scaling this up for more training examples.