# Diving into Deep Learning

Let's load in any libraries we will use in this notebook.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import torchvision
import torchvision.transforms as transforms

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import tqdm

# Part 0: Prepare the Data
## Loading the Data
The torchvision library already has a function for the FashionMNIST dataset that lets us load it in really easily!

There's nothing for you to change in the cell below, but have a close look at what I'm doing as it'll be relevant for next week's practical and for Project 1.

First, I'm creating a 'transform', this is a transformation that gets applied to the raw data. I'm doing 2 things: 
1. **converting the data to a Tensor**: this is the format that all data needs to be in before using the Pytorch library
2. **normalizing the data**: the first value is the mean of the data and the second is the standard deviation. Note here that I've just assumed these values are 0.5 -- it's a kind of reasonable assumption, these are images with values between 0 and 1. In practice, I'd probably want to find the real mean and standard deviation for best performance.

Next, I'm loading the FashionMNIST train and test subsets. Note the arguments I'm using: *root* is where the data will be, *train* is if I want to the train subset or instead test subset, *download* allows the library to download the data to the root if it's not already there, and *transform* will apply my defined transform to the data.

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5), (0.5))])

trainval_dataset = torchvision.datasets.FashionMNIST(root = '../data/', train=True, download=True, transform = transform)
test_dataset = torchvision.datasets.FashionMNIST(root = '../data/', train=False, download=True, transform = transform)

## Creating Data Subsets

Notice that we didn't have a validation dataset? FashionMNIST only has a training and testing subset, we'll need to create our own validation dataset by carving out some from the training subset. I've defined the number of data points I want in the training dataset and validation dataset with the train_size and val_size variables.

**Your turn:** Investigate the torch.utils.data.random_split() function, and use it to randomly split the trainval_dataset into a train_dataset and val_dataset.

If you've done this correctly, you should have a train dataset of size 48000, val size of 12000, and test size of 10000.

In [None]:
train_portion = 0.8
train_size = int(train_portion*len(trainval_dataset))
val_size = len(trainval_dataset)-train_size

##### Your code goes here ######

print(f'Size of train dataset: {len(train_dataset)}')
print(f'Size of val dataset: {len(val_dataset)}')
print(f'Size of test dataset: {len(test_dataset)}')


## Inspecting the Data
Below, I'm visualising the balance of different classes between the training and validation dataset. There's nothing for you to change here, just observe the histogram -- are the classes generally balanced between the training and validation dataset?

In [None]:
train_labels = [data[1] for data in train_dataset] #this is a list comprehension -- one of the coolest things in python. Google it if you don't know what it is.
val_labels = [data[1] for data in val_dataset]

plt.hist([train_labels, val_labels], density = True) #normalize the histogram as we don't care about absolute numbers here, but a relative comparison
plt.xlabel('Class')
plt.xticks([i for i in range(10)]) #10 classes
plt.ylabel('Normalized count')
plt.show()

Last step before we move onto creating a neural network -- let's visually inspect the data!

There's one major take-away here: a neural network needs a vector of input values, but an image is a matrix of pixels (we'll learn more about this next week).
For now, we're going to flatten that matrix into one long vector of pixel values -- weird right? But you'll be suprised at how well the neural network can learn from this.

In [None]:
fig, ax = plt.subplots(2, 4)
for idx in range(4):
    train_image = (train_dataset[idx][0].numpy().reshape((28, 28))*255)+125
    ax[0, idx].imshow(train_image, cmap = 'gray')
    network_input = train_image.reshape(1, -1)
    ax[1, idx].imshow(network_input, cmap = 'gray', aspect='auto')

    ax[0, idx].set_axis_off()
    ax[1, idx].set_axis_off()
    
ax[0, 1].set_title('Images from dataset')
ax[1, 1].set_title('Input to the network -- a vector of pixel values')
plt.show()

print('Example input to the network without visualisation (first 100 values):')
print(train_dataset[0][0].flatten()[:100])

input_size = len(train_dataset[0][0].flatten())
print(f'Size of input: {input_size}')

# Part 1: A neural network with 1 hidden layer

In this section, I'm going to show you how to build a neural network with 1 hidden layer. Most of the code is done for you, but make sure to read through and understand what's going on, because in the next part you'll be following this process to create a deep neural network!

## Building the network
In the cell below, there's a class called *SimpleNet*.     
This is a neural network with 1 hidden layer, where the hidden layer has 128 neurons with a ReLU activation function. The output layer has 10 neurons and no activation function.

The class definitions for a PyTorch model generally follow this minimum structure:
1. an **\__init__()** function where we create the model architecture.
    1. **nn.Linear** creates a layer of linear neurons (no activation function - yet). This will be initialised with a weights vector of size nxm, where n is the size of the input to the layer, and m is the output size of the layer (i.e. number of neurons in the layer).  
    Note the input arguments -- the first argument is the input size, the second argument is the number of neurons in the layer. I've initialised the first layer with an input_size of 784 -- above, we saw that was the size of an image when it was flattened to a vector. The second layer has an output size of 10 -- this is because we have 10 classes. I chose the first hidden layer to have 128 neurons (output size of 128) because it seemed like a reasonable interpolation between 784 and 10.  
    Read more about this layer type here: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
    2. **nn.ReLU()** creates a ReLU function - the non-linear activation function we will apply to all neurons after their linear operation.  
    Read more about this layer type here: https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html
2. a **foward()** function where we pass an input through our model and return its output.
    1. In the \__init__ function, we created the layers that make our architecture. Now we must apply them sequentially.
    2. Note how we use the ReLU in between the 2 linear layers.

In [None]:
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        y = self.fc2(x)

        return y

Now that we've defined the network, let's create an instance of it. We're also *hopefully* using a GPU node at the moment, so we'll load the network onto the GPU if we are. If not, don't worry -- the network will stay on the CPU.

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') #this line checks if we have a GPU available

#create network and load onto device
net = SimpleNet()
net.to(device)

print(net)

## Creating dataloaders

Dataloaders are Pytorch's useful way of handling data. They automatically batch the dataset into the batch size you want, and can also randomly shuffle the data for you if you choose.

1. *torch.utils.data.DataLoader()* You can read the documentation here: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
    1. First argument is the dataset.
    2. Optional argument *batch_size* is the batch size to test the model with.
    3. Optional argument *shuffle* controls whether data is randomly shuffled before taking from the dataset.
    4. Optional argument *num_workers* is how many subprocesses are used to load data from the dataset -- it can make loading the data faster.

In [None]:
batch_size = 128

trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size,
                                          shuffle=True, num_workers = 2)
valloader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size,
                                          shuffle=False, num_workers = 2)
testloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size,
                                          shuffle=False, num_workers = 2)

## Creating a loss function and optimizer

We're going to use Cross Entropy Loss (a very standard loss for classification tasks) and perform mini-batch Stochastic Gradient Descent.

Nothing too exciting here:
1. *nn.CrossEntropyLoss* - plain Cross Entropy Loss. You can read the documentation here: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
2. *optim.SGD* - stochastic gradient descent! Documentation here: https://pytorch.org/docs/stable/generated/torch.optim.SGD.html
    1. It will work with however many examples you pass in (anywhere from 1 - entire dataset). 
    2. *lr* is the learning rate. Generally somewhere between 0.01-0.001 is a sensible first guess.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

## Time to train!

In the cell below, we're training the neural network (NN) for 10 epochs. This should take about 2 minutes if you're on the GPU, and about 4 minutes if you're on the CPU.

Notice the general flow of the training process:
1. For every epoch:
    1. We iterate through each batch in our trainloader
        1. We separate the inputs from the GT class labels
        2. Move inputs and GT labels to the GPU if available
        3. Do any remaining preprocessing to the data (in this case, flatten an image into a vector).
        4. Zero the parameter gradients -- this is to ensure we start 'fresh' for each step of gradient descent.
        5. Forward pass through the network to find the prediction
        6. Calculate the loss between the prediction and GT labels
        7. Calculate the gradients with a backwards pass
        8. Change the parameters based on the gradients, using the optimizer
        9. Record any data we want to save
        
This is the general flow for the training process, with some important elements missing... before we get to that, run the training process, and observe the loss curve produced at the end.

In [None]:
losses = {'train': []}
total_epochs = 10

for epoch in tqdm.tqdm(range(total_epochs), desc="Training progress"):    
    
    train_loss = []
    
    for i, data in  enumerate(trainloader, 0):
        inputs, labels = data
        
        #move the inputs and labels to the GPU if available
        inputs = inputs.to(device)
        labels = labels.to(device)
        
        #flatten the images into a vector
        inputs = inputs.reshape(len(inputs), -1)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward pass to find the outputs
        outputs = net(inputs)
        
        #calculate the loss
        loss = criterion(outputs, labels)
        
        #backward pass to calculate the gradient
        loss.backward()
        
        #take a step with gradient descent to change the parameters
        optimizer.step()

        #let's keep track of the loss
        train_loss += [loss.cpu().item()]

    #record the mean loss over the entire epoch
    epoch_loss = np.mean(train_loss)
    losses['train'] += [epoch_loss]
    
#let's plot the loss for each epoch to observe how it changes over 10 epochs
plt.plot(losses['train'], label = 'Train')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

### Your turn!
There are a number of key elements missing from the cell above. I've pasted the same code below. One-by-one, add each of these elements to the below cell. At the end, you should produce 2 loss plots that look similar to the below.

<p align="center">
  <img src="figures/simplenet_curves.png" />
</p>

1. There is currently no regularisation being applied to the weights. The optimizer has a useful argument that does this for us -- look at the documentation for the SGD optimizer and the weight_decay argument. Use a value between 0.001-0.0001.
2. The loss itself is not very interpretable. Create another plot that shows the average accuracy on the training data at each epoch.
    1. You will need to find the predicted class by using the maximum class score in outputs for each input. The torch.argmax() function will be useful.
    2. You will need to compare the predicted class to the GT labels stored in the labels variable.
3. Comparison to the validation dataset in each epoch.
    1. After the loop that tests the trainloader, add a loop that tests the valloader.
    **It is critical that you do not perform a backward pass of optimizer step on the validation data. We do not want to learn from the validation data, only measure performance.**
    2. Store the loss and accuracy for the validation data, and plot it on the same plots as the training datasets.

After you've done the above, consider: At what point has the model converged?


In [None]:
#Re-create network and load onto device
net = SimpleNet()
net.to(device)

#because we have re-created the network, we need to re-initialise the optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

losses = {'train': []}

total_epochs = 10

for epoch in tqdm.tqdm(range(total_epochs), desc="Training progress"):    
    
    train_loss = []
    for i, data in  enumerate(trainloader, 0):
        inputs, labels = data
        
        #move the inputs and labels to the GPU if available
        inputs = inputs.to(device)
        labels = labels.to(device)
        
        #flatten the images into a vector
        inputs = inputs.reshape(len(inputs), -1)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward pass to find the outputs
        outputs = net(inputs)
        
        #calculate the loss
        loss = criterion(outputs, labels)
        
        #backward pass to calculate the gradient
        loss.backward()
        
        #take a step with gradient descent to change the parameters
        optimizer.step()

        #let's keep track of the loss
        train_loss += [loss.cpu().item()]

    #record the mean loss over the entire epoch
    epoch_loss = np.mean(train_loss)
    losses['train'] += [epoch_loss]
    
    
#let's plot the loss for each epoch to observe how it changes over 5 epochs
plt.plot(losses['train'], label = 'Train')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Deep Neural Networks
## Build a better model
Building off the code from above, create a deep neural network with at least 4 fully connected layers.

As a starting point, you may choose to use:

However, the choice is yours if you want to try something different!

Train your model and explore these different variables, observing how they influence the attainable validation accuracy. 
1. Experiment with changing the learning rate. Observe how it changes performance.
    1. Experiment with changing the learning rate while learning, e.g. after 10 epochs, re-initialise the optimizer with a lower learning rate.
2. Try changing the network depth and layer widths and observe any changes in performance.


In [None]:
#### Update this model #####

class DeepNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        y = self.fc2(x)

        return y

In [None]:
#### Enter your training code from earlier here #####
#### Make sure to use the new DeepNet class #####

#Re-create network and load onto device
net = DeepNet()


## Evaluate your best model on the test dataset
Once you're satisfied that you've found the optimal hyperparameters, train your model with those hyperparameters (making sure to stop before overfitting) and test for accuracy on the test dataset.

The best model I could get was 89.75%. 

In [None]:
### Enter your code for testing accuracy with the testloader here.

print(f'Accuracy on test dataset: {}')