<h1>Residual and Skip Connections</h1>
As we've seen up until now neural networks can learn a lot of interesting things! But much of the data has been of a very simple nature. In this lab we are going to try and train with data that is a bit more complicated, the CIFAR10 dataset. CIFAR10 images are much more complicated then MNIST images and even though they are only 3x32x32 they have about 4x as much data as MNIST! Now image using high resolution images!<br>
So let's just bigger neural networks right? In general there are two ways we can increase the size of the neural networks we have seen up until now, by increasing the width (parameters per layer) and the depth (number of layers).<br>
So which is better?<br>
Well..... it's complicated<br>
Via empirical studies it is easy to show that by increasing the model's width the network's performance on a validation set does increase, up until a point then the model with a huge number of parameters starts to overfit on the training set and performance on the validation set DECREASES never reaching even close to 100%. Instead it has been shown that increasing the DEPTH of our model is far more effective. The verdict is STILL out on why this is but theories include:<br>
-Every layer performs independent "operations" (like steps in a program) more steps are better<br>
-Information is "distilled" layer to layer so each layer receives a refined version of the input and so cannot overfit<br>
-Adding a new layer creates more paths for the data to flow to the output then does adding more width

So we'll just add more layers!! Well... it's not that simple

![alt text](https://cdn-images-1.medium.com/max/1000/1*aqmUx_ONo8KqKNEYsjM8eA.png)

[Why ResNets?](https://mc.ai/what-are-deep-residual-networks-or-why-resnets-are-important/)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.utils.data import DataLoader as dataloader
import os
import random
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output

In [None]:
#the size of our mini batches
batch_size     = 64
#How many itterations of our dataset
num_epochs     = 10
#optimizer learning rate
learning_rate  = 1e-4
#initialise what epoch we start from
start_epoch    = 0
#initialise best valid accuracy 
best_valid_acc = 0
#where to load/save the dataset from 
data_set_root = "data"

start_from_checkpoint = False
save_dir = 'Models'
model_name = 'Res_net'

In [None]:
#Set device to GPU_indx if GPU is avaliable
GPU_indx = 0
device = torch.device(GPU_indx if torch.cuda.is_available() else 'cpu')

<h3> Create a transform for the input data </h3>

In [None]:
#Prepare a composition of transforms
#transforms.Compose will perform the transforms in order
#NOTE some transform only take in a PIL image, others only a Tensor
#EG Resize and ToTensor take in a PIL Image, Normalize takes in a Tensor
#Refer to documentation
#https://pytorch.org/docs/stable/torchvision/transforms.html
transform = transforms.Compose([
            transforms.Resize(32),
            transforms.ToTensor()])

<h3> Create the training, testing and validation data</h3>

In [None]:
#Let's use the CIFAR10 dataset!
train_data = datasets.CIFAR10(data_set_root, train=True, download=True, transform=transform)
test_data = datasets.CIFAR10(data_set_root, train=False, download=True, transform=transform)

#We are going to split the test dataset into a test and validation set 50/50
validation_split = 0.5

#Determine the number of samples for each split
n_train_examples = int(len(test_data)*validation_split)
n_valid_examples = len(test_data) - n_train_examples

#The function random_split will take our dataset and split it randomly and give us dataset
#that are the sizes we gave it
#Note: we can split it into to more then two pieces!
test_data, valid_data = torch.utils.data.random_split(test_data, [n_train_examples, n_valid_examples])

<h3> Create the dataloader</h3>

In [None]:
#Create the training, Validation and Evaluation/Test Datasets
#It is best practice to separate your data into these three Datasets
#Though depending on your task you may only need Training + Evaluation/Test or maybe only a Training set
#(It also depends on how much data you have)
#https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
train_loader = dataloader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = dataloader(valid_data, batch_size=batch_size)
test_loader  = dataloader(test_data, batch_size=batch_size)

<h3> Check the lengths of all the datasets</h3>

In [None]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

<h3>Lets visualize some data!</h3>
CIFAR10 is a dataset of 32*32 colour images with 10 classes of animals and vehicles!

In [None]:
#create a dataloader itterable object
dataiter = iter(train_loader)
#sample from the itterable object
images, labels = dataiter.next()

In [None]:
#Lets visualise an entire batch of images!
plt.figure(figsize = (20,10))
out = torchvision.utils.make_grid(images, 8)
plt.imshow(out.numpy().transpose((1, 2, 0)))

<h2>Creating Deep Networks</h2>
So we'll just make our Networks deeper!<br>
Well, it's not that simple, not only does adding more layers mean our model is more sequential (rather than parallel, meaning forward and backward passes are slower) but we now face the problem of "Vanishing Gradients".<br>
When we create larger and larger networks, something funny happens when we try and train them, the gradients that are back propagated from the output become tiny (near zero) for layers near the top. They seem to "vanish"! But why!? Well in most models gradients become smaller as they backpropagate through a network. This is easiest to understand by looking at our networks parameters and thinking about how gradients are back propagated. In general gradients are back propagated by multiplying together the weights of layers sequentially. As the weights of our models are tiny (much less then zero in magnitude) multiplying many of them together gives us a VERY small result. This problem becomes worse the deeper it is! As a result the top layers of our network barely move from their random initialisations and in effect aren't trained!

<h3>Enter the Skip and Residual Connection!</h3>
Skip and Residual connection allow us to have our deep networks and train them too!<br>
So what are they?<br>
In simple terms we take the output of some layer and "skip" some number of layers and combine it with the hidden layer of a much later layer. One result of this is that, during backpropagation, the gradients have a shorter minimum path to the input layers, reducing the impact of the vanishing gradient!<br>
There are a couple of ways to combine hidden layers together, by adding them together or concatenating the tensors.<br>
Adding the hidden layers together (often called a Residual Connection) means that the size of the layers must be the same which for the networks we've seen until now has not been the case (size usually decreases). However with residual connections we don't necessarily need to add the hidden layers directly. For example, we can take a hidden layer and skip two layers, by passing it through a single layer (that will transform it to the right size) halving the length of the path for gradients.<br>
Concatenating hidden layers involves simply "sticking together" the tensors. This not only helps with the vanishing gradient problem but also helps information from the input penetrate deeper into the network.


![alt text](https://miro.medium.com/max/1140/1*D0F3UitQ2l5Q0Ak-tjEdJg.png)
A simple "Identity" resdual connection

<h3>Modules in Modules</h3>
To simplify the creation os our residual and skip networks we will create seperate nn.modules of the skip and residual "blocks" and then create our "top level" network with these!<br>
NOTE: For simplicity all these blocks return an output the same size as their input though this does not have to be the case!
NOTE: We also introduce "Batch Normalisation" layers here for more info:

[Batch Normalisation](https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c)

In [None]:
#First block demonstraights a simple identity residual connection

############TO DO#############
##############################
class Res_block(nn.Module):
    def __init__(self, channels):
        #Call the __init__ function of the parent nn.module class
        super(Res_block, self).__init__()
        
        self.conv1 = ##Conv2d channels, channels//2, kernel 3x3, stride 1, padding 1
        #Batch normalisation layer channels in = channels out
        self.bn1 = nn.BatchNorm2d(channels//2)
        self.conv2 = ##Conv2d channels//2, channels, kernel 3x3, stride 1, padding 1
        self.bn2 = nn.BatchNorm2d(channels)
        
    def forward(self, x):
        ##store the input
        x0 = x
        x = ##Perform conv1. batchnorm1, relu##
        x = ##Perform conv2. batchnorm2##
        x = ##add input to ouput of batchnorm2
        x = ##relu output##
        
        return x
#################################
#################################

#Second block demonstraights how we can use a "side layer" in our residual block to 
#Change the shape of the tensors so they match later layers
#The channels change in this case but you could also create one where the feature map size changes
class ResDown_block(nn.Module):
    def __init__(self, channels_in, channels_out):
        #Call the __init__ function of the parent nn.module class
        super(ResDown_block, self).__init__()
        #how to handle channel width change
        self.conv1 = nn.Conv2d(channels_in, channels_out, kernel_size=3, stride=1, padding = 1)
        self.conv2 = nn.Conv2d(channels_out, channels_out, kernel_size=3, stride=1, padding = 1)
        self.bn1 = nn.BatchNorm2d(channels_out)
        
        self.convRes = nn.Conv2d(channels_in, channels_out, kernel_size=3, stride=1, padding = 1)
        self.bn2 = nn.BatchNorm2d(channels_out)

    def forward(self, x):
        x0 = self.convRes(x)
        
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.bn2(self.conv2(x))
         
        x = F.relu(x + x0)
        return x

#Third block is a simple skip connection
#The layers downsamples to half the input size channel size
#and then concatenates the first hidden layer (x1) to the last output (x1) along the channels
#creating a tensor that is the same shape as the input
class Skip_block(nn.Module):
    def __init__(self, channels):
        #Call the __init__ function of the parent nn.module class
        super(Skip_block, self).__init__()
        
        self.conv1 = nn.Conv2d(channels, channels//2,  kernel_size=3, stride=1, padding=1)
        self.bn1 = nn.BatchNorm2d(channels//2)
        self.conv2 = nn.Conv2d(channels//2, channels//2,  kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(channels//2)
        
    def forward(self, x):
        x1= F.relu(self.BN1(self.conv1(x)))
        x2= F.relu(self.bn2(self.conv2(x1)))
        return torch.cat((x1,x2),1)

#We will use the above blocks to create a "Deep" neural network with many layers!
class Deep_CNN(nn.Module):
    def __init__(self, channels_in, num_blocks = 2, layer_type = Res_block):
        #Call the __init__ function of the parent nn.module class
        super(Deep_CNN, self).__init__()
        self.conv1 = nn.Conv2d(channels_in, 16, kernel_size=4, stride=2)
        #Batch normalisation is very common in deep learning 
        self.bn1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=2)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(32, 64, kernel_size=3, stride=2)
        self.bn3 = nn.BatchNorm2d(64)

        self.layers = self.create_blocks(num_blocks, layer_type, channels = 64)

        self.linear1 = nn.Linear(64*3*3, 10)

        #This function will create a nn.Sequential block from a list of Pytorch layers
        #A forward pass though the Sequential block will perform a forward pass
        #though the layers in the order they appear in the list
    def create_blocks(self, num_blocks, block_type, channels = 64):
        blocks = []
        
        #We will add some number of the res/skip blocks!
        for _ in range(num_blocks):
            blocks.append(block_type(channels))

        return nn.Sequential(*blocks)
        
    def forward(self, x):
        #Pass input through conv layers
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        
        #Pass through the block of res/skip blocks!
        x = self.layers(x)
        
        #Flatten it for the final linear layer!
        x = x.view(x.shape[0], -1)
        
        #Ouput the class acitvations!
        x = self.linear1(x)

        return x

<h3>Creating our Network</h3>
When creating an instance of our network we will also specify the type of block we will use!<br>
The next bit of code should be familiar to you, try experimenting with the different layer types and see the different results!

In [None]:
#create an instance of our network
#set channels_in to the number of channels of the dataset images
net = ##TO DO##

In [None]:
#Lets have a look at our network structure!
net

In [None]:
#pass image through network
out = net(images.to(device))
#check output
out.shape

In [None]:
#Pass our network parameters to the optimiser set our lr as the learning_rate
optimizer = optim.Adam(net.parameters(), lr = learning_rate)
#Define a Cross Entropy Loss
Loss_fun = nn.CrossEntropyLoss()

In [None]:
#Create Save Path from save_dir and model_name, we will save and load our checkpoint here
Save_Path = os.path.join(save_dir, model_name + ".pt")

#Create the save directory if it does note exist
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

#Load Checkpoint
if start_from_checkpoint:
    #Check if checkpoint exists
    if os.path.isfile(Save_Path):
        #load Checkpoint
        check_point = torch.load(Save_Path)
        #Checkpoint is saved as a python dictionary
        #https://www.w3schools.com/python/python_dictionaries.asp
        #here we unpack the dictionary to get our previous training states
        net.load_state_dict(check_point['model_state_dict'])
        optimizer.load_state_dict(check_point['optimizer_state_dict'])
        start_epoch = check_point['epoch']
        best_valid_acc = check_point['valid_acc']
        print("Checkpoint loaded, starting from epoch:", start_epoch)
    else:
        #Raise Error if it does not exist
        raise ValueError("Checkpoint Does not exist")
else:
    #If checkpoint does exist and Start_From_Checkpoint = False
    #Raise an error to prevent accidental overwriting
    if os.path.isfile(Save_Path):
        raise ValueError("Warning Checkpoint exists")
    else:
        print("Starting from scratch")

In [None]:
def calculate_accuracy(fx, y):
    preds = fx.max(1, keepdim=True)[1]
    correct = preds.eq(y.view_as(preds)).sum()
    acc = correct.float()/preds.shape[0]
    return acc

In [None]:
#This function should perform a single training epoch using our training data
def train(net, device, loader, optimizer, Loss_fun, loss_logger):
    
    #initialise counters
    epoch_loss = 0
    epoch_acc = 0
    
    #Set Network in train mode
    net.train()
    
    for i, (x, y) in enumerate(loader):
        
        #load images and labels to device
        x = x.to(device) # x is the image
        y = y.to(device) # y is the corresponding label
                
        #Forward pass of image through network and get output
        fx = net(x)
        
        #Calculate loss using loss function
        loss = Loss_fun(fx, y)
        
        #calculate the accuracy
        acc = calculate_accuracy(fx, y)

        #Zero Gradents
        optimizer.zero_grad()
        #Backpropagate Gradents
        loss.backward()
        #Do a single optimization step
        optimizer.step()
        
        #create the cumulative sum of the loss and acc
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        #log the loss for plotting
        loss_logger.append(loss.item())

        #clear_output is a handy function from the IPython.display module
        #it simply clears the output of the running cell
        
        clear_output(True)
        print("TRAINING: | Itteration [%d/%d] | Loss %.2f |" %(i+1 ,len(loader) , loss.item()))
        
    #return the avaerage loss and acc from the epoch as well as the logger array       
    return epoch_loss / len(loader), epoch_acc / len(loader), loss_logger

In [None]:
#This function should perform a single evaluation epoch and will be passed our validation or evaluation/test data
#it WILL NOT be used to train out model
def evaluate(net, device, loader, Loss_fun, loss_logger = None):
    
    epoch_loss = 0
    epoch_acc = 0
    
    #Set network in evaluation mode
    #Layers like Dropout will be disabled
    #Layers like Batchnorm will stop calculating running mean and standard deviation
    #and use current stored values
    net.eval()
    
    with torch.no_grad():
        for i, (x, y) in enumerate(loader):
            
            #load images and labels to device
            x = x.to(device)
            y = y.to(device)
            
            #Forward pass of image through network
            fx = net(x)
            
            #Calculate loss using loss function
            loss = Loss_fun(fx, y)
            
            #calculate the accuracy
            acc = calculate_accuracy(fx, y)
            
            #log the cumulative sum of the loss and acc
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
            #log the loss for plotting if we passed a logger to the function
            if not (loss_logger is None):
                loss_logger.append(loss.item())
            
            clear_output(True)
            print("EVALUATION: | Itteration [%d/%d] | Loss %.2f | Accuracy %.2f%% |" %(i+1 ,len(loader), loss.item(), 100*(epoch_acc/ len(loader))))
    
    #return the avaerage loss and acc from the epoch as well as the logger array       
    return epoch_loss / len(loader), epoch_acc / len(loader), loss_logger

In [None]:
#This cell implements our training loop
training_loss_logger = []
validation_loss_logger = []

for epoch in range(start_epoch, num_epochs):
    
    #call the training function and pass training dataloader etc
    train_loss, train_acc, training_loss_logger = train(net, device, train_loader, optimizer, Loss_fun, training_loss_logger)
    
    #call the evaluate function and pass validation dataloader etc
    valid_loss, valid_acc, validation_loss_logger = evaluate(net, device, valid_loader, Loss_fun, validation_loss_logger)

    #If this model has the highest performace on the validation set 
    #then save a checkpoint
    #{} define a dictionary, each entry of the dictionary is indexed with a string
    if (valid_acc > best_valid_acc):
        print("Saving Model")
        torch.save({
            'epoch':                 epoch,
            'model_state_dict':      net.state_dict(),
            'optimizer_state_dict':  optimizer.state_dict(), 
            'train_acc':             train_acc,
            'valid_acc':             valid_acc,
        }, Save_Path)
    
    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:05.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:05.2f}% |')

In [None]:
plt.figure(figsize = (10,10))
train_x = np.linspace(0, num_epochs, len(training_loss_logger))
plt.plot(train_x, training_loss_logger, c = "y")
valid_x = np.linspace(0, num_epochs, len(validation_loss_logger))
plt.plot(valid_x, validation_loss_logger, c = "k")

plt.title("LeNet")
plt.legend(["Training Loss", "Validation Loss"])


In [None]:
#call the evaluate function and pass the evaluation/test dataloader etc
test_loss, test_acc, _ = evaluate(net, device, test_loader, Loss_fun)