<img src="https://drive.google.com/uc?id=1YLNtm8gNsviTEnVXzfiby2VMKrc0XzLP" width="500"/>

---


# **Recap (1)**

#### **Morning contents/agenda**

1. Overview of contents covered

2. A design and training guide

3. Key network operations

4. U-Net

5. Backpropagation

<br>

#### **Afternoon contents/agenda**

1. Building a U-Net from scratch

<br/>

---

<br/>

## 1. Overview of contents covered

The basic ingredients of Deep Learning are what we covered during Week 1:

1. Fully-connected/Dense/Feed-forward layers (you might also see these called Multi-Layer Perceptron or MLP)
2. Convolutional layers - always preferred to other types of layers when images are involved
3. Max Pooling layers
4. Transposed convolution layers (and other un-pooling methods)

<br>

To these, we add specific regularisation methods:

1. Dropout
2. Batch normalisation
3. Layer normalisation

<br>

Every model is then composed using these operations, depending on the type of data used and the task being solved:

1. Classification problems - usually solved with FFNs and CNNs
2. Domain-mapping problems (e.g. segmentation) - usually solved using architectures like U-Nets
3. Generative problems - which we will see in Week 2
4. Sequential-data processing/generation - which we will see in Week 3

<br>

---

<br>

## 2. A design and training guide

There are a huge number of hyperparameters in Deep Learning, and it can be daunting to know where to start to tune them appropriately. It is impossible to give you hard and fast rules to decide which parameters to tune, because they are always task and dataset dependent, and they largely require trial and error. However, I will try to give you some ideas to guide you in your designs.

<br>

### **Before starting**

**Before** starting to design any networks or testing any code, we should always do the following:

1. **Explore your dataset** to figure out the type of data you have, how it is organised, and how many data samples it contains. This is a good time to familiarise yourself with the dataset, look for outliers like corrupted images, and understand the statistical properties of your samples. Larger, more diverse datasets will require deeper, more feature-rich networks. Very small datasets might not be suitable for Deep Learning or might require using transfer learning.

2. **Understand the problem** that you are trying to solve. What will be the inputs of your model? What will your model output? Or, what is the same, what type of problem is it: classification, regression, dimensionality reduction, generative modelling? Write these on a piece of paper even before attempting to write any code.

3. **Research existing solutions** to similar tasks. I would recommend exploring blogposts, the latest literature, GitHub repositories, and examples/tutorials in PyTorch, Tensorflow, and Keras. Based on this research, you should find and keep notes of: 1) how have the authors prepared or processed their dataset? 2) what evaluation metrics have the authors used to measure model performance? 3) what model architecture (and hyperparameters in general) have the authors used? If possible, rank the models you decide to be suitable in levels of complexity.

4. **Write the pipeline code** to: 1) load and prepare the dataset and the training, validation and test sets; 2) run the training, validation and test loops for any possible model; 3) prepare a series of metrics that you will use to evaluation your models (and that you will not use for training).

5. **Establish a procedure to keep track of results**. You will likely be running many different tests with your models, so having a streamlined and reliable way to keep track of these will be critical. This could be by using a notebook, a spreadsheet or something more advanced like [Weights & Biases](https://wandb.ai/site/).

<br>

#### *Determining network inputs and outputs*

Network inputs are determined by the task and architecture:

1. FFN-based architectures require flat vectors as inputs `(batch, input_dim)`
2. CNN-based architectures require image-like inputs, including a channel dimension `(batch, channel, H, W)`

<br>

If the network deals with sequential data, an additional dimension will be required (we will see this in Week 3):

1. For FFN-based architectures, including Transformers `(batch, seq_length, input_dim)`
2. For CNN-based architectures `(batch, seq_length, channel, H, W)`

<br>

Network outputs are determined by the specific task (we will see some of these in Week 2 and Week 3):

1. For **classification tasks**, the output should be a vector of probabilities with as many dimensions as classes exist

2. For **generative or regression tasks**, the output should have dimensions determined by the data type: `(batch, output_dim)` for vectorised data and `(batch, channel, H, W)` for image-like data

3. For **discrete generative tasks**, such as text generation, the output should be a vector of probabilities with as many dimensions as possible discrete outputs exist (possible words in text generation)

4. A `seq_length` dimension can be added for **sequential outputs**: `(batch, seq_length, output_dim)` for vectorised data and `(batch, channel, seq_length, H, W)` for image-like data

<br>

<center>
<img src="https://drive.google.com/uc?id=1nisb0AaclwvxsyFDWvrV6U89Z_9sBeIZ" width="800"/>
</center>

<br><br>


### **Bootstrapping the design**

There are a number of choices to make in order to bootstrap your design:

1. The first step is always to **choose a base model** from which to work. To do this, use the research that you have conducted before. **Always start with the simplest model** before trying the latest state-of-the-art (even if that simple model is a CNN with 3 layers). Once the first model on the list works, you can then start going down your ranking to understand how each of the models performs. This way you will have a baseline against which to benchmark any improvements you make.

2. To **select an optimiser**, start with a simple, popular optimiser: I would always recommed starting with Adam with default parameters. If the paper you are reproducing has used a different optimiser, you might also want to try the optimiser they suggest with the parameters that they recommend.

3. The **loss function** can be initially chosen based on the problem you're trying to solve: MSE for regression problems and Cross-Entropy for classification problems. If the paper you are reproducing suggests a different loss function, and this loss function is already implemented or easily implemented, you can also start using that instead.

4. The **batch size** should often be chosen as the largest batch size supported by the available hardware. We generally choose a batch size in the range 32-512 as a general rule of thumb. However, large batch sizes can also produce more unstable training loops as suggested [here](https://arxiv.org/abs/1804.07612), so you should watch out for that. The batch size governs the training speed and shouldn't be used to directly tune the validation set performance.

5. The **learning rate** is critical to training the model. A very small learning rate can be seen through slow progression in the loss over epochs. A very large learning rate can be seen when the loss function becomes unstable, erratic, or it diverges. A reasonable procedure to find a learning rate is to start with a small number like `1e-6` and progressively increase the rate by multiplying it by `10` until the training diverges. You can use the last learning rate value before it started diverging.

6. Choose the **number of epochs** by balancing starting to see some convergence of the training with computational efficiency. You will likely be running a significant amount of tests. Therefore, you don't wont each test to take hours to run. Always try to fail fast and leave the long training until after you have found a good set of hyperparameters. We often start with epochs in the range 5-100 depending on the problem.

<br>

### **Testing your setup**

Once you have bootstrapped your design, I suggest that you thoroughly test it before proceding with the actual training or hyperparameter optimisation.

To do this, I recommend that you take one single batch from your dataset and try to overfit your model to this batch only. If you cannot overfit to this single batch, you should troubleshoot your model and code, as it is very likely that you have made a mistake somewhere (even if you don't get an error message).

Use this chance to also print the shape of every output of every significant operation to ensure they are as expected.

I also recommend extensive plotting across your training, validation and testing procedures to ensure that they make sense. Plot everything: the images in your dataset, the batches in your dataloader, the outputs of your network, the gradients in your layers, the filters in your `Conv2d`s, etc. And familiarise yourself with every plot that you do and what they mean.

At this point, you can introduce the full dataset (or a larger portion of it) and train your model.

<br>

### **Fine tune your design**

Once you have a baseline run, we will start exploring the behaviour of the model as it responds to different hyperparameters and fine tuning its performance. We will do this by devising experiments and always following concrete evidence.

We could try to use an automated algorithm to explore the whole space of possible hyperparameters. Unfortunately, that is not possible in Deep Learning: there are way too many hyperparameters over too large a range to do this effectively.

Instead, we will use well-defined sets of experiments to understand what hyperparameters are important and what ranges for those hyperparameters are relevant for our problem. We can later use this information to run automated hyperparameter tuning (if time allows).

Our focus during the experiments, especially at first, should **not** be on obtaining the best possible performance, but to explore and understand the importance and effect of different hyperparameters for our specific problem. We need to develop an intuition for our problem, and running these experiments will allow us to do this.

Our procedure should then be:

1. Look the current results of our model: this includes training curves, performance metrics, and actual quality of the outputs (for example, by looking plotting output images of our network).

2. Identify a hypothesis for the next round of experiments based on the evidence provided by the current results of our model. For example, a hypothesis could be that the network is overfitting.

2. Design a set of experiments that are determined by the hypothesis. The experiments should have a clear goal and be sufficiently narrow in scope: if we try to add multiple features, test multiple hyperparameters or answer multiple questions at once, we may not be able to disentangle the separate effects on the results. For example, if our hypothesis is that the network is overfitting, we could design experiments to explore the impact of a regulariser like BatchNorm.

3. Use the results of the model after each experiment to learn as much as possible about your hypothesis and the impact of different hyperparameters. Make sure to keep track of your experiments, including notes of what you learn from them. You will likely want to revisit these notes later on.

4. Consider whether to update your current network design to a new best configuration.

5. Go back to step 1. Repeat this process as much as needed or for as long as you can.

<br>

At this point, you can perform a formal hyperparameter tuning (using grid or random search) for those parameters that you feel will have the largest impact on the results.

Finally, you can use this final set of hyperparameters to train your model for a larger number of epochs. The total number of epochs will be determined by the observed loss and accuracy: you want to train for as long as these values keep improving both in the training and validation sets. You also want to ensure that you check the outputs of your network often to see if you can observe the improvement.

<br><br>

Here are some examples of loss and accuracy outputs you might obtain, and what hypotheses you might extract from them:

<br>

<p align = "center"><img src="https://drive.google.com/uc?id=14OxdJTla4FMPS7pe-seG8JVsaETU0oyW" width="600"/></p><p align = "center">
<i>The model seems to be training but with instability, its a risky training. We could address this by using a smaller learning rate, changing the batch size, or adding some regularisation.</i>
</p>

<br>

<p align = "center"><img src="https://drive.google.com/uc?id=1-ZGoWxDpIWGs7YcK3ZeljFHrbfTu-ho6" width="600"/></p><p align = "center">
<i>The model does not have enough capacity and is underfitting.We could address this by increasing the number of epochs, trying to increase the learning rate, or increasing the model capacity.</i>
</p>

<br>

<p align = "center"><img src="https://drive.google.com/uc?id=1pYHiHoOr-HDs6GLIQovkC4zvCAmkYvrS" width="600"/></p><p align = "center">
<i>The model has too much flexibility and is overfitting. We could address this by adding regularisers, or decreasing model capacity.</i>
</p>

<br><br>

Here are some things to consider in this process:

- If underfitting, you should increase the complexity of the network. Usually, you will get more of a performance boost from adding more layers than adding more neurons/channels in each layer.
- If overfitting, you should start by using regularisation approaches: Dropout, BatchNorm, LayerNorm, data augmentation, and explicit regularisers. If none of these are enough, you should reduce the complexity of your network.
- Overfit first and then regularize. I would recommend starting with a model large enough that it can overfit (i.e. focus on training loss) and then regularize it appropriately (give up some training loss to improve the validation loss).
- At this stage, the loss function is determined by the problem being solved. You should use MSE for regression-like problems and cross-entropy for classification problems (with added KL terms in the case of VAEs). Once you feel comfortable with designing networks, you can explore other types of loss functions.
- I would recommend considering [learning-rate scheduling](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate), but only later on in the training process.
- Adam is a great optimiser, so I would use that as a default.
- If you are struggling to diagnose what is causing your model to perform poorly, I recommend accessing the gradients of every layer in your network and plotting their norm over the training process. This will help you understand if your network is suffering from vanishing and exploding gradients. If you are suffering from either of the two, you should consider: (1) change your activation functions to a flavour of ReLU; (2) add BatchNorm; (3) add skip connections; (4) although not covered, gradient clipping is another possible approach.
- Only test one change at a time!

<br><br>

For a more in-depth discussion, I would recommend this [tuning playbook](https://github.com/google-research/tuning_playbook). If you want to get a pro's take on this process, [this blog](https://karpathy.github.io/2019/04/25/recipe/) is very good.

<br>


### **PyTorch implementation steps**

Here is a high-level overview of the steps that you always need to take:

1. Create a `Dataset` object(s) that will contain your complete dataset (either a custom one by creating a class that inherits from Dataset or one extracted directly from torch or torchvision).
2. Create a `Dataloader` object(s) to assemble batches from the Dataset and send them to the device. At this point, you might want to use a `StratifiedShuffleSplit` to separate between validation and training loaders.
3. Create a model, either by using a new `nn.Module` class or by downloading a predefined network like Alexnet.
4. Instantiate an appropriate criterion like MSE or CrossEntropy.
5. Instantiate an optimizer like SGD or Adam.
6. Create the train, validation, and test loops.
7. Run over all epochs using alternatively train and validation.
8. Run the test loop on the trained model.

<br>

The train loop always has the same steps as well:

1. Set the model to `model.train()` to ensure operations like dropout are correctly set up.
2. Iterate through every batch.
3. Send the batch to the device.
4. Zero out the gradients by calling `optimiser.zero_grad()`.
5. Apply the model to the batch.
6. Compare the output of the model and the expected target using the criterion.
7. Calculate any other relevant metrics like accuracy.
8. Calculate the gradients using backpropagation by calling `loss.backward()`.
9. Update the parameters with the gradient using `optimizer.step()`.

<br>

Validation and testing follow a similar procedure:

1. Set the model to `model.eval()` to ensure operations like dropout are correctly set up.
2. Iterate through every batch.
3. Send the batch to the device.
4. Apply the model to the batch.
5. Compare the output of the model and the expected target using the criterion if there is a target available.
6. Calculate any other relevant metrics like accuracy.
7. For testing, save the outputs of the model for verification.

<br>

---

<br>

## 3. Key network operations

### **Linear layers**

<br>

<center><img src="https://drive.google.com/uc?id=1AtQhgLLuGKiLi8neVLNo1xQ91rfW7XLj" width="800"/></center>

<img src="https://drive.google.com/uc?id=1rAh2U6ejO54rptSXbHTYqiTYg78mvOTe" width="800"/>

<br>

### **Convolutional layers**

<center><img src="https://drive.google.com/uc?id=1ME4bFACe5hE9pSYyIvclE1lE542YbEL3" width="800"/></center>

<br>

<p align = "center"><img src="https://drive.google.com/uc?id=1jyD_4d3HvulHQ5obeKJ57ANt6-U66din" width="800"/></p><p align = "center">
<i>3-channel input (RGB image) and 1-channel output example</i>
</p>

<br>

<p align = "center"><img src="https://drive.google.com/uc?id=1400AvvRTZkRH06-13XHDR19GILqBxwTP" width="800"/></p><p align = "center">
<i>3-channel input (RGB image) and 2-channel output example</i>
</p>

<br>

### **Max pooling**

<img src="https://benmoseley.blog/uploads/teaching/2024-ESE-DL/images/slides/Slide15.png" width="600"/>

<br>

### **Transposed convolutional layers**

Transposed convolutions can be computed by following an easy recipe:

<center><img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*54-7typHLLXhdvAhlku9SQ.png" width="900"/></center>


where we know that:
- `s`: stride
- `p`: padding
- `k`: kernel size

and we use this hyperparameters to calculate:
- `z`: how many zeros to insert in between pixels of my input
- `p'`: how much padding do I add around the image

But with the added caveat that, **as the name indicates**, we need to **transpose the kernel** before using it to convolve with the input. Transposing the kernel in this case implies flipping the kernel along each of its axes.

<br>

### **Activation functions**

Historically, the three most commonly used are:

- tanh
- sigmoid
- ReLU (Rectified Linear Unit)


<img src="https://miro.medium.com/max/1190/1*f9erByySVjTjohfFdNkJYQ.jpeg" width="210"/>

<img src="https://miro.medium.com/max/4800/1*XxxiA0jJvPrHEJHD4z893g.png" width="400"/>

But over time, many more have been developed:

<center><img src="https://drive.google.com/uc?id=1H6XZ-jCIbXZpff00CwqL0TcXbW57nlGb" width="800"/></center>

A special case of activation functions are those involving outputs from several neurons in each layer:

<center><img src="https://drive.google.com/uc?id=1xjF1HkWG3pQhZ0OFCmiHCPUDMSrwaji3" width="400"/></center>

[Link to wikipedia entry](https://en.wikipedia.org/wiki/Activation_function) about activation functions.


<br>

### **Skip connections**

<img src="https://www.researchgate.net/publication/348555917/figure/fig4/AS:991956234694659@1613512204672/Conceptualized-architecture-of-the-skip-connection-51-The-sub-figure-on-the-left-shows.ppm" width="600"/>

<br><br>

---

<br>

## 4. U-Net

<p align = "center"><img src="https://drive.google.com/uc?id=1glxl06_zsq-2Off0E21VOQmG5k22R6_z" width="800"/></p><p align = "center">
<i> sources: <a href="https://arxiv.org/pdf/1505.04597.pdf">original unet</a>, <a href="https://www.kaggle.com/c/tgs-salt-identification-challenge"> seismic segmentation</a></i>
</p>

<br>

---

<br>

## 5. Backpropagation

<center><p>Step 1</p><img src="https://drive.google.com/uc?id=1OO0M9ZBPHle0XwMsGiYMmspe17Lb2T-4" width="800"/></center>

<br>

Here it is worth noting that $a_4$ depends on all the model parameters $w_i,b_i$.

<br>

<center><p>Step 2</p><img src="https://drive.google.com/uc?id=1GdVX-e8Jn70is2m2j06SFdbz2EYR2CLa" width="800"/></center>

<br>

where here we add a $\frac{1}{2}$ in front of the loss to simplify calculations:

$$\require{cancel}$$
$$\frac{\partial C}{\partial a_4} = \frac{1}{\cancel{2}} \cancel{2} (a_4 - y)$$

and then we continue with the chain rule:

<br>

<center><p>Step 3</p><img src="https://drive.google.com/uc?id=13SS5XgL-BXsEy37gsbsP_PvxO_bD6mi7" width="800"/></center>

<br>

---

<br>