In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="../css/custom.css">

# Network Architectures & Transfer Learning

![footer_logo](../images/logo.png)


## Goal

We will discuss some of the most important and popular **deep learning architectures** for CNNs. Afterwards, we discuss the concept of **transfer learning**.

## Program

Benchmarking dataset
- [ImageNet]()

Famous networks
- [AlexNet]()
- [VGG]()
- [Inception]()
- [ResNet]()

Transfer Learning
- [Transfer Learning]()
    - [Fine-tuning]()
    - [Feature extraction]()

# ImageNet

The full datasets consists of 14 million images, 20,000 categories

![](../images/network_architechture/imagenet.jpeg)

# ImageNet


A bench mark for image classification

![](../images/network_architechture/benchmarks.png)

# AlexNet

<center><img src="../images/network_architechture/alexnet.png" width="800"><center>

# [AlexNet](https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf)

- Publisehd by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton in 2014
- One of the first fast GPU-implementations of a CNN to win an image recognition contest.
- Considered one of the most influential papers published in computer vision, having spurred many more papers published employing CNNs and GPUs to accelerate deep learning.
- As of 2020, the AlexNet paper has been cited over 70,000 times according to Google Scholar.


# AlexNet

18.2% top-5 error rate on ImageNet

<center><img src="../images/network_architechture/alexnet_drawing.png" width="1000"><center>

[Gavves, E. (2019)](https://uvadlc.github.io/lectures/apr2019/lecture4-convnets.pdf)

*Top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model*

## Removing layers

- layer 7: 16 million less paramters, 1.1% drop in performance
- layer 6 & 7: 50 million less paramters, 5.7% drop in performance


<center><img src="../images/network_architechture/alexnet_drawing.png" width="1000"><center>

## Removing layers: convolutions are important

- layer 7: 16 million less paramters, 1.1% drop in performance
- layer 6 & 7: 50 million less paramters, 5.7% drop in performance
- layer 3 & 4: 1 million less paramters, 3% drop in performance


<center><img src="../images/network_architechture/alexnet_drawing.png" width="1000"><center>

## Removing layers: depth is important!

Removing layers 3,4,6 & 7 results in a 33.5% drop in performance!


<center><img src="../images/network_architechture/alexnet_drawing.png" width="1000"><center>

# [VGG-16](https://arxiv.org/pdf/1409.1556.pdf)

- 7.3% error rate in ImageNet (compared to 18.2% with AlexNet)


<center><img src="../images/network_architechture/vgg16.png" width="400"><center>

[image source](https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2)

## 3x3 Convolutions

- The smallest possible filter to captures the “up”, “down”, “left”, “right”, "center" of a region.
- Two back to back 3x3 convolutions have the effective receptive field of a single 5x5 convolution. Here’s the visualization of two stacked 3x3 convolutions resulting in 5x5.


<center><img src="../images/network_architechture/convolutions_55_33.png" width="500"><center>

[image source](https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2)

- 1 large filter can be replaced by a deeper stack of successive smaller filters!

# [Inception](https://arxiv.org/abs/1409.4842v1)

- 6.67% error rate in ImageNet (compared to 18.2% with AlexNet)
- GoogLeNet (Version 1) has 22 layers


<center><img src="../images/network_architechture/inception.png" width="1000"><center>

[image source](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202)

**Motivation behind network architecture**:

Salient parts have great variation in sizes. Hence, the receptive fields should vary in size accordingly.


<center><img src="../images/network_architechture/dogs.jpeg" width="800"></center>
    
[image source](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202)

Also, 
- Naively stacking convolutional operations is expensive
- Very deep nets are prone to overfitting.

**Naive Solution**:

Multiple kernel filters of different sizes (1 × 1, 3 × 3, 5 × 5)


<center><img src="../images/network_architechture/inception_module.png" width="700"></center>

[image source](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202)

Still computationally expensive!

**Better Solution**:

Before applying convolutions, combine the input channels with 1x1 convolutions.

<center><img src="../images/network_architechture/inception_module.png" width="700"></center>

[image source](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202)


# [ResNet](https://arxiv.org/abs/1512.03385)

<center><img src="../images/network_architechture/resnet.jpeg" width="1000"></center>


[image source](https://www.kaggle.com/keras/resnet50)

## ResNet

- The first truly Deep Network, going deeper than 1,000 layers
- More importantly, the first Deep Architecture that proposed a novel concept on how to gracefully go deeper than a few dozen layers (Not simply getting more GPUs, more training time, etc.)
- Smashed Imagenet, with a 3.57% error 
- Won a variety of challenges: object classification, detection, segmentation, etc. 

## Motivation: Going deeper had its limits

<center><img src="../images/network_architechture/toodeep.png" width="900"></center>


## What is the problem?

- Very deep networks stop learning after a bit
- An accuracy is reached, then the network saturates and starts unlearning
- Signal gets lost through so many layers

<center><img src="../images/network_architechture/toodeep.png" width="900"></center>


## Solution: The residual block

Reinserting the original image at different stages of the network means we can have deeper networks

<center><img src="../images/network_architechture/residual_connection.png" width="700"></center>


## Residual connections

Without residual connections deeper networks are untrainable

<center><img src="../images/network_architechture/deep_residual.png" width="900"></center>


# Transfer Learning

Assume we have two datasets, A and B:

**Dataset A** has plenty of images and we have already been able to train an accurate model on it ( e.g. think of a ResNet model trained on the ImageNet* dataset).
    

**Dataset B** has much fewer images. We would struggle to learn an accurate model with this data alone. 

We can use the the model learnt on dataset A to learn a better model on dataset B!

Even if the image classes of B do not (necessarilly) overlap with A.

This is called transfer learning!

## Why use transfer learning

The most powerful CNNs have millions of parameters... but our datasets are not always as large. This can result in overfittting, but transfer learning can help us to avoid this!

There are two main approaches to transfer learning
1. Fine-tuning
2. Feature extraction




## 1. Fine-tuning

When fine-tuning, we assume the parameters of the pre-trained model are already close to the optimum for the new dataset.

We use the weights as a starting point for the parameters of the new model and fine-tune from there.

Best used when the new dataset B is relatively big *e.g. a dataset with more than a few thousand images*


## 2. Feature extraction

This is similar to fine-tuning, but we train only the loss layer.

Essentially use the network as a pretrained feature extractor.

Best used when:
- The target dataset 𝑇 is small and any fine-tuning of layer might cause overfitting,
- Or when we don’t have the resources to train a deep net,
- Or when we don’t care for the best possible accuracy.

## It is also possible to do something in between, e.g. fine-tune the last few layers!

![](../../images/finetuning.png)

## Transfer Learning is the norm!

## Not the exception!


## Summary

In this notebook we have covered,
- ImageNet (a famous dataset for training models and benchmarking performance)
- The architechture of famous networks
    - [AlexNet]()
    - [VGG]()
    - [Inception]()
    - [ResNet]()
- Transfer Learning

# Transfer Learning: Exercise
[Exercise: Keras advanced](../exercises/02_03_transfer_learning.ipynb)

![footer_logo](../images/logo.png)