# Deep CNNs 

### LeNet
Proposed in 1998 by LeCun et al.

Structurally consists of 2 conv layers + 2 fully connected layers with a total of ~50K tunable weights

Used average pooling and tanh (in place of ReLU) as nonlinear activation

Trained on MNIST dataset consisting of 60K images of handwritten digits

### AlexNet
Proposed by Krizhevsky et al. in 2012

Structurally consists of 5 conv layers + 3 fully connected layers with a total of ~60M tunable weights

Used max pooling and ReLU as nonlinear activation

Used drop-out to mitigate overfitting

Trained on a subset of ImageNet dataset consisting of ~1.2M images

### VGGNet
Proposed by Simonyan and Zisserman in 2014

Structurally consists of 14 conv layers + 3 fully connected layers with a total of ~140M tunable weights

Used ReLU as nonlinear activation

<figure>
    <img src= '../../mlrefined_images/convnet_images/CNN_architectures.png' width="220%"  height="auto" alt=""/>
</figure> 

# Transfer Learning

- Modern CNNs can have tens or even hundreds of layers with hundreds of millions tunable weights.

- Without large datasets (with hundreds of thousands or millions of data points) training such architectures will lead to extreme overfitting.

- Additionally, training from scratch of such deep archtiectures require extensive computational resources and training time (e.g., AlexNet: two GPUs, 6 days, and VGGNet: four GPUs, 2 weeks).

When you have a smaller than ideal dataset and/or limited time to train a network from scratch, you can still leverage pre-trained models like AlexNet or VGGNet by 'transferring' some knowledge/information gained from these models to yours. This is typically called <strong>transfer learning</strong>.   

For instance, you can simply re-use these pre-trained models by keeping all their weights untouched except for the weights of the final layer which you learn using your own dataset. In other words, you only tune the weights $w_1$ through $w_B$ in your model 

\begin{equation}
\text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} +  f_2\left(\mathbf{x}\right){w}_{2} + \cdots + f_B\left(\mathbf{x}\right)w_B
\end{equation}

while using the features $f_1$ through $f_B$, as provided by deep CNN architectures.

Depending on the size of your data, you may take this idea one step further and also learn some of the weights inside each $f$, usually those belonging to the later layers of the CNN architecture.  

Instead of initializing these weights randomly, they can be initialized with their optimal values according to the pre-trained model.  