# Pretrained CNNs in PyTorch

Welcome to the `08_pretrained_cnns` notebook. This notebook is part of a portfolio designed to showcase essential concepts and techniques in PyTorch, with a focus on utilizing pretrained convolutional neural networks (CNNs). 

Pretrained models, which have been trained on large datasets like ImageNet, offer a powerful starting point for various image recognition tasks. This notebook explores topics such as loading and exploring pretrained models, preparing and preprocessing datasets, performing inference, and comparing different models. It also covers visualization techniques for model predictions and learned features.

## Table of contents

1. [Understanding pretrained CNNs](#understanding-pretrained-cnns)
2. [Setting up the environment](#setting-up-the-environment)
3. [Loading and exploring pretrained models](#loading-and-exploring-pretrained-models)
4. [Preparing and preprocessing the dataset](#preparing-and-preprocessing-the-dataset)
5. [Performing inference with pretrained models](#performing-inference-with-pretrained-models)
6. [Comparing different pretrained models](#comparing-different-pretrained-models)
7. [Visualizing model predictions and features](#visualizing-model-predictions-and-features)
8. [Conclusion](#conclusion)

## Understanding pretrained CNNs

Pretrained CNNs leverage models that have already been trained on large-scale datasets, such as ImageNet, for tasks like image classification. These pretrained models allow for faster deployment and often provide better results, especially when the available data is limited. Pretrained CNNs are widely used in transfer learning, where the knowledge learned from one task (e.g., classifying objects in ImageNet) is applied to another, often related, task (e.g., classifying different types of medical images).

### **Why use pretrained CNNs?**

There are several key reasons to use pretrained CNNs:

- **Avoiding long training times**: Training deep CNNs from scratch is computationally expensive and can take days or even weeks. Pretrained models have already been trained on vast datasets, saving time.
- **Better performance on small datasets**: When a dataset is small, training a CNN from scratch can lead to overfitting, as the model may not generalize well. A pretrained CNN, having learned general visual features from a large dataset, can apply this knowledge to new tasks, often with better generalization.
- **Effective feature extraction**: CNNs trained on large datasets like ImageNet can learn to extract powerful features that are useful for a wide range of tasks, even when transferred to new datasets or domains.

### **Key concepts in pretrained CNNs**

#### **Transfer learning**

Transfer learning is the main concept behind using pretrained CNNs. In essence, transfer learning involves using a model trained on one task and applying it to a different but related task. There are two common approaches to transfer learning with CNNs:

- **Feature extraction**: In this approach, the convolutional layers of the pretrained CNN are used as fixed feature extractors. These layers are not fine-tuned during training; instead, the pretrained filters are used to extract features from the new dataset, and a new fully connected classifier is trained on top of these features.
- **Fine-tuning**: In this approach, the pretrained CNN is not only used as a feature extractor but is also fine-tuned on the new dataset. The entire network, or part of it, is retrained to adapt the learned features to the new task.

#### **Feature extraction vs fine-tuning**

When deciding between feature extraction and fine-tuning, several factors come into play:

- **Feature extraction** is faster since only the classifier is trained, and the pretrained convolutional layers remain unchanged. This is particularly useful when the target dataset is small or when computational resources are limited.
- **Fine-tuning**, on the other hand, allows the model to adjust the pretrained filters to the specific characteristics of the new dataset, often leading to better performance. However, it requires more computational power and can be prone to overfitting if the target dataset is small.

In both cases, the early layers of a pretrained CNN are typically retained as they capture low-level, generic features like edges and textures, which are useful for a wide range of tasks. The later layers, which learn more task-specific features, are either replaced or fine-tuned depending on the method used.

### **Popular pretrained CNN architectures**

Several CNN architectures have been widely adopted for transfer learning due to their effectiveness and availability as pretrained models:

- **AlexNet**: One of the earliest successful CNNs, AlexNet consists of five convolutional layers followed by three fully connected layers. It was trained on ImageNet and sparked the popularity of deep CNNs.
- **VGG**: VGG networks are known for their simplicity and uniformity. They consist of small filters (3x3) and deep architectures, which capture fine details in images.
- **ResNet**: Residual Networks (ResNet) introduced the idea of residual connections, which help prevent the vanishing gradient problem in deep networks. ResNet models are highly popular for transfer learning due to their excellent performance across a variety of tasks.
- **Inception (GoogLeNet)**: Inception networks utilize "Inception modules," which apply multiple convolutional filters of different sizes in parallel. This architecture allows the model to capture features at various scales, making it highly versatile.
- **DenseNet**: DenseNet networks have dense connections between layers, where each layer receives input from all preceding layers. This architecture improves the flow of gradients during training and makes better use of features learned by earlier layers.

### **Transfer learning strategies**

There are different strategies to employ when working with pretrained CNNs, depending on the size of the dataset and the similarity between the source (pretraining) and target tasks:

- **Small dataset, similar task**: Use feature extraction. Freezing the convolutional layers and training only the classifier is usually sufficient.
- **Large dataset, similar task**: Fine-tuning is more appropriate, as the large amount of data can justify retraining part or all of the model.
- **Small dataset, dissimilar task**: Feature extraction with a fully connected classifier is the preferred approach to avoid overfitting.
- **Large dataset, dissimilar task**: Fine-tuning is beneficial since the large dataset can help the model learn new patterns and features specific to the task.

### **Advantages of pretrained CNNs**

Using pretrained CNNs has several advantages:

- **Faster convergence**: Since the model has already been trained on a large dataset, its weights are well-initialized, leading to faster convergence on new tasks.
- **Better generalization**: Pretrained models often generalize well to new tasks, especially when the tasks are related to the original pretraining task.
- **Resource efficiency**: Pretrained models reduce the computational resources needed, as they leverage the knowledge learned from previous tasks.

### **Limitations of pretrained CNNs**

Despite their advantages, pretrained CNNs also have some limitations:

- **Bias from pretraining data**: The features learned by the model are biased toward the dataset on which it was trained. For instance, models pretrained on ImageNet may not perform as well on tasks that differ significantly from image classification.
- **Domain adaptation**: When the target domain differs significantly from the source domain (e.g., medical images vs. natural images), the pretrained features might not be as useful. Fine-tuning may be required, but this can increase the risk of overfitting if the new dataset is small.

### **Maths**

#### **Transfer learning in CNNs**

In transfer learning, we leverage a pretrained model and adapt it to a new task. Mathematically, this involves taking the weights $ W $ learned during the pretraining phase (e.g., on ImageNet) and applying them to a new dataset. There are two main approaches:

##### **Feature extraction**

In feature extraction, the weights of the convolutional layers are frozen. These layers serve as fixed feature extractors, and only the final classification layer is retrained for the new task.

If we denote the weights of the convolutional layers as $ W_{\text{conv}} $ and the weights of the new fully connected layer as $ W_{\text{fc}} $, the forward pass during transfer learning is:

$$
Z_{\text{conv}} = f_{\text{conv}}(X; W_{\text{conv}})
$$

$$
Z_{\text{fc}} = f_{\text{fc}}(Z_{\text{conv}}; W_{\text{fc}})
$$

Here, $ X $ is the input, $ Z_{\text{conv}} $ represents the features extracted by the frozen convolutional layers, and $ Z_{\text{fc}} $ is the output from the new classifier. During training, only $ W_{\text{fc}} $ is updated, while $ W_{\text{conv}} $ remains fixed. Therefore, the gradients with respect to $ W_{\text{conv}} $ are zero:

$$
\frac{\partial L}{\partial W_{\text{conv}}} = 0
$$

where $ L $ is the loss function, and the only gradients computed are for the final classification layer:

$$
\frac{\partial L}{\partial W_{\text{fc}}} = \frac{\partial L}{\partial Z_{\text{fc}}} \cdot \frac{\partial Z_{\text{fc}}}{\partial W_{\text{fc}}}
$$

##### **Fine-tuning**

In fine-tuning, some or all of the layers of the pretrained model are unfrozen, allowing their weights to be adjusted during training. This involves recalculating gradients for both the pretrained layers and the new fully connected layer. The overall loss function $ L $ now depends on both $ W_{\text{conv}} $ and $ W_{\text{fc}} $:

$$
L = L(Y, \hat{Y}(X; W_{\text{conv}}, W_{\text{fc}}))
$$

The gradients for each layer are updated using backpropagation:

$$
\frac{\partial L}{\partial W_{\text{conv}}} = \frac{\partial L}{\partial Z_{\text{conv}}} \cdot \frac{\partial Z_{\text{conv}}}{\partial W_{\text{conv}}}
$$

$$
\frac{\partial L}{\partial W_{\text{fc}}} = \frac{\partial L}{\partial Z_{\text{fc}}} \cdot \frac{\partial Z_{\text{fc}}}{\partial W_{\text{fc}}}
$$

The distinction between fine-tuning and feature extraction is that fine-tuning allows the model to learn new patterns by modifying pretrained weights, while feature extraction reuses the pretrained weights without modification.

#### **Weight initialization in transfer learning**

In a typical CNN trained from scratch, the weights $ W $ are initialized randomly (e.g., using Xavier or He initialization). In transfer learning, however, the pretrained weights $ W_{\text{pretrained}} $ are used as the starting point:

$$
W = W_{\text{pretrained}} + \Delta W
$$

This weight initialization brings two benefits:
- **Faster convergence**: Since $ W_{\text{pretrained}} $ is already a good approximation (having been trained on a large dataset), the gradients $ \Delta W $ required to adapt to the new task are much smaller, leading to faster convergence.
- **Better generalization**: Pretrained weights have already captured important low-level and mid-level features (such as edges, textures, and shapes), which often generalize well to new datasets.

#### **Pretrained CNN architectures: A mathematical view**

Different CNN architectures handle the flow of information and the learning of features in unique ways. Here are the mathematical concepts behind some of the most popular architectures used in transfer learning:

##### **ResNet (Residual Networks)**

ResNet introduced the idea of **residual connections** or **skip connections**, which are defined as:

$$
y = f(x) + x
$$

Here, $ f(x) $ represents the transformation applied by a block of layers (such as convolutional layers), and $ x $ is the input to the block. This formulation allows the network to learn residuals rather than directly mapping inputs to outputs. The residual connections help mitigate the vanishing gradient problem by allowing gradients to flow directly through the skip connections during backpropagation:

$$
\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot (1 + \frac{\partial f(x)}{\partial x})
$$

As a result, ResNet models can be much deeper without suffering from performance degradation, making them ideal for transfer learning on complex tasks.

##### **VGG**

VGG networks are known for their simplicity, with layers composed of small filters (3x3) applied consecutively. The convolution operation in VGG is mathematically expressed as:

$$
S(i,j) = \sum_{m=0}^{2} \sum_{n=0}^{2} W(m,n) \cdot X(i+m,j+n) + b
$$

This uniform architecture makes VGG networks deep but computationally expensive. In transfer learning, VGG’s deep layers are excellent for capturing fine-grained details, which can be transferred to new tasks. However, their depth also means they require more computational power to fine-tune compared to shallower networks like AlexNet.

##### **DenseNet**

DenseNet introduces **dense connections**, where each layer receives input from all preceding layers. If we denote the input to the $ l $-th layer as $ x_l $, the output of the $ l $-th layer is computed as:

$$
x_l = H_l([x_0, x_1, \dots, x_{l-1}])
$$

where $ H_l $ represents the transformation applied by the layer, and $ [x_0, x_1, \dots, x_{l-1}] $ is the concatenation of the feature maps from all preceding layers. This results in a highly connected network, where information from earlier layers is directly passed to later layers. In transfer learning, DenseNet’s dense connections improve gradient flow and make the network more efficient at reusing learned features, which can be beneficial when fine-tuning on new tasks.

#### **Gradient flow in fine-tuning**

When fine-tuning a pretrained CNN, the behavior of the gradients is crucial to understanding how the model adapts to the new task. In a standard CNN, gradients can vanish or explode as they propagate through deep layers. However, in architectures like ResNet or DenseNet, residual or dense connections help gradients flow more smoothly:

- In **ResNet**, residual connections allow gradients to bypass several layers, reducing the chance of vanishing gradients.
- In **DenseNet**, the direct connections between layers ensure that gradients flow freely between all layers, enhancing the model’s ability to fine-tune even on complex tasks.

In both cases, the smooth gradient flow results in more effective fine-tuning, as the pretrained weights can be adjusted more easily without encountering significant training challenges.

## Setting up the environment

##### **Q1: How do you install the necessary libraries for working with pretrained CNNs in PyTorch?**


##### **Q2: How do you import the required modules for loading and using pretrained models in PyTorch?**


##### **Q3: How do you set the device to GPU if available, otherwise default to CPU in PyTorch?**


##### **Q4: How do you check the current version of PyTorch installed in your environment?**

## Loading and exploring pretrained models

##### **Q5: How do you load a pretrained model like ResNet-50 from torchvision in PyTorch?**


##### **Q6: How do you inspect the architecture of a loaded pretrained model in PyTorch?**


##### **Q7: How do you modify the input layer of a pretrained model to match the input dimensions of your dataset?**


##### **Q8: How do you extract and print out the names and shapes of all layers in a pretrained model using PyTorch?**


##### **Q9: How do you load a pretrained model with its weights frozen, so they are not updated during training?**


##### **Q10: How do you replace the final fully connected layer of a pretrained model to fit the number of classes in your dataset?**

## Preparing and preprocessing the dataset

##### **Q11: How do you load and preprocess an image dataset using `torchvision.datasets` and `transforms` to be compatible with a pretrained model?**


##### **Q12: How do you apply normalization to an image dataset to match the input requirements of a pretrained model in PyTorch?**


##### **Q13: How do you resize all images in your dataset to the required input size for a pretrained model in PyTorch?**


##### **Q14: How do you create DataLoaders for training and validation datasets in PyTorch?**


##### **Q15: How do you apply data augmentation techniques such as random horizontal flip and random crop to your training dataset in PyTorch?**

## Performing inference with pretrained models

##### **Q16: How do you perform inference on a single image using a pretrained model in PyTorch?**


##### **Q17: How do you batch process multiple images for inference using a pretrained model in PyTorch?**


##### **Q18: How do you decode the output of a pretrained model to obtain the predicted class labels?**


##### **Q19: How do you calculate and display the top-5 predicted classes for an image using a pretrained model in PyTorch?**


##### **Q20: How do you evaluate the accuracy of a pretrained model on an entire validation dataset in PyTorch?**

## Comparing different pretrained models

##### **Q21: How do you load and compare the architectures of different pretrained models such as VGG16, ResNet50, and InceptionV3 in PyTorch?**


##### **Q22: How do you measure and compare the inference time for different pretrained models on the same dataset in PyTorch?**


##### **Q23: How do you evaluate and compare the accuracy of different pretrained models on the same validation dataset in PyTorch?**


##### **Q24: How do you calculate and compare the number of parameters in different pretrained models using PyTorch?**


##### **Q25: How do you visualize and compare the feature maps generated by different pretrained models for the same input image in PyTorch?**

## Visualizing model predictions and features

##### **Q26: How do you visualize the predictions of a pretrained model for a batch of test images in PyTorch?**


##### **Q27: How do you plot a confusion matrix for the predictions made by a pretrained model in PyTorch?**


##### **Q28: How do you visualize the activation maps of specific layers in a pretrained model using techniques like Grad-CAM in PyTorch?**


##### **Q29: How do you generate and visualize t-SNE plots for the features extracted by a pretrained model in PyTorch?**


##### **Q30: How do you overlay the Grad-CAM activation map onto the original input image to interpret the model's focus areas?**

## Conclusion