In [None]:
from os.path import join as ospj
from warnings import warn
from copy import deepcopy

In [None]:
# Our imports
from src.utils.pipeline import build_pipeline

In [None]:
### If using Colab, uncomment the two following lines to mount your Google Drive.

# from google.colab import drive
# drive.mount('/content/drive')


### If using Colab, change the PROJECT_ROOT to where you've uploaded the project.
### E.g. PROJECT_ROOT='/content/drive/MyDrive/TeamX/'
### You may also need to change the `data_dir`, `save_dir`, paths in the `cfgs/exercise_3/` configs.

PROJECT_ROOT='./'
# import sys
# sys.path.append(PROJECT_ROOT)

In [None]:
# Just so that you don't have to restart the notebook with every change.
%load_ext autoreload
%autoreload 2 

In Exercise 3, you will implement a convolutional neural network to perform image classification and explore methods to improve the training performance and generalization of these networks.
We will use the CIFAR-10 dataset as a benchmark for our networks, similar to the previous exercise. This dataset consists of 50000 training images of 32x32 resolution with 10 object classes, namely airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The task is to implement a convolutional network to classify these images using the PyTorch library. The four questions are,

- Implementing a convolutional neural network, training it, and visualizing its weights (Question 1).
- Experiment with batch normalization and early stopping (Question 2).
- Data augmentation and dropout to improve generalization (Question 3).
- Implement transfer learning from an ImageNet-pretrained model (Question 4).

Before we begin, here are a few remarks regarding the codebase for this assignment.


For every experiment, you would define a config dictionary (see the dictionary in `./cfgs/exercise_3/cnn_cifar10.py`). Every config dictionary, will have the configuration for 
- data (e.g batch size, shuffle, which DataModule to use, splitting)
- model (e.g which class module to use and what arguments to pass to it)
- training (e.g type of optimizer, lr_scheduler, n_epochs etc.)

The DataModules are located at  `src/data_loaders/` and they inherit from a base_data_module that handles things such as splitting the data (see `src/data_loaders/base_data_modules.py`). A sample datamodule may inherit from this class (e.g `src/data_loaders/data_modules.py`). The main concern is that datamodule initialization should get everything ready, so that one can simply get the dataloaders for train/held-out sets from it (see `get_loader` and `get_heldout_loader` in BaseDataModule). The data augmentations are also done in a preset fation. One defines the preset in `utils/transform_presets.py` and simply specifies the *preset key* in the config for datamodule.

The models are defined in `src/models/` (see for instance `src/models/cnn/model.py`). These are typical Pytorch nn.Modules that we had also seen in Assignment 2. They might additionally have extra methods such as `VisualizeFilter` in `model.py`.

The Traier glues everything together. It creates the model, sets up optimizer, lr_schduler etc. and has the option to `train()` or `evaluate()` a model over the given dataloaders. It also logs everything in `Logs/YOUR_EXP_NAME.log` and saves the checkpoints under the `Saved/YOUR_EXP_NAME/`. Please familirize yourself with the `__init__` and methods of both `trainers/base_trainer.py` and `trainers/cnn_trainer.py` before continuing with the assignment.

Lastly, for tracking different metrics (top(1/5) (train/val) accuracy or losses), we use a MetricTracker object defined in `src/utils/utils.py`. A single tracker keeps track of multiple metric keys and can `update()` their history by adding new values to a list. In the end, it can be used to return an average of a metric.


Feel free to ask questions on the forum if part of the codebase is confusing.


### Question 1: Implement Convolutional Network (10 points)

In this question, we will implement a five-layered convolutional neural network architecture as well as the loss function to train it. Refer to the comments in the code to the exact places where you need to fill in the code.

![Failed to load the image. Please view it yourself at ./data/exercise-3/fig1_resized.png](./data/exercise-3/fig1_resized.png)

Our architecture is shown in Fig 1. It has five convolution blocks. Each block is consist of convolution, max pooling, and ReLU operation in that order. We will use 3×3 kernels in all convolutional layers. Set the padding and stride of the convolutional layers so that they maintain the spatial dimensions. Max pooling operations are done with 2×2 kernels, with a stride of 2, thereby halving the spatial resolution each time. Finally, stacking these five blocks leads to a 512 × 1 × 1 feature map. Classification is achieved by a fully connected layer. We will train convolutional neural networks on the CIFAR-10 dataset. Implement a class ConvNet to define the model described. The ConvNet takes 32 × 32 color images as inputs and has 5 hidden layers with 128, 512, 512, 512, 512 filters, and produces a 10-class classification.

a) Please implement the above network (initialization and forward pass) in class `ConvNet` in `models/cnn/model.py`. The code to train the model is already provided in the `trainers/base_trainer.py`'s train() and `trainers/cnn_trainer`'s _train_epoch(). Train the above model and report the training and validation accuracies. (5 points)

b) Implement the method `__str__` in `models/base_model.py`, which should give a string representaiton of the model. The string should show the number of `trainable` parameters for each layer. This gives us a measure of model capacity. Also at the end, it should print the total number of trainable parameters for the entire model. (2 points)

c) Implement a function `VisualizeFilter` in `models/cnn/model.py`, which visualizes the filters of the first convolution layer implemented in Q1.a. In other words, you need to show 128 filters with size 3x3 as color images (since each filter has three input channels). Stack these into 3x3 color images into one large image. You can use the `imshow` function from the `matplotlib` library to visualize the weights. See an example in Fig. 2

![Failed to load the image. Please view it yourself at ./data/exercise-3/fig2_resized.png](./data/exercise-3/fig2_resized.png)

Compare the filters before and after training. Do you see any patterns? (3 points). Please attach your output images before and after training in a cell with your submission.

In [None]:
from cfgs.exercise_3 import cnn_cifar10
q1_config = cnn_cifar10.q1_experiment

datamodule_class = q1_config['datamodule']
data_args = q1_config['data_args']

dm = datamodule_class(**data_args)

# Based on the heldout_split in the config file, 
# the datamodule will break the dataset into two splits
train_data_loader = dm.get_loader()
valid_data_loader = dm.get_heldout_loader()

# Test loader is the same as train loader
# except that training=False, shuffle=False, and no splitting is done 
# So we use the exact config from training and just modify these arguments
test_data_args = deepcopy(data_args) # copy the args
test_data_args['training'] = False
test_data_args['shuffle'] = False
test_data_args['heldout_split'] = 0.0

# Now we initialize the test module with the modified config
test_dm = datamodule_class(**test_data_args)
test_loader = test_dm.get_loader() # and get the loader from it

In [None]:
trainer_class = q1_config['trainer_module']
trainer_cnn = trainer_class(
    config = q1_config, 
    log_dir = ospj(PROJECT_ROOT,'Logs'),
    train_loader=train_data_loader, 
    eval_loader=valid_data_loader,
)

trainer_cnn.model.VisualizeFilter()
trainer_cnn.train()
trainer_cnn.model.VisualizeFilter()

In [None]:
# Change this to the experiment you want to evaluate
path = './Saved/CIFAR10_CNN/last_model.pth'

trainer_cnn.load_model(path=path)

result = trainer_cnn.evaluate(loader=test_loader)

print(result)

After running with default config we got the following metrics:

<div align="center">

| Metric                | Training        | Validation      |
|-----------------------|-----------------|-----------------|
| Loss                  | 0.4848          | 0.62818         |
| Top 1 Accuracy        | 83.116%         | 78.84%          |
| Top 5 Accuracy        | 99.304%         | 98.76%          |

</div>

The first image shows the filters before training, and the second image shows the filters after training. The filters before training seem to be random which makes sense since the filters were initialized as such, while the filters after training have learned to detect some patterns. For example, some filters have learned to detect edges, while others have learned to detect colors.

![Before training](./results/before_training.png)

![After training](./results/after_training.png)

### Question 2: Improve training of Convolutional Networks (15 points)

a) Batch normalization is a widely used operation in neural networks, which will increase the speed of convergence and reach higher performance. You can read the paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” for more theoretical details.
In practice, these operations are implemented in most toolboxes, such as PyTorch and TensorFlow. Add batch normalization in the model of Q1.a (You can use PyTorch's implementation). Please keep other hyperparameters the same, but only add batch normalization. The ConvNet with batch normalization still uses the same class with Q1.a but different arguments. Check the code for details. In each block, the computations should be in the order of **[convolution -> batch normalization -> pooling -> ReLU]**. Compare the loss curves and accuracy using batch normalization to its counterpart in Q1.a. (5 points)

In order to run this experiment, please create a new config dictionary in `cnn_cifar10.py` under the name `q2a_normalization_experiment` (Hint: most of it should be similar to Q1's config). Don't forget to assign the config a new name, so that it doesn't overwrite previous experiments. Similar to the above cells, import the config and run the experiment. 

You can also add extra code to `base_trainer.py` or `cnn_trainer.py` so that they return extra information after the training is finished. For example, recall that in assignment 2's `models/twolayernet/model.py` we had a train method that would return the history of loss values, and then in the notebook the history was plotted with matplotlib. Feel free to make adjustments that let you better understand what's happening. This also applies to next questions. Right now the code only uses tensorboard and wandb for plotting (if enabled in config).

In [None]:
from cfgs.exercise_3 import cnn_cifar10
q2_config = cnn_cifar10.q2a_normalization_experiment

datamodule_class = q2_config['datamodule']
data_args = q2_config['data_args']

dm = datamodule_class(**data_args)

# Based on the heldout_split in the config file, 
# the datamodule will break the dataset into two splits
train_data_loader = dm.get_loader()
valid_data_loader = dm.get_heldout_loader()

# Test loader is the same as train loader
# except that training=False, shuffle=False, and no splitting is done 
# So we use the exact config from training and just modify these arguments
test_data_args = deepcopy(data_args) # copy the args
test_data_args['training'] = False
test_data_args['shuffle'] = False
test_data_args['heldout_split'] = 0.0

# Now we initialize the test module with the modified config
test_dm = datamodule_class(**test_data_args)
test_loader = test_dm.get_loader() # and get the loader from it

trainer_class = q2_config['trainer_module']
trainer_cnn = trainer_class(
    config = q2_config, 
    log_dir = ospj(PROJECT_ROOT,'Logs'),
    train_loader=train_data_loader, 
    eval_loader=valid_data_loader,
)

trainer_cnn.train()

In [None]:
# Change this to the experiment you want to evaluate
path = './Saved/CIFAR10_CNN_BN/last_model.pth'

trainer_cnn.load_model(path=path)

result = trainer_cnn.evaluate(loader=test_loader)
print(result)

After running with batch normalization(q2a_normalization_experiment):

<div align="center">

| Metric                | Training        | Validation      |
|-----------------------|-----------------|-----------------|
| Loss                  | 0.342           | 0.597           |
| Top 1 Accuracy        | 88.28%          | 80.96%          |
| Top 5 Accuracy        | 99.65%          | 98.94%          |

</div>

As expected, batch normalization significantly improved performance across all metrics. Both training and validation loss are lower compared to the model without batch normalization. Additionally, top-1 and top-5 accuracies are higher for both the training and validation sets in the model utilizing batch normalization.

b) Throughout training, we optimize our parameters on the training set. This does not guarantee that with every step we also improve on validation and test set as well! Hence, there is no reason for our latest training checkpoint (the last checkpoint after the last epoch) to be the best to keep. One simple idea is to save a checkpoint of the best model for the validation set throughout the training. Meanining that as the training proceeds, we keep checking our **validation** accuracy after each epoch (or every N epochs) and save the best model. This can mitigate overfitting, as if the model overfits to training data (and accuracy on validation set drops), we would still have access to the best model checkpoint! Note that you **should not** do this on the test set, as we are not alowed to optimize **anything** (including the checkpoint selection) on the test set.

For this task, you need add the logic for saving the `best model` during the training. In the `src/trainers/base_trainer`, in method `train()` we already have the call to `self.evaluate()`. All you need to add is to process the returned result (a dictionary of metric_key -> metric_value) and see if you should save a checkpoint of the model. If yes, then you can save a checkpoint at `self.checkpoint_dir` under `best_val_model.pth` or a similar name, using the `save_model()` method. Feel free to define additional class attributes or methods if needed. 

We also recommend adding a few prints, such as the epochs that you save the best model at. You can also use the `self.logger` object.

Please also implement the `should_evaluate()` in the `trainers/base_tariner.py`, which allows for doing the cross-validation evaluation in intervals, based on the config.


Increase the training epochs to 50 in Q1.a and Q2.a (simply edit their config dictionaries), and compare the **best model** and **latest model** on the **training set** and **validation set**. Due to the randomness, you can train multiple times to verify and observe overfitting and early stopping. (5 points)


Feel free to add any needed train/evaluation code below for this task. 

In [None]:
from cfgs.exercise_3 import cnn_cifar10

configs = [cnn_cifar10.q1_experiment, cnn_cifar10.q2a_normalization_experiment]

for config in configs: build_pipeline(config)[0].train()

We indeed observed the phenomenon of overfitting. In the first experiment, where no batch normalization was used, the model clearly overfitted the training data, as evidenced by an increasing validation loss. However, in the second experiment with batch normalization, overfitting was mitigated. The model's validation loss plateaued at a lower value, suggesting that batch normalization helped prevent the model from fitting too closely to the training data, thus improving its generalization capability. The better generalization is also backed up by the performance metrics obtained from the experiments: the model with batch normalization achieved an accuracy of 83.26%, while the model without batch normalization reached 79.82%. This 3.44% percentage point improvement underscores the effectiveness of batch normalization in enhancing the model's ability to generalize to unseen data. 

In the first experiment, the following metrics were obtained for the best and latest models:

<div align="center">

| Metric                         | Best Model      | Latest Model     |
|--------------------------------|-----------------|------------------|
| Training Loss                  | 0.04            | 0.02             |
| Validation Loss                | 0.85            | 0.89             |
| Training Top 1 Accuracy        | 99.42%          | 99.99%           |
| Validation Top 1 Accuracy      | 80.36%          | 79.82%           |

</div>

In the second experiment, the following metrics were obtained for the best and latest models:

<div align="center">

| Metric                         | Best Model      | Latest Model     |
|--------------------------------|-----------------|------------------|
| Training Loss                  | 0.00270         | 0.00243          |
| Validation Loss                | 0.59842         | 0.60637          |
| Training Top 1 Accuracy        | 99.99%          | 99.99%           |
| Validation Top 1 Accuracy      | 84.22%          | 83.26%           |

</div>

In the first experiment, the best model (achieved on epoch 26) has a higher training loss but a lower validation loss than the latest model. This is expected, since the model overfits the training data. 

In the second experiment, the best model (achived on epoch 38) has a similar training (0.00270) and validation loss (0.59842) to the latest model with training loss (0.00243) and validation loss (0.60637). 

We see that the model after some number of epochs platoes and does not improve significantly anymore. This is a good time to stop the training. Thus, we can use early stopping to stop the training when the model stops improving.

Note: CIFAR10_CNN_BN run also includes Batch Normalization

<img src="./results/q2b_train_loss_2.svg" alt="Training loss" width="48%"/>
<img src="./results/q2b_eval_loss_2.svg" alt="Validation loss" width="48%"/>

<img src="./results/q2b_train_top1_2.svg" alt="Training top1 accuracy" width="48%"/>
<img src="./results/q2b_eval_top1_2.svg" alt="Validation top1 accuracy" width="48%"/>


c) While in part `b` we save the best model, we still do as many epochs as indicated in the config file. This is not convenient as the overfitting steps are wasting time and compute and also wouldn't affect the best model. Hence, Early Stopping can be helpful, where we **stop** the training after a few non-improving steps! Early stopping logic should be considered after every training epoch is finished, to see if we should do more epochs or not. Therefore, the logic should should be implemented ath the end of the loop over epochs in the `train()` method of `base_trainer.py` (which takes care of running multiple epochs).

Once implemented, you need a new config dictionary to enable early stopping. Simply create a new one at the bottom of `cfgs/exercise-3/cnn_cifar10.py`. It should be mostly similar to previous config, with the following modification:
```Python
q2c_earlystop_experiment = dict(
    name = 'Some New Name' # Otherwise it will overwrite previous experiment!
    ...
    trainer = dict(
        ...
        monitor = "off", # -> chante to "max eval_top1"
        early_stop = 0, #  -> change to 4
    ),
)
```
This will enable the early stopping to be considered for `eval_top1` metric and the maximum number of non-improving steps will be set to 4.

Use the cells below to re-run one of the experiments from part `b` that the best epoch was way lower than the total number of epochs, and see if early stopping can prevent unnecessary training epochs in that case.

In [None]:
from cfgs.exercise_3 import cnn_cifar10

build_pipeline(cnn_cifar10.q2c_earlystop_experiment)[0].train()

As expected, the model stopped training after 4 epochs of no improvement. The training was stopped after 15 epochs, which is earlier than the 50 epochs specified in the config. This is a good example of how early stopping can prevent unnecessary training epochs and save computational resources.


<img src="./results/q2c_early_stop.svg" alt="Early stopping" width="48%"/>

### Question 3: Improve generalization of Convolutional Networks (10 points)

We saw in Q2 that the model can start over-fitting to the training set if we continue training for long. To prevent over-fitting, there are two main paradigms we can focus on. 

The first is to get more training data. This might be a difficult and expensive process. However, it is generally the most effective way to learn more general models. A cheaper alternative is to perform data augmentation. The second approach is to regularize the model to prevent overfitting. 

In the following sub-questions, we will experiment with each of these paradigms and measure the effect on the model generalization. We recommend disabling Early Stopping from previous question (simply removing it from config file) so that it does not interrupt your experiments with data augmentations and you maintain full control over number of epochs.



a) Data augmentation is the process of creating more training data by applying certain transformations to the training set images. Usually, the underlying assumption is that the label of the image does not change under the applied transformations. This includes geometric transformations like translation, rotation, scaling, flipping, random cropping, and color transformations like greyscale, colorjitter. For every image in the training batch, a random transformation is sampled from the possible ones (e.g., a random number of pixels to translate the image by) and is applied to the image. While designing the data input pipeline, we must choose the hyper-parameters for these transformations (e.g., limits of translation or rotation) based on things we expect to see in the test-set/real world. Your task in this question is to implement the data augmentation for the CIFAR-10 classification task. Many of these transformations are implemented in the `torchvision.transforms` package. Familiarize yourself with the APIs of these transforms, and functions to compose multiple transforms or randomly sample them. Next, implement geometric and color space data augmentations for the CIFAR-10 dataset, by choosing the right functions and order of application. Tune the hyper-parameters of these data augmentations to improve the validation performance. You will need to train the model a bit longer (20-30 epochs) with data augmentation, as the training data is effectively larger now. Discuss which augmentations work well for you in the report. (6 points)

Create as many config dictionaries as you need in `cnn_cifar10.py`. For every augmentation, simply create a new preset under `src/utils/transform_presets.py` and reference its name in your experiment's config dict.



b) Dropout is a popular scheme to regularize the model to improve generalization. The dropout layer works by setting the input activations randomly to zero at the output. You can implement Dropout by adding the `torch.nn.Dropout` layer between the conv blocks in your model. The layer has a single hyper-parameter $p$, which is the probability of dropping the input activations. High values of $p$ regularize the model heavily and decrease model capacity, but with low values, the model might overfit. Find the right hyper-parameter for $p$ by training the model for different values of $p$ and comparing training validation and validation accuracies. You can use the same parameter $p$ for all layers. You can also disable the data augmentation from the previous step while running this experiment, to clearly see the benefit of dropout. Show the plot of training and validation accuracies for different values of dropout (0.1 - 0.9) in the report. Create as many config dictionaries as you need in `cnn_cifar10.py`. (4 points)

In [1]:
# REMOVE
from os.path import join as ospj
from warnings import warn
from copy import deepcopy
from src.utils.pipeline import build_pipeline


# Data Augmentation Experiment
from cfgs.exercise_3 import cnn_cifar10

configs = [cnn_cifar10.q3a_aug3_experiment] # , cnn_cifar10.q3a_aug2_experiment, cnn_cifar10.q3a_aug3_experiment]

for config in configs: build_pipeline(config)[0].train()

transforms for preset CIFAR10_col_aug for split train are Compose(
    RandomApply(
    p=0.5
    ColorJitter(brightness=(0.5, 1.5), contrast=(0.5, 1.5), saturation=(0.5, 1.5), hue=(-0.1, 0.1))
)
    RandomGrayscale(p=0.5)
    RandomEqualize(p=0.5)
    ToTensor()
    Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
)
Files already downloaded and verified
Initialization DataLoader for 45000 samples with {'batch_size': 200, 'shuffle': True, 'num_workers': 6}
Initialization heldout DataLoader 5000 samples with {'batch_size': 200, 'shuffle': False, 'num_workers': 6}
transforms for preset CIFAR10_col_aug for split eval are Compose(
    ToTensor()
    Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
)
Files already downloaded and verified
Initialization DataLoader for 10000 samples with {'batch_size': 200, 'shuffle': False, 'num_workers': 6}
Freeing GPU Memory
Free: 3837 MB	Total: 3899 MB


[34m[1mwandb[0m: Currently logged in as: [33mdhimitrios-duka1[0m ([33mhwga-cj[0m). Use [1m`wandb login --relogin`[0m to force relogin


Train Epoch: 1 Loss: 1.6053: : 100% 45000/45000 [00:37<00:00, 1194.10it/s]
Eval Loss: 1.7511: : 100% 5000/5000 [00:02<00:00, 2304.88it/s]


Saved best model at epoch 1 with eval_top1=0.3972


Train Epoch: 2 Loss: 1.3808: : 100% 45000/45000 [00:38<00:00, 1181.72it/s]
Eval Loss: 1.1917: : 100% 5000/5000 [00:02<00:00, 2356.66it/s]


Saved best model at epoch 2 with eval_top1=0.5678


Train Epoch: 3 Loss: 1.0022: : 100% 45000/45000 [00:38<00:00, 1169.21it/s]
Eval Loss: 1.0718: : 100% 5000/5000 [00:02<00:00, 2346.67it/s]


Saved best model at epoch 3 with eval_top1=0.6201999999999999


Train Epoch: 4 Loss: 1.0404: : 100% 45000/45000 [00:38<00:00, 1156.05it/s]
Eval Loss: 0.9744: : 100% 5000/5000 [00:02<00:00, 2241.03it/s]


Saved best model at epoch 4 with eval_top1=0.6524000000000001


Train Epoch: 5 Loss: 0.9204: : 100% 45000/45000 [00:39<00:00, 1144.36it/s]
Eval Loss: 0.9022: : 100% 5000/5000 [00:02<00:00, 2122.59it/s]


Saved best model at epoch 5 with eval_top1=0.6809999999999999


Train Epoch: 6 Loss: 0.9051: : 100% 45000/45000 [00:40<00:00, 1124.65it/s]
Eval Loss: 0.8641: : 100% 5000/5000 [00:02<00:00, 2188.31it/s]


Saved best model at epoch 6 with eval_top1=0.6922


Train Epoch: 7 Loss: 0.9030: : 100% 45000/45000 [00:39<00:00, 1133.01it/s]
Eval Loss: 0.7875: : 100% 5000/5000 [00:02<00:00, 2241.35it/s]


Saved best model at epoch 7 with eval_top1=0.7069999999999999


Train Epoch: 8 Loss: 0.8988: : 100% 45000/45000 [00:39<00:00, 1138.23it/s]
Eval Loss: 0.8604: : 100% 5000/5000 [00:02<00:00, 2175.42it/s]
Train Epoch: 9 Loss: 0.6352: : 100% 45000/45000 [00:39<00:00, 1136.63it/s]
Eval Loss: 0.8070: : 100% 5000/5000 [00:02<00:00, 2132.53it/s]


Saved best model at epoch 9 with eval_top1=0.7222000000000002


Train Epoch: 10 Loss: 0.6439: : 100% 45000/45000 [00:39<00:00, 1136.07it/s]
Eval Loss: 0.7976: : 100% 5000/5000 [00:02<00:00, 2324.49it/s]
Train Epoch: 11 Loss: 0.5890: : 100% 45000/45000 [00:39<00:00, 1132.27it/s]
Eval Loss: 0.7623: : 100% 5000/5000 [00:02<00:00, 2187.34it/s]


Saved best model at epoch 11 with eval_top1=0.73


Train Epoch: 12 Loss: 0.6587: : 100% 45000/45000 [00:39<00:00, 1133.26it/s]
Eval Loss: 0.8182: : 100% 5000/5000 [00:02<00:00, 2146.04it/s]
Train Epoch: 13 Loss: 0.6476: : 100% 45000/45000 [00:39<00:00, 1128.60it/s]
Eval Loss: 0.8772: : 100% 5000/5000 [00:02<00:00, 2309.42it/s]
Train Epoch: 14 Loss: 0.6388: : 100% 45000/45000 [00:39<00:00, 1125.24it/s]
Eval Loss: 0.7133: : 100% 5000/5000 [00:02<00:00, 2284.42it/s]


Saved best model at epoch 14 with eval_top1=0.7371999999999999


Train Epoch: 15 Loss: 0.5576: : 100% 45000/45000 [00:40<00:00, 1119.27it/s]
Eval Loss: 0.7230: : 100% 5000/5000 [00:02<00:00, 2175.44it/s]
Train Epoch: 16 Loss: 0.5360: : 100% 45000/45000 [00:40<00:00, 1119.96it/s]
Eval Loss: 0.7336: : 100% 5000/5000 [00:02<00:00, 2107.57it/s]
Train Epoch: 17 Loss: 0.4583: : 100% 45000/45000 [00:40<00:00, 1122.08it/s]
Eval Loss: 0.6561: : 100% 5000/5000 [00:02<00:00, 2308.27it/s]


Saved best model at epoch 17 with eval_top1=0.7465999999999999


Train Epoch: 18 Loss: 0.4426: : 100% 45000/45000 [00:40<00:00, 1122.24it/s]
Eval Loss: 0.7368: : 100% 5000/5000 [00:02<00:00, 2283.46it/s]
Train Epoch: 19 Loss: 0.5719: : 100% 45000/45000 [00:40<00:00, 1120.93it/s]
Eval Loss: 0.7354: : 100% 5000/5000 [00:02<00:00, 2341.56it/s]
Train Epoch: 20 Loss: 0.4124: : 100% 45000/45000 [00:39<00:00, 1129.20it/s]
Eval Loss: 0.7792: : 100% 5000/5000 [00:02<00:00, 2357.63it/s]


Saved best model at epoch 20 with eval_top1=0.7466000000000002


Train Epoch: 21 Loss: 0.4760: : 100% 45000/45000 [00:39<00:00, 1132.16it/s]
Eval Loss: 0.6663: : 100% 5000/5000 [00:02<00:00, 2321.05it/s]
Train Epoch: 22 Loss: 0.4699: : 100% 45000/45000 [00:39<00:00, 1130.54it/s]
Eval Loss: 0.7371: : 100% 5000/5000 [00:02<00:00, 2408.54it/s]
Train Epoch: 23 Loss: 0.3616: : 100% 45000/45000 [00:39<00:00, 1126.18it/s]
Eval Loss: 0.7056: : 100% 5000/5000 [00:02<00:00, 2114.44it/s]
Train Epoch: 24 Loss: 0.3088: : 100% 45000/45000 [00:39<00:00, 1127.88it/s]
Eval Loss: 0.7719: : 100% 5000/5000 [00:02<00:00, 2254.81it/s]
Train Epoch: 25 Loss: 0.4499: : 100% 45000/45000 [00:39<00:00, 1127.28it/s]
Eval Loss: 0.8249: : 100% 5000/5000 [00:02<00:00, 2327.39it/s]
Train Epoch: 26 Loss: 0.3427: : 100% 45000/45000 [00:39<00:00, 1125.88it/s]
Eval Loss: 0.6902: : 100% 5000/5000 [00:02<00:00, 2352.29it/s]


Saved best model at epoch 26 with eval_top1=0.7515999999999999


Train Epoch: 27 Loss: 0.3138: : 100% 45000/45000 [00:40<00:00, 1122.66it/s]
Eval Loss: 0.7770: : 100% 5000/5000 [00:02<00:00, 2305.14it/s]
Train Epoch: 28 Loss: 0.2525: : 100% 45000/45000 [00:40<00:00, 1124.72it/s]
Eval Loss: 0.8191: : 100% 5000/5000 [00:02<00:00, 2204.72it/s]
Train Epoch: 29 Loss: 0.2109: : 100% 45000/45000 [00:40<00:00, 1118.15it/s]
Eval Loss: 0.7530: : 100% 5000/5000 [00:02<00:00, 2171.55it/s]


Saved best model at epoch 29 with eval_top1=0.7532


Train Epoch: 30 Loss: 0.2269: : 100% 45000/45000 [00:40<00:00, 1114.91it/s]
Eval Loss: 0.7151: : 100% 5000/5000 [00:02<00:00, 2187.83it/s]


0,1
epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
eval_loss,█▅▄▃▂▂▂▂▁▂▁▁▂▁▁▁▁▁▂▁▁▂▁▂▂▁▂▂▂▂
eval_top1,▁▄▅▆▇▇▇▇▇▇█▇▇█████████████████
eval_top5,▁▅▅▆▇▇▇▇████▇█████████████████
loss,█▆▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
top1,▁▃▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇██████████
top5,▁▆▇▇▇▇▇▇██████████████████████

0,1
epoch,30.0
eval_loss,0.83446
eval_top1,0.7444
eval_top5,0.9776
loss,0.2985
top1,0.8996
top5,0.9972


In [None]:
# Dropout Experiment
from cfgs.exercise_3 import cnn_cifar10

configs = cnn_cifar10.q3b_dropout_experiments

for config in configs: build_pipeline(config)[0].train()

### Data augmentation
During the augmentation experiments, we experimented with geometric, color space data augmentations and the combination of both.

For the geometric augmentations, we used random horizontal flip, random rotation, and random perspective. We got the best results using the following configuration:

```python
def get_geo_transforms():
    return [
        transforms.RandomHorizontalFlip(p=0.6),
        transforms.RandomRotation(15),
        transforms.RandomPerspective(distortion_scale=0.2, p=0.2)
    ]
```

Using this configuration, we achieved the following metrics:

<div align="center">

| Metric                | Training  | Validation |
|-----------------------|-----------|------------|
| Loss                  | 0.44      | 0.54       |
| Top 1 Accuracy        | 84.37%    | 81.58%     |
| Top 5 Accuracy        | 99.46%    | 98.92%     |

</div>

For the color space augmentations, we tried using color jitter, random grayscale, random sharpness adjustment, random autocontrast and random histogram equalization. We got the best results using the following configuration:

```python
def get_col_transforms():
    return [
        transforms.RandomApply([
            transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.1)
        ], p=0.5),
        transforms.RandomGrayscale(p=0.5),
        transforms.RandomEqualize(p=0.5)
    ]
```

Using this configuration, we achieved the following metrics:
+ Training loss: 0.39570
+ Validation loss: 0.78630

+ Training accuracy:
    + Top 1 accuracy: 86.35%
    + Top 5 accuracy: 99.54%

+ Validation accuracy:
    + Top 1 accuracy: 75.94%
    + Top 5 accuracy: 98.00%


However, using only color space augmentations did not improve the model's performance. The model became too prone to overfitting, especially after the 10th epoch, as evidenced by the following graph:

<img src="./results/q3a_color_augmentation_of.svg" alt="Color augmentation" width="48%"/>


Finally, we combined both geometric and color space augmentations. We achieved the following metrics:

+ Training loss: 0.00000
+ Validation loss: 0.00000

+ Training accuracy:
    + Top 1 accuracy: 100.0%
    + Top 5 accuracy: 100.0%

+ Validation accuracy:
    + Top 1 accuracy: 85.0%
    + Top 5 accuracy: 100.0%

Another interesting approach is using AutoAugmentPolicy.CIFAR10 as the data augmentation strategy. AutoAugmentPolicy.CIFAR10 consists of a series of learned geometric and color space augmentation operations, specifically designed for the CIFAR10 dataset. During training, the model randomly chooses one of these operations to applys it to the image. 

We got the following metrics using AutoAugmentPolicy.CIFAR10:
+ Training loss: 0.54940
+ Validation loss: 0.72996

+ Training accuracy:
    + Top 1 accuracy: 80.94%
    + Top 5 accuracy: 98.38%

+ Validation accuracy:
    + Top 1 accuracy: 75.44%
    + Top 5 accuracy: 97.28%


Note: The experiments were run for 30 epochs. Also, no additional features like batchnorm or dropout were used.

### Dropout experiments

### Question 4: Use pretrained networks (10 points)

It has become standard practice in computer vision tasks related to images to use a convolutional network pre-trained as the backbone feature extraction network and train new layers on top for the target task. In this question, we will implement such a model. We will use the `VGG_11_bn` network from the `torchvision.models` library as our backbone network. This model has been trained on ImageNet, achieving a top-5 error rate of 10.19%. It consists of 8 convolutional layers followed by adaptive average pooling and fully-connected layers to perform the classification. We will get rid of the average pooling and fully-connected layers from the `VGG_11_bn` model and attach our own fully connected layers to perform the CIFAR-10 classification.

a) Instantiate a pretrained version of the `VGG_11_bn` model with ImageNet pre-trained weights. Add two fully connected layers on top, with Batch Norm and ReLU layers in between them, to build the CIFAR-10 10-class classifier. Note that you will need to set the correct mean and variance in the data-loader, to match the mean and variance the data was normalized with when the `VGG_11_bn` was trained. Train only the newly added layers while disabling gradients for the rest of the network. Each parameter in PyTorch has a required grad flag, which can be turned off to disable gradient computation for it. Get familiar with this gradient control mechanism in PyTorch and train the above model. As a reference point, you will see validation accuracies in the range (61-65%) if implemented correctly. (6 points)


b) We can see that while the ImageNet features are useful, just learning the new layers does not yield better performance than training our own network from scratch. This is due to the domain-shift between the ImageNet dataset (224x224 resolution images) and the CIFAR-10 dataset (32x32 images). To improve the performance we can fine-tune the whole network on the CIFAR-10 dataset, starting from the ImageNet initialization (set `"fine_tune"` to `true` in `vgg_cifar10.py`). To do this, enable gradient computation to the rest of the network, and update all the model parameters. Additionally train a baseline model where the entire network is trained from scratch, without loading the ImageNet weights (set `"weights"` to `None` in `vgg_cifar10.py`). Compare the two models' training curves, validation, and testing performance in the report. (4 points)


If you're using Pytorch 1, the `weights` argument will not work. In that case, you need to change the `weights` argument to `pretrained=True` or `False`. Feel free to post on Forum if you have any issues.

For both questions, feel free to modify the data augmentation by defining a new preset and referring to it in the config file. However, make sure that in your experiments you always change only one thing at a time (i.e use the same augmentation for both method A and method B if you're comparing them with each other!)

In [None]:
from cfgs.exercise_3 import vgg_cifar10
q4_config = vgg_cifar10.q4_dict


datamodule_class = q4_config['datamodule']
data_args = q4_config['data_args']

dm = datamodule_class(**data_args)

# Based on the heldout_split in the config file, 
# the datamodule will break the dataset into two splits
train_data_loader = dm.get_loader()
valid_data_loader = dm.get_heldout_loader()

# Test loader is the same as train loader
# except that training=False, shuffle=False, and no splitting is done 
# So we use the exact config from training and just modify these arguments
test_data_args = deepcopy(data_args) # copy the args
test_data_args['training']=False
test_data_args['shuffle']=False
test_data_args['heldout_split']=0.0

# Now we initialize the test module with the modified config
test_dm = datamodule_class(**test_data_args)
test_loader = test_dm.get_loader() # and get the loader from it

 By default WandB is enabled in config file for `vgg_cifar10.py`. You can set it to false if you don't want to use it. It's not an essential part of the assignment anyway.

In [None]:
# NOTE: We moved this login in the base_trainer class

# wandb_enabled = q4_config['trainer_config']['wandb']
# if wandb_enabled:
#     import wandb
    
#     # change entity to your wandb username/group name. Also feel free to rename project and run names.
#     run = wandb.init(
#         project='hlcv-a3',
#         name=q4_config['name'],
#         config=q4_config,
#         entity="hwga-cj",
#         dir="./"
#     )
#     run.name = run.name + f'-{run.id}'
#     assert run is wandb.run


In [None]:
# REMOVE
from os.path import join as ospj
from warnings import warn
from copy import deepcopy
PROJECT_ROOT='./'

trainer_class = q4_config['trainer_module']
trainer_vgg = trainer_class(
    config = q4_config, 
    log_dir = ospj(PROJECT_ROOT,'Logs'),
    train_loader=train_data_loader, 
    eval_loader=valid_data_loader,
)

In [None]:
trainer_vgg.train()

In [None]:
# Change this to the experiment you want to evaluate
path = './Saved/CIFAR10_VGG/last_model.pth'

trainer_vgg.load_model(path=path)

result = trainer_vgg.evaluate(loader=test_loader)

print(result)

# if wandb_enabled:
#     for metrics, values in result.items():
#         wandb.run.summary[f"test_{metrics}"] = values

#     # Change the title and message as you wish.
#     # Would only work if you have enabled push notifications for your email/slack in wandb account settings.
#     # Of course not an essential part of the assignment :)
#     wandb.alert(
#         title="Training Finished",
#         text=f'VGG Training has finished. Test results: {result}', level=wandb.AlertLevel.INFO
#     )

#     run.finish()

In [None]:
# REMOVE
from os.path import join as ospj
from warnings import warn
from copy import deepcopy
PROJECT_ROOT='./'

from cfgs.exercise_3 import vgg_cifar10
from src.utils.pipeline import build_pipeline

configs = [vgg_cifar10.q4_dict_ft, vgg_cifar10.q4_dict_ft_nw]

for config in configs: 
    trainer_vgg, test_loader = build_pipeline(config)

    trainer_vgg.train()

    # Change this to the experiment you want to evaluate
    path = f'./Saved/{config["name"]}/best_val_model.pth'

    trainer_vgg.load_model(path=path)

    result = trainer_vgg.evaluate(loader=test_loader)

    print(f"Result for {config['name']}: {result}")

Fine-tuning only the classifier layers of the VGG_11_bn model yielded the following metrics:

+ Training loss: 0.55604
+ Validation loss: 1.14341
+ Test loss: 1.16853

+ Training accuracy:
    + Top 1 accuracy: 80.078%
    + Top 5 accuracy: 99.339%

+ Validation accuracy:
    + Top 1 accuracy: 64.972%
    + Top 5 accuracy: 96.499%

+ Test accuracy:
    + Top 1 accuracy: 64.540%
    + Top 5 accuracy: 96.198%

Fine-tuning the entire VGG_11_bn model with pre-loaded weights yielded the following metrics:

+ Training loss: 0.06656
+ Validation loss: 0.58274
+ Test loss: 0.61824

+ Training accuracy:
    + Top 1 accuracy: 97.85%
    + Top 5 accuracy: 99.98%

+ Validation accuracy:
    + Top 1 accuracy: 86.25%
    + Top 5 accuracy: 98.79%

+ Test accuracy:
    + Top 1 accuracy: 85.55%
    + Top 5 accuracy: 98.79%

Fine-tuning the entire VGG_11_bn model without pre-loaded weights yielded the following metrics:

+ Training loss: 0.07170
+ Validation loss: 0.76685
+ Test loss: 0.79863

+ Training accuracy:
    + Top 1 accuracy: 97.53%
    + Top 5 accuracy: 99.98%

+ Validation accuracy:
    + Top 1 accuracy: 82.14%
    + Top 5 accuracy: 98.66%

+ Test accuracy:
    + Top 1 accuracy: 81.05%
    + Top 5 accuracy: 98.72%

<image src="./results/q3b_vgg_fine_tune.svg" alt="VGG11_bn fine-tuning" width="48%"/>

In [None]:
# REMOVE
from os.path import join as ospj
from warnings import warn
from copy import deepcopy
PROJECT_ROOT='./'

from cfgs.exercise_3 import vgg_cifar10
from src.utils.pipeline import build_pipeline

config_list = [vgg_cifar10.q4_dict_ft_list, vgg_cifar10.q4_dict_ft_nw_list]

for configs in config_list:
    for config in configs: 
        trainer_vgg, test_loader = build_pipeline(config)

        trainer_vgg.train()

        # Change this to the experiment you want to evaluate
        path = f'./Saved/{config["name"]}/best_val_model.pth'

        trainer_vgg.load_model(path=path)

        result = trainer_vgg.evaluate(loader=test_loader)

        print(f"Result for {config['name']}: {result}")