# Homework 2, *part 2*
### (60 points total)

In this part, you will build a convolutional neural network (CNN) to solve (yet another) image classification problem: the Tiny ImageNet dataset (200 classes, 100K training images, 10K validation images). Try to achieve as high accuracy as possible.

**Unlike part 1**, you are now free to use the full power of PyTorch and its subpackages.

## Deliverables

* This file.
* A "checkpoint file" `"checkpoint.pth"` that contains your CNN's weights (you get them from `model.state_dict()`). Obtain it with `torch.save(..., "checkpoint.pth")`. When grading, we will load it to evaluate your accuracy.

**Should you decide to put your `"checkpoint.pth"` on Google Drive, update (edit) the following cell with the link to it:**

### [Dear TAs, I've put my "checkpoint.pth" on Google Drive, download it here](https://drive.google.com/open?id=1unnXVbB-vK6-tW87NyYOdwJo39p_rLR6)

## Grading

* 9 points for reproducible training code and a filled report below.
* 11 points for building a network that gets above 25% accuracy.
* 4 points for using an **interactive** (please don't reinvent the wheel with `plt.plot`) tool for viewing progress, for example Tensorboard ([with this library](https://github.com/lanpa/tensorboardX) and [an extra hack for Colab](https://stackoverflow.com/a/57791702)). In this notebook, insert screenshots of accuracy and loss plots (training and validation) over iterations/epochs/time.
* 6 points for beating each of these accuracy milestones on the private **test** set:
  * 30%
  * 34%
  * 38%
  * 42%
  * 46%
  * 50%
  
*Private test set* means that you won't be able to evaluate your model on it. Rather, after you submit code and checkpoint, we will load your model and evaluate it on that test set ourselves, reporting your accuracy in a comment to the grade.

Note that there is an important formatting requirement, see below near "`DO_TRAIN = True`".

## Restrictions

* No pretrained networks.
* Don't enlarge images (e.g. don't resize them to $224 \times 224$ or $256 \times 256$).

## Tips

* **One change at a time**: never test several new things at once (unless you are super confident). Train a model, introduce one change, train again.
* Google a lot: try to reinvent as few wheels as possible (unlike in part 1 of this assignment).
* Use GPU.
* Use regularization: L2, batch normalization, dropout, data augmentation...
* Pay much attention to accuracy and loss graphs (e.g. in Tensorboard). Track failures early, stop bad experiments early.

In [1]:
# Detect if we are in Google Colaboratory
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

from pathlib import Path
# Determine the locations of auxiliary libraries and datasets.
# `AUX_DATA_ROOT` is where 'notmnist.py', 'animation.py' and 'tiny-imagenet-2020.zip' are.
if IN_COLAB:
    google.colab.drive.mount("/content/drive")
    
    # Change this if you created the shortcut in a different location
    AUX_DATA_ROOT = Path("/content/drive/My Drive/Deep Learning 2020 -- Home Assignment 2")
    
    assert AUX_DATA_ROOT.is_dir(), "Have you forgot to 'Add a shortcut to Drive'?"
else:
    AUX_DATA_ROOT = Path(".")

The below cell puts training and validation images in `./tiny-imagenet-200/train` and `./tiny-imagenet-200/val`:

In [2]:
# Extract the dataset into the current directory
if not Path("tiny-imagenet-200/train/class_000/00000.jpg").is_file():
    import zipfile
    with zipfile.ZipFile(AUX_DATA_ROOT / 'tiny-imagenet-2020.zip', 'r') as archive:
        archive.extractall()

**You are required** to format your notebook cells so that `Run All` on a fresh notebook:
* trains your model from scratch, if `DO_TRAIN is True`;
* loads your trained model from `"./checkpoint.pth"`, then **computes** and prints its validation accuracy, if `DO_TRAIN is False`.

In [3]:
DO_TRAIN = False

## Train the model

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as torch_data
import torchvision

import matplotlib.pyplot as plt
import numpy as np
import random

In [5]:
!nvidia-smi -L

GPU 0: GeForce RTX 2070 (UUID: GPU-69831a40-2410-16e9-5f2b-087be136b9f4)


In [6]:
# WandB – Install the W&B library
%pip install wandb -q

Note: you may need to restart the kernel to use updated packages.


In [7]:
# Ignore excessive warnings
import logging
logging.propagate = False 
logging.getLogger().setLevel(logging.ERROR)

# WandB – Import the wandb library
import wandb

In [8]:
class Conv2dBlock(nn.Module):
    """
    Full pre-activation.
    """

    def __init__(self, in_channels, out_channels, kernel_size, stride=1):
        super(Conv2dBlock, self).__init__()

        self.in_channels  = in_channels
        self.out_channels = out_channels
        self.kernel_size  = kernel_size
        self.padding      = (self.kernel_size // 2, self.kernel_size // 2)
        self.stride       = stride

        self.layer_block  = nn.Sequential( 
                                          nn.BatchNorm2d(self.in_channels),
                                          nn.ReLU(),
                                          nn.Conv2d(self.in_channels, self.out_channels,\
                                                    kernel_size=self.kernel_size, stride=self.stride, padding=self.padding, bias=False)
                                         )
    
    def forward(self, x):
        return self.layer_block(x)

In [9]:
class ResBlock(nn.Module):

    def __init__(self, in_channels, out_channels, block_size, downsampling=False):
        super(ResBlock, self).__init__()

        self.in_channels  = in_channels
        self.out_channels = out_channels
        self.block_size   = block_size

        self.downsampling = downsampling
        if self.downsampling:
            self.stride = 2
        else:
            self.stride = 1

        layers = [Conv2dBlock(self.out_channels, self.out_channels, 3) for i in range(1, self.block_size)]

        self.block = nn.Sequential( 
                                   Conv2dBlock(self.in_channels, self.out_channels, 3, stride=self.stride),
                                   *layers
                                  )
        
        self.shortcut = nn.Sequential( 
                                      nn.Conv2d(self.in_channels, self.out_channels, kernel_size=1, stride=self.stride),
                                      nn.BatchNorm2d(self.out_channels)
                                     )
        
    def forward(self, x):
        if self.in_channels != self.out_channels:
            identity = self.shortcut(x)
        else:
            identity = x

        x  = self.block(x)
        x += identity

        return x

In [10]:
class ResNet34(nn.Module):

    def __init__(self, in_channels, num_classes):
        super(ResNet34, self).__init__()

        self.in_channels = in_channels
        self.num_classes = num_classes

        self.conv1 = Conv2dBlock(self.in_channels, 64, 3, stride=1)
        self.conv2 = nn.Sequential(
                                    #nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
                                    ResBlock(64, 64, 6)
                                  )
        self.conv3 = ResBlock(64, 128, 8, downsampling=True)
        self.conv4 = ResBlock(128, 256, 12, downsampling=True)
        self.conv5 = ResBlock(256, 512, 6, downsampling=True)

        self.fc_out = nn.Sequential( 
                                    nn.AdaptiveAvgPool2d((1,1)),
                                    nn.Flatten(),
                                    nn.Linear(512, self.num_classes),
                                    nn.LogSoftmax(dim=1)
                                   )
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.conv5(x)
        x = self.fc_out(x)
        
        return x

In [11]:
def train(net, n_epochs, optimizer, criterion, train_loader, val_loader, device,\
          scheduler=None, early_stop=None, save_dir=None):
    
    for epoch in range(1, n_epochs+1):
        train_loss, val_loss = 0.0, 0.0
        train_correct, val_correct = 0, 0

        # training
        net.train()
        for X, y in train_loader:
            # send data to device
            X, y = X.to(device), y.to(device)

            # make prediction and calculate loss
            y_prob = net(X)
            loss   = criterion(y_prob, y)
            train_loss += loss.item()

            # evaluate accuracy
            y_pred = y_prob.argmax(dim=1)
            train_correct += (y_pred == y).sum().item()

            # update model weights
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            del loss
        train_loss /= len(train_loader)

        # validation
        net.eval()
        for X, y in val_loader:
            X, y = X.to(device), y.to(device)

            y_prob = net(X)
            loss   = criterion(y_prob, y)
            val_loss += loss.item()

            # evaluate accuracy
            y_pred = y_prob.argmax(dim=1)
            val_correct += (y_pred == y).sum().item()

            del loss
        val_loss /= len(val_loader)

        if scheduler is not None:
            scheduler.step(val_loss)
        
        if early_stop is not None:
            early_stop.step()

        # log results
        wandb.log({
                    "Train loss" : train_loss,
                    "Validation loss" : val_loss,
                    "Train accuracy" : 100 * train_correct / len(train_loader.dataset),
                    "Validation accuracy" : 100 * val_correct / len(val_loader.dataset)
                 })
    
    if save_dir is not None:
        torch.save(net, save_dir)
        wandb.save(save_dir)

In [12]:
from torchvision import transforms

train_transform = transforms.Compose([
                                      transforms.RandomCrop(56),
                                      transforms.ColorJitter(brightness=(0.25, 1.5), saturation=(0.25, 1.5)),
                                      transforms.RandomChoice([transforms.RandomHorizontalFlip(),
                                                               transforms.RandomVerticalFlip()]),
                                      transforms.RandomAffine(degrees=20, scale=(0.8, 1.1), shear=10),
                                      transforms.ToTensor()
                                    ])
val_transform = transforms.Compose([transforms.ToTensor()])

train_dset = torchvision.datasets.ImageFolder(root='./tiny-imagenet-200/train', transform=train_transform)
val_dset   = torchvision.datasets.ImageFolder(root='./tiny-imagenet-200/val', transform=val_transform)
im_channels, num_classes = 3, 200

In [13]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [14]:
# WandB – Login to your wandb account so you can log all your metrics
!wandb login 031af70dba88e746696d15cc5bdddf1dc268ab62

Successfully logged in to Weights & Biases!


wandb: Appending key for api.wandb.ai to your netrc file: C:\Users\asang/.netrc


In [15]:
def fix_seed(seed=0, device=device):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if device.type == 'cuda':
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

In [16]:
!nvidia-smi

Fri May 22 20:56:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.12       Driver Version: 441.12       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce RTX 2070   WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   39C    P8     7W /  N/A |   3209MiB /  8192MiB |     14%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  

In [17]:
if DO_TRAIN:
    wandb.init(project="skoltech_dl_hw2")

    # dataloaders
    batch_size = 128
    train_loader = torch_data.DataLoader(train_dset, batch_size, shuffle=True, pin_memory=True)
    val_loader   = torch_data.DataLoader(val_dset, batch_size, shuffle=False, pin_memory=True)

    # fix seeds for reproducibility
    fix_seed(666, device)

    # test loader
    for X, y in train_loader:
        print(X[0].size(), y[0])
        plt.imshow(X[0,0,:,:].numpy())
        plt.show()
        break

    # model
    net = ResNet34(im_channels, num_classes)

    # training parameters
    n_epochs = 30

    optimizer = torch.optim.SGD(net.parameters(), lr=1e-2, momentum=0.9, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, verbose=True, threshold=0.01)
    criterion = nn.NLLLoss()

    print("Begin training...")
    net.to(device)
    train(net, n_epochs, optimizer, criterion, train_loader, val_loader, device, scheduler, save_dir='checkpoint.pth')
    print("Training successful.")

## Load and evaluate the model

In [18]:
def validate(net, criterion, val_loader, device):
    val_loss, val_acc = 0.0, 0
    for X, y in val_loader:
        X, y = X.to(device), y.to(device)

        y_prob    = net(X)
        val_loss += criterion(y_prob, y).item()

        # accuracy
        y_pred = y_prob.argmax(dim=1)
        val_acc += (y_pred == y).sum().item()
    return val_loss / len(val_loader), 100 * val_acc / len(val_loader.dataset)

In [19]:
# Your code here (load the model from "./checkpoint.pth")
# Please use `torch.load("checkpoint.pth", map_location='cpu')`

model = ResNet34(im_channels, num_classes)
model.load_state_dict(torch.load("checkpoint.pth", map_location='cpu'))
model.to(device)
model.eval()

criterion = nn.NLLLoss()

In [20]:
# In case GPU has smaller memory, use smaller batch size
batch_size   = 32
val_loader   = torch_data.DataLoader(val_dset, batch_size, shuffle=False)

val_accuracy = validate(model, criterion, val_loader, device)[1]
assert 0 <= val_accuracy <= 100
print("Validation accuracy: %.2f%%" % val_accuracy)

Validation accuracy: 57.64%


# Report

Below, please mention:

* A brief history of tweaks and improvements.
* Which network architectures have you tried? What is the final one and why?
* What is the training method (batch size, optimization algorithm, number of iterations, ...) and why?
* Which techniques have you tried to prevent overfitting? What were their effects? Which of them worked well?
* Any other insights you learned.

For example, start with:

"I have analyzed these and those conference papers/sources/blog posts. \
I tried this and that to adapt them to my problem. \
The conclusions this task taught me are ..."

### Initial network architecture
First of all, I started with analyzing this paper by A. Canziani et al. (https://arxiv.org/abs/1605.07678), where the authors compare different CNN architectures in terms of computing time and performance on the ImageNet dataset (top-1 validation accuracy). Although, in this homework we used a different version of it, I thought that this paper might give a general idea about the networks. As a result, I've chosen a **ResNet34** architecture, which offers both a reasonable computational cost and good accuracy.<br>
Secondly, I've read the original paper on ResNets by K. He et al. (https://arxiv.org/abs/1512.03385), where they presented their idea on residual learning and different ResNet architectures, in order to understand how to implement the chosen ResNet34 network.<br>
While implementing of the network, I used `torchvision` ResNet implementation as a reference point when some things were unclear.

### Training method and parameters
Final training parameters (for the latest version of the network) are as follows:<br>
1) `batch_size = 128` - mainly because the larger batch size which I used in the beginning (`batch_size = 256`) was giving "out of memory" error and also because this number was frequently used in the papers \ reports regarding the performnace on Tiny ImageNet dataset, and is reasonable from the computational time point;<br>
2) initially I trained the networks using ADAM optimizer, but it was only working well with `learning_rate = 1e-3` and its convergence was actually worse than the simple SGD with Nesterov momentum. Hence, I've chosen to use SGD with Momentum with `learning_rate = 1e-2`, `momentum = 0.9`, and `weight_decay = 1e-4` (for $L_2$ regularization);<br>
3) I also used a learning rate scheduler - if the loss on validation set did not improve (compared to the best achieved validation loss) for 3 epochs in a row, then learning rate was decayed through multiplication by $ \gamma = 0.1 $;<br>
4) number of epochs `n_epochs = 30` was chosen empirically, by observing that usually during training, the validation loss stops improving significantly after the first learning rate decay which takes places at around 10-20 epochs into the training. This can also be seen on the following plot:

<img src="https://drive.google.com/uc?id=1m9c4E9pl6dJRHp0OPoTv9tDuI9z8Mdf1" width=800>



As can be seen from the plot, the validation loss almost stops improving after around 22 epochs into the training. The rapid drop after the 20th epoch indicates the first learning rate decay;<br>
5) negative log likelihood (NLL) loss function was used for both training and validation, as it is one of the most widely used loss functions for multiclass classification.<br>

### Preventing overfitting

1) to begin with, larger models with many trainable parameters are usually more prone to overfitting and "remembering" the input data, so one of the first decisions was to choose models with reasonable number of parameters. While all popular networks (AlexNet, VGG, GoogleNet, Inception, ResNet etc.) have a reasonable performance on the full dataset, in the case of this homework, we are dealing with a much smaller dataset of 100'000 images compared to over 14 million (as per wiki - https://en.wikipedia.org/wiki/ImageNet) images in the original dataset. Hence, it was one more reason to choose ResNet34 with around 21.4 million parameters compared to e.g. VGG-16 with 138 (!) million;<br>
2) continuing the topic of a smaller dataset, it should be noted that we have only 500 training images per 200 classes. Even smaller networks can start overfitting heavily after a few training epochs (as evidenced in this report - http://cs231n.stanford.edu/reports/2017/pdfs/12.pdf), hence we need to find a way to "enlarge" our dataset and one of the main tricks there is data augmentation. Basing on these sources - the aforementioned report, blogpost (https://learningai.io/projects/2017/06/29/tiny-imagenet.html), one more Stanford report (http://cs231n.stanford.edu/reports/2017/pdfs/931.pdf) - I've chosen to implement the following random changes to the images: random crop to 56x56 pixels (from 64x64), brightness and saturation jitters, random horizontal \ vertical flips, random affine transformations (rotation, shear, and scaling). Also, when converting to tensors, Pytorch automatically normalizes pixel intensities to be in the range of [0, 1].Data augmentation helped to reduce overfitting and increased the validation accuracy from ~35% to 38% with the original ResNet34 network;<br>
3) one more technique to combat overfitting was usage of $L_2$ regularization (implemented through weight decay in the optimizer in PyTorch), which is one of the most widespread ways on reducing the overfitting problem.


### Corrections to the original network
1) **Original ResNet34** - I was able to achieve about ~38% validation accuracy after 30 epochs with the original ResNet34 structure.<br>
2) **Full pre-activation ResNet34** - in order to increase the accuracy, I analyzed the following work by the authors of the original paper on ResNets - https://arxiv.org/pdf/1603.05027.pdf. There they propose a different structure of a ResNet block - in the original paper the structure is as follows: CONV -> BatchNorm -> ReLU -> CONV -> BatchNorm -> Add -> ReLU, in the aforementioned paper, they instead propose several different schemes and conclude that full pre-activation scheme: BatchNorm -> ReLU -> CONV -> BatchNorm -> ReLU -> CONV provides a better accuracy, although this change is significant mostly on the larger networks. I've implemented the full pre-activation scheme and saw a 1% increase in the validation accuracy after 30 epochs, leading to ~39% accuracy.<br>
3) **Tuned ResNet34** - inspired by the analysis of the dataset in the following report (http://cs231n.stanford.edu/reports/2016/pdfs/411_Report.pdf), which states that one of the main problems with the Tiny ImageNet dataset is that original "big" networks are designed to tackle much larger images from the original ImageNet and they rapidly decrease the amount of information in the layers (https://github.com/tjmoon0104/Tiny-ImageNet-Classifier) through usage of big kernels and active downsampling through strides. I've made slight changes to the network architecture - kernel size of the input convolutional layer was change to 3x3 from 7x7, stride was decreased from 2 to 1 (meaning that there is no downsampling). This change made a huge impact as I was able to achieve ~48% validation accuracy, which is 9% higher than the previous result. Thus, I've decided to dig a little bit more in that direction.<br>
4) **Final model** - next move was to remove the input max pooling layer, as a result input size of an image to the first ResNet block changed from `16x16x64` in the original ResNet34 (conv1 layer) to `64x64x64`. This increased the memory usage in the GPU, hence I was forced to decrease the batch size from `batch_size = 256` for the 3 networks above to `batch_size = 128`. However, removing the max pooling layer resulted in even better results at 57.64% validation accuracy. The effect of changing batch size should be minimal here, because I've tested smaller batch sizes with the pre-activation ResNet34 and the effect was not significant (accuracy even decreased a bit). Hence, removal of max pooling gives the main accuracy bump here.<br>

### Final model architecture

In [21]:
from torchsummary import summary

net = ResNet34(3, 200)
summary(net, input_size=(3,64,64), device='cpu')

ModuleNotFoundError: No module named 'torchsummary'

### Plots for the models 2-4:

<img src="https://drive.google.com/uc?id=1qPdko_rc6zctJUM_xpd-oWNG0kSPllOe" width=800>

<img src="https://drive.google.com/uc?id=17UUzBHX0XyyKv3bSgfcFcrCzER6UNNJv" width=800>

<img src="https://drive.google.com/uc?id=1R5gzuFCUvrpPwIkhNpzidSdz0SgI5fyp" width=800>

<img src="https://drive.google.com/uc?id=1f8-vdD19kYx49hXJlzBk3PF2kh16yDcS" width=800>

The singular sharp drops \ rises after 20+ epochs are due to the learning rate decay from 0.01 to 0.001. As we can see, final model outperforms previous ones and there is no serious overfitting, as final train accuracy at 30th epoch was ~62.8%, which is comparable to the validation accuracy at 57.64%. Same for the losses - 1.431 for training, 1.733 for validation after 30 epochs.

### Concluding remarks

This was a very interesting, albeit a challenging homework. In the process of doing it, I was able to obtain knowledge in many directions - data augmentations (+ application on practice and their big effect on the training process), different CNN architectures (as a result of doing a little research while choosing a network for this task). Most importantly, I've learned the effect of tuning the network architectures for your own task, which is really huge - making a set of small adaptations to the original ResNet34 architecture increased the validation accuracy by almost 20%!