# Homework 2, *part 2*
### (60 points total)

In this part, you will build a convolutional neural network (CNN) to solve (yet another) image classification problem: the Tiny ImageNet dataset (200 classes, 100K training images, 10K validation images). Try to achieve as high accuracy as possible.

**Unlike part 1**, you are now free to use the full power of PyTorch and its subpackages.

## Deliverables

* This file.
* A "checkpoint file" `"checkpoint.pth"` that contains your CNN's weights (you get them from `model.state_dict()`). Obtain it with `torch.save(..., "checkpoint.pth")`. When grading, we will load it to evaluate your accuracy.

**Should you decide to put your `"checkpoint.pth"` on Google Drive, update (edit) the following cell with the link to it:**

### [Dear TAs, I've put my "checkpoint.pth" on Google Drive, download it here](https://drive.google.com/file/d/1WLX-J-PbEicia-XTdVW-yzAO0OZ2K4bp/view?usp=sharing)

## Grading

* 9 points for reproducible training code and a filled report below.
* 11 points for building a network that gets above 25% accuracy.
* 4 points for using an **interactive** (please don't reinvent the wheel with `plt.plot`) tool for viewing progress, for example Tensorboard ([with this library](https://github.com/lanpa/tensorboardX) and [an extra hack for Colab](https://stackoverflow.com/a/57791702)). In this notebook, insert screenshots of accuracy and loss plots (training and validation) over iterations/epochs/time.
* 6 points for beating each of these accuracy milestones on the private **test** set:
  * 30%
  * 34%
  * 38%
  * 42%
  * 46%
  * 50%
  
*Private test set* means that you won't be able to evaluate your model on it. Rather, after you submit code and checkpoint, we will load your model and evaluate it on that test set ourselves, reporting your accuracy in a comment to the grade.

Note that there is an important formatting requirement, see below near "`DO_TRAIN = True`".

## Restrictions

* No pretrained networks.
* Don't enlarge images (e.g. don't resize them to $224 \times 224$ or $256 \times 256$).

## Tips

* **One change at a time**: never test several new things at once (unless you are super confident). Train a model, introduce one change, train again.
* Google a lot: try to reinvent as few wheels as possible (unlike in part 1 of this assignment).
* Use GPU.
* Use regularization: L2, batch normalization, dropout, data augmentation...
* Pay much attention to accuracy and loss graphs (e.g. in Tensorboard). Track failures early, stop bad experiments early.

In [1]:
# Detect if we are in Google Colaboratory
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

from pathlib import Path
# Determine the locations of auxiliary libraries and datasets.
# `AUX_DATA_ROOT` is where 'notmnist.py', 'animation.py' and 'tiny-imagenet-2020.zip' are.
if IN_COLAB:
    google.colab.drive.mount("/content/drive")
    
    # Change this if you created the shortcut in a different location
    AUX_DATA_ROOT = Path("/content/drive/My Drive/Deep Learning 2020 -- Home Assignment 2")
    
    assert AUX_DATA_ROOT.is_dir(), "Have you forgot to 'Add a shortcut to Drive'?"
else:
    AUX_DATA_ROOT = Path(".")

The below cell puts training and validation images in `./tiny-imagenet-200/train` and `./tiny-imagenet-200/val`:

In [2]:
# Extract the dataset into the current directory
if not Path("tiny-imagenet-200/train/class_000/00000.jpg").is_file():
    import zipfile
    with zipfile.ZipFile(AUX_DATA_ROOT / 'tiny-imagenet-2020.zip', 'r') as archive:
        archive.extractall()

**You are required** to format your notebook cells so that `Run All` on a fresh notebook:
* trains your model from scratch, if `DO_TRAIN is True`;
* loads your trained model from `"./checkpoint.pth"`, then **computes** and prints its validation accuracy, if `DO_TRAIN is False`.

In [3]:
DO_TRAIN = True

## Train the model

In [68]:
%reload_ext tensorboard

In [69]:
import torch
import torchvision
import torch.utils.data as data
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision.datasets as datasets
import torchvision.models as models
import matplotlib.pyplot as plt
import numpy as np
import os
from torch.utils.tensorboard import SummaryWriter

In [70]:
os.makedirs("./logs", exist_ok=True)

%tensorboard --logdir {"./logs"}

In [75]:
writer = SummaryWriter(log_dir = 'logs/model4')

In [76]:
def train_model(model, data, size, criterion, optimizer, scheduler, epochs):

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    i = 0
    j = 0
    for epoch in range(epochs):
        print('Epoch {}/{}'.format(epoch+1, epochs))
        print('-' * 10)
        

        for stage in ['train', 'val']:
            
            running_loss = 0.0
            running_corrects = 0
            curr_loss = 0
            if stage == 'train':
                model.train()  
                for (inputs, labels) in data['train']:
                    i += 1
                    inputs = inputs.to(device)
                    labels = labels.to(device)
                    optimizer.zero_grad()
    
                    with torch.set_grad_enabled(True):
                        outputs = model(inputs)
                        prob, preds = torch.max(outputs, 1)
                        loss = criterion(outputs, labels)
                        loss.backward()
                        optimizer.step()
                    
                    curr_loss = loss.item() * inputs.size(0)
                    running_loss += curr_loss
                    writer.add_scalar('Train loss', curr_loss, global_step = i)
                    running_corrects += torch.sum(preds == labels.data)
                
                train_acc = running_corrects.double() / size[0]                
                train_loss = running_loss / size[0]

                print(train_loss, train_acc)
            else:
                model.eval()   
                for (inputs, labels) in data['val']:
                    j += 1
                    inputs = inputs.to(device)
                    labels = labels.to(device)
                    optimizer.zero_grad()
    
                    with torch.set_grad_enabled(False):
                        outputs = model(inputs)
                        prob, preds = torch.max(outputs, 1)
                        loss = criterion(outputs, labels)
                    
                    curr_loss = loss.item() * inputs.size(0)
                    running_loss += curr_loss
                    writer.add_scalar('Val loss', curr_loss, global_step = j)
                    running_corrects += torch.sum(preds == labels.data)
                eval_acc = running_corrects.double() / size[1]
                eval_loss = running_loss / size[1]
                print(eval_loss, eval_acc)
            scheduler.step()

In [77]:
data_transforms = {
    'train': transforms.Compose([
        torchvision.transforms.ColorJitter(),
        transforms.RandomResizedCrop(size=(64, 64), scale = (0.7, 1.0)),
        #transforms.CenterCrop(50),
        #transforms.Resize((64,64), interpolation=2),
        
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(35),
        transforms.ToTensor(),
    ]),
    'val': transforms.Compose([
        transforms.ToTensor(),
    ]),
}

In [78]:
image_datasets = {x: datasets.ImageFolder(os.path.join('tiny-imagenet-200', x), data_transforms[x]) for x in ['train', 'val']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=100, shuffle=True, num_workers=64)
              for x in ['train', 'val']}

dataset_sizes = (len(image_datasets['train']), len(image_datasets['val']))

In [79]:
model_ft = models.resnet18()
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 200)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_ft = model_ft.to(device)

criterion = nn.CrossEntropyLoss()
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.05, momentum=0.9, weight_decay=0.005, nesterov = True)
exp_lr_scheduler = lr_scheduler.MultiStepLR(optimizer_ft, milestones=[25, 39], gamma=0.1)

model_ft = train_model(model_ft, dataloaders, dataset_sizes, criterion, optimizer_ft, exp_lr_scheduler, 50)

Epoch 1/50
----------
4.683117496490478 tensor(0.0576, device='cuda:0', dtype=torch.float64)
4.4236318445205685 tensor(0.0773, device='cuda:0', dtype=torch.float64)
Epoch 2/50
----------
4.34192391037941 tensor(0.0906, device='cuda:0', dtype=torch.float64)
4.474054937362671 tensor(0.0781, device='cuda:0', dtype=torch.float64)
Epoch 3/50
----------
4.270411863327026 tensor(0.1009, device='cuda:0', dtype=torch.float64)
4.762742962837219 tensor(0.0619, device='cuda:0', dtype=torch.float64)
Epoch 4/50
----------
4.2413978748321535 tensor(0.1047, device='cuda:0', dtype=torch.float64)
4.526197094917297 tensor(0.0712, device='cuda:0', dtype=torch.float64)
Epoch 5/50
----------
4.218556240558624 tensor(0.1089, device='cuda:0', dtype=torch.float64)
4.522344665527344 tensor(0.0742, device='cuda:0', dtype=torch.float64)
Epoch 6/50
----------
4.210513427734375 tensor(0.1087, device='cuda:0', dtype=torch.float64)
4.955681881904602 tensor(0.0550, device='cuda:0', dtype=torch.float64)
Epoch 7/50
----

KeyboardInterrupt: 

In [80]:
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(size=(64, 64), scale = (0.6, 1.0)),
        #transforms.CenterCrop(50),
        #transforms.Resize((64,64), interpolation=2),
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(45),
        transforms.ToTensor(),
    ]),
    'val': transforms.Compose([
        transforms.ToTensor(),
    ]),
}

In [None]:
image_datasets = {x: datasets.ImageFolder(os.path.join('tiny-imagenet-200', x), data_transforms[x]) for x in ['train', 'val']}
dataloader = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=100, shuffle=True, num_workers=64)
              for x in ['train', 'val']}

sizes = (len(image_datasets['train']), len(image_datasets['val']))

if DO_TRAIN:
    model = models.densenet121()
    num = model.classifier.in_features
    model.classifier = nn.Linear(num, 200)
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    
    optimizer = optim.SGD(model.parameters(), lr=0.07, momentum=0.9, nesterov = True, weight_decay = 0.001)
    multi_scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[25, 35], gamma=0.3)
    
    model = train_model(model, dataloader, sizes, criterion, optimizer, multi_scheduler, 50)

Epoch 1/50
----------
4.547195558309555 tensor(0.0756, device='cuda:0', dtype=torch.float64)
4.901333947181701 tensor(0.0550, device='cuda:0', dtype=torch.float64)
Epoch 2/50
----------


## Load and evaluate the model

In [8]:
image_datasets = {x: datasets.ImageFolder(os.path.join('tiny-imagenet-200', x),
                                          data_transforms[x]) for x in ['train', 'val']}

In [9]:
img_eval = torch.utils.data.DataLoader(image_datasets['val'], batch_size=100,
                                             shuffle=True, num_workers=64)

In [11]:
# Your code here (load the model from "./checkpoint.pth")
# Please use `torch.load("checkpoint.pth", map_location='cpu')`

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = models.densenet121()
num = model.classifier.in_features
model.classifier = nn.Linear(num, 200)
model.load_state_dict(torch.load('model_dense_27.pt'))
model = model.to(device)
model.eval()
val_sum = 0
for img, labels in img_eval:
    img = img.to(device)
    labels = labels.to(device)
    _, pred_labels = torch.max(model(img), 1)
    val_sum += torch.sum(pred_labels == labels.data)

val_accuracy = (val_sum.double() / 10000)*100


print(val_accuracy)

tensor(51.4600, device='cuda:0', dtype=torch.float64)


In [13]:
#val_accuracy = # Your code here
assert 0 <= val_accuracy <= 100
print("Validation accuracy: %.2f%%" % val_accuracy)

Validation accuracy: 51.46%


# Report

Below, please mention:

* A brief history of tweaks and improvements.
* Which network architectures have you tried? What is the final one and why?
* What is the training method (batch size, optimization algorithm, number of iterations, ...) and why?
* Which techniques have you tried to prevent overfitting? What were their effects? Which of them worked well?
* Any other insights you learned.

For example, start with:

"I have analyzed these and those conference papers/sources/blog posts. \
I tried this and that to adapt them to my problem. \
The conclusions this task taught me are ..."

From the various sources and papers () I get the knowledge, that standart ResNet architectures without pretraining could give descent results, if you will select parameters carefully.

I tried ResNet-18 and ResNet-34 architectures as a baseline, which gave me somewhat good results of 0.35 validation accuracy, and then they were overfitting and plateauing. 

Trying to solve this problem, I added various augmentations (crops, flips, rotations) and introduced weight decay to my SGD optimizer. It helped to make overfitting smaller. Playing with optimizer also gave me an idea of gradually reducing Learning rate, which gave some advantage in final validation accuracy of resnet-18, the best one was about 0.45.

![ResNet loss](pict.png)

At that point I read about another type of nets used in image classification - DenseNets. Even though the learning time was much bigger, it gave better results initially. Using the same tricks that I used with ResNets I has made progress to validation accuracy 0.5. The last succesful decision was to make my SGD Nesteriv SGD, which gave the final boost and made the final result 0.514 val accuracy with small overfittiing.

Above other techniques, I also tried Cyclic Learning rate, which was shown to be not stable in case of CNN's, I also tried different starting learning rates, but in my experience quite big learning rates (1e-2) gave better results then smaller ones. The best results was acquired typically about 25-30 epochs. 

I experimented with batch sizes, but in my case it did not give much improvement.

In conclusion, this task taught me that even standard structures could give relatively good results if you use right set of parameters and augmentations. It also learned me to prevent overfitting and believe in myself :)