# HA1 - Cats and dogs

<img src="https://cdn.pixabay.com/photo/2015/05/20/10/03/cat-and-dog-775116_960_720.jpg" alt="Image of cats and dogs" style="width: 500px;"/>

For this home assignment, we'll use the Kaggle dataset for the [Dogs vs. Cats competition](https://www.kaggle.com/c/dogs-vs-cats). It is comprised of 25k colour images of dogs and cats. Our goal with this dataset will be to create a classifier that can tell us if the input image is of a cat or a dog.

## Using your cloud GPU
As a way of helping you speed up the training process, each group gets access to a cloud instance with a GPU. Take a look at the [instructions folder](https://github.com/JulianoLagana/deep-machine-learning/blob/master/instructions/) to understand how to connect to an instance and use our tools there. You're free to use this limited resource as you see fit, but if you spend all your credits, you'll need a late day to obtain more (and you can only do this once).

### Strong recommendation:
In order to make the most out of your GPU hours, first try solving the initial part of this notebook (tasks 0-3) in your own computer (these tasks can be solved on the CPU), and leave most of the available hours for solving tasks 4-5, and refining your best model further (and, if you have the spare hours, experiment a bit!).

### Working efficiently:
Training for several epochs just to have your code break at the last validation step is incredibly frustrating and inefficient. Good practice is to first test long training runs with a much simpler dry-run: a single epoch, a few batches et c.

Requirements:
- Whenever we ask you to plot anything, be sure to add a title and label the axes. If you're plotting more than one curve in the same plot, also add a legend.
- When we ask you to train an architecture, train it for a reasonable number of epochs. "Reasonable" here means you should be fairly confident that training for a higher number of epochs wouldn't impact your conclusions regarding the model's performance. When experimenting, a single epoch is often enough to tell whether your model setup has improved or not.


Hints:
- If you get errors saying you've exhausted the GPU resources, well, then you've exhausted the GPU resources. However, sometimes that's because `pytorch` didn't release a part of the GPU's memory. If you think your CNN should fit in your memory during training, try restarting the kernel and directly training only that architecture.
- Every group has enough cloud credits to complete this assignment. However, this statement assumes you'll use your resources judiciously (e.g. always try the code first in your machine and make sure everything works properly before starting your instances) and **won't forget to stop your instance after using it,**  otherwise you might run out of credits.
- Before starting, take a look at the images we'll be using. This is a hard task, don't get discouraged if your first models perform poorly (several participants in the original competition didn't achieve an accuracy higher than 60%).
- Solving the [computer labs](https://github.com/JulianoLagana/deep-machine-learning/tree/master/computer-labs) is a good way to get prepared for this assignment.

---
## 0. Imports

In the following cell, add all the imports you'll use in this assignment.

In [None]:
### BEGIN SOLUTION ###
import torch
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
from torchvision.transforms import Resize, ToTensor, Compose, CenterCrop
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, models, transforms

import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
### END SOLUTION ###

---
## 1. Loading the data and preprocessing

The first step is to head to the [Kaggle website for the cats and dogs competition](https://www.kaggle.com/c/dogs-vs-cats/data) and download the data from there. You should download both the test and train folders together in one zip file (there is a `Download all` button at the bottom of the page). Unfortunately, you need to create a Kaggle account for this.

**Only necessary for tasks 4-6**: Downloading the data to your local computer is quite straight-forward. Sooner or later you will have to upload the data to the cloud instance and that is a bit more tricky. There are a few ways to do it:

 - Jupyter Notebook upload function. When starting the notebook server with the command `jupyter notebook` you are directed to a main page. In the top right corner there is an upload button.
 - Using [`scp`](https://linuxize.com/post/how-to-use-scp-command-to-securely-transfer-files/) to copy files via an ssh connection.
 - Using the [Kaggle CLI](https://github.com/Kaggle/kaggle-api). We have added it to the conda environment.

For this assignment we will again need data loaders. Like before we need to create a `Dataset` to give as input to a `DataLoader`. 
Fortunately, this type of image data is quite common so we get some help from `pytorch`. We can use [`ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder) to create a `Dataset` for our images. As long as our folder structure for the data conforms to the folder structure expected by `ImageFolder`, we can use it right out of the box and the `DataLoader` class will happily accept it as input.

To use `ImageFolder` you should create a folder structure that resembles the following (obviously, the folder names are up to you):


         small_train             small_val                train                   val
              |                      |                      |                      |
              |                      |                      |                      |
        -------------          -------------          -------------          -------------
        |           |          |           |          |           |          |           |
        |           |          |           |          |           |          |           |
      cats        dogs       cats        dogs       cats        dogs       cats        dogs


The `small_train` and `small_val` folders have the training and validation samples for your smaller subset of the data, while the `train` and `val` folders contain all the samples you extracted from Kaggle's `train.zip`.
This is just a convenient way of having a smaller dataset to play with for faster prototyping.

We provide you a notebook that shows how to achieve this folder structure (`create_project_notebook_structure.ipynb`), starting from the original `dogs-vs-cats.zip` file that you download from Kaggle. If you do use that notebook, we encourage you to understand how each step is being done, so you can generalize this knowledge to new datasets you'll encounter.

For the smaller dataset, we advise you to use 70% of the data as training data (and thereby the remaining 30% for validation data). However, for the larger dataset, you should decide how to split between training and validation.

**What percentage of the larger dataset did you decide to use for training?**

**Optional (1 POE):** Did you decide to keep the same ratio split between train and validation sets for the larger dataset? Motivate your decision!


Fill in the dataset paths (to be used later by your data loaders):

In [None]:
# TODO: Change the directories accordingly
train_path = "/your/path"
val_path = "/your/path"
small_train_path = "/your/path"
small_val_path = "/your/path"
### BEGIN SOLUTION
train_path = "train"
val_path = "val"
small_train_path = "small_train"
small_val_path = "small_val"
### END SOLUTION

---
### 1.1 Preprocessing
**(1 POE)** 

Once you have the expected folder structure, create two data loaders for automatically generating batches from the images in your smaller subset of data. It is here we choose how to preprocess the input data. There are multiple reasons for why we preprocess data:

- Some transformations might be needed to actually make the data work with our network (reshaping, permuting dimensions et c.).
- Make the training more efficient by making the input dimensions smaller, e.g. resizing, cropping.
- Artificially expanding the training data through [data augmentation](https://cartesianfaith.com/2016/10/06/what-you-need-to-know-about-data-augmentation-for-machine-learning/)
- We have some clever idea of how to change the data to make the training process better.

We do not expect you to do data augmentation, but feel free to preprocess the data as you see fit.
Construct an `ImageFolder` dataset like this:

```python
ImageFolder(<path_to_data_folder>, transform=Compose(<list_of_transforms>))
# example:
ImageFolder(Path.cwd() / "small_train", transform=Compose([ToTensor]))
```

Hints:
- Take a look at [`ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder) and [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) from the pytorch docs.
- To preprocess the data you can use the built-in pytorch [`Transforms`](https://pytorch.org/docs/stable/torchvision/transforms.html)
- The `ImageFolder` dataset provides the data as a python image type. For easy conversion to a `torch.Tensor`, use the [`ToTensor`](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.ToTensor) transformation.
- The specified `batch_size` should be chosen so that you train fast but don't run out of memory. You need to figure this out empirically; start small and increase the batch size until you run out of memory.
- The `DataLoader` constructor takes an optional argument `num_workers`, which defaults to `0` if not provided. Setting a higher number creates multiple threads which load batches concurrently. This can speed up training considerably.  
- When feeding the images to your CNN, you'll probably want all of them to have the same spatial size, even though the .jpeg files differ in this. Resizing the images can be done using the previously mentioned built-in pytorch Transforms.
- Resizing the images to a smaller size while loading them can be beneficial. The VGG network that is used later in this assignment requires that images are at least 224x224, but before that use small images to speed up training. The CNN's do surprisingly well on 64x64 or even 32x32 images. Shorter training cycles give your more time to experiment!

We encourage you to explore the data and choose transformations that you believe to be useful. For exploration we provide you with some helper functions to visually compare transformations side by side:

In [None]:
def compare_transforms(transformations, index):
    """Visually compare transformations side by side.
    Takes a list of ImageFolder datasets with different compositions of transformations.
    It then display the `index`th image of the dataset for each transformed dataset in the list.
    
    Example usage:
        compare_transforms([dataset_with_transform_1, dataset_with_transform_2], 0)
    
    Args:
        transformations (list(ImageFolder)): list of ImageFolder instances with different transformations
        index (int): Index of the sample in the ImageFolder you wish to compare.
    """
    
    # Here we combine two neat functions from basic python to validate the input to the function:
    # - `all` takes an iterable (something we can loop over, like a list) of booleans
    #    and returns True if every element is True, otherwise it returns False.
    # - `isinstance` checks whether a variable is an instance of a particular type (class)
    if not all(isinstance(transf, ImageFolder) for transf in transformations):
        raise TypeError("All elements in the `transformations` list need to be of type ImageFolder")
        
    num_transformations = len(transformations)
    fig, axes = plt.subplots(1, num_transformations)
    
    # This is just a hack to make sure that `axes` is a list of the same length as `transformations`.
    # If we only have one element in the list, `plt.subplots` will not create a list of a single axis
    # but rather just an axis without a list.
    if num_transformations == 1:
        axes = [axes]
        
    for counter, (axis, transf) in enumerate(zip(axes, transformations)):
        axis.set_title("transf: {}".format(counter))
        image_tensor = transf[index][0]
        display_image(axis, image_tensor)

    plt.show()

def display_image(axis, image_tensor):
    """Display a tensor as image
    
    Example usage:
        _, axis = plt.subplots()
        some_random_index = 453
        image_tensor, _ = train_dataset[some_random_index]
        display_image(axis, image_tensor)
    
    Args:
        axis (pyplot axis)
        image_tensor (torch.Tensor): tensor with shape (num_channels=3, width, heigth)
    """
    
    # See hint above
    if not isinstance(image_tensor, torch.Tensor):
        raise TypeError("The `display_image` function expects a `torch.Tensor` " +
                        "use the `ToTensor` transformation to convert the images to tensors.")
        
    # The imshow commands expects a `numpy array` with shape (3, width, height)
    # We rearrange the dimensions with `permute` and then convert it to `numpy`
    image_data = image_tensor.permute(1, 2, 0).numpy()
    height, width, _ = image_data.shape
    axis.imshow(image_data)
    axis.set_xlim(0, width)
    # By convention when working with images, the origin is at the top left corner.
    # Therefore, we switch the order of the y limits.
    axis.set_ylim(height, 0)

In [None]:
### BEGIN SOLUTION
# 1 POE for resize, 1 POE for other preprocessing
img_width, img_height = 224, 224
batch_size = 64

transformations = [Resize(size=[img_width,img_width]),ToTensor()]
train_dataset = ImageFolder(small_train_path,transform=Compose(transformations))
train_dataloader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True)

val_dataset = ImageFolder(small_val_path,transform=Compose(transformations))
val_dataloader = DataLoader(val_dataset,batch_size=batch_size,shuffle=True)


pil_type = ImageFolder(small_train_path,transform=Compose([CenterCrop(size=(img_width,img_width))]))


# Sub-optimal transformation for comparison
crop = ImageFolder(small_train_path,transform=Compose([CenterCrop(size=(img_width,img_width)),ToTensor()]))

compare_transforms([train_dataset, crop], 0)

### END SOLUTION

**(2 POE)** How did you select transformations, if any? Briefly explain your reasoning:

---
## 2. Training

**(1 POE)**

Create your first CNN architecture for this task. Start with something as simple as possible, that you're almost sure can get an accuracy better than 50% (we'll improve upon it later).
Naturally, you must also select a loss function and an optimizer.

Hints:

- Training on a CPU is slow and in the beginning you just want to verify that your architecture actually produces a predicition with the correct shape. Make everything you can to speed up the prototyping phase, e.g. train only for a single epoch and make the images ridiculously small.
- Going from the last CNN layer to the final fully connected layer is not trivial. The convolutions produces "3D" output which we can think of as an image with many channels, while the fully connected layer expects a row vector as input. Calculate how many output features the convolutions produce and use `.reshape` to make your tensor fit the fully connected layer. (It is also common to see the `.view` method to do the same thing. They basically do the same thing but have some differences in internal memory management.) *Hint within the hint:* remember that the fully connected layers expects a *batch* of 1D tensors. 


In [None]:
### BEGIN SOLUTION

class Net(nn.Module):
    """Dog/Cat classifier
    
    Expects square input images with side = `size: int`
    """
    def __init__(self, size):
        super(Net, self).__init__()
        final_size = self._size_reduction(size)
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        # 224 inputs -(conv)-> 222 -(pool)-> 111 -(conv)-> 109 -(pool)-> 54
        self.conv1 = nn.Conv2d(3, 10, 3)
        self.conv2 = nn.Conv2d(10, 10, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(10 * final_size * final_size, 128)  # 6*6 from image dimension
        self.fc2 = nn.Linear(128, 1)
        self.out = nn.Sigmoid()
    
    @staticmethod
    def _size_reduction(x):
        """Heuristic size reduction
        
        Convolutions remove a border of size 2,
        Max pooling halves the size.
        ==> round( { [(x-2) / 2] - 2} / 2} = round(x/4 - 3/2)
        """
        
        return int(x/4 - 3/2)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        batch_size = x.size(0)
        x = x.reshape((batch_size, self.num_flat_features(x)))
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        # The data loader return ground truth batch y with shape (batch_size,)
        # squeeze turns our prediction from shape (batch_size, 1) -> (batch_size,)
        x = self.out(x).squeeze()
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
    
model = Net(size=img_width)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
### END SOLUTION

Train your model using the two data loaders you created earlier. Train for a reasonable amount of epochs, so as to get a good sense of how well this architecture performs.

Hints:
- Note that you will need to plot your training and validation losses and accuracies, so make sure that you saved them during training. 

In [None]:
# Any pytorch object (e.g. model, inputs, output, etc.) can 
# be transferred to the current device by running
#       name_of_object.to(device)
# Example:
#       model.to(device)
#
# The following line automatically figures out what device (cpu or gpu)
# you are using and stores the result in `device`.
# Later we can use the `.to(device)` method to move our data or model to the correct device.
device = torch.device("cuda" if torch.cuda.is_available() 
                                  else "cpu")


### BEGIN SOLUTION  
from time import time
model.to(device)
num_epochs = 2
steps = 0
running_loss = 0
running_acc = 0
print_every = 50
train_losses, val_losses, train_accs, val_accs = [], [], [], []

start = time()
for epoch in range(num_epochs):
    # Train:   
    for batch_index, (x, y) in enumerate(train_dataloader):
        steps += 1
        inputs, labels = x.to(device), y.to(device)
        optimizer.zero_grad()
        z = model.forward(inputs)
        loss = criterion(z, labels.float())
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
        z[z >= 0.5] = 1
        z[z < 0.5] = 0

        equals = z == labels.reshape(z.shape).float()
        running_acc += torch.mean(equals.type(torch.FloatTensor)).item()
        
        if steps % print_every == 0:
            val_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for batch_index, (x, y) in enumerate(val_dataloader):
                    inputs, labels = x.to(device), y.to(device)
                    z = model.forward(inputs)
                    batch_loss = criterion(z, labels.float())
                    val_loss += batch_loss.item()
                    z[z >= 0.5] = 1
                    z[z < 0.5] = 0
                    
                    equals = z == labels.reshape(z.shape).float()
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            train_losses.append(running_loss/print_every)
            val_losses.append(val_loss/len(val_dataloader))    
            train_accs.append(running_acc/print_every) 
            val_accs.append(accuracy/len(val_dataloader)) 
            print(f"Epoch {epoch+1}/{num_epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Train accuracy: {running_acc/print_every:.3f}.. "
                  f"Validation loss: {val_loss/len(val_dataloader):.3f}.. "
                  f"Validation accuracy: {accuracy/len(val_dataloader):.3f}")
            running_loss = 0
            running_acc = 0
            model.train()

end = time()
epoch_time = (end - start) / num_epochs
print("| {} | {:.2f} | {:.3f} |".format(img_width, epoch_time, val_accs[-1]))
### END SOLUTION

Create two plots. In one of them, plot the loss in the training and the validation datasets. In the other one, plot the accuracy in the training and validation datasets.

In [None]:
### BEGIN SOLUTION
xvalues = range(len(train_losses))
plt.figure(1)
plt.subplot(311)
plt.plot(xvalues, train_losses, 'r', xvalues, val_losses, 'b')
plt.legend(["Train","Val"])
plt.title("Loss")
plt.subplots_adjust(hspace = 0.1)

plt.subplot(212)
plt.plot(xvalues, train_accs, 'r', xvalues, val_accs, 'b')
plt.legend(["Train","Val"])
plt.title("Accuracy")
plt.subplots_adjust(hspace = 0.1)

plt.show()
### END SOLUTION

**(2 POE)** Based on these, what would you suggest for improving your model? Why?

**Your answer**: (fill in here)

---
## 3. Improving your model

**(1 POE)** Continue to improve your model architecture by comparing the value of the metrics you're interested in for both the training and validation set. Try different ideas! When you're happy with one architecture, copy it in the cell below and train it here. Save the training and validation losses and accuracies. You'll use this later to compare your best model with the one using transfer learning.

**Note**: When trying different ideas, you'll end up with several different models. However, when submitting your solutions to Canvas, the cell below must contain only the definition and training of *one* model. Remove all code related to the models that were not chosen.

In [None]:
### BEGIN SOLUTION

class Net(nn.Module):
    def __init__(self, size):
        super(Net, self).__init__()
        final_size = self._size_reduction(size)
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(3, 10, 3)
        self.conv2 = nn.Conv2d(10, 10, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(10 * final_size * final_size, 128)  # 6*6 from image dimension
        self.fc2 = nn.Linear(128, 1)
        self.out = nn.Sigmoid()
        
    @staticmethod
    def _size_reduction(x):
        """Heuristic size reduction
        
        Convolutions remove a border of size 2,
        Max pooling halves the size.
        ==> round( { [(x-2) / 2] - 2} / 2} = round(x/4 - 3/2)
        """
        
        return int(x/4 - 3/2)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        # Batch size can vary (train vs validation or rest batch at epoch end),
        # set dynamically.
        batch_size = x.size(0)
        x = x.reshape((batch_size, self.num_flat_features(x)))
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        x = self.out(x).squeeze()
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
    
model = Net(img_width)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

device = torch.device("cuda" if torch.cuda.is_available() 
                                  else "cpu")
model.to(device)    

num_epochs = 2
steps = 0
running_loss = 0
running_acc = 0
print_every = 50
train_losses, val_losses, train_accs, val_accs = [], [], [], []

start = time()
for epoch in range(num_epochs):
    # Train:   
    for batch_index, (x, y) in enumerate(train_dataloader):
        steps += 1
        inputs, labels = x.to(device), y.to(device)
        optimizer.zero_grad()
        z = model.forward(inputs)
        loss = criterion(z, labels.float())
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
        z[z >= 0.5] = 1
        z[z < 0.5] = 0

        equals = z == labels.reshape(z.shape).float()
        running_acc += torch.mean(equals.type(torch.FloatTensor)).item()
        
        if steps % print_every == 0:
            val_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for batch_index, (x, y) in enumerate(val_dataloader):
                    inputs, labels = x.to(device), y.to(device)
                    z = model.forward(inputs)
                    batch_loss = criterion(z, labels.float())
                    val_loss += batch_loss.item()
                    z[z >= 0.5] = 1
                    z[z < 0.5] = 0
                    
                    equals = z == labels.reshape(z.shape).float()
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            train_losses.append(running_loss/print_every)
            val_losses.append(val_loss/len(val_dataloader))    
            train_accs.append(running_acc/print_every) 
            val_accs.append(accuracy/len(val_dataloader)) 
            print(f"Epoch {epoch+1}/{num_epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Train accuracy: {running_acc/print_every:.3f}.. "
                  f"Validation loss: {val_loss/len(val_dataloader):.3f}.. "
                  f"Validation accuracy: {accuracy/len(val_dataloader):.3f}")
            running_loss = 0
            running_acc = 0

end = time()
epoch_time = (end - start) / num_epochs
print(epoch_time)
### END SOLUTION

Create two plots. In one of them, plot the loss in the training and the validation datasets. In the other one, plot the accuracy in the training and validation datasets.

In [None]:
### BEGIN SOLUTION
xvalues = range(len(train_losses))
plt.figure(1)
plt.subplot(311)
plt.plot(xvalues, train_losses, 'r', xvalues, val_losses, 'b')
plt.legend(["Train","Val"])
plt.title("Loss")
plt.subplots_adjust(hspace = 0.1)

plt.subplot(212)
plt.plot(xvalues, train_accs, 'r', xvalues, val_accs, 'b')
plt.legend(["Train","Val"])
plt.title("Accuracy")
plt.subplots_adjust(hspace = 0.1)

plt.show()
### END SOLUTION

**(2 POE)** Did your results improve? What problems did your improvements fix? Explain why, or why not. 

**Your answer**: (fill in here)

[Save your model](https://pytorch.org/tutorials/beginner/saving_loading_models.html) to disk (the architecture, weights and optimizer state). This is simply so you can use it again easily in the later parts of the notebook, without having to keep it in memory or re-training it. The actual file you create is not relevant to your submission. The code to save the model is given in the cell below. 

In [None]:
# Assuming that you called your model "my_model"
torch.save(model.state_dict(), "my_model")

---
## 4. Transfer Learning

**From now, training on CPU will not be feasible. If your computer has a GPU, try it out! Otherwise, now is the time to connect to your cloud instance**

Now, instead of trying to come up with a good architecture for this task, we'll use the VGG16 architecture, but with the top layers removed (the fully connected layers + softmax). We'll substitute them with a single fully connected layer, and a classification layer that makes sense for our problem.

However, this model has a very high capacity, and will probably suffer a lot from overfitting if we try to train it from scratch, using only our small subset of data. Instead, we'll start the optimization with the weights obtained after training VGG16 on the ImageNet dataset.

Start by loading the *pretrained* VGG16 model, from the [torchvision.models](https://pytorch.org/docs/stable/torchvision/models.html).

In [None]:
### BEGIN SOLUTION
VGG_model = models.vgg16(pretrained=True)
print(VGG_model.classifier[0].in_features) # 1000 

#VGG_model = applications.VGG16(weights="imagenet", include_top=False, input_shape=(115,115,3))
### END SOLUTION

Create a new model with the layers you want to add on top of VGG.

*Hint:*
- You can access and modify the top layers of the VGG model with `vgg_model.classifier`, and the remaining layers with `vgg_model.features`.
- You can get the number of output features of `vgg_model.features` with `vgg_model.classifier[0].in_features`

In [None]:
### BEGIN SOLUTION
top_layers = nn.Sequential(
    nn.Linear(in_features=25088, out_features=256, bias=True),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(in_features=256, out_features=1, bias=True),
    nn.Sigmoid()
)
### END SOLUTION

Now add the new model on top of VGG.

In [None]:
### BEGIN SOLUTION
VGG_model.classifier = top_layers
### END SOLUTION

### 4.1 Using VGG features

Now we're almost ready to train the new model. However, since the top layers of this architecture are being initialized randomly, it's sometimes possible for them to generate large gradients that can wreck the pretraining of the bottom layers. To avoid this, freeze all the VGG layers in your architecture (i.e. signal to the optimizer that these should not be changed during optimization) by setting the attribute `requires_grad` for all parameters `vgg_model.features` to `False`.

In [None]:
# Freeze bottom
### BEGIN SOLUTION
for param in VGG_model.features.parameters():
    param.requires_grad = False
### END SOLUTION

Perform the transfer learning by training the top layers of your model.

In [None]:
### BEGIN SOLUTION
from time import time

criterion = nn.BCELoss()

device = torch.device("cuda" if torch.cuda.is_available() 
                                  else "cpu")
VGG_model.to(device)
optimizer = optim.Adam(VGG_model.parameters())


num_epochs = 2
steps = 0
running_loss = 0
running_acc = 0
print_every = 50
train_losses_vgg, val_losses_vgg, train_accs_vgg, val_accs_vgg = [], [], [], []

start = time()

VGG_model.train()
for epoch in range(num_epochs):
    # Train:   
    for batch_index, (x, y) in enumerate(train_dataloader):
        steps += 1
        inputs, labels = x.to(device), y.to(device)
        optimizer.zero_grad()
        z = VGG_model.forward(inputs).squeeze()
        loss = criterion(z, labels.float())
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
        z[z >= 0.5] = 1
        z[z < 0.5] = 0

        equals = z == labels.reshape(z.shape).float()
        running_acc += torch.mean(equals.type(torch.FloatTensor)).item()
        
        if steps % print_every == 0:
            val_loss = 0
            accuracy = 0
            VGG_model.eval()
            with torch.no_grad():
                for batch_index, (x, y) in enumerate(val_dataloader):
                    inputs, labels = x.to(device), y.to(device)
                    z = VGG_model.forward(inputs).squeeze()
                    batch_loss = criterion(z, labels.float())
                    val_loss += batch_loss.item()
                    z[z >= 0.5] = 1
                    z[z < 0.5] = 0
                    
                    equals = z == labels.reshape(z.shape).float()
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            train_losses_vgg.append(running_loss/print_every)
            val_losses_vgg.append(val_loss/len(val_dataloader))    
            train_accs_vgg.append(running_acc/print_every) 
            val_accs_vgg.append(accuracy/len(val_dataloader)) 
            print(f"Epoch {epoch+1}/{num_epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Train accuracy: {running_acc/print_every:.3f}.. "
                  f"Validation loss: {val_loss/len(val_dataloader):.3f}.. "
                  f"Validation accuracy: {accuracy/len(val_dataloader):.3f}")
            running_loss = 0
            running_acc = 0
            VGG_model.train()
end = time()
vgg_epoch_time = (end - start) / num_epochs
print("Average epoch time: {}".format(vgg_epoch_time))
### END SOLUTION

Create two plots. In one of them, plot the loss in the training and the validation datasets. In the other one, plot the accuracy in the training and validation datasets.

In [None]:
### BEGIN SOLUTION
xvalues = range(len(train_losses_vgg))
plt.figure(1)
plt.subplot(311)
plt.plot(xvalues, train_losses_vgg, 'r', xvalues, val_losses_vgg, 'b')
plt.legend(["Train","Val"])
plt.title("Loss")
plt.subplots_adjust(hspace = 0.1)

plt.subplot(212)
plt.plot(xvalues, train_accs_vgg, 'r', xvalues, val_accs_vgg, 'b')
plt.legend(["Train","Val"])
plt.title("Accuracy")
plt.subplots_adjust(hspace = 0.1)

plt.show()
### END SOLUTION

How does the model perform, compared to the model obtained in step 3? Create one plot with the training accuracy and another with the validation accuracy of the two scenarios.

In [None]:
### BEGIN SOLUTION
xvalues = range(len(train_losses_vgg))
plt.figure(1)
plt.subplot(311)
plt.plot(xvalues, val_losses, 'r', xvalues, val_losses_vgg, 'b')
plt.legend(["First Model","VGG"])
plt.title("Loss")
plt.subplots_adjust(hspace = 0.1)

plt.subplot(212)
plt.plot(xvalues, val_accs, 'r', xvalues, val_accs_vgg, 'b')
plt.legend(["First Model","VGG"])
plt.title("Accuracy")
plt.subplots_adjust(hspace = 0.1)

plt.show()
### END SOLUTION

**(2 POE)** Compare these results. Which approach worked best, starting from scratch or doing transfer learning? Reflect on whether your comparison is fair or not:

**Your answer**: (fill in here)

**(1 POE)** What are the main differences between the ImageNet dataset and the Dogs vs Cats dataset we used?

**Your answer**: (fill in here)

**Optional (2 POE)** Even though there are considerable differences between these datasets, why is it that transfer learning is still a good idea?

**Your answer**: (fill in here)

**Optional (1 POE)** In which scenario would transfer learning be unsuitable?

**Your answer**: (fill in here)

Save the model to a file.

In [None]:
torch.save(VGG_model.state_dict(), "trans_learning_top_only")

### 4.2 Fine-tuning

Now that we have a better starting point for the top layers, we can train the entire network. Unfreeze the bottom layers by resetting the `requires_grad` attribute to `True`.

In [None]:
# UnFreeze bottom
### BEGIN SOLUTION
for param in VGG_model.features.parameters():
    param.requires_grad = True
### END SOLUTION

Fine tune the model by training all the layers.

Hint:
- Even though we do have a decent starting point for the optimization, it's still possible that a bad hyper-parameter choice wrecks the preinitialization. Make sure to use a small learning rate for this step.

In [None]:
### BEGIN SOLUTION

device = torch.device("cuda" if torch.cuda.is_available() 
                                  else "cpu")
VGG_model.to(device)
optimizer = optim.Adam(VGG_model.parameters(),lr=0.0001)


num_epochs = 2
steps = 0
running_loss = 0
running_acc = 0
print_every = 50
train_losses_vgg_f, val_losses_vgg_f, train_accs_vgg_f, val_accs_vgg_f = [], [], [], []

start = time()
VGG_model.train()
for epoch in range(num_epochs):
    # Train:   
    for batch_index, (x, y) in enumerate(train_dataloader):
        steps += 1
        inputs, labels = x.to(device), y.to(device)
        optimizer.zero_grad()
        z = VGG_model.forward(inputs).squeeze()
        loss = criterion(z, labels.float())
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
        z[z >= 0.5] = 1
        z[z < 0.5] = 0

        equals = z == labels.reshape(z.shape).float()
        running_acc += torch.mean(equals.type(torch.FloatTensor)).item()
        
        if steps % print_every == 0:
            val_loss = 0
            accuracy = 0
            VGG_model.eval()
            with torch.no_grad():
                for batch_index, (x, y) in enumerate(val_dataloader):
                    inputs, labels = x.to(device), y.to(device)
                    z = VGG_model.forward(inputs).squeeze()
                    batch_loss = criterion(z, labels.float())
                    val_loss += batch_loss.item()
                    z[z >= 0.5] = 1
                    z[z < 0.5] = 0
                    
                    equals = z == labels.reshape(z.shape).float()
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            train_losses_vgg_f.append(running_loss/print_every)
            val_losses_vgg_f.append(val_loss/len(val_dataloader))    
            train_accs_vgg_f.append(running_acc/print_every) 
            val_accs_vgg_f.append(accuracy/len(val_dataloader)) 
            print(f"Epoch {epoch+1}/{num_epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Train accuracy: {running_acc/print_every:.3f}.. "
                  f"Validation loss: {val_loss/len(val_dataloader):.3f}.. "
                  f"Validation accuracy: {accuracy/len(val_dataloader):.3f}")
            running_loss = 0
            running_acc = 0
            VGG_model.train()
end = time()
vgg_fine_tune_epoch = (end - start) / num_epochs
print(vgg_fine_tune_epoch)
### END SOLUTION

How does the model perform, compared to the model trained with frozen layers? Create one plot with the training accuracy and another with the validation accuracy of the two scenarios.

In [None]:
### BEGIN SOLUTION
xvalues = range(len(train_losses_vgg))
plt.figure(1)
plt.subplot(311)
plt.plot(xvalues, val_losses_vgg, 'r', xvalues, val_losses_vgg_f, 'b')
plt.legend(["VGG (frozen)","VGG (unfrozen)"])
plt.title("Loss")
plt.subplots_adjust(hspace = 0.1)

plt.subplot(212)
plt.plot(xvalues, val_accs_vgg, 'r', xvalues, val_accs_vgg_f, 'b')
plt.legend(["VGG (frozen)","VGG (unfrozen)"])
plt.title("Accuracy")
plt.subplots_adjust(hspace = 0.1)

plt.show()
### END SOLUTION

**(1 POE)** Did the model's performance improve? Why (why not)?

**Your answer**: (fill in here)

Save the model to file.

In [None]:
torch.save(model.state_dict(), "trans_learning_full")

### 4.3 Improving the top model (optional)

Improve the architecture for the layers you add on top of VGG16. Try different ideas! When you're happy with one architecture, copy it in the cell below and train it here.

In [None]:
### BEGIN SOLUTION
### END SOLUTION

**(1 POE)** How does the model perform, compared to the model trained in step 4.2? Create one plot with the training accuracy and another with the validation accuracy of the two scenarios.

In [None]:
### BEGIN SOLUTION
### END SOLUTION

Save the model to a file.

In [None]:
torch.save(model.state_dict(), "best_trans_learning")

## 5. Final training

Now we'll train the model that achieved the best performance so far using the entire dataset.

**Note**: start the optimization with the weights you obtained training in the smaller subset, i.e. *not* from scratch.

First, create two new data loaders, one for training samples and one for validation samples. This time, they'll load data from the folders for the entire dataset.

In [None]:
### BEGIN SOLUTION
img_width, img_height = 224, 224
batch_size = 64

train_dataset = ImageFolder(train_path, transform=Compose([Resize(size=[img_width,img_width]), ToTensor()]))
train_dataloader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True)

val_dataset = ImageFolder(val_path, transform=Compose([Resize(size=[img_width,img_width]), ToTensor()]))
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

batches_per_epoch = len(train_dataset) // batch_size + 1
### END SOLUTION

Train your model using the full data. This optimization might take a long time, so live plotting of some metrics is recommended.

In [None]:
### BEGIN SOLUTION

device = torch.device("cuda" if torch.cuda.is_available() 
                                  else "cpu")
VGG_model.to(device)
optimizer = optim.Adam(VGG_model.parameters(),lr=0.0001)


num_epochs = 2
running_loss = 0
running_acc = 0
print_every = 1000
train_losses_vgg_all, val_losses_vgg_all, train_accs_vgg_all, val_accs_vgg_all = [], [], [], []

start = time()
VGG_model.train()
for epoch in range(1, num_epochs+1):
    # Train:
    running_metric_ticker = 0
    train_time = time()
    for batch_index, (x, y) in enumerate(train_dataloader, 1):
        inputs, labels = x.to(device), y.to(device)
        optimizer.zero_grad()
        z = VGG_model.forward(inputs).squeeze()
        loss = criterion(z, labels.float())
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
        z[z >= 0.5] = 1
        z[z < 0.5] = 0

        equals = z == labels.reshape(z.shape).float()
        running_acc += torch.mean(equals.type(torch.FloatTensor)).item()
        running_metric_ticker += 1
        if batch_index % print_every == 0 or batch_index == batches_per_epoch:
            val_loss = 0
            accuracy = 0
            VGG_model.eval()
            with torch.no_grad():
                for x, y in val_dataloader:
                    inputs, labels = x.to(device), y.to(device)
                    z = VGG_model.forward(inputs).squeeze()
                    batch_loss = criterion(z, labels.float())
                    val_loss += batch_loss.item()
                    z[z >= 0.5] = 1
                    z[z < 0.5] = 0
                    
                    equals = z == labels.reshape(z.shape).float()
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            train_losses_vgg_all.append(running_loss/print_every)
            val_losses_vgg_all.append(val_loss/len(val_dataloader))    
            train_accs_vgg_all.append(running_acc/print_every) 
            val_accs_vgg_all.append(accuracy/len(val_dataloader)) 
            print("Epoch {}/{}, batch {}/{}"
                  .format(epoch, num_epochs, batch_index, batches_per_epoch))
            print("Train loss: {:.3f}\t Train acc: {:.3f}\tVal loss: {:.3f}\t Val acc: {:.3f}"
                  .format(running_loss / running_metric_ticker,
                          running_acc / running_metric_ticker,
                          val_loss / len(val_dataloader),
                          accuracy / len(val_dataloader)))
            running_loss = 0
            running_acc = 0
            VGG_model.train()
            running_metric_ticker = 0
end = time()
full_train_epoch = (end - start) / num_epochs
print(full_train_epoch)
### END SOLUTION

How does the model perform now when trained on the entire dataset, compared to when only trained on the smaller subset of data? Create one plot with the training accuracy and another with the validation accuracy of the two scenarios.

In [None]:
### BEGIN SOLUTION
xvalues = range(len(train_losses_vgg_f))
plt.figure(1)
plt.subplot(311)
plt.plot(xvalues, val_losses_vgg_f, 'r', xvalues, val_losses_vgg_all, 'b')
plt.legend(["VGG (unfrozen)","VGG (all)"])
plt.title("Loss")
plt.subplots_adjust(hspace = 0.1)

plt.subplot(212)
plt.plot(xvalues, val_accs_vgg_f, 'r', xvalues, val_accs_vgg_all, 'b')
plt.legend(["VGG (unfrozen)","VGG (all)"])
plt.title("Accuracy")
plt.subplots_adjust(hspace = 0.1)

plt.show()
### END SOLUTION

**(2 POE)** What can you conclude from these plots? Did you expect what you observe in the plots, explain!

**Your answer**: (fill in here)

## 6. Evaluation on test set (optional)

Now we'll evaluate your final model, obtained in step 6, on the test set. As mentioned before, the samples in the test set are not labeled, so we can't compute any performance metrics ourselves. 

As a bit of fun and to inspire some friendly competition you may instead submit it to Kaggle for evaluation.

Compute the predictions for all samples in the test set according to your best model, and save it in a .csv file with the format expected by the competition.

Hints:
- There is a `sampleSubmission.csv` file included in the zip data. Take a look at it to better understand what is the expected format here.
- `pathlib`'s `Path` class has a `glob` function, which returns the filenames of all files in a given path.
- If you don't know how to create and write to files with Python, Google can help.

In [None]:
### BEGIN SOLUTION

from pathlib import Path
from PIL import Image
size = (224, 224)

submission_file = Path.cwd() / "submission.csv"
if submission_file.exists():
    print("Removing old submission file.")
    submission_file.unlink()
    
test_path = Path.cwd() / "test"
# Need to list the images to get the number of matching files
test_images = list(test_path.glob("*.jpg"))
num_test_images = len(test_images)

predictions = np.empty((num_test_images, ))
with open(submission_file,"a") as f:
    for counter, name in enumerate(test_images, 1):
        img = Image.open(name)
        img = img.resize(size, Image.BILINEAR)
        transf = transforms.ToTensor()
        inputs = transf(img)
        inputs = inputs.unsqueeze(0).to(device)
        id_ = int(name.stem)
        predictions[id_ - 1] = VGG_model.forward(inputs).item()
        if counter % 1000 == 0 or counter == num_test_images:
            print("Processed {}/{} images".format(counter, num_test_images))
            
with open(submission_file,"w") as f:
    f.write("id,label\n")
    for id_, pred in enumerate(predictions, 1):
        f.write("{},{}\n".format(id_, pred))
        
print("Predictions written to {}".format(submission_file.name))
### END SOLUTION

Now that you created your submission file, submit it to Kaggle for evaluation. The [old competition](https://www.kaggle.com/c/dogs-vs-cats) does not allow submissions any more, but you can submit your file to the [new one](https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition) via the "Late submission" button (they use the same data). The Kaggle CLI can be used as well. Kaggle evaluates your submission according to your log-loss score. Which score did you obtain?

**Your answer**: (fill in here)

What was the username you used for this submission?

**Your answer**: (fill in here)