# Transfer Learning

In this notebook, you'll learn how to use pre-trained networks to solved challenging problems in computer vision using the cat vs dog of yesterday.

Transfer Learning consists of using, for the feature extraction part, the weights of a model that has been already trained on another dataset. 

Once trained, these models work astonishingly well as feature detectors for images they weren't trained on. Here we'll use transfer learning to train a network that can classify our cat and dog photos with near perfect accuracy.

With `torchvision.models` you can download these pre-trained networks and use them in your applications. We'll include `models` in our imports now.

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt

import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torch.autograd  import Variable
from torchvision import datasets, transforms, models

Most of the pretrained models require the input to be 224x224 images. Also, we'll need to match the normalization used when the models were trained. Each color channel was normalized separately, the means are `[0.485, 0.456, 0.406]` and the standard deviations are `[0.229, 0.224, 0.225]`.

>**Bonus Exercise**: Try this at the end: build a data augmentation pipeline using Albumentations library (check the other notebook).

In [2]:
ls ../../../../

[0m[01;34mAI_Challenge[0m/  [01;34mbuildozer[0m/  [01;34mDatasets[0m/  [01;34mGIT_test[0m/  [01;34mStrive[0m/


In [3]:
## Load and preprocess the Cat vs Dog dataset of yesterday. 
## The images should have size 224x224

data_dir = "../../../../Datasets/cat_vs_dog"

# TODO: Define transforms for the training data and testing data
train_transform = transforms.Compose([transforms.Resize((320,320)),
                                       transforms.RandomRotation(30),
                                       transforms.CenterCrop(255),
                                       transforms.RandomResizedCrop(224),
                                       transforms.RandomHorizontalFlip(),
                                       transforms.ToTensor(),
                                       transforms.Normalize([0.485, 0.456, 0.406], # remember to check this
                                                            [0.229, 0.224, 0.225])])

test_transform = transforms.Compose([transforms.Resize((224, 224)),
                                       transforms.ToTensor(),
                                       transforms.Normalize([0.485, 0.456, 0.406], # remember to check this
                                                            [0.229, 0.224, 0.225])])



# Pass transforms in here, then run the next cell to see how the transforms look
train_data = datasets.ImageFolder(data_dir + '/train', transform=train_transform)
test_data = datasets.ImageFolder(data_dir + '/test', transform=test_transform)

trainloader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
testloader = torch.utils.data.DataLoader(test_data, batch_size=16, shuffle=False)

We can load in a model such as [DenseNet](http://pytorch.org/docs/0.3.0/torchvision/models.html#id5). Let's print out the model architecture so we can see what's going on. You can pick other models as well from here: https://pytorch.org/docs/0.3.0/torchvision/models.html#id5

In [15]:
model = models.squeezenet1_0(pretrained=True)
model

SqueezeNet(
  (features): Sequential(
    (0): Conv2d(3, 96, kernel_size=(7, 7), stride=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True)
    (3): Fire(
      (squeeze): Conv2d(96, 16, kernel_size=(1, 1), stride=(1, 1))
      (squeeze_activation): ReLU(inplace=True)
      (expand1x1): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
      (expand1x1_activation): ReLU(inplace=True)
      (expand3x3): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (expand3x3_activation): ReLU(inplace=True)
    )
    (4): Fire(
      (squeeze): Conv2d(128, 16, kernel_size=(1, 1), stride=(1, 1))
      (squeeze_activation): ReLU(inplace=True)
      (expand1x1): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
      (expand1x1_activation): ReLU(inplace=True)
      (expand3x3): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (expand3x3_activation): ReLU(inplace=True)
    )
    (5): Fire(
   

This model is built out of two main parts, the features and the classifier. The features part is a stack of convolutional layers and overall works as a feature detector that can be fed into a classifier. The classifier part is a single fully-connected layer `(classifier): Linear(in_features=1024, out_features=1000)`. This layer was trained on the ImageNet dataset, so it won't work for our specific problem. That means **we need to replace the classifier, but the features will work perfectly on their own**. *In general, I think about pre-trained networks as amazingly good feature detectors that can be used as the input for simple feed-forward classifiers.*



In [14]:
# Freeze parameters so we don't backprop through them
for param in model.parameters():
    param.requires_grad = False

from collections import OrderedDict
classifier = nn.Sequential(OrderedDict([
          ('fc1',   nn.Linear(512, 128)),
          ('relu1', nn.ReLU()),
        #   ('fc2',   nn.Linear(256, 128)),
        #   ('relu2', nn.ReLU()),
          ('output', nn.Linear(128, 2)),
          ('softmax', nn.LogSoftmax(dim=1))
        #   ('softmax', nn.Softmax(dim=1))
]))
    
model.fc = classifier

model

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

With our model built, we need to train the classifier. However, now we're using a **really deep** neural network. If you try to train this on a CPU like normal, it will take a long, long time. Instead, we're going to use the GPU to do the calculations. The linear algebra computations are done in parallel on the GPU leading to 100x increased training speeds. It's also possible to train on multiple GPUs, further decreasing training time.

PyTorch, along with pretty much every other deep learning framework, uses [CUDA](https://developer.nvidia.com/cuda-zone) to efficiently compute the forward and backwards passes on the GPU. In PyTorch, you move your model parameters and other tensors to the GPU memory using `model.to('cuda')`. You can move them back from the GPU with `model.to('cpu')` which you'll commonly do when you need to operate on the network output outside of PyTorch. As a demonstration of the increased speed, I'll compare how long it takes to perform a forward and backward pass with and without a GPU.

In [6]:
import time

We can check if the GPU is available with 

`torch.cuda.is_available()` .

This command can be used to make a model that is agnostic to the device we are using, simply defining:

`device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")`

in this way, device will be "cuda:0" if a GPU is available, or "cpu" if it is not!



In [7]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


So, whenever you get a new Tensor or Module:

```python
input = data.to(device)
model = MyModule(...).to(device)
```

This tells the machine to move the data or the model on the GPU if available, so that you can speed up a lot your training process! If the data are already on cpu/gpu and you are running the lines above, nothing will happen! 


In [8]:
# Use GPU if it's available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

# model = models.densenet121(pretrained=True)

# Freeze parameters so we don't backprop through them
# for param in model.parameters():
#     param.requires_grad = False

# Change the classifier to make it work with your binary classification problem

# model.classifier = nn.Sequential(# your code here)

criterion = nn.NLLLoss()
# criterion = nn.CrossEntropyLoss

# Only train the classifier parameters, feature parameters are frozen
optimizer = optim.Adam(model.fc.parameters(), lr=0.003)

model.to(device);

cuda:0


>**Exercise** Complete the code below to complete the training and the validation.

In [9]:
epochs = 1
steps = 0
running_loss = 0
print_every = 5
max_accuracy = 0

for epoch in range(epochs):
    for inputs, labels in trainloader:
        steps += 1
        # inputs, labels = Variable(inputs).float(), Variable(labels).float()
        # Move input and label tensors to the default device
        inputs, labels = inputs.to(device), labels.to(device)
        
        # write the training loop. call the loss "loss" so that the line below will work
        optimizer.zero_grad()

        output = model.forward(inputs)   # 1) Forward pass
        # print(output, output.shape)
        loss = criterion(output, labels) # 2) Compute loss
        # print(loss)
        # loss = Variable(loss, requires_grad=True)
        loss.backward()                  # 3) Backward pass
        optimizer.step()                 # 4) Update model
        
        running_loss += loss.item()
        
        if steps % print_every == 0:
            test_loss = 0
            accuracy = 0
            # REMEMBER TO ACTIVATE THE EVAL MODE
            model.eval()
            with torch.no_grad():
                for inputs, labels in testloader:

                    inputs, labels, model = inputs.to(device), labels.to(device), model.to(device)
                    
                    logps = model.forward(inputs)
                    batch_loss = criterion(logps, labels)
                    
                    test_loss += batch_loss.item()
                    
                    # Calculate accuracy
                    ps = torch.exp(logps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
                    if accuracy >= max_accuracy:
                        max_accuracy = accuracy
                        torch.save(model.state_dict(), 'checkpoint.pth')
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Test loss: {test_loss/len(testloader):.3f}.. "
                  f"Test accuracy: {accuracy/len(testloader):.3f}")
            running_loss = 0
            model.train()

Epoch 1/1.. Train loss: 0.837.. Test loss: 0.528.. Test accuracy: 0.618
Epoch 1/1.. Train loss: 0.622.. Test loss: 0.315.. Test accuracy: 0.900
Epoch 1/1.. Train loss: 0.413.. Test loss: 0.191.. Test accuracy: 0.960
Epoch 1/1.. Train loss: 0.356.. Test loss: 0.145.. Test accuracy: 0.954
Epoch 1/1.. Train loss: 0.279.. Test loss: 0.144.. Test accuracy: 0.946
Epoch 1/1.. Train loss: 0.375.. Test loss: 0.265.. Test accuracy: 0.883
Epoch 1/1.. Train loss: 0.316.. Test loss: 0.131.. Test accuracy: 0.947
Epoch 1/1.. Train loss: 0.310.. Test loss: 0.138.. Test accuracy: 0.943
Epoch 1/1.. Train loss: 0.244.. Test loss: 0.105.. Test accuracy: 0.956
Epoch 1/1.. Train loss: 0.235.. Test loss: 0.101.. Test accuracy: 0.960
Epoch 1/1.. Train loss: 0.257.. Test loss: 0.124.. Test accuracy: 0.952
Epoch 1/1.. Train loss: 0.184.. Test loss: 0.098.. Test accuracy: 0.958
Epoch 1/1.. Train loss: 0.343.. Test loss: 0.135.. Test accuracy: 0.949
Epoch 1/1.. Train loss: 0.288.. Test loss: 0.145.. Test accuracy

In less than one epoch! 😳🤩 (I manually interrupted the training, this is why there's the error.)

Ok, now that you have a great model, it's worth it to save it for use it again later.

In [10]:
print("Our model: \n\n", model, '\n')
print("The state dict keys: \n\n", model.state_dict().keys())

Our model: 

 ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(in

The simplest thing to do is simply save the state dict with `torch.save`. For example, we can save it to a file `'checkpoint.pth'`.

In [18]:
# torch.save(model.state_dict(), 'checkpoint.pth')

Then we can load the state dict with `torch.load`.

In [11]:
state_dict = torch.load('checkpoint.pth')
print(state_dict.keys())

odict_keys(['conv1.weight', 'bn1.weight', 'bn1.bias', 'bn1.running_mean', 'bn1.running_var', 'bn1.num_batches_tracked', 'layer1.0.conv1.weight', 'layer1.0.bn1.weight', 'layer1.0.bn1.bias', 'layer1.0.bn1.running_mean', 'layer1.0.bn1.running_var', 'layer1.0.bn1.num_batches_tracked', 'layer1.0.conv2.weight', 'layer1.0.bn2.weight', 'layer1.0.bn2.bias', 'layer1.0.bn2.running_mean', 'layer1.0.bn2.running_var', 'layer1.0.bn2.num_batches_tracked', 'layer1.1.conv1.weight', 'layer1.1.bn1.weight', 'layer1.1.bn1.bias', 'layer1.1.bn1.running_mean', 'layer1.1.bn1.running_var', 'layer1.1.bn1.num_batches_tracked', 'layer1.1.conv2.weight', 'layer1.1.bn2.weight', 'layer1.1.bn2.bias', 'layer1.1.bn2.running_mean', 'layer1.1.bn2.running_var', 'layer1.1.bn2.num_batches_tracked', 'layer2.0.conv1.weight', 'layer2.0.bn1.weight', 'layer2.0.bn1.bias', 'layer2.0.bn1.running_mean', 'layer2.0.bn1.running_var', 'layer2.0.bn1.num_batches_tracked', 'layer2.0.conv2.weight', 'layer2.0.bn2.weight', 'layer2.0.bn2.bias', '

And to load the state dict in to the network, you do `model.load_state_dict(state_dict)`.


In [12]:
model.load_state_dict(state_dict)

<All keys matched successfully>

Oh, but what does it mean? 

First, that you need to have your model defined as the one that has been saved. In other words, if you save the `checkpoint.pth` and send it to your friend, your friend won't be able to use the model *unless* you tell her/him how the model has been defined, which layers it has, how they are called and so on.

Is it useless then? No, on the contrary! Think that you're training a model, with 1000 epochs. It could take three days to do it. What if at the 999th epoch your *Airbnb host* shutdown the wi-fi connection? All your progress are lost! However, since your notebook have all the model defined, you can save the checkpoints every $n$ iterations or *when the validation accuracy improves*.

This is indeed a common practice: everytime you test your model on the validation set, you can check if the validation accuracy is higher then the one that you have saved already and save the new checkpoints! I recommend to check this out: https://wandb.ai/site

What if you try loading the checkpoints with a model that doesn't match?


In [13]:
# Try this
import fc_model
model = fc_model.Network(784, 10, [400, 200, 100])
# This will throw an error because the tensor sizes are wrong!
model.load_state_dict(state_dict)

ModuleNotFoundError: No module named 'fc_model'

It was intended to get an error, don't worry. I know that is always scary to see the red message, and your debug mode has already been triggered, but put it aside.

>**Exercise**: Save the checkpoints of your model after 5 epochs of training. Load them back and continue the training for how many epochs you want.