# Recap: What we have been doing


`1-Tensors-in-PyTorch.ipynb:` introduces a tensor representation in PyTorch

`2-Neural-networks-in-PyTorch.ipynb:` introduces a basic framework for defining neural networks via the `nn` module

`3-Training-neural-networks.ipynb:` introduces loss and backprop to improve predictions

`4-Fashion-MNIST.ipynb:` walks through the feed-forward neural network on Fashion MNIST data

`5-Inference-and-Validation.ipynb:` introduces dropout

`6-Saving-and-Loading-Models.ipynb:` shows how to save and load model so that you don't have to re-train it from scratch

`7-Loading-Image-Data.ipynb:` shows how to load custom image data and outlines some image augmentation techniques

# Transfer Learning

In this notebook you will learn how to use pre-trained networks to solve a majority of problems in computer vision. Most of the time you won't want to train a whole convolutional network yourself. Training modern convolutional networks on huge datasets like [ImageNet](http://www.image-net.org/) can easily take weeks on multiple GPUs. 
> Instead, most people use a pretrained network either as a fixed feature extractor or as an initial network to fine tune. 

Here, we will be working with networks trained on ImageNet. These pre-trained architectures are available from torchvision library in module [torchvision.models](https://pytorch.org/docs/0.4.0/torchvision/models.html).

We can choose from 6 different pre-trained architectures:

* AlexNet
* VGG
* ResNet
* SqueezeNet
* DenseNet
* Inception v3

From `torchvision.models` tables you'll notice that generally, the larger the network, the better the accuracy. However, at the same time, the larger the network is, the longer it is going to take to train it and compute the predictions. Therefore, it's worth thinking about the tradeoff between accuracy and speed when choosing the right architecture.

ImageNet is a massive dataset with over 1 million labeled images in 1000 categories. Once trained, these models work astonishingly well as feature detectors for images they weren't trained on. Using a pre-trained network on images not in the training set is called **transfer learning**. Here we'll use transfer learning to train a network that can classify our cat and dog photos with near perfect accuracy.




In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# as usual we import necessary libraries
import matplotlib.pyplot as plt
import torch
from torchvision import datasets, transforms
import torchvision.models as models
import helper
from torch import nn, optim
import torch.nn.functional as F
import fc_model

In [2]:
# check if CUDA is available:
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('CUDA is not available. Training on CPU...')
else:
    print('CUDA is available. Training on GPU...')

CUDA is not available. Training on CPU...


# Cats and Dogs Dataset

Again, we will be working with cats and dogs dataset from Part 7 `Loading-Image-Data.ipynb`. Let's start by downloading our example data, a .zip of 2,000 JPG pictures of cats and dogs, and extracting it locally in `/tmp`.

**NOTE:** The 2,000 images used in this exercise are excerpted from the ["Dogs vs. Cats" dataset](https://www.kaggle.com/c/dogs-vs-cats/data) available on Kaggle, which contains 25,000 images. Here, we use a subset of the full dataset to decrease training time for educational purposes. Be careful though `/tmp` is a temporary directory and files will be deleted from `/tmp` upon reboot.


In [3]:
!wget --no-check-certificate \
    https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
    -O /tmp/cats_and_dogs_filtered.zip

--2019-01-09 17:08:47--  https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.215.240
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.215.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68606236 (65M) [application/zip]
Saving to: ‘/tmp/cats_and_dogs_filtered.zip’


2019-01-09 17:09:14 (2.54 MB/s) - ‘/tmp/cats_and_dogs_filtered.zip’ saved [68606236/68606236]



In [4]:
import os
import zipfile

local_zip = '/tmp/cats_and_dogs_filtered.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()
base_dir = '/tmp/cats_and_dogs_filtered'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

The contents of the .zip are extracted to the base directory `/tmp/cats_and_dogs_filtered`, which contains `train` and `validation` subdirectories for the training and validation datasets, which in turn each contain `cats` and `dogs` subdirectories. 

*Transforms:* Most of the pretrained models require the input to be 224x224 images. Also, we'll need to match the normalization used when the models were trained. Each color channel was normalized separately, the means are `[0.485, 0.456, 0.406]` and the standard deviations are `[0.229, 0.224, 0.225]`.

In [5]:
input_size = [224, 224]
channel_mean = [0.485, 0.456, 0.406]
channel_std = [0.229, 0.224, 0.225]
train_transforms = transforms.Compose([transforms.RandomResizedCrop(input_size[0]),
                                       transforms.RandomHorizontalFlip(),
                                       transforms.ToTensor(),
                                       transforms.Normalize(channel_mean, channel_std)]) 
# no data augmentation on the test data:
test_transforms = transforms.Compose([transforms.RandomResizedCrop(input_size[0]),
                                      transforms.ToTensor(),
                                      transforms.Normalize(channel_mean, channel_std)]) 
# pass the transforms:
train_data = datasets.ImageFolder(train_dir, transform = train_transforms)
test_data = datasets.ImageFolder(validation_dir, transform = test_transforms)

In [6]:
batch_size = 32
trainloader = torch.utils.data.DataLoader(train_data, batch_size = batch_size, shuffle = True)
testloader = torch.utils.data.DataLoader(test_data, batch_size = batch_size)

In [7]:
# We can load in a pre-trained network, such as DenseNet:
model = models.densenet121(pretrained = True)

  nn.init.kaiming_normal(m.weight.data)


In [8]:
model

DenseNet(
  (features): Sequential(
    (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer2): _DenseLayer(
        (norm1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplac

The model is built from 2 parts:
* features
* classifier

The **features** part is a stack of convolutional layers and works as a feature detector, which can be fed into a classifier. The **classifier** part is defined as a linear combination layer (a single FC layer) `Linear(in_features=1024, out_features=1000, bias=True)` with 1,024 input features and 1,000 output features for all 1,000 ImageNet classes. Now unfortunately, this setup won't work for our specific problem since we are only interested in differentiating between 2 classes: cats and dogs. Therefore, we want to use only the features part from the pre-trained network (will keep it static) and re-train the classifier.

_A general theme around pre-trained networks is that they are really good feature detectors, which can be used as the input for a simple feed-forward classifier._

In [18]:
# first thing we freeze our feature parameters:
for param in model.parameters():
    param.requires_grad = False

What this will do is that when we run our tensors through the model it's not going to calculate gradients and not going to keep track of these operations. This will ensure that our feature parameters don't get updated. It'll also speed up the training because we are not keeping track of these operations. 

Now we need to replace the pre-trained classifier with our own classifier. We're going to use `Sequential` module available from PyTorch. We give it a list of operations that we are going to do and it will pass the tensors through them sequentially. We also pass an `OrderedDict` to name each of these layers.

In [13]:
from collections import OrderedDict
classifier = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(1024, 256)),
    ('relu', nn.ReLU()),
    ('dropout1', nn.Dropout(.25)),
    ('fc2', nn.Linear(256, 2)),
    ('output', nn.LogSoftmax(dim = 1))    
]))

# now we attach this new classifier
model.classifier = classifier

With our model built, we need to train the classifier. However, now we're using a really deep neural network. If you try to train this on a CPU like normal, it will take a long, long time. Instead, it's better to use the GPU (if available) to do the calculations. The linear algebra computations are done in parallel on the GPU leading to 100x increased training speeds. It's also possible to train on multiple GPUs, further decreasing training time.

PyTorch, along with pretty much every other deep learning framework, uses CUDA to efficiently compute the forward and backwards passes on the GPU. In PyTorch, you move your model parameters and other tensors to the GPU memory using `model.to('cuda')`. You can move them back from the GPU with `model.to('cpu')` which you'll commonly do when you need to operate on the network output outside of PyTorch.


Another -- and probably the best -- option is to write device agnostic code which will automatically use CUDA if it is enabled:
```python
# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

...

# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)
```

>**Exercise:** Train a pretrained model to classify the cat and dog images. Continue with the DenseNet model, or try ResNet or VGG. Make sure you are only training the classifier and the parameters for the features part are frozen.

In [None]:
## ANSWER:

In [29]:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cpu


In [37]:
model = models.densenet121(pretrained = True)
model

DenseNet(
  (features): Sequential(
    (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer2): _DenseLayer(
        (norm1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplac

In [38]:
for param in model.parameters():
    param.requires_grad = False
    
from collections import OrderedDict
classifier = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(1024, 256)),
    ('relu', nn.ReLU()),
    ('dropout1', nn.Dropout(.25)),
    ('fc2', nn.Linear(256, 2)),
    ('output', nn.LogSoftmax(dim = 1))    
]))

model.classifier = classifier
criterion = nn.NLLLoss()
# we train only the classifier since feature parameters are frozen
optimizer = optim.Adam(model.classifier.parameters(), lr = 0.001)
model.to(device)

DenseNet(
  (features): Sequential(
    (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer2): _DenseLayer(
        (norm1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplac

In [39]:
# Since I don't have a local GPU available, this normally takes a very long time to train. As a result, I used 
# a google colab notebook with a free GPU (8-Transfer-Learning-Google-Colab.ipynb)

epochs = 4
steps = 0
running_loss = 0
print_every = 5
for epoch in range(epochs):
  
    # Training loop:
    for inputs, labels in trainloader:
        steps += 1
        # Move input and label tensors to the default device
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        logps = model.forward(inputs)
        loss = criterion(logps, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        
        # Validation loop after 5 training batches:
        if steps % print_every == 0:
            test_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for inputs, labels in testloader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    logps = model.forward(inputs)
                    batch_loss = criterion(logps, labels)
                    
                    test_loss += batch_loss.item()
                    
                    # Calculate accuracy
                    ps = torch.exp(logps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
                    
            print('Epoch # : ' + str(epoch) + ' of ' +str(epochs-1) + ', Train loss: ' + str(running_loss/print_every) +
                 ', Test loss: ' + str(test_loss/len(testloader)) + ', Test accuracy: ' + str(accuracy/len(testloader)))
            running_loss = 0
            model.train()

0
0
0
0
0
0
0
0
0
0
0
0
