# Dog Breed Identification (Part II): Breed Classification

We continue with our efforts on the [Dog Breed Identification Kaggle Competition](https://www.kaggle.com/c/dog-breed-identification). This time, we'll focus on building a deep __convolutional neural network__ to model the function between input images, $\{ \mathbf{X}^{(i)} \}_{i = 1}^{N_{\textrm{train}}}$, and corresponding labels, $\{ y^{(i)} \}_{i = 1}^{N_{\textrm{train}}}$, where $N_{\textrm{train}}$ = the number of _training data_ available. The labels $y^{(i)} \in \{ 1, ..., K \}$ are proxy to the _object category_ of the input $\mathbf{X}^{(i)}$; e.g., if some $\mathbf{X}^{(i)}$ is an image containing an [Alaskan Malamute](https://en.wikipedia.org/wiki/Alaskan_Malamute), its label should be the integer which is mapped to the category "Alaskan Malamute". All possible object categories have a corresponding integer label.

We also have access to a (smaller) _validation dataset_, which we periodically evaluate our network on to ensure we don't [overfit](https://en.wikipedia.org/wiki/Overfitting) the training data. We denote the validation data as $(\{ \mathbf{X}^{(i)}, y^{(i)} \})_{i = 1}^{N_{\textrm{validate}}}$, where $N_{\textrm{validate}}$ = the number of _validation data_ available, and $N_{\textrm{validate}} << N_{\textrm{train}}$.

A neural network can be expressed as a parametric function $f(\mathbf{X}^{(i)}; \theta)$; parametrized by a _parameter vector_ $\theta$. The parameters correspond to the __learned weights__ on the connections between neurons of adjacent layers. The goal is to learn a setting of $\theta$ which minimizes the differences between $\sum_{i = 1}^{N_{\textrm{train}}} \left[ f(\mathbf{X}^{(i)}; \theta) - y^{(i)} \right]$, while simultaneously choosing a reasonable setting of parameters that we expect to generalize well to new (e.g., test) data.

For the Kaggle competition, we are given _test data_ $\{ \mathbf{X}^{(i)} \}_{i = 1}^{N_{\textrm{test}}}$, and we submit $\{ f(\mathbf{X}^{(i)}; \hat{\theta}^*) = \hat{y}^{(i)} \}_{i = 1}^{N_{\textrm{test}}}$, where $\hat{\theta}^*$ is an estimate of optimal parameters given the particular neural network model. These __predictions__ (or __inferences__) will be compared with the [ground truth](https://en.wikipedia.org/wiki/Ground_truth) categorical labels, and we will be ranked according to the number of test data our model misclassified.

At the time of writing, the best __error rate__ listed on the competition's leaderboard is 0.313% (accuracy of 100% - 0.313% = 99.687%). We don't expect to beat this, but obtaining a model with ~2-3% error rate is a realistic and challenging goal.

## Imports / miscellany

In [2]:
import os
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.optim
import torch.nn as nn
import torch.nn.functional as F

from torch.autograd import Variable
from torch.utils.data import TensorDataset
from torch.utils.data.dataloader import DataLoader

%matplotlib inline

train_path = os.path.join('..', 'data', 'processed_train')
valid_path = os.path.join('..', 'data', 'processed_valid')
test_path = os.path.join('..', 'data', 'processed_test')

# Are there CUDA-enabled GPU devices available? 
cuda = torch.cuda.device_count() > 0

## Load pre-processed doggos

We've already pre-processed the pupper image data. For now, we will simply load it into memory.

In [6]:
class ListDataset(torch.utils.data.Dataset):
    '''
    Custom torch.utils.Dataset used for reading
    in a list of data files from disk.
    '''
    def __init__(self, data_path):
        self.data_files = os.listdir(data_path)
        sorted(self.data_files)

    def __getitem__(self, idx):
        return load_file(self.data_files[idx])

    def __len__(self):
        return len(self.data_files)

In [17]:
# Load training (input, target) data.
train_data = ListDataset(train_path)

# Load validation (input, target) data.
valid_data = ListDataset(valid_path)

In [18]:
# Sanity check: print out training, validation data shapes
print('No. of training data:', len(train_data))
print('No. of validation data:', len(valid_data))

No. of training data: 8177
No. of validation data: 2045


## Define PyTorch neural network model

Here is where things get interesting. We will use the [PyTorch deep learning library](http://pytorch.org/) (which I highly recommend!) to create a convolutional neural network (CNN) to learn an approximate mapping between inputs $\mathbf{X}$ and targets $y$.

In [19]:
class CNN(nn.Module):
    '''
    Defines the convolutional neural network model.
    '''
    def __init__(self, input_size, n_classes):
        '''
        Constructor for the CNN object.
        
        Arguments:
            - input_size (int): The number of units in the input "layer".
                Corresponds to the number of pixels in the input images.
            - n_classes (int): The number of target categories in the data.
        
        Returns:
            - Instantiated CNN object.
        '''
        super(CNN, self).__init__()
        
        # Convolutional layer portion of the network.
        self.convolutional = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=16, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=16, out_channels=16, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=16, out_channels=32, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=32, out_channels=32, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=32, out_channels=32, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=32, out_channels=32, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        
        # Fully-connected layer portion of the network.
        self.dense = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, n_classes)
        )
    
    def forward(self, x):
        '''
        Defines the forward pass of the network.
        
        Arguments:
            - x (np.ndarray): A minibatch of images with
                shape (M, D, H, W), with M = minibatch size.
        
        Returns:
            - Activations of the final layer of the CNN. That is,
                the representation of the input, learned by the network,
                which is used to disentangle the correct object category.
        '''
        # Get features computed by convolutional portion of the network.
        conv_features = self.convolutional(x)
        
        # Flatten these features from 4D (batch_size, D, H, W)
        # tensor to 2D (batch_size, D * H * W) tensor.
        flat_features = conv_features.view(conv_features.size(0), -1)
                
        # Get prediction computed by the fully-connected portion of the network.
        predictions = self.dense(flat_features)
        
        # Return the processed input data as the predicted target values.
        return predictions

Now that we've defined the network, we can instantiate one and train it to identify the dog breeds in our image dataset. But first, we must choose network [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) (parameters that are chosen prior to training which affect how the training operates, as opposed to parameters which are learned during network training). We also store some useful information in workspace-level variables, which can be changed once (here) to affect the rest of the notebook. Finally, we will use `torch.utils.data.DataLoader`s to simplify the presentation of data to the network.

In [20]:
# Hyperparameters
n_epochs = 50  # No. of times to train on the entire training data.
batch_size = 100  # No. of examples used in minibatch stochastic gradient descent (SGD).
print_interval = 10  # No. of minibatch SGD iterations between each progress message.

# Useful information
input_size = (256, 256)  # As defined in "Data Exploration.ipynb".
n_classes = 120  # No. of doggo breeds.

# Data loaders
train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True, num_workers=8)
valid_loader = DataLoader(dataset=valid_data, batch_size=batch_size, shuffle=True, num_workers=8)

## Train the doggo recognizer

We're ready to begin network training. We'll instantiate a network, define the criterion we aim to minimize ([Multiclass Log Loss](https://www.kaggle.com/wiki/MultiClassLogLoss)), define the optimization algorithm (we'll use [Adam](https://arxiv.org/abs/1412.6980), short for adaptive moments, a variant of SGD which dynamically updates individual parameter learning rates during training), and train the model for some number of epochs (the number of passes through the training data, given by `n_epochs`).

In [21]:
# Instantiate CNN object.
network = CNN(input_size=input_size, n_classes=n_classes)
if cuda:
    network.cuda()

# Create loss / cost /objective function.
criterion = nn.CrossEntropyLoss()

# Specify optimization routine.
optimizer = torch.optim.Adam(network.parameters(), weight_decay=1e-2)

Here is the training loop!

In [22]:
for epoch in range(n_epochs):
    train_correct, train_total = 0, 0
    
    # On each minibatch SGD iteration, we get `batch_size` samples from `X_train`.
    for idx, (inputs, targets) in enumerate(train_loader):
        # Convert `torch.Tensor`s to `Variable`s.
        if cuda:
            inputs = Variable(inputs.cuda())
            targets = Variable(targets.cuda())
        else:
            inputs = Variable(inputs)
            targets = Variable(targets)
        
        # Run forward, backward pass of network.
        # Zero out gradient buffer.
        optimizer.zero_grad()
        
        # Run forward pass of network to get predictions.
        predictions = network.forward(inputs)
        
        # Get integer predictions by selecting the maximal output activation.
        _, predicted = torch.max(predictions.data, 1)
        
        # Add correct classifications to a running sum.
        train_correct += (predicted.cpu() == targets.data.cpu()).sum()
        
        # Add number of items in the minibatch to running sum.
        train_total += targets.size(0)
        
        # Calculate loss (non-negative function of predictions and true targets).
        loss = criterion(predictions, targets)
        
        # Run backward pass (calculate gradient of loss w.r.t. network parameters).
        loss.backward()
        
        # Take optimization step (update network parameters in opposite direction of loss).
        optimizer.step()
        
        if idx % print_interval == 0:
            print('Epoch [%d / %d], Iteration [%d / %d], Loss: %.4f' % (epoch + 1, \
                                n_epochs, idx + 1, len(train_loader), loss.data[0]))
    
    valid_correct, valid_total = 0, 0
    
    # Calculate the accuracy of the network on the
    # validation data at the end of each epoch.
    for idx, (inputs, targets) in enumerate(valid_loader):
        # Convert `torch.Tensor`s to `Variable`s.
        if cuda:
            inputs = Variable(inputs.cuda())
        else:
            inputs = Variable(inputs)
        
        predictions = network.forward(inputs)
        _, predicted = torch.max(predictions.data, 1)
        valid_correct += (predicted.cpu() == targets).sum()
        
        valid_total += targets.size(0)
        
    print()
    print('Training accuracy: %.4f' % (100 * train_correct / train_total))
    print('Validation accuracy: %.4f' % (100 * valid_correct / valid_total))
    print()

NotImplementedError: Traceback (most recent call last):
  File "/home/djsaunde/anaconda2/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 40, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/djsaunde/anaconda2/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 40, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/djsaunde/anaconda2/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 13, in __getitem__
    raise NotImplementedError
NotImplementedError
