# Dog Breed Identification (Part I): Data Exploration

We continue with our efforts on the [Dog Breed Identification Kaggle Competition](https://www.kaggle.com/c/dog-breed-identification). This time, we'll focus on building a deep __convolutional neural network__ to model the function between input images, $\{ \mathbf{X}^{(i)} \}_{i = 1}^{N_{\textrm{train}}}$, and corresponding labels, $\{ y^{(i)} \}_{i = 1}^{N_{\textrm{train}}}$, where $N_{\textrm{train}}$ = the number of _training data_ available. The labels $y^{(i)} \in \{ 1, ..., K \}$ are proxy to the _object category_ of the input $\mathbf{X}^{(i)}$; e.g., if some $\mathbf{X}^{(i)}$ is an image containing an [Alaskan Malamute](https://en.wikipedia.org/wiki/Alaskan_Malamute), its label should be the integer which is mapped to the category "Alaskan Malamute". All possible object categories have a corresponding integer label.

We also have access to a (smaller) _validation dataset_, which we periodically evaluate our network on to ensure we don't [overfit](https://en.wikipedia.org/wiki/Overfitting) the training data. We denote the validation data as $(\{ \mathbf{X}^{(i)}, y^{(i)} \})_{i = 1}^{N_{\textrm{validate}}}$, where $N_{\textrm{validate}}$ = the number of _validation data_ available, and $N_{\textrm{validate}} << N_{\textrm{train}}$.

A neural network can be expressed as a parametric function $f(\mathbf{X}^{(i)}; \theta)$; parametrized by a _parameter vector_ $\theta$. The parameters correspond to the __learned weights__ on the connections between neurons of adjacent layers. The goal is to learn a setting of $\theta$ which minimizes the differences between $\sum_{i = 1}^{N_{\textrm{train}}} \left[ f(\mathbf{X}^{(i)}; \theta) - y^{(i)} \right]$, while simultaneously choosing a reasonable setting of parameters that we expect to generalize well to new (e.g., test) data.

For the Kaggle competition, we are given _test data_ $\{ \mathbf{X}^{(i)} \}_{i = 1}^{N_{\textrm{test}}}$, and we submit $\{ f(\mathbf{X}^{(i)}; \hat{\theta}^*) = \hat{y}^{(i)} \}_{i = 1}^{N_{\textrm{test}}}$, where $\hat{\theta}^*$ is an estimate of optimal parameters given the particular neural network model. These __predictions__ (or __inferences__) will be compared with the [ground truth](https://en.wikipedia.org/wiki/Ground_truth) categorical labels, and we will be ranked according to the number of test data our model misclassified.

At the time of writing, the best __error rate__ listed on the competition's leaderboard is 0.313% (accuracy of 100% - 0.313% = 99.687%). We don't expect to beat this, but obtaining a model with ~2-3% error rate is a realistic and challenging goal.

## Imports / miscellany

In [1]:
import os
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.optim
import torch.nn as nn
import torch.nn.functional as F

from torch.autograd import Variable
from torch.utils.data import TensorDataset
from torch.utils.data.dataloader import DataLoader

%matplotlib inline

data_path = os.path.join('..', 'data')

# Are there CUDA-enabled GPU devices available? 
cuda = torch.cuda.device_count() > 0

## Load pre-processed doggos

We've already pre-processed the pupper image data. For now, we will simply load it into memory.

In [2]:
# Load training (input, target) data.
X_train = np.load(os.path.join(data_path, 'X_train.npy')).transpose((0, 3, 2, 1))
y_train = np.load(os.path.join(data_path, 'y_train.npy'))
y_train = np.array([ np.argmax(y_train[idx, :]) for idx in range(y_train.shape[0]) ])

# Load validation (input, target) data.
X_valid = np.load(os.path.join(data_path, 'X_valid.npy')).transpose((0, 3, 2, 1))
y_valid = np.load(os.path.join(data_path, 'y_valid.npy'))
y_valid = np.array([ np.argmax(y_valid[idx, :]) for idx in range(y_valid.shape[0]) ])

In [3]:
# Sanity check: print out training, validation data shapes
print('Training data shapes (X, y):', (X_train.shape, y_train.shape))
print('Validation data shapes (X, y):', (X_valid.shape, y_valid.shape))

Training data shapes (X, y): ((8177, 3, 256, 256), (8177,))
Validation data shapes (X, y): ((2045, 3, 256, 256), (2045,))


## Define PyTorch neural network model

Here is where things get interesting. We will use the [PyTorch deep learning library](http://pytorch.org/) (which I highly recommend!) to create a convolutional neural network (CNN) to learn an approximate mapping between inputs $\mathbf{X}$ and targets $y$.

In [4]:
class CNN(nn.Module):
    '''
    Defines the convolutional neural network model.
    '''
    def __init__(self, input_size, n_classes):
        '''
        Constructor for the CNN object.
        
        Arguments:
            - input_size (int): The number of units in the input "layer".
                Corresponds to the number of pixels in the input images.
            - n_classes (int): The number of target categories in the data.
        
        Returns:
            - Instantiated CNN object.
        '''
        super(CNN, self).__init__()
        
        # Convolutional layer portion of the network.
        self.convolutional = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=16, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=16, out_channels=16, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=16, out_channels=32, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=32, out_channels=32, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=32, out_channels=32, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=32, out_channels=32, \
                    kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        
        # Fully-connected layer portion of the network.
        self.dense = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, n_classes)
        )
    
    def forward(self, x):
        '''
        Defines the forward pass of the network.
        
        Arguments:
            - x (np.ndarray): A minibatch of images with
                shape (M, D, H, W), with M = minibatch size.
        
        Returns:
            - Activations of the final layer of the CNN. That is,
                the representation of the input, learned by the network,
                which is used to disentangle the correct object category.
        '''
        # Get features computed by convolutional portion of the network.
        conv_features = self.convolutional(x)
        
        # Flatten these features from 4D (batch_size, D, H, W)
        # tensor to 2D (batch_size, D * H * W) tensor.
        flat_features = conv_features.view(conv_features.size(0), -1)
                
        # Get prediction computed by the fully-connected portion of the network.
        predictions = self.dense(flat_features)
        
        # Return the processed input data as the predicted target values.
        return predictions

Now that we've defined the network, we can instantiate one and train it to identify the dog breeds in our image dataset. But first, we must choose network [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) (parameters that are chosen prior to training which affect how the training operates, as opposed to parameters which are learned during network training). We also store some useful information in workspace-level variables, which can be changed once (here) to affect the rest of the notebook. Finally, we will use `torch.utils.data.DataLoader`s to simplify the presentation of data to the network.

In [5]:
# Hyperparameters
n_epochs = 50  # No. of times to train on the entire training data.
batch_size = 100  # No. of examples used in minibatch stochastic gradient descent (SGD).
print_interval = 10  # No. of minibatch SGD iterations between each progress message.

# Useful information
input_size = (256, 256)  # As defined in "Data Exploration.ipynb".
n_classes = 120  # No. of doggo breeds.

# Cast training, validation data to `torch.Tensor`s.
try:
    X_train, y_train = torch.from_numpy(X_train).float(), torch.from_numpy(y_train)
    X_valid, y_valid = torch.from_numpy(X_valid).float(), torch.from_numpy(y_valid)
except RuntimeError:
    print('Data already cast to torch.Tensor data type.')
    
# Data loaders
train_loader = DataLoader(dataset=TensorDataset(X_train, y_train), batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(dataset=TensorDataset(X_valid, y_valid), batch_size=batch_size, shuffle=True)

## Train the doggo recognizer

We're ready to begin network training. We'll instantiate a network, define the criterion we aim to minimize ([Multiclass Log Loss](https://www.kaggle.com/wiki/MultiClassLogLoss)), define the optimization algorithm (we'll use [Adam](https://arxiv.org/abs/1412.6980), short for adaptive moments, a variant of SGD which dynamically updates individual parameter learning rates during training), and train the model for some number of epochs (the number of passes through the training data, given by `n_epochs`).

In [6]:
# Instantiate CNN object.
network = CNN(input_size=input_size, n_classes=n_classes)
if cuda:
    network.cuda()

# Create loss / cost /objective function.
criterion = nn.CrossEntropyLoss()

# Specify optimization routine.
optimizer = torch.optim.Adam(network.parameters(), weight_decay=1e-2)

Here is the training loop!

In [7]:
for epoch in range(n_epochs):
    train_correct, train_total = 0, 0
    
    # On each minibatch SGD iteration, we get `batch_size` samples from `X_train`.
    for idx, (inputs, targets) in enumerate(train_loader):
        # Convert `torch.Tensor`s to `Variable`s.
        if cuda:
            inputs = Variable(inputs.cuda())
            targets = Variable(targets.cuda())
        else:
            inputs = Variable(inputs)
            targets = Variable(targets)
        
        # Run forward, backward pass of network.
        # Zero out gradient buffer.
        optimizer.zero_grad()
        
        # Run forward pass of network to get predictions.
        predictions = network.forward(inputs)
        
        # Get integer predictions by selecting the maximal output activation.
        _, predicted = torch.max(predictions.data, 1)
        
        # Add correct classifications to a running sum.
        train_correct += (predicted.cpu() == targets.data.cpu()).sum()
        
        # Add number of items in the minibatch to running sum.
        train_total += targets.size(0)
        
        # Calculate loss (non-negative function of predictions and true targets).
        loss = criterion(predictions, targets)
        
        # Run backward pass (calculate gradient of loss w.r.t. network parameters).
        loss.backward()
        
        # Take optimization step (update network parameters in opposite direction of loss).
        optimizer.step()
        
        if idx % print_interval == 0:
            print('Epoch [%d / %d], Iteration [%d / %d], Loss: %.4f' % (epoch + 1, \
                                n_epochs, idx + 1, len(train_loader), loss.data[0]))
    
    valid_correct, valid_total = 0, 0
    
    # Calculate the accuracy of the network on the
    # validation data at the end of each epoch.
    for idx, (inputs, targets) in enumerate(valid_loader):
        # Convert `torch.Tensor`s to `Variable`s.
        if cuda:
            inputs = Variable(inputs.cuda())
        else:
            inputs = Variable(inputs)
        
        predictions = network.forward(inputs)
        _, predicted = torch.max(predictions.data, 1)
        valid_correct += (predicted.cpu() == targets).sum()
        
        valid_total += targets.size(0)
        
    print()
    print('Training accuracy: %.4f' % (100 * train_correct / train_total))
    print('Validation accuracy: %.4f' % (100 * valid_correct / valid_total))
    print()

Epoch [1 / 50], Iteration [1 / 82], Loss: 4.8510
Epoch [1 / 50], Iteration [11 / 82], Loss: 4.8131
Epoch [1 / 50], Iteration [21 / 82], Loss: 4.7949
Epoch [1 / 50], Iteration [31 / 82], Loss: 4.7696
Epoch [1 / 50], Iteration [41 / 82], Loss: 4.7969
Epoch [1 / 50], Iteration [51 / 82], Loss: 4.7720
Epoch [1 / 50], Iteration [61 / 82], Loss: 4.7456
Epoch [1 / 50], Iteration [71 / 82], Loss: 4.7372
Epoch [1 / 50], Iteration [81 / 82], Loss: 4.7110

Training accuracy: 1.3208
Validation accuracy: 2.2005

Epoch [2 / 50], Iteration [1 / 82], Loss: 4.7349
Epoch [2 / 50], Iteration [11 / 82], Loss: 4.7118
Epoch [2 / 50], Iteration [21 / 82], Loss: 4.6357
Epoch [2 / 50], Iteration [31 / 82], Loss: 4.6846
Epoch [2 / 50], Iteration [41 / 82], Loss: 4.6171
Epoch [2 / 50], Iteration [51 / 82], Loss: 4.5101
Epoch [2 / 50], Iteration [61 / 82], Loss: 4.6820
Epoch [2 / 50], Iteration [71 / 82], Loss: 4.6121
Epoch [2 / 50], Iteration [81 / 82], Loss: 4.5589

Training accuracy: 2.3236
Validation accuracy

Epoch [17 / 50], Iteration [11 / 82], Loss: 3.5160
Epoch [17 / 50], Iteration [21 / 82], Loss: 3.6315
Epoch [17 / 50], Iteration [31 / 82], Loss: 3.5694
Epoch [17 / 50], Iteration [41 / 82], Loss: 3.4834
Epoch [17 / 50], Iteration [51 / 82], Loss: 3.5107
Epoch [17 / 50], Iteration [61 / 82], Loss: 3.5055
Epoch [17 / 50], Iteration [71 / 82], Loss: 3.5665
Epoch [17 / 50], Iteration [81 / 82], Loss: 3.5036

Training accuracy: 12.1560
Validation accuracy: 9.3399

Epoch [18 / 50], Iteration [1 / 82], Loss: 3.4409
Epoch [18 / 50], Iteration [11 / 82], Loss: 3.3884
Epoch [18 / 50], Iteration [21 / 82], Loss: 3.5361
Epoch [18 / 50], Iteration [31 / 82], Loss: 3.6354
Epoch [18 / 50], Iteration [41 / 82], Loss: 3.5975
Epoch [18 / 50], Iteration [51 / 82], Loss: 3.5824
Epoch [18 / 50], Iteration [61 / 82], Loss: 3.5674
Epoch [18 / 50], Iteration [71 / 82], Loss: 3.5484
Epoch [18 / 50], Iteration [81 / 82], Loss: 3.5330

Training accuracy: 13.2689
Validation accuracy: 9.4866

Epoch [19 / 50], Ite

Epoch [33 / 50], Iteration [1 / 82], Loss: 2.7532
Epoch [33 / 50], Iteration [11 / 82], Loss: 2.7767
Epoch [33 / 50], Iteration [21 / 82], Loss: 2.5016
Epoch [33 / 50], Iteration [31 / 82], Loss: 2.7273
Epoch [33 / 50], Iteration [41 / 82], Loss: 2.8220
Epoch [33 / 50], Iteration [51 / 82], Loss: 2.6691
Epoch [33 / 50], Iteration [61 / 82], Loss: 2.8386
Epoch [33 / 50], Iteration [71 / 82], Loss: 2.7296
Epoch [33 / 50], Iteration [81 / 82], Loss: 2.9105

Training accuracy: 25.3149
Validation accuracy: 12.0782

Epoch [34 / 50], Iteration [1 / 82], Loss: 2.6561
Epoch [34 / 50], Iteration [11 / 82], Loss: 2.6352
Epoch [34 / 50], Iteration [21 / 82], Loss: 2.6641
Epoch [34 / 50], Iteration [31 / 82], Loss: 2.5209
Epoch [34 / 50], Iteration [41 / 82], Loss: 2.7778
Epoch [34 / 50], Iteration [51 / 82], Loss: 2.7939
Epoch [34 / 50], Iteration [61 / 82], Loss: 2.7237
Epoch [34 / 50], Iteration [71 / 82], Loss: 3.0107
Epoch [34 / 50], Iteration [81 / 82], Loss: 2.7386

Training accuracy: 26.134


Training accuracy: 39.9780
Validation accuracy: 12.6161

Epoch [49 / 50], Iteration [1 / 82], Loss: 1.8853
Epoch [49 / 50], Iteration [11 / 82], Loss: 1.9977
Epoch [49 / 50], Iteration [21 / 82], Loss: 1.9052
Epoch [49 / 50], Iteration [31 / 82], Loss: 1.9446
Epoch [49 / 50], Iteration [41 / 82], Loss: 1.9505
Epoch [49 / 50], Iteration [51 / 82], Loss: 2.1301
Epoch [49 / 50], Iteration [61 / 82], Loss: 2.0906
Epoch [49 / 50], Iteration [71 / 82], Loss: 2.1123
Epoch [49 / 50], Iteration [81 / 82], Loss: 1.9767

Training accuracy: 39.8557
Validation accuracy: 12.7628

Epoch [50 / 50], Iteration [1 / 82], Loss: 1.9892
Epoch [50 / 50], Iteration [11 / 82], Loss: 2.0374
Epoch [50 / 50], Iteration [21 / 82], Loss: 1.9739
Epoch [50 / 50], Iteration [31 / 82], Loss: 1.9396
Epoch [50 / 50], Iteration [41 / 82], Loss: 2.1095
Epoch [50 / 50], Iteration [51 / 82], Loss: 2.0503
Epoch [50 / 50], Iteration [61 / 82], Loss: 2.1246
Epoch [50 / 50], Iteration [71 / 82], Loss: 2.0771
Epoch [50 / 50], It