# Improving Performance

This is largely the same notebook as `train_model.ipynb` but with a number of performance improvements.


Since pytorch gives you so much control, your decisions can greatly increase or decrease training time.

In [None]:
import copy
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import PIL
from PIL import Image
import random
from sklearn.model_selection import train_test_split
import time

import torch
from torch import nn
import torch.multiprocessing as mp
from torch.utils.data import Dataset
from torch.utils.data.sampler import SubsetRandomSampler
from torchvision import datasets, transforms as T
from torchvision.io import read_image
import torchvision.models as models
from torchvision.transforms import Lambda

## GPUs

The first change is the addition of GPUs. It's best practice to use a command like the one below instead of hard-coding GPUs, so your code can still run (very slowly!) on a cpu-only machine.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

In [None]:
img_dir = 'data'
cache_dir = '/domino/datasets/' + os.environ['DOMINO_STARTING_USERNAME'] + '/' + os.environ['DOMINO_PROJECT_NAME'] + '/scratch'
#to use all of the species clases read in image_labels'csv
annotations_file = 'labels_reduced_classes.csv'

labels = pd.read_csv(annotations_file)
species = sorted(labels['question__species'].unique())
species_to_idx = dict(zip(species,range(len(species))))
idx_to_species = {v: k for k, v in species_to_idx.items()}

We can set the number of images to use for each class

In [None]:
#num_samples (per class) must be less than 1000
num_samples = 200

small_dfs = []
for animal in species:
    small_df = labels.loc[labels['question__species'] == animal]
    small_df = small_df[['image_name','question__species']].sample(num_samples, random_state=42)
    small_dfs.append(small_df)

reduced_images = pd.concat(small_dfs, ignore_index=True)
reduced_images.to_csv('reduced_images.csv', index=False)


#change annotations file location
annotations_file = 'reduced_images.csv'
labels = pd.read_csv(annotations_file)

labels['question__species'].value_counts()

## Building a Custom Dataset

We use the same custom dataset class as we defined in the last notebook, but this version of `SnapshotSerengetiDataset` has a one notable optimization: now the `__getitem__` method references a `__cacheitem__` method which scales an image and saves it as a tensor for quicker use in the future.

In [None]:
class SnapshotSerengetiDataset(Dataset):
    def __init__(self, annotations_file, img_dir, cache_dir, class_dict, transform=None, target_transform=None):
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.cache_dir = cache_dir
        self.transform = transform
        self.target_transform = target_transform
        self.class2index = class_dict

    def __len__(self):
        return len(self.img_labels)
    
    def __cacheitem__(self, idx):
        cache_path = os.path.join(self.cache_dir, self.img_labels.iloc[idx, 0])
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = read_image(img_path)
        image = T.Resize(256)(image)
        torch.save(image, cache_path)
        
    def __getitem__(self, idx):
        cache_path = os.path.join(self.cache_dir, self.img_labels.iloc[idx, 0])
        if not os.path.isfile(cache_path):
            self.__cacheitem__(idx)
        image = torch.load(cache_path)
        image = torch.mul(image, 1/255.) # scale to [0, 1]
        label = self.class2index[self.img_labels.iloc[idx, 1]]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

## Profiling

In software development, you don't want to put effort into making something faster unless you know that the paet you are working on is the bottleneck.

Pytorch has some tools to help with this, but `line_profiler` is a great Jupyter extension to profile functions line-by-line. That's how we identified `read_image()` and `T.Resize(256)` as the slowest parts of the `__getitem__` method.

Below is an example of what that profiling output can look like.

In [None]:
%load_ext line_profiler

`transform_bench()` naively loads the image and does all the transforms. You can see that over 80% of the time is spent in the first two commands.

In [None]:
def transform_bench(idx):
    img = read_image(os.path.join(img_dir, labels.iloc[idx, 0]))
    img = T.Resize(256)(img)
    img = T.CenterCrop(224)(img)
    img = T.RandomRotation(30)(img)
    img = T.RandomHorizontalFlip()(img)
    img = T.ConvertImageDtype(torch.float32)(img)
    img = T.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])(img)
    return img

idx = 1
%lprun -f transform_bench transform_bench(idx)

`cache_transform_bench()` loads the cached tensor that has already been converted from the raw image and resized. This makes the image load roughly 10x as fast after the first time we use the image.

In [None]:
def pre_transform(idx):
    img = read_image(os.path.join(img_dir, labels.iloc[idx, 0]))
    img = T.Resize(256)(img)
    torch.save(img, os.path.join(cache_dir, labels.iloc[idx, 0]))

def cache_transform_bench(idx):
    img = torch.load(os.path.join(cache_dir, labels.iloc[idx, 0]))
    img = T.CenterCrop(224)(img)
    img = T.RandomRotation(30)(img)
    img = T.RandomHorizontalFlip()(img)
    img = T.ConvertImageDtype(torch.float32)(img)
    img = T.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])(img)
    return img

idx = 1
pre_transform(idx)
%lprun -f cache_transform_bench cache_transform_bench(idx)

Here we have the same image transforms from the `clean_data.ipynb` as well as a transform to one-hot encode the species index.

Note that `T.Resize(256)` is no longer necessary since it is done the first time an image is loaded via `SnapshotSerengetiDataset.__getitem__` and cached.

In [None]:
train_transform = T.Compose([#T.Resize(256),
                             T.RandomRotation(30),
                             T.RandomHorizontalFlip(),
                             T.CenterCrop(224),
                             T.ConvertImageDtype(torch.float32),
                             T.Normalize(mean=[0.485, 0.456, 0.406],
                                          std=[0.229, 0.224, 0.225])])

val_transform = T.Compose([#T.Resize(256),
                           T.CenterCrop(224),
                           T.ConvertImageDtype(torch.float32),
                           T.Normalize(mean=[0.485, 0.456, 0.406],
                                        std=[0.229, 0.224, 0.225])])

target_transform = Lambda(lambda y: torch.zeros(len(species_to_idx), dtype=torch.float).scatter_(dim=0, index=torch.tensor(y), value=1))

Now we prepare our train and validation data the same as before.

In [None]:
random_seed= 42
val_size = .1

def stratified_split(annotations_file, test_size = 0.2):
    img_labels = pd.read_csv(annotations_file)
    indices = img_labels.index
    labels = img_labels[['question__species']]
    train_indices, test_indices, _, _ = train_test_split(indices, labels, stratify=labels, test_size=test_size, random_state=random_seed)
    return train_indices, test_indices

train_indices, val_indices = stratified_split(annotations_file=annotations_file, test_size=val_size)

dataset_size = len(train_indices) + len(val_indices)

train_dataset = SnapshotSerengetiDataset(annotations_file=annotations_file, img_dir=img_dir, cache_dir=cache_dir, class_dict=species_to_idx, transform=train_transform, target_transform=target_transform)
val_dataset = SnapshotSerengetiDataset(annotations_file=annotations_file, img_dir=img_dir, cache_dir=cache_dir, class_dict=species_to_idx, transform=val_transform, target_transform=target_transform)

train_sampler = SubsetRandomSampler(train_indices)
val_sampler = SubsetRandomSampler(val_indices)

## Dataloaders

Pytorch dataloaders not only make it easier to feed data to a model, but also include a number of features that can be used to improve performance.

The are two used here:
* `num_workers` tells a dataloader how many parallel processes should run to prep data. Here they are all running on the cpus to prep batches in parallel. Sometimes the underlying dataset will be configured so they run on the GPU, but we found it faster to transfer a whole batch at a time (with profiling!).   
* `pin_memory` puts data in page-locked memory, which makes copies of prepared batches to the GPU much faster.

In [None]:
num_workers=6
batch_size = 128

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler, num_workers=num_workers, pin_memory=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, sampler=val_sampler, num_workers=num_workers, pin_memory=True)

## Model Training Loops

The `train_model` function here has a few improvements.

First, we move the `inputs` and `labels` to the GPU with `.to(device)`. Since we checked if the GPU was available earlier, this will do nothing if there isn't one.

We also added a conditional statement to unfreeze all the layers after a certain epoch. You might do this if you want to use a pretrained model, but want to squeeze out further improvement than possible via only the last layer. Since the model is largely trained, you can get noticeable improvement with a relatively small number of unfrozen iterations.

In [None]:
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        if epoch == 2:
            for child in model.children(): # Unfreeze layers
                for param in child.parameters():
                    param.requires_grad = True

        print('Epoch {}/{}'.format(epoch+1, num_epochs))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.argmax(1))
            #if phase == 'train':
                #scheduler.step()

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

## Importing a model

We're using the same model as before, but are now using the `.to(device)` command to load it on the GPU. Pytorch requires that the model, batch, and labels are all in the same location for training to work.

In [None]:
model = models.resnet50(pretrained=True).to(device)
model.fc = nn.Linear(model.fc.in_features, len(species_to_idx)).to(device) # Rescale output fully-connected layer size
num_frozen_layers = 9

# Freeze `num_frozen_layers`
layer = 0
for child in model.children():
    layer += 1
    if layer <= num_frozen_layers:
        for param in child.parameters():
            param.requires_grad = False
print('Number of unfrozen layers: ' + str(layer-num_frozen_layers))

## Time to Train!

Now we are again ready to train our model. Note how much faster it is even though we are training the exact same thing as before!

You may find that the first epoch is slower, since there are images that haven't been cached yet.

In [None]:
dataloaders = {'train': train_loader, 'val': val_loader}
dataset_sizes = {'train': dataset_size*(1-val_size), 'val': dataset_size*val_size}

loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=5e-3, momentum=0.9, weight_decay=5e-3)

model = train_model(model, loss_fn, optimizer, 1, num_epochs=5)

In [None]:
def visualize_model(model, num_images=6):
    was_training = model.training
    model.eval()
    images_so_far = 0


    with torch.no_grad():
        for i, (images, labels) in enumerate(dataloaders['val']):
            images = images.cuda()
            outputs = model(images)
            _, preds = torch.max(outputs, 1)
            
            figure = plt.figure(figsize=(16, 16))
            for j in range(num_images):
                images_so_far += 1
                figure.add_subplot(num_images//2, 2, images_so_far)
                figure.tight_layout()
                plt.axis('off')
                image = images.cpu().data[j]
                mean = np.array([0.485, 0.456, 0.406])
                std = np.array([0.229, 0.224, 0.225])
                image = std * image.permute(1, 2, 0).numpy() + mean
                image = np.clip(image, 0, 1)
                plt.title('predicted: {}'.format(idx_to_species[preds[j].item()]))
                plt.imshow(image)
                
                if images_so_far == num_images:
                    model.train(mode=was_training)
                    return
            plt.show()
                
        model.train(mode=was_training)


visualize_model(model, num_images=6)

## System Utilization

While a model is training, you may want to confirm that's it's making full use of your hardware. If it isn't, that's a sign that you can do things like train larger batches, a larger model, or do more things in parallel.

`nvidia-smi` shows information about the GPUs in your system and their memory/processor usage.

`top` is a commonly-used linux command showing the most active processes.

In [None]:
!nvidia-smi

In [None]:
!top -n 1

You may occasionally need to run the following command to clear cached data from the GPU between model trainings. This does not affect our cache of pretransformed images, since those were saved to disk.

If you see high memory usage in the `nvidia-smi` output above you can try running it an then running `nvidia-smi` again to see the difference.

In [None]:
torch.cuda.empty_cache()