# Exercise multimodal recognition: RGB-D scene recognition

This exercise consists of three parts: two tutorials and the deliverable. The students must modify the code of the tutorial part, and write and discuss the results in the deliverable part that will be used to evaluate the exercise.

If you are not familiar with jupyter notebooks please check __[this tutorial](https://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/What%20is%20the%20Jupyter%20Notebook.html)__ first.

# Part 2 (tutorial): RGB-D baseline

If you haven followed the tutorial related with single modality, please run **single.ipynb** first for the first part.

In this tutorial, you will build a two-branch RGB-D network using PyTorch. The code is loosely based on the __[PyTorch transfer learning tutorial](http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)__. Just execute the code sequentially, paying attention to the comments.

In [None]:
%matplotlib inline

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.autograd import Variable
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
import time
import os
import copy
import itertools

import RGBDutils

plt.ion()   # interactive mode

Load Data
---------

We will use torchvision, torch.utils.data and RGBDutils packages for loading the
data. The dataset is structured hierarchically in splits\modalities\classes (check the folder).

In [None]:
# Data augmentation and normalization for training
RGB_AVG = [0.485, 0.456, 0.406] # Default ImageNet ILSRVC2012
RGB_STD = [0.229, 0.224, 0.225] # Default ImageNet ILSRVC2012
DEPTH_AVG = [0.485, 0.456, 0.406] # Default ImageNet ILSRVC2012
DEPTH_STD = [0.229, 0.224, 0.225] # Default ImageNet ILSRVC2012
data_transforms = {
    'train': RGBDutils.Compose([
        RGBDutils.RandomResizedCrop(227),
        RGBDutils.RandomHorizontalFlip(),
        RGBDutils.ToTensor(),
        RGBDutils.Normalize(RGB_AVG, RGB_STD, DEPTH_AVG, DEPTH_STD)
    ]),
    'val': RGBDutils.Compose([
        RGBDutils.Resize(256),
        RGBDutils.CenterCrop(227),
        RGBDutils.ToTensor(),
        RGBDutils.Normalize(RGB_AVG, RGB_STD, DEPTH_AVG, DEPTH_STD)
    ]),
    'test': RGBDutils.Compose([
        RGBDutils.Resize(256),
        RGBDutils.CenterCrop(227),
        RGBDutils.ToTensor(),
        RGBDutils.Normalize(RGB_AVG, RGB_STD, DEPTH_AVG, DEPTH_STD)
    ]),
}

# Path to the dataset
# data_dir = '/home/mcv/datasets/sunrgbd_lite'
data_dir = 'sunrgbd_lite'

# Preparing dataset and dataloaders
partitions = ['train', 'val', 'test']
image_datasets = {x: RGBDutils.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in partitions}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=64,
                                             shuffle=True, num_workers=4)
              for x in partitions}
dataset_sizes = {x: len(image_datasets[x]) for x in partitions}
class_names = image_datasets['train'].classes

use_gpu = torch.cuda.is_available()

In [None]:
image_datasets

**Visualize a few samples**

Let's visualize a few RGB-D pairs so as to RGB-D data and data augmentations.



In [None]:
# Get a batch of training data and visualize the first four pairs
inputsRGB, inputsDepth, classes = next(iter(dataloaders['train']))
inputsRGB, inputsDepth, classes = inputsRGB[0:4], inputsDepth[0:4], classes[0:4]

# Make a grid from batch
outRGB = torchvision.utils.make_grid(inputsRGB)
outDepth = torchvision.utils.make_grid(inputsDepth)

RGBDutils.imshow(outRGB, outDepth, title=[class_names[x] for x in classes],concat_vert=True)

Training the model
------------------

Now, let's write a general function to train a model. Details:

-  Uses Adam algorithm for gradient descent.
-  Early stoping using best validation accuracy

In [None]:
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            print('Phase %s' % phase)
            if phase == 'train':
                if scheduler != None:
                    scheduler.step()
                model.train(True)  # Set model to training mode
            else:
                model.train(False)  # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0.0

            # Iterate over data.
            for data in dataloaders[phase]:
                # get the inputs
                inputs_rgb, inputs_hha, labels = data
                # wrap them in Variable
                if use_gpu:
                    inputs_rgb = Variable(inputs_rgb.cuda())
                    inputs_hha = Variable(inputs_hha.cuda())
                    labels = Variable(labels.cuda())
                else:
                    inputs_rgb, inputs_hha, labels = Variable(inputs_hha), Variable(inputs_hha), Variable(labels)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                outputs = model((inputs_rgb, inputs_hha))
                _, preds = torch.max(outputs.data, 1)

                loss = criterion(outputs, labels)

                # backward + optimize only if in training phase
                if phase == 'train':
                    loss.backward()
                    optimizer.step()

                # statistics
                # running_loss += loss.data[0] * inputs_rgb.size(0) # Pytorch 0.4
                running_loss += loss.data.item() * inputs_rgb.size(0) # Pytorch 1
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects / dataset_sizes[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

And now, a function to evaluate the model on a particular set.

In [None]:
def evaluate_model(model, partition, criterion):
    since = time.time()

    model.train(False)  # Set model to evaluate mode

    running_loss = 0.0
    running_corrects = 0.0

    # Iterate over data.
    for data in dataloaders[partition]:
        # get the inputs
        inputs_rgb, inputs_hha, labels = data
        # wrap them in Variable
        if use_gpu:
            inputs_rgb = Variable(inputs_rgb.cuda())
            inputs_hha = Variable(inputs_hha.cuda())
            labels = Variable(labels.cuda())
        else:
            inputs_rgb, inputs_hha, labels = Variable(inputs_hha), Variable(inputs_hha), Variable(labels)

        # forward
        outputs = model((inputs_rgb, inputs_hha))
        _, preds = torch.max(outputs.data, 1)
        loss = criterion(outputs, labels)

        # statistics
        # running_loss += loss.data[0] * inputs_rgb.size(0) # Pytorch 0.4
        running_loss += loss.data.item() * inputs_rgb.size(0) # Pytorch 1
        running_corrects += torch.sum(preds == labels.data)

    test_loss = running_loss / dataset_sizes[partition]
    test_acc = running_corrects / dataset_sizes[partition]

    
    print()

    time_elapsed = time.time() - since
    print('Tested in {:.0f}m {:.0f}s Loss: {:.4f} Acc: {:.4f}'.format(
        time_elapsed // 60, time_elapsed % 60, test_loss, test_acc))

    return test_acc, test_loss

Building the RGB-D model
----------------------

The architecture of the network is shown in the following figure:
<img src="figures/rgbd_network.png" />

The following code creates the RGB-D network by instantiating two AlexNets, that are combined using concatenation just before the classifier. There are some tricky steps due to the way the pretrained AlexNet is implemented in PyTorch. 


In [None]:
# In PyTorch every network is implementd as a nn.Module
class RGBDnet(nn.Module):
    # The parameters are initialized in __init__(self, ...)
    def __init__(self, num_classes):
        super(RGBDnet, self).__init__()
        
        # RGB branch
        model_rgb = torchvision.models.alexnet(pretrained=True)
        self.rgb_convs = model_rgb.features
        c = model_rgb.classifier
        self.rgb_fcs = nn.Sequential(c[0],c[1],c[2],c[3],c[4],c[5])
        num_ftrs_rgb = c[4].out_features

        # HHA branch
        model_hha = torchvision.models.alexnet(pretrained=True)
        self.hha_convs = model_hha.features
        c = model_hha.classifier
        self.hha_fcs = nn.Sequential(c[0],c[1],c[2],c[3],c[4],c[5])
        f = model_hha.features
        c = model_hha.classifier
        num_ftrs_hha = c[4].out_features

        # Classifier
        self.classifier = nn.Linear(num_ftrs_rgb+num_ftrs_hha, num_classes)

    # The data flow is defined in forward. No need to specify backward operations (PyTorch takes care of them)
    def forward(self, x):
        x_rgb = self.rgb_convs(x[0])
        x_rgb = x_rgb.view(x_rgb.size(0), 256 * 6 * 6)
        x_hha = self.hha_convs(x[1])
        x_hha = x_hha.view(x_hha.size(0), 256 * 6 * 6)
        x_rgb = self.rgb_fcs(x_rgb)
        x_hha = self.hha_fcs(x_hha)
        x = torch.cat((x_rgb, x_hha), 1)
        x = self.classifier(x)
        return x

In [None]:
# Instantiate the model
num_classes = len(class_names)
model = RGBDnet(num_classes=num_classes)

# You can visualize the resulting network
print(model)

Set up the training/fine tuning parameters
----------------------

The following code creates the optimization criterio and set per-layer training rates to better control the fine tuning and training process. We use a very simple model in which all layers are frozen except the last fully connected one, i.e. the classifier, so it should be easy to improve the performance.

In [None]:
# Here we define the learning rate
for param in model.parameters(): # Freeze all parameters by default
    param.requires_grad = False

if use_gpu:
    model = model.cuda()

criterion = nn.CrossEntropyLoss()

learning_rate =0.001
    
perlayer_optim = [
    {'params': model.rgb_convs[0].parameters(), 'lr': 0.00}, # conv1 RGB
    {'params': model.rgb_convs[3].parameters(), 'lr': 0.00}, # conv2 RGB
    {'params': model.rgb_convs[6].parameters(), 'lr': 0.00}, # conv3 RGB
    {'params': model.rgb_convs[8].parameters(), 'lr': 0.00}, # conv4 RGB
    {'params': model.rgb_convs[10].parameters(), 'lr': 0.00}, # conv5 RGB
    {'params': model.rgb_fcs[1].parameters(), 'lr': 0.00}, # fc6 RGB
    {'params': model.rgb_fcs[4].parameters(), 'lr': 0.00}, # fc7 RGB
    {'params': model.hha_convs[0].parameters(), 'lr': 0.00}, # conv1 HHA
    {'params': model.hha_convs[3].parameters(), 'lr': 0.00}, # conv2 HHA
    {'params': model.hha_convs[6].parameters(), 'lr': 0.00}, # conv3 HHA
    {'params': model.hha_convs[8].parameters(), 'lr': 0.00}, # conv4 HHA
    {'params': model.hha_convs[10].parameters(), 'lr': 0.00}, # conv5 HHA
    {'params': model.hha_fcs[1].parameters(), 'lr': 0.00}, # fc6 HHA
    {'params': model.hha_fcs[4].parameters(), 'lr': 0.00}, # fc7 HHA
    {'params': model.classifier.parameters(), 'lr': 0.001} # fc8
]
for param in itertools.chain(model.rgb_convs[0].parameters(),model.rgb_convs[3].parameters(),
                             model.rgb_convs[6].parameters(),model.rgb_convs[8].parameters(),
                             model.rgb_convs[10].parameters(),model.rgb_fcs[1].parameters(),
                             model.rgb_fcs[4].parameters(),
                             model.hha_convs[0].parameters(),model.hha_convs[3].parameters(),
                             model.hha_convs[6].parameters(),model.hha_convs[8].parameters(),
                             model.hha_convs[10].parameters(),model.hha_fcs[1].parameters(),
                             model.hha_fcs[4].parameters(),
                             model.classifier.parameters()):
    param.requires_grad = True
    
    
optimizer = torch.optim.Adam(perlayer_optim, lr=learning_rate)

Train and evaluate the model
-----------------

It shouldn't take more than 2 mins to train with the GPU in the server.

In [None]:
# Train
model = train_model(model, criterion, optimizer, None, num_epochs=25)
    
# Evaluate
train_acc, _ = evaluate_model(model, 'train', criterion)
val_acc, _ = evaluate_model(model, 'val', criterion)
test_acc, _ = evaluate_model(model, 'test', criterion)
print('Accuracy. Train: %1.2f%% val: %1.2f%% test: %1.2f%%' % 
      (train_acc*100, val_acc*100, test_acc*100))

# Part 2 (deliverable)

This part will be evaluated as deliverable. Please check you include the required results and information. In principle I don't intent to run your code, just check your numbers and descriptions.

* Comparison of RGB, HHA and RGB-D baselines. Include a table with the train, validation and test average accuracies (and standard deviations) over 5 runs for each case (RGB only, HHA only and RGB-D).
* Description of the improvements of the RGB-D network, experimental results and discussion (0.25 points)
* Team work: description of the contribution of each member of the team.

The maximum of the exercise is 0.5 points.
