## Demo 6 - movie poster classification

Multi-label (meaning that there is more than one "true" label) classification of movie poster images by genre.

In [1]:
import os
import sys
from skimage import io
from torch import nn
import torch
import numpy as np
import pandas as pd

### Load the data

We use a PyTorch `Dataset` to represent the data, which means we must implement `__init__`, `__len__` and `__getitem__`.  For efficiency's sake, we ideally want to load the image data (movie posters and their genre classifications) and represent them as a `Tensor` in memory.

The movie posters had to be converted to images of the same size and colour channels.  The resizing can be done inside Python but is slow, so they were converted on-disk using command-line tools. The colour channels are more efficient than resizing in Python, so that was done in Python.

Note that we need to permute the dimensions of the `Tensor` we create from `skimage` NumPy arrays. The latter represent the colour channels (three, for red-green-blue) as the innermost dimension (the fourth dimension, or dimension 3, meaning that pixels are represented as an array of three colour values).  The convolutional layers in PyTorch require the colour channel to be the second dimension (after the batch dimension, meaning that the image is represented as three overlapping images, one for each colour, where each pixel is a single value).  You need to use `Tensor.permute` not `Tensor.view`---the latter just redraws the "boxes" inside the array, it doesn't rearrange the data for a different order of dimensions.

In [21]:
from torch.utils.data import Dataset
from skimage import transform
from skimage import color

class MoviePosterDataset(Dataset):
    def __init__(self, csvfile, imagedir, device="cpu"):
        self.posterlist = pd.read_csv(csvfile)
        self.imagedir = imagedir
        
        imageids = list(self.posterlist["Id"])
        imagefiles = ["{}/{}.jpg".format(self.imagedir, x) for x in imageids]
        images = [np.array(io.imread(x)) for x in imagefiles]
        images = np.array([color.gray2rgb(x) if len(x.shape) < 3 else x for x in images])
        
        truths = self.posterlist[self.posterlist.columns[2:]]
        self.truths = torch.Tensor(truths.to_numpy())
    
        tns = torch.from_numpy(images)
        self.images = tns.permute(0, 3, 1, 2)
        
        if device != "cpu":
            self.device = torch.device(device)
            self.images = self.images.to(self.device)
            self.truths = self.truths.to(self.device)
        
    def __len__(self):
        return len(self.posterlist)
    
    def __getitem__(self, idx):            
        truths = self.truths[idx]
        images = self.images[idx]
        
        return images, truths

We keep the `Tensor` on the CPU because of memory limitations created by sticking to one GPU (which only has 10GB of space to itself).  We will move it to the GPU in batches instead.

In [3]:
mpd = MoviePosterDataset("Multi_Label_dataset/train.csv", 
                         "Multi_Label_dataset/ImageSmaller", device="cpu")

### Define the model

Defining a model with convolutional layers for images is technically a lot easier than defining an RNN-based model for human language.  The `Conv2d` layer automatically moves a 5x5 filter across the entire image, no effort required to manage padding and packing and unpacking and sequence issues. We do have to flatten the output of the `MaxPool2d` layer to feed it to the subsequent `Linear` layers.  The output of the model comes from `Sigmoid` over the number of classes, so that we have a binary classification for the 25 separate movie labels.  

In [4]:
class PosterClassifier(nn.Module):
    def __init__(self, dropout=0.1):
        super().__init__()
        self.conv2d = nn.Conv2d(3,3,5,padding=2)
        self.maxpool = nn.MaxPool2d(5,padding=2)
        self.relu = nn.ReLU()
        self.linear0 = nn.Linear(3*90*60, 3*90*60)
        self.dropout1 = nn.Dropout(dropout)
        self.tanh = nn.Tanh()
        self.linear1 = nn.Linear(3*90*60, 25)
        self.dropout2 = nn.Dropout(dropout)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        output = self.conv2d(x)
        output = self.maxpool(output)
        output = output.view(-1, 3*90*60)
        output = self.relu(output)
        output = self.linear0(output)
        output = self.dropout1(output)
        output = self.tanh(output)
        output = self.linear1(output)
        output = self.dropout2(output)
        output = self.sigmoid(output)
        
        return output

### Arrange the data for the model

We're going to do a 60/40 train/test split using PyTorch's own samplers and data loaders.

In [6]:
len(mpd)

7254

In [7]:
totalindices = list(range(len(mpd)))

In [8]:
import random
import math

random.shuffle(totalindices)
splitindex = math.floor(len(mpd)*0.6)

In [9]:
splitindex

4352

In [10]:
trainingindices = totalindices[:splitindex]
testingindices = totalindices[splitindex:]

In [11]:
trainingsampler = torch.utils.data.SubsetRandomSampler(trainingindices)
testingsampler = torch.utils.data.SubsetRandomSampler(testingindices)

In [12]:
traindl = torch.utils.data.DataLoader(mpd, batch_size=40, 
                                      sampler=trainingsampler, pin_memory=True)
testdl = torch.utils.data.DataLoader(mpd, sampler=testingsampler)

### Write and run the training loop

We use the training loop to send the data to the GPU, batch by batch.

In [15]:
import torch.optim as optim

def train(dataloader, epochs=3):
    torch.cuda.empty_cache()
    model = PosterClassifier()
    model = model.to("cuda:3")
    optimizer = optim.Adam(model.parameters())
    criterion = nn.BCELoss()
    for epoch in range(epochs):
        sumloss = 0
        batches = 0
        for c, data in enumerate(dataloader):
            images, truth = data
            optimizer.zero_grad()
            output = model(images.float().to("cuda:3"))
            loss = criterion(output, truth.to("cuda:3"))
            sumloss += loss
            batches += 1.0
            loss.backward()
            optimizer.step()

        print("In epoch {}, loss = {}".format(epoch, sumloss/batches))

    return model

In [16]:
trained = train(traindl, epochs=30)

In epoch 0, loss = 0.42183035612106323
In epoch 1, loss = 0.4964887201786041
In epoch 2, loss = 0.6141465902328491
In epoch 3, loss = 0.7202863693237305
In epoch 4, loss = 0.7396678924560547
In epoch 5, loss = 0.8101411461830139
In epoch 6, loss = 0.8151503205299377
In epoch 7, loss = 0.900789737701416
In epoch 8, loss = 0.9067399501800537
In epoch 9, loss = 0.9030411839485168
In epoch 10, loss = 0.8954600095748901
In epoch 11, loss = 0.8842679858207703
In epoch 12, loss = 0.8826567530632019
In epoch 13, loss = 0.8775250911712646
In epoch 14, loss = 0.8838796615600586
In epoch 15, loss = 0.948268473148346
In epoch 16, loss = 1.0035309791564941
In epoch 17, loss = 0.9940931797027588
In epoch 18, loss = 1.0120608806610107
In epoch 19, loss = 1.0030421018600464
In epoch 20, loss = 1.0258008241653442
In epoch 21, loss = 1.0112287998199463
In epoch 22, loss = 1.0018103122711182
In epoch 23, loss = 0.9910273551940918
In epoch 24, loss = 1.0099142789840698
In epoch 25, loss = 0.98990625143051

This is quite bad---the loss only gets worse. We need to make adjustments to the model...

### Write and run a testing routine

For memory purposes, we're keeping the testing data on the CPU memory still.  We have to put the model into evaluation mode with `model.eval`, which turns off the dropout and other regularization useful in training---we want the test to represent a deterministic result of the trained model. We also test a single epoch as one big batch.

In [19]:
def test(model, dataloader):
    model = model.to("cpu")
    model.eval()
    criterion = nn.BCELoss()
    sumloss = 0
    items = 0
    for c, data in enumerate(dataloader):
        images, truth = data
        output = model(images.float())
        loss = criterion(output, truth)
        sumloss += loss
        items += 1.0
    print("Loss on test data = {}".format(sumloss/items))

In [20]:
test(trained, testdl)

Loss on test data = 1.0326223373413086


Loss is not quite the right way to evaluate a model like this.  However, it is "encouraging" that the loss on the test data is not wholly out of line from the loss in the training data.