# Encoder-Decoder Network for Image Captioning

In this notebook we will train the image captioning model, which we will later use for the scene-thumbnails of our own dataset. In the following, the training of a Encoder-Decoder Network with a LSTM-RNN will be explained step-by-step.

In [1]:
!pip install pycocotools

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
!pip install nltk

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
import os
import sys
import torch.backends.cudnn as cudnn

from pycocotools.coco import COCO
import torch
import torchtext.vocab as vocab
from tqdm import tqdm, trange


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


Load GPU if available

In [2]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

SystemError: GPU device not found

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()

if str(device) == "cuda":
    torch.cuda.get_device_name(0)
    print("device is GPU")

else:
    print("device is CPU")

cudnn.benchmark = True              # set to true only if inputs to model are fixed size; otherwise lot of computational overhead

device is CPU


### Configuration for Data-Processing, Models and Training

In [32]:
import os

class Arguments(object):
    
    def __init__(self):
        
        # Dataset Processing Parameters
        self.environment = 'local'            # Location of data
        self.vocab_cutoff = 15                 # Cutoff value for words to get loaded to vocabulary
        self.vocab = True                     # False --> Create vocabulary, True --> Load vocabulary
        self.do_vectorize = True              # True --> Vectorize training set
        self.cpi  = 5                         # CPI = number of annotations per image
        
        # Files
        self.dir_path = os.path.dirname(os.path.realpath('Image Captioning'))
        self.annotations_path = '/dataset/files/annotations/'
        self.data_path = '/Volumes/Samsung_X5/University/Semesters/Semester 6/Interactive Video Retrieval/'
        self.thumbnails = '/Volumes/Samsung_X5/University/Semesters/Semester 6/Interactive Video Retrieval/thumbnails/'
        self.vocab_path = self.dir_path + '/vocab/vocab_cut15.pkl'
        self.caps_token_path = self.dir_path + '/caps_token.pkl'
        self.caps_len_path = self.dir_path + '/caps_len.pkl'
        self.pretrained_model = '/Volumes/Samsung_X5/University/Semesters/Semester 6/Interactive Video Retrieval/Model/checkpoint_img_captioning_coco.pth.tar'
        
        # Training Parameters
        self.batch_size = 100
        self.seed = 42
        self.start_epoch = 0
        self.epochs = 100                   # number of epochs to train for (if early stopping is not triggered)
        self.epochs_since_improvement = 0   # keeps track of number of epochs since there's been an improvement in validation BLEU
        self.total_step =  600              # Number of steps per epoch (training)
        self.test_total_step = 200          # Number of steps in testing
        self.workers = 1                    # for data-loading;
        self.encoder_lr = 1e-4              # learning rate for encoder if fine-tuning
        self.decoder_lr = 0.00035           # learning rate for decoder
        self.grad_clip = 5.                 # clip gradients at an absolute value of
        self.alpha_c = 1.                   # regularization parameter for 'doubly stochastic attention', as in the paper
        self.best_bleu = 0.215371729        # BLEU-4 score right now
        self.print_freq = 100               # print training/validation stats every __ batches
        self.fine_tune_encoder = True       # fine-tune encoder?
        self.checkpoint = '/Volumes/Samsung_X5/University/Semesters/Semester 6/Interactive Video Retrieval/Model/checkpoint_img_captioning_coco.pth.tar' # path to checkpoint, None if none
        self.model_name = 'img_captioning_coco' # model name for saving a checkpoing
    
       
        # Model Parameters
        self.emb_dim = 300                  # dimension of word embeddings. Depends on pre-trained model!!!
        self.attention_dim = 512            # dimension of attention linear layers
        self.decoder_dim = 512              # dimension of decoder RNN
        self.dropout = 0.5
        

In [33]:
args = Arguments()

In [9]:
glove = vocab.GloVe(name='6B', dim=300)

100%|█████████▉| 399999/400000 [00:42<00:00, 9425.00it/s]


## 1. Preprocess the dataset
First of all, let's get a little bit familiar with the dataset. You can either download the dataset (about 14GB for training and 6GB for validation) or retrieve the images using the URL, which can be found in the files instances_train2014.json or instances_test2014. 

#### DataPreprocess
The dataset is comprised by about 600.000 images in the training dataset, each annotated with 5 captions. We use the class DataPreprocess to load all image names, which are used to construct the path to the image, as well as the corresponding annotation IDs. Once we have extracted them, we load all captions in plain text and process the text. We a) construct a vocublary of all words that appear more often than a certain cutoff value (= 5) and b) tokenize all captions in the training set and test (=validation) set using the word-index in the vocabulary. You can also use a pre-extracted vocabulary and load it to save time! Before tokenizing, we add a start ([BOS]) and end ([EOS]) token to the captions. Even though vectorizing and creation of the vocabulary are not done in this class, the DataPreprocess instantiates these classes and executes it. 
<br> <br>
**_Simply call and parse the arguments-file:_**
<br> data_object = DataPreprocess.ProcessData(args)

In [21]:
import os
import sys

from pycocotools.coco import COCO


class DataPreprocess(object):

    
    def __init__(self, train_img_names, train_cap_tokens, train_caps, train_caps_len,
                 test_img_names, test_cap_tokens, test_caps, test_caps_len, 
                 train_indices, test_indices,
                 vectorizer, vocab):
        
        # Training and test data
        self.train_img_names = train_img_names
        self.train_cap_tokens = train_cap_tokens
        self.train_caps = train_caps
        self.train_caps_len = train_caps_len
        
        self.test_img_names = test_img_names
        self.test_cap_tokens = test_cap_tokens
        self.test_caps = test_caps
        self.test_caps_len = test_caps_len
        
        # Indices
        self.train_indices = train_indices
        self.test_indices = test_indices
        
        # Utils
        self.vectorizer = vectorizer
        self.vocab = vocab
        
        
    @classmethod
    def ProcessData(cls, args):
        '''
        This is the main method in the DataPreprocess-class. Call this function to instantiate the class and
        to preprocess the dataset. It comprises:
        - load of the coco-IDs to extract the image names and corresponding annotations with LoadCoco()
        - load of the corresponding captions and the image names into lists with LoadFiles() 
        - instantiate of the vecotrizer and vocabulary class with ProcessCaptions()
        - vectorize/ tokenize the annotations of the training and test sets
        
        Return: train_img_names, train_cap_tokens, train_caps, train_caps_len, test_img_names,
                   test_cap_tokens, test_caps, test_caps_len, vectorizer, vocab
        '''
        
        train_dataType = 'train2014'
        test_dataType = 'val2014'
            
        # Load the coco-instances that include the img_ids, the coco_instance and the coco_captions
        train_img_ids, train_coco_inst, train_coco_caps = DataPreprocess.LoadCoco(train_dataType, args)
        test_img_ids, test_coco_inst, test_coco_caps = DataPreprocess.LoadCoco(test_dataType, args)
        
        # Load coco_ids and file_names of every image in the dataset
        train_img_coco_ids, train_img_names, train_cap_coco_ids, train_caps = DataPreprocess.LoadFiles(train_img_ids, train_coco_inst, train_coco_caps)
        test_img_coco_ids, test_img_names, test_cap_coco_ids, test_caps = DataPreprocess.LoadFiles(test_img_ids, test_coco_inst, test_coco_caps)
        
        # Process the captions, i.e. load the vocab and tokenize each word (only trainin data)
        vectorizer, vocab, train_cap_tokens, train_caps_len = DataPreprocess.ProcessCaptions(train_caps, args)
        
        # Vectorize test set
        test_cap_tokens, test_caps_len = vectorizer.vectorize(test_caps)
        
        # Get indices for training and test set
        train_indices = DataPreprocess.GetIndices(train_img_names, train_cap_tokens, train_caps, args)
        test_indices = DataPreprocess.GetIndices(test_img_names, test_cap_tokens, test_caps, args)
        
       
        return cls(train_img_names, train_cap_tokens, train_caps, train_caps_len, test_img_names,
                   test_cap_tokens, test_caps, test_caps_len,
                   train_indices, test_indices, 
                   vectorizer, vocab)
        
    
    @staticmethod
    def LoadCoco(dataType, args):
        
        # initialize COCO API for instance annotations
        dataDir = args.annotations_path
        instances_annFile = os.path.join(dataDir, 'instances_{}.json'.format(dataType))
        coco_inst = COCO(instances_annFile)

        # initialize COCO API for caption annotations
        captions_annFile = os.path.join(dataDir, 'captions_{}.json'.format(dataType))
        coco_caps = COCO(captions_annFile)

        # get image ids 
        img_ids = list(coco_inst.anns.keys())
              
        return img_ids, coco_inst, coco_caps
        
    
    # So far, the coco_files continue a lot of unnecessary information. Extract only ids and file_names
    @staticmethod
    def LoadFiles(img_ids, coco_inst, coco_caps):
        
        # To store image_names and their ids in the coco files
        img_coco_ids = []
        img_names = []
        
        # Load captions (caps) and their ids in the coco files
        caps_coco_ids = []
        caps = []

        for i in img_ids:
            img_id = coco_inst.anns[i]['image_id']
            img_name = coco_inst.loadImgs(img_id)[0]
    
            img_coco_ids.append(img_name['id'])
            img_names.append(img_name['file_name'])
            
    
        for image in img_coco_ids:
            annIds = coco_caps.getAnnIds(image)
            caps_coco_ids.append(annIds)

            anns = coco_caps.loadAnns(annIds)
            annotations = []

            for annotation in anns:
                annotations.append(annotation['caption'])

            caps.append(annotations)
            
        return img_coco_ids, img_names, caps_coco_ids, caps
        
        
    @staticmethod
    def ProcessCaptions(caps, args):
        
        vectorizer = Vectorizer.CreateVectorizer(caps, args)
        
        if args.do_vectorize == True:
            
            # Tokenize captions and add start/end token to caption
            caps_token, caps_len = vectorizer.vectorize(caps)
            
            # Save tokenized captions and their length
            dir_path = args.dir_path
            tokenized_caps = '/train_caps_token.pkl'
            length_caps = '/train_caps_len.pkl'
            
            path_token = dir_path + tokenized_caps
            path_length = dir_path + length_caps
        
            with open(path_token, 'wb') as f:
                pickle.dump(caps_token, f)
                
            with open(path_length, 'wb') as f:
                pickle.dump(caps_len, f)
                
        else:
            
            # Load pre-tokenized captions and their lengths
            file = open(args.caps_token_path, 'rb')
            caps_token = pickle.load(file)
            file.close()
            
            file = open(args.caps_len_path, 'rb')
            caps_len = pickle.load(file)
            file.close()
            
            
        return vectorizer, vectorizer.vocab, caps_token, caps_len
            
            
    @staticmethod
    def GetIndices(img_names, cap_tokens, caps, args):
        
        print("Start getting indices...")
        
        data = []  # list to store [[cap_idx, cap, img_name, ref_caps],...]
        counter = 0  

        # Tranfer img_names into dimensionality of cap_tokens using the cpi
        for i, img_name in enumerate(img_names):

            reference_token = []

            # Get all captions for image and save their ids
            for j in range(len(caps[i])):
                reference_indices = counter + j
                reference_token.append(reference_indices)

            # Add each token id, the corresponding image and the corresponding captions
            for k in range(len(caps[i])):
                cap_index = counter +k
                cap_text = (cap_tokens[cap_index])
                data.append([cap_index, cap_tokens[cap_index], img_name, reference_token])

            # Increase counter by the length of the captions per image
            counter += len(reference_token)

        # Save all indices into a pd.DataFrame
        indices = pd.DataFrame(data, columns = ['Caption Index', 'Tokenized Caption', 'Image Name', 'Reference Captions'])

        return indices
        

#### Vectorizer Class
The _Vectorizer Class_ load all captions as plain text, iterates over all captions to extract every word, extracts only the words beyond a certain cutoff-threshold (set in the arguments-class), and builds the vocabulary. Once the vocabulary is build, the _Vectorizer Class_ is used to tokenize the words in each annotation/ caption using the index of the vocabulary.

In [22]:
import pandas as pd
import pickle
import os


class Vectorizer(object):
    
    def __init__(self, vocab, args):
        
        self.vocab = vocab
        self.args = args
        
    
    @classmethod
    def CreateVectorizer(cls, caps, args):
        """ Instantiates the vectorizer from the dataset frame
        Arguments:
            review_df = Pandas DataFrame passed from class ReviewDataset
            cutoff = parameter for frequency based filtering
        Returns:
            an instance of the class ReviewVectorizer
        """
        
        # If vocab exists, load vocab
        if args.vocab == True:
            
            print("Load existing vocabulary")
            
            file = open(args.vocab_path, 'rb')
            vocab = pickle.load(file)
            file.close()
            
        
        # If vocab doesn't exist, create new vocab
        else:
        
            print("Create new vocabulary")
            # Instantiate new vocabulary
            vocab = Vocabulary(add_unk = True)

            # Retrieve all words from the captions-file and lower them 
            all_words = []

            for cap in caps:
                for annotation in cap:
                    
                    # Remove punctuation and lower the words
                    annotation = annotation[:(len(annotation)-1)]
                    annotation = annotation.lower()
                    annotation = annotation.replace('@','')
                    annotation = annotation.replace('/','')
                    annotation = annotation.replace('#','')
                    annotation = annotation.replace('_','')
                    annotation = annotation.replace('|','')
                    annotation = annotation.replace("'",'')
                    annotation = annotation.replace('.','')

                    words = annotation.split()
                    all_words.extend(words)

            # Convert list with words to dataframe
            all_words_df = pd.DataFrame(all_words)

            # This returns a dataframe series with the counts of every single words
            single_words_counts = all_words_df.apply(pd.Series.value_counts)
            single_words_list = single_words_counts.index.tolist()

            # Add words to vocab that occur more often than cutoff value
            print("Start adding words to vocab")
            for words in single_words_list:

                if single_words_counts.loc[words].values > args.vocab_cutoff:
                    vocab.add_token(words)
            
            # Add start and end token to the vocab
            vocab.add_token('[BOS]')
            vocab.add_token('[EOS]')
            vocab.add_token('[PAD]')
            
            # Save vocab for later usage
            dir_path = args.dir_path
            file_name = '/vocab.pkl'
            path = dir_path + file_name
        
            with open(path, 'wb') as f:
                pickle.dump(vocab, f)
        
            
        return cls(vocab, args)
    
    
    # Not tested yet
    def vectorize(self, caps, MAX_LEN = 128):
        
        print("Start Tokenization")
        
        # Add start and end token to the vocab
        caps_tokens = []
        caps_tokens_flat = []
        caps_length  = []
        caps_length_flat = []
        
        start_token = self.vocab.lookup_token('[BOS]')
        end_token = self.vocab.lookup_token('[EOS]')
        pad_token = self.vocab.lookup_token('[PAD]')
        
        for cap in caps:
            
            cap_tokens = []
            cap_length = []
            
            for annotation in cap:
                
                annotation_token = []
                
                # Remove punctuation and lower the words
                annotation = annotation[:(len(annotation)-1)]
                annotation = annotation.lower()
                words = annotation.split()
                
                # Add start token to each caption
                annotation_token.append(start_token)
            
                # For each word in the caption, retrieve the token
                for word in words:
                    annotation_token.append(self.vocab.lookup_token(word))
                 
                # Add end token to each caption
                annotation_token.append(end_token)
                cap_length.append(len(annotation_token))
                caps_length_flat.append(len(annotation_token))
                
                for i in range(MAX_LEN - len(annotation_token)):
                    annotation_token.append(pad_token)
                
                # Add tokens to array (looks like [img1[[cap11],[cap12],[cap13]], img2[[cap21]...]...])
                cap_tokens.append(annotation_token)
                
                # Add tokens to flat_array (looks like [[cap11],[cap12],[cap13],[cap21]...])
                caps_tokens_flat.append(annotation_token)
                
            caps_tokens.append(cap_tokens)
            caps_length.append(cap_length)
        
        return caps_tokens_flat, caps_length_flat 
    
    
    # Updates vocab and saves new vocab file
    def update_vocab(self, word):
        
        self.vocab.add_token(word)
        vocab = self.vocab
        
        # Save vocab
        dir_path = self.args.dir_path
        file_name = '/vocab.pkl'
        path = dir_path + file_name
        
        with open(path, 'wb') as f:
            pickle.dump(vocab, f)
        
        

#### Vocabulary Class
The _Vocabulary Class_ is typically called by the _Vectorizer Class_. With the functions _add_token()_ and _lookup_token()_ you can add and retrieve the index of a word.

In [23]:
class Vocabulary(object):
    
    def __init__(self, token_to_idx = None, add_unk = True, unk_token = "[UNK]"):
        
        if token_to_idx is None: # if there doesn't preexist a map of tokens, create a new index
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        
        self._idx_to_token = {idx: token
                             for token, idx in self._token_to_idx.items()}
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        
        self.unk_index = -1 # if we don't have unkown token, then unk_index = -1
        if add_unk:  # if token is unkown, pass it to add_token and get the indes for the <unk> token
            self.unk_index = self.add_token(unk_token)
        
    
    def add_token(self, token):
        """ Update mapping dictionariy based on the token
        Arguments:
            token = string/ word that should be inserted to dictionary
        Returns:
            index(int) = corresponding int-index for token
        """
        
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
            
        return index
        
    def lookup_token(self, token):
        """ Retrieves the index of a token or in case that no token is present the index of 'UNK'
        Arguments:
            token(str) = the token for which the index should be retrieved
        Return:
            index(int) = the index associated with the token
        """
        if self._add_unk:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx(token)
    

## 2. Create Dataset using Dataloader
Now that we have pre-processed the annotations/ captions and have loaded the names of all images using the DataPreprocess class, we can build our dataset. Unlike in different settings, in which we load the entire training set into the dataloader, it is more feasible to load the images batch-by-batch. For this, we will use two classes: <br> 
1. **_LoadData_**: takes an instance of _DataPreprocess()_ to load the image names and the tokenized captions. Using the function _CreateDataLoader()_, we call the class _CocoDataset()_ to instantiate a dataset object. This is then parsed to PyTorch's DataLoader. We sample the instances, i.e. images and captions using their index, randomly based on the length of the caption (function _Get_Train_Indices()_).
2. **_CocoDataset_**: is used to actually load the images based on a caption-index (as PIL files), transforms the images and sends an image together with the corresponding-caption to the DataLoader. 

In [24]:
from torchvision import transforms
from PIL import Image
from tqdm import tqdm
import numpy as np
import random 


class LoadData(object):
    
    
    def __init__(self, data_object, args):
        
        # Load training and test data
        self.train_img_names = data_object.train_img_names
        self.test_img_names = data_object.test_img_names
        
        self.train_cap_tokens = data_object.train_cap_tokens
        self.test_cap_tokens = data_object.test_cap_tokens
        
        self.train_caps_len = data_object.train_caps_len
        self.test_caps_len = data_object.test_caps_len
        
        self.train_indices = data_object.train_indices
        self.test_indices = data_object.test_indices
        
        self.args = args
        self.num_workers = 0
        self.transform = transforms.Compose([ 
                transforms.Resize(256),                          # smaller edge of image resized to 256
                transforms.RandomCrop(224),                      # get 224x224 crop from random location
                transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
                transforms.ToTensor(),                           # convert the PIL Image to a tensor
                transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                                     (0.229, 0.224, 0.225))])
        
    
    def CreateDataloader(self, mode):
        
        if mode == 'train':
            
            batch_size = args.batch_size
            indices = LoadData.Get_Random_Train_Indices(self, batch_size, mode)
            img_folder = args.data_path + 'train2014/'
            indices, img_names, cap_tokens, caps_len, cap_references = LoadData.Get_Data(self, indices, mode)
              
                
        elif mode == 'test':
            
            batch_size = 1
            indices = LoadData.Get_Train_Indices(self, batch_size, mode)
            img_folder = args.data_path + 'val2014/'
            indices, img_names, cap_tokens, caps_len, cap_references = LoadData.Get_Data(self, indices, mode)
            
    
        # Create new dataset instance, which will be fed into the dataloader
        dataset = CoCoDataset(transform = self.transform,
                                mode = mode,
                                batch_size = batch_size,
                                img_path = img_folder,
                                img_names =  img_names,
                                cap_tokens = cap_tokens,
                                caps_len = caps_len,
                                cap_references = cap_references,
                                args = self.args)


        initial_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        
        data_loader = data.DataLoader(dataset = dataset, 
                                      num_workers = self.num_workers,
                                      batch_sampler=data.sampler.BatchSampler(sampler=initial_sampler,
                                                                              batch_size=dataset.batch_size,
                                                                              drop_last=False))
        
        return data_loader
            
    
    # It is way faster to get random indices and only pass a subset of the data to the dataset instance.
    @staticmethod    
    def Get_Random_Train_Indices(self, batch_size, mode):
        
        if mode == 'train':
            all_indices = random.sample(range(len(self.train_cap_tokens)), batch_size)
        
        else:
            all_indices = random.sample(range(len(self.test_cap_tokens)), batch_size)
            
        return all_indices
    
    
    # Just for test purposes
    @staticmethod
    def Get_Data(self, indices, mode):
        
        sub_img_names = []
        sub_cap_tokens = []
        sub_caps_len = []
        sub_indices = []
        cap_references = []
        
        for i, index in enumerate(indices):
            
            try:
                if mode == 'train':

                    sub_img_names.append(self.train_indices.loc[index, 'Image Name'])
                    sub_cap_tokens.append(self.train_cap_tokens[index])
                    sub_caps_len.append(self.train_caps_len[index])

                    cap_references.append(self.train_indices.loc[index, 'Reference Captions'])

                else:

                    sub_img_names.append(self.test_indices.loc[index, 'Image Name'])
                    sub_cap_tokens.append(self.test_cap_tokens[index])
                    sub_caps_len.append(self.test_caps_len[index])

                    cap_references.append(self.test_indices.loc[index, 'Reference Captions'])
            
            except: pass
                
            sub_indices.append(i)
        
        return sub_indices, sub_img_names, sub_cap_tokens, sub_caps_len, cap_references
       
    
    #If captions are not padded, we can use this function to derive captions with equal length.
    @staticmethod    
    def Get_Train_Indices(self, batch_size, mode):
        
        if mode == 'train':
            sel_length = np.random.choice(self.train_caps_len)
            all_indices = np.where([self.train_caps_len[i] == sel_length for i in np.arange(len(self.train_caps_len))])[0]
        
        else:
            sel_length = np.random.choice(self.test_caps_len)
            all_indices = np.where([self.test_caps_len[i] == sel_length for i in np.arange(len(self.test_caps_len))])[0]
            
        
        indices = list(np.random.choice(all_indices, size=batch_size))
        
        return indices


_CocoDataset()_ is used as Dataset class for the DataLoader. It is called everytime a new batch is retrieved during training.

In [25]:
import torch
import torch.utils.data as data

class CoCoDataset(data.Dataset):
    
    def __init__(self, transform, mode, batch_size, img_path, img_names, cap_tokens, caps_len, cap_references, args):
        
        self.transform = transform
        self.mode = mode
        self.batch_size = batch_size
        self.img_path = img_path
        self.cap_references = cap_references
        
        self.img_names = img_names
        self.cap_tokens = cap_tokens
        self.caps_len = caps_len
        self.cpi = args.cpi
        
        
    def __getitem__(self, index):
        
        img_idx = index   # 5 captions per image
        
        if self.mode == 'train':
            
            # Get images
            img_name = self.img_names[img_idx]
            image_path = self.img_path + img_name

            # Convert image to tensor and pre-process using transform
            image = Image.open(image_path).convert('RGB')
            image = self.transform(image)

            # Convert caption to tensor of word ids.
            caption = self.cap_tokens[index]
            caption = torch.tensor(caption, dtype=torch.long)
            caption_len = self.caps_len[index]
            caption_len = torch.tensor(caption_len, dtype=torch.long)

            # return pre-processed image and caption tensors
            return image, caption, caption_len
        

        # obtain image if in test mode
        else:
            
            # Get images
            img_name = self.img_names[img_idx]
            image_path = self.img_path + img_name

            # Convert image to tensor and pre-process using transform
            PIL_image = Image.open(image_path).convert('RGB')
            image = self.transform(PIL_image)
            
            # Convert caption to tensor of word ids.
            caption = self.cap_tokens[index]
            caption = torch.tensor(caption, dtype=torch.long)
            caption_len = self.caps_len[index]
            caption_len = torch.tensor(caption_len, dtype=torch.long)
            
            cap_reference = self.cap_references

            # return original image and pre-processed image tensor
            return image, caption, caption_len, cap_reference
        

## 3. Network Architecture
The image captioning task follows a **Encoder-Decoder** architecture. The **Encoder** will be a pre-trained **ResNet-101** model to encode the images into an embedded image vector. To retrieve the embedding-vectors, we have to cut-off the last layers of the ResNet-model. The output of the Encoder will be a (1, 2048)-dim vector. <br> <br>
This vector is then fed as start-state/hidden value into the **Decoder**, which is a LSTM-RNN with Attention. 

![alt text](model_figure.png)

In [26]:
'''
Source: https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning/blob/master/models.py
'''

import torch
from torch import nn
import torchvision

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Encoder(nn.Module):

    def __init__(self, encoded_image_size=14):
        super(Encoder, self).__init__()
        self.enc_image_size = encoded_image_size

        resnet = torchvision.models.resnet101(pretrained=True)  # pretrained ImageNet ResNet-101

        # Remove linear and pool layers as original ResNet Model is trained on image classification
        modules = list(resnet.children())[:-2]
        self.resnet = nn.Sequential(*modules)

        # Resize image to fixed size to allow input images of variable size
        self.adaptive_pool = nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))

        self.fine_tune()

        
    def forward(self, images):
        """
        Forward propagation.
        :param images: images, a tensor of dimensions (batch_size, 3, image_size, image_size)
        :return: encoded images
        """
        output = self.resnet(images)         # (batch_size, 2048, image_size/32, image_size/32)
        output = self.adaptive_pool(output)  # (batch_size, 2048, encoded_image_size, encoded_image_size)
        output = output.permute(0, 2, 3, 1)  # (batch_size, encoded_image_size, encoded_image_size, 2048)
        
        return output

    
    def fine_tune(self, fine_tune=True):
        """
        Allow or prevent the computation of gradients for convolutional blocks 2 through 4 of the encoder.
        :param fine_tune: Allow?
        """
        for p in self.resnet.parameters():
            p.requires_grad = False
        
        # If fine-tuning, only fine-tune convolutional blocks 2 through 4
        for c in list(self.resnet.children())[5:]:
            for p in c.parameters():
                p.requires_grad = fine_tune

            

In [27]:
class Attention(nn.Module):
    """
    Attention Network.
    """

    def __init__(self, encoder_dim, decoder_dim, attention_dim):
        """
        :param encoder_dim: feature size of encoded images
        :param decoder_dim: size of decoder's RNN
        :param attention_dim: size of the attention network
        """
        super(Attention, self).__init__()
        self.encoder_att = nn.Linear(encoder_dim, attention_dim)  # linear layer to transform encoded image
        self.decoder_att = nn.Linear(decoder_dim, attention_dim)  # linear layer to transform decoder's output
        self.full_att = nn.Linear(attention_dim, 1)               # linear layer to calculate values to be softmax-ed
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)                          # softmax layer to calculate weights

        
    def forward(self, encoder_out, decoder_hidden):
        """
        Forward propagation.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :param decoder_hidden: previous decoder output, a tensor of dimension (batch_size, decoder_dim)
        :return: attention weighted encoding, weights
        """
        att1 = self.encoder_att(encoder_out)     # (batch_size, num_pixels, attention_dim)
        att2 = self.decoder_att(decoder_hidden)  # (batch_size, attention_dim)
        att = self.full_att(self.relu(att1 + att2.unsqueeze(1))).squeeze(2)  # (batch_size, num_pixels)
        alpha = self.softmax(att)  # (batch_size, num_pixels)
        attention_weighted_encoding = (encoder_out * alpha.unsqueeze(2)).sum(dim=1)  # (batch_size, encoder_dim)

        return attention_weighted_encoding, alpha
    


In [28]:
class DecoderWithAttention(nn.Module):
    """
    Decoder.
    """

    def __init__(self, attention_dim, embed_dim, decoder_dim, vocab_size, encoder_dim=2048, dropout=0.5):
        """
        :param attention_dim: size of attention network
        :param embed_dim: embedding size
        :param decoder_dim: size of decoder's RNN
        :param vocab_size: size of vocabulary
        :param encoder_dim: feature size of encoded images
        :param dropout: dropout
        """
        super(DecoderWithAttention, self).__init__()

        self.encoder_dim = encoder_dim
        self.attention_dim = attention_dim
        self.embed_dim = embed_dim
        self.decoder_dim = decoder_dim
        self.vocab_size = vocab_size
        self.dropout = dropout

        self.attention = Attention(encoder_dim, decoder_dim, attention_dim)  # attention network

        self.embedding = nn.Embedding(vocab_size, embed_dim)  # embedding layer
        self.dropout = nn.Dropout(p=self.dropout)
        self.decode_step = nn.LSTMCell(embed_dim + encoder_dim, decoder_dim, bias=True)  # decoding LSTMCell
        self.init_h = nn.Linear(encoder_dim, decoder_dim)  # linear layer to find initial hidden state of LSTMCell
        self.init_c = nn.Linear(encoder_dim, decoder_dim)  # linear layer to find initial cell state of LSTMCell
        self.f_beta = nn.Linear(decoder_dim, encoder_dim)  # linear layer to create a sigmoid-activated gate
        self.sigmoid = nn.Sigmoid()
        self.fc = nn.Linear(decoder_dim, vocab_size)  # linear layer to find scores over vocabulary
        self.init_weights()  # initialize some layers with the uniform distribution

        
    def init_weights(self):
        """
        Initializes some parameters with values from the uniform distribution, for easier convergence.
        """
        self.embedding.weight.data.uniform_(-0.1, 0.1)
        self.fc.bias.data.fill_(0)
        self.fc.weight.data.uniform_(-0.1, 0.1)

        
    def load_pretrained_embeddings(self, weight_matrix):
        """
        Loads embedding layer with pre-trained embeddings.
        :param embeddings: pre-trained embeddings
        """
        self.embedding.weight = nn.Parameter(weight_matrix)

        
    def fine_tune_embeddings(self, fine_tune=False):
        """
        Allow fine-tuning of embedding layer? (Only makes sense to not-allow if using pre-trained embeddings).
        :param fine_tune: Allow?
        """
        for p in self.embedding.parameters():
            p.requires_grad = fine_tune

            
    def init_hidden_state(self, encoder_out):
        """
        Creates the initial hidden and cell states for the decoder's LSTM based on the encoded images.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :return: hidden state, cell state
        """
        mean_encoder_out = encoder_out.mean(dim=1)
        h = self.init_h(mean_encoder_out)  # (batch_size, decoder_dim)
        c = self.init_c(mean_encoder_out)
        return h, c

    
    def forward(self, encoder_out, encoded_captions, caption_lengths):
        """
        Forward propagation.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, enc_image_size, enc_image_size, encoder_dim)
        :param encoded_captions: encoded captions, a tensor of dimension (batch_size, max_caption_length)
        :param caption_lengths: caption lengths, a tensor of dimension (batch_size, 1)
        :return: scores for vocabulary, sorted encoded captions, decode lengths, weights, sort indices
        """

        batch_size = encoder_out.size(0)
        encoder_dim = encoder_out.size(-1)
        vocab_size = self.vocab_size

        # Flatten image
        encoder_out = encoder_out.view(batch_size, -1, encoder_dim)  # (batch_size, num_pixels, encoder_dim)
        num_pixels = encoder_out.size(1)

        # Sort input data by decreasing lengths; why? apparent below
        caption_lengths, sort_ind = caption_lengths.sort(descending=True)
        encoder_out = encoder_out[sort_ind]
        encoded_captions = encoded_captions[sort_ind]

        # Embedding
        embeddings = self.embedding(encoded_captions)  # (batch_size, max_caption_length, embed_dim)

        # Initialize LSTM state
        h, c = self.init_hidden_state(encoder_out)  # (batch_size, decoder_dim)

        # We won't decode at the <end> position, since we've finished generating as soon as we generate <end>
        # So, decoding lengths are actual lengths - 1
        decode_lengths = (caption_lengths - 1).tolist()

        # Create tensors to hold word predicion scores and alphas
        predictions = torch.zeros(batch_size, max(decode_lengths), vocab_size).to(device)
        alphas = torch.zeros(batch_size, max(decode_lengths), num_pixels).to(device)

        # At each time-step, decode by
        # attention-weighing the encoder's output based on the decoder's previous hidden state output
        # then generate a new word in the decoder with the previous word and the attention weighted encoding
        for t in range(max(decode_lengths)):
            batch_size_t = sum([l > t for l in decode_lengths])
            attention_weighted_encoding, alpha = self.attention(encoder_out[:batch_size_t],
                                                                h[:batch_size_t])
            gate = self.sigmoid(self.f_beta(h[:batch_size_t]))  # gating scalar, (batch_size_t, encoder_dim)
            attention_weighted_encoding = gate * attention_weighted_encoding
            h, c = self.decode_step(
                torch.cat([embeddings[:batch_size_t, t, :], attention_weighted_encoding], dim=1),
                (h[:batch_size_t], c[:batch_size_t]))  # (batch_size_t, decoder_dim)
            preds = self.fc(self.dropout(h))  # (batch_size_t, vocab_size)
            predictions[:batch_size_t, t, :] = preds
            alphas[:batch_size_t, t, :] = alpha

        return predictions, encoded_captions, decode_lengths, alphas, sort_ind



## 4. Training Process
The following classes describe the training process. The main-class for training is the **_Trainer()_** class. The _Trainer()_ class is called by parsing a dataloader-object (=instance of class _LoadData()_ ), the data-object and the arguments instance. Within the class, _Trainer()_ comprises the following structure: <br>
<br>
Structure of **_Trainer()-class_**:
- **Main-method**: coordinates the training and loads the models (encoder and decoder). Moreover, it calls the method save_checkpoint() to save checkpoints
- **TrainModel**: executes the training for every epoch. It iterates over all steps (as defined in Arguments) and is responsible for managing forward- and backward propagation. For every step in n-steps, it calls a new batch from the dataloader-instance and passes it to the model.
- **EvalModel**: evaluate the training process (set the test-steps in arguments!). For every step, it load a batch from the test-set (=val2014) with a specific batch-size (current = 1). It then executes the forwardpass without calculating the gradients and without backpropagation. To calculate the BLEU-score for every datapoint, we retrieve its reference captions, and pass them together with the predicted caption to the NLTK toolkit _corpus-bleu()_ . 
- **Utils-methods**: besides the training and evaluation, the _Trainer()_ class also implements various helper-methods for adjusting the learning rate, saving the checkpoints, clipping gradients or calculating the accuracies.
<br> 
<br>
Furthermore, we save all calculated metrics in an instance of the class AverageMeter(), which is instantiated for saving losses, and accuracies during training and validation.

In [29]:
class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

In [30]:
import torch.optim
import torch.utils.data
import torchvision.transforms as transforms
import torchtext.vocab as vocab
from torch import nn
from torch.nn.utils.rnn import pack_padded_sequence
from nltk.translate.bleu_score import corpus_bleu




class Trainer(object):
    
    '''
    --------------------------------------------------------------------------------------------------
    Utils methods
    --------------------------------------------------------------------------------------------------
    '''
    def Adjust_LR(optimizer, shrink_factor):
        
        print("\nDECAYING learning rate.")
        for param_group in optimizer.param_groups:
            param_group['lr'] = param_group['lr'] * shrink_factor
        print("The new learning rate is %f\n" % (optimizer.param_groups[0]['lr'],))
    
        
    # Function to clip gradients
    def clip_gradient(optimizer, grad_clip):
   
        for group in optimizer.param_groups:
            for param in group['params']:
                if param.grad is not None:
                    param.grad.data.clamp_(-grad_clip, grad_clip)
     
    
    def accuracy(scores, targets, k):
        
        batch_size = targets.size(0)
        _, ind = scores.topk(k, 1, True, True)
        correct = ind.eq(targets.view(-1, 1).expand_as(ind))
        correct_total = correct.view(-1).float().sum()  # 0D tensor
        
        return correct_total.item() * (100.0 / batch_size)
    
    
    def save_checkpoint(data_name, epoch, epochs_since_improvement, encoder, decoder, encoder_optimizer, decoder_optimizer,
                    bleu4, is_best):
        
        # State to be saved
        state = {'epoch': epoch,
                 'epochs_since_improvement': epochs_since_improvement,
                 'bleu-4': bleu4,
                 'encoder': encoder,
                 'decoder': decoder,
                 'encoder_optimizer': encoder_optimizer,
                 'decoder_optimizer': decoder_optimizer}
        
        filename = 'checkpoint_' + data_name + 'glove300d.pth.tar'
        torch.save(state, filename)
        
        # If this checkpoint is the best so far, store a copy so it doesn't get overwritten by a worse checkpoint
        if is_best:
            torch.save(state, 'BEST_' + filename)
            
            
    def Load_Glove(data_object):
        
        print("Load Wordembeddings")
        # Load Glove using Torchtext
        glove = vocab.GloVe(name='6B', dim=300)
        
        matrix_len = len(data_object.vocab._token_to_idx)
        weights_matrix = np.zeros((matrix_len, 300))
        words_found = 0

        for i, word in enumerate(data_object.vocab._token_to_idx):
            try: 
                weights_matrix[i] = glove[word]
                words_found += 1
            except KeyError:
                weights_matrix[i] = np.random.normal(scale=0.6, size=(300, ))
    
        return weights_matrix
    
    
    '''
    --------------------------------------------------------------------------------------------------
    Training methods
    --------------------------------------------------------------------------------------------------
    '''
    
    #Function to execute the forward and backward pass
    def TrainModel(encoder, decoder, criterion, encoder_optimizer,
                   decoder_optimizer, epoch, losses, top5accs, args, dataloader):         
            
            
        # Iterate of over the steps and load a new dataset for every step
        for i_step in range(1, args.total_step+1):
                
            # Set models to train-mode
            decoder.train()  
            encoder.train()
                
            # Get dataset for each step
            batch = dataloader.CreateDataloader(mode = 'train')
            img, cap, cap_len = next(iter(batch))
                
            # Move data to GPU
            img = img.to(device)
            cap = cap.to(device)
            cap_len = cap_len.to(device)
                
            # Forward propagation
            img = encoder(img)
            scores, caps_sorted, decode_lengths, alphas, sort_ind = decoder(img, cap, cap_len)
                
            # Targets start after [BOS] and end before [EOS]
            targets = caps_sorted[:, 1:]
                
            # Remove timesteps that we didn't decode at, or are pads
            # pack_padded_sequence is an easy trick to do this
        
            scores = pack_padded_sequence(scores, decode_lengths, batch_first=True)
            scores = scores[0]
            targets = pack_padded_sequence(targets, decode_lengths, batch_first=True)
            targets = targets[0]
                
            # Calculate loss
            loss = criterion(scores, targets)

            # Add doubly stochastic attention regularization
            loss += args.alpha_c * ((1. - alphas.sum(dim=1)) ** 2).mean()

            # Back prop.
            decoder_optimizer.zero_grad()
                
            if encoder_optimizer is not None:
                encoder_optimizer.zero_grad()
                
            loss.backward()
                
            # Clip gradients by calling function clip_gradient()
            if args.grad_clip is not None:
                Trainer.clip_gradient(decoder_optimizer, args.grad_clip)
                    
                if encoder_optimizer is not None:
                    Trainer.clip_gradient(encoder_optimizer, args.grad_clip)

            # Update weights
            decoder_optimizer.step()
                
            if encoder_optimizer is not None:
                encoder_optimizer.step()

            # Keep track of metrics
            top5 = Trainer.accuracy(scores, targets, 5)
            losses.update(loss.item(), sum(decode_lengths))
            top5accs.update(top5, sum(decode_lengths))
            
            
        return encoder, decoder, losses, top5accs 
            
       
        
    #Function to evaluate the model during training on the validation set   
    def EvalModel(encoder, decoder, criterion, encoder_optimizer,
                   decoder_optimizer, epoch, losses, top5accs, args, dataloader):
        
        
        # Add reference list for calculating Blue-Scores
        true_captions = list() 
        predictions = list() 
        
        # Iterate of over the steps and load a new dataset for every step
        for i_step in range(1, args.test_total_step+1):
            
            decoder.eval()
            encoder.eval()
        
            # Disable gradient calculation for evaluation
            with torch.no_grad():
                
                # Get dataset for each test step
                batch = dataloader.CreateDataloader(mode = 'test')
                img, cap, cap_len, cap_reference = next(iter(batch))

                # Move data to GPU
                img = img.to(device)
                cap = cap.to(device)
                cap_len = cap_len.to(device)

                # Forward propagation
                img = encoder(img)
                scores, caps_sorted, decode_lengths, alphas, sort_ind = decoder(img, cap, cap_len)

                # Targets start after [BOS] and end before [EOS]
                targets = caps_sorted[:, 1:]

                # Remove timesteps that we didn't decode at, or are pads
                # pack_padded_sequence is an easy trick to do this
                
                scores_copy = scores.clone()

                scores = pack_padded_sequence(scores, decode_lengths, batch_first=True)
                scores = scores[0]
                targets = pack_padded_sequence(targets, decode_lengths, batch_first=True)
                targets = targets[0]
                
                # Calculate loss
                loss = criterion(scores, targets)

                # Add doubly stochastic attention regularization
                loss += args.alpha_c * ((1. - alphas.sum(dim=1)) ** 2).mean()
                
                # Keep track of metrics
                top5 = Trainer.accuracy(scores, targets, 5)
                losses.update(loss.item(), sum(decode_lengths))
                top5accs.update(top5, sum(decode_lengths))
            
                
                # Store references for calculating BLEU scores
                for index in sort_ind:
                    
                    img_captions = []
                    for caption in cap_reference[index]:
                        img_caption = dataloader.test_cap_tokens[caption] # Retrieve the caption based on its index
                        img_caption_len = dataloader.test_caps_len[caption] # Retrieve the length of the caption
                        img_caption = img_caption[1:(img_caption_len)] # Remove start-token and paddings
                        img_captions.append(img_caption) # Add all tokenized captions to the array
                        
                    true_captions.append(img_captions)
                
                # Get predictions
                _, preds = torch.max(scores_copy, dim=2)
                preds = preds.tolist() # Save tensor as list
                temp_preds = []
                
                for i, prediction in enumerate(preds):
                    temp_preds.append(preds[i][:decode_lengths[i]]) # if any, remove paddings
                
                preds = temp_preds
                predictions.extend(preds)
                
        # Calculate BLEU-Scores using NLTK toolkit
        bleu = corpus_bleu(true_captions, predictions)
        
        return bleu, losses, top5accs
            
    
    
    """
    Main-function to be called, which loads the optimizer, then executes the training and evaluation
    """
    def Main(data_object, dataloader, args):
        
        print("Initialize Training by setting-up models")
        # Initialize a new model or load existing checkpoints
        if args.checkpoint is None:
            decoder = DecoderWithAttention(attention_dim = args.attention_dim,
                                       embed_dim = args.emb_dim,
                                       decoder_dim = args.decoder_dim,
                                       vocab_size = len(data_object.vocab._token_to_idx),
                                       dropout= args.dropout)
            decoder_optimizer = torch.optim.Adam(params=filter(lambda p: p.requires_grad, decoder.parameters()),
                                             lr= args.decoder_lr)
            encoder = Encoder()
            encoder.fine_tune(args.fine_tune_encoder)
            encoder_optimizer = torch.optim.Adam(params=filter(lambda p: p.requires_grad, encoder.parameters()),
                                                 lr=args.encoder_lr) if args.fine_tune_encoder else None

        
        else:
            print("Load existing checkpoint")
            checkpoint = torch.load(args.checkpoint)
            start_epoch = checkpoint['epoch'] + 1
            
            epochs_since_improvement = checkpoint['epochs_since_improvement']
            best_bleu4 = checkpoint['bleu-4']
            
            decoder = checkpoint['decoder']
            decoder_optimizer = checkpoint['decoder_optimizer']
            
            encoder = checkpoint['encoder']
            encoder_optimizer = checkpoint['encoder_optimizer']
            
            if args.fine_tune_encoder is True and encoder_optimizer is None:
                
                encoder.fine_tune(args.fine_tune_encoder)
                encoder_optimizer = torch.optim.Adam(params=filter(lambda p: p.requires_grad, encoder.parameters()),
                                                 lr=args.encoder_lr)
                
        # Load Glove weight matrix for embedding layer and update weigths
        weigth_matrix = Trainer.Load_Glove(data_object)
        weight_matrix = torch.Tensor(weigth_matrix)
        decoder.load_pretrained_embeddings(weight_matrix)
        print("Word embeddings loaded and embedding weights updated")
        
        # Move the models to the GPU
        encoder = encoder.to(device)
        decoder = decoder.to(device)
        
        # Loss function
        criterion = nn.CrossEntropyLoss().to(device)
        
        # Call class AverageMeter() to instantiate metrics
        losses = AverageMeter()     # loss (per word decoded)
        top5accs = AverageMeter()   # top5 accuracy
        losses_val = AverageMeter()
        top5accs_val = AverageMeter()
        
        
        # Start training
        for epoch in tqdm(range(args.start_epoch, args.epochs)):
            
            # Decay learning rate if there is no improvement for 8 consecutive epochs, and terminate training after 20
            if args.epochs_since_improvement == 20:
                break

            if args.epochs_since_improvement > 0 and args.epochs_since_improvement % 8 == 0:

                Trainer.Adjust_LR(decoder_optimizer, 0.8)

                if args.fine_tune_encoder:
                    Trainer.Adjust_LR(encoder_optimizer, 0.8)
                    
                    
            # Now, start training
            encoder, decoder, losses, top5accs = Trainer.TrainModel(encoder, decoder, criterion, encoder_optimizer, decoder_optimizer, 
                                                                    epoch, losses, top5accs, args, dataloader)
             
            print("-----------Metrics for Epoch %s-----------" %epoch)
            print("Average training loss: %s" %losses.avg)
            print("Average training accuracy: %s" %top5accs.avg)
            
            # After training for n-steps in each epoch, evaluate the output
            current_bleu, losses_val, top5accs_val = Trainer.EvalModel(encoder, decoder, criterion, encoder_optimizer, decoder_optimizer, 
                             epoch, losses_val, top5accs_val, args, dataloader)
            
            print("Average validation loss: %s" %losses_val.avg)
            print("Average validation accuracy: %s" %top5accs_val.avg)
            print("BLEU-score: %s" %current_bleu)
            
            
            # Compare BLEU Score with previous scores
            is_best_bleu = current_bleu > args.best_bleu
            args.best_bleu = max(current_bleu, args.best_bleu)
            
            if not is_best_bleu:
                args.epochs_since_improvement += 1
                print("Epochs since improvement: %s" %args.epochs_since_improvement)
                
            else:
                args.epochs_since_improvement = 0
            
            # Save checkpoint
            Trainer.save_checkpoint(args.model_name, epoch, args.epochs_since_improvement, encoder, decoder,
                                    encoder_optimizer, decoder_optimizer, current_bleu, is_best_bleu)
            
            
        

## 5. Instantiate Training
Finally, to start training we need instantiate an object of the class **_DataPreprocess_**, which we'll call data_object. Using the data_object, we instantiate an object of the class **_LoadData_**, which includes the dataloader. Lastly, we use both the data_object as well as the dataloader-object to start the training by calling the Main()-method of the **_Trainer()-class_**

In [25]:
data_object = DataPreprocess.ProcessData(args)

loading annotations into memory...
Done (t=12.06s)
creating index...
index created!
loading annotations into memory...
Done (t=0.87s)
creating index...
index created!
loading annotations into memory...
Done (t=6.58s)
creating index...
index created!
loading annotations into memory...
Done (t=0.43s)
creating index...
index created!
Load existing vocabulary
Start Tokenization
Start Tokenization
Start getting indices...
Start getting indices...


In [30]:
dataloader = LoadData(data_object, args)

In [None]:
train = Trainer.Main(data_object, dataloader, args)

Initialize Training by setting-up models


Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /root/.cache/torch/checkpoints/resnet101-5d3b4d8f.pth
73.8%IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

100.0%


Load Wordembeddings
Word embeddings loaded and embedding weights updated


  0%|          | 0/100 [00:00<?, ?it/s]

-----------Metrics for Epoch 0-----------
Average training loss: 5.4620256498198705
Average training accuracy: 48.058848133514466
Average validation loss: 4.579207232129352
Average validation accuracy: 58.711629917855596
BLEU-score: 0.10537740475960217


  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  1%|          | 1/100 [19:18<31:52:01, 1158.80s/it]

-----------Metrics for Epoch 1-----------
Average training loss: 5.0299822298420835
Average training accuracy: 53.05985571763859
Average validation loss: 4.414589353679201
Average validation accuracy: 60.27157249233465
BLEU-score: 0.13026540328835762


  2%|▏         | 2/100 [38:35<31:31:39, 1158.16s/it]

-----------Metrics for Epoch 2-----------
Average training loss: 4.801371099721451
Average training accuracy: 55.7058780772306
Average validation loss: 4.29830440969652
Average validation accuracy: 61.463343535927706
BLEU-score: 0.14739830734888754


  3%|▎         | 3/100 [57:36<31:03:57, 1152.96s/it]

-----------Metrics for Epoch 3-----------
Average training loss: 4.65055874203267
Average training accuracy: 57.45861516265166
Average validation loss: 4.190259289562787
Average validation accuracy: 63.13526623305038
BLEU-score: 0.17648947242967197


  4%|▍         | 4/100 [1:16:38<30:39:22, 1149.61s/it]

-----------Metrics for Epoch 4-----------
Average training loss: 4.5383444568536095
Average training accuracy: 58.77304597287373
Average validation loss: 4.121639466857221
Average validation accuracy: 63.94641755530096
BLEU-score: 0.15736580681628304
Epochs since improvement: 1


  5%|▌         | 5/100 [1:35:33<30:13:23, 1145.30s/it]

-----------Metrics for Epoch 5-----------
Average training loss: 4.451857416399863
Average training accuracy: 59.786563287074195
Average validation loss: 4.07681141800937
Average validation accuracy: 64.56531300395837
BLEU-score: 0.16276560617311414
Epochs since improvement: 2


  6%|▌         | 6/100 [1:54:26<29:48:45, 1141.76s/it]

-----------Metrics for Epoch 6-----------
Average training loss: 4.379413872914554
Average training accuracy: 60.646020223009664
Average validation loss: 4.033044104995606
Average validation accuracy: 65.07867132867133
BLEU-score: 0.18070952051567885


  7%|▋         | 7/100 [2:13:57<29:43:02, 1150.35s/it]

-----------Metrics for Epoch 7-----------
Average training loss: 4.319004536728151
Average training accuracy: 61.358949963240065
Average validation loss: 4.008871428314521
Average validation accuracy: 65.32244897959184
BLEU-score: 0.176063208326808
Epochs since improvement: 1


  8%|▊         | 8/100 [2:34:14<29:54:40, 1170.44s/it]

-----------Metrics for Epoch 8-----------
Average training loss: 4.266622715069842
Average training accuracy: 61.97568304356892
Average validation loss: 3.983129996413365
Average validation accuracy: 65.62666279688483
BLEU-score: 0.16494754940612683
Epochs since improvement: 2


  9%|▉         | 9/100 [2:54:03<29:43:25, 1175.89s/it]

-----------Metrics for Epoch 9-----------
Average training loss: 4.220905399037353
Average training accuracy: 62.5139767319949
Average validation loss: 3.9637099953373496
Average validation accuracy: 65.8991991643454
BLEU-score: 0.16127689819954236
Epochs since improvement: 3


 10%|█         | 10/100 [3:13:49<29:28:42, 1179.14s/it]

-----------Metrics for Epoch 10-----------
Average training loss: 4.180401458799817
Average training accuracy: 62.997093947986684
Average validation loss: 3.9379384471622156
Average validation accuracy: 66.1772272133613
BLEU-score: 0.1701431131428906
Epochs since improvement: 4


 11%|█         | 11/100 [3:33:29<29:09:12, 1179.24s/it]

-----------Metrics for Epoch 11-----------
Average training loss: 4.144112720953953
Average training accuracy: 63.43124118974898
Average validation loss: 3.927592958484417
Average validation accuracy: 66.22062663185379
BLEU-score: 0.16053332693429215
Epochs since improvement: 5


 12%|█▏        | 12/100 [3:53:12<28:51:18, 1180.44s/it]

-----------Metrics for Epoch 12-----------
Average training loss: 4.111190110739318
Average training accuracy: 63.820647182842464
Average validation loss: 3.898494093590803
Average validation accuracy: 66.63431287234755
BLEU-score: 0.18175892224047738


 13%|█▎        | 13/100 [4:14:57<29:25:41, 1217.72s/it]

-----------Metrics for Epoch 13-----------
Average training loss: 4.081435444684267
Average training accuracy: 64.17260490405505
Average validation loss: 3.875614231415796
Average validation accuracy: 66.94965326367509
BLEU-score: 0.1828670718636703


 14%|█▍        | 14/100 [4:36:44<29:43:49, 1244.53s/it]

-----------Metrics for Epoch 14-----------
Average training loss: 4.053825528213686
Average training accuracy: 64.50765038597056
Average validation loss: 3.8589352743475325
Average validation accuracy: 67.14517719389826
BLEU-score: 0.17614536809537082
Epochs since improvement: 1


 15%|█▌        | 15/100 [4:58:31<29:49:30, 1263.19s/it]

-----------Metrics for Epoch 15-----------
Average training loss: 4.028273591474722
Average training accuracy: 64.81202701845017
Average validation loss: 3.8414974854278667
Average validation accuracy: 67.34655078761543
BLEU-score: 0.1947429163765393


 16%|█▌        | 16/100 [5:20:10<29:43:48, 1274.15s/it]

-----------Metrics for Epoch 16-----------
Average training loss: 4.004671805695419
Average training accuracy: 65.0937883144438
Average validation loss: 3.8284119077053114
Average validation accuracy: 67.47483521536968
BLEU-score: 0.1979634245773051


 17%|█▋        | 17/100 [5:42:01<29:37:55, 1285.25s/it]

-----------Metrics for Epoch 17-----------
Average training loss: 3.9828530739245287
Average training accuracy: 65.35191556067589
Average validation loss: 3.8091073006917804
Average validation accuracy: 67.65897906011773
BLEU-score: 0.2058748712565859


 18%|█▊        | 18/100 [6:03:56<29:28:24, 1293.96s/it]

-----------Metrics for Epoch 18-----------
Average training loss: 3.962007783906738
Average training accuracy: 65.60099881266717
Average validation loss: 3.7964951560225355
Average validation accuracy: 67.80296106744653
BLEU-score: 0.19787497513871777
Epochs since improvement: 1


 19%|█▉        | 19/100 [6:25:45<29:12:50, 1298.40s/it]

-----------Metrics for Epoch 19-----------
Average training loss: 3.9422114229894625
Average training accuracy: 65.841692452517
Average validation loss: 3.7799538403686053
Average validation accuracy: 68.01641408658648
BLEU-score: 0.1803975204055632
Epochs since improvement: 2


 20%|██        | 20/100 [6:47:29<28:53:36, 1300.20s/it]

-----------Metrics for Epoch 20-----------
Average training loss: 3.9238864055350966
Average training accuracy: 66.06056207652797
Average validation loss: 3.765887196562213
Average validation accuracy: 68.15643613500993
BLEU-score: 0.19478964416259492
Epochs since improvement: 3


 21%|██        | 21/100 [7:09:19<28:35:57, 1303.26s/it]

## 6. Evaluation
To evaluate the performance of the image captioning model, the BLEU-4 score is not sufficient. Therefore, we will manually select images from the dataset and compare the generated captions. For this, run the following function:

In [10]:
import torch.nn.functional as F
import pickle
from os import listdir
from os.path import isfile, join
import csv


class TestModel(object):
    
    def __init__(self):
       
        # Test parameters
        self.transform = transforms.Compose([ 
                    transforms.Resize(256),                          # smaller edge of image resized to 256
                    transforms.RandomCrop(224),                      # get 224x224 crop from random location
                    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
                    transforms.ToTensor(),                           # convert the PIL Image to a tensor
                    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                                     (0.229, 0.224, 0.225))])

        self.test_mode = ''
        self.beam_size = ''
        self.args = ''
        self.vocab = ''
        self.vocab_size = ''
        self.bleu = list()
        
        # Image for inference
        self.image = ''
        self.image_name = ''
        
        # Thumbnails
        self.thumbnails = []
        
        # Dataloader for inference
        self.dataloader = ''
        
    @classmethod
    def MainTest(cls, test_mode, model_path, beam_size, args, dataloader = None):
        
        # Load parameters
        cls.test_mode =test_mode
        cls.beam_size = beam_size
        cls.args = args
        
        TestModel.LoadVocab(cls)
        
        # Load the encoder and decoder networks
        encoder, decoder = TestModel.LoadModel(model_path)
        
        # If in test_mode, get test-set and evaluate based on it
        if test_mode == 'test':
            cls.dataloader = dataloader
            
        # If in inference_mode, get
        elif test_mode == 'inference_manually':
            
            image_path = str(input("Enter path to image here: "))
            cls.image = TestModel.TransformImage(cls, image_path)
            cls.image_name = image_path
            
            TestModel.Evaluate(cls, encoder, decoder)
            
        elif test_mode == 'inference_thumbnails':
            
            # Load thumbnails 
            path = args.dir_path + '/thumbnails.pkl'
            file = open(path, 'rb')
            thumbnails = pickle.load(file)
            file.close()
            
            # Save thumbnails into class variable
            cls.thumbnails = thumbnails
            
        
        # Run pre-trained model
        captions = TestModel.Evaluate(cls, encoder, decoder)
        
        if test_mode == 'inference_thumbnails':
            
            path = args.dir_path + '/captions.csv'
            
            # Save captions
            with open(path, 'w', newline='') as myfile:

                fieldnames = ['caption', 'thumbnail_id']
                wr = csv.DictWriter(myfile, quoting=csv.QUOTE_ALL, fieldnames=fieldnames)
                wr.writeheader()

                for i, row in enumerate(captions):
                    separator = ' '
                    caption = separator.join(row)
                    wr.writerow({'caption': caption, 'thumbnail_id': cls.thumbnails[i]})
    
            
        return cls()
    
    
    @staticmethod
    def LoadVocab(self):
        
        file = open(args.vocab_path, 'rb')
        self.vocab = pickle.load(file)
        self.vocab_size = len(self.vocab._token_to_idx)
        
        print(self.vocab_size)
        
        file.close()
            
            
    @staticmethod
    def LoadModel(model_path):
        
        # Load checkpoint from model_path
        checkpoint = torch.load(model_path, map_location = torch.device('cpu')) 
        encoder = checkpoint['encoder']
        decoder = checkpoint['decoder']
        
        # send encoder and decoder to device
        encoder.to(device)
        decoder.to(device)
        
        # set both networks to eval
        encoder.eval()
        decoder.eval()
        
        return encoder, decoder
    
 
    @staticmethod
    def TransformImage(self, image_path):
        
        # Convert image to tensor and pre-process using transform
        image = Image.open(image_path).convert('RGB')
        transform = transforms.Compose([ 
                    transforms.Resize(256),                          # smaller edge of image resized to 256
                    transforms.RandomCrop(224),                      # get 224x224 crop from random location
                    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
                    transforms.ToTensor(),                           # convert the PIL Image to a tensor
                    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                                     (0.229, 0.224, 0.225))])
        image = transform(image).unsqueeze_(0)
        image.to(device)
        
        
        return image
    
    
    @staticmethod
    def Evaluate(self, encoder, decoder):
        
        # Depending on the test mode, either evaluate on the test-dataset or on a single image
        if self.test_mode == 'test':
            step_size = self.args.test_total_step
            
            # Add reference list for calculating Blue-Scores
            true_captions = list() 
            predictions = list()
            
        # We don't need any reference point for our image
        elif self.test_mode == 'inference_manually':
            step_size = 1
            predictions = list()
            
        elif self.test_mode == 'inference_thumbnails':
            step_size = len(self.thumbnails)
            predictions = list()
            
            
        # Iterate of over the steps and load a new dataset for every step
        for i_step in tqdm(range(1, step_size+1)):
        
            # Disable gradient calculation for evaluation
            with torch.no_grad():
                
                # Load beam_size
                k = self.beam_size
                
                # If test, get batch, else get image
                if self.test_mode == 'test':
                    # Get dataset for each test step
                    batch = self.dataloader.CreateDataloader(mode = 'test')
                    img, cap, cap_len, cap_reference = next(iter(batch))

                    # Move data to GPU
                    img = img.to(device)
                    cap = cap.to(device)
                    cap_len = cap_len.to(device)
                    
                elif self.test_mode == 'inference_manually':
                    img_name = self.image_name
                    img = self.image
                    img = img.to(device)
                    
                
                elif self.test_mode == 'inference_thumbnails':
                    img_name = self.thumbnails[i_step-1]
                    img_path = self.args.thumbnails + img_name
                    img = TestModel.TransformImage(self, img_path)
                    
                    img = img.to(device)
                    
                try:
                    # Forward propagation
                    encoder_out = encoder(img)  # (1, enc_image_size, enc_image_size, encoder_dim)
                    enc_image_size = encoder_out.size(1)
                    encoder_dim = encoder_out.size(3)

                    # Flatten encoding
                    encoder_out = encoder_out.view(1, -1, encoder_dim)  # (1, num_pixels, encoder_dim)
                    num_pixels = encoder_out.size(1)

                    # We'll treat the problem as having a batch size of k
                    encoder_out = encoder_out.expand(k, num_pixels, encoder_dim)  # (k, num_pixels, encoder_dim)

                    # Tensor to store top k previous words at each step; now they're just <start>
                    k_prev_words = torch.LongTensor([[self.vocab.lookup_token('[BOS]')]] * k).to(device)  # (k, 1)

                    # Tensor to store top k sequences; now they're just <start>
                    seqs = k_prev_words  # (k, 1)

                    # Tensor to store top k sequences' scores; now they're just 0
                    top_k_scores = torch.zeros(k, 1).to(device)  # (k, 1)

                    # Lists to store completed sequences and scores
                    complete_seqs = list()
                    complete_seqs_scores = list()

                    # Start decoding
                    step = 1
                    h, c = decoder.init_hidden_state(encoder_out)

                    # s is a number less than or equal to k, because sequences are removed from this process once they hit [EOS]
                    while True:

                        embeddings = decoder.embedding(k_prev_words).squeeze(1)  # (s, embed_dim)

                        awe, _ = decoder.attention(encoder_out, h)  # (s, encoder_dim), (s, num_pixels)

                        gate = decoder.sigmoid(decoder.f_beta(h))  # gating scalar, (s, encoder_dim)
                        awe = gate * awe

                        h, c = decoder.decode_step(torch.cat([embeddings, awe], dim=1), (h, c))  # (s, decoder_dim)

                        scores = decoder.fc(h)  # (s, vocab_size)
                        scores = F.log_softmax(scores, dim=1)

                        # Add
                        scores = top_k_scores.expand_as(scores) + scores  # (s, vocab_size)

                        # For the first step, all k points will have the same scores (since same k previous words, h, c)
                        if step == 1:
                            top_k_scores, top_k_words = scores[0].topk(k, 0, True, True)  # (s)

                        else:
                            # Unroll and find top scores, and their unrolled indices
                            top_k_scores, top_k_words = scores.view(-1).topk(k, 0, True, True)  # (s)

                        # Convert unrolled indices to actual indices of scores
                        prev_word_inds = top_k_words / self.vocab_size  # (s)
                        next_word_inds = top_k_words % self.vocab_size  # (s)

                        # Add new words to sequences
                        seqs = torch.cat([seqs[prev_word_inds], next_word_inds.unsqueeze(1)], dim=1)  # (s, step+1)

                        # Which sequences are incomplete (didn't reach <end>)?
                        incomplete_inds = [ind for ind, next_word in enumerate(next_word_inds) if
                                           next_word != self.vocab.lookup_token('[EOS]')]

                        complete_inds = list(set(range(len(next_word_inds))) - set(incomplete_inds))

                        # Set aside complete sequences
                        if len(complete_inds) > 0:
                            complete_seqs.extend(seqs[complete_inds].tolist())
                            complete_seqs_scores.extend(top_k_scores[complete_inds])

                        k -= len(complete_inds)  # reduce beam length accordingly

                        # Proceed with incomplete sequences
                        if k == 0:
                            break

                        seqs = seqs[incomplete_inds]
                        h = h[prev_word_inds[incomplete_inds]]
                        c = c[prev_word_inds[incomplete_inds]]
                        encoder_out = encoder_out[prev_word_inds[incomplete_inds]]
                        top_k_scores = top_k_scores[incomplete_inds].unsqueeze(1)
                        k_prev_words = next_word_inds[incomplete_inds].unsqueeze(1)

                        # Break if things have been going on too long
                        if step > 50:
                            break

                        step += 1
                    
                
                    i = complete_seqs_scores.index(max(complete_seqs_scores))
                    seq = complete_seqs[i]
                
                    if self.test_mode == 'test':
                        img_captions = []
                        for caption in cap_reference:
                            img_caption = dataloader.test_cap_tokens[caption] # Retrieve the caption based on its index
                            img_caption_len = dataloader.test_caps_len[caption] # Retrieve the length of the caption
                            img_caption = img_caption[1:(img_caption_len)] # Remove start-token and paddings
                            img_captions.append(img_caption) # Add all tokenized captions to the array

                        true_captions.append(img_captions)

                        # Get predictions
                        temp_preds = []

                        for i, prediction in enumerate(seq):
                            temp_preds.append(preds[i][:decode_lengths[i]]) # if any, remove paddings

                        seq = temp_preds
                        predictions.extend(seq)


                    else:

                        predicted_caption = list()

                        for token in seq:
                            word = self.vocab._idx_to_token[token]
                            predicted_caption.append(word)

                        # Save all predicted captions into predictions
                        predictions.append(predicted_caption)
                        
                except:
                        predictions.append([])
                    
                    
        if self.test_mode == 'test':        
            # Calculate BLEU-Scores using NLTK toolkit
            bleu = corpus_bleu(true_captions, predictions)

            self.bleu.append(bleu)
       
        return predictions
            

In [35]:
TestModel.MainTest('inference_thumbnails', args.pretrained_model, 1, args)

15028


100%|██████████| 60381/60381 [6:03:51<00:00,  2.77it/s]   


<__main__.TestModel at 0x1459f0e10>