#Building an Image Captioning System (Part 1 of 2)
Before building a video captioning system, we could try to build an image captioning system. 
An image captioning system takes image as an input, and it should produce the image's caption. Most of modern image captioning system based on Deep Learning employ two-stage model:
1. Capture important features from the image. For this one could use CNN.
1. Feed this feature vector into RNN to generate a caption.

In this notebook we are going to delve into details of how it can be done.

We study the image captioning system built by Yunjey Choi, and <a href="https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning">here</a> is the link to the corresponding GitHub repo with image captioning project. 

This notebook can be considered as a joint work of Alibek Orynbassar and Birzhan Moldagaliyev

In [0]:
!pip install -q torch==1.0.0 torchvision

# let us try to understand the CNN part first
import torch
import torch.nn as nn
import torchvision.models as models
from torch.nn.utils.rnn import pack_padded_sequence

class Encoder(nn.Module):
  def __init__(self, embed_size):
    super(Encoder, self).__init__()
    resnet18 = models.resnet18(pretrained=True)
    modules = list(resnet18.children())[:-1]
    self.resnet = nn.Sequential(*modules)
    self.linear = nn.Linear(resnet18.fc.in_features, embed_size)
    self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)
    
  def forward(self, images):
    with torch.no_grad():
      features = self.resnet(images)
    features = features.view(features.size(0), -1)
    features = self.linear(features)
    features = self.bn(features)
    return features

# let us perform a sanity check
embed_size = 300
images = torch.randn(2, 3, 224, 224)

encoder = Encoder(embed_size)
print(encoder(images))
  

[K    100% |████████████████████████████████| 591.8MB 26kB/s 
[31mfastai 1.0.51 has requirement numpy>=1.15, but you'll have numpy 1.14.6 which is incompatible.[0m
[?25h

Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /root/.torch/models/resnet18-5c106cde.pth
100%|██████████| 46827520/46827520 [00:01<00:00, 24638957.01it/s]


tensor([[ 2.6998e-01, -9.8909e-01, -3.7659e-01,  5.2393e-01, -6.5969e-01,
         -2.4887e-01, -9.9046e-01, -1.7060e-01,  5.1260e-01,  7.3862e-01,
         -1.1618e-01,  8.6441e-01,  5.3538e-01,  8.0630e-02,  8.9902e-01,
         -1.3806e-01, -5.5699e-02,  2.8007e-01,  2.2073e-01,  4.9998e-01,
          7.6076e-01,  4.4532e-01,  5.4514e-01,  8.3451e-01,  1.7190e-01,
         -9.5332e-01, -4.9066e-01,  8.8761e-01, -9.5369e-01, -3.3473e-01,
          3.8383e-01, -3.8949e-01,  8.5460e-01,  8.9659e-01,  4.9491e-01,
         -5.1683e-01, -4.9997e-01, -3.1168e-01, -5.3843e-01, -9.2439e-02,
          2.1483e-01, -8.5620e-01,  8.2038e-01, -4.9159e-01, -3.8859e-01,
         -2.0815e-01,  3.4762e-01, -4.2014e-01, -7.3013e-01, -1.6654e-01,
         -9.1146e-01, -2.8602e-01,  7.9210e-01, -9.7904e-01, -6.6130e-01,
          1.5624e-01, -5.5582e-01, -9.1548e-01, -4.0081e-01,  1.5074e-01,
          2.8860e-01,  2.5784e-01, -3.5420e-01, -3.9437e-01,  1.2254e-01,
         -4.3627e-01,  7.5056e-01,  4.

In [0]:
# Let us now attempt to understand the decoder

class Decoder(nn.Module):
  def __init__(self, embed_size, hidden_size, vocab_size, num_layers, max_sequence_length=20):
    super(Decoder, self).__init__()
    self.embed = nn.Embedding(vocab_size, embed_size)
    self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
    self.linear = nn.Linear(hidden_size, vocab_size)
    self.max_seq_length = max_seq_length
    
  def forward(self, features, captions, lengths):
    embeddings = self.embed(captions)
    embeddings = torch.cat((features.unsqueeze(1),embeddings), 1)
    packed = pack_padded_sequence(embeddings, lengths, batch_first=True)
    hiddens, _ = self.lstm(packed)
    outputs = self.linear(hiddens[0])
    
  def sample(self, features, states=None):
    sample_ids = []
    inputs = features.unsqueeze(1)
    for i in range(self.max_seq_length):
      hiddens, states = self.lstm(inputs, states)
      outputs = self.linear(hiddens.squeeze(1))
      _, predicted = outputs.max(1)
      sample_ids.append(predicted)
      inputs = self.embed(predicted)
      inputs.unsqueeze(1)
    sample_ids = torch.stack(sample_ids, 1)
    return sample_ids

#Init Part


## embed
Here (embed) embeds indices of words into a lower dimensional vector. For example, suppose we have two kinds of words: 'True' and 'False'. We also need to have a symbol for empty word, which we are going to denote as < PAD >. So we have three kind of words in total: 'True', 'False' and '< PAD >'. We then going to index these words from 0 to 2. The we could use (embed) to embed these indices into two-dimensional vectors. 

In [0]:
import numpy as np
# batch_size = 3
raw_stream = [['True', 'True', 'True', 'False', 'False', 'True'], ['True', 'False', 'True'], ['True', 'True']]
stream_dic = {'<PAD>':0, 'True':1, 'False':2}
indexed_stream = [[stream_dic[label] for label in seq] for seq in raw_stream]
lengths = [len(seq) for seq in indexed_stream]
pad_token = stream_dic['<PAD>']
longest_seq = max(lengths)
batch_size = len(indexed_stream)
padded_stream = np.ones((batch_size, longest_seq))*pad_token
for i, seq_len in enumerate(lengths):
  padded_stream[i,:seq_len] = indexed_stream[i]
padded_stream = torch.tensor(padded_stream, dtype=torch.long)
print(padded_stream)

# we can now embed the padded stream into lower dimensional representation
initial_dim = 3
lower_dim = 2
embed = nn.Embedding(initial_dim, lower_dim)
embedded_stream = embed(padded_stream)
print(embedded_stream)

tensor([[1, 1, 1, 2, 2, 1],
        [1, 2, 1, 0, 0, 0],
        [1, 1, 0, 0, 0, 0]])
tensor([[[ 0.6346,  0.3061],
         [ 0.6346,  0.3061],
         [ 0.6346,  0.3061],
         [ 0.7149, -0.0028],
         [ 0.7149, -0.0028],
         [ 0.6346,  0.3061]],

        [[ 0.6346,  0.3061],
         [ 0.7149, -0.0028],
         [ 0.6346,  0.3061],
         [-0.6702,  0.8191],
         [-0.6702,  0.8191],
         [-0.6702,  0.8191]],

        [[ 0.6346,  0.3061],
         [ 0.6346,  0.3061],
         [-0.6702,  0.8191],
         [-0.6702,  0.8191],
         [-0.6702,  0.8191],
         [-0.6702,  0.8191]]], grad_fn=<EmbeddingBackward>)


#lstm
lstm represents a kind of recurrent neural network cell. Let's play around with a simple lstm cell. Throughout this notebook cell we assume that batch comes as a first dimension, i.e. batch_first = True

In [0]:
# let's try to understand how does an LSTM cell work
input_size = 3
seq_len = 7
hidden_size = 2
batch_size = 3
num_layers = 1
num_directions = 1

lstm_cell = nn.LSTM(input_size=input_size, hidden_size=hidden_size, 
                    num_layers=num_layers, batch_first=True)
h0 = torch.randn(num_layers*num_directions, batch_size, hidden_size)
c0 = torch.randn(num_layers*num_directions, batch_size, hidden_size)
sample = torch.randn(batch_size, seq_len, input_size)
out1, (hn,cn) = lstm_cell(sample, (h0,c0))
print('out1: {}'.format(out1))
# if one omits (h0,c0) they are set to 0 tensors
out2, (hn,cn) = lstm_cell(sample)
print('out2: {}'.format(out2))


out1: tensor([[[ 0.3128, -0.1324],
         [ 0.2009, -0.2255],
         [-0.0066, -0.3524],
         [ 0.1208, -0.2915],
         [ 0.1950, -0.1549],
         [-0.1604, -0.5076],
         [-0.3904, -0.4340]],

        [[-0.2046, -0.4704],
         [-0.2986, -0.2940],
         [ 0.2040, -0.2584],
         [ 0.1660, -0.0242],
         [ 0.1723,  0.2223],
         [ 0.1695,  0.3431],
         [-0.2145, -0.0970]],

        [[ 0.0815, -0.0457],
         [ 0.2478, -0.1699],
         [-0.1294, -0.2739],
         [ 0.2193, -0.1089],
         [ 0.0750, -0.3292],
         [ 0.0517, -0.3054],
         [-0.1402, -0.3411]]], grad_fn=<TransposeBackward0>)
out2: tensor([[[-0.0616, -0.2216],
         [ 0.1166, -0.4554],
         [-0.0496, -0.4591],
         [ 0.1112, -0.3907],
         [ 0.1878, -0.1846],
         [-0.1666, -0.5360],
         [-0.3946, -0.4410]],

        [[-0.3823, -0.2191],
         [-0.3518, -0.2678],
         [ 0.1840, -0.2433],
         [ 0.1615, -0.0217],
         [ 0.1712,  0.

##linear
A usual linear transformation layer

In [0]:
batch_size = 3
input_size = 3
output_size = 4

linear = nn.Linear(input_size, output_size)
sample_input = torch.randn(batch_size, input_size)
output = linear(sample_input)
print(output)

tensor([[-0.3086, -0.5280,  0.0065,  1.4702],
        [ 0.1297, -0.4131, -0.1351,  0.1972],
        [ 0.2923, -0.6071, -0.6792,  0.4919]], grad_fn=<AddmmBackward>)


##max_seq_length
max_seq_length sets the upper bound for the number of words in the generated caption

# Forward Part
In order to understand the forward pass, we could come up with some synthetic data to feed and see how does the forward pass work. Let us try to generate a sequence of nucleic acids based on some given feature. There are 4 types of nucleic acids denoted as: 'G', 'C', 'A' and 'T'. We also have a padding symbol '< PAD >'. So, in total we have 5 symbols. We could use the following indexing to turn symbols into integers: {< PAD >: 0, G: 1, C: 2, A: 3, T: 4}.  


In [0]:
# set up data
vocab_size = 5
batch_size = 2
feature_size = 4
embed_size = 4
hidden_size = 3
num_layers = 1
max_seq_length = 6

# Features. Since the batch size is 2, we need to generate 2 feature vectors of size 4
features = torch.randn(batch_size, feature_size) 

# Captions. Suppose that gene sequences corresponding to these features are as follows
# feature 1: [G, C, T, T, A, C] or [1, 2, 4, 4, 3, 2] using the indexing
# feature 2: [T, G, G, A] or [4, 1, 1, 3] using the indexing]
caption_1 = [1, 2, 4, 4, 3, 2]
caption_2 = [4, 1, 1, 3]
# a variable lengths stores lengths of captions
lengths = [6, 4]
# we need to pad caption_2, to make lengths of captions to be equal
caption_2 = [4, 1, 1, 3, 0, 0]
captions = torch.tensor([caption_1, caption_2], dtype=torch.long)





In [0]:
# set up infrastructure
embed = nn.Embedding(vocab_size, embed_size)
lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
linear = nn.Linear(hidden_size, vocab_size)
max_seq_length = max_seq_length

# set up rnn utilities
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

In [0]:
# run forward path with printing dimensions of tensors along the way
embeddings = embed(captions)
print("embeddings' dimensions: {}".format(embeddings.size()))

# we then prepend feature vectors as the first elements of corresponding captions
featured_embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
print("featured_embeddings' dimensions: {}".format(featured_embeddings.size()))

# we then pack featured_embeddings for convenience and efficiency
packed = pack_padded_sequence(featured_embeddings, lengths, batch_first=True)
print("Packed sequence output:\n{}".format(packed))

# we then run packed sequence through lstm getting a packed output at the end
packed_hiddens, _ = lstm(packed)
print("packed_output {}".format(packed_hiddens))

# in princile, we could carry on using the packed format provided that we remember it.
packed_final = linear(packed_hiddens[0])
print("dimensions of packed_final: {}".format(packed_final.size()))

# the alternative is to reverse packing, and only then apply the linear transformation. This way seems 
# to be more natural, but less efficient, because we will perform some unnecessary computations.
unpacked_hiddens = pad_packed_sequence(packed_hiddens, batch_first=True)[0] # [0] index to get a tensor
print("dimensions for unpacked_hidden: {}".format(unpacked_hiddens.size()))

# we can then apply the linear transformation
unpacked_final = linear(unpacked_hiddens)
print("dimensions for unpacked_final: {}".format(unpacked_final.size()))

embeddings' dimensions: torch.Size([2, 6, 4])
featured_embeddings' dimensions: torch.Size([2, 7, 4])
Packed sequence output:
PackedSequence(data=tensor([[-0.7754,  0.6657,  0.9380,  0.4559],
        [ 0.3809,  0.5994,  0.5325,  1.1054],
        [-0.7672, -0.1533, -0.6961,  2.4644],
        [ 0.8890,  1.3946, -1.1777,  0.2074],
        [ 0.3553, -0.9340, -1.1730, -0.8144],
        [-0.7672, -0.1533, -0.6961,  2.4644],
        [ 0.8890,  1.3946, -1.1777,  0.2074],
        [-0.7672, -0.1533, -0.6961,  2.4644],
        [ 0.8890,  1.3946, -1.1777,  0.2074],
        [ 1.6336,  1.2927,  0.6528, -0.2332]],
       grad_fn=<PackPaddedSequenceBackward>), batch_sizes=tensor([2, 2, 2, 2, 1, 1]))
packed_output PackedSequence(data=tensor([[ 8.9204e-02,  1.1213e-01, -1.0824e-01],
        [ 1.2020e-01,  1.1709e-01, -5.8015e-02],
        [ 1.8218e-01,  3.7615e-01, -1.5257e-01],
        [ 2.4978e-01,  1.3320e-01, -2.1004e-01],
        [ 1.9492e-01,  1.4153e-01,  1.7107e-01],
        [ 2.8571e-01,  3.5327