@authors
* Arseniy Ashuha, you can text me ```ars.ashuha@gmail.com```,
* Based on https://github.com/ebenolson/pydata2015

<h1 align="center"> Part II: Attention mechanism @ Image Captioning </h1> 

<img src="https://s2.postimg.org/pq18f5t7t/deepbb.png" width=480>

In this seminar you'll be going through the image captioning pipeline.

To begin with, let us download the dataset of image features from a pre-trained GoogleNet (see instructions in chat)

### Data preprocessing

In [5]:
# Load dataset
import numpy as np

captions = np.load("../download/train-data-captions.npy")
img_codes = np.load("../download/train-data-svdfeatures.npy").astype('float32')

In [6]:
print ("each image code is a 6x6 feature matrix from GoogleNet:", img_codes.shape)
print (img_codes[0,:10,0,0])
print ('\n\n')
print ("for each image there are 5-7 descriptions, e.g.:\n")
print ('\n'.join(captions[0]))

each image code is a 6x6 feature matrix from GoogleNet: (82783, 128, 6, 6)
[-19.53911972   3.23637891   2.08816719   0.66493636  -2.9185071
   1.82758021  -1.3254329   -0.53197509  -3.15473676  -1.8739953 ]



for each image there are 5-7 descriptions, e.g.:

People shopping in an open market for vegetables.
An open market full of people and piles of vegetables.
People are shopping at an open air produce market.
Large piles of carrots and potatoes at a crowded outdoor market.
People shop for vegetables like carrots and potatoes at an open air market.


In [7]:
#split descriptions into tokens
for img_i in range(len(captions)):
    for caption_i in range(len(captions[img_i])):
        sentence = captions[img_i][caption_i] 
        captions[img_i][caption_i] = ["#START#"]+sentence.split(' ')+["#END#"]

In [8]:
# Build a Vocabulary
from collections import Counter
word_counts = Counter()
for img_captions in captions:
    for caption in img_captions:
        word_counts.update(caption)

In [9]:
vocab  = ['#UNK#', '#START#', '#END#']
vocab += [k for k, v in word_counts.items() if v >= 5]
vocab = list(set(vocab))
n_tokens = len(vocab)

assert 12000 <= n_tokens <= 15000

word_to_index = {w: i for i, w in enumerate(vocab)}

We'll use this function to convert sentences into a network-readible matrix of token indices.

When given several sentences of different length, it pads them with -1.

In [10]:
PAD_ix = -1
UNK_ix = vocab.index('#UNK#')
START_ix = vocab.index("#START#")
END_ix = vocab.index("#END#")

#good old as_matrix for the third time
def as_matrix(sequences,max_len=None):
    max_len = max_len or max(map(len,sequences))
    
    matrix = np.zeros((len(sequences),max_len),dtype='int32')+PAD_ix
    for i,seq in enumerate(sequences):
        row_ix = [word_to_index.get(word,UNK_ix) for word in seq[:max_len]]
        matrix[i,:len(row_ix)] = row_ix
    
    return matrix

def to_string(tokens_ix):
    assert len(np.shape(tokens_ix))==1,"to_string works on one sequence at a time"
    tokens_ix = list(tokens_ix)[1:]
    if END_ix in tokens_ix:
        tokens_ix = tokens_ix[:tokens_ix.index(END_ix)]
    return " ".join([vocab[i] for i in tokens_ix])

In [11]:
#try it out on several descriptions of a random image
as_matrix(captions[1337])

array([[ 9903,  7541,  2673,  9869,  3759,  2030,     0,   961,  1303,
        10923,  2030, 11889, 12796,    -1,    -1,    -1],
       [ 9903,  7541,  2673, 13076,   896,  6449,  8395, 12878,  9404,
          167,  3140, 12796,    -1,    -1,    -1,    -1],
       [ 9903,  7541, 10231,  2054, 13004,  3562,  9840,   961,  4424,
         6075,  7050,  7345, 10447,   474,  2554, 12796],
       [ 9903,  1926,   961,  5786,  4591,  9073,  2565, 10013,  2030,
         3062,  3161,  2030,  1411,  7345,  9451, 12796],
       [ 9903,  7541, 10795,  4960,   961,  5786,  2054, 13004,  3759,
         2030,  2554, 12796,    -1,    -1,    -1,    -1]], dtype=int32)

In [12]:
to_string(as_matrix(captions[1337])[0])

'A woman standing on a  tennis court holding a racquet.'

### The neural network

Since the image encoder CNN is already applied, the only remaining part is to write a sentence decoder.


In [16]:
import theano, theano.tensor as T
import lasagne
from lasagne.layers import *
from lasagne.init import Normal
theano.config.compute_test_value = 'ignore'
theano.config.warn_float64 = 'raise'

class AttentionWeights(MergeLayer):
    def __init__(self, encoder_seq, attn_query, num_units):
        MergeLayer.__init__(self, [encoder_seq, attn_query])
        
        enc_units = encoder_seq.output_shape[2]
        dec_units = attn_query.output_shape[1]
        
        self.W_enc = self.add_param(Normal(), (enc_units, num_units), name='enc_to_hid')
        self.W_query = self.add_param(Normal(), (dec_units, num_units), name='dec_to_hid')
        self.W_out = self.add_param(Normal(), (num_units, ),name='hid_to_logit')
    
    def get_output_for(self, inputs):
        # the encoder_sequence shape = [batch, time,units]
        # the query shapeshape  = [batch, units]
        encoder_sequence, query = inputs
        
        # Hidden layer activations, shape [batch,seq_len,hid_units]
        
        query_to_hid = query.dot(self.W_query)[:,None,:]
        
        # enc_to_hid = <Your code: contributon from encoder to hid, shape:[batch,time,units]>
        enc_to_hid = encoder_sequence.astype('float32').dot(self.W_enc)
        
        hid = T.tanh(query_to_hid+enc_to_hid)
        
        # Logits from hidden, [batch_size, seq_len]
        logits = T.dot(hid, self.W_out)
        
        assert logits.ndim ==2, "Logits must have shape [batch,time] and be 2-dimensional."\
                                "Current amount of dimensions:"+str(logits.ndim)
        
        attn_weights = T.nnet.softmax(logits)
        return attn_weights
    
    def get_output_shape_for(self,input_shapes):
        enc_shape,query_shape = input_shapes
        return enc_shape[:-1]

class AttentionOutput(MergeLayer):
    def __init__(self, encoder_seq, attn_weights):
        MergeLayer.__init__(self,[encoder_seq,attn_weights])
    
    def get_output_for(self,inputs):
        # encoder_sequence shape = [batch,time,units]
        # attn_weights shape = [batch,time]
        encoder_sequence, attn_weights = inputs
    
        #Reshape attn_weights to make 'em 3-dimensional: [batch,time,1] - so you could multiply by encoder sequence
        attn_weights = attn_weights.reshape([attn_weights.shape[0],attn_weights.shape[1],1])
        
        #Compute attention response by summing encoder elements with weights along time axis (axis=1)
        #attn_output = <Compute attention response by summing encoder elements with weights along time axis (1)>
        attn_output = (attn_weights * encoder_sequence.astype('float32')).sum(1)
        return attn_output
    
    def get_output_shape_for(self,input_shapes):
        enc_shape,query_shape = input_shapes
        return (enc_shape[0],enc_shape[-1])
# network shapes. 
EMBEDDING_SIZE = 128    #Change at your will
LSTM_SIZE  = 256        #Change at your will
ATTN_SIZE  = 256        #Change at your will
FEATURES,HEIGHT,WIDTH = img_codes.shape[1:]


We will define a single LSTM step here. An LSTM step should
* take previous cell/out and input
* compute next cell/out and next token probabilities
* use attention to work with image features

In [20]:
from agentnet.resolver import ProbabilisticResolver
from agentnet.memory import LSTMCell

temperature = theano.shared(1.)
class decoder:
    prev_word = InputLayer((None,),name='index of previous word')
    image_features = InputLayer((None,FEATURES,HEIGHT,WIDTH),name='img features')

    prev_cell = InputLayer((None,LSTM_SIZE),name='previous LSTM cell goes here')
    prev_out = InputLayer((None,LSTM_SIZE),name='previous LSTM output goes here')
    
    prev_word_emb = EmbeddingLayer(prev_word,len(vocab),EMBEDDING_SIZE)
    
    ###Attention part:
    # Please implement attention part of rnn architecture
    
    #First we reshape image into a sequence of image vectors
    image_features_seq = reshape(dimshuffle(image_features,[0,2,3,1]),[[0],-1,[3]])
    
    #Then we apply attention just as usual
    attn_probs = AttentionWeights(image_features_seq, prev_word_emb, 32)
    attn = AttentionOutput(image_features_seq, attn_probs)

    lstm_input = concat([attn,prev_word_emb],axis=-1)

    new_cell,new_out = LSTMCell(prev_cell,prev_out,lstm_input)
    
    
    output_probs = DenseLayer(new_out,len(vocab),nonlinearity=T.nnet.softmax)

    
    output_probs_scaled = ExpressionLayer(output_probs,lambda p: p**temperature)
    output_tokens = ProbabilisticResolver(output_probs_scaled,assume_normalized=False)
    
    
    # recurrent state transition dict
    # on next step, {key} becomes {value}
    transition = {
        new_cell:prev_cell,
        new_out:prev_out
    }

### Training

During training, we should feed our decoder RNN with reference captions from the dataset. Training then comes down to simple likelihood maximization problem.

Deep learning people also know this as minimizing crossentropy.

In [21]:
# Inputs for sentences
sentences = T.imatrix("[batch_size x time] of word ids")
l_sentences = InputLayer((None,None),sentences)

# Input layer for image features
image_vectors = T.tensor4("image features [batch,channels,h,w]")
l_image_features = InputLayer((None,FEATURES,HEIGHT,WIDTH),image_vectors)


In [22]:
from agentnet import Recurrence

decoder_trainer = Recurrence(
    input_sequences={decoder.prev_word:l_sentences},
    input_nonsequences={decoder.image_features:l_image_features},
    state_variables=decoder.transition,
    tracked_outputs=[decoder.output_probs],
    unroll_scan = False,
)

In [24]:
#get predictions and define loss
next_token_probs = get_output(decoder_trainer[decoder.output_probs])

next_token_probs = next_token_probs[:,:-1].reshape([-1,len(vocab)])
next_tokens = sentences[:,1:].ravel()

loss = T.nnet.categorical_crossentropy(next_token_probs,next_tokens)

#apply mask
mask = T.neq(next_tokens,PAD_ix)
loss = T.sum(loss*mask.astype('float32'))/T.sum(mask.astype('float32'))

In [25]:
#trainable NN weights
weights = get_all_params(decoder_trainer,trainable=True)
updates = lasagne.updates.adam(loss,weights)

In [26]:
#compile a functions for training and evaluation
#please not that your functions must accept image features as FIRST param and sentences as second one
train_step = theano.function([image_vectors,sentences],loss,updates=updates,allow_input_downcast=True)
val_step   = theano.function([image_vectors,sentences],loss,allow_input_downcast=True)
#for val_step use deterministic=True if you have any dropout/noize

# Training

* You first have to implement a batch generator
* Than the network will get trained the usual way

In [27]:
from random import choice

def generate_batch(images,captions,batch_size,max_caption_len=None):
    
    #sample random numbers for image/caption indicies
    random_image_ix = np.random.randint(0,len(images),size=batch_size)
    
    #get images
    batch_images = images[random_image_ix]
    
    #5-7 captions for each image
    captions_for_batch_images = captions[random_image_ix]
    
    #pick 1 from 5-7 captions for each image
    batch_captions = list(map(choice,captions_for_batch_images))
    
    #convert to matrix
    batch_captions_ix = as_matrix(batch_captions,max_len=max_caption_len)
    
    return batch_images, batch_captions_ix

In [28]:
bx,by = generate_batch(img_codes,captions,3)
bx[0,:10,0,0],by

(array([-2.07359767, -0.97894174, -2.05806088, -1.89956403, -6.82178402,
         2.34008193, -4.10232496, -5.30042219,  0.74239367,  3.04254794], dtype=float32),
 array([[ 9903,  7541,  2673,  3602,  2565,  9840,  7612,  6075,  3759,
          2030,  1521,  4386,  2030,  1082,  4761, 12796],
        [ 9903, 12151,  6826, 10060,  9315, 13113, 11145,  3352,  2030,
          4374,   138, 12796,    -1,    -1,    -1,    -1],
        [ 9903,  1926,  2651,   585,  4591, 10777,  3562,  3331,  8395,
         12098, 12796,    -1,    -1,    -1,    -1,    -1]], dtype=int32))

### Main loop
* We recommend you to periodically evaluate the network using the next "apply trained model" block
 *  its safe to interrupt training, run a few examples and start training again

In [38]:
batch_size=50 #adjust me
n_epochs=100 #adjust me
n_batches_per_epoch = 50 #adjust me
n_validation_batches = 5 #how many batches are used for validation after each epoch


In [39]:
from tqdm import tqdm

for epoch in range(n_epochs):
    
    train_loss=0
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    
    
    print('Epoch: {}, train loss: {}'.format(epoch, train_loss))

print("Finish :)")

100%|██████████| 50/50 [01:46<00:00,  1.83s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 0, train loss: 4.900748100280762


100%|██████████| 50/50 [02:00<00:00,  2.87s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 1, train loss: 4.772661609649658


100%|██████████| 50/50 [02:12<00:00,  2.31s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 2, train loss: 4.624748344421387


100%|██████████| 50/50 [01:41<00:00,  2.05s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 3, train loss: 4.4516128730773925


100%|██████████| 50/50 [01:57<00:00,  2.82s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 4, train loss: 4.364750647544861


100%|██████████| 50/50 [01:48<00:00,  2.00s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 5, train loss: 4.240622897148132


100%|██████████| 50/50 [01:45<00:00,  2.03s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 6, train loss: 4.185291972160339


100%|██████████| 50/50 [01:42<00:00,  2.07s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 7, train loss: 4.117165808677673


100%|██████████| 50/50 [01:45<00:00,  2.01s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 8, train loss: 4.059547119140625


100%|██████████| 50/50 [1:07:47<00:00, 1176.73s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 9, train loss: 4.006289596557617


100%|██████████| 50/50 [02:16<00:00,  2.16s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 10, train loss: 3.9884975719451905


100%|██████████| 50/50 [01:45<00:00,  2.24s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 11, train loss: 3.8984531450271604


100%|██████████| 50/50 [01:45<00:00,  2.41s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 12, train loss: 3.8852854490280153


100%|██████████| 50/50 [01:53<00:00,  2.51s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 13, train loss: 3.8390228080749513


100%|██████████| 50/50 [01:47<00:00,  2.04s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 14, train loss: 3.7889770793914797


100%|██████████| 50/50 [01:46<00:00,  1.98s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 15, train loss: 3.8258581352233887


100%|██████████| 50/50 [01:50<00:00,  2.22s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 16, train loss: 3.7717241764068605


100%|██████████| 50/50 [01:47<00:00,  1.99s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 17, train loss: 3.733306703567505


100%|██████████| 50/50 [01:50<00:00,  2.34s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 18, train loss: 3.6542920112609862


100%|██████████| 50/50 [01:51<00:00,  2.29s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 19, train loss: 3.72977352142334


100%|██████████| 50/50 [17:35<00:00, 141.63s/it]
  0%|          | 0/50 [00:00<?, ?it/s]

Epoch: 20, train loss: 3.645867781639099


 36%|███▌      | 18/50 [00:42<01:05,  2.05s/it]

KeyboardInterrupt: 

### apply trained model

In [40]:
batch_size = theano.shared(np.int32(1))
MAX_LENGTH = 20         #Change at your will

In [41]:
#set up recurrent network that generates tokens and feeds them back to itself
unroll_dict = dict(decoder.transition)
unroll_dict[decoder.output_tokens] = decoder.prev_word #on next iter, output goes to input

first_output = T.repeat(T.constant(START_ix,dtype='int32'),batch_size)
init_dict = {
    decoder.output_tokens:InputLayer([None],first_output)
}

decoder_applier = Recurrence(
    input_nonsequences={decoder.image_features:l_image_features},
    state_variables=unroll_dict,
    state_init = init_dict,
    tracked_outputs=[decoder.output_probs,decoder.output_tokens],
    n_steps = MAX_LENGTH,
)

In [42]:
theano.config.warn_float64 = 'ignore'
generated_tokens = get_output(decoder_applier[decoder.output_tokens])

generate = theano.function([image_vectors],generated_tokens,allow_input_downcast=True)

In [48]:
from pretrained_lenet import image_to_features
import matplotlib.pyplot as plt
%matplotlib inline

img = plt.imread("./data/Dog-and-Cat.jpg")
plt.imshow(img)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

In [46]:
output_ix = generate([image_to_features(img)])[0]

for _ in range(100):
    temperature.set_value(10)
    print (to_string(output_ix))

NameError: name 'image_to_features' is not defined

### Some tricks (for further research)

* Initialize LSTM with some function of image features.

* Try other attention functions

* If you train large network, it is usually a good idea to make a 2-stage prediction
    1. (large recurrent state) -> (bottleneck e.g. 256)
    2. (bottleneck) -> (vocabulary size)
    * this way you won't need to store/train (large_recurrent_state x vocabulary size) matrix
    
* Use [hierarchical softmax](https://gist.github.com/justheuristic/581853c6d6b87eae9669297c2fb1052d) or [byte pair encodings](https://github.com/rsennrich/subword-nmt)


