# Assignment 3.3

# Image Caption Retrieval Model

### 1. Data preprocessing

We will use Microsoft COCO (Common Objects in Context) data set to train our "Image Caption Retrieval Model". This data set consists of pretrained 10-crop VGG19 features (Neural codes) and its corresponding text caption. 


In [1]:
from __future__ import print_function

import os
import sys
import numpy as np
import pandas as pd
from collections import OrderedDict

#DATA_PATH = 'img_cap_coco' #(If Google colab)
DATA_PATH = 'data'
EMBEDDING_PATH = 'embeddings'
MODEL_PATH = 'models'

You will need to create above directories and locate data set provided in directory 'data'

In [6]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!wget https://storage.googleapis.com/trl_data/img_cap_coco.zip
!wget http://images.cocodataset.org/zips/val2014.zip

--2018-04-14 12:53:59--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2018-04-14 12:53:59--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2018-04-14 12:54:34 (23.7 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

--2018-04-14 12:54:35--  https://storage.googleapis.com/trl_data/img_cap_coco.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.141.128, 2607:f8b0:400c:c06::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.141.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
L

In [20]:
!mkdir glove6B
!mv glove.6B.zip glove6B/

!unzip glove6B/glove.6B;
!unzip img_cap_coco;
!unzip val2014;

!mv glove.6B.100d.txt glove6B/
!mv glove.6B.50d.txt glove6B/
!mv glove.6B.200d.txt glove6B/
!mv glove.6B.300d.txt glove6B/

!ls

#rm -r foldername
#rm filename
#mv oldfoldername newfoldername

datalab  glove6B  img_cap_coco	img_cap_coco.zip  val2014  val2014.zip


In [104]:
!mkdir embeddings
!mkdir models

!ls

datalab     glove6B	  img_cap_coco.zip  val2014
embeddings  img_cap_coco  models	    val2014.zip


#### Reading pairs of image (VGG19 features) - caption data

In [0]:
# DO NOT CHANGE BELOW CODE

import collections

np_train_data = np.load(os.path.join(DATA_PATH,'train_data.npy'))
np_val_data = np.load(os.path.join(DATA_PATH,'val_data.npy'))

train_data = collections.OrderedDict()
for i in range(len(np_train_data.item())):
    cap =  np_train_data.item()['caps']
    img =  np_train_data.item()['ims']
    train_data['caps'] = cap
    train_data['ims'] = img
    
val_data = collections.OrderedDict()
for i in range(len(np_val_data.item())):
    cap =  np_val_data.item()['caps']
    img =  np_val_data.item()['ims']
    val_data['caps'] = cap
    val_data['ims'] = img

In [208]:
# example of caption
train_data['caps'][23]

b'a woman is working in a kitchen carrying a soft toy'

In [209]:
# example of pre-computed VGG19 features
val_data['ims'][1]

array([0.02239205, 0.00604904, 0.02322354, ..., 0.        , 0.00106503,
       0.00711824], dtype=float32)

#### Reading caption and information about its corresponding raw images from Microsoft COCO website

In [0]:
# DO NOT CHANGE BELOW CODE
# use them for your own additional preprocessing step
# to map precomputed features and location of raw images 

import json

with open(os.path.join(DATA_PATH,'instances_val2014.json')) as json_file:
    coco_instances_val = json.load(json_file)
    
with open(os.path.join(DATA_PATH,'captions_val2014.json')) as json_file:
    coco_caption_val = json.load(json_file)

#### Additional preprocessing

In [211]:
# create your own function to map pairs of precomputed features and filepath of raw images
# this will be used later for visualization part
# simple approach: based on matched text caption (see json file)

# YOUR CODE HERE 
def return_imagepair(number):
    img = coco_instances_val['images'][number]['flickr_url']
    features = val_data['ims'][number]
    return(img, features)

print(return_imagepair(4999))

('http://farm8.staticflickr.com/7396/8750681361_391310447e_z.jpg', array([0.        , 0.        , 0.00112501, ..., 0.        , 0.        ,
       0.        ], dtype=float32))


#### Build vocabulary index 

In [212]:
# DO NOT CHANGE BELOW CODE

def build_dictionary(text):

    wordcount = OrderedDict()
    for cc in text:
        words = cc.split()
        for w in words:
            if w not in wordcount:
                wordcount[w] = 0
            wordcount[w] += 1
    words = list(wordcount.keys())
    freqs = list(wordcount.values())
    sorted_idx = np.argsort(freqs)[::-1]
    

    worddict = OrderedDict()
    worddict['<pad>'] = 0
    worddict['<unk>'] = 1
    for idx, sidx in enumerate(sorted_idx):
        worddict[words[sidx]] = idx+2  # 0: <pad>, 1: <unk>
    

    return worddict

# use the resulting vocabulary index as your look up dictionary
# to transform raw text into integer sequences

all_captions = []
all_captions = train_data['caps'] + val_data['caps']

# decode bytes to string format
caps = []
for w in all_captions:
    caps.append(w.decode())
    
words_indices = build_dictionary(caps)
print ('Dictionary size: ' + str(len(words_indices)))
indices_words = dict((v,k) for (k,v) in words_indices.items())

Dictionary size: 11473


### 2. Image - Caption Retrieval Model

### Image model

In [0]:
from keras.layers import Input, Dense
from keras.models import Model

inputs_image = Input(shape=(4096,))
image_dense = Dense(1024, activation='relu')(inputs_image)


### Caption model

In [214]:
# For embedding layer, initialize with pretrained word embedding (GloVe)

# Set up the glove embedding
GLOVE_DIR = 'glove6B'

embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [0]:
# Transform all captions into integer sequences for the NN
#words_indices['rowboat']
#embeddings_index['rowboat']
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(caps)
sequences = tokenizer.texts_to_sequences(caps)
word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1

In [0]:
from keras.preprocessing.sequence import pad_sequences

padded_caps = pad_sequences(sequences, maxlen=50)

In [0]:
# Create the embedding matrix
from numpy import zeros

embedding_matrix = zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [0]:
from keras.layers import Flatten, Input, Dense, Embedding, Reshape, GRU, merge 
from keras.layers import LSTM, Dropout, BatchNormalization, Activation, TimeDistributed, dot
from keras.models import Sequential

# Create the caption model
#TODO: inputs_caption = Input(shape=(15,))
inputs_caption = Input(shape=(50,))
embed = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(inputs_caption)
lstm = LSTM(256, return_sequences=True)(embed)
dense = TimeDistributed(Dense(256, activation='relu'))(lstm)
flattened = Flatten()(dense)
caption_dense = Dense(1024, activation='relu')(flattened)

### Join model

In [0]:
# YOUR CODE HERE
dotproduct = dot([image_dense, caption_dense], axes=-1)
# layer for computing dot product between tensors

### Main model for training stage

In [220]:
# YOUR CODE HERE

# define your model input and output
print ("loading the training model")
training_model = Model(inputs=[inputs_image, inputs_caption], outputs=[dotproduct])

loading the training model


### Retrieval model

In [221]:
# YOUR CODE HERE

# define your model input and output
print ("loading sub-models for retrieving Neural codes")
caption_model = Model(inputs=inputs_caption, outputs=caption_dense)
image_model = Model(inputs=inputs_image, outputs=image_dense)

loading sub-models for retrieving Neural codes


### Loss function

We define our loss function as a loss for maximizing the margin between a positive and
negative example.  If we call $p_i$ the score of the positive pair of the $i$-th example, and $n_i$ the score of the negative pair of that example, the loss is:

\begin{equation*}
loss = \sum_i{max(0, 1 -p_i + n_i)}
\end{equation*}

In [0]:
from keras import backend as K


def max_margin_loss(y_true, y_pred):
    
    # YOUR CODE HERE
    loss_ = K.sum(K.maximum(0.0, 1.0 - y_pred[0] + y_pred[1]))
    
    return loss_
   

#### Accuracy metric for max-margin loss
How many times did the positive pair effectively get a higher value than the negative pair?

In [0]:
# YOUR CODE HERE
def accuracy(y_true, y_pred):
    
    # YOUR CODE HERE
    accuracy_ = K.mean(y_pred[0] > y_pred[1])
    
    return accuracy_


### Compile model

In [224]:
# DO NOT CHANGE BELOW CODE
print ("compiling the training model")
training_model.compile(optimizer='adam', loss=max_margin_loss, metrics=[accuracy])

compiling the training model


### 3. Data preparation for training the model

* adjust the length of captions into fixed maximum length (50 words)
* sampling caption for each image, while shuffling the image data
* encode captions into integer format based on look-up vocabulary index

In [0]:
# sampling one caption per image
# return image_ids, caption_ids
from random import random
from math import floor

def sampling_img_cap(data):
    
    ims = data['ims']
    image_ids = []
    caption_ids = []
    for ids in range(0, len(ims)):
        # Random number between 1 and 5 for each img
        i = floor(random() * 5)
        caption = ids * 5 + i
        image_ids.append(ids)
        caption_ids.append(caption)
    
    image_ids = np.array(image_ids)
    caption_ids = np.array(caption_ids)
    return image_ids, caption_ids


In [0]:
# transform raw text caption into integer sequences of fixed maximum length

def prepare_caption(caption_ids, caption_data):
    
    # YOUR CODE HERE
    # did this already above
    caption_seqs = []
    
    for i in caption_ids:
        if (len(caption_data) == 50000):
            caption_seqs.append(padded_caps[i])
        else:
            caption_seqs.append(padded_caps[i+50000])
       
    return np.stack(caption_seqs)

In [0]:
# DO NOT CHANGE BELOW CODE

train_caps = []
for cap in train_data['caps']:
    train_caps.append(cap.decode())

val_caps = []
for cap in val_data['caps']:
    val_caps.append(cap.decode())

In [0]:
# DO NOT CHANGE BELOW CODE

train_image_ids, train_caption_ids = sampling_img_cap(train_data)
val_image_ids, val_caption_ids = sampling_img_cap(val_data)

x_caption = prepare_caption(train_caption_ids, train_caps)
x_image = train_data['ims'][np.array(train_image_ids)]

x_val_caption = prepare_caption(val_caption_ids, val_caps)
x_val_image = val_data['ims'][np.array(val_image_ids)]

### 4. Create noise set for negative examples of image-fake caption and dummy output

Notice that we do not have real output with labels for training the model. Keras architecture expects labels, so we need to create dummy output -- which is numpy array of zeros. This dummy labels or output is never used since we compute loss function based on margin between positive examples (image-real caption) and negative examples (image-fake caption).

In [0]:
# YOUR CODE HERE
def create_noise_caption_ids(image_ids):
    max_caption_id= (len(image_ids)-1)*5+4
    
    caption_ids = []
    i = 0
    
    while i in range(len(image_ids)-1):
        _id = np.random.randint(0, max_caption_id)
        if (_id >= (i*5) and _id <= (i*5+4)):
            continue
        else:
            caption_ids.append(_id)
            i = i + 1
            
    return np.stack(caption_ids)

train_noise_caption_ids = create_noise_caption_ids(image_ids=train_image_ids)
val_noise_caption_ids = create_noise_caption_ids(image_ids=val_image_ids)

train_noise = prepare_caption(train_noise_caption_ids, train_caps)
val_noise = prepare_caption(val_noise_caption_ids, val_caps)

y_train_labels = np.zeros((10000,), dtype=int)
y_val_labels = np.zeros((5000,), dtype=int)

### 5. Training model

In [0]:
# YOUR CODE HERE

X_train = [x_image, x_caption]
Y_train = y_train_labels
X_valid = [x_val_image, x_val_caption]
Y_valid = y_val_labels


In [231]:
# YOUR CODE HERE

# fit the model on training and validation set
batch_size = 250
epochs = 10

training_model.fit(X_train, Y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(X_valid, Y_valid))

Train on 10000 samples, validate on 5000 samples
Epoch 1/10


ResourceExhaustedError: ignored

#### Storing models and weight parameters

In [0]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, 'weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'image_model.h5'))

### 6. Feature extraction (Neural codes)

In [0]:
# YOUR CODE HERE

# Use caption_model and image_model to produce "Neural codes" 
# for both image and caption from validation set

### 7. Caption Retrieval

#### Display original image as query and its ground truth caption

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing import image

In [0]:
# YOUR CODE HERE

# choose one image_id from validation set
# use this id to get filepath of image
img_id = 
filepath_image = 

# display original caption
original_caption = 
print(original_caption)

# DO NOT CHANGE BELOW CODE
img = image.load_img(os.path.join(IMAGE_DATA,filepath_image), target_size=(224,224))
plt.imshow(img)
plt.axis("off")
plt.show()

In [0]:
# function to retrieve caption, given an image query

def get_caption(image_filename, n=10):   
    
    # YOUR CODE HERE


In [0]:
# DO NOT CHANGE BELOW CODE
get_caption(filepath_image)

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

#### Answer:

=== write your answer here ===

### 8. Image Retrieval

In [0]:
# given text query, display retrieved image, similarity score, and its original caption 

def search_image(text_caption, n=10):
    
    # YOUR CODE HERE
    

Consider to use the following settings for image retrieval task.

* use real caption that is available in validation set as a query.
* use part of caption as query. For instance, instead of use the whole text sentence of the
caption, you may consider to use key phrase or combination of words that is included in
corresponding caption.

In [0]:
# Example of text query 
# text = 'two giraffes standing near trees'

# YOUR QUERY-1
text1 = 

# DO NOT CHANGE BELOW CODE
search_image(text1)

In [0]:
# YOUR QUERY-2
text2 = 

# DO NOT CHANGE BELOW CODE
search_image(text2)

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

#### Answer:

=== write your answer here ===