# Project Introduction 

This project serves as the second mini-project for MATH6380O.

The problem of image caption generation involves outputting a readable and concise description of the contents of a photograph. It is a challenging artificial intelligence problem as it requires both techniques from computer vision to interpret the contents of the photograph and techniques from natural language processing to generate the textual description. Therefore, we implement state-of-the-art deep learning methods that have achieved good results on this challenging problem for this project. We perform image captioning by combined CNN/RNN/LSTM models. 

# Team Information

Team members: Chunyan BAI, Yuan CHEN, Haoye CAI, Wenshuo GUO

Distribution of Work: Chunyan BAI and Yuan CHEN are mainly in charge of coding, Haoye CAI and Wenshuo GUO are mainly in charge of writing up the report. All members actively participated in project preparations and discussions. 


# Dataset

In this project, we use the Flickr8k dataset which comprised of 8,000 photos and up to 5 captions for each photo. 

This dataset is publically available at http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html which includes images obtained from the Flickr website. The dataset has a pre-defined training dataset (6,000 images), development/validation dataset (1,000 images), and test dataset (1,000 images).

# Generate Training Examples from Image-Caption Pairs
The first step of our project is to prepare the image and text pairs  for training a deep image captioning model. Each channel is preprocessed separately. We first extract features from images in advance, then preprocess the text data to form image-caption pairs.


## Feature Extraction using VGGNet

We use a pre-trained model to first extract features from the images. There are many models
to choose from. In this case, we use VGG-16. In order to save computational resources, we pre-compute
the image features using the pre-trained model and save them to files. We can then load these
features later and feed them into our model as the interpretation of a given photo in the dataset.

We remove the last layer from the loaded VGG model so that we can get the internal representation of the images
 before the classification layer. These are the features that the model has extracted from the images.

Below is a function `extract_features()` that,
given a directory name, load each photo, prepare it for VGG, and collect the predicted
features from the VGG model. The image features are 4096-dimensional vectors.
The function returns a dictionary of image identifier to image features. We call this function to prepare the image data for testing our models, then save the
results to a file named features.pkl.


In [1]:
!ls -l Flickr8k_text/

total 9892
-rw-r--r-- 1 mark mark 2918552 Oct 15  2013 CrowdFlowerAnnotations.txt
-rw-r--r-- 1 mark mark  346674 Oct 15  2013 ExpertAnnotations.txt
-rw-r--r-- 1 mark mark   25801 Oct 11  2013 Flickr_8k.devImages.txt
-rw-r--r-- 1 mark mark 3244761 Feb 16  2012 Flickr8k.lemma.token.txt
-rw-r--r-- 1 mark mark   25775 Oct 11  2013 Flickr_8k.testImages.txt
-rw-r--r-- 1 mark mark 3395237 Oct 15  2013 Flickr8k.token.txt
-rw-r--r-- 1 mark mark  154678 Oct 11  2013 Flickr_8k.trainImages.txt
-rw-r--r-- 1 mark mark    1821 Oct 15  2013 readme.txt


In [5]:
from os import listdir
from os import path
from pickle import dump
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model

# extract features from each photo in the directory
def extract_features(directory):
    # load the model
    model = VGG16()
    # re-structure the model; use all layers except the last softmax layer corresponding 
    # to the 1000 classes for ImageNet
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    # summarize
    model.summary()
    # extract features from each photo
    features = dict()
    for name in listdir(directory):
        # load an image from file
        filename = path.join(directory, name)
        image = load_img(filename, target_size=(224, 224))
        # convert the image pixels to a numpy array
        image = img_to_array(image)
        # reshape data for the model
        image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
        # prepare the image for the VGG model
        image = preprocess_input(image)
        # get features
        feature = model.predict(image, verbose=0)
        # get image id
        image_id = name.split('.')[0]
        # store feature
        features[image_id] = feature
#         print('>%s' % name)
    return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

Features Extracted


## Preprocess the Textual Caption Data
The dataset contains multiple descriptions for each image. First, we load the file containing all of the descriptions. Then we need to do some cleaning to preprocess the data, and prepare our vocabulary dictionary according to the dataset.

In [6]:
!head -10 Flickr8k_text/Flickr8k.token.txt

#1000268201_693b08cb0e

1000268201_693b08cb0e.jpg#0	A child in a pink dress is climbing up a set of stairs in an entry way .
1000268201_693b08cb0e.jpg#1	A girl going into a wooden building .
1000268201_693b08cb0e.jpg#2	A little girl climbing into a wooden playhouse .
1000268201_693b08cb0e.jpg#3	A little girl climbing the stairs to her playhouse .
1000268201_693b08cb0e.jpg#4	A little girl in a pink dress going into a wooden cabin .
1001773457_577c3a7d70.jpg#0	A black dog and a spotted dog are fighting
1001773457_577c3a7d70.jpg#1	A black dog and a tri-colored dog playing with each other on the road .
1001773457_577c3a7d70.jpg#2	A black dog and a white dog with brown spots are staring at each other in the street .
1001773457_577c3a7d70.jpg#3	Two dogs of different breeds looking at each other on the road .
1001773457_577c3a7d70.jpg#4	Two dogs on pavement moving toward each other .


In [7]:
import string
import re

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# extract descriptions for images
def load_descriptions(doc):
    mapping = dict()
    # process lines
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        if len(line) < 2:
            continue
        # take the first token as the image id, the rest as the description
        image_id, image_desc = tokens[0], tokens[1:]
        # remove filename from image id
        image_id = image_id.split('.')[0]
        # convert description tokens back to string
        image_desc = ' '.join(image_desc)
        # create the list if needed
        if image_id not in mapping:
             mapping[image_id] = list()
        # store description
        mapping[image_id].append(image_desc)
    return mapping

def clean_descriptions(descriptions):
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    for _, desc_list in descriptions.items():
        for i in range(len(desc_list)):
            desc = desc_list[i]
            # tokenize
            desc = desc.split()
            # convert to lower case
            desc = [word.lower() for word in desc]
            # remove punctuation from each token
            desc = [re_punc.sub('', w) for w in desc]
            # remove hanging 's' and 'a'
            desc = [word for word in desc if len(word)>1]
            # remove tokens with numbers in them
            desc = [word for word in desc if word.isalpha()]
            # store as string
            desc_list[i] =  ' '.join(desc)

# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
    # build a list of all description strings
    all_desc = set()
    for key in descriptions.keys():
        [all_desc.update(d.split()) for d in descriptions[key]]
    return all_desc

# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
             lines.append(key + ' ' + desc)
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))
# save to file
save_descriptions(descriptions, 'descriptions.txt')

Loaded: 8092 
Vocabulary Size: 8763


**Sample captions before preprocessing **

In [8]:
!head -5 Flickr8k_text/Flickr8k.token.txt

1000268201_693b08cb0e.jpg#0	A child in a pink dress is climbing up a set of stairs in an entry way .
1000268201_693b08cb0e.jpg#1	A girl going into a wooden building .
1000268201_693b08cb0e.jpg#2	A little girl climbing into a wooden playhouse .
1000268201_693b08cb0e.jpg#3	A little girl climbing the stairs to her playhouse .
1000268201_693b08cb0e.jpg#4	A little girl in a pink dress going into a wooden cabin .


**Sample captions after preprocessing **

In [9]:
!head -5 descriptions.txt

1000268201_693b08cb0e child in pink dress is climbing up set of stairs in an entry way
1000268201_693b08cb0e girl going into wooden building
1000268201_693b08cb0e little girl climbing into wooden playhouse
1000268201_693b08cb0e little girl climbing the stairs to her playhouse
1000268201_693b08cb0e little girl in pink dress going into wooden cabin


Each image has a unique identifier. This identifier is used in the filename and in the
text file of descriptions. After the above preprocessing, we step through the entire list of image descriptions. Below is a function load_descriptions() that returns a dictionary
of photo identifiers given the loaded description text. Each identifier maps to a list of one or more textual
descriptions.

In [1]:
from pickle import load

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))

Dataset: 6000
Descriptions: train=6000
Photos: train=6000


## Generate Training Examples

Now we have prepared the image-caption pairs. We can then generate training examples.

In [2]:
from numpy import array
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

# fit a tokenizer given caption descriptions
''' The word embedding layer expects input sequences to be comprised of integers. We can map
each word in our vocabulary to a unique integer and encode our input sequences. Later, when
we make predictions, we can convert the prediction to numbers and look up their associated
words in the same mapping. To do this encoding, we will use the Tokenizer class in the Keras API.
First, the Tokenizer must be trained on the entire training dataset, which means it finds
all of the unique words in the data and assigns each a unique integer. We can then use the fit
Tokenizer to encode all of the training sequences, converting each sequence from a list of words
to a list of integers.'''
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

Using TensorFlow backend.


### Generate a training example from a Image-Caption pair 

Each image and caption serves as a training example that needs to be further adapted to work with our language model encoder. 
We can now encode the text. Each description will be split into words. The LSTM language model will be
provided with the previous word and the image features to generate the next word. 
This is how the model complete forward pass. For example, the input sequence `little girl running in field` would be split into 6 input-output pairs to train the model:



|X1| X2 (text sequence)| y (word)|
|:---:|:---|:---:|
|photo1| startseq |little |
|photo1| startseq, little| girl|
|photo1|startseq, little, girl| running|
|photo1| startseq, little, girl, running| in|
|photo1|startseq, little, girl, running, in |field|
|photo1| startseq, little, girl, running, in, field| endseq|


**Table: Example of how a photo-caption pair (Photo1-`little girl running in field`)  is transformed into input and output training examples**

When the model is used to generate descriptions, the generated words will be con-
catenated and recursively provided as input to generate a caption for an image. The function
below named create_sequences() transforms the data into input-output pairs of data
for training the model, given the tokenizer, a maximum sequence length, and the
dictionary of all descriptions and photos. The model has two inputs: one for image features and
one for the encoded text, and one output: the encoded next word in
the text sequence.

The input text is encoded as integers, which are then fed to a word embedding layer. The
image features will be fed directly to another part of the model. The model will output a
prediction, which will be a probability distribution over all words in the vocabulary. 

In [3]:
# create sequences of images (X1), input sequences of words (X2) and output word (y)for an image

def create_sequences(tokenizer, max_length, descriptions, photos):
    X1, X2, y = list(), list(), list()
    # walk through each image identifier
    for key, desc_list in descriptions.items():
        # walk through each description for the image
        for desc in desc_list:
            # encode the sequence
            seq = tokenizer.texts_to_sequences([desc])[0]
            # split one sequence into multiple X,y pairs
            for i in range(1, len(seq)):
                # split into input and output pair
                in_seq, out_seq = seq[:i], seq[i]
                # pad input sequence
                in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                # encode output sequence
                out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                # store
                X1.append(photos[key][0])
                X2.append(in_seq)
                y.append(out_seq)
    return array(X1), array(X2), array(y)

# define the captioning model based on a LSTM language model  

#  Fit the "Merge" image captioning model

We implement a deep learning based architecture based on the `merge-model` described by [Marc Tanti, et al.](http://www.aclweb.org/anthology/W17-3506) in
their 2017 papers.  The merge model has three parts:

1. **Photo Feature Extractor:** This is a 16-layer VGG model pre-trained on the ImageNet dataset. We have pre-processed the images with the VGG model (without the output layer) and will use the extracted features predicted by this model as input. 
* **Sequence Processor:** This is a word embedding layer for handling the text input, followed by a Long Short-Term Memory (LSTM) recurrent neural network layer. Think of this as a `acceptor` style RNN.
* **Acceptor:** This is essentially the acceptor part of the (Sequence Processor) RNN with a little twist. Both the image feature extractor and sequence processor output a fixed-length vector (of the same length). These are merged together and processed by a Dense layer to make a final prediction.

A plot of the model is created below for better understanding.

The Photo Feature Extractor model expects input image features to be a vector of 4,096 dimensions. These are processed by a Dense layer to produce a 256-dim representation of the photo. The Sequence Processor model expects input sequences with a pre-defined length (34 words) which are fed into an Embedding layer using a mask to ignore padded values. This is followed by an LSTM layer with 256 memory units. Both the input models produce a 256-dim vector. Furthermore, both input models employ 0.5 dropout to reduce overfitting the training dataset. The acceptor path merges the vectors from both input models using an addition operation. This is then fed to a Dense 256 neuron layer and then
to a final output Dense layer that makes a softmax prediction over the entire output vocabulary
for the next word in the sequence. The function below named define model() defines and returns the model.

![model.png](model.png)

In [13]:
def define_model(vocab_size, max_length):
    # image input vector
    # Transformed Image data (VGGNet output)
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    # input for sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    # decoder model
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    # summarize model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model

In [15]:
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)


# load validation/developement set (1K)
# load test set
filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)

# define the model
model = define_model(vocab_size, max_length)

Dataset: 6000
Descriptions: train=6000
Photos: train=6000
Vocabulary Size: 7579
Description Length: 34
Dataset: 1000
Descriptions: test=1000
Photos: test=1000
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 34)           0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 4096)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 34, 256)      1940224     input_3[0][0]                    
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 4096)     

We then fit the model on the training dataset. The model
learns fast and quickly overfits the training dataset. For this reason, we will monitor the performance
of the trained model on validation dataset. When the performance of the model on the
development dataset improves at the end of an epoch, we save the whole model to file. We finally select the model with the best validation set performance (lowest validation loss).


In [19]:
# define checkpoint callback
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')

In [20]:
# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], 
          validation_data=([X1test, X2test], ytest))

#Dataset: 6,000
#Descriptions: train=6,000
#Photos: train=6,000
#Vocabulary Size: 7,579
#Description Length: 34
#Dataset: 1,000
#Descriptions: test=1,000
#Photos: test=1,000
#Train on 306,404 samples, validate on 50,903 sample

Train on 306404 samples, validate on 50903 samples
Epoch 1/20
Epoch 00001: val_loss improved from inf to 4.06850, saving model to model.h5
 - 1605s - loss: 4.5107 - val_loss: 4.0685
Epoch 2/20
Epoch 00002: val_loss improved from 4.06850 to 3.92612, saving model to model.h5
 - 1604s - loss: 3.8923 - val_loss: 3.9261
Epoch 3/20
Epoch 00003: val_loss improved from 3.92612 to 3.89365, saving model to model.h5
 - 1602s - loss: 3.7206 - val_loss: 3.8936
Epoch 4/20
Epoch 00004: val_loss improved from 3.89365 to 3.87893, saving model to model.h5
 - 1603s - loss: 3.6378 - val_loss: 3.8789
Epoch 5/20
Epoch 00005: val_loss did not improve
 - 1603s - loss: 3.5957 - val_loss: 3.8919
Epoch 6/20
Epoch 00006: val_loss did not improve
 - 1611s - loss: 3.5690 - val_loss: 3.9050
Epoch 7/20
Epoch 00007: val_loss did not improve
 - 1605s - loss: 3.5517 - val_loss: 3.9201
Epoch 8/20
Epoch 00008: val_loss did not improve
 - 1602s - loss: 3.5382 - val_loss: 3.9278
Epoch 9/20
Epoch 00009: val_loss did not impr

<keras.callbacks.History at 0x1a224bbcc0>

# Evaluate Model

Once the model is fit, we can evaluate on the test set consisting of 1,000 images. Recall we used train and validation subsets for training and finetuning.

Here we propose and use a well-known metric for our evaluation.


# Metrics: Bilingual Evaluation Understudy (BLEU)
One measure that can be used to evaluate the skill of the model are BLEU (Bilingual Evaluation Understudy) scores, which are used for evaluating a generated sentence to a reference sentence. A perfect match results in a score of 1.0, whereas a
perfect mismatch results in a score of 0.0. The score was developed for evaluating the predictions made by automatic machine translation systems.

The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper BLEU: a Method for Automatic Evaluation of Machine Translation. The approach works by counting matching n-grams in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be each token and a bigram comparison would be each word pair. The comparison is made regardless of word order.

BLEU score is adopted for many language generation problems such as:

* Language generation.
* Image caption generation.
* Text summarization.
* Speech recognition.
* And much more.

For reference, below are some ball-park BLEU scores for skillful models when evaluated on the test dataset (taken from the 2017 paper Where to put the Image in an Image Caption Generator)


|Metric|Performance Range|
|:----:|:---------------:|
|BLEU-1| 0.401 to 0.578|
|BLEU-2| 0.176 to 0.390|
|BLEU-3| 0.099 to 0.260|
|BLEU-4| 0.059 to 0.170|

A higher score close to 1.0 is better, a score closer to zero is worse. 



### Sentence BLEU Score Example
NLTK provides the `bleu()` function for evaluating a candidate sentence against one
or more reference sentences. The reference sentences must be provided as a list of sentences
where each reference is a list of tokens. The candidate sentence is provided as a list of tokens.
For example:

In [28]:
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test'], ['this', 'is' 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate)
print(score)

1.0


Running this example prints a perfect score of **1.0** as the candidate matches one of the references exactly.

### Generate a description for a single photo 

We then evaluate our model by generating descriptions for all photos in the test dataset and
evaluating those predictions with our evaluation metric.

First, we need to be able to generate a description for a image using the trained model.
This process involves calling the model recursively to generate the next words with previously generated words (starting with the start description token `startseq`) as input until the end of sequence token
is reached `endseq` or the maximum description length is reached. The function below named
`generate_desc()` implements this and generates a textual description given a trained
model, and a given prepared image as input. It calls the function `word_for_id()` in order to
map an integer prediction back to a word.

In [12]:
from numpy import argmax
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu

# load doc into memory
def load_doc(filename):
            # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
    # seed the generation process
    in_text = 'startseq'
    # iterate over the whole length of the sequence
    for _ in range(max_length):
        # integer encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
        sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
        yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
        yhat = argmax(yhat)
        # map integer to word
        word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
        if word is None:
            break
        # append as input for generating the next word
        in_text += ' ' + word
        # stop if we predict the end of the sequence
        if word == 'endseq':
            break
    return in_text

# remove start/end sequence tokens from a summary

def cleanup_summary(summary):
    # remove start of sequence token
    index = summary.find('startseq ')
    if index > -1:
        summary = summary[len('startseq '):]
    # remove end of sequence token
    index = summary.find(' endseq')
    if index > -1:
        summary = summary[:index]
    return summary

We will generate predictions for all images in the test dataset. The function below named
`evaluate_model()` evaluates a trained model against a given dataset of image descriptions
and features. The actual and predicted descriptions are collected and evaluated collectively
using the corpus `BLEU score` that summarizes how close the generated text is to the expected
text.

Here, we compare each generated description against all of the
reference descriptions for the photograph. We then calculate BLEU scores for 1, 2, 3 and 4
cumulative n-grams. A higher score close to 1.0 is better, a score closer to zero is worse.

In [22]:
# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
    actual, predicted = list(), list()
    # step over the whole set
    for key, desc_list in descriptions.items():
        # generate description
        yhat = generate_desc(model, tokenizer, photos[key], max_length)
        # clean up prediction
        yhat = cleanup_summary(yhat)
        # store actual and predicted
        references = [cleanup_summary(d).split() for d in desc_list]
        actual.append(references)
        predicted.append(yhat.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)

# load TEST set (1k)
filename = 'Flickr8k_text/Flickr_8k.testImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))

# load the model
filename = 'model.h5'
model = load_model(filename)
# evaluate model
evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)


Dataset: 6000
Descriptions: train=6000
Vocabulary Size: 7579
Description Length: 34
Dataset: 1000
Descriptions: test=1000
Photos: test=1000
BLEU-1: 0.372020
BLEU-2: 0.200042
BLEU-3: 0.132851
BLEU-4: 0.054616


Running the example prints the BLEU scores. We can see that the scores fit within the
expected range of a skillful model on the problem. The scores seem relatively low, and especially when the number of grams increases, the scores drop significantly. The model's performance seems not as satisfactory as we want, both in accuracy and robustness (reflected by the droping figures). This show a potential space for improvement. 


# Generating Test Samples

Now we try to set up our model to generate test samples (on un-seen images).
Almost everything we need to generate captions for entirely new photographs is in the model
file. We also need the Tokenizer for encoding generated words for the model while generating
a sequence, and the maximum length of input sequences, used when we defined the model (e.g.
34).

With the encoding of text, we can create
the tokenizer and save it to a file so that we can load it quickly whenever we need it without
needing the entire Flickr8K dataset. An alternative would be to use our own vocabulary file
and mapping to integers function during training. We can create the Tokenizer as before and
save it as a pickle file tokenizer.pkl. The complete example is listed below.

## Text Preprocessing Pipeline

In [23]:
from keras.preprocessing.text import Tokenizer
from pickle import dump

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# load training dataset so we can create a text preprocessor that can
# be saved and used for test/unseen data
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

Dataset: 6000
Descriptions: train=6000


We can now load the tokenizer whenever we need it without having to load the entire training
dataset of annotations. Below we generate a description for a new photograph. 


![example.jpg](example.jpg)
** Figure:  `dog is running across the beach`.** 

**Note:** Caption was generated using the Merge Deep NN trained above.

## Image Captioning Modules

In [4]:
from pickle import load
from numpy import argmax
from keras.preprocessing.sequence import pad_sequences
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
from keras.models import load_model
    
# extract features from each photo in the directory
def extract_features(filename):
        # load the model
    model = VGG16()
    # re-structure the model
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    # load the photo
    image = load_img(filename, target_size=(224, 224))
    # convert the image pixels to a numpy array
    image = img_to_array(image)
    # reshape data for the model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # prepare the image for the VGG model
    image = preprocess_input(image)
    # get features
    feature = model.predict(image, verbose=0)
    return feature

# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

# remove start/end sequence tokens from a summary
def cleanup_summary(summary):
    # remove start of sequence token
    index = summary.find('startseq ')
    if index > -1:
        summary = summary[len('startseq '):]
    # remove end of sequence token
    index = summary.find(' endseq')
    if index > -1:
        summary = summary[:index]
    return summary

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
    # seed the generation process
    in_text = 'startseq'
    # iterate over the whole length of the sequence
    for _ in range(max_length):
        # integer encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
        sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
        yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
        yhat = argmax(yhat)
        # map integer to word
        word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
        if word is None:
            break
        # append as input for generating the next word
        in_text += ' ' + word
        # stop if we predict the end of the sequence
        if word == 'endseq':
            break
    return in_text

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 34
# load the model
model = load_model('model.h5')
# load and prepare the photograph
photo = extract_features('example.jpg')
# generate description
description = generate_desc(model, tokenizer, photo, max_length)
description = cleanup_summary(description)
print(description)


black dog is running through the water


# Inject Architecture for image captioning

Above we implemented a deep learning based architecture based on the `merge-model` described by [Marc Tanti, et al.](http://www.aclweb.org/anthology/W17-3506) in their 2017 papers.  Following a pipeline similar to the Merge architecture for image captioning implement various versions of the  the inject architecture where the LSTM plays the role of a conditional generator as opposed to an encoder in the Merge Architecture.


The inject model has the following parts:

1. **Photo Feature Extractor:** This is a 16-layer VGG model pre-trained on the ImageNet dataset. We have pre-processed the photos with the VGG model (without the output layer) and will use the extracted features predicted by this model as input. 
* **Conditional Generator:** This is a word embedding layer for handling the text input, followed by a conditional Long Short-Term Memory (LSTM) recurrent neural network layer that is conditioned on the photo features. 

The Photo Feature Extractor model expects input photo features to be a vector of 4,096 dimensions. These are processed by a Dense layer to produce a 256 element representation of the photo. The conditional generator model expects input sequences with a pre-defined length (34 words) which are fed into an Embedding layer that uses a mask to ignore padded values. This is followed by an LSTM.


In [4]:
def define_inject_model(vocab_size, max_length):
    # image input vector
    # Transformed Image data (VGGNet output)
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    # input for sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    #se3 = LSTM(256)(se2)
    # decoder model
    generator1 = add([fe2, se2])
    generator2 = LSTM(256)(generator1)
    generator3 = Dense(256, activation='relu')(generator2)
    outputs = Dense(vocab_size, activation='softmax')(generator3)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    # summarize model
    model.summary()
    plot_model(model, to_file='model_inject.png', show_shapes=True)
    return model

In [5]:
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)


# load validation/developement set (1K)
# load test set
filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)

# define the model
model_inject = define_inject_model(vocab_size, max_length)

Dataset: 6000
Descriptions: train=6000
Photos: train=6000
Vocabulary Size: 7579
Description Length: 34
Dataset: 1000
Descriptions: test=1000
Photos: test=1000
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 4096)         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 34)           0                                            
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 4096)         0           input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 34, 256)  

In [6]:
# define checkpoint callback
checkpoint = ModelCheckpoint('model_inject.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')

In [7]:
# fit model
model_inject.fit([X1train, X2train], ytrain, epochs=3, verbose=2, callbacks=[checkpoint], 
          validation_data=([X1test, X2test], ytest))

Train on 306404 samples, validate on 50903 samples
Epoch 1/3
Epoch 00001: val_loss improved from inf to 4.41615, saving model to model_inject.h5
 - 1754s - loss: 5.1663 - val_loss: 4.4162
Epoch 2/3
Epoch 00002: val_loss improved from 4.41615 to 4.09621, saving model to model_inject.h5
 - 1741s - loss: 4.2332 - val_loss: 4.0962
Epoch 3/3
Epoch 00003: val_loss improved from 4.09621 to 4.00521, saving model to model_inject.h5
 - 1762s - loss: 3.9616 - val_loss: 4.0052


<keras.callbacks.History at 0x11f84e668>

In [10]:
# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

In [13]:
from keras.models import load_model

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
    actual, predicted = list(), list()
    # step over the whole set
    for key, desc_list in descriptions.items():
        # generate description
        yhat = generate_desc(model, tokenizer, photos[key], max_length)
        # clean up prediction
        yhat = cleanup_summary(yhat)
        # store actual and predicted
        references = [cleanup_summary(d).split() for d in desc_list]
        actual.append(references)
        predicted.append(yhat.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
    
max_length = max_length(train_descriptions)

# load the model
filename = 'model_inject.h5'
model_inject = load_model(filename)
# evaluate model
evaluate_model(model_inject, test_descriptions, test_features, tokenizer, max_length)

BLEU-1: 0.329774
BLEU-2: 0.176957
BLEU-3: 0.120197
BLEU-4: 0.049280


# Inject/Merge Model Analysis

The inject model combines the encoded form of the image with each word from the text description generated so-far.  This approach uses the recurrent neural network as a text generation model that uses a sequence of both image and word information as input in order to generate the next word in the sequence. Therefore, this model combines the concerns of the image with each input word, requiring the encoder to develop an encoding that incorporates both visual and linguistic information together.

On the other hand, the merge model combines both the encoded form of the image input with the encoded form of the text description generated so far. The combination of these two encoded inputs is then used by a very simple decoder model to generate the next word in the sequence. This separates the concern of modeling the image input, the text input and the combining and interpretation of the encoded inputs.

Below is a visualisation of the two models:

![inject.png](inject.png)
** Figure:  `The Inject Model Architecture`.** 



![merge.png](merge.png)
** Figure:  `The Merge Model Architecture`.** 


As explained by Marc Tanti, for the inject model, in the ‘inject’ architectures, the image vector (usually derived from the activation values of a hidden layer in a convolutional neural network) is injected into the RNN, for example by treating the image vector on a par with a ‘word’ and including it as part of the caption prefix. In the case of ‘merge’ architectures, the image is left out of the RNN subnetwork, such that the RNN handles only the caption prefix, that is, handles only purely linguistic information. After the prefix has been vectorised, the image vector is then merged with the prefix vector in a separate ‘multimodal layer’ which comes after the RNN subnetwork. 

Generally, Marc Tanti, et al. found the merge architecture to be more effective compared to the inject approach.

# Further Improvements

## Improvement 1: better generator using attention
Our first improvement is to introduce and employ the attention mechanism, which is commonly seen in many state-of-the-art image to text generation methods. The basic idea is to let the model focus its attention on certain area to produce the corresponding words. This is intuitive and coincide with human perception. Below we implement a simple model with attention.

In [42]:
from keras.layers import Permute, merge

def attention_mechanism(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Dense(int(inputs.shape[1]), activation='softmax')(a)
    a_probs = Permute((2, 1))(a)
    attention_mul = merge([inputs, a_probs], mode='mul')
    return attention_mul

def define_inject1_model(vocab_size, max_length):
    # image input vector
    # Transformed Image data (VGGNet output)
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    # input for sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=False)(inputs2)
    se2 = Dropout(0.5)(se1)
    #se3 = LSTM(256)(se2)
    # decoder model
    generator1 = add([fe2, se2])
    attention = attention_mechanism(generator1)
    generator2 = LSTM(256, return_sequences=False)(attention)
    generator3 = Dense(256, activation='relu')(generator2)
    outputs = Dense(vocab_size, activation='softmax')(generator3)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    # summarize model
    model.summary()
    plot_model(model, to_file='model_inject1.png', show_shapes=True)
    return model

In [43]:
"""
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)


# load validation/developement set (1K)
# load test set
filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)
"""
# define the model
model_inject1 = define_inject1_model(vocab_size, max_length)

  if __name__ == '__main__':
  name=name)


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_19 (InputLayer)           (None, 4096)         0                                            
__________________________________________________________________________________________________
input_20 (InputLayer)           (None, 34)           0                                            
__________________________________________________________________________________________________
dropout_19 (Dropout)            (None, 4096)         0           input_19[0][0]                   
__________________________________________________________________________________________________
embedding_10 (Embedding)        (None, 34, 256)      1940224     input_20[0][0]                   
__________________________________________________________________________________________________
dense_23 (

In [44]:
# define checkpoint callback
checkpoint = ModelCheckpoint('model_inject1.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')

In [46]:
# fit model
model_inject1.fit([X1train, X2train], ytrain, epochs=3, verbose=2, callbacks=[checkpoint], 
          validation_data=([X1test, X2test], ytest))

Train on 306404 samples, validate on 50903 samples
Epoch 1/3
Epoch 00001: val_loss improved from inf to 4.35661, saving model to model_inject1.h5
 - 1779s - loss: 4.8760 - val_loss: 4.3566
Epoch 2/3
Epoch 00002: val_loss improved from 4.35661 to 4.11538, saving model to model_inject1.h5
 - 1769s - loss: 4.2100 - val_loss: 4.1154
Epoch 3/3
Epoch 00003: val_loss improved from 4.11538 to 4.02391, saving model to model_inject1.h5
 - 1768s - loss: 3.9704 - val_loss: 4.0239


<keras.callbacks.History at 0x20c2fdba20>

In [47]:
# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

In [48]:
from keras.models import load_model

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
    actual, predicted = list(), list()
    # step over the whole set
    for key, desc_list in descriptions.items():
        # generate description
        yhat = generate_desc(model, tokenizer, photos[key], max_length)
        # clean up prediction
        yhat = cleanup_summary(yhat)
        # store actual and predicted
        references = [cleanup_summary(d).split() for d in desc_list]
        actual.append(references)
        predicted.append(yhat.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
    
max_length = max_length(train_descriptions)

# load the model
filename = 'model_inject1.h5'
model_inject1 = load_model(filename)
# evaluate model
evaluate_model(model_inject1, test_descriptions, test_features, tokenizer, max_length)

  return cls(**config)


BLEU-1: 0.430439
BLEU-2: 0.216167
BLEU-3: 0.129352
BLEU-4: 0.052467


As we can see in the printed numbers, the BLEU scores seem to be slightly improved. And it also coincide with our previous intuition and observation that the scores drop as the number of grams increases. This improvement may be attributed to our new attention mechanism, which helps the LSTM to focus on the corresponding region (in the features) rather than the entire image to produce the description words. One can easily visualize that the weights of attention are indeed focus on the correct object. This is consistent with human perception that we only need to look at certain object instead of the whole scene to tell the corresponding word. The merits of such mechanism is that the model can put more weights on the relevant parts and ignore other redundant information which may serve as noise and negatively affect the judgment. This may explain the advantage over the vanilla approach.

## Improvement 2: better generator using CNN

Another possible improvement we explored is using CNN for text generation instead of LSTM-based methods. Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. Meanwhile, the accuracy is comparable with LSTM-based methods. For more details, see  [convolutional kernels for text](https://arxiv.org/pdf/1705.03122.pdf).

In [38]:
from keras.layers import Conv1D, MaxPooling1D, concatenate, Flatten

def define_inject2_model(vocab_size, max_length):
    # image input vector
    # Transformed Image data (VGGNet output)
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    # input for sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=False)(inputs2)
    # using 1D convolutions for different n-grams
    conv_list = []
    conv_final = []
    kernels = [2, 4, 6, 8]
    for k in kernels:
        conv = Conv1D(64, kernel_size=k, padding='same', activation='relu')(se1)
        conv = MaxPooling1D(padding='same')(conv)
        conv_list.append(conv)
    conv_final = concatenate([c for c in conv_list])
    # decoder model
    generator1 = add([fe2, conv_final])
    generator1 = Flatten()(generator1)
    generator2 = Dense(256, activation='relu')(generator1)
    outputs = Dense(vocab_size, activation='softmax')(generator2)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    # summarize model
    model.summary()
    plot_model(model, to_file='model_inject2.png', show_shapes=True)
    return model

In [None]:
# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

In [None]:
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)

# load validation/developement set (1K)
# load test set
filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)

# define the model
model_inject2 = define_inject2_model(vocab_size, max_length)

Here we define a CNN whose kernel is based on 4-grams, and each kernel is followed by a max pooling layer to reduce the dimensionality. Then we use fully connected layer to transform the output to desired dimension, achieving text generation using CNN. The architecture details are listed below. 

In [None]:
# define checkpoint callback
checkpoint = ModelCheckpoint('model_inject2.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')

We further retrain the model on Flickr8k dataset (on a different machine) with default ADAM optimizer, and use standard BLEU evaluation. 

In [None]:
# fit model
model_inject2.fit([X1train, X2train], ytrain, epochs=6, verbose=2, callbacks=[checkpoint], 
          validation_data=([X1test, X2test], ytest))

In [None]:
from keras.models import load_model

# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
    actual, predicted = list(), list()
    # step over the whole set
    for key, desc_list in descriptions.items():
        # generate description
        yhat = generate_desc(model, tokenizer, photos[key], max_length)
        # clean up prediction
        yhat = cleanup_summary(yhat)
        # store actual and predicted
        references = [cleanup_summary(d).split() for d in desc_list]
        actual.append(references)
        predicted.append(yhat.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
    
max_length = max_length(train_descriptions)

# load the model
filename = 'model_inject2.h5'
model_inject2 = load_model(filename)
# evaluate model
evaluate_model(model_inject2, test_descriptions, test_features, tokenizer, max_length)

The evaluated BLEU score of the CNN-based method is listed below. We can see that it slightly outperforms vanilla LSTM-based methods, both merge model and injection model, and is comparable to attention-based methods. Meanwhile the training time for each epoch is decreased to half compared with previous methods. We can see that CNN has indeed parallelized the text generation process. 

# Result Summary


In [14]:
import pandas as pd

results = pd.DataFrame(columns=["ExpID", "BLEU1Test", "BLEU2Test", "BLEU3Test", "BLEU4Test", "Experiment description"])

results.loc[len(results)] = ["1", "0.372020", "0.200042", "0.132851", "0.054616", "merge architecture"]
results.loc[len(results)] = ["2", "0.329774", "0.176957", "0.120197", "0.049280", "inject architecture"]
results.loc[len(results)] = ["3", "0.430439", "0.216167", "0.129352", "0.052467", "inject with attention"]
results.loc[len(results)] = ["4", "0.411333", "0.137368", "0.068167", "0.021550", "inject using CNN"]
results.loc[len(results)] = ["5", "0.460040", "0.247789", "0.157232", "0.066799", "merge using biLSTM"]

results

Unnamed: 0,ExpID,BLEU1Test,BLEU2Test,BLEU3Test,BLEU4Test,Experiment description
0,1,0.37202,0.200042,0.132851,0.054616,merge architecture
1,2,0.329774,0.176957,0.120197,0.04928,inject architecture
2,3,0.430439,0.216167,0.129352,0.052467,inject with attention
3,4,0.411333,0.137368,0.068167,0.02155,inject using CNN
4,5,0.46004,0.247789,0.157232,0.066799,merge using biLSTM


The test results using different model architectures are summarized in the above table. From the above results, we could see that the merge model with biLSTM out-performs other models in the tests. Comparing the basic merge architecture and inject architecture, it is shown that the merge architecture has better performance. The success of the merge model for the encoder-decoder architecture suggests that the role of the recurrent neural network is to encode input rather than generate output. Especially, comparing the basic inject architecture and inject architectures with attention or using CNN, it is shown that the inject architecture with attention out-performs others, and is comparable with the merge model. The results agree with our expectations. For instance for the baseline Inject architecture where the VGGNEt embedding as condition for the conditional generator LSTM, the learning capacity can be low. By using more learnable layers, e.g. LSTM for the generator phase, we indeed get better results.