# CNN-GRU-MultimodalLayer

This notebook aims the construction, training and test of a CNN-LSTM model purposed in Vinyals et
al. [15].

The CNN-LSTM model is a encoder-decoder designed for image caption generation.

### Setup of libraries

In the next cell we link the GPU hardware to tensorflow for use this component in the training process.

In [2]:
import csv # To read the captions of a csv
import tensorflow as tf # To build and train the ANN
from tensorflow.keras.preprocessing.text import Tokenizer # To use the tokenizer to split the words
from tensorflow.keras.preprocessing.sequence import pad_sequences # 
import numpy as np
import pickle
import pandas as pd

## Lectura de los datos

Ahora leemos del CSV las descripciones y los id de las imagenes asociadas a el, para con estos formar el path de cada imagen.

In [48]:
def idToPath(id_image):
    id_image = str(id_image)
    complete_name=id_image+".jpg"
    while len(complete_name)<16:
        complete_name="0"+complete_name
    return "data/train2017/"+complete_name

In [97]:
PATH = "data/train_machine_spanish.xlsx"
df = pd.read_excel(PATH, names=["id_image","caption"])
df['caption'] = df.apply(lambda x: "smark "+x['caption']+" emark", axis=1)
df['id_image'] = df.apply(lambda x: idToPath(x['id_image']),axis=1)

In [98]:
PATH = "data/validation.xlsx"
val_df = pd.read_excel(PATH, names=["id_image","caption"])
val_df['caption'] = val_df.apply(lambda x: "smark "+x['caption']+" emark", axis=1)
val_df['id_image'] = val_df.apply(lambda x: idToPath(x['id_image']),axis=1)

### Setup of text data

We load the tokenizer that we initialized in word embeddings notebook.

In [99]:
# loading
with open('items/tokenizer_spanish.pkl', 'rb') as handle:
    tokenizer = pickle.load(handle)

We transform the descriptions into a list of integers that represent the index in the tokenizer's word index vocabulary.

In [100]:
sentences_x = tokenizer.texts_to_sequences(df['caption'])

And its equivalent for the labels that are the same lists with each element moved one position to the left

In [101]:
sentences_y = [sentence[1:] for sentence in sentences_x]

We put pads on the right for every sentence that has less than 15 words.

In [102]:
pad_sentences_x = pad_sequences(sentences_x, padding='post',maxlen=15)
pad_sentences_y = pad_sequences(sentences_y, padding='post',maxlen=15)

In [103]:
pad_sentences_x[0]

array([   2,    5,   55,    7,    1,  279,  230,    9, 2988,  450,    8,
        139,    3,    0,    0], dtype=int32)

In [104]:
pad_sentences_y[0]

array([   5,   55,    7,    1,  279,  230,    9, 2988,  450,    8,  139,
          3,    0,    0,    0], dtype=int32)

### Model building

The embedding layer loaded from the model stored in word embeddings notebook. This model transform word indexes to embeddings for encoder inputs.

In [12]:
class Embeddings_Model(tf.keras.Model):
    
    def __init__(self, max_length, embedding_dimension):
        super(Embeddings_Model, self).__init__()
        weights = None
        # Load the weight of the layer embedding pre-trained
        with open('items/embeddingLayerWeights_spanish.pkl', 'rb') as handle:
            weights = pickle.load(handle)
        self.embedding = tf.keras.layers.Embedding(max_length+1, embedding_dimension ,weights=[weights])
        
    def call(self, inputs):
        x = self.embedding(inputs)
        
        return x

We define the enconder that is the InceptionV3 model. Its inputs are the images and the output is the image embedding.

In [108]:
class CNN_Model(tf.keras.Model):
    
    def __init__(self):
        super(CNN_Model, self).__init__()
        # Load the InceptionV3 pre-trained model
        self.input_model = tf.keras.applications.InceptionV3(include_top=True, weights='imagenet',classes=1000)
        
    def call(self, inputs):
        x = self.input_model(inputs)
        
        return x

We define the decoder that is composed by:

- Embedding model: Transform word indexes to embeddings
- GRU module: Generate step by step the words.
- Dense layer: Transform LSTM outputs to one hot vector (without softmax)

In [109]:
class GRU_Model(tf.keras.Model):
    
    def __init__(self, max_length, embedding_dimension, num_result_words):
        super(GRU_Model, self).__init__()
        # Dimension embedding number
        self.embedding_dimension = embedding_dimension
        # Output words number
        self.num_result_words = num_result_words
        # Layer to map from image embedding to lstm dimension
        self.dense = tf.keras.layers.Dense(embedding_dimension)
        # LSTM layer with 0.1 dropout
        self.gru = tf.keras.layers.GRU(embedding_dimension, input_shape=(num_result_words, embedding_dimension),
                                         return_sequences=True)
        # Concatenate layer for merge image embeddings and gru output in the multimodal layer
        self.concat = tf.keras.layers.Concatenate()
        # The multimodal layer is similar to the softmax size but with a linear function
        # because the softmax function is used in the loss function
        self.output_model = tf.keras.layers.Dense(max_length, activation='linear')
        
    def call(self, inputs):
        # input = (captions, initial_state)
        captions, initial_state = inputs[0],inputs[1]
        initial_state = self.dense(initial_state)
        initial_state = tf.reshape(initial_state,[-1,self.embedding_dimension])
        x = self.gru(captions, initial_state=initial_state)
        # List of image embeddings used as one of the inputs to the multimodal layer
        image_embedding = []
        for i in range(x.shape[1]):
            image_embedding.append(initial_state)
        image_embedding = tf.reshape(image_embedding,[-1,x.shape[1],self.embedding_dimension])
        x = self.concat([x,image_embedding])
        x = self.output_model(x)
        
        return x

We combine the encoder and decoder models in a unique model.

In [114]:
class CNN_GRU_Model():
    
    # This function declare every layer of the model
    def __init__(self):        
        # Number of differents words in the tokenizer vocabulary.
        self.max_length = 14276
        # Embedding dimension
        self.embedding_dimension = 512
        # Maximum number or words that the model is able to generate in a caption.
        self.num_result_words = 15
        # Optimizer used
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
        # Initialize the encoder
        self.encoder = CNN_Model()
        # Initialize the embedding layer
        self.embedding = Embeddings_Model(self.max_length, self.embedding_dimension)
        # Initialize the decoder
        self.decoder = GRU_Model(self.max_length, self.embedding_dimension, self.num_result_words)
    
    # This function load all images for the encoder can transform them to image embeddings
    # Parameters:
    #            image_paths: Collection of path to images
    #            load: Boolean that indicate if load all images or read a file with the embeddings
    # Return:
    #        A list with the image embeddings
    def encoder_predict(self, image_paths, load=False, test=False):
        predictions = []
        contador=0
        if load or test:
            # If load=True then we need to predict all image embeddings with the encoder
            # and stored for future uses
            for image_path in image_paths:
                # Visual indicator for longs process
                if contador%1000==0:
                    print("Procesando imagen",contador)
                contador+=1
                # Read the file with tensorflow function
                image = tf.io.read_file(image_path)
                # Transform the image to jpeg in RGB color space
                image = tf.image.decode_jpeg(image, channels=3)
                # Resize to inception_v3 input size
                image = tf.image.resize(image, (299, 299))
                # Normalize and other transforms
                image = tf.keras.applications.inception_v3.preprocess_input(image)
                # Add a dimension to tensor for the model
                image = np.expand_dims(image, axis=0)
                # Add the result of the prediction to the list
                predictions.append(self.encoder.predict(image))
            # Store the image embeddings list
            if not test and len(predictions)>10000:
                with open('items/image_embeddings_list.pkl', 'wb') as handle:
                    pickle.dump(predictions, handle, protocol=pickle.HIGHEST_PROTOCOL)
        else:
            # Load the image embeddings list
            with open('items/image_embeddings_list.pkl', 'rb') as handle:
                predictions = pickle.load(handle)[:len(image_paths)]
        
        return predictions
    
    # This function preprocess the data
    # Parameters:
    #           X_image: Image path list
    #           X_caption: Caption list
    #           Y_caption: Label caption list for compare with the predict caption
    def preprocess_data(self,X_image,X_caption,Y_caption, test=False):
        # Get the image embeddings
        image_embeddings = np.array(self.encoder_predict(X_image, test))
        
        # Lambda function for map the word embeddings
        map_embeddings = lambda x: self.embedding.predict(np.array([x]))
        # Map the word embeddings
        X_caption_embeddings = map_embeddings(X_caption)
        
        # A set of transforms in numpy array sizes
        Y_caption_embeddings = np.array(Y_caption)
        X_caption_embeddings = X_caption_embeddings.reshape((
            -1, self.num_result_words, self.embedding_dimension))
        Y_caption_embeddings = Y_caption_embeddings.reshape((
            -1, self.num_result_words))
        
        return image_embeddings, X_caption_embeddings, Y_caption_embeddings
    
    # This function train the model
    # Parameters:
    #           X_image: Image path list
    #           X_caption: Caption list
    #           Y_caption: Label caption list for compare with the predict caption
    def train(self, X_image, X_caption, Y_caption):
        
        # Reshape of the data
        image_embeddings, X_caption_embeddings, Y_caption_embeddings = self.preprocess_data(
            X_image,
            X_caption,
            Y_caption,
            True)
#         val_image_embeddings, val_X_caption_embeddings, val_Y_caption_embeddings = self.preprocess_data(
#             val_X_image,
#             val_X_caption,
#             val_Y_caption,
#             True)
        
        # Initialize optimizer
        optimizer = tf.keras.optimizers.Adam()
        print("Image embeddings:",image_embeddings.shape)
        print("X_caption_embeddings",X_caption_embeddings.shape)
        print("Y_caption_embeddings",Y_caption_embeddings.shape)
        # Compile the model with sparse_cross_entropy as loss function
        self.decoder.compile(optimizer, loss=sparse_cross_entropy)
        # Train the model
        self.decoder.fit([X_caption_embeddings,
                          image_embeddings],
                         Y_caption_embeddings,
                         epochs=10,
                         batch_size=16,
                         shuffle=True,
                         workers=16)
    
    # This function get the prediction by trained model
    # Parameters:
    #           X_image: Image path list
    #           X_caption: Caption list
    #           Y_caption: Label caption list for compare with the predict caption
    def predict(self, X_image, X_caption, Y_caption):
        
        # Reshape of the data
        image_embeddings, X_caption_embeddings, Y_caption_embeddings = self.preprocess_data(
            X_image,
            X_caption,
            Y_caption,
            test=True)
        
        images_list = []
        captions_list = []
        for example in range(0,len(image_embeddings)):
            tokens = []
            input_token = tokenizer.word_index['smark']
            hidden_state = np.array(image_embeddings[example])
            hidden_state = hidden_state.reshape((1, 1, 1000))
            while input_token != tokenizer.word_index['emark'] and len(tokens)<self.num_result_words:
                tokens.append(input_token)
                tokens_array = self.embedding.predict(np.array(tokens))
                tokens_array = tokens_array.reshape((1, -1, self.embedding_dimension))
                token = self.decoder.predict([tokens_array, hidden_state],
                                            workers=16)[0,-1,:]
                input_token = np.argmax(token)
            images_list.append(X_image[example])
            captions_list.append(tokenizer.sequences_to_texts([tokens]))
            if example%100==0:
                print("%d images predicted"%example)
        
        return images_list,captions_list
    
    
    def save(self,model):
        self.decoder.save_weights(model)
    
    def load(self,model):
        self.decoder.load_weights(model)

In [111]:
def sparse_cross_entropy(y_true, y_pred):
    # Reshape and cast label caption
    y_true = tf.reshape(y_true, [-1,15])
    y_true = tf.dtypes.cast(y_true, tf.int32)
    # Use tensorflow sparse softmax cross entropy loss function
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true,
                                                          logits=y_pred)
    
    # Get the scalar of the batch mean loss
    loss_mean = tf.reduce_mean(loss)
    return loss_mean

In [115]:
model = CNN_GRU_Model()

In [116]:
model.train(df['id_image'], pad_sentences_x, pad_sentences_y)

('Procesando imagen', 0)
('Procesando imagen', 1000)
('Procesando imagen', 2000)
('Procesando imagen', 3000)
('Procesando imagen', 4000)
('Procesando imagen', 5000)
('Procesando imagen', 6000)
('Procesando imagen', 7000)
('Procesando imagen', 8000)
('Procesando imagen', 9000)
('Procesando imagen', 10000)
('Procesando imagen', 11000)
('Procesando imagen', 12000)
('Procesando imagen', 13000)
('Procesando imagen', 14000)
('Procesando imagen', 15000)
('Procesando imagen', 16000)
('Procesando imagen', 17000)
('Procesando imagen', 18000)
('Procesando imagen', 19000)
('Procesando imagen', 20000)
('Procesando imagen', 21000)
('Procesando imagen', 22000)
('Procesando imagen', 23000)
('Procesando imagen', 24000)
('Procesando imagen', 25000)
('Procesando imagen', 26000)
('Procesando imagen', 27000)
('Procesando imagen', 28000)
('Procesando imagen', 29000)
('Procesando imagen', 30000)
('Procesando imagen', 31000)
('Procesando imagen', 32000)
('Procesando imagen', 33000)
('Procesando imagen', 34000

In [117]:
PATH = "data/validation.xlsx"
df_test = pd.read_excel(PATH, names=["id_image","caption"])
df_test['caption'] = df_test.apply(lambda x: "smark "+x['caption']+" emark", axis=1)
df_test['id_image'] = df_test.apply(lambda x: idToPath(x['id_image']),axis=1)

In [118]:
test_sentences_x = tokenizer.texts_to_sequences(df_test['caption'])

In [119]:
test_sentences_y = [sentence[1:] for sentence in test_sentences_x]

In [120]:
test_pad_sentences_x = pad_sequences(test_sentences_x, padding='post',maxlen=15)
test_pad_sentences_y = pad_sequences(test_sentences_y, padding='post',maxlen=15)

In [121]:
img, captions = model.predict(df_test['id_image'], test_pad_sentences_x, test_pad_sentences_y)

('Procesando imagen', 0)
('Procesando imagen', 1000)
('Procesando imagen', 2000)
('Procesando imagen', 3000)
('Procesando imagen', 4000)


KeyboardInterrupt: 

In [None]:
for i in range(len(img)):
    print(img[i])
    print(captions[i])

In [None]:
result_df = pd.DataFrame(columns=["id_image","caption"])
for i in range(len(img)):
    image = img[i][:-4].split('/')[-1]
    caption = captions[i][0][6:]
    result_df.loc[i] = [image,caption]

In [None]:
result_df.to_csv("results/val_M041_E.csv", encoding = 'utf-8',index=False)

In [25]:
#model.save('./models/model2_pretranslate')

In [85]:
model.load('./models/model2_pretranslate')