# <center>Image annotation<br /> Workshop</center>

![logoEI_1_6.jpg](attachment:logoEI_1_6.jpg)
Designer : Nassim HADDAM
</div>

The aim of this workshop is to give you the skills you need to carry out the [image annotation] workshop (https://fr.wikipedia.org/wiki/Annotation_automatique_d%27images) (or image subtitling). In this workshop, you'll be using CNNs (to create attributes useful for annotation) and RNNs (to do the actual annotation). You'll also need to do **a lot** of pre-processing. This workshop will focus on this part of the process, and will give you an idea of the network architecture that will be used for annotation. We'll start by importing the libraries we're interested in.

You must therefore re-execute the cells corresponding to data preparation in the previous workshop, from data loading to image and annotation pre-processing.
<br><br>
<b>Imports</b>

In [None]:
import tensorflow as tf

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import collections
import random
import re
import numpy as np
import os
import time
import json
from glob import glob
from PIL import Image
import pickle
from tqdm import tqdm

<b>Chargement des données</b>

In [None]:
# Annotation file path
annotation_folder = "/annotations/"
annotation_file = os.path.abspath('.')+"/annotations/captions_train2014.json"

# Path to folder containing images to be annotated
image_folder = '/train2014/'
PATH = os.path.abspath('.') + image_folder

# Read annotation file
with open(annotation_file, 'r') as f:
    annotations = json.load(f)

# Group all annotations with the same identifier.
image_path_to_caption = collections.defaultdict(list)
for val in annotations['annotations']:
    # mark the beginning and end of each annotation
# PLEASE COMPLETE
    # The image ID is part of the image path
    image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (val['image_id'])
    # Add caption associated with image_path
#PLEASE COMPLETE

# Take first images only
image_paths = list(image_path_to_caption.keys())
train_image_paths = image_paths[:2000]

# List all annotations
train_captions = []
# List of all duplicated image filenames (in number of annotations per image)
img_name_vector = []

for image_path in train_image_paths:
    caption_list = image_path_to_caption[image_path]
    # Add caption_list to train_captions
#PLEASE COMPLETE
    # Add duplicated image_path len(caption_list) times
#PLEASE COMPLETE

In [None]:
# Download the pre-trained InceptionV3 model with cassification from ImageNet
image_model = tf.keras.applications.InceptionV3(include_top=False,
                                                weights='imagenet')
# Creation of a variable that will be the input to the new image pre-processing model
new_input = \
# PLEASE COMPLETE
# retrieve the last hidden layer containing the image in compact representation
hidden_layer = \
#PLEASE COMPLETE

# Model that calculates a dense representation of images with InceptionV3
#PLEASE COMPLETE

# Definition of load_image function
def load_image(image_path):
    """
    The load_image function has as input the path of an image and as output a pair
    containing the processed image and its path.
    The load_image function performs the following processing:
        1. Loads the file corresponding to the path image_path
        2. Decodes the image into RGB.
        3. Resize image to size (299, 299).
        4. Normalize image pixels between -1 and 1.
    """
    img = \
#PLEASE COMPLETE
    img = \
#PLEASE COMPLETE
    img = \
#PLEASE COMPLETE
    img = \
#PLEASE COMPLETE
    return img, image_path

# Image pre-processing
# Take image names
encode_train = sorted(set(img_name_vector))

# Create an instance of "tf.data.Dataset" based on image names
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
# Data split into batches after pre-processing by load_image
image_dataset = image_dataset.map(
  load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(16)

# Browse dataset batch by batch for InceptionV3 pre-processing
for img, path in tqdm(image_dataset):
    # InceptionV3 pre-processes current batch (size (16,8,8,2048))
    batch_features = image_features_extract_model(img)
    # Resize batch size (16,8,8,2048) to (16,64,2048)
    batch_features = tf.reshape(batch_features,
                              (batch_features.shape[0], -1, batch_features.shape[3]))
    # Browse current batch and store path and batch with np.save()
    for bf, p in zip(batch_features, path):
        path_of_feature = p.numpy().decode("utf-8")
        # (image path associated with its new representation , image representation)
        np.save(path_of_feature, bf.numpy())

**Pré-traitement des annotations**

In [None]:
# Find the maximum size
def calc_max_length(tensor):
    return max(len(t) for t in tensor)

# Select the 5000 most frequent words in the vocabulary
top_k = 5000
#The Tokenizer class enables text pre-processing for neural networks
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
# Builds a vocabulary based on the train_captions list
tokenizer.fit_on_texts(\
#PLEASE COMPLETE

# Create token to fill annotations to equalize length
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'] = 0

# Creation of vectors (list of integer tokens) from annotations (list of words)
train_seqs = \
#PLEASE COMPLETE

# Fill each vector up to maximum annotation length
cap_vector = \
#PLEASE COMPLETE

# Calculates the maximum length used to store attention weights
# This will be used later for display during evaluation
max_length = calc_max_length(train_seqs)

# 1 Training and test set formation :
Next, you need to separate the dataset into two parts: a training set and a test set. The code that performs these operations is detailed in the next cell.

In [None]:
img_to_cap_vector = collections.defaultdict(list)
# Creation of a dictionary associating image paths with (.npy file) annotationss
# Images are duplicated because there are several annotations per image
print(len(img_name_vector), len(cap_vector))
for img, cap in zip(img_name_vector, cap_vector):
    img_to_cap_vector[img].append(cap)

"""
Creation of training and validation datasets using
random 80-20 splitting
"""
# Take the keys (names of processed image files), *these will not be duplicated*.
img_keys = list(img_to_cap_vector.keys())
#PLEASE COMPLETE
# Split indexes into training and test
slice_index = int(len(img_keys)*0.8)
img_name_train_keys, img_name_val_keys = \
#PLEASE COMPLETE

"""
Training and test sets are in the form of
lists containing mappings:(pre-processed image ---> annotation token(word) )
"""

# Loop to build training set
img_name_train = []
cap_train = []
for imgt in img_name_train_keys:
    capt_len = len(img_to_cap_vector[imgt])
    # Duplicate images by number of annotations per image
    img_name_train.extend([imgt] * capt_len)
    cap_train.extend(img_to_cap_vector[imgt])

# Loop to build test set
img_name_val = []
cap_val = []
for imgv in img_name_val_keys:
    capv_len = \
#PLEASE COMPLETE
    # Duplication of images in the number of annotations per image
#PLEASE COMPLETE

len(img_name_train), len(cap_train), len(img_name_val), len(cap_val)

Creation of a training dataset represented by an instance [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) starting from the base dataset (the file names and annotations of the training dataset). The `tf.data.Dataset` class is used to represent large datasets and facilitate their pre-processing.

In [None]:
# Feel free to modify these parameters to suit your machine
BATCH_SIZE = 64 # batch size
BUFFER_SIZE = 1000 # buffer size for data mixing
embedding_dim = 256
units = 512 # size of hidden layer in RNN
vocab_size = top_k + 1
num_steps = len(img_name_train) // BATCH_SIZE

# The shape of the vector extracted from InceptionV3 is (64, 2048)
# The following two variables represent the shape of this vector
features_shape = 2048
attention_features_shape = 64

# Function to load numpy files from pre-processed images
def map_func(img_name, cap):
    img_tensor = np.load(img_name.decode('utf-8')+'.npy')
    return img_tensor, cap

# Creation of a "Tensor "s dataset (used to represent large datasets)
# The dataset is created from "img_name_train" and "cap_train".
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

# Use map to load numpy files (possibly in parallel)
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
          map_func, [item1, item2], [tf.float32, tf.int32]),
          num_parallel_calls=tf.data.experimental.AUTOTUNE)

# Mix data and divide into batches
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

acc# 2 The model :
Compared with the model, the last convolutional layer of `InceptionV3` is of the form `(8, 8, 2048)`. This vector was reshaped into the form `(64, 2048)` during disk-level storage. This vector is then passed through the CNN encoder (which consists of a single, fully connected layer). The RNN will predict the next word in the annotation for this vector. [The image](https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/22_Image_Captioning.ipynb) below shows an example of a very basic annotation system architecture using a CNN and an RNN.

![22_image_captioning_flowchart.png](attachment:22_image_captioning_flowchart.png)

In this example the captioning is done this way:
<ul>
    <li>The image is passed through the CNN to get a compact representation of it. This representation is returned by the `Dense 2` layer of size 4096.</li>
    <li>This representation is reduced by passing it to the dense layer `Dense Map` to be input as the initial hidden state to the RNN cells.</li>
    <li>The RNN part is composed of <a href="https://openclassrooms.com">GRU</a> (Gated Reccurent Unit). This part is made up of 3 layers. One layer represents a level of abstraction of the language. </li>
    <li>The RNN has as input the annotation as well as the image in compact form, and returns for each column the word following the input word at column level.</li>
    <li>The annotations are represented as a word list. This list is unusable by RNN, so it's passed to a module that replaces each word with an integer (or integer token), and then to another module that projects each token into a vector whose elements are between -1 and 1.</li>
</ul>

Your annotation system will follow, broadly speaking, the same principle as shown in the image below, nevertheless it will contain some essential differences distinguishing it from this example. In particular, the system will contain an attention mechanism (explained in [article](https://arxiv.org/pdf/1502.03044.pdf) and in [the video](https://www.youtube.com/watch?v=uCSTpOLMC48&list=WL&index=257)) whose function is to cause the neural network to give greater importance in its annotation predictions to the most telling and relevant parts of the image.

**The CNN encoder:**

The CNN encoder produces a suitable representation of the image, which it passes on to the RNN decoder for captioning. The CNN has as input the characteristics of images already pre-processed by InceptionV3 and stored on disk.

Please note that in the CNN part of this neural network, the last convolutional layer is not flattened, as in the previous workshop on CNNs. Remember that the images from InceptionV3 pre-processing were of the form 8x8x2048. These images have now been resized to 64x2048. This means that for each of the 64 positions of the pre-processed image, this representation contains the 2048 features extracted by InveptoinV3. And so, the input to the CNN decoder is a batch where each element is made up of the 2048 features of the 64 positions of the pre-processed image (which was originally 8x8). The dense layer that follows calculates a new image representation of size 64x256, where each image position has 256 features. The weights are the same for neurons of the same position in the same column in the pre-processed image (which are associated with the same image feature). This is due to the way the [dense layer]( https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) handles matrix operations in tensorflow.

The advantage of this representation over the flattened representation is that it preserves spatial information at the level of the neural network layers. This will enable the attention mechanism of the RNN part to detect interesting positions in the image and inform the algorithm on which zone it should place the most importance when captioning the image.

In [None]:
class CNN_Encoder(tf.keras.Model):
    # Since the images are already preprocessed by InceptionV3, they are represented in compact form.
    # The CNN encoder will simply transmit these characteristics to a dense layer
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        # form after fc == (batch_size, 64, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = \
#PLEASE COMPLETE
        x = \
#PLEASE COMPLETE
        return x

**The attention mechanism:**

The attention mechanism is very similar to a [classical RNN] cell( https://fr.wikipedia.org/wiki/R%C3%A9seau_de_neurones_r%C3%A9currents), but with a few differences. The attention part has as input the representation of the pre-processed image returned by the CNN as well as the current value of the hidden state of the GRU, and as output the **context vector** which reflects the most important features of the image. An intermediate step in calculating this vector is to compute the **attention weights**, which represent the importance of each image position (of which there are 64) in predicting its annotation.

The image representation given as input is initially transformed in the same way as for CNN, by passing it to a dense layer of size `units`. Similarly, the hidden state is also passed to a dense layer of size `units`. The new image representation is then added to the hidden state and passed to an activation function of type [`tanh`](https://fr.wikipedia.org/wiki/Tangente_hyperbolique) as for conventional RNN cells. At this level, we'll have a data representation of size `64xunits` containing a mixture of information about the image and the annotation text. A score is then assigned to each position by passing this representation to a dense layer. These scores are normalized with a softmax layer to produce the vector of **attention weights**.

Finally, each feature of the input image representation is multiplied (weighted) by the attention vector. After this, the sum of each feature along the positions (the lines of the representation) is taken to form the **context vector**.

Overall, we can say that the attention vector depends on scores that are learned from a spatial and textual representation of the image. This attention vector reflects the relevance of each position and is used to calculate the context vector, which will give us the importance of the image's features.

In [None]:
class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = \
#PLEASE COMPLETE
        self.W2 = \
#PLEASE COMPLETE
        self.V = \
#PLEASE COMPLETE

    def call(self, features, hidden):
        # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)

        # hidden layer shape == (batch_size, hidden_size)
        hidden_with_time_axis = \
#PLEASE COMPLETE

        attention_hidden_layer = \
#PLEASE COMPLETE

        # This gives you a non-normalized score for each image feature.
        score = \
#PLEASE COMPLETE

        attention_weights = \
#PLEASE COMPLETE

        context_vector = \
#PLEASE COMPLETE
        context_vector = \
#PLEASE COMPLETE

        return context_vector, attention_weights

**The RNN decoder:**

The role of the RNN decoder is to use the pre-processed representation of the image to predict its caption word by word. This RNN has a single cell type [GRU]( https://en.wikipedia.org/wiki/Gated_recurrent_unit). The GRU has a hidden state that represents the memory of the last elements seen by it. The GRU updates its state before returning it, to do this it uses certain memory mechanisms which are quite sophisticated.

The decoder is structured as follows: each time the RNN is called, the current word, the image representation and the GRU's hidden state are given as input to the RNN. As words are represented by integers, they must be passed through an embedding layer (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding), which calculates a vector representation of size `output_dim` from the number representing the word.

In addition, the attention mechanism provides a vector representing **the context** of the image, i.e. a vector that tells us about the image's dominant features. This vector is computed by a call to the attention mechanism, providing it with the image features encoded by the CNN and the hidden state of the GRU, which summarizes the history of words seen by the RNN so far.

Next, the current word and context are concatenated to form the GRU's input vector, which in turn calculates the state in the next step. This state is passed through a dense layer of size `units`, then the output of this layer is passed to another dense layer of size `vocab_size`, which returns the score associated with each word in the vocabulary in order to predict the next word.

In [None]:
class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units

        self.embedding = \
#PLEASE COMPLETE
        self.gru = tf.keras.layers.GRU(self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        #Dense layer with GRU output as input
        self.fc1 = \
#PLEASE COMPLETE
        # Last dense layer
        self.fc2 = \
#PLEASE COMPLETE

        self.attention = \
#PLEASE COMPLETE

    def call(self, x, features, hidden):
        # Attention is defined by a separate model
        context_vector, attention_weights = self.attention(features, hidden)
        # Pass current word to embedding layer
        x = \
#PLEASE COMPLETE
        # Concatenation
        x = \
#PLEASE COMPLETE

        # Passage of the concatenated vector to the gru
        output, state = \
#PLEASE COMPLETE

        # Dense layer
        y = \
#PLEASE COMPLETE

        y = tf.reshape(y, (-1, x.shape[2]))

        # Dense layer
        y = \
#PLEASE COMPLETE

        return y, state, attention_weights

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

**Combiner la partie encodeur et décodeur :**

Vous devez compléter la partie du code pour la création de l'encodeur et du décodeur.

In [None]:
# Encoder creation
encoder = \
#PLEASE COMPLETE
# Decoder creation
decoder = \
#PLEASE COMPLETE

In [None]:
# ADAM optimizer
optimizer = \
#PLEASE COMPLETE
# The loss function
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

To keep track of your learning and save it, you can use the class [`tf.train.Checkpoint`](https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint).

In [None]:
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

Initialize training start time in `start_epoch`. The `tf.train.Checkpoint` class allows you to continue training where you left off if it was interrupted earlier.

In [None]:
start_epoch = 0
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
    # Restore last checkpoint in checkpoint_path
    ckpt.restore(ckpt_manager.latest_checkpoint)

# 3 Training and testing:
Next, you'll implement the `train_step` and `evaluate` functions:

- The `train_step` function represents a step in network training. It consists of the encoder's evaluation of the vector pre-calculated by InceptionV3. The output of this step is transmitted to the decoder, which then predicts the annotation word by word. The loop for predicting each word and calculating the associated loss should be implemented in this function.
- The `evaluate` function will be used to evaluate the network's performance on the test set. It is therefore similar to the `train_step` function, except that the calculation part of the loss function is omitted, as it doesn't involve training the network.

Finally, you need to implement the training and testing part of the code. Note that training is done here by image batch.

# 3.1 Training
The function used to complete a training step on a batch of images is `train_step`. The function has as input a batch of pre-processed images and their annotations, and returns the loss associated with this batch.

The hidden state of the RNN part is initialized, as is the start word with the start token. The image features are then extracted by the encoder. After this, the batch is traversed word by word to predict the next word using the decoder. The decoder uses the hidden state, the image features and the previous word to predict the current word. The decoder updates the hidden state and returns it together with the batch predictions. The loss is calculated from the predictions returned by the decoder and the annotations associated with the batch.

Finally, the overall loss and gradient are calculated and the network is updated.

In [None]:
loss_plot = []
@tf.function
def train_step(img_tensor, target):
    loss = 0

    # Initialize hidden state for each batch
    hidden = decoder.reset_state(batch_size=target.shape[0])

    # Initialize decoder input
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * target.shape[0], 1)

    with tf.GradientTape() as tape: # Enables loss gradient calculation
        features = encoder(img_tensor)

        for i in range(1, target.shape[1]):
            # Prediction of the i'th word of the batch with the decoder
            predictions, hidden, _ = decoder(dec_input, features, hidden)
            loss += loss_function(target[:, i], predictions)

            # The correct word at step i is given as input at step (i+1)
            dec_input = tf.expand_dims(target[:, i], 1)

    total_loss = (loss / int(target.shape[1]))

    trainable_variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, trainable_variables)

    optimizer.apply_gradients(zip(gradients, trainable_variables))

    return loss, total_loss

The global code containing the training loop is shown below. This loop traverses the training dataset batch by batch and trains the network with them.

In [None]:
EPOCHS = 20

for epoch in range(start_epoch, EPOCHS):
    start = time.time()
    total_loss = 0

    for (batch, (img_tensor, target)) in enumerate(dataset):
        batch_loss, t_loss = train_step(img_tensor, target)
        total_loss += t_loss

        if batch % 100 == 0:
            print ('Epoch {} Batch {} Loss {:.4f}'.format(
              epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
    # saving loss
    loss_plot.append(total_loss / num_steps)

    if epoch % 5 == 0:
        ckpt_manager.save()

    print ('Epoch {} Loss {:.6f}'.format(epoch + 1,
                                         total_loss/num_steps))
    print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

# Training curve display
plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.show()

# 3.2 Test
The function used to complete an evaluation step for testing is in the next cell.

In [None]:
def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))

    hidden = decoder.reset_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))

    features = encoder(img_tensor_val)

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()
        result.append(tokenizer.index_word[predicted_id])

        if tokenizer.index_word[predicted_id] == '<end>':
            return result, attention_plot

        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

# Function for representing attention at image level
def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(10, 10))

    len_result = len(result)
    for l in range(len_result):
        temp_att = np.resize(attention_plot[l], (8, 8))
        ax = fig.add_subplot(len_result//2, len_result//2, l+1)
        ax.set_title(result[l])
        img = ax.imshow(temp_image)
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()

L'affichage de quelques exemples sur le résultat retourné par l'évaluation.

In [None]:
# Display of some annotations in the test set
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

print ('Real Caption:', real_caption)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image, result, attention_plot)