#**Pre-processing Data**
Here we will execute all the steps to make our data ready for feeding it into the final model.

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pickle import load, dump 
import re

  import pandas.util.testing as tm


## Downloading Datasets
The datasets used here are:

Flicker8k_Dataset: Contains 8092 photographs in jpeg format.
Flickr8k_text: Contains a number of files containing different sources of descriptions for the photographs.

As the official website where these datasets were available is down, you can use the link of following repository to get the datasets.  
https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip
https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip

In [None]:
!wget https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip
!wget https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip

--2020-07-21 18:53:04--  https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/124585957/47f52b80-3501-11e9-8f49-4515a2a3339b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200721%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200721T185304Z&X-Amz-Expires=300&X-Amz-Signature=096fec3ef8b88ea43b560456310d86cf317a66cf6805e7653a831815dab5fee6&X-Amz-SignedHeaders=host&actor_id=0&repo_id=124585957&response-content-disposition=attachment%3B%20filename%3DFlickr8k_Dataset.zip&response-content-type=application%2Foctet-stream [following]
--2020-07-21 18:53:04--  https://github-production-release-asset-2e65be.s3.amazonaws.com/124585957/47f52b80-3501-11e9-8f49-4515a2a3339b?X-Amz-Algorithm=AWS4-HMAC-SHA

In [None]:
#Unzipping the datasets downloaded above
!unzip Flickr8k_Dataset.zip
!unzip Flickr8k_text.zip

unzip:  cannot find or open Flickr8k_Dataset.zip, Flickr8k_Dataset.zip.zip or Flickr8k_Dataset.zip.ZIP.
unzip:  cannot find or open Flickr8k_text.zip, Flickr8k_text.zip.zip or Flickr8k_text.zip.ZIP.


## Preprocessing Captions Data
The flickr8k text dataset has the following files:

Flickr8k.token.txt - The raw captions of the Flickr8k Dataset. The first column is the ID of the caption which is "image address # caption number"

Flickr8k.lemma.txt - The lemmatized version of the above captions 

Flickr_8k.trainImages.txt - The training images used in our experiments.
Flickr_8k.devImages.txt - The development/validation images used in our experiments.
Flickr_8k.testImages.txt - The test images used in our experiments.


In [None]:
with open("Flickr8k.token.txt") as f:
  data = f.read()

These are the first few lines of the file we have just opened.

1087168168_70280d024a.jpg#2	A teenage boy is jumping on an inflatable slide .

1087168168_70280d024a.jpg#3	A young boy jumping and a young girl seating in an inflatable pool .

1087168168_70280d024a.jpg#4	Two young kids are playing in the water on an inflated toy .

1087539207_9f77ab3aaf.jpg#0	A family playing on a tractor on a beautiful day .

1087539207_9f77ab3aaf.jpg#1	Children ride a tractor in a field .


As we can see the text data has image id's along with the captions. So, we need to separate the id's and captions. We will map each image id with it's corresponding captions using a dictionary. 

In [None]:
#Declaring an empty dictionary to store all the captions 
captions = dict()
try:
  #Splitting the data in the file into separate lines
  for line in data.split('\n'):
    #Splitting each line into words
    words = line.split()
    #First word contains the image id and the later part belongs to the corresponding caption
    img_id, img_cap = words[0], words[1:]
    #Dropping the .jpg file format and the caption numer from the image id
    img_id = img_id.split('.')[0]
    img_cap = ' '.join(img_cap)
    """If the dictionary already has this image id as caption, then we will append this 
       caption to the list of captions of corresponding image id, otherwise we will create an 
       empty list corresponding to this image id and then append this caption in that list"""
    if img_id in captions:
      captions[img_id].append(img_cap)
    else:
      captions[img_id] = list()
      captions[img_id].append(img_cap)
except Exception as e:
  print("Got exception: \n", e)

Got exception: 
 list index out of range


In [None]:
#Generating the list of captions for a random image id
captions["47871819_db55ac4699"]

['A soccer player in blue is chasing after the player in black and white .',
 'The girl in the white strip is falling down as the girl in the blue strip challenges for the soccer ball .',
 'The girls are playing soccer .',
 'Two women in soccer uniforms playing soccer .',
 'Two young women on different teams are playing soccer on a field .']

In [None]:
"""The data which is available to us has many acronyms which we 
   tend to use in our day to day life. So, we need to remove all 
   such occurences in evey line of text. Here, we declare a function 
   which will be called when we read line by line texts from the text file."""
def preprocess(word):
  word = re.sub(r"won't", "will not", word)
  word = re.sub(r"can't", "cannot", word)
  word = re.sub(r"'ll", " will", word)
  word = re.sub(r"'ve", " have", word)
  word = re.sub(r"'s", " is", word)
  word = re.sub(r"n't", " not", word)
  word = re.sub(r"'t", " not", word)
  word = re.sub(r"'m", " am", word)
  word = re.sub(r"'d", " would", word)
  word = re.sub(r"'re", " are", word)
  return word

In [None]:
#In order to preprocess the text in our test file, we read each caption and then preprocess it
for id in captions.keys():
  cap_ = list()
  caption = captions[id]
  for sentence in caption:
    sentence = preprocess(sentence)
    sentence = sentence.replace("\n", ' ')
    #Removing all occurences of characters other than alphabets and numbers
    sentence = re.sub('[^A-Za-z0-9]+', ' ', sentence)
    #Adding markers to detect start and end of a sentence
    img_cap = "startseq " + sentence.lower() + "endseq"
    #Adding the preprocessed caption to the list of captions
    cap_.append(img_cap)
  captions[id] = cap_

In [None]:
#Generating the list of captions after preprocessing them
captions["47871819_db55ac4699"]

['startseq a soccer player in blue is chasing after the player in black and white endseq',
 'startseq the girl in the white strip is falling down as the girl in the blue strip challenges for the soccer ball endseq',
 'startseq the girls are playing soccer endseq',
 'startseq two women in soccer uniforms playing soccer endseq',
 'startseq two young women on different teams are playing soccer on a field endseq']

In [None]:
#Pickling the above dictionary into 'captions.pkl' file which can later be accessed when need arises
dump(captions, open("captions.pkl", "wb"))

## InceptionV3
For the image processing part through which we will convert a given image into it's corresponding encodings, we will use InceptionV3 model and load the pre-trained weights from imagenet. Imagenet is a standard dataset used for classification. It contains more than 14 million images in the dataset, with little more than 21 thousand groups or classes. 

In [None]:
#Importing Libraries
from keras.applications.inception_v3 import InceptionV3, preprocess_input
from keras.preprocessing.image import img_to_array, load_img
from keras.models import Model

Using TensorFlow backend.


In [None]:
#Loading the inceptionV3 architecture with pre-trained imagenet weigths
inception = InceptionV3(weights = 'imagenet')
inception.summary()

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.5/inception_v3_weights_tf_dim_ordering_tf_kernels.h5
Model: "inception_v3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 299, 299, 3)  0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 149, 149, 32) 864         input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 149, 149, 32) 96          conv2d_1[0][0]                   
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 14

In [None]:
"""The above model classifies an image into certain classes 
   but we just need the encodings of an image, so we drop 
   the last layer which is softmax layer. Also, we do not want 
   to train the weights of this layer, so we keep the value of 
   trainable parameter of all layers to false."""
inception.layers.pop()
for layer in inception.layers:
  layer.trainable = False

In [None]:
#Final inceptionV3 model which gives us an image encoding of dimension (, 2048)
model = Model(inputs = inception.input, outputs = inception.layers[-1].output)
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 299, 299, 3)  0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 149, 149, 32) 864         input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 149, 149, 32) 96          conv2d_1[0][0]                   
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 149, 149, 32) 0           batch_normalization_1[0][0]      
____________________________________________________________________________________________

In [None]:
#To use the above model elsewhere we can store the model as json file and its corresponding weights in a '.h5' file 
inception_json = model.to_json()
with open("inception.json", "w") as f:
  f.write(inception_json)
model.save_weights("inception.h5")

Now, we will create a dictionary which stores the image id and it's corresponding encodings for all the images in the training set.

In [None]:
#Declaring an empty dictionary to atore image encodings of training set images 
image_encodings = dict()
with open("Flickr_8k.trainImages.txt", "r") as f:
  train_images = f.read()
try:
  for line in train_images.split("\n"):
    id = line.split('.')[0]
    img = load_img("Flicker8k_Dataset/{}.jpg" .format(id), target_size = (299, 299))
    #Converting the PIL image into a 3-D numpy array 
    img_ar = img_to_array(img)
    """Model to which we will give these image encodings as input 
       expects images in batches, so we need an extra dimension for 
       storing the number of samples in each batch, which can be 
       done. For this we will use np.expand_dims.""" 
    img_dim = np.expand_dims(img_ar, axis = 0)
    #print(img_ar.shape)
    #print(img_dim.shape)
    #Preprocess_input function converts the given image into the format required by the model
    img_ = preprocess_input(img_dim)
    #Predicts the encodings of the image
    predict = model.predict(img_)
    #Output of the model is of shape (, 2048) so we need to reshape it into (2048, )
    image_encodings[id] = predict.reshape(predict.shape[1])
except Exception as e:
    print("Exception: \n", e)

Exception: 
 [Errno 2] No such file or directory: 'Flicker8k_Dataset/.jpg'


In [None]:
#Checking the length of training set images
print(len(image_encodings))

6000


In [None]:
#Pickling the training set images into "train_image_encodings.pkl" file 
dump(image_encodings, open("train_image_encodings.pkl", "wb"))

In [None]:
with open("captions.pkl", "rb") as f:
  captions = load(f)
print(captions["2394267183_735d2dc868"])
len(captions)

['startseq a dog goes through an obstacle course while a person looks on endseq', 'startseq a dog is going through a slalom style obstacle course endseq', 'startseq a dog performs a slalom like obstacle while the owner walks along side endseq', 'startseq a dog plays with a man by running around poles endseq', 'startseq the woman is training a white dog to zigzag through metal poles endseq']


8092

In [None]:
#Creating a dictionary that contains captions of all the images in the training set
train_captions = dict()
with open("Flickr_8k.trainImages.txt","r") as f:
  train_cap = f.read()
try:
  for line in train_cap.split("\n"):
    words = line.split(".")
    train_img_id = words[0]
    if train_img_id in captions:
      train_captions[train_img_id] = captions[train_img_id]
except Exception as e:
  print("Exception: \n",e)
print(len(train_captions))

6000


In [None]:
#Pickling the dictionary containing all the captions in training set into "train_captions.pkl" file
dump(train_captions, open("train_captions.pkl", "wb"))

## Sequential Data Preparation
In order to train our caption generator model, we need to first create text corpus from which we will create word to index and index to word dictionary. Then, we will remove rarely occuring words from our dictionary. Here, we will be using 300-D GloVe vector embeddings to train our model.

In [None]:
#Importing Libraries
from nltk import FreqDist

In [None]:
#Declaring empty string which can be used for storing text corpus used for the captions of images in training set 
corpus = ""

with open("train_captions.pkl", "rb") as f:
  train_captions = load(f)
#Loop to generate all the captions for a particular image
for all_caps in train_captions.values():
  #Loop to generate each caption from all the captions i.e. one line of data
  for one_cap in all_caps:
    corpus += " " + one_cap

In [None]:
#Variable to store total number of words used in training text corpus
total_words = corpus.split()
#Variable to store the number of unique words used in training text corpus
dictionary = set(total_words)
print(f"There are {len(dictionary)} unique words in the text corpus.")

There are 8148 unique words in the text corpus.


In [None]:
#Creating the frequency distribution for all the words used
freq = FreqDist(total_words)
#Printing the 10 most common words
freq.most_common(10)

[('a', 46782),
 ('startseq', 30000),
 ('endseq', 27183),
 ('in', 14091),
 ('the', 13508),
 ('on', 7992),
 ('is', 7196),
 ('and', 6678),
 ('dog', 6136),
 ('with', 5763)]

Now, we will remove the words which occur very rarely in our text corpus. We will remove all words that occurs less than 5 times.

In [None]:
#Removing the word that occurs less than 5 times from the dictionary
for word in list(dictionary):
  if freq[word]<=5:
    dictionary.remove(word)

dict_size = len(dictionary)+1
print(f"There are {len(dictionary)} unique words in the text corpus, after removal of less frequent words.")

There are 2354 unique words in the text corpus, after removal of less frequent words.


In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical 

In [None]:
#Declaring a variable to store the list of all the captions in the training set text corpus
caption_list = []
for all_caps in train_captions.values():
  for one_cap in all_caps:
    caption_list.append(one_cap)
print(f"There are {len(caption_list)} captions present in the training set.")

There are 30000 captions present in the training set.


We need to map each word to an index, only then will the text data be ready to be sent as an input to the sequential model. This can be done with the help of tokenizer. Tokenizer has an argument num_words whose default value is none which results in tokenizing all the words in the dictionary. But, as we don't want the less frequent words to get tokenzied, we need to set the value of num_words to our dictionary size which is 2355 here (i.e. length of dictionary + 1). Now, only the most frequent 'num_words - 1' words will be kept.   

In [None]:
tokenizer = Tokenizer(num_words = dict_size)
tokenizer.fit_on_texts(caption_list)

In [None]:
#Creating a dictionary to map an index to word
idx_to_word = tokenizer.index_word
#Removing less frequent words from the dictionary as discussed above
for idx in list(idx_to_word):
  if(idx>=dict_size):
    idx_to_word.pop(idx, None)

#Creating a dictionary to map each word to an index using idx_to_word dictionary
word_to_idx = dict()
for idx,word in idx_to_word.items():
  word_to_idx[word] = idx

print(len(idx_to_word))
print(len(word_to_idx))

#Pickling both the dictionaries so that they can be used later
dump(idx_to_word, open("idx_to_word", "wb"))
dump(word_to_idx, open("word_to_idx", "wb"))

2354
2354


In [None]:
#Variable to store the maximum length of words in a caption among all the captions in the training set
max_length = 0
for caption in caption_list:
  temp = len(caption.split())
  max_length = max(temp, max_length)

print(f"Maximum caption length: {max_length}")

Maximum caption length: 39


If we feed all the data at once to the model, it will occupy a lot of space. Rather than getting whole data at one time, we will use generator to generate data in batches. A generator must create and yield one batch of examples. Now, we will define a data generator that will load one-image worth of examples for each batch. 

We can loop forever with a while loop and within it, loop over each image in the image directory. For each image id, we can load the image and create all of the input-output sequence pairs from the image’s description.

In [None]:
def generator(captions, img_encoding, max_length, dict_size, num_photos_per_batch):
  #Creating a list of inputs and outputs for the sequence model
  X1, X2, Y = list(), list(), list()
  n = 0
  while(1):
    for id, caption_list in captions.items():
      n += 1
      #Extracting image encodings from the provided dictionary
      img = img_encoding[id]
      #Looping over all the captions, one caption at a time
      for caption in caption_list:
        #Mapping each word in the caption to its corresponding index using texts_to_sequences 
        sequence = tokenizer.texts_to_sequences([caption])[0]
        for i in len(sequence):
          #Our input sequence consists of all the words predicted uptill now and our output is the current word
          inp_sequence, out_sequence = sequence[:i], sequence[i]
          #Padding the input sequence with 0's at the end to make the input sequence of max_length
          inp_sequence = pad_sequences([inp_sequence], maxlen = max_length, padding = 'post')[0]
          #To generate one hot encoded vector for the output word 
          out_sequence = to_categorical([out_sequence], num_classes = dict_size)[0]
          X1.append(img)
          X2.append(inp_sequence)
          Y.append(out_sequence)
      #If number of images generated equals batch_size then we yield the input and output lists 
      if(n==num_photos_per_batch):
        #Yielding is similar to return statement but it saves the current state of the function and executes it from where yield was encountered
        yield [[array(X1), array(X2)], array(Y)]
        X1, X2, Y = list(), list(), list()
        n = 0 

Now, we will download the glove embeddings and use the 300-D word embeddings.
It can be downloaded from http://nlp.stanford.edu/data/glove.6B.zip. 

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-07-21 09:32:56--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-07-21 09:32:56--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-07-21 09:32:56--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [None]:
#Unzipping the GloVe file
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [None]:
#Creating a dictionary which maps each word in the GloVe file to it's embedding values
glove = dict()
with open("glove.6B.300d.txt", "r") as f:
  for line in f.readlines():
    words = line.split()
    word = words[0]
    coefficients = list(map(float,words[1:]))
    glove[word] = coefficients
print(len(glove))
dump(glove, open("glove.pkl", "wb"))

400000


In [None]:
with open("glove.pkl", "rb") as f:
  glove = load(f)
  glove_words = set(glove.keys())
print(len(glove_words))
embedding_size = 300
#Creating an embedding matrix to store word embeddings of all the words in our text corpus 
embedding_matrix = np.zeros((dict_size, embedding_size))
print(embedding_matrix.shape)

400000
(2355, 300)


In [None]:
#If a word in our text corpus is also present in GloVe word embeddings then we store it's values in embedding_matrix else we store a vector of zeros
for word, idx in word_to_idx.items():
  embedding_vector = np.zeros((300))
  if word in glove_words:
    embedding_vector = glove[word]
    embedding_matrix[idx] = embedding_vector
  else:
    embedding_matrix[idx] = embedding_vector 

In [None]:
#Pickling the embedding matrix and tokenizer
dump(embedding_matrix, open("embedding_matrix.pkl", "wb"))
dump(tokenizer, open("tokenizer.pkl", "wb"))