# Image Caption Generator 

What is image caption generation?
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. 

This is what we are going to implement in this Python based project where we will use deep learning techniques of Convolutional Neural Networks and a type of Recurrent Neural Network (LSTM) together. 
source : https://paperswithcode.com/task/image-captioning


### Image Caption Generator with CNN

The objective of our project is to learn the concepts of a CNN and LSTM model and build a working model of Image caption generator by implementing CNN with LSTM.

# The Dataset of Python based Project

For the image caption generator, we will be using the Flickr_8K dataset. 

You can download it from here : 

* <a href="https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip">Flickr8k_Dataset.zip</a> 
(Size: 1GB).

#### Let's Begin to code.

# Import Modules 

In [None]:
import os   # handling the files
import pickle # storing numpy features
import numpy as np
from tqdm.notebook import tqdm # how much data is process till now

from tensorflow.keras.applications.vgg16 import VGG16 , preprocess_input # extract features from image data.
from tensorflow.keras.preprocessing.image import load_img , img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.layers import Input , Dense , LSTM , Embedding , Dropout , add

**os** - used to handle files using system commands.

**pickle** - used to store numpy features extracted

**numpy** - used to perform a wide variety of mathematical operations on arrays

**tqdm** - progress bar decorator for iterators. Includes a default range iterator printing to stderr.

**VGG16, preprocess_input** - imported modules for feature extraction from the image data

**load_img, img_to_array** - used for loading the image and converting the image to a numpy array

**Tokenizer** - used for loading the text as convert them into a token

**pad_sequences** - used for equal distribution of words in sentences filling the remaining spaces with zeros

**plot_model** - used to visualize the architecture of the model through different images

#### Now we must set the directories to use the data

In [None]:
BASE_DIR = # YOUR CODE HERE
WORKING_DIR = # YOUR CODE HERE

# Extract Image Features

We have to load and restructure the model

VGG-16 is a convolutional neural network that is 16 layers deep. You can load a pretrained version of the network trained on more than a million images from the ImageNet database. The pretrained network can classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals.

In [None]:
# Load vgg16 Model
# YOUR CODE HERE

# restructure model
# YOUR CODE HERE

# Summerize
# YOUR CODE HERE

+ Fully connected layer of the VGG16 model is not needed, just the previous layers to extract feature results.

+ By preference you may include more layers, but for quicker results avoid adding the unnecessary layers.

# Extract the image features
Now we extract the image features and load the data for preprocess

In [None]:
# extract features from image
features = {} # Dictionary 'features' is created and will be loaded with the extracted features of image data
directory = os.path.join(BASE_DIR, 'Images')

for img_name in tqdm(os.listdir(directory)):
    # load the image from file
    # YOUR CODE HERE
    # convert image pixels to numpy array
    # YOUR CODE HERE
    # reshape data for model
    # YOUR CODE HERE
    # preprocess image for vgg
    # YOUR CODE HERE
    # extract features
    # YOUR CODE HERE
    # get image ID
    # YOUR CODE HERE
    # store feature
    # YOUR CODE HERE

In [None]:
# store features in pickle
# YOUR CODE HERE

Extracted features are not stored in the disk, so re-extraction of features can extend running time

Dumps and store your dictionary in a pickle for reloading it to save time

In [None]:
# load features from pickle
# YOUR CODE HERE

Load all your stored feature data to your project for quicker runtime 

## Load the Captions Data

Let us store the captions data from the text file

In [None]:
# YOUR CODE HERE

### Now we split and append the captions data with the image

In [None]:
# create mapping of image to captions
mapping = {}
# process lines
for line in tqdm(captions_doc.split('\n')):
    # split the line by comma(,)
    # YOUR CODE HERE
    if len(line) < 2:
        continue
    image_id, caption = tokens[0], tokens[1:]
    # remove extension from image ID
    # YOUR CODE HERE
    # convert caption list to string
    # YOUR CODE HERE
    # create list if needed
    # YOUR CODE HERE
    # store the caption
    # YOUR CODE HERE

+ Dictionary 'mapping' is created with key as image_id and values as the corresponding caption text

+ Same image may have multiple captions, **if image_id not in mapping: mapping[image_id] = []** creates a list for appending captions to the corresponding image

In [None]:
# Print the number of images loaded
# YOUR CODE HERE

# Preprocess Text Data

In [None]:
def clean(mapping):
    for key, captions in mapping.items():
        for i in range(len(captions)):
            # take one caption at a time
            # YOUR CODE HERE
            # preprocessing steps
            # convert to lowercase
            # YOUR CODE HERE
            # delete digits, special chars, etc., 
            # YOUR CODE HERE
            # delete additional spaces
            # YOUR CODE HERE
            # add start and end tags to the caption
            # YOUR CODE HERE

Defined to clean and convert the text for quicker process and better results

Let us visualize the text **before** and **after** cleaning

In [None]:
# before preprocess of text
# YOUR CODE HERE

In [None]:
# preprocess the text
# YOUR CODE HERE

In [None]:
# after preprocess of text
# YOUR CODE HERE

In [None]:
# Store the preprocessed captions into a list named "all_captions"
# YOUR CODE HERE

In [None]:
# Print the number of unique captions stored
# YOUR CODE HERE

# Visualize some captions
Print the first ten captions

In [None]:
# YOUR CODE HERE

# Processing of Text Data
Now we start processing the text data

In [None]:
# tokenize the text
# YOUR CODE HERE

In [None]:
# Print the number of unique words
# YOUR CODE HERE

In [None]:
# get maximum length of the caption available
# YOUR CODE HERE

+ Finding the maximum length of the captions, used for reference for the padding sequence.

# Train Test Split

#### After preprocessing the data now we will train, test and split

In [None]:
# YOUR CODE HERE

**Now we will define a batch and include the padding sequence**

In [None]:
# create data generator to get data in batch (avoids session crash)
def data_generator(data_keys, mapping, features, tokenizer, max_length, vocab_size, batch_size):
    # loop over images
    X1, X2, y = list(), list(), list()
    n = 0
    while 1:
        for key in data_keys:
            n += 1
            captions = mapping[key]
            # process each caption
            # encode the sequence
            # split the sequence into X, y pairs
            # split into input and output pairs
            # pad input sequence
            # encode output sequence
            # store the sequences

Padding sequence normalizes the size of all captions to the max size filling them with zeros for better results.

# Model Creation

In [None]:
# encoder model
# Image feature layers (create Input, dropout and Dense layers)
# Input layer will have the shape of the output length of the features from the VGG model
# YOUR CODE HERE
# sequence feature layers (create Input, Embedding, Dropout and LSTM layers)
# YOUR CODE HERE

# decoder model (add from previous layes, dense from previous and dense from previous)
# YOUR CODE HERE

model = Model(inputs=[inputs1, inputs2], outputs=outputs)
# compile the model(use the loss function for category outputs)
# YOUR CODE HERE

# plot the model (optional but useful to understand it)
# YOUR CODE HERE

# Train Model
Now let us train the model

In [None]:
# train the model
epochs = 20
batch_size = 32
steps = len(train) // batch_size

for i in range(epochs):
    # create data generator
    # YOUR CODE HERE
    # fit for one epoch
    # YOUR CODE HERE

+ **steps = len(train) // batch_size** - back propagation and fetch the next data

+ Loss decreases gradually over the iterations

+ Increase the no. of epochs for better results

+ Assign the no. of epochs and batch size accordingly for quicker results


### You can save the model in the working directory for reuse

In [None]:
# save the model
# YOUR CODE HERE

# Generate Captions for the Image

In [None]:
def idx_to_word(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

+ Convert the predicted index from the model into a word

In [None]:
# generate caption for an image
def predict_caption(model, image, tokenizer, max_length):
    # add start tag for generation process 'startseq'
    # YOUR CODE HERE
    # iterate over the max length of sequence
    # YOUR CODE HERE
    # encode input sequence
    # YOUR CODE HERE
    # pad the sequence
    # YOUR CODE HERE
    # predict next word
    # YOUR CODE HERE
    # get index with high probability
    # YOUR CODE HERE
    # convert index to word
    # YOUR CODE HERE
    # stop if word not found
    # YOUR CODE HERE
    # append word as input for generating next word
    # YOUR CODE HERE
    # stop if we reach end tag 'endseq'
    # YOUR CODE HERE

+ Captiongenerator appending all the words for an image

+ The caption starts with 'startseq' and the model continues to predict the caption until the 'endseq' appeared

# Model Validation
Now we validate the data using BLEU Score

In [None]:
from nltk.translate.bleu_score import corpus_bleu
# validate with test data
actual, predicted = list(), list()

for key in tqdm(test):
    # get actual caption
    # YOUR CODE HERE
    # predict the caption for image
    # YOUR CODE HERE
    # split into words
    # YOUR CODE HERE
    # append to the list
    # YOUR CODE HERE

    # calcuate BLEU score and print for BLEU-1 with weights=(1.0, 0, 0, 0)
# and for BLEU-2 with weights=(0.5, 0.5, 0, 0)
# YOUR CODE HERE

+ BLEU Score is used to evaluate the predicted text against a reference text, in a list of tokens.

+ The reference text contains all the words appended from the captions data (actual_captions)

+ A BLEU Score more than **0.4 is considered a good result**, for a better score increase the no. of epochs accordingly.

## Visualize the Results

In [None]:
from PIL import Image
import matplotlib.pyplot as plt
def generate_caption(image_name):
    # load the image
    # YOUR CODE HERE
    print('---------------------Actual---------------------')
    # print the actual caption 
    # YOUR CODE HERE
    # predict the caption
    # YOUR CODE HERE
    print('--------------------Predicted--------------------')
    # print the predicted caption
    # YOUR CODE HERE

+ Image caption generator defined

+ First prints the actual captions of the image then prints a predicted caption of the image

In [None]:
# visualize the result for some images 
# YOUR CODE HERE

# Final Thoughts

Please share you thoughts to have better and more accurate results.