# Image Caption Generator with CNN & LSTM

You saw an image and your brain can easily tell what the image is about, but can a computer tell what the image is representing? Computer vision researchers worked on this a lot and they considered it impossible until now! With the advancement in Deep learning techniques, availability of huge datasets and computer power, we can build models that can generate captions for an image.

This is what we are going to implement in this Python based project where we will use deep learning techniques of Convolutional Neural Networks and a type of Recurrent Neural Network (LSTM) together.

## What is Image Caption Generator?

Image caption generator is a task that involves computer vision and natural language processing concepts to recognize the context of an image and describe them in a natural language like English.

# The Dataset of Python based Project

For the image caption generator, we will be using the Flickr_8K dataset. There are also other big datasets like Flickr_30K and MSCOCO dataset but it can take weeks just to train the network so we will be using a small Flickr8k dataset. The advantage of a huge dataset is that we can build better models.

we get dataset from **Kaggle** you can download it from here also : <a href="https://www.kaggle.com/datasets/adityajn105/flickr8k">Kaggle-Flicker8k</a> (Size: 1GB).


In [None]:
import os   # handling the files
import pickle # storing numpy features
import numpy as np
from tqdm.notebook import tqdm # how much data is process till now

from tensorflow.keras.applications.vgg16 import VGG16 , preprocess_input # extract features from image data.
from tensorflow.keras.applications.resnet import ResNet101, preprocess_input
from tensorflow.keras.preprocessing.image import load_img , img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.layers import Input , Dense , LSTM , Embedding , Dropout , add

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Now we must set the directories to use the data

In [None]:
WORKING_DIR = "/content/drive/MyDrive/ML/"
BASE_DIR = "/content/drive/MyDrive/ML/"

# Extract Image Features

We have to load and restructure the model


In [None]:
# Load ResNet101 Model
model = ResNet101(include_top=False, weights='imagenet', pooling='avg')

# restructure model
model = Model(inputs = model.inputs , outputs = model.output)

# extract the image features
Now we extract the image features and load the data for preprocess

In [None]:
# extract features from image
features = {}
directory = os.path.join(BASE_DIR, 'Images')

for img_name in tqdm(os.listdir(directory)):
    # load the image from file
    img_path = directory + '/' + img_name
    image = load_img(img_path, target_size=(224, 224))
    # convert image pixels to numpy array
    image = img_to_array(image)
    # reshape data for model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # preprocess image for vgg
    image = preprocess_input(image)
    # extract features
    feature = model.predict(image, verbose=0)
    # get image ID
    image_id = img_name.split('.')[0]
    # store feature
    features[image_id] = feature

  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
np.array(list(features.values())).shape

(1, 1, 2048)

features dictionary each vector of length 4096,<br>
* **Key:** movie_name without.jpg
* **Value:** feature_vector

In [None]:
# store features in pickle
pickle.dump(features, open(os.path.join(WORKING_DIR, 'resnet101_features.pkl'), 'wb'))

Extracted features are not stored in the disk, so re-extraction of features can extend running time

Dumps and store your dictionary in a pickle for reloading it to save time

In [None]:
# load features from pickle
with open(os.path.join(WORKING_DIR, 'resnet101_features.pkl'), 'rb') as f:
    features = pickle.load(f)

In [None]:
# # load features from INPUT pickle
# with open(os.path.join("/kaggle/input/featurespkl", 'features.pkl'), 'rb') as f:
#     features = pickle.load(f)

Load all your stored feature data to your project for quicker runtime

In [None]:
features.keys()

dict_keys(['3765374230_cb1bbee0cb'])

## Load the Captions Data

Let us store the captions data from the text file

In [None]:
with open(os.path.join(BASE_DIR, 'captions.txt'), 'r') as f:
    next(f)
    captions_doc = f.read()

### Now we split and append the captions data with the image

In [None]:
# create mapping of image to captions
mapping = {}
# process lines
for line in tqdm(captions_doc.split('\n')):
    # split the line by comma(,)
    tokens = line.split(',')
    if len(line) < 2:
        continue
    image_id, caption = tokens[0], tokens[1:]
    # remove extension from image ID
    image_id = image_id.split('.')[0]
    # convert caption list to string
    caption = " ".join(caption)
    # create list if needed
    if image_id == "3765374230_cb1bbee0cb" and image_id not in mapping:
      mapping[image_id]=[]
      mapping[image_id].append(caption)
    elif image_id == "3765374230_cb1bbee0cb":
      mapping[image_id].append(caption)

  0%|          | 0/40456 [00:00<?, ?it/s]

# Preprocess Text Data

In [None]:
def clean(mapping):
    for key, captions in mapping.items():
        for i in range(len(captions)):
            # take one caption at a time
            caption = captions[i]
            # preprocessing steps
            # convert to lowercase
            caption = caption.lower()
            # delete digits, special chars, etc.,
            caption = caption.replace('[^A-Za-z]', '')
            # delete additional spaces
            caption = caption.replace('\s+', ' ')
            # add start and end tags to the caption
            caption = 'startseq ' + " ".join([word for word in caption.split() if len(word)>1]) + ' endseq'
            captions[i] = caption

Defined to clean and convert the text for quicker process and better results

Let us visualize the text **before** and **after** cleaning

In [None]:
# before preprocess of text
mapping['3765374230_cb1bbee0cb']

['"A girl in a swimsuit stands in the spray of water   a bicycle in the background ."',
 'A little girl in a green bathing suit is getting splashed in a water fountain .',
 'A little girl reacts to a spray of water .',
 'A young girl in a green bathing suit getting splashed with water .',
 'A young girl wearing a green bathing suit has water sprayed on her back .']

In [None]:
# preprocess the text
clean(mapping)

In [None]:
# after preprocess of text
mapping['3765374230_cb1bbee0cb']

['startseq "a girl in swimsuit stands in the spray of water bicycle in the background ." endseq',
 'startseq little girl in green bathing suit is getting splashed in water fountain endseq',
 'startseq little girl reacts to spray of water endseq',
 'startseq young girl in green bathing suit getting splashed with water endseq',
 'startseq young girl wearing green bathing suit has water sprayed on her back endseq']

#### Next we will store the preprocessed captions into a list

In [None]:
all_captions = []
for key in mapping:
    for caption in mapping[key]:
        all_captions.append(caption)

In [None]:
len(all_captions)

5

# Processing of Text Data
Now we start processing the text data

In [None]:
# tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)

In [None]:
 tokenizer.word_index

{'in': 1,
 'startseq': 2,
 'girl': 3,
 'water': 4,
 'endseq': 5,
 'green': 6,
 'bathing': 7,
 'suit': 8,
 'the': 9,
 'spray': 10,
 'of': 11,
 'little': 12,
 'getting': 13,
 'splashed': 14,
 'young': 15,
 'a': 16,
 'swimsuit': 17,
 'stands': 18,
 'bicycle': 19,
 'background': 20,
 'is': 21,
 'fountain': 22,
 'reacts': 23,
 'to': 24,
 'with': 25,
 'wearing': 26,
 'has': 27,
 'sprayed': 28,
 'on': 29,
 'her': 30,
 'back': 31}

In [None]:
vocab_size = len(tokenizer.word_index) + 1

In [None]:

vocab_size

32

No. of unique words

In [None]:
# get maximum length of the caption available
max_length = max(len(caption.split()) for caption in all_captions)
max_length

17

+ Finding the maximum length of the captions, used for reference for the padding sequence.

# Train Test Split

#### After preprocessing the data now we will train, test and split

In [None]:
image_ids = list(mapping.keys())
# split = int(len(image_ids) * 0.90)
train = image_ids
test = image_ids

**Now we will define a batch and include the padding sequence**

In [None]:
# create data generator to get data in batch (avoids session crash)
def data_generator(data_keys, mapping, features, tokenizer, max_length, vocab_size, batch_size):
    # loop over images
    X1, X2, y = list(), list(), list()
    n = 0
    while 1:
        for key in data_keys:
            n += 1
            captions = mapping[key]
            # process each caption
            for caption in captions:
                # encode the sequence
                seq = tokenizer.texts_to_sequences([caption])[0]
                # split the sequence into X, y pairs
                for i in range(1, len(seq)):
                    # split into input and output pairs
                    in_seq, out_seq = seq[:i], seq[i]
                    # pad input sequence
                    in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                    # encode output sequence
                    out_seq = to_categorical([out_seq],num_classes=vocab_size)[0]
                    # store the sequences
                    X1.append(features[key][0])
                    X2.append(in_seq)
                    y.append(out_seq)
            if n == 1:
                X1, X2, y = np.array(X1), np.array(X2), np.array(y)
                yield [X1, X2], y
                X1, X2, y = list(), list(), list()
                n = 0

Padding sequence normalizes the size of all captions to the max size filling them with zeros for better results.

# Model Creation

+ **shape=(4096,)** - output length of the features from the VGG model

+ **Dense** - single dimension linear layer array

+ **Dropout()** - used to add regularization to the data, avoiding over fitting & dropping out a fraction of the data from the layers

+ **model.compile()** - compilation of the model

+ **loss=’sparse_categorical_crossentropy’** - loss function for category outputs

+ **optimizer=’adam’** - automatically adjust the learning rate for the model over the no. of epochs

+ Model plot shows the concatenation of the inputs and outputs into a single layer

+ Feature extraction of image was already done using VGG, no CNN model was needed in this step.

# Train Model
Now let us train the model

In [None]:
# Load ResNet101 Model
model = ResNet101(include_top=False, weights='imagenet', pooling='avg')

# restructure model
model = Model(inputs = model.inputs , outputs = model.output)

In [None]:
# load glove vectors for embedding layer
embeddings_index = {}
golve_path ='/content/drive/MyDrive/ML/glove.6B.300d.txt'
glove = open(golve_path, 'r', encoding = 'utf-8').read()
for line in glove.split("\n"):
    values = line.split(" ")
    word = values[0]
    indices = np.asarray(values[1: ], dtype = 'float32')
    embeddings_index[word] = indices

emb_dim = 300
emb_matrix = np.zeros((vocab_size, emb_dim))

for word, i in tokenizer.word_index.items():
    emb_vec = embeddings_index.get(word)
    if emb_vec is not None:
        emb_matrix[i] = emb_vec
emb_matrix.shape



(32, 300)

In [None]:
emb_matrix.shape

(32, 300)

In [None]:
# train the model
epochs = 20
batch_size = 32
steps = len(train) // batch_size

for i in range(epochs):
    # create data generator
    generator = data_generator(train, mapping, features, tokenizer, max_length, vocab_size, batch_size)
    # fit for one epoch
    model.fit(generator, epochs=1, steps_per_epoch=steps, verbose=1)

ValueError: ignored

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Embedding, LSTM
from tensorflow.keras.utils import to_categorical

# Load the InceptionV3 model pretrained on ImageNet
base_model = InceptionV3(weights='imagenet')
model = Model(inputs=base_model.input, outputs=base_model.layers[-2].output)

# Function to preprocess the input image
def preprocess_image(img_path):
    img = image.load_img(img_path, target_size=(299, 299))
    img_array = image.img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    img_array = tf.keras.applications.inception_v3.preprocess_input(img_array)
    return img_array

# Function to generate image features
def generate_image_features(img_path):
    img_array = preprocess_image(img_path)
    features = model.predict(img_array)
    return features

# Load and preprocess the image
img_path = "/content/drive/MyDrive/ML/Images/3765374230_cb1bbee0cb.jpg"
image_features = generate_image_features(img_path)

# Define the captioning model
# ...

# Define the captioning model
embedding_dim = 300  # Adjust as needed
vocab_size = 10000  # Adjust as needed

inputs1 = Input(shape=(2048,))
fe1 = Dense(embedding_dim, activation='relu')(inputs1)

inputs2 = Input(shape=(None,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2 = LSTM(300)(se1)  # Adjust the size to match the Dense layer

decoder1 = tf.keras.layers.add([fe1, se2])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

caption_model = Model(inputs=[inputs1, inputs2], outputs=outputs)

# ...


# Compile the model
caption_model.compile(loss='categorical_crossentropy', optimizer='adam')

# Load the pre-trained weights for the captioning model (optional)
# caption_model.load_weights('path/to/your/pretrained_weights.h5')

# Define a function to generate a caption for a given image
def generate_caption(photo):
    in_text = 'startseq'
    for _ in range(max_length):
        sequence = [word_to_index[word] for word in in_text.split() if word in word_to_index]
        sequence = pad_sequences([sequence], maxlen=max_length)
        yhat = caption_model.predict([photo, sequence], verbose=0)
        yhat = np.argmax(yhat)
        try:
            word = index_to_word[yhat]
        except KeyError:
            # Handle the case where the index is not in the dictionary
            word = 'UNKNOWN'
        in_text += ' ' + word
        if word == 'endseq':
            break
    return in_text

# Example usage
max_length = 34  # Adjust as needed based on your dataset
word_to_index = {}  # Your word-to-index mapping
index_to_word = {}  # Your index-to-word mapping

caption = generate_caption(image_features)
print(caption)




startseq UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN


In [None]:
model.layers[2].set_weights([emb_matrix])
model.layers[2].trainable = False



ValueError: ignored

In [None]:
x = features['3765374230_cb1bbee0cb']

In [None]:
x.shape

(1, 2048)

+ **steps = len(train) // batch_size** - back propagation and fetch the next data

+ Loss decreases gradually over the iterations

+ Increase the no. of epochs for better results

+ Assign the no. of epochs and batch size accordingly for quicker results


### You can save the model in the working directory for reuse

In [None]:
# save the model
model.save(WORKING_DIR+'/model_with_glove.h5')

  saving_api.save_model(


# Generate Captions for the Image

In [None]:
def idx_to_word(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

+ Convert the predicted index from the model into a word

In [None]:
# generate caption for an image
model='/content/drive/MyDrive/ML/model_18.h5'
def predict_caption(model, image, tokenizer, max_length):
    # add start tag for generation process
    in_text = 'startseq'
    # iterate over the max length of sequence
    for i in range(max_length):
        # encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad the sequence
        sequence = pad_sequences([sequence], max_length)
        # predict next word
        yhat = model.predict([image, sequence], verbose=0)
        # get index with high probability
        yhat = np.argmax(yhat)
        # convert index to word
        word = idx_to_word(yhat, tokenizer)
        # stop if word not found
        if word is None:
            break
        # append word as input for generating next word
        in_text += " " + word
        # stop if we reach end tag
        if word == 'endseq':
            break
    return in_text

+ Captiongenerator appending all the words for an image

+ The caption starts with 'startseq' and the model continues to predict the caption until the 'endseq' appeared

## Visualize the Results

In [None]:
from PIL import Image
import matplotlib.pyplot as plt
def generate_caption(image_name):
    # load the image
    # image_name = "1001773457_577c3a7d70.jpg"
    image_id = image_name.split('.')[0]
    img_path = os.path.join(BASE_DIR, "Images", image_name)
    image = Image.open(img_path)
    captions = mapping[image_id]
    print('---------------------Actual---------------------')
    for caption in captions:
        print(caption)
    # predict the caption
    y_pred = predict_caption(model, features[image_id], tokenizer, max_length)
    print('--------------------Predicted--------------------')
    print(y_pred)
    plt.imshow(image)

+ Image caption generator defined

+ First prints the actual captions of the image then prints a predicted caption of the image

In [None]:
generate_caption("1001773457_577c3a7d70.jpg")

In [None]:
generate_caption("1002674143_1b742ab4b8.jpg")

In [None]:
generate_caption("101669240_b2d3e7f17b.jpg")

# Final Thoughts

+ Training the model by increasing the no. of epochs can give better and more accurate results.


**In this project , we have built an Image Caption Generator exploring the Flickr Dataset as an advanced deep learning project using different models from image extraction and text based processing.**