# Image Caption Generator using CNN and LSTM

This notebook demonstrates how to build an image caption generator using a combination of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. We'll use the Flickr8k dataset, which contains 8,000 images, each with 5 different captions.

## Overview of the Process
1. Download and explore the Flickr8k dataset
2. Preprocess the images using a pre-trained CNN (VGG16)
3. Preprocess the captions (tokenization, vocabulary creation)
4. Build the CNN-LSTM model architecture
5. Train the model
6. Evaluate model performance
7. Generate captions for new images


## 1. Download and Explore the Flickr8k Dataset

First, let's install the necessary packages and download the dataset from Kaggle.

In [None]:
# Install required packages
!pip install numpy pandas matplotlib tensorflow keras pillow nltk tqdm kagglehub

In [None]:
# Download the Flickr8k dataset from Kaggle
import kagglehub
import os

# Download the dataset
path = kagglehub.dataset_download("adityajn105/flickr8k")
print("Path to dataset files:", path)

# List the contents of the dataset directory
dataset_path = path
print("Dataset contents:")
for item in os.listdir(dataset_path):
    print(f"- {item}")

### Explore the Dataset Structure

The Flickr8k dataset typically contains:
- A directory of images (Flickr8k_Dataset)
- Text files with captions (Flickr8k_text)

Let's explore the structure and understand the format of the captions.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import random

# Define paths to images and captions
images_dir = os.path.join(dataset_path, 'Images')
captions_file = os.path.join(dataset_path, 'captions.txt')

# Read the captions file
captions_df = pd.read_csv(captions_file, delimiter=',')
print(captions_df.head())

# Count the number of images and captions
num_images = len(os.listdir(images_dir))
num_captions = len(captions_df)
print(f"Number of images: {num_images}")
print(f"Number of captions: {num_captions}")
print(f"Average captions per image: {num_captions / num_images}")

In [None]:
# Display a few random images with their captions
def display_image_with_captions(image_name, captions):
    img_path = os.path.join(images_dir, image_name)
    img = Image.open(img_path)
    plt.figure(figsize=(10, 8))
    plt.imshow(img)
    plt.axis('off')
    plt.title(f"Image: {image_name}")
    plt.show()
    
    print("Captions:")
    for i, caption in enumerate(captions, 1):
        print(f"{i}. {caption}")

# Get a random image and its captions
random_images = captions_df['image'].unique()
random_image = random.choice(random_images)
image_captions = captions_df[captions_df['image'] == random_image]['caption'].tolist()

display_image_with_captions(random_image, image_captions)

## 2. Preprocess Images and Captions

### 2.1 Image Preprocessing

We'll use a pre-trained CNN (VGG16) to extract features from the images. This is more efficient than training a CNN from scratch.

In [None]:
import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.models import Model
from tqdm import tqdm

# Load the VGG16 model pre-trained on ImageNet data
base_model = VGG16(weights='imagenet')
# Remove the last layer (classification layer)
model = Model(inputs=base_model.input, outputs=base_model.layers[-2].output)

# Function to extract features from an image
def extract_features(image_path):
    # Load the image with target size (224, 224)
    img = load_img(image_path, target_size=(224, 224))
    # Convert the image to array
    img_array = img_to_array(img)
    # Expand dimensions to match the model's expected input shape
    img_array = np.expand_dims(img_array, axis=0)
    # Preprocess the image (normalize pixel values, etc.)
    img_array = preprocess_input(img_array)
    # Extract features
    features = model.predict(img_array)
    return features

# Extract features for all images and store them
def extract_all_features(images_dir):
    features = {}
    # Get list of all image files
    image_files = os.listdir(images_dir)
    
    print(f"Extracting features for {len(image_files)} images...")
    for image_file in tqdm(image_files):
        # Extract features for the image
        image_path = os.path.join(images_dir, image_file)
        image_features = extract_features(image_path)
        # Store features using the image filename as key
        features[image_file] = image_features
    
    return features

# Extract features for all images (this may take some time)
# Uncomment the line below to run the feature extraction
# features = extract_all_features(images_dir)

# To save time, we'll extract features for just a few images for demonstration
def extract_sample_features(images_dir, num_samples=10):
    features = {}
    image_files = random.sample(os.listdir(images_dir), num_samples)
    
    print(f"Extracting features for {len(image_files)} sample images...")
    for image_file in tqdm(image_files):
        image_path = os.path.join(images_dir, image_file)
        image_features = extract_features(image_path)
        features[image_file] = image_features
    
    return features, image_files

# Extract features for a sample of images
sample_features, sample_images = extract_sample_features(images_dir, num_samples=10)

# Save the features to a file
import pickle
with open('sample_features.pkl', 'wb') as f:
    pickle.dump(sample_features, f)

print("Sample features extracted and saved.")

### 2.2 Caption Preprocessing

Now, let's preprocess the captions by cleaning the text, creating a vocabulary, and preparing the data for training.

In [None]:
import string
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK resources
nltk.download('punkt')

# Function to clean captions
def clean_caption(caption):
    # Convert to lowercase
    caption = caption.lower()
    # Remove punctuation
    caption = caption.translate(str.maketrans('', '', string.punctuation))
    # Tokenize the caption
    tokens = word_tokenize(caption)
    # Remove words with numbers
    tokens = [word for word in tokens if word.isalpha()]
    # Join the tokens back into a string
    caption = ' '.join(tokens)
    return caption

# Process all captions
def process_captions(captions_df):
    # Create a dictionary to store captions for each image
    captions_dict = {}
    
    # Group captions by image
    for index, row in captions_df.iterrows():
        image_name = row['image']
        caption = row['caption']
        
        # Clean the caption
        cleaned_caption = clean_caption(caption)
        
        # Add start and end tokens
        processed_caption = 'startseq ' + cleaned_caption + ' endseq'
        
        # Add to dictionary
        if image_name not in captions_dict:
            captions_dict[image_name] = []
        captions_dict[image_name].append(processed_caption)
    
    return captions_dict

# Process all captions
captions_dict = process_captions(captions_df)

# Display a few examples
for image_name in list(captions_dict.keys())[:3]:
    print(f"Image: {image_name}")
    for caption in captions_dict[image_name]:
        print(f"  - {caption}")
    print()

In [None]:
# Create vocabulary of all words in captions
def create_vocabulary(captions_dict):
    vocabulary = set()
    
    for image_name, captions in captions_dict.items():
        for caption in captions:
            # Add all words to vocabulary
            vocabulary.update(caption.split())
    
    return vocabulary

# Create vocabulary
vocabulary = create_vocabulary(captions_dict)
print(f"Vocabulary size: {len(vocabulary)}")
print(f"Sample words: {list(vocabulary)[:20]}")

In [None]:
# Create word-to-index and index-to-word mappings
def create_word_mappings(vocabulary):
    word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
    idx_to_word = {idx: word for idx, word in enumerate(vocabulary)}
    return word_to_idx, idx_to_word

# Create mappings
word_to_idx, idx_to_word = create_word_mappings(vocabulary)

# Save the mappings
with open('word_to_idx.pkl', 'wb') as f:
    pickle.dump(word_to_idx, f)
with open('idx_to_word.pkl', 'wb') as f:
    pickle.dump(idx_to_word, f)

print("Word mappings created and saved.")

### 2.3 Prepare Training Data

Now, let's prepare the data for training the model. We'll create sequences of words for the LSTM model.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Function to create sequences for training
def create_sequences(captions_dict, features, word_to_idx, max_length):
    X1, X2, y = [], [], []
    
    # Process each image and its captions
    for image_name, captions in captions_dict.items():
        # Skip images for which we don't have features
        if image_name not in features:
            continue
        
        # Get image features
        image_features = features[image_name]
        
        # Process each caption for the image
        for caption in captions:
            # Convert caption to sequence of word indices
            seq = [word_to_idx[word] for word in caption.split() if word in word_to_idx]
            
            # Create input-output pairs for each word in the caption
            for i in range(1, len(seq)):
                # Input: image features and sequence up to current word
                in_seq = seq[:i]
                # Output: next word (target)
                out_seq = seq[i]
                
                # Pad the input sequence
                in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                # One-hot encode the output word
                out_seq = to_categorical([out_seq], num_classes=len(word_to_idx))[0]
                
                # Add to training data
                X1.append(image_features)
                X2.append(in_seq)
                y.append(out_seq)
    
    return np.array(X1), np.array(X2), np.array(y)

# Find the maximum caption length
def get_max_length(captions_dict):
    max_length = 0
    for image_name, captions in captions_dict.items():
        for caption in captions:
            length = len(caption.split())
            if length > max_length:
                max_length = length
    return max_length

# Get maximum caption length
max_length = get_max_length(captions_dict)
print(f"Maximum caption length: {max_length}")

# Create sequences for training (using sample features for demonstration)
# Filter captions_dict to only include images in sample_features
sample_captions_dict = {image: captions_dict[image] for image in sample_images if image in captions_dict}

# Create sequences
X1, X2, y = create_sequences(sample_captions_dict, sample_features, word_to_idx, max_length)

print(f"Number of training samples: {len(X1)}")
print(f"Image features shape: {X1.shape}")
print(f"Text sequence shape: {X2.shape}")
print(f"Output shape: {y.shape}")

## 3. Build the CNN-LSTM Model

Now, let's build the model architecture that combines CNN features with LSTM for caption generation.

In [None]:
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout, add
from tensorflow.keras.models import Model

# Define the model architecture
def build_model(vocab_size, max_length):
    # Image feature input
    inputs1 = Input(shape=(4096,))  # Shape of VGG16 features
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    
    # Sequence input
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    
    # Decoder model
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    
    # Combine the inputs and outputs into a single model
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

# Build the model
model = build_model(len(word_to_idx), max_length)
model.summary()

## 4. Train the Model

Now, let's train the model using the prepared data.

In [None]:
# Define a data generator for training
def data_generator(X1, X2, y, batch_size):
    # Get the number of samples
    num_samples = len(X1)
    indices = np.arange(num_samples)
    
    while True:
        # Shuffle indices for each epoch
        np.random.shuffle(indices)
        
        # Create batches
        for i in range(0, num_samples, batch_size):
            batch_indices = indices[i:i + batch_size]
            
            # Get batch data
            batch_X1 = X1[batch_indices]
            batch_X2 = X2[batch_indices]
            batch_y = y[batch_indices]
            
            yield [batch_X1, batch_X2], batch_y

# Train the model
# Note: In a real scenario, you would train on the full dataset
# For demonstration, we're using a small sample
batch_size = 32
epochs = 10

# Create a data generator
generator = data_generator(X1, X2, y, batch_size)

# Train the model
steps_per_epoch = len(X1) // batch_size
if steps_per_epoch == 0:  # Ensure at least one step per epoch
    steps_per_epoch = 1

history = model.fit(
    generator,
    steps_per_epoch=steps_per_epoch,
    epochs=epochs,
    verbose=1
)

# Save the model
model.save('image_caption_model.h5')
print("Model trained and saved.")

## 5. Evaluate Model Performance

Let's evaluate the model's performance by generating captions for some test images and comparing them with the actual captions.

In [None]:
# Function to generate a caption for an image
def generate_caption(model, image_features, word_to_idx, idx_to_word, max_length):
    # Start with the start sequence token
    in_text = 'startseq'
    
    # Generate caption word by word
    for i in range(max_length):
        # Encode the current input sequence
        sequence = [word_to_idx[word] for word in in_text.split() if word in word_to_idx]
        sequence = pad_sequences([sequence], maxlen=max_length)
        
        # Predict the next word
        yhat = model.predict([image_features, sequence], verbose=0)
        yhat = np.argmax(yhat)
        
        # Map the predicted word index to the actual word
        word = idx_to_word[yhat]
        
        # Stop if we reach the end token
        if word == 'endseq':
            break
            
        # Append the word to the caption
        in_text += ' ' + word
    
    # Remove start and end tokens from the final caption
    final_caption = in_text.replace('startseq', '')
    
    return final_caption.strip()

# Generate captions for sample images
for i, image_name in enumerate(sample_images[:3]):
    # Get image features
    image_features = sample_features[image_name]
    
    # Generate caption
    generated_caption = generate_caption(model, image_features, word_to_idx, idx_to_word, max_length)
    
    # Display image and captions
    img_path = os.path.join(images_dir, image_name)
    img = Image.open(img_path)
    plt.figure(figsize=(10, 8))
    plt.imshow(img)
    plt.axis('off')
    plt.title(f"Image: {image_name}")
    plt.show()
    
    print("Generated Caption:")
    print(generated_caption)
    
    print("\nActual Captions:")
    if image_name in captions_dict:
        for j, caption in enumerate(captions_dict[image_name], 1):
            # Remove start and end tokens for display
            clean_caption = caption.replace('startseq', '').replace('endseq', '').strip()
            print(f"{j}. {clean_caption}")
    print("\n" + "-"*50 + "\n")

## 6. Implement Caption Generator for New Images

Now, let's create a function to generate captions for new images that weren't part of the training set.

In [None]:
# Function to generate caption for a new image
def caption_image(image_path, model, word_to_idx, idx_to_word, max_length):
    # Extract features from the image
    image_features = extract_features(image_path)
    
    # Generate caption
    caption = generate_caption(model, image_features, word_to_idx, idx_to_word, max_length)
    
    return caption

# Test the caption generator on a new image
# You can replace this with any image path
test_image_path = os.path.join(images_dir, random.choice(os.listdir(images_dir)))

# Generate caption
caption = caption_image(test_image_path, model, word_to_idx, idx_to_word, max_length)

# Display the image and caption
img = Image.open(test_image_path)
plt.figure(figsize=(10, 8))
plt.imshow(img)
plt.axis('off')
plt.title("Test Image")
plt.show()

print("Generated Caption:")
print(caption)

## 7. Complete Implementation

Let's create a complete implementation that can be used to generate captions for any image.

In [None]:
from tensorflow.keras.models import load_model

class ImageCaptionGenerator:
    def __init__(self, model_path, word_to_idx_path, idx_to_word_path, max_length):
        # Load the model
        self.model = load_model(model_path)
        
        # Load word mappings
        with open(word_to_idx_path, 'rb') as f:
            self.word_to_idx = pickle.load(f)
        with open(idx_to_word_path, 'rb') as f:
            self.idx_to_word = pickle.load(f)
        
        # Set maximum caption length
        self.max_length = max_length
        
        # Load VGG16 model for feature extraction
        base_model = VGG16(weights='imagenet')
        self.feature_extractor = Model(inputs=base_model.input, outputs=base_model.layers[-2].output)
    
    def extract_features(self, image_path):
        # Load the image
        img = load_img(image_path, target_size=(224, 224))
        # Convert to array
        img_array = img_to_array(img)
        # Expand dimensions
        img_array = np.expand_dims(img_array, axis=0)
        # Preprocess the image
        img_array = preprocess_input(img_array)
        # Extract features
        features = self.feature_extractor.predict(img_array, verbose=0)
        return features
    
    def generate_caption(self, image_path):
        # Extract features
        image_features = self.extract_features(image_path)
        
        # Start with start token
        in_text = 'startseq'
        
        # Generate caption word by word
        for i in range(self.max_length):
            # Encode the current input sequence
            sequence = [self.word_to_idx[word] for word in in_text.split() if word in self.word_to_idx]
            sequence = pad_sequences([sequence], maxlen=self.max_length)
            
            # Predict the next word
            yhat = self.model.predict([image_features, sequence], verbose=0)
            yhat = np.argmax(yhat)
            
            # Map the predicted word index to the actual word
            word = self.idx_to_word[yhat]
            
            # Stop if we reach the end token
            if word == 'endseq':
                break
                
            # Append the word to the caption
            in_text += ' ' + word
        
        # Remove start token from the final caption
        final_caption = in_text.replace('startseq', '')
        
        return final_caption.strip()

# Example usage (uncomment when you have a trained model)
'''
# Initialize the caption generator
caption_generator = ImageCaptionGenerator(
    model_path='image_caption_model.h5',
    word_to_idx_path='word_to_idx.pkl',
    idx_to_word_path='idx_to_word.pkl',
    max_length=max_length
)

# Generate caption for an image
image_path = 'path/to/your/image.jpg'
caption = caption_generator.generate_caption(image_path)
print(f"Caption: {caption}")
'''

## 8. Conclusion and Next Steps

In this notebook, we've built an image caption generator using a combination of CNN (VGG16) and LSTM networks. The model extracts features from images using a pre-trained CNN and then generates captions using an LSTM network.

### What we've accomplished:
1. Downloaded and explored the Flickr8k dataset
2. Preprocessed images using VGG16 for feature extraction
3. Preprocessed captions (cleaning, tokenization, vocabulary creation)
4. Built a CNN-LSTM model architecture
5. Trained the model on the dataset
6. Evaluated the model's performance
7. Implemented a caption generator for new images

### Possible improvements:
1. Use a larger dataset (e.g., Flickr30k, MSCOCO) for better performance
2. Try different pre-trained CNN models (e.g., ResNet, Inception)
3. Implement attention mechanisms to focus on relevant parts of the image
4. Use more advanced NLP techniques (e.g., transformers) for caption generation
5. Implement beam search for better caption generation
6. Fine-tune the CNN part of the model for better feature extraction

### Resources for further learning:
1. [Show and Tell: A Neural Image Caption Generator](https://arxiv.org/abs/1411.4555)
2. [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044)
3. [Deep Visual-Semantic Alignments for Generating Image Descriptions](https://cs.stanford.edu/people/karpathy/cvpr2015.pdf)