# Image Captioning with ViT-GPT2

This notebook demonstrates an image captioning system using the Vision Transformer (ViT) and GPT-2 models. The model generates natural language descriptions for input images by combining the power of vision transformers for image understanding and GPT-2 for text generation.

## Features
- Uses ViT (Vision Transformer) as the image encoder
- Employs GPT-2 as the language model for caption generation
- Integrates with Hugging Face transformers library
- Provides example usage with sample images

## Setup
First, let's import the required libraries and set up our environment.

In [None]:
import torch
import pandas as pd
from PIL import Image
import os
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer

# Set random seed for reproducibility
torch.manual_seed(42)

## Load Model and Processors
We'll use the pre-trained ViT-GPT2 model for image captioning.

In [None]:
def load_model_and_processors():
    model_name = 'nlpconnect/vit-gpt2-image-captioning'
    model = VisionEncoderDecoderModel.from_pretrained(model_name)
    feature_extractor = ViTImageProcessor.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, feature_extractor, tokenizer

model, feature_extractor, tokenizer = load_model_and_processors()

## Caption Generation Function
Let's define a function to generate captions for input images.

In [None]:
def generate_caption(image_path, model, feature_extractor, tokenizer):
    image = Image.open(image_path).convert('RGB')
    pixel_values = feature_extractor(image, return_tensors='pt').pixel_values

    generated_ids = model.generate(
        pixel_values,
        max_length=30,
        num_beams=4,
        early_stopping=True
    )

    generated_caption = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    return generated_caption

## Example Usage
Let's try the model on some sample images from our dataset.

In [None]:
# Set paths
image_dir = 'Images'

# Get a few sample images
sample_images = os.listdir(image_dir)[:5]

# Generate and display captions
for image_name in sample_images:
    image_path = os.path.join(image_dir, image_name)
    
    # Display the image
    image = Image.open(image_path)
    display(image)
    
    # Generate and print the caption
    caption = generate_caption(image_path, model, feature_extractor, tokenizer)
    print(f'Generated caption: {caption}
')

## Compare with Ground Truth
Let's compare the generated captions with the original captions from our dataset.

In [None]:
# Load original captions
captions_df = pd.read_csv('captions.txt')

# Create a dictionary of image to captions
caption_dict = {}
for _, row in captions_df.iterrows():
    if row['image'] not in caption_dict:
        caption_dict[row['image']] = []
    caption_dict[row['image']].append(row['caption'])

# Compare generated captions with original ones
for image_name in sample_images:
    print(f'Image: {image_name}')
    
    # Display original captions
    print('Original captions:')
    for caption in caption_dict[image_name]:
        print(f'- {caption}')
    
    # Generate and display new caption
    image_path = os.path.join(image_dir, image_name)
    generated = generate_caption(image_path, model, feature_extractor, tokenizer)
    print(f'
Generated caption:
- {generated}
')

## Conclusion
This notebook demonstrated how to use the ViT-GPT2 model for image captioning. The model combines the power of Vision Transformers for image understanding with GPT-2's language generation capabilities to create natural descriptions of images.

Key points:
- The model effectively processes images and generates relevant captions
- Captions are generated using beam search for better quality
- The implementation is easy to use and can be integrated into various applications

Feel free to experiment with different images and parameters to see how the model performs!