## Image Caption Generator

We are going to use Transformers model to generate caption from an Image.

### Installation



1.   Transformers
2.   Pytorch
3. Image

For installation, please do pip install package_name

In Colab, Pytorch comes preinstalled and same goes with PIL for Image. You will only need to install **transformers** from Huggingface.




In [1]:
!pip install transformers



# **Step 1: Import Libraries and Initialize Model Components**
* VisionEncoderDecoderModel: This class lets you combine a vision encoder model (like ViT) and a language decoder model (like GPT-2) for tasks like image captioning.

* ViTFeatureExtractor: This component processes images into a format the model can understand, similar to tokenizers for text data.

* AutoTokenizer: Handles the text encoding and decoding, converting text to tokens the model can process and later decoding tokens back into readable text.

* torch: Used to handle tensors and manage the device (GPU/CPU) on which computations run.

In [1]:
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
import torch
from PIL import Image

# **Loading the Pre-trained Model:**

* model: Loads a pre-trained ViT-GPT2 model from Hugging Face’s Model Hub for image captioning.
* feature_extractor: Loads a feature extractor specific to this model to prepare images.
* tokenizer: Loads a tokenizer for the GPT-2 model, which helps convert text to tokens and vice versa.

#**Device Selection:**

* torch.device("cuda" if torch.cuda.is_available() else "cpu") checks if a GPU (CUDA) is available. If so, it assigns computations to the GPU; otherwise, it defaults to the CPU.
* model.to(device) moves the model to the selected device for faster computation if a GPU is available.

In [2]:
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/4.61k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]



VisionEncoderDecoderModel(
  (encoder): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ViTLayer(
          (attention): ViTAttention(
            (attention): ViTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(in_featur

#**Step 2: Define Parameters and Caption Generation Function**

**Setting Parameters for Text Generation:**

* max_length = 16: Sets a maximum length of 16 tokens for generated captions.
* num_beams = 4: Enables beam search with 4 beams, a decoding strategy that improves caption quality by considering multiple possibilities at each step.
* gen_kwargs: Stores these settings to pass them as arguments to the model’s generation method.

#**Image Processing:**

* predict_step(image_paths): A function that takes a list of image file paths as input.
* Loop: For each image path in image_paths, it:
 > Opens the image with Image.open(image_path).
 > Converts the image to RGB mode (if it’s not already in that mode), ensuring compatibility with the model.
 > Appends the processed image to the images list.

#**Feature Extraction:**
* feature_extractor(images=images, return_tensors="pt"): Processes the list of images, returning them as a batch of PyTorch tensors ("pt").
* pixel_values.to(device): Moves the pixel values to the selected device (GPU or CPU) for model processing.

#**Caption Generation:**

* model.generate(pixel_values, **gen_kwargs): Feeds the pixel values into the model to generate captions. The parameters (gen_kwargs) control aspects of the generation, like maximum length and beam search.

#**Decoding and Cleaning the Output:**

* tokenizer.batch_decode(output_ids, skip_special_tokens=True): Converts the generated token IDs back into readable text (captions) and removes special tokens.
* preds = [pred.strip() for pred in preds]: Strips any extra spaces from each caption.
* return preds: Returns a list of captions for the input images.

In [7]:
max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
  images = []
  for image_path in image_paths:
    i_image = Image.open(image_path)
    if i_image.mode != "RGB":
      i_image = i_image.convert(mode="RGB")

    images.append(i_image)

  pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
  pixel_values = pixel_values.to(device)

  output_ids = model.generate(pixel_values, **gen_kwargs)

  preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  preds = [pred.strip() for pred in preds]
  return preds

predict_step(['93922153_8d831f7f01.jpg'])

['a woman is sitting on a ledge overlooking a mountain range']