<a href="https://colab.research.google.com/github/ansar2019/image-captioning/blob/main/LLaVA_captioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Step 1: Install requirements (run once)
!pip install -q torch transformers accelerate bitsandbytes pillow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m112.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m98.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m63.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Step 2: Import with proper quantization config
from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
from PIL import Image
import torch

# Step 3: Configure 4-bit quantization properly
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# Step 4: Load model without legacy load_in_4bit parameter
model = AutoModelForVision2Seq.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=quant_config  # Only quantization parameter
)

# Step 5: Load processor with padding config
processor = AutoProcessor.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    pad_token="<pad>",
    padding_side="right"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/70.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

In [3]:
# Step 6: Working caption generation function
def generate_caption(image_path):
    image = Image.open(image_path).convert("RGB")

    # Required prompt template
    prompt = """A chat between a curious human and an artificial intelligence assistant.
    The assistant gives helpful, detailed, and polite answers to the human's questions.
    USER: <image>
    Describe this image in detail.
    ASSISTANT:"""

    # Process with proper padding
    # The 'max_length' parameter in the processor call was causing the issue.
    # It was forcing the image to be padded to a length that didn't match the model's expectation.
    # Removing or adjusting 'max_length' might fix the problem.
    inputs = processor(
        text=prompt,
        images=image,
        return_tensors="pt",
        # padding="max_length",  # Commenting out this line might fix it
        # max_length=512,        # This line can be removed as well
        truncation=True,
        return_attention_mask=True
    ).to(model.device)

    # Generate caption
    # Removed attention_mask as a separate argument
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        pad_token_id=processor.tokenizer.eos_token_id
    )

    return processor.decode(outputs[0], skip_special_tokens=True).split("ASSISTANT:")[-1].strip()

In [4]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!unzip "/content/unseen-image.zip"

Archive:  /content/unseen-image.zip
   creating: unseen-image/
  inflating: unseen-image/COCO_test2014_000000000870.jpg  
  inflating: unseen-image/COCO_test2014_000000000890.jpg  
  inflating: unseen-image/COCO_test2014_000000000958.jpg  
  inflating: unseen-image/COCO_test2014_000000000970.jpg  
  inflating: unseen-image/COCO_test2014_000000000979.jpg  
  inflating: unseen-image/COCO_test2014_000000001024.jpg  
  inflating: unseen-image/COCO_test2014_000000001035.jpg  
  inflating: unseen-image/COCO_test2014_000000001043.jpg  
  inflating: unseen-image/COCO_test2014_000000001047.jpg  
  inflating: unseen-image/COCO_test2014_000000001076.jpg  
  inflating: unseen-image/COCO_test2014_000000001110.jpg  
  inflating: unseen-image/COCO_test2014_000000001116.jpg  
  inflating: unseen-image/COCO_test2014_000000001118.jpg  
  inflating: unseen-image/COCO_test2014_000000001127.jpg  
  inflating: unseen-image/COCO_test2014_000000001152.jpg  
  inflating: unseen-image/COCO_test2014_000000001156

In [6]:


# Set your folder path in Google Drive
folder_path = '/content/unseen-image'  # Change this to your folder path
output_file = '/content/drive/MyDrive/captions_results.txt'  # Output text file path

# Supported image extensions
image_extensions = ['.jpg', '.jpeg', '.png', '.bmp', '.gif']

# Get list of image files
image_files = [f for f in os.listdir(folder_path)
              if os.path.splitext(f)[1].lower() in image_extensions]

# Process images and save results
with open(output_file, 'w') as f:
    for image_file in image_files:
        image_path = os.path.join(folder_path, image_file)
        try:
            caption = generate_caption(image_path)
            f.write(f"{image_file}: {caption}\n")
            print(f"Processed {image_file}: {caption}")
        except Exception as e:
            error_msg = f"Error processing {image_file}: {str(e)}"
            f.write(error_msg + "\n")
            print(error_msg)

print(f"\nAll captions saved to: {output_file}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Processed COCO_test2014_000000007168.jpg: The image features a blender filled with a variety of colorful vegetables, including red peppers, green peppers, and onions. The blender is placed on a table, and the vegetables are arranged in a visually appealing manner. The blender is filled to the brim, indicating that it is ready to be blended. The assortment of vegetables creates a vibrant and healthy-looking mixture, perfect for a nutritious smoothie or a delicious vegetable soup.
Processed COCO_test2014_000000007372.jpg: The image features a small bird standing on a grassy field, eating from a jar or container. The bird is focused on the jar, which is placed on the ground. The scene is set in a lush green field, providing a natural and serene backdrop for the bird's activity.
Processed COCO_test2014_000000001371.jpg: The image features a large brown elephant standing in a grassy field. The elephant is the main focus of the scene, occupying a significant portion of the image. The grassy 