# Qwen2.5 VL Overview
Qwen 2.5VL is a multimodal AI model that integrates visual and textual processing capabilities. The model accepts multiple input types, including images and videos alongside text prompts, and generates text-based responses. This multimodal architecture allows Qwen 2.5VL to understand and reason about visual content while incorporating textual context, making it suitable for tasks that require both visual comprehension and natural language generation.

## **Image Captioning**

### 1.󠀠󠀮󠁽󠁝󠁝󠁝󠁝 Imports and Load Model

In [1]:
!pip install qwen-vl-utils  # Upgrade Qwen-VL utilities

Collecting qwen-vl-utils
  Downloading qwen_vl_utils-0.0.11-py3-none-any.whl.metadata (6.3 kB)
Collecting av (from qwen-vl-utils)
  Downloading av-15.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Downloading qwen_vl_utils-0.0.11-py3-none-any.whl (7.6 kB)
Downloading av-15.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (39.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.7/39.7 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: av, qwen-vl-utils
Successfully installed av-15.0.0 qwen-vl-utils-0.0.11


In [2]:
import os                                # Operating system interface for file paths and environment variables
import textwrap                          # Text wrapping utilities for formatted string display
import io                                # Core tools for working with streams and in-memory buffers
import requests                          # HTTP library for making web requests and downloading content
import random

import numpy as np

import torch
from transformers import (
    Qwen2_5_VLForConditionalGeneration,  # Vision-language model for multimodal tasks
    AutoProcessor,                       # Automatic tokenizer and feature extractor
)
from qwen_vl_utils import process_vision_info  # Qwen model output processing utilities

from PIL import Image                    # Python Imaging Library for image manipulation
import matplotlib.pyplot as plt          # Plotting library for data visualization
import matplotlib.patches as patches     # Drawing shapes and annotations on plots

import IPython.display as ipd

We will load the 3-B-parameter `Qwen 2.5-VL-Instruct` model, which uses 6 GB VRAM in float-16 mode. If you are on a GPU without enough memory, set device_map="cpu" instead of "auto".

In [3]:
model_name = "Qwen/Qwen2.5-VL-3B-Instruct"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.53G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/216 [00:00<?, ?B/s]

### 2. Image Preprocessor and Prompt Template
**Preprocessing** serves the same purpose as CLIP's preprocessing, namely scaling, center cropping, and normalization for images, and tokenization, length uniformity, and format conversion for text. However, the processor used needs to receive images and text in a specific format. Fortunately, Qwen 2.5VL provides converter methods from standard JSON format:
- **Text**: The `apply_chat_template` function extracts text from the provided JSON into a template with the structure `<|im_start|>{role}\n{content}<|im_end|><|im_start|>assistant`
- **Images**: The `process_vision_info` function extracts and preprocesses images and videos from the provided JSON as tensors.

In [4]:
# Load Prompts
url = "https://images.unsplash.com/photo-1504208434309-cb69f4fe52b0?w=700"
img = Image.open(io.BytesIO(requests.get(url, timeout=15).content)).convert("RGB")
txt = "Describe this image."
msgs = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": img},
            {"type": "text",  "text": txt}
        ],
    }
]

# Load a pre-trained processor
processor = AutoProcessor.from_pretrained(model_name)

# Format Conversion
text_prompt = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(msgs)

# Apply the processor
inputs = processor(
    text=[text_prompt],
    images=image_inputs,
    videos=video_inputs,  # In this case, it is empty
    padding=True,
    return_tensors="pt",
).to(model.device)

preprocessor_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


chat_template.json: 0.00B [00:00, ?B/s]

### 3. Generate de caption

In [None]:
# Generate the captions (inference mode)
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=64)

If we were to display the generated output, we would observe that it also includes the input. Let's post-process the output to obtain only the desired response, removing the input and special tokens.

In [None]:
# Decode extracting the newly generated tokens
caption = processor.batch_decode(
    generated_ids[:, inputs.input_ids.shape[-1]:],
    skip_special_tokens=True
)[0]

# Display the image and the caption
plt.imshow(img)
plt.axis("off")
plt.show()
formatted_caption = textwrap.fill(caption, 80)

## **Object Detection**

### 1.󠀠󠀮󠁽󠁝󠁝󠁝󠁝 Imports and Load Model

In [None]:
!pip install qwen-vl-utils  # Upgrade Qwen-VL utilities

In [None]:
# Standard library imports for core functionality
import os          # Operating system interface for file paths and environment variables
import textwrap    # Text wrapping and formatting utilities
import io          # Core tools for working with streams and in-memory buffers
import requests    # HTTP library for making web requests and downloads
import re          # Regular expression pattern matching and text processing
import json        # JSON encoder and decoder for data serialization
import pprint      # Data pretty-printer for readable output formatting
import random

import numpy as np

import torch
from transformers import (
    Qwen2_5_VLForConditionalGeneration,  # Vision-language model for multimodal tasks
    AutoProcessor,                       # Automatic tokenizer and image processor
)
from qwen_vl_utils import process_vision_info  # Vision processing utilities for Qwen models

# Image processing and computer vision
from PIL import (Image, ImageDraw, ImageFont, ImageColor)

# Data visualization and plotting
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import IPython.display as ipd

We will load the 3-B-parameter Qwen 2.5-VL-Instruct model, which uses 6 GB VRAM in float-16 mode. If you are on a GPU without enough memory, set device_map="cpu" instead of "auto".

In [None]:
model_name = "Qwen/Qwen2.5-VL-3B-Instruct"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

### 2. Image Preprocessor and Prompt Template
We repeat the procedure performed in the previous section.

In [None]:
# Load Prompts
url = "https://learnopencv.com/wp-content/uploads/2025/06/elephants.jpg"
img = Image.open(io.BytesIO(requests.get(url, timeout=15).content)).convert("RGB")
system_prompt = "You are an object detector. The format of your output should be a valid JSON object {'bbox_2d': [x1, y1, x2, y2], 'label': 'class'} where class is the name of the class you are detecting."
user_prompt = "Outline the position of elephants"
msgs = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": system_prompt
            }
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": img},
            {
                "type": "text",
                "text": user_prompt
            }
        ],
    }
]

# Load a pre-trained processor
processor = AutoProcessor.from_pretrained(model_name)

# Format Conversion
text_prompt = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(msgs)

# Apply the processor
inputs = processor(
    text=[text_prompt],
    images=image_inputs,
    videos=video_inputs,  # In this case, it is empty
    padding=True,
    return_tensors="pt",
).to(model.device)

### 3. Generate de caption

Similarly, we repeat the inference and output post-processing procedure.

In [None]:
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=1000)

output = processor.batch_decode(
      generated_ids[:, inputs.input_ids.shape[-1]:],
      skip_special_tokens=False
)[0]

Likewise, we define utility functions to process the generated JSON and integrate it into the image.

In [None]:
# Process the JSON
def _repair_newlines_inside_strings(txt: str) -> str:
    """
    Replace raw newlines that occur *inside* JSON string literals with a space.
    Very lightweight: it simply looks for a quote, then any run of characters
    that is NOT a quote or backslash, then a newline, then continues…
    """
    pattern = re.compile(r'("([^"\\]|\\.)*)\n([^"]*")')
    while pattern.search(txt):
        txt = pattern.sub(lambda m: m.group(1) + r'\n' + m.group(3), txt)
    return txt
def extract_json(code_block: str, parse: bool = True):
    """
    Remove Markdown code-block markers (``` or ```json) and return:
      • the raw JSON string   (parse=False, default)
      • the parsed Python obj (parse=True)
    """
    # Look for triple-backtick blocks, optionally tagged with a language (e.g. ```json)
    block_re = re.compile(r"```(?:\w+)?\s*(.*?)\s*```", re.DOTALL)
    m = block_re.search(code_block)
    payload = (m.group(1) if m else code_block).strip()
    if parse:
        try:
            return json.loads(payload)
        except json.JSONDecodeError as e:
            # attempt a mild repair and retry once
            payload_fixed = _repair_newlines_inside_strings(payload)
            return json.loads(payload_fixed)
    else:
        return payload


# Integrate into the image
def _text_wh(draw, text, font):
    """
    Return (width, height) of *text* under the given *font*, coping with
    Pillow ≥10.0 (textbbox) and older versions (textsize).
    """
    # Check if the draw object has the 'textbbox' method (Pillow >= 8.0)
    if hasattr(draw, "textbbox"): # Pillow ≥8.0, preferred
        # Get the bounding box of the text
        left, top, right, bottom = draw.textbbox((0, 0), text, font=font)
        # Calculate and return the width and height
        return right - left, bottom - top
    # Check if the draw object has the 'textsize' method (Pillow < 10.0)
    elif hasattr(draw, "textsize"): # Pillow <10.0
        # Get the size of the text
        return draw.textsize(text, font=font)
    # Fallback for other or older versions of Pillow
    else: # Fallback
        # Get the bounding box from the font itself
        left, top, right, bottom = font.getbbox(text)
        # Calculate and return the width and height
        return right - left, bottom - top
def draw_bboxes(
    img,
    detections,
    box_color="red",
    box_width=3,
    font_size=32,
    text_color="white",
    text_bg="red",
):
    # Create a drawing object for the image
    draw = ImageDraw.Draw(img)
    try:
        # Try to load a TrueType font
        font = ImageFont.truetype("DejaVuSans.ttf", font_size)
    except OSError:
        # If TrueType font is not found, load the default font
        font = ImageFont.load_default(font_size)

    # Iterate through each detected object
    for det in detections:
        # Extract bounding box coordinates
        x1, y1, x2, y2 = det["bbox_2d"]
        # Get the label of the detected object, default to empty string if not present
        label = str(det.get("label", ""))

        # Draw the rectangle (bounding box) on the image
        draw.rectangle([x1, y1, x2, y2], outline=box_color, width=box_width)

        # If a label exists, draw the label text
        if label:
            # Get the width and height of the text label
            tw, th = _text_wh(draw, label, font)
            # Set padding around the text
            pad = 2
            # Calculate the top-left x-coordinate for the text background
            tx1 = x1
            # Calculate the top-left y-coordinate for the text background, ensuring it stays within the top edge of the image
            ty1 = max(0, y1 - th - 2 * pad) # keep inside top edge
            # Calculate the bottom-right x-coordinate for the text background
            tx2 = x1 + tw + 2 * pad
            # Calculate the bottom-right y-coordinate for the text background
            ty2 = ty1 + th + 2 * pad

            # If a text background color is specified, draw the background rectangle
            if text_bg:
                draw.rectangle([tx1, ty1, tx2, ty2],
                               fill=text_bg, outline=box_color)
            # Draw the text label on the image
            draw.text((tx1 + pad, ty1 + pad), label,
                      fill=text_color, font=font)

    # Return the modified image with bounding boxes and labels
    return img

Finally, we apply the functions to generate the image with the integrated object detector.

In [None]:
bounding_boxes = extract_json(output)
pprint.pprint(bounding_boxes, indent=4)
img_out = draw_bboxes(img.copy(), bounding_boxes)

# Display the output
plt.figure(figsize=(8, 8))
plt.imshow(img_out)
plt.axis("off")
plt.title("Output")
plt.show()