<a href="https://colab.research.google.com/github/amir-asari/VLM_Bootcamp/blob/main/Qwen2_5_VL_Image_Classification_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Qwen2.5-VL Image Classification with Hugging Face Transformers
This code is a consolidated script optimized for a single cell execution in a Google Colab environment with a GPU.

 ----------------------------------------------------------------
 1. Setup and Installation
 ----------------------------------------------------------------

Install necessary libraries. We use the latest transformers from source
to ensure Qwen2.5-VL compatibility, and 'flash-attn' for performance.

In [1]:
!pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git) accelerate
!pip install -q -U bitsandbytes
!pip install -q qwen-vl-utils pillow

import torch
import warnings
from PIL import Image
import requests
from io import BytesIO
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor, BitsAndBytesConfig

# Suppress minor warnings for a clean Colab output
warnings.filterwarnings('ignore')

print("Installation and Imports complete.")

/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: `pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git) accelerate'
Installation and Imports complete.


----------------------------------------------------------------
2. Model and Processor Loading
----------------------------------------------------------------

In [2]:
# We use the 7B model for a good balance of performance and capability.
# MODEL_ID = "Qwen/Qwen2.5-VL-7B-Instruct"
MODEL_ID = "Qwen/Qwen2.5-VL-3B-Instruct"

# Check for GPU availability and set device
if torch.cuda.is_available():
    device = "cuda"
    dtype = torch.bfloat16 # Use bfloat16 for faster inference if supported
else:
    device = "cpu"
    dtype = torch.float32

print(f"--- Environment Setup ---")
print(f"Device: {device}")
print(f"Loading Model: {MODEL_ID}")
print("-" * 25)

try:
    #Define 4-bit quantization configuration
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16 # Compute in float16
    )

    # Load the Model and Processor
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    processor = AutoProcessor.from_pretrained(MODEL_ID)

    # Load model with performance optimizations
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,
        torch_dtype=dtype,
        device_map="auto",
    ).eval()
    print("Model loaded successfully.")

except Exception as e:
    print(f"Error loading model or libraries: {e}")
    print("Please ensure you have a T4 GPU enabled in Colab runtime settings.")
    model = None

--- Environment Setup ---
Device: cuda
Loading Model: Qwen/Qwen2.5-VL-3B-Instruct
-------------------------


The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.53G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/216 [00:00<?, ?B/s]

Model loaded successfully.


----------------------------------------------------------------
3. Inference Function Definition
----------------------------------------------------------------

In [3]:
def classify_image(image_url: str, question: str):
    """
    Downloads an image, prepares the multimodal prompt, and generates a response
    using the Qwen2.5-VL model.
    """
    if model is None:
        return "Model failed to load. Cannot run inference."

    print(f"\n--- Running Inference ---")
    print(f"Question: {question}")

    # 1. Download and load the image
    try:
        response = requests.get(image_url, stream=True)
        response.raise_for_status()
        image = Image.open(BytesIO(response.content)).convert("RGB")
        print(f"Image loaded: {image_url}")
    except Exception as e:
        return f"Error loading image from URL: {e}"

    # 2. Construct the chat template (Qwen-VL uses a specific structure)
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": question}
            ],
        }
    ]

    # 3. Process the input (tokenization and tensor conversion)
    try:
        inputs = processor.apply_chat_template(
            conversation,
            tokenize=True,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt"
        ).to(device)

        # 4. Generate the response
        output_ids = model.generate(
            **inputs,
            do_sample=False,
            max_new_tokens=50,
        )

        # Decode the generated tokens
        response_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

        # Extract only the assistant's response part
        assistant_tag = "<|im_start|>assistant\n"
        if assistant_tag in response_text:
            response_text = response_text.split(assistant_tag)[-1].strip()

        return response_text

    except Exception as e:
        return f"Error during model inference: {e}"

----------------------------------------------------------------
4. Execution Examples
----------------------------------------------------------------

In [4]:
# Example 1: Standard Image Classification (What is this?)
image_url_1 = "https://www.activewild.com/wp-content/uploads/2021/12/Malayan-Tapir.jpg"
question_1 = "What kind of animal is this in the image? Be precise and only state the breed."

# Example 2
image_url_2 = "https://asset.kompas.com/crops/tkMmIaj1OiYlxUgX-KbD1CVzyHc=/0x0:4709x3139/1200x800/data/photo/2021/05/07/60956ee89fed2.jpg"
question_2 = "What kind of animal is this in the image? Be precise and only state the breed."



# Result 1
result_1 = classify_image(image_url_1, question_1)
print(f"\n[Image 1 Classification Result]\n{result_1}")
print("=" * 50)



# Result 2
result_2 = classify_image(image_url_2, question_2)
print(f"\n[Image 2 Classification Result]\n{result_2}")
print("=" * 50)



--- Running Inference ---
Question: What kind of animal is this in the image? Be precise and only state the breed.
Image loaded: https://www.activewild.com/wp-content/uploads/2021/12/Malayan-Tapir.jpg


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



[Image 1 Classification Result]
system
You are a helpful assistant.
user
What kind of animal is this in the image? Be precise and only state the breed.
assistant
Tapir

--- Running Inference ---
Question: What kind of animal is this in the image? Be precise and only state the breed.
Image loaded: https://asset.kompas.com/crops/tkMmIaj1OiYlxUgX-KbD1CVzyHc=/0x0:4709x3139/1200x800/data/photo/2021/05/07/60956ee89fed2.jpg

[Image 2 Classification Result]
system
You are a helpful assistant.
user
What kind of animal is this in the image? Be precise and only state the breed.
assistant
Tortoise
