https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct \
https://github.com/QwenLM/Qwen2.5-VL

In [1]:
import os

In [2]:
# Set CUDA_VISIBLE_DEVICES to expose only device 0
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Set CUDA_VISIBLE_DEVICES to expose devices 0, 1, 2, and 3
# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

In [3]:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
torch.cuda.current_device()

0

In [5]:
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Loading checkpoint shards: 100%|██████████| 5/5 [00:04<00:00,  1.17it/s]


In [6]:
model.device

device(type='cuda', index=0)

In [7]:
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
processor.tokenizer.padding_side = "left"

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


#### Single Image Inference

In [8]:
messages =[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "./DeepSeek-VL2/images/visual_grounding_2.jpg",
                },
                {"type": "text", "text": "Describe this image."},
            ],
        }
]

In [9]:
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

In [10]:
print(text)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>
<|im_start|>assistant



In [11]:
image_inputs, video_inputs = process_vision_info(messages)

In [12]:
image_inputs

[<PIL.Image.Image image mode=RGB size=672x616>]

In [13]:
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

In [14]:
type(inputs)

transformers.feature_extraction_utils.BatchFeature

In [15]:
inputs = inputs.to(model.device)

In [16]:
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

In [17]:
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

In [18]:
print(output_text[0])

The image is a cartoon depicting two people in business attire. The person on the left, labeled "我" (I), appears to be smiling and pointing at the other person with a speech bubble above their head that says "你看 又急" (Look, you're always in a hurry). The person on the right, labeled "导" (Guide), looks frustrated or annoyed, with sweat drops and flames around their head, indicating anger or stress. The overall scene suggests a humorous interaction where one person is being scolded for being impatient.


#### Batch Images Inference

In [30]:
messages1 = [
    {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "./DeepSeek-VL2/images/visual_grounding_2.jpg",
                },
                {"type": "text", "text": "Describe this image."},
            ],
    }
]

messages2 = [
    {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "./DeepSeek-VL2/images/visual_grounding_1.jpeg",
                },
                {"type": "text", "text": "Describe this image."},
            ],
    }
]

messages = [messages1, messages2]

In [31]:
# Preparation for inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages
]

In [32]:
print(texts)

['<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n', '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n']


In [33]:
image_inputs, video_inputs = process_vision_info(messages)

In [34]:
image_inputs

[<PIL.Image.Image image mode=RGB size=672x616>,
 <PIL.Image.Image image mode=RGB size=728x1092>]

In [35]:
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

In [36]:
inputs

{'input_ids': tensor([[151643, 151643, 151643,  ..., 151644,  77091,    198],
        [151644,   8948,    198,  ..., 151644,  77091,    198]]), 'attention_mask': tensor([[0, 0, 0,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]), 'pixel_values': tensor([[ 1.9303,  1.9303,  1.9303,  ...,  2.1459,  2.1459,  2.1459],
        [ 1.9303,  1.9303,  1.9303,  ...,  2.1459,  2.1459,  2.1459],
        [ 1.9303,  1.9303,  1.9303,  ...,  2.1459,  2.1459,  2.1459],
        ...,
        [-1.4419, -1.4273, -1.4419,  ..., -1.2669, -1.1958, -1.0536],
        [-1.5587, -1.5733, -1.4565,  ..., -0.8261, -1.0110, -1.0110],
        [-1.5295, -1.3689, -1.4711,  ..., -1.2100, -1.1674, -1.1105]]), 'image_grid_thw': tensor([[ 1, 44, 48],
        [ 1, 78, 52]])}

In [37]:
inputs = inputs.to(model.device)

In [38]:
# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

In [39]:
output_texts = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

In [40]:
output_texts

['system\nYou are a helpful assistant.\nuser\nDescribe this image.\nassistant\nThe image is a cartoon depicting two characters in business attire. The character on the left, labeled "我" (I), appears to be smiling and pointing at the other character. The character on the right, labeled "导" (Guide), looks frustrated or angry, with flames coming out of their head and a speech bubble saying "你看 又急" (Look, you\'re always in a hurry). The overall scene suggests a humorous or exaggerated interaction between the two characters, possibly highlighting a common workplace scenario where one person is perceived as being impatient or rushed by the other.',
 'system\nYou are a helpful assistant.\nuser\nDescribe this image.\nassistant\nThe image shows two giraffes standing on a grassy field under a clear blue sky. The giraffe in the foreground is larger and appears to be an adult, while the one in the background is smaller, likely a juvenile. Both giraffes have their characteristic long necks and spot