https://github.com/OpenBMB/MiniCPM-o \
https://huggingface.co/openbmb/MiniCPM-V-2_6

In [1]:
import os

In [2]:
# Set CUDA_VISIBLE_DEVICES to expose only device 0
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Set CUDA_VISIBLE_DEVICES to expose devices 0, 1, 2, and 3
# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

In [3]:
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
                                   attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager

model = model.cuda().eval()

Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.94it/s]


In [5]:
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

#### Single Image Inference

In [6]:
image = Image.open('./DeepSeek-VL2/images/visual_grounding_1.jpeg').convert('RGB')
question = 'Describe the image?'
msgs = [{'role': 'user', 'content': [image, question]}]

In [7]:
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [8]:
print(res)

The image shows two giraffes in a grassy field with trees in the background. The larger giraffe is prominently featured in the foreground, while the smaller one stands slightly behind and to its right. Giraffes are known for their long necks and legs, which help them reach leaves high up in trees, and their distinctive coat patterns. They inhabit savannas, grasslands, open woodlands, and mountainous areas across Africa. This setting suggests that these giraffes might be in a wildlife reserve or safari park where they can roam freely in an environment similar to their natural habitat.


#### Multiple Images in one Question

In [9]:
image1 = Image.open('./DeepSeek-VL2/images/visual_grounding_1.jpeg').convert('RGB')
image2 = Image.open('./DeepSeek-VL2/images/visual_grounding_2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

In [10]:
answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)

In [11]:
print(answer)

The main difference between the two images is their content. Image 1 features a natural scene with giraffes, while image 2 depicts an animated cartoon of people in suits. The colors and overall themes are also distinct, with the former having more earthy tones and the latter featuring bright, contrasting colors typical of cartoons.


#### Batch Inference

In [15]:
image1 = Image.open('./DeepSeek-VL2/images/visual_grounding_1.jpeg').convert('RGB')
image2 = Image.open('./DeepSeek-VL2/images/visual_grounding_2.jpg').convert('RGB')
question = 'Describe the image'

msgs = [[{'role': 'user', 'content': [image1, question]}], [{'role': 'user', 'content': [image2, question]}]]

In [16]:
answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)

In [17]:
answer

["The image captures two giraffes in a grassy field, which appears to be a wildlife reserve or safari park. The taller giraffe is prominently featured in the foreground, while the second one stands slightly behind and to the right. Both animals are facing left, with their heads turned towards the camera, giving a sense of direct engagement with the viewer. In the background, there's a hint of other animals grazing and trees that provide a naturalistic setting for these majestic creatures. The clear blue sky suggests it might be midday or early afternoon when the sun is bright but not directly overhead.",
 'The image is a cartoon depicting an interaction between two characters, labeled as "我" (I) and "导" (Guide). The character on the right, presumably the guide, is depicted with a large bald spot on their head, red marks indicating irritation or anger, and flames coming from their head, suggesting they are very angry. They are pointing at the character on the left, who appears calm and 