# Preprocessing multimodal inputs
Ashok Kumar Pant

## Model setup
* Processor: Like a tokenizer for LLMs; handles images (resize, normalization) & text (tokenization).
* Model: Blip2ForConditionalGeneration from HuggingFace.

**BLIP-2**

BLIP-2 (Bootstrapped Language Image Pretraining 2) is an advanced vision-language model developed by Salesforce. It’s designed for tasks like image captioning, visual question answering (VQA), and image-grounded dialogue. BLIP-2 bridges visual and textual data by using a vision encoder and a language model, achieving strong performance with efficient training.

In [15]:
from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch

blip_processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
!pip install torchinfo



In [17]:
# print(model)
from torchinfo import summary
summary(model)

Layer (type:depth-idx)                                                 Param #
Blip2ForConditionalGeneration                                          24,576
├─Blip2VisionModel: 1-1                                                --
│    └─Blip2VisionEmbeddings: 2-1                                      363,264
│    │    └─Conv2d: 3-1                                                829,312
│    └─Blip2Encoder: 2-2                                               --
│    │    └─ModuleList: 3-2                                            984,756,864
│    └─LayerNorm: 2-3                                                  2,816
├─Blip2QFormerModel: 1-2                                               --
│    └─LayerNorm: 2-4                                                  1,536
│    └─Dropout: 2-5                                                    --
│    └─Blip2QFormerEncoder: 2-6                                        --
│    │    └─ModuleList: 3-3                                            105,136

## Preprocessing multimodal inputs
### Preprocessing Images
Image is resized (regardless of original aspect ratio) to 224x224 and converted to a PyTorch tensor with shape [1, 3, 224, 224].

In [10]:
from PIL import Image
from urllib.request import urlopen

car_path = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/car.png"
image = Image.open(urlopen(car_path)).convert("RGB")

inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)
print(inputs["pixel_values"].shape)  # Output: torch.Size([1, 3, 224, 224])

torch.Size([1, 3, 224, 224])


### Preprocessing Text
BLIP-2 uses GPT2Tokenizer for text.

Spaces are encoded as Ġ (space marker).

In [11]:
text = "Her vocalization was remarkably melodic"
token_ids = blip_processor(image, text=text, return_tensors="pt").to(device, torch.float16)["input_ids"][0]
tokens = blip_processor.tokenizer.convert_ids_to_tokens(token_ids)
tokens = [t.replace("Ġ", "_") for t in tokens]  # Illustrative replacement
print(tokens)  # ['</s>', 'Her', '_vocal', 'ization', '_was', '_remarkably', '_mel', 'odic']

['<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '<image>', '</s>', 'Her', '_vocal', 'ization', '_was', '_remarkably', '_mel', 'odic']


## Load and preprocess the image


In [12]:
from PIL import Image
from urllib.request import urlopen

# Example image URL: AI-generated supercar
car_path = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/car.png"
image = Image.open(urlopen(car_path)).convert("RGB")

# Preprocess the image to pixel values expected by BLIP-2
inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)

## Generate caption tokens and decode them into text

In [13]:
# Generate caption token IDs (max 20 new tokens)
generated_ids = model.generate(**inputs, max_new_tokens=20)

# Decode tokens to readable text
generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()

print("Caption:", generated_text)


Caption: a car is driving down the road with a large sign on it


## Fun experiment: Captioning a Rorschach inkblot


In [14]:
rorschach_url = "https://upload.wikimedia.org/wikipedia/commons/7/70/Rorschach_blot_01.jpg"
image = Image.open(urlopen(rorschach_url)).convert("RGB")
inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()

print("Caption for Rorschach test:", generated_text)


Caption for Rorschach test: a drawing of a horse with a black and white pattern
