# Comparing Color Attribution: Vision-Language Models vs. Language Models

This notebook demonstrates how color attribution differs between:
1. Vision-Language Models (VLMs) that can directly process images
2. Language Models (LMs) that rely on textual descriptions of images

We'll use a simple banana image to compare how each model identifies the color of the fruit.

## 1. Vision-Language Model Approach (PaliGemma2)

PaliGemma2 can directly analyze the image to determine the color of the fruit.

In [None]:
# Code adapted from https://huggingface.co/google/paligemma2-3b-mix-224

from transformers import (
    PaliGemmaProcessor,
    PaliGemmaForConditionalGeneration,
)
from transformers.image_utils import load_image
import torch

model_id = "google/paligemma2-3b-mix-224"

url = "https://upload.wikimedia.org/wikipedia/commons/8/8a/Banana-Single.jpg"
image = load_image(url)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)

prompt = "answer en which color is the fruit?"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

# -> "yellow"


yellow


## 2. Language Model Approach (Gemma 2)

Gemma 2 cannot see the image directly. Instead, it relies on a textual description of the image to determine the color.

In [None]:
# Code adapted from https://huggingface.co/google/gemma-2-2b-it

# !pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

image_desctiption = "The image shows a bunch of bananas lying on a weathered wooden surface. The bananas are clustered together, still attached at their stems. The wood grain of the surface is visible and adds a rustic feel to the image. "
input_prompt = image_desctiption + "What is the color of the fruit?"
input_ids = tokenizer(input_prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# -> The color of the fruit is **yellow**.

## Conclusion

Both models correctly identify the banana as yellow, but through different mechanisms:
- The VLM (PaliGemma2) directly processes the visual information from the image
- The LM (Gemma 2) relies on the textual description that explicitly mentions bananas (which it knows are yellow)

This demonstrates the difference in how these model types handle visual information and attribute properties like color.