# Multimodal Transformer: Image + Text Test

This notebook demonstrates a simple multimodal transformer pipeline using CLIP (OpenAI) from Hugging Face Transformers.
It compares an input image with one or more text prompts and reports similarity scores.

Notes:
- Uses `openai/clip-vit-base-patch32`. On first run, it downloads model weights.
- If your environment lacks internet, run where these are cached or install/download ahead of time.
- Works on CPU by default; uses GPU automatically if available.

## Setup

If required packages are missing, install them via pip. In Jupyter you can use `%pip` (uncomment below).
If your environment has no network, ensure these packages and model weights are preinstalled/cached.

```python
%pip install torch torchvision transformers pillow
```

In [1]:
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

  from .autonotebook import tqdm as notebook_tqdm
2025-09-02 11:35:23.740986: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-02 11:35:23.846493: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


'cuda'

In [2]:
model_name = 'openai/clip-vit-base-patch32'

# Loads processor and model (downloads weights on first run if not cached)
processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(model_name)
model = model.to(device)
print(f'Loaded {model_name} on {device}')

Loaded openai/clip-vit-base-patch32 on cuda


In [3]:
from typing import List, Tuple, Union

def clip_similarity(image_path: str, texts: Union[str, List[str]], top_k: int = 5) -> List[Tuple[str, float]]:
    """
    Compute CLIP similarity between an image and one or more text prompts.

    Returns a list of (text, probability) sorted by probability descending.
    """
    if isinstance(texts, str):
        texts = [texts]
    assert len(texts) > 0, 'texts must be non-empty'

    image = Image.open(image_path).convert('RGB')
    inputs = processor(text=texts, images=image, return_tensors='pt', padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits_per_image = outputs.logits_per_image  # shape: [batch=1, num_texts]
        probs = logits_per_image.softmax(dim=1)

    probs = probs[0].detach().cpu()
    sorted_idx = probs.argsort(descending=True)[:top_k]
    return [(texts[i], float(probs[i])) for i in sorted_idx]

def pretty_print(results: List[Tuple[str, float]]):
    for text, p in results:
        print(f'{p:.3f}  {text}')

## Try it with multiple candidate texts
Update `image_path` to point to a local image file, then run the cell to see which text best matches the image.

In [5]:
# Change this to a valid local image path
image_path = '/home/dtrad/Downloads/images/hill.png'

candidate_texts = [
    'a dog playing in the park',
    'a cat sitting on a sofa',
    'a person riding a bike',
    'a scenic mountain landscape',
]

results = clip_similarity(image_path, candidate_texts, top_k=len(candidate_texts))
pretty_print(results)

1.000  a scenic mountain landscape
0.000  a person riding a bike
0.000  a dog playing in the park
0.000  a cat sitting on a sofa


## Try it with a single text
You can also test a single text prompt; the function returns a single (text, probability) pair.

In [6]:
image_path = '/home/dtrad/Downloads/images/parrot.png'  # <- change this
text = 'a photo of a black dog'
clip_similarity(image_path, text, top_k=1)[0]

('a photo of a black dog', 1.0)

## Tips
- First run downloads model weights (~400MB). Subsequent runs use the local cache.
- For offline use, pre-download the model by running this notebook once with internet or by manually caching via `transformers` CLI.
- GPU is used automatically if `torch.cuda.is_available()` is true. Otherwise, it runs on CPU.