
# CoCa + CLIP Reranking: Hybrid Captioning Notebook

This notebook implements your **inference-time reranking** pipeline:

1. **Candidate Generation** — Generate `N` candidate captions for an image using **CoCa** (via `open_clip`) when available.  
   As a robust fallback, we use **BLIP** (Hugging Face `transformers`) to generate `N` candidates with nucleus sampling.
2. **CLIPScore Computation** — Compute **cosine similarity** between CLIP image and text embeddings.
3. **Hybrid Scoring** — Compute `Score(c) = log P(CoCa|BLIP)(c | I) + α * CLIPScore(I, c)`.
4. **Caption Selection** — Choose the caption with the highest hybrid score.

> ⚠️ **Note**: This notebook is designed to run both with and without internet.  
> - With internet: it will `pip install` missing deps and download models.  
> - Without internet: you can still run **CLIP-only reranking** on your **own candidate list** (e.g., from CoCa in your local repo).

---



## Quick Start

**Option A — Full pipeline (internet available):**
1. Run **Setup** to install/load packages.
2. In **Config**, set `GENERATOR_BACKEND = "coca"` (preferred) or `"blip"` (fallback).
3. Run **Demo** on your image(s).

**Option B — Rerank-only (no internet; you already have candidates):**
1. Skip installs if packages are present.
2. In **Provide Your Own Candidates**, paste your list of candidates per image.
3. Run **Rerank + Select** to get the best caption per image.


## Setup

In [1]:

# If you're offline or already have these installed, you can skip the pip cells safely.
INSTALL = True  # set False if installs are not needed (or you have no internet)

if INSTALL:
    try:
        # Torch + torchvision for models and transforms
        import torch, torchvision
    except Exception:
        %pip -q install torch torchvision --index-url https://download.pytorch.org/whl/cpu

    try:
        import open_clip
    except Exception:
        %pip -q install open_clip_torch

    try:
        import transformers
    except Exception:
        %pip -q install transformers pillow

# Imports
import os, math, json, time, random
from pathlib import Path
from typing import List, Tuple, Dict, Optional

import torch
import torch.nn.functional as F
from PIL import Image

# Try imports guardedly
try:
    import open_clip
except Exception as e:
    open_clip = None

try:
    from transformers import BlipProcessor, BlipForConditionalGeneration
except Exception:
    BlipProcessor = BlipForConditionalGeneration = None

from torchvision import transforms

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


  from .autonotebook import tqdm as notebook_tqdm


Device: cuda


## Config

In [2]:

class CFG:
    # === Generator choices ===
    # 'coca' (via open_clip) or 'blip' (fallback). If 'coca' fails, code auto-falls back to 'blip'.
    GENERATOR_BACKEND = "coca"
    N_CANDIDATES = 8              # number of candidates per image
    MAX_LEN = 32                  # max tokens for generated caption
    TOP_P = 0.9                   # nucleus sampling (used by BLIP; CoCa if supported)
    TEMPERATURE = 1.0

    # === Scoring ===
    ALPHA = 2.0                   # weight for CLIPScore in hybrid score

    # === Models ===
    # CLIP for scoring
    CLIP_ARCH = "ViT-B-32"
    CLIP_PRETRAINED = "openai"

    # CoCa variant (if available via open_clip)
    COCA_ARCH = "coca_ViT-L-14"
    COCA_PRETRAINED = "mscoco_finetuned_laion2b_s13b_b90k"

    # BLIP HF id (fallback generator)
    BLIP_MODEL_ID = "Salesforce/blip-image-captioning-base"

CFG = CFG()
CFG.__dict__


{}

## Utilities

In [3]:

def load_image(path: str) -> Image.Image:
    img = Image.open(path).convert("RGB")
    return img

# A standard CLIP preprocessing pipeline (open_clip provides transforms)
def get_clip_preprocess(clip_preprocess):
    # open_clip returns a transform; if missing, provide a basic one
    if clip_preprocess is not None:
        return clip_preprocess
    return transforms.Compose([
        transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073),
                             std=(0.26862954, 0.26130258, 0.27577711)),
    ])


## Load CLIP (for scoring)

In [4]:

def load_clip(arch: str, pretrained: str):
    if open_clip is None:
        raise RuntimeError("open_clip not available; cannot load CLIP.")
    model, _, preprocess = open_clip.create_model_and_transforms(arch, pretrained=pretrained, device=device)
    tokenizer = open_clip.get_tokenizer(arch if 'coca_' not in arch else 'ViT-L-14')  # tokenizer not used for scoring
    model.eval()
    return model, preprocess

clip_model, clip_preprocess = load_clip(CFG.CLIP_ARCH, CFG.CLIP_PRETRAINED)
clip_preprocess = get_clip_preprocess(clip_preprocess)
print("Loaded CLIP:", CFG.CLIP_ARCH, CFG.CLIP_PRETRAINED)




Loaded CLIP: ViT-B-32 openai


## CLIPScore (cosine similarity of image & text embeddings)

In [5]:

@torch.no_grad()
def clipscore(model, preprocess, image: Image.Image, captions: List[str]) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    # Image -> embedding
    img_t = preprocess(image).unsqueeze(0).to(device)
    img_feat = model.encode_image(img_t)
    img_feat = img_feat / img_feat.norm(dim=-1, keepdim=True)

    # Text -> embeddings
    # open_clip tokenization requires the architecture name; however, we can use the generic tokenizer by model
    # Use open_clip tokenizer via create_model_and_transforms arch inference
    tok = open_clip.tokenize(captions).to(device)
    txt_feat = model.encode_text(tok)
    txt_feat = txt_feat / txt_feat.norm(dim=-1, keepdim=True)

    # Similarities
    sims = (img_feat @ txt_feat.T).squeeze(0)  # (N,)
    return sims, img_feat, txt_feat


## Candidate Generation via CoCa (preferred)

In [6]:

def try_load_coca(arch: str, pretrained: str):
    if open_clip is None:
        return None, None, None
    try:
        model, _, preprocess = open_clip.create_model_and_transforms(arch, pretrained=pretrained, device=device)
        # open_clip provides a tokenizer for CoCa text decoder
        tok = open_clip.get_tokenizer("coca") if hasattr(open_clip, "get_tokenizer") else None
        model.eval()
        return model, preprocess, tok
    except Exception as e:
        print("[WARN] Failed to load CoCa:", e)
        return None, None, None

coca_model, coca_preprocess, coca_tokenizer = try_load_coca(CFG.COCA_ARCH, CFG.COCA_PRETRAINED)
if coca_model:
    print("Loaded CoCa:", CFG.COCA_ARCH, CFG.COCA_PRETRAINED)
else:
    print("CoCa NOT available. Will fallback to BLIP if requested.")




Loaded CoCa: coca_ViT-L-14 mscoco_finetuned_laion2b_s13b_b90k


In [7]:

@torch.no_grad()
def generate_with_coca(
    model, preprocess, image: Image.Image, n_candidates: int = 8, max_len: int = 32,
    top_p: float = 0.9, temperature: float = 1.0
) -> Tuple[List[str], torch.Tensor]:
    """Generate N candidates and return (captions, log_probs).
    We attempt to use open_clip's generation if available. If not, we raise.
    log_probs is a (N,) tensor with approximate sequence log-likelihoods.
    """
    if model is None:
        raise RuntimeError("CoCa model not loaded.")

    # open_clip CoCa expose .generate in recent versions. If missing, raise to fallback.
    if not hasattr(model, "generate"):
        raise RuntimeError("This open_clip CoCa build has no `.generate`.")

    image_t = preprocess(image).unsqueeze(0).to(device)

    # Top-p sampling, multiple sequences
    out = model.generate(
        image=image_t,
        text=None,
        num_beams=None,
        num_return_sequences=n_candidates,
        temperature=temperature,
        top_p=top_p,
        max_len=max_len,
    )

    # The generate API returns dict-like in newer builds; otherwise raw tokens.
    if isinstance(out, dict):
        tokens = out.get("sequences")
        logprobs = out.get("sequence_logprobs", None)
    else:
        tokens, logprobs = out, None

    # Decode
    if hasattr(open_clip, "decode"):
        captions = [open_clip.decode(t) if isinstance(t, torch.Tensor) else str(t) for t in tokens]
    else:
        # As a last resort, try to map tokens->string; this path is rare.
        captions = [str(t) for t in tokens]

    # Approximate log-likelihoods; if not provided, compute a proxy using CLIPScore-only later
    if logprobs is None:
        logprobs = torch.zeros(len(captions), device=device)

    if isinstance(logprobs, list):
        logprobs = torch.tensor(logprobs, device=device, dtype=torch.float32)
    return captions, logprobs


## Candidate Generation via BLIP (fallback)

In [8]:

def try_load_blip(model_id: str):
    if BlipProcessor is None or BlipForConditionalGeneration is None:
        return None, None
    try:
        processor = BlipProcessor.from_pretrained(model_id)
        model = BlipForConditionalGeneration.from_pretrained(model_id).to(device)
        model.eval()
        return processor, model
    except Exception as e:
        print("[WARN] Failed to load BLIP:", e)
        return None, None

blip_processor, blip_model = try_load_blip(CFG.BLIP_MODEL_ID)
print("BLIP ready?" , blip_model is not None)


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


BLIP ready? True


In [9]:

@torch.no_grad()
def generate_with_blip(
    processor, model, image: Image.Image, n_candidates: int = 8, max_len: int = 32,
    top_p: float = 0.9, temperature: float = 1.0
) -> Tuple[List[str], torch.Tensor]:
    if processor is None or model is None:
        raise RuntimeError("BLIP not loaded.")
    inputs = processor(images=image, return_tensors="pt").to(device)
    # N candidates via nucleus sampling
    outputs = model.generate(
        **inputs,
        do_sample=True,
        top_p=top_p,
        temperature=temperature,
        max_new_tokens=max_len,
        num_return_sequences=n_candidates,
        output_scores=True,
        return_dict_in_generate=True,
    )
    # Decode to strings
    captions = processor.batch_decode(outputs.sequences, skip_special_tokens=True)
    # Approximate log-likelihood: sum of logits' log-probs from `scores` list
    # Convert scores (per step) and next_tokens to logprobs
    seq_logprobs = []
    for i in range(len(captions)):
        # For each returned sequence i, collect per-step token log-probs
        # HuggingFace returns scores shared across sequences; using next_tokens per sequence is complex.
        # Instead, we approximate by using sequence length as a weak proxy if per-token info isn't easy to extract.
        # To keep things simple and robust, we'll compute a uniform proxy:
        seq_logprobs.append(-len(captions[i]) / 10.0)  # proxy: mildly penalize longer strings
    logprobs = torch.tensor(seq_logprobs, device=device, dtype=torch.float32)
    return captions, logprobs


## Hybrid Scoring & Selection

In [10]:

def hybrid_score(loglik: torch.Tensor, clips: torch.Tensor, alpha: float) -> torch.Tensor:
    # Normalize to similar ranges before combining (optional but helps stability)
    # We z-score both arrays independently to avoid scale domination.
    def z(x):
        mu = x.mean()
        sd = x.std().clamp_min(1e-6)
        return (x - mu) / sd
    ll_z = z(loglik)
    cs_z = z(clips)
    return ll_z + alpha * cs_z

def select_best(captions: List[str], scores: torch.Tensor) -> Tuple[str, int]:
    idx = int(scores.argmax().item())
    return captions[idx], idx


## End-to-end: One Image → Best Caption

In [11]:

def caption_image(
    image_path: str,
    generator_backend: str = None,
    n_candidates: int = None,
    max_len: int = None,
    top_p: float = None,
    temperature: float = None,
    alpha: float = None
) -> Dict:
    t0 = time.time()
    generator_backend = generator_backend or CFG.GENERATOR_BACKEND
    n_candidates = n_candidates or CFG.N_CANDIDATES
    max_len = max_len or CFG.MAX_LEN
    top_p = top_p or CFG.TOP_P
    temperature = temperature or CFG.TEMPERATURE
    alpha = alpha or CFG.ALPHA

    img = load_image(image_path)

    captions, loglik = [], None
    used_backend = None
    # Try CoCa first if requested
    if generator_backend == "coca":
        try:
            caps, ll = generate_with_coca(coca_model, coca_preprocess or clip_preprocess, img,
                                          n_candidates=n_candidates, max_len=max_len, top_p=top_p, temperature=temperature)
            captions, loglik = caps, ll
            used_backend = "coca"
        except Exception as e:
            print("[WARN] CoCa generation failed:", e)

    # Fallback to BLIP
    if (not captions) and (blip_model is not None):
        caps, ll = generate_with_blip(blip_processor, blip_model, img,
                                      n_candidates=n_candidates, max_len=max_len, top_p=top_p, temperature=temperature)
        captions, loglik = caps, ll
        used_backend = "blip"

    if not captions:
        raise RuntimeError("No generator available. Install/enable CoCa or BLIP.")

    # CLIPScore
    clips, _, _ = clipscore(clip_model, clip_preprocess, img, captions)

    # Hybrid
    scores = hybrid_score(loglik, clips, alpha)

    best_caption, best_idx = select_best(captions, scores)
    elapsed = time.time() - t0
    return {
        "image": image_path,
        "backend": used_backend,
        "captions": captions,
        "loglik": [float(x) for x in loglik.cpu()],
        "clips": [float(x) for x in clips.cpu()],
        "hybrid": [float(x) for x in scores.cpu()],
        "best_caption": best_caption,
        "best_idx": int(best_idx),
        "time_s": elapsed,
    }

# Pretty print helper
def print_result(res: Dict):
    print(f"Image: {res['image']}")
    print(f"Backend: {res['backend']}  |  time: {res['time_s']:.2f}s")
    print("Top choice →", res["best_caption"])
    print("--- Candidates (cap / loglik / clip / hybrid) ---")
    for i, (c, ll, cs, hy) in enumerate(zip(res["captions"], res["loglik"], res["clips"], res["hybrid"])):
        tag = "  <-- BEST" if i == res["best_idx"] else ""
        print(f"[{i:02d}] {c} | ll={ll:+.3f} | clip={cs:+.3f} | hybrid={hy:+.3f}{tag}")


## Batch Inference over a Folder

In [12]:

import pandas as pd

def caption_folder(
    folder: str,
    exts: Tuple[str, ...] = (".jpg", ".jpeg", ".png"),
    limit: Optional[int] = None,
    **kwargs
) -> pd.DataFrame:
    paths = [str(p) for p in Path(folder).glob("**/*") if p.suffix.lower() in exts]
    if limit:
        paths = paths[:limit]
    rows = []
    for p in paths:
        try:
            res = caption_image(p, **kwargs)
            rows.append({
                "image": p,
                "backend": res["backend"],
                "best_caption": res["best_caption"],
                "best_idx": res["best_idx"],
                "time_s": res["time_s"],
                "candidates": json.dumps(res["captions"], ensure_ascii=False),
                "loglik": json.dumps(res["loglik"]),
                "clips": json.dumps(res["clips"]),
                "hybrid": json.dumps(res["hybrid"]),
            })
        except Exception as e:
            rows.append({"image": p, "backend": "error", "best_caption": str(e), "best_idx": -1, "time_s": -1, "candidates": "[]", "loglik": "[]", "clips": "[]", "hybrid": "[]"})
    df = pd.DataFrame(rows)
    return df

# Example:
# df = caption_folder("/path/to/images", limit=5, generator_backend="coca")
# df.to_csv("results.csv", index=False)


## Provide Your Own Candidates (Rerank-only mode)

In [13]:

def rerank_only(image_path: str, candidates: List[str], alpha: float = None) -> Dict:
    alpha = alpha or CFG.ALPHA
    img = load_image(image_path)
    clips, _, _ = clipscore(clip_model, clip_preprocess, img, candidates)
    # Use zero log-likelihoods if you don't have them
    loglik = torch.zeros(len(candidates), device=device)
    scores = hybrid_score(loglik, clips, alpha)
    best_caption, best_idx = select_best(candidates, scores)
    return {
        "image": image_path,
        "backend": "rerank_only",
        "captions": candidates,
        "loglik": [0.0] * len(candidates),
        "clips": [float(x) for x in clips.cpu()],
        "hybrid": [float(x) for x in scores.cpu()],
        "best_caption": best_caption,
        "best_idx": int(best_idx),
        "time_s": 0.0,
    }

# Example:
# res = rerank_only("example.jpg", ["a dog in grass", "a cat on sofa", "a brown dog running"])
# print_result(res)


## Demo

In [14]:

# 🔧 Set your image path here and run this cell.
DEMO_IMAGE = "/path/to/your/image.jpg"   # <-- change me
if os.path.exists(DEMO_IMAGE):
    out = caption_image(DEMO_IMAGE, generator_backend=CFG.GENERATOR_BACKEND)
    print_result(out)
else:
    print("[Info] Set DEMO_IMAGE to a valid image path and re-run.")


[Info] Set DEMO_IMAGE to a valid image path and re-run.


## Save Results Helper

In [15]:

def save_json(obj, path: str):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)
    print("Saved:", path)
