# Capability â€” Image Captioning
Generate concise or rich descriptions for any scene, optionally returning grounded regions alongside the text.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericpence/perceptron_repo/blob/main/cookbook/recipes/capabilities/captioning/captioning.ipynb)

## Install dependencies
Install the SDK and Pillow so we can preview local assets inline.

In [None]:
%pip install --upgrade perceptron pillow --quiet

## Configure the Perceptron client
Set your API key once, reuse the configured client for the rest of the notebook, and resolve the captioning assets.

In [None]:
from pathlib import Path

from IPython.display import Image as IPyImage
from IPython.display import display
from PIL import Image, ImageDraw, ImageFont

from cookbook.utils import cookbook_asset
from perceptron import caption, configure

# configure() reads PERCEPTRON_API_KEY from the environment.
# configure() reads PERCEPTRON_API_KEY from the environment.
configure(
    provider="perceptron",
    # model="isaac-0.1",  # Enable once the SDK supports the model argument.
)

SUBURBAN_STREET = cookbook_asset("capabilities", "caption", "suburban_street.webp")
SOLAR_ARRAY = cookbook_asset("capabilities", "caption", "solar_array.webp")
SOLAR_ANNOTATED = Path("solar_array_annotated.png")

for path in (SUBURBAN_STREET, SOLAR_ARRAY):
    if not path.exists():
        raise FileNotFoundError(f"Missing asset: {path}")

## Describe a suburban street
The simplest call requests a concise caption and returns plain text.

In [None]:
display(IPyImage(filename=str(SUBURBAN_STREET)))
street_caption = caption(str(SUBURBAN_STREET), style="concise", expects="text")
print(street_caption.text)

## Request grounded details
Switch styles and ask for bounding boxes so downstream tooling can highlight each part of the caption.

In [None]:
solar_caption = caption(
    str(SOLAR_ARRAY),
    style="detailed",
    expects="box",
)
print(solar_caption.text)
boxes = solar_caption.points or []
print(f"Returned {len(boxes)} grounded snippets")

In [None]:
img = Image.open(SOLAR_ARRAY).convert("RGB")
draw = ImageDraw.Draw(img)
try:
    font = ImageFont.truetype("arial.ttf", size=20)
except OSError:
    font = ImageFont.load_default()

if boxes:

    def to_px(point):
        return point.x / 1000 * img.width, point.y / 1000 * img.height

    for idx, box in enumerate(boxes):
        top_left = to_px(box.top_left)
        bottom_right = to_px(box.bottom_right)
        draw.rectangle([top_left, bottom_right], outline="orange", width=3)
        label = box.mention or f"snippet {idx + 1}"
        draw.text((top_left[0], max(top_left[1] - 18, 0)), label, fill="orange", font=font)
else:
    print("No grounded regions returned; set expects='box' to request them.")

img.save(SOLAR_ANNOTATED)
display(IPyImage(filename=str(SOLAR_ANNOTATED)))
print(f"Saved annotated caption overlay to {SOLAR_ANNOTATED}")

## Conclusion & next steps
- Adjust the `style` argument (`concise`, `detailed`) to match your UX.
- Keep `expects="text"` for pure narrative outputs or request grounding via `expects="box"` / `"point"`.
- Wrap this logic in a helper that iterates through folders to batch caption datasets.