# Video Captioning With VLMs

Video captioning and Q/A is far from a solved problem. Ideally, we'd have a video model that can natively process the audio and every frame and output a response. Unfortunately this is not the reality at this moment--the image frames alone would far exceed the context window of any LLM, even for a 30 second video.

An alternative is to just take frames at certain intervals, and that's what we are going to do in this sample. This isn't only possible, but also an order of magnitude cheaper.

The most cost-effective way to caption a video is to sample some frames at a set interval, and if available to get the transcript, or a summary of the transcript, and pass that as well. In this case we don't have the transcript, so we'll make do with some image frames that we get from a YouTube Video of Mr.Beast taking his girlfriend on a date.

First let's install some packages. Since models hosted on Inference are compatible with the OpenAI SDK, we'll be using that to interact with the Inference API.

In [1]:
!pip install yt-dlp opencv-python openai glob

[31mERROR: Could not find a version that satisfies the requirement glob (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for glob[0m[31m
You should consider upgrading via the '/Users/michaelryaboy/recent-projects/inference-webhook/venv/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

We'll be using the yt-dlp package to scrape 5 frames from a Mr.Beast video titled '$1 vs $500,000 Romantic Date.'

Fun fact: OpenAI is one of the most active maintainers of yt-dlp because they are using it to scrape YT at scale!

In [8]:
%%bash
mkdir -p keyframes
yt-dlp -f bestvideo[ext=mp4] -o - "https://www.youtube.com/watch?v=hTSaweR8qMI" \
  | ffmpeg -i pipe: \
           -vf fps=.05 \
           -frames:v 5 \
           keyframes/keyframe_%02d.jpg

[download]   9.1% of    1.36GiB at   16.87MiB/s ETA 01:14[out#0/image2 @ 0x13a814a30] video:2121KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
frame=    5 fps=0.5 q=1.6 Lsize=N/A time=00:01:40.00 bitrate=N/A speed=10.1x    
[download]   9.1% of    1.36GiB at   15.44MiB/s ETA 01:21

ERROR: unable to write data: [Errno 32] Broken pipe



Import some packages. Not much to see here.

In [9]:
import os
import json
import base64
from glob import glob           # <–– grabs the function glob()
from openai import OpenAI

## Setting Up Video Captioning with Inference API

### Getting Your API Key

First thing - grab your Inference API key from [https://inference.net/dashboard/api-keys](https://inference.net/dashboard/api-keys).

### Configuration

Set your base URL to `https://api.inference.net/v1` so requests route to Inference instead of OpenAI. We're using the `google/gemma-3-27b-instruct/bf-16` model - it's compact but handles images really well.

### System Message

Configure your system message to tell the model it's a captioning service. Basically you're saying "your job is to analyze video frames and write captions describing what's happening."

### Image Quality Matters

This is important - use the highest quality frames you can. The model does well with text recognition and fine details, but only if it can actually see them clearly. Blurry or low-res images will hurt your results.

In [13]:
API_KEY = os.getenv("INFERENCE_API_KEY")
MODEL   = "google/gemma-3-27b-instruct/bf-16"
SYSTEM_MSG = """
You are a JSON-only image analysis API specializing in YouTube keyframes.
Generate one concise caption that describes what's happening across all these frames.
Respond only with a JSON object:

{"caption": "…"}
""".strip()

client = OpenAI(base_url="https://api.inference.net/v1", api_key=API_KEY)

To pass images to our VLM API, we need to first encode them into base64:

In [14]:
data_uris = []
for filepath in sorted(glob("keyframes/*.jpg")):   # now glob(...) works!
    with open(filepath, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    data_uris.append(f"data:image/jpeg;base64,{b64}")

Now let's generate an image. We'll define a json schema so the model has to give us a valid caption.

To learn more about json schemas check out the [structured output docs](https://docs.inference.net/features/structured-outputs).

In [19]:
resp = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": SYSTEM_MSG},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Here are 5 keyframes from a YouTube video. Generate a single caption."},
                *[
                    {"type": "image_url", "image_url": {"url": uri}}
                    for uri in data_uris
                ]
            ],
        },
    ],
    response_format = {
        "type": "json_schema",
        "json_schema": {
            "name": "video_caption",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "caption": {
                        "type": "string",
                        "description": "A concise caption describing what's happening across all the video frames"
                    }
                },
                "required": ["caption"],
                "additionalProperties": False
            }
    }
}
)

# — OUTPUT RESULT —
print(json.dumps(resp.choices[0].message.content, indent=2))

"{\"caption\": \"A couple experiences a date night at an amusement park with escalating costs, ultimately leading to a close moment between them.\"}"


Great! We got a good caption. Not as good as the one we would get with the full transcript, but still impressive, considering we passed only 5 images.

In production, we may want to fine-tune a model to do this task even more cheaply and effectively, depending on the scale!