Video Embeddings with Videoprism

In [1]:
# @title Prepare environment

import os
import sys

# Fetch VideoPrism repository if Python does not know about it and install
# dependencies needed for this notebook.
if not os.path.exists("videoprism_repo"):
  !git clone --quiet --branch=main --depth=1 \
     https://github.com/everettVT/videoprism.git videoprism_repo
  os.chdir('./videoprism_repo')
  !pip install .
  os.chdir('..')

# Append VideoPrism code to Python import path.
if "videoprism_repo" not in sys.path:
  sys.path.append("videoprism_repo")

Processing /content/videoprism_repo
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting einshape (from videoprism==1.0.0)
  Downloading einshape-1.0-py3-none-any.whl.metadata (706 bytes)
Downloading einshape-1.0-py3-none-any.whl (21 kB)
Building wheels for collected packages: videoprism
  Building wheel for videoprism (setup.py) ... [?25l[?25hdone
  Created wheel for videoprism: filename=videoprism-1.0.0-py3-none-any.whl size=40354 sha256=88ee90e916f4db8fa02c8b63e2169a94ff8bf03bf19ce697dced0cf6a9a621e9
  Stored in directory: /tmp/pip-ephem-wheel-cache-s0te1sma/wheels/e3/73/3c/3dc3551ff92b46a1e55f9a893f2d5b8fdc55d670bd73d3b605
Successfully built videoprism
Installing collected packages: einshape, videoprism
Successfully installed einshape-1.0 videoprism-1.0.0


In [2]:
!pip install "daft>=0.6.1" av yt-dlp "jax[cuda12]"

Collecting daft>=0.6.1
  Downloading daft-0.6.1-cp39-abi3-manylinux_2_24_x86_64.whl.metadata (12 kB)
Collecting av
  Downloading av-15.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Collecting yt-dlp
  Downloading yt_dlp-2025.9.5-py3-none-any.whl.metadata (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.1/177.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvcc-cu12>=12.6.85 (from jax-cuda12-plugin[with_cuda]<=0.5.3,>=0.5.3; extra == "cuda12"->jax[cuda12])
  Downloading nvidia_cuda_nvcc_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Downloading daft-0.6.1-cp39-abi3-manylinux_2_24_x86_64.whl (47.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.4/47.4 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading av-15.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (39.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m43.9 MB/s

In [3]:
!hf auth login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) 
Token is valid (permission: read).
The token `Anyscale Ray Serve LLM` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to 

In [19]:
import os
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "0.9"
os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"

In [20]:
import daft
from daft import col, DataType as dt
import numpy as np
import jax
import jax.numpy as jnp
from jax.extend import backend
import tensorflow as tf
from videoprism import models as vp

In [5]:
print(jax.devices())    # should list a CUDA device

[CudaDevice(id=0)]


- B: batch size (number of videos in a batch).
- T: number of frames per video clip (typically 16).
- N: tokens per frame (for 288×288 with 18×18 patches → 16×16 = 256).
- D: embedding dimension (Base: 768; Large: 1024).

Video-text model returns:
- video_embeddings: [B, D] (global video embeddings).
- text_embeddings: [B, D] (global text embeddings).
- Optional: frame_embeddings [B, T, D]; tokens [B, T×N, D]

Retrieval:
- cosine similarity reduces to dot product since outputs are L2-normalized.
- For a single video vs K texts: [1, D] @ [K, D]^T → [1, K].

In [25]:
PATHS = ["https://www.youtube.com/watch?v=wKOC_w4oKO8"]
B, T, H, W, C = 24, 16, 288, 288, 3
ROW_LIMIT = 2048
MODEL_NAME = 'videoprism_lvt_public_v1_base'
# for 'videoprism_lvt_public_v1_large', set T = 8

In [26]:
df_frames = daft.read_video_frames(
    PATHS,
    image_height=H,
    image_width=W,
).limit(ROW_LIMIT).collect()



🗡️ 🐟 Limit 2048: 00:00 

🗡️ 🐟 PythonFunction Scan: 00:00 

In [27]:
df_frames.show(3)

path Utf8,frame_index Int64,frame_time Float64,frame_time_base Utf8,frame_pts Int64,frame_dts Int64,frame_duration Int64,is_key_frame Boolean,data Image[RGB; 288 x 288]
https://www.youtube.com/watch?v=wKOC_w4oKO8,1935,64.56456456456456,1/11988,774000,774000,400,False,
https://www.youtube.com/watch?v=wKOC_w4oKO8,1936,64.59793126459793,1/11988,774400,774400,400,False,
https://www.youtube.com/watch?v=wKOC_w4oKO8,1937,64.6312979646313,1/11988,774800,774800,400,False,


### Sampling Strategies

In [28]:
df_grouped = (
    df_frames
    .with_column("group_index", df_frames["frame_index"] // T)
    .groupby("path", "group_index")
    .agg_list("data", "frame_index")
)
df_grouped.show(3)

path Utf8,group_index Int64,data List[Image[RGB; 288 x 288]],frame_index List[Int64]
https://www.youtube.com/watch?v=wKOC_w4oKO8,70,"[<FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>]","[1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135]"
https://www.youtube.com/watch?v=wKOC_w4oKO8,115,"[<FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>]","[1849, 1850, 1851, 1852, 1853, 1854, 1855, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848]"
https://www.youtube.com/watch?v=wKOC_w4oKO8,117,"[<FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>]","[1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887]"


### Stack, Normalize, and Cast

In [29]:
@daft.func(return_dtype=dt.tensor(dt.float32(), shape=(16,288, 288, 3)))
def stack_clip(frames: list[np.ndarray], indices: list[int], clip_size: int):
    """Stacks a list of frames into a single numpy array

    Args:
        frames: List[T] of (H,W,3) float32
        indices: List[T] of int

    Returns:
        (1,T,H,W,3) float32 in [0,1]

    In a parallel/distributed groupby, a pre-group sort isn’t guaranteed
    to survive aggregation order; partitions can concatenate in
    non-deterministic order. Additionally, the image dtype is natively a
    list[uint8], so we need to cast to float32 before normalizing from
    [0,255] to [0,1].

    Steps:
    1. Aggregate both image_tensor and frame_index.
    2. Sort by frame_index inside the group-level UDF, then stack.
    3. Normalize and cast in one step.
    4. Add a batch dimension and return.

    """

    # Don't assume frames are sorted already:
    order = np.argsort(np.asarray(indices))

    # Convert Daft Image to np.ndarray
    def to_np(x):
        if hasattr(x, "to_numpy"):
            return x.to_numpy()          # Daft Image -> np.ndarray (H,W,C) uint8
        return np.asarray(x)

    # Sort frames by frame_index
    frames_sorted = [to_np(frames[i]) for i in order]

    # Ensure Tails are padded with duplicates
    if len(order) < clip_size:
        frames_sorted.extend([frames_sorted[-1]] * (clip_size - len(order)))

    # Stack, Normalize, and Cast in one step
    x = np.stack(frames_sorted[:clip_size], axis=0).astype(np.float32) / 255.0 # (T,H,W,3) float32 in [0,1]

    return x # [1,T,H,W,C] where T=clip_size

df_clips = df_grouped.with_column("clip", stack_clip(df_grouped["data"], df_grouped["frame_index"], clip_size=T))
df_clips.show(3)


path Utf8,group_index Int64,data List[Image[RGB; 288 x 288]],frame_index List[Int64],"clip FixedShapeTensor[Float32; [16, 288, 288, 3]]"
https://www.youtube.com/watch?v=wKOC_w4oKO8,97,"[<FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>]","[1552, 1553, 1554, 1555, 1556, 1557, 1558, 1559, 1560, 1561, 1562, 1563, 1564, 1565, 1566, 1567]",<FixedShapeTensor>
https://www.youtube.com/watch?v=wKOC_w4oKO8,51,"[<FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>]","[816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831]",<FixedShapeTensor>
https://www.youtube.com/watch?v=wKOC_w4oKO8,18,"[<FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>, <FixedShapeImage>]","[288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303]",<FixedShapeTensor>


In [32]:
@daft.udf(
    return_dtype = dt.embedding(dt.float32(), 768),
    batch_size=B, # clips per batch (tune for throughput)
    num_gpus=1,
)
class VideoPrismVideoUDF:
    def __init__(self, model_name: str = "videoprism_lvt_public_v1_base"):
        from videoprism import models as vp
        self.model = vp.get_model(model_name)
        self.params = vp.load_pretrained_weights(model_name)

        @jax.jit
        def vf_b(x):  # [B,T,288,288,3] -> [B,D]
            v, _, _ = self.model.apply(self.params, x, None, None, train=False)
            return v

        self.vf_b = vf_b

        # Warmup both
        _ = self.vf_b(jnp.zeros((B, T, H, W, C), jnp.float32)).block_until_ready()

    def __call__(self,
        clips: list[np.ndarray], # List[T,H,W,C] of len B
    ):
        # Batch Inference
        xb = jnp.stack(clips, axis=0) # [B,T,H,W,C]
        embeddings = self.vf_b(xb) # [B,768]
        np_embeddings = np.asarray(embeddings)  # Back to NumPy
        return [np_embeddings[i].tolist() for i in range(B)]



Previous runs with 24 batches of 16 frame clips processed in 128 sec.

In [33]:
print(f"Video Embeddings will process {B} clips of {T} frame each at {W}x{H}x{3}")

df_clips_few = df_clips.sort("group_index").collect()
df_video_embs = df_clips_few.with_column("video_embeddings", VideoPrismVideoUDF(df_clips_few["clip"])).collect()


Video Embeddings will process 24 clips of 16 frame each at 288x288x3




🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 GroupedAggregate: 00:00 

🗡️ 🐟 UDF stack_clip: 00:00 

🗡️ 🐟 Sort: 00:00 

🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 UDF VideoPrismVideoUDF: 00:00 

ERROR:daft_local_execution:Error when running pipeline node UDF VideoPrismVideoUDF


UDFException: User-defined function `<__main__.VideoPrismVideoUDF object at 0x7a675548ad50>` failed when executing on inputs:
  - clip (FixedShapeTensor[Float32; [16, 288, 288, 3]], length=8)

In [16]:
df_video_embs.select("group_index","video_embeddings", "clip").count_rows()

🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 Count: 00:00 

🗡️ 🐟 Project: 00:00 

28

Extract Audio from Video, Transcribe and Embed

In [16]:


@daft.udf(
    return_dtype = dt.embedding(dt.float32(), 768),
    batch_size=B, # clips per batch (tune for throughput)
    num_gpus=1,
)
class VideoPrismTextUDF:
    def __init__(self, model_name: str = "videoprism_lvt_public_v1_base"):
        from videoprism import models as vp
        self.model = vp.get_model(model_name)
        self.params = vp.load_pretrained_weights(model_name)
        self.text_tokenizer = vp.load_text_tokenizer('c4_en')

        @jax.jit
        def vf_b(x):  # [B,T,288,288,3] -> [B,D]
            _, t, _ = self.model.apply(self.params, x, None, None, train=False)
            return t

        self.vf_b = vf_b

        # Warmup both
        _ = self.vf_b(jnp.zeros((B, T, H, W, C), jnp.float32)).block_until_ready()

    def __call__(self,
        clips: list[np.ndarray], # List[T,H,W,C] of len B
    ):
        # Batch Inference
        xb = jnp.stack(clips, axis=0) # [B,T,H,W,C]
        embeddings = self.vf_b(xb) # [B,768]
        np_embeddings = np.asarray(embeddings)  # Back to NumPy
        return [np_embeddings[i].tolist() for i in range(B)]
