# Shot Boundary Detection with SigLip2 and Window Functions

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/everettVT/daft-video-embeddings/blob/main/workload/sbd_image_embeddings_siglip.ipynb)

This notebook demonstrates an approach to shot boundary detection in videos using the SigLip2 model and window functions within the Daft framework. By leveraging image embeddings from SigLip2 and applying window functions to analyze changes in these embeddings over time, we can identify potential cut boundaries in video sequences.

In [73]:
!pip install -q "daft[huggingface]" transformers numpy

In [123]:
# General Parameters
MODEL_ID = "google/siglip2-base-patch16-512"
B, T, H, W, C = 2, 16, 288, 288, 3 # Batch Size, Clip Size (# frames), Height, Width, RGB
ROW_LIMIT = 50000

PATHS = [
    "https://www.youtube.com/watch?v=eYXDSuNpKTk", # Life after Apache Spark
]

# Chi-Squared SBD Params
CHSQ_HISTOGRAM_BINS = 32
CHSQ_THRESHOLD = 0.3
CHSQ_MIN_SHOT_DURATION = 0.5 # seconds


image_embedding_column_name = f"img_emb_{MODEL_ID}"

In [124]:
import daft
from daft.functions import embed_image
from daft import col, lit, Window, DataType as dt

import numpy as np

In [126]:
df_frames = daft.read_video_frames(
    PATHS,
    image_height=H,
    image_width=W,
).limit(ROW_LIMIT).collect() # Materialize a few frames so we don't re-read from YT
df_frames.show(3)

path Utf8,frame_index Int64,frame_time Float64,frame_time_base Utf8,frame_pts Int64,frame_dts Int64,frame_duration Int64,is_key_frame Boolean,data Image[RGB; 288 x 288]
https://www.youtube.com/watch?v=eYXDSuNpKTk,2150,71.7384050717384,1/11988,860000,860000,400,False,
https://www.youtube.com/watch?v=eYXDSuNpKTk,2151,71.77177177177177,1/11988,860400,860400,400,False,
https://www.youtube.com/watch?v=eYXDSuNpKTk,2152,71.80513847180514,1/11988,860800,860800,400,False,


### Generate SigLip2 Embeddings

In [127]:
df_emb = df_frames.with_column(
    image_embedding_column_name,
    embed_image(
        df_frames["data"],
        model_name=MODEL_ID,
        provider="transformers",
    )
)

In [128]:
df_emb = df_emb.collect()



🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 UDF _ImageEmbedderExpression: 00:00 

## Shot Boundary Detection with SigLip2,

In [129]:
w = Window().partition_by("path").order_by("frame_time")
w_time = w.range_between(-0.5, Window.current_row)

df_shots = (
    df_emb
    .with_column("cos_dist", col(image_embedding_column_name).embedding.cosine_distance(col(image_embedding_column_name).lag(1).over(w)))
    .with_column("cos_dist_smooth", col("cos_dist").mean().over(w_time))
)


In [130]:
df_shots.collect()

🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 WindowPartitionAndOrderBy: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 WindowPartitionAndDynamicFrame: 00:00 

🗡️ 🐟 Project: 00:00 

path Utf8,frame_index Int64,frame_time Float64,frame_time_base Utf8,frame_pts Int64,frame_dts Int64,frame_duration Int64,is_key_frame Boolean,data Image[RGB; 288 x 288],img_emb_google/siglip2-base-patch16-512 Embedding[Float32; 512],cos_dist Float64,cos_dist_smooth Float64
https://www.youtube.com/watch?v=eYXDSuNpKTk,0,0.0,1/11988,0,0,400,True,,<Embedding>,,
https://www.youtube.com/watch?v=eYXDSuNpKTk,1,0.0333667000333667,1/11988,400,400,400,False,,<Embedding>,0.0006189306497935,0.0006189306497935
https://www.youtube.com/watch?v=eYXDSuNpKTk,2,0.0667334000667334,1/11988,800,800,400,False,,<Embedding>,0.0017487474881988,0.0011838390689962
https://www.youtube.com/watch?v=eYXDSuNpKTk,3,0.1001001001001001,1/11988,1200,1200,400,False,,<Embedding>,0.0012450707720379,0.0012042496366767
https://www.youtube.com/watch?v=eYXDSuNpKTk,4,0.1334668001334668,1/11988,1600,1600,400,False,,<Embedding>,0.0002564746572409,0.0009673058918178
https://www.youtube.com/watch?v=eYXDSuNpKTk,5,0.1668335001668335,1/11988,2000,2000,400,False,,<Embedding>,0.001756438946703,0.0011251325027948
https://www.youtube.com/watch?v=eYXDSuNpKTk,6,0.2002002002002002,1/11988,2400,2400,400,False,,<Embedding>,0.0003478305827535,0.000995582182788
https://www.youtube.com/watch?v=eYXDSuNpKTk,7,0.2335669002335669,1/11988,2800,2800,400,False,,<Embedding>,0.0006832215845906,0.0009509592401883


In [141]:
THRESHOLD = 0.1
df_sbd = df_shots.with_column("is_cut_boundary", (col("cos_dist") >= THRESHOLD))

In [143]:
df_sbd.where(df_sbd["is_cut_boundary"]).select("data").show()

data Image[RGB; 288 x 288]
