# Act 2: Iterating with Data


In this notebook, you'll add another layer of context to your video pipeline. In Act 1, we added visual search based on evenly spaced frames from the video. 

Here we'll add more context in two ways:

> First, we'll apply **content-aware scene detection.**

> Second, we'll **add audio** signals to the mix by extracting the audio channel, transcribing it, then adding an embedding index for the text. 

This typically means chaining together separate tools: scene detection tools, FFmpeg for audio extraction, Whisper for transcription, sentence splitters, text embedding models, and vector databases.

The techniques you'll learn apply to any content that needs breaking down into processable chunks:

- Breaking videos into logical segments (scenes, chapters, clips)
- Processing long-form audio (podcasts, interviews, lectures)
- Chunking documents for better search granularity

## In this notebook
1. **Detect Scene Boundaries** - Automatically identify scene changes using content-aware detection
2. **Iterate Over Scenes** - Create a view with one row per scene for independent processing
3. **Process Each Scene** - Extract audio and generate transcripts with computed columns
4. **Search Across Scenes** - Split transcripts into sentences and build semantic search

**Here's what you'll build:**

```
┌───────────────────────────────────────────────────────────────────────────────────────┐
│  INPUT            DETECT            EXTRACT            TRANSCRIBE         SEARCH      │
│                                                                                       │
│  ┌────────┐      ┌────────┐      ┌─────────┐        ┌────────────┐    ┌───────────┐   │
│  │ Video  │─────▶│ Scenes │─────▶│  Audio  │───────▶│ Transcript │───▶│ Semantic  │   │
│  │  File  │      │        │      │         │        │    Text    │    │  Search   │   │
│  └────────┘      └────────┘      └─────────┘        └────────────┘    └───────────┘   │
│      │               │                │                    │                  │       │
│      │               │                │                    │                  │       │
│  Load video    PySceneDetect     Extract audio        Whisper           Query by      │
│  from URL      finds scene       from video          transcribes        what's said   │
│                boundaries         segments            audio                           │
│                                                                                       │
└───────────────────────────────────────────────────────────────────────────────────────┘
```

In [1]:
import pixeltable as pxt
import pixeltable.functions as pxtf

In [2]:
pxt.list_tables('primetime-workshop')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/alison-pxt/.pixeltable/pgdata


['primetime-workshop/primetime_vids', 'primetime-workshop/video-frame-view']

In [3]:
v = pxt.get_table('primetime-workshop/primetime_vids')

In [4]:
v

0
table 'primetime-workshop/primetime_vids'

Column Name,Type,Computed With
video,Video,
title,String,
promo_img,Image,
promo_text,String,
duration,Float,get_duration(video)


## 01 - Detect Scene Boundaries

In Act 1, we extracted frames at regular intervals to get a specific number of frames. However, those frames might not reflect meaningful scene breaks - they're just evenly spaced snapshots of the video.

Here in Act 2, we'll use **content-aware scene detection** to find actual scene boundaries - places where the content changes significantly, not just arbitrary time intervals. 

This functionality in Pixeltable is once again a set of user-defined functions (UDFs). Pixeltable exposes the [PySceneDetect](https://www.scenedetect.com/features/) package as a collection of UDFs. PySceneDetect is a popular way to detect breaks in-between video content, identifying where meaningful scene changes occur. Alternatively, you could bring your own scene detection Python library, and wrap the functions as UDFs that Pixeltable can now execute. 

We'll use content-based scene detection with tuned parameters to find these meaningful scene breaks. Learn more about [scene detection functions](https://docs.pixeltable.com/sdk/latest/video) and the [`scene_detect_histogram()`](https://docs.pixeltable.com/sdk/latest/video#udf-scene_detect_histogram) function:

In [5]:
# Add scene detection with tuned parameters
v.add_computed_column(
    scenes=v.video.scene_detect_histogram(
        fps=10,
        threshold=0.6,
        min_scene_len=100
    ),
    if_exists='replace'
)

objc[70213]: Class AVFFrameReceiver is implemented in both /Users/alison-pxt/Documents/Github/pxt-primetime/.venv/lib/python3.13/site-packages/av/.dylibs/libavdevice.62.1.100.dylib (0x10de743a8) and /Users/alison-pxt/Documents/Github/pxt-primetime/.venv/lib/python3.13/site-packages/cv2/.dylibs/libavdevice.61.3.100.dylib (0x35a1d43a8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.
objc[70213]: Class AVFAudioReceiver is implemented in both /Users/alison-pxt/Documents/Github/pxt-primetime/.venv/lib/python3.13/site-packages/av/.dylibs/libavdevice.62.1.100.dylib (0x10de743f8) and /Users/alison-pxt/Documents/Github/pxt-primetime/.venv/lib/python3.13/site-packages/cv2/.dylibs/libavdevice.61.3.100.dylib (0x35a1d43f8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.


Added 1 column value with 0 errors in 43.42 s (0.02 rows/s)


1 row updated.

**What happens when you use `scene_detect_histogram()`:**

This UDF analyzes the video frame-by-frame (at 10 frames per second) to detect visual changes. The algorithm:

- Compares frames in HSL (Hue, Saturation, Lightness) color space to measure visual differences
- Uses a threshold of 0.6 to determine if a change is significant enough to mark a scene boundary  
- Ensures each detected scene is at least 100 frames long (10 seconds at 10 fps) to filter out brief flashes

This typically takes about a minute for a 6-minute video. The output will be a list of dictionaries, where each dictionary contains scene `start_time`, `start_pts`, and `duration` keys.

**Understanding the parameters:**
- `fps=10`: Analyzes 10 frames per second (balancing speed vs. accuracy)
- `threshold=0.6`: Sensitivity for detecting scene changes (lower = more sensitive)
- `min_scene_len=100`: Minimum scene length in frames (prevents very short false positives)


The `scenes` column contains a JSON array with scene boundaries. Check the updated table schema:


In [6]:
v

0
table 'primetime-workshop/primetime_vids'

Column Name,Type,Computed With
video,Video,
title,String,
promo_img,Image,
promo_text,String,
duration,Float,get_duration(video)
scenes,Json,"video.scene_detect_histogram(fps=10,  threshold=0.6,  min_scene_len=100)"


In [7]:
v.select(v.video, v.scenes).collect()

video,scenes
,"[{""duration"": 28.779, ""start_pts"": 0, ""start_time"": 0.}, {""duration"": 16.85, ""start_pts"": 690690, ""start_time"": 28.779}, {""duration"": 80.122, ""start_pts"": 1095094, ""start_time"": 45.629}, {""duration"": 28.737, ""start_pts"": 3018015, ""start_time"": 125.751}, {""duration"": 25.734, ""start_pts"": 3707704, ""start_time"": 154.488}, {""duration"": 51.301, ""start_pts"": 4325321, ""start_time"": 180.222}, {""duration"": 70.32, ""start_pts"": 5556551, ""start_time"": 231.523}, {""duration"": 7.132, ""start_pts"": 7244237, ""start_time"": 301.843}, {""duration"": 41.458, ""start_pts"": 7415408, ""start_time"": 308.975}, {""duration"": 11.261, ""start_pts"": 8410402, ""start_time"": 350.433}]"


## 02 - Iterate Over Scenes

To create a view with video segments, we need to extract the scene start times from the `scenes` column. The [`video_splitter()`](https://docs.pixeltable.com/sdk/latest/video#iterator-video_splitter) iterator takes `segment_times` - an array of split points where the video should be divided.

The `scenes` column is in JSON format. Pixeltable supports JSON path expressions, which allows us to extract exactly what we need to pass to the video splitter.

Learn more about [iterators](https://docs.pixeltable.com/platform/iterators) and [JSON operations](https://docs.pixeltable.com/platform/type-system).

In [8]:
# Let's look at this output
v.select(all_start_times=v.scenes['*'].start_time).collect()

all_start_times
"[0., 28.779, 45.629, 125.751, 154.488, 180.222, 231.523, 301.843, 308.975, 350.433]"


We don't want the first one, so we can also use Python slice notation to get only the part of the list we care about: We use `v.scenes[1:].start_time` to get these split points. We skip the first scene boundary (index 0, which is at time 0) because `video_splitter` automatically starts the first segment at 0. If we have split points at [30, 60, 90], we get 4 segments: 0-30, 30-60, 60-90, and 90-end.

In [9]:
# Run a query to extract a scene boundary
v.select(scene_start_times=v.scenes[1:].start_time).collect()

scene_start_times
"[28.779, 45.629, 125.751, 154.488, 180.222, 231.523, 301.843, 308.975, 350.433]"


In [10]:
scenes = pxt.create_view(
    'primetime-workshop/scene_view',
    v,
    iterator=pxtf.video.video_splitter(
        video=v.video,
        segment_times=v.scenes[1:].start_time, # this is from our scene detection
        mode='fast',
    ),
    if_exists='replace'
)

The view now has one row per scene segment. Let's check the schema and see what the iterator added to this view:


In [11]:
scenes

0
view 'primetime-workshop/scene_view' (of 'primetime-workshop/primetime_vids')

Column Name,Type,Computed With
pos,Required[Int],
segment_start,Float,
segment_start_pts,Int,
segment_end,Float,
segment_end_pts,Int,
video_segment,Required[Video],
video,Video,
title,String,
promo_img,Image,
promo_text,String,


The [`video_splitter()`](https://docs.pixeltable.com/sdk/latest/video#iterator-video_splitter) iterator adds the following columns to the view:

- `pos`: Position/index of the segment
- `video_segment`: The actual video segment file
- `segment_start`: Start time of the segment in seconds
- `segment_end`: End time of the segment in seconds
- `segment_start_pts` / `segment_end_pts`: Presentation timestamps (for advanced use)

In [12]:
scenes.select(scenes.pos, scenes.segment_start, scenes.segment_end, scenes.video_segment).limit(3).collect()

pos,segment_start,segment_end,video_segment
0,0.0,28.779,
1,28.779,45.629,
2,45.629,125.751,


In [None]:
scenes.count()

## 03 - Process Each Scene

Let's enrich our scene view with audio and transcripts. We'll transcribe at the scene level, then split transcripts into sentences for better embedding granularity.

### Extract & Transcribe Audio

Extract audio from each scene's video segment so we can transcribe it. Pixeltable's `extract_audio()` UDF works for this:


In [13]:
scenes.add_computed_column(
    audio=pxtf.video.extract_audio(scenes.video_segment),
    if_exists='replace'
)

Added 10 column values with 0 errors in 0.41 s (24.21 rows/s)


10 rows updated.

In [14]:
scenes

0
view 'primetime-workshop/scene_view' (of 'primetime-workshop/primetime_vids')

Column Name,Type,Computed With
pos,Required[Int],
segment_start,Float,
segment_start_pts,Int,
segment_end,Float,
segment_end_pts,Int,
video_segment,Required[Video],
audio,Required[Audio],extract_audio(video_segment)
video,Video,
title,String,
promo_img,Image,


Transcribe audio using OpenAI's Whisper model. There are two options:

- **Local Whisper:** Free, no API key needed, but slower
- **OpenAI API:** Faster, but requires an API key and costs money

We'll use local Whisper for this example:


In [15]:
import warnings
warnings.filterwarnings('ignore')

scenes.add_computed_column(
    transcription=pxtf.whisper.transcribe(scenes.audio, model='base'),
    if_exists='replace'
)

Added 10 column values with 0 errors in 19.92 s (0.50 rows/s)


10 rows updated.

In [None]:
# Alternative: Using OpenAI API for transcription (faster, but requires API key)
# Uncomment and use this instead of the built-in whisper function below
# 
# First, install the openai package: pip install openai
# Set your API key: export OPENAI_API_KEY='your-api-key-here'
# 
# scenes.add_computed_column(
#     api_tx=pxtf.openai.transcriptions(scenes.audio, model='whisper-1'),
#     if_exists='replace'
# )

Let's see the schema change since transcription returns JSON:

In [16]:
scenes

0
view 'primetime-workshop/scene_view' (of 'primetime-workshop/primetime_vids')

Column Name,Type,Computed With
pos,Required[Int],
segment_start,Float,
segment_start_pts,Int,
segment_end,Float,
segment_end_pts,Int,
video_segment,Required[Video],
audio,Required[Audio],extract_audio(video_segment)
transcription,Required[Json],"transcribe(audio, model='base')"
video,Video,
title,String,


Let's look at the transcription output:

In [17]:
scenes.select(
    scenes.pos,
    scenes.video_segment,
    scenes.transcription
).where(scenes.pos == 0).limit(1).collect()

pos,video_segment,transcription
0,,"{""text"": "" That check has been the whole point of the sequence beginning with the bishop cutting down the slope of the book by forcing it to a last threatening left. Question is... What will she do now?"", ""language"": ""en"", ""segments"": [{""id"": 0, ""end"": 4.96, ""seek"": 0, ""text"": "" That check has been the whole point of the sequence beginning with the bishop"", ""start"": 0., ""tokens"": [50364, 663, 1520, 575, 668, 264, ..., 8310, 2863, 365, 264, 34470, 50612], ""avg_logprob"": -0.556, ""temperature"": 0., ""no_speech_prob"": 0.364, ""compression_ratio"": 1.384}, {""id"": 1, ""end"": 8.16, ""seek"": 0, ""text"": "" cutting down the slope of the book by forcing it to a last threatening left."", ""start"": 4.96, ""tokens"": [50612, 6492, 760, 264, 13525, 295, ..., 257, 1036, 20768, 1411, 13, 50772], ""avg_logprob"": -0.556, ""temperature"": 0., ""no_speech_prob"": 0.364, ""compression_ratio"": 1.384}, {""id"": 2, ""end"": 10.16, ""seek"": 0, ""text"": "" Question is..."", ""start"": 8.16, ""tokens"": [50772, 14464, 307, 485, 50872], ""avg_logprob"": -0.556, ""temperature"": 0., ""no_speech_prob"": 0.364, ""compression_ratio"": 1.384}, {""id"": 3, ""end"": 12.16, ""seek"": 0, ""text"": "" What will she do now?"", ""start"": 10.16, ""tokens"": [50872, 708, 486, 750, 360, 586, 30, 50972], ""avg_logprob"": -0.556, ""temperature"": 0., ""no_speech_prob"": 0.364, ""compression_ratio"": 1.384}]}"


We can extract the text from the transcription JSON:

In [22]:
scenes.select(scenes.transcription, scenes.transcription.text).limit(3).collect()

transcription,transcription_text
"{""text"": """", ""language"": ""en"", ""segments"": []}",
"{""text"": "" The President has invited you to the White House. There'll be a chess board set up in the Oval Office, and of course a photo op of you kicking hi ...... a list of talking points. It's a big deal beating the Soviets at the wrong game. Could you stop the car, please? I'd like to walk. To the airport."", ""language"": ""en"", ""segments"": [{""id"": 0, ""end"": 3., ""seek"": 0, ""text"": "" The President has invited you to the White House."", ""start"": 0., ""tokens"": [50364, 440, 3117, 575, 9185, 291, 281, 264, 5552, 4928, 13, 50514], ""avg_logprob"": -0.417, ""temperature"": 0., ""no_speech_prob"": 0.225, ""compression_ratio"": 2.148}, {""id"": 1, ""end"": 5., ""seek"": 0, ""text"": "" There'll be a chess board set up in the Oval Office,"", ""start"": 3., ""tokens"": [50514, 821, 603, 312, 257, 24122, ..., 264, 422, 3337, 8935, 11, 50614], ""avg_logprob"": -0.417, ""temperature"": 0., ""no_speech_prob"": 0.225, ""compression_ratio"": 2.148}, {""id"": 2, ""end"": 8., ""seek"": 0, ""text"": "" and of course a photo op of you kicking his ass."", ""start"": 5., ""tokens"": [50614, 293, 295, 1164, 257, 5052, 999, 295, 291, 19137, 702, 1256, 13, 50764], ""avg_logprob"": -0.417, ""temperature"": 0., ""no_speech_prob"": 0.225, ""compression_ratio"": 2.148}, {""id"": 3, ""end"": 11., ""seek"": 0, ""text"": "" Texas being more of a checker state."", ""start"": 8., ""tokens"": [50764, 7885, 885, 544, 295, 257, 1520, 260, 1785, 13, 50914], ""avg_logprob"": -0.417, ""temperature"": 0., ""no_speech_prob"": 0.225, ""compression_ratio"": 2.148}, {""id"": 4, ""end"": 15., ""seek"": 0, ""text"": "" There's a dinner tonight after the reception"", ""start"": 13., ""tokens"": [51014, 821, 311, 257, 6148, 4440, 934, 264, 21682, 51114], ""avg_logprob"": -0.417, ""temperature"": 0., ""no_speech_prob"": 0.225, ""compression_ratio"": 2.148}, {""id"": 5, ""end"": 18., ""seek"": 0, ""text"": "" at the Russian chess club in Georgetown."", ""start"": 15., ""tokens"": [51114, 412, 264, 7220, 24122, 6482, 294, 34848, 13, 51264], ""avg_logprob"": -0.417, ""temperature"": 0., ""no_speech_prob"": 0.225, ""compression_ratio"": 2.148}, ..., {""id"": 11, ""end"": 30., ""seek"": 2800, ""text"": "" There are a lot of visitors belong,"", ""start"": 28., ""tokens"": [50364, 821, 366, 257, 688, 295, 14315, 5784, 11, 50464], ""avg_logprob"": -0.33, ""temperature"": 0.8, ""no_speech_prob"": 0.042, ""compression_ratio"": 1.351}, {""id"": 12, ""end"": 32., ""seek"": 2800, ""text"": "" so we've prepared a list of talking points."", ""start"": 30., ""tokens"": [50464, 370, 321, 600, 4927, 257, 1329, 295, 1417, 2793, 13, 50564], ""avg_logprob"": -0.33, ""temperature"": 0.8, ""no_speech_prob"": 0.042, ""compression_ratio"": 1.351}, {""id"": 13, ""end"": 41., ""seek"": 2800, ""text"": "" It's a big deal beating the Soviets at the wrong game."", ""start"": 37., ""tokens"": [50814, 467, 311, 257, 955, 2028, 13497, 264, 41354, 412, 264, 2085, 1216, 13, 51014], ""avg_logprob"": -0.33, ""temperature"": 0.8, ""no_speech_prob"": 0.042, ""compression_ratio"": 1.351}, {""id"": 14, ""end"": 45., ""seek"": 2800, ""text"": "" Could you stop the car, please?"", ""start"": 43., ""tokens"": [51114, 7497, 291, 1590, 264, 1032, 11, 1767, 30, 51214], ""avg_logprob"": -0.33, ""temperature"": 0.8, ""no_speech_prob"": 0.042, ""compression_ratio"": 1.351}, {""id"": 15, ""end"": 48., ""seek"": 2800, ""text"": "" I'd like to walk."", ""start"": 47., ""tokens"": [51314, 286, 1116, 411, 281, 1792, 13, 51364], ""avg_logprob"": -0.33, ""temperature"": 0.8, ""no_speech_prob"": 0.042, ""compression_ratio"": 1.351}, {""id"": 16, ""end"": 50., ""seek"": 2800, ""text"": "" To the airport."", ""start"": 49., ""tokens"": [51414, 1407, 264, 10155, 13, 51464], ""avg_logprob"": -0.33, ""temperature"": 0.8, ""no_speech_prob"": 0.042, ""compression_ratio"": 1.351}]}","The President has invited you to the White House. There'll be a chess board set up in the Oval Office, and of course a photo op of you kicking his ass. Texas being more of a checker state. There's a dinner tonight after the reception at the Russian chess club in Georgetown. A lot of prominent visitors belong, so I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. There are a lot of visitors belong, so we've prepared a list of talking points. It's a big deal beating the Soviets at the wrong game. Could you stop the car, please? I'd like to walk. To the airport."
"{""text"": "" That check has been the whole point of the sequence beginning with the bishop cutting down the slope of the book by forcing it to a last threatening left. Question is... What will she do now?"", ""language"": ""en"", ""segments"": [{""id"": 0, ""end"": 4.96, ""seek"": 0, ""text"": "" That check has been the whole point of the sequence beginning with the bishop"", ""start"": 0., ""tokens"": [50364, 663, 1520, 575, 668, 264, ..., 8310, 2863, 365, 264, 34470, 50612], ""avg_logprob"": -0.556, ""temperature"": 0., ""no_speech_prob"": 0.364, ""compression_ratio"": 1.384}, {""id"": 1, ""end"": 8.16, ""seek"": 0, ""text"": "" cutting down the slope of the book by forcing it to a last threatening left."", ""start"": 4.96, ""tokens"": [50612, 6492, 760, 264, 13525, 295, ..., 257, 1036, 20768, 1411, 13, 50772], ""avg_logprob"": -0.556, ""temperature"": 0., ""no_speech_prob"": 0.364, ""compression_ratio"": 1.384}, {""id"": 2, ""end"": 10.16, ""seek"": 0, ""text"": "" Question is..."", ""start"": 8.16, ""tokens"": [50772, 14464, 307, 485, 50872], ""avg_logprob"": -0.556, ""temperature"": 0., ""no_speech_prob"": 0.364, ""compression_ratio"": 1.384}, {""id"": 3, ""end"": 12.16, ""seek"": 0, ""text"": "" What will she do now?"", ""start"": 10.16, ""tokens"": [50872, 708, 486, 750, 360, 586, 30, 50972], ""avg_logprob"": -0.556, ""temperature"": 0., ""no_speech_prob"": 0.364, ""compression_ratio"": 1.384}]}",That check has been the whole point of the sequence beginning with the bishop cutting down the slope of the book by forcing it to a last threatening left. Question is... What will she do now?


The `transcription_text` column has type JSON, which requires us to use a type cast if we want to process it as a string. 

We use `.astype(pxt.String)` to convert it from JSON to a string. This lets us work with the transcript text directly in queries and computed columns.

Pixeltable enforces type safety: if you want to pass the path expression `scenes.transcription.text` to a UDF that expects a string, you would first need to cast it. Otherwise, it would flag a type error.

In [23]:
# Extract the text from the transcription JSON
scenes.add_computed_column(
    transcript_text=scenes.transcription.text.astype(pxt.String),
    if_exists='replace'
)

Added 10 column values with 0 errors in 0.04 s (237.30 rows/s)


10 rows updated.

In [24]:
scenes

0
view 'primetime-workshop/scene_view' (of 'primetime-workshop/primetime_vids')

Column Name,Type,Computed With
pos,Required[Int],
segment_start,Float,
segment_start_pts,Int,
segment_end,Float,
segment_end_pts,Int,
video_segment,Required[Video],
audio,Required[Audio],extract_audio(video_segment)
transcription,Required[Json],"transcribe(audio, model='base')"
transcript_text,String,transcription.text.astype(String)
video,Video,


In [25]:
scenes.select(scenes.pos, scenes.video_segment, scenes.transcript_text).limit(3).collect()

pos,video_segment,transcript_text
4,,
5,,"The President has invited you to the White House. There'll be a chess board set up in the Oval Office, and of course a photo op of you kicking his ass. Texas being more of a checker state. There's a dinner tonight after the reception at the Russian chess club in Georgetown. A lot of prominent visitors belong, so I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. There are a lot of visitors belong, so we've prepared a list of talking points. It's a big deal beating the Soviets at the wrong game. Could you stop the car, please? I'd like to walk. To the airport."
9,,С deserves greyом


### Split Scene Transcripts into Sentences

Embedding entire scene transcripts loses semantic granularity. A scene might discuss multiple topics, and embedding the whole transcript creates a single vector that blurs those distinctions.

We'll create another view that splits each scene's transcript into sentences using [`string_splitter()`](https://docs.pixeltable.com/sdk/latest/string). This gives us one row per sentence, which we can then embed and search with better precision.

In [26]:
# Create a view that splits transcripts into sentences
sentences = pxt.create_view(
    'primetime-workshop/sentences',
    scenes,
    iterator=pxtf.string.string_splitter(
        scenes.transcript_text,
        separators='sentence'
    )
)

The view now has one row per sentence. Let's check the schema:

In [27]:
sentences

0
"view 'primetime-workshop/sentences' (of 'primetime-workshop/scene_view', 'primetime-workshop/primetime_vids')"

Column Name,Type,Computed With
pos,Required[Int],
text,Required[String],
segment_start,Float,
segment_start_pts,Int,
segment_end,Float,
segment_end_pts,Int,
video_segment,Required[Video],
audio,Required[Audio],extract_audio(video_segment)
transcription,Required[Json],"transcribe(audio, model='base')"
transcript_text,String,transcription.text.astype(String)


When we talked about iterator views in Act 1, we noted that each one gets a position column in order to reconstruct. In this case, the string splitter returns `pos` and `text`.

View some sample sentences:

In [28]:
sentences.select(sentences.pos, sentences.transcript_text, sentences.text).limit(3).collect()

pos,transcript_text,text
0,"The President has invited you to the White House. There'll be a chess board set up in the Oval Office, and of course a photo op of you kicking his ass. Texas being more of a checker state. There's a dinner tonight after the reception at the Russian chess club in Georgetown. A lot of prominent visitors belong, so I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. There are a lot of visitors belong, so we've prepared a list of talking points. It's a big deal beating the Soviets at the wrong game. Could you stop the car, please? I'd like to walk. To the airport.",The President has invited you to the White House.
1,"The President has invited you to the White House. There'll be a chess board set up in the Oval Office, and of course a photo op of you kicking his ass. Texas being more of a checker state. There's a dinner tonight after the reception at the Russian chess club in Georgetown. A lot of prominent visitors belong, so I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. There are a lot of visitors belong, so we've prepared a list of talking points. It's a big deal beating the Soviets at the wrong game. Could you stop the car, please? I'd like to walk. To the airport.","There'll be a chess board set up in the Oval Office, and of course a photo op of you kicking his ass."
2,"The President has invited you to the White House. There'll be a chess board set up in the Oval Office, and of course a photo op of you kicking his ass. Texas being more of a checker state. There's a dinner tonight after the reception at the Russian chess club in Georgetown. A lot of prominent visitors belong, so I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. I'm going to have to go back to the White House. There are a lot of visitors belong, so we've prepared a list of talking points. It's a big deal beating the Soviets at the wrong game. Could you stop the car, please? I'd like to walk. To the airport.",Texas being more of a checker state.


## 04 - Search Across Scenes

Now let's create an embedding index on the sentence-level text. This gives us much better semantic search precision than embedding entire scene transcripts.

We'll use a sentence transformer model from Hugging Face:

In [29]:
sentences.add_embedding_index(
    sentences.text,
    embedding=pxtf.huggingface.sentence_transformer.using(model_id='sentence-transformers/all-MiniLM-L6-v2'),
    if_exists='replace'
)

Let's see the index in the view schema:

In [None]:
sentences

Now we can perform text-based semantic search on the sentences:

In [30]:
# Search for sentences about chess strategy
sim = sentences.text.similarity(string='chess strategy')

sentences.order_by(
    sim,
    asc=False
).select(
    sentences.video_segment,
    sentences.text,
    score=sim
).limit(3).collect()

video_segment,text,score
,There's a dinner tonight after the reception at the Russian chess club in Georgetown.,0.365
,"There'll be a chess board set up in the Oval Office, and of course a photo op of you kicking his ass.",0.317
,It's a big deal beating the Soviets at the wrong game.,0.269


We just built semantic search for video transcripts.

The workflow we defined:
- Detects scene boundaries using content-aware algorithms
- Creates a view with one row per scene segment
- Extracts audio and transcribes each scene automatically
- Splits transcripts into sentences for better embedding granularity
- Builds a searchable embedding index on sentences
- Finds relevant sentences by semantic similarity

Scenes, views, and computed columns work together to make video content searchable by meaning.

## 05 - Add a New Video

Now we are going to add a new video and sit back and watch while Pixeltable populates the views across the entire hierarchy. 

Thus far, we have created a 3-level table hierarchy in this notebook:

```
Videos `primetime_vids`
├─ Scenes view `scene_view`
│  ├─ Sentences view `sentences`
```

Here they all are (including `video-frame-view` from Act 1):

In [31]:
pxt.list_tables('primetime-workshop')

['primetime-workshop/video-frame-view',
 'primetime-workshop/primetime_vids',
 'primetime-workshop/sentences',
 'primetime-workshop/scene_view']

We started with our `primetime_vids` table - I assigned it the handle `v` in this Python session. We can view the schema:

In [32]:
v

0
table 'primetime-workshop/primetime_vids'

Column Name,Type,Computed With
video,Video,
title,String,
promo_img,Image,
promo_text,String,
duration,Float,get_duration(video)
scenes,Json,"video.scene_detect_histogram(fps=10,  threshold=0.6,  min_scene_len=100)"


And we have been iterating with data from a single row:

In [33]:
v.count()

1

Now it's time to demo our workflow in action. What will happen when we insert a new row into `primetime_vids`?



In [None]:
v.insert([{
    'video': 'source/only-murders-opening.mp4',
    'title': 'Only Murders in the Building',
    'promo_img': 'source/only-murders-img.jpg',
    'promo_text': 'Three strangers share an obsession with true crime and suddenly find themselves wrapped up in one. When a grisly death occurs inside their exclusive Upper West Side apartment building, the trio suspects murder and employs their precise knowledge of true crime to investigate the truth. Perhaps even more explosive are the lies they tell one another. Soon, the endangered trio comes to realize a killer might be living among them as they race to decipher the mounting clues before it is too late.'
}])

The schema won't change, because we just added a row - we didn't make any changes to columns.

In [35]:
v

0
table 'primetime-workshop/primetime_vids'

Column Name,Type,Computed With
video,Video,
title,String,
promo_img,Image,
promo_text,String,
duration,Float,get_duration(video)
scenes,Json,"video.scene_detect_histogram(fps=10,  threshold=0.6,  min_scene_len=100)"


But we do have two computed columns: `duration` and `scenes`. Let's look to make sure Pixeltable orchestrated those computations upon insert:

In [36]:
v.collect()

video,title,promo_img,promo_text,duration,scenes
,The Queens Gambit,,"Set during the Cold War era, orphaned chess prodigy Beth Harmon struggles with addiction in a quest to become the greatest chess player in the world.",377.043,"[{""duration"": 28.779, ""start_pts"": 0, ""start_time"": 0.}, {""duration"": 16.85, ""start_pts"": 690690, ""start_time"": 28.779}, {""duration"": 80.122, ""start_pts"": 1095094, ""start_time"": 45.629}, {""duration"": 28.737, ""start_pts"": 3018015, ""start_time"": 125.751}, {""duration"": 25.734, ""start_pts"": 3707704, ""start_time"": 154.488}, {""duration"": 51.301, ""start_pts"": 4325321, ""start_time"": 180.222}, {""duration"": 70.32, ""start_pts"": 5556551, ""start_time"": 231.523}, {""duration"": 7.132, ""start_pts"": 7244237, ""start_time"": 301.843}, {""duration"": 41.458, ""start_pts"": 7415408, ""start_time"": 308.975}, {""duration"": 11.261, ""start_pts"": 8410402, ""start_time"": 350.433}]"
,Only Murders in the Building,,"Three strangers share an obsession with true crime and suddenly find themselves wrapped up in one. When a grisly death occurs inside their exclusive Upper West Side apartment building, the trio suspects murder and employs their precise knowledge of true crime to investigate the truth. Perhaps even more explosive are the lies they tell one another. Soon, the endangered trio comes to realize a killer might be living among them as they race to decipher the mounting clues before it is too late.",248.707,"[{""duration"": 6.84, ""start_pts"": 0, ""start_time"": 0.}, {""duration"": 12.596, ""start_pts"": 164164, ""start_time"": 6.84}, {""duration"": 10.594, ""start_pts"": 466466, ""start_time"": 19.436}, {""duration"": 10.594, ""start_pts"": 720720, ""start_time"": 30.03}, {""duration"": 74.491, ""start_pts"": 974974, ""start_time"": 40.624}, {""duration"": 17.226, ""start_pts"": 2762760, ""start_time"": 115.115}, {""duration"": 19.061, ""start_pts"": 3176173, ""start_time"": 132.341}, {""duration"": 86.42, ""start_pts"": 3633630, ""start_time"": 151.401}]"


We also have a `scene_view` that derives from this base table. Since I've been working with it in this notebook already, we use the handle `scenes` to review the schema:

In [37]:
scenes

0
view 'primetime-workshop/scene_view' (of 'primetime-workshop/primetime_vids')

Column Name,Type,Computed With
pos,Required[Int],
segment_start,Float,
segment_start_pts,Int,
segment_end,Float,
segment_end_pts,Int,
video_segment,Required[Video],
audio,Required[Audio],extract_audio(video_segment)
transcription,Required[Json],"transcribe(audio, model='base')"
transcript_text,String,transcription.text.astype(String)
video,Video,


In [38]:
scenes.select(scenes.promo_img, scenes.video_segment, scenes.transcript_text).tail(2)

promo_img,video_segment,transcript_text
,,"Do you ever see this email from a Bev Mellon about our podcast? Bev Mellon. Must be tough to have a last name that's a fruit. Fiona Apple, Darrell Strawberry, Gilbert Grape, something was always eating him."
,,"So what's next podcast wise, Charles, you mentioned a cold case? Oh, please. It was interesting, wouldn't it cool? But what we need is a hot, fresh dead body, preferably right here, or very near to here. It's true. We've been very lucky with people dying in our building. Yeah, but it is kind of a flaw in our business model. Mm-hmm. Oh my God, there she is. What? Who? Hey. My Malbacita. So, Sass never even made it up here? That's odd. Okay, watch this. Wow, so cool. Like stepping into a sharper image catalog. See? And what's cool? The cork. Oh, jeez! I've got it. Oh, my God. What a stupid thing. Oh, shit. I hope these spots don't stain. Oh, it's good. Maybe it's fine. Maybe a little time off is a good thing. I mean, it's always a good thing when someone doesn't get murdered, right? Yeah, always. Absolutely. But if someone had to get murdered, let's all say who we hope it would be. Not some part, Tess. Oh, no, it's fine. Count of three. One, two. Oliver. Me? One. You"


And we have sentences too!

In [41]:
sentences

0
"view 'primetime-workshop/sentences' (of 'primetime-workshop/scene_view', 'primetime-workshop/primetime_vids')"

Column Name,Type,Computed With
pos,Required[Int],
text,Required[String],
segment_start,Float,
segment_start_pts,Int,
segment_end,Float,
segment_end_pts,Int,
video_segment,Required[Video],
audio,Required[Audio],extract_audio(video_segment)
transcription,Required[Json],"transcribe(audio, model='base')"
transcript_text,String,transcription.text.astype(String)

Index Name,Column,Metric,Embedding
idx0,text,cosine,"sentence_transformer(text, model_id='sentence-transformers/all-MiniLM-L6-v2', normalize_embeddings=False)"


In [42]:
sentences.select(sentences.video_segment, sentences.transcript_text, sentences.text).tail(3)

video_segment,transcript_text,text
,"So what's next podcast wise, Charles, you mentioned a cold case? Oh, please. It was interesting, wouldn't it cool? But what we need is a hot, fresh dead body, preferably right here, or very near to here. It's true. We've been very lucky with people dying in our building. Yeah, but it is kind of a flaw in our business model. Mm-hmm. Oh my God, there she is. What? Who? Hey. My Malbacita. So, Sass never even made it up here? That's odd. Okay, watch this. Wow, so cool. Like stepping into a sharper image catalog. See? And what's cool? The cork. Oh, jeez! I've got it. Oh, my God. What a stupid thing. Oh, shit. I hope these spots don't stain. Oh, it's good. Maybe it's fine. Maybe a little time off is a good thing. I mean, it's always a good thing when someone doesn't get murdered, right? Yeah, always. Absolutely. But if someone had to get murdered, let's all say who we hope it would be. Not some part, Tess. Oh, no, it's fine. Count of three. One, two. Oliver. Me? One. You",Me?
,"So what's next podcast wise, Charles, you mentioned a cold case? Oh, please. It was interesting, wouldn't it cool? But what we need is a hot, fresh dead body, preferably right here, or very near to here. It's true. We've been very lucky with people dying in our building. Yeah, but it is kind of a flaw in our business model. Mm-hmm. Oh my God, there she is. What? Who? Hey. My Malbacita. So, Sass never even made it up here? That's odd. Okay, watch this. Wow, so cool. Like stepping into a sharper image catalog. See? And what's cool? The cork. Oh, jeez! I've got it. Oh, my God. What a stupid thing. Oh, shit. I hope these spots don't stain. Oh, it's good. Maybe it's fine. Maybe a little time off is a good thing. I mean, it's always a good thing when someone doesn't get murdered, right? Yeah, always. Absolutely. But if someone had to get murdered, let's all say who we hope it would be. Not some part, Tess. Oh, no, it's fine. Count of three. One, two. Oliver. Me? One. You",One.
,"So what's next podcast wise, Charles, you mentioned a cold case? Oh, please. It was interesting, wouldn't it cool? But what we need is a hot, fresh dead body, preferably right here, or very near to here. It's true. We've been very lucky with people dying in our building. Yeah, but it is kind of a flaw in our business model. Mm-hmm. Oh my God, there she is. What? Who? Hey. My Malbacita. So, Sass never even made it up here? That's odd. Okay, watch this. Wow, so cool. Like stepping into a sharper image catalog. See? And what's cool? The cork. Oh, jeez! I've got it. Oh, my God. What a stupid thing. Oh, shit. I hope these spots don't stain. Oh, it's good. Maybe it's fine. Maybe a little time off is a good thing. I mean, it's always a good thing when someone doesn't get murdered, right? Yeah, always. Absolutely. But if someone had to get murdered, let's all say who we hope it would be. Not some part, Tess. Oh, no, it's fine. Count of three. One, two. Oliver. Me? One. You",You


## Wrap-Up

You built a semantic search pipeline for video transcripts using Pixeltable:

```
┌───────────────────────────────────────────────────────────────────────────────────────┐
│  INPUT            DETECT            EXTRACT            TRANSCRIBE         SEARCH      │
│                                                                                       │
│  ┌────────┐      ┌────────┐      ┌─────────┐        ┌────────────┐    ┌───────────┐   │
│  │ Video  │─────▶│ Scenes │─────▶│  Audio  │───────▶│ Transcript │───▶│ Semantic  │   │
│  │  File  │      │        │      │         │        │    Text    │    │  Search   │   │
│  └────────┘      └────────┘      └─────────┘        └────────────┘    └───────────┘   │
│      │               │                │                    │                  │       │
│      │               │                │                    │                  │       │
│  Load video    PySceneDetect     Extract audio        Whisper           Query by      │
│  from URL      finds scene       from video          transcribes        what's said   │
│                boundaries         segments            audio                           │
│                                                                                       │
└───────────────────────────────────────────────────────────────────────────────────────┘
```

**What you built:**

You added a second search modality to your video pipeline. 

- Act 1 gave you visual search (find frames by what they look like).  
- Act 2 gives you audio search (find scenes by what's being said).

**Each step is declarative:**
- **Input**: Video data loaded from URL
- **Detect**: PySceneDetect finds scene boundaries automatically
- **Extract**: Audio extracted from video segments  
- **Transcribe**: Whisper transcribes audio to searchable text
- **Search**: Query scenes by semantic meaning

**Next up:** In Act 3, you'll use generative AI to create new images and videos from scene data.

## Appendix - JSON Parsing Examples


This section covers JSON parsing in more detail, including step-by-step exploration of the JSON structure and different ways to access query results.


### Exploring JSON Structure

To create a view with video segments, we need to extract the scene start times from our `scenes` column. Let's build up to this step by step, exploring the JSON structure along the way. Learn more about [JSON operations](https://docs.pixeltable.com/platform/type-system).

**Step 1:** First, let's see what the `scenes` column contains:


In [None]:
import pixeltable as pxt
v = pxt.get_table('primetime-workshop/primetime_vids')

In [None]:
# Step 1: Look at the scenes column structure
v.select(v.scenes).collect()

**Step 2:** The `scenes` column contains a JSON array. Let's access the first scene to see its structure:

In [None]:
# Step 2: Access the first scene element
v.select(v.scenes[0]).collect()

**Step 3:** Each scene has properties like `start_time` and `end_time`. Let's access the `start_time` of the first scene:


In [None]:
# Step 3: Access the start_time property of the first scene
v.select(v.scenes[0].start_time).collect()


**Step 4:** Now let's slice the array to get all scenes from index 1 onwards (skipping the first scene, which typically starts at 0):


In [None]:
# Step 4: Slice to get scenes from index 1 onwards
v.select(v.scenes[1:]).collect()

**Step 5:** Now access the `start_time` property for all scenes in the slice. Here, we'll also name the column `times`:


In [None]:
# Step 5: Access start_time for all scenes from index 1 onwards
v.select(times=v.scenes[1:].start_time).collect()

### Accessing Query Results

Using `select()`, we are composing a query to run. When you run a query, Pixeltable gives you a few ways to interact with the results:

1. You can convert to a list of dictionaries
2. You can index by row/column `[0,0]` and by column name

**Example 1:** Convert to a list of dictionaries


In [None]:
result = v.select(times=v.scenes[1:].start_time).collect()

In [None]:
result  # Returns as a table

In [None]:
result[0]  # Returns first row as dict

In [None]:
result['times']  # Returns list of times values

**Example 2:** Index by row/column `[0,0]` and by column name

In [None]:
# Index by position [row, column]
first_value = result[0, 0]  # First row, first column
first_value

In [None]:
# Index by column name
first_time = result[0, "times"]  # First row, "times" column
first_time

---

## Learn More

### Views & Iterators
- [Views Platform Guide](https://docs.pixeltable.com/platform/views) - Creating and using views
- [Iterators Platform Guide](https://docs.pixeltable.com/platform/iterators) - Working with iterators
- [`video_splitter()`](https://docs.pixeltable.com/sdk/latest/video#iterator-video_splitter) - Video segmentation iterator
- [`string_splitter()`](https://docs.pixeltable.com/sdk/latest/string#iterator-string_splitter) - Text segmentation iterator
- [`create_view()`](https://docs.pixeltable.com/sdk/latest/pixeltable#func-create-view) - View creation API

### Scene Detection
- [Video Functions](https://docs.pixeltable.com/sdk/latest/video) - Scene detection functions
- [`scene_detect_histogram()`](https://docs.pixeltable.com/sdk/latest/video#udf-scene_detect_histogram) - Content-based scene detection

### Audio & Transcription
- [`extract_audio()`](https://docs.pixeltable.com/sdk/latest/video#udf-extract_audio) - Extract audio from video
- [Whisper Functions](https://docs.pixeltable.com/sdk/latest/whisper) - Local Whisper transcription
- [OpenAI Functions](https://docs.pixeltable.com/sdk/latest/openai) - OpenAI API transcription

### JSON Operations
- [Type System](https://docs.pixeltable.com/platform/type-system) - Working with JSON and other types
- [JSON Operations](https://docs.pixeltable.com/platform/type-system) - Accessing and manipulating JSON data

### Embeddings & Search
- [Embedding Indexes](https://docs.pixeltable.com/platform/embedding-indexes) - Building searchable indexes
- [`sentence_transformer()`](https://docs.pixeltable.com/sdk/latest/huggingface#sentence-transformer) - Text embeddings
- [Similarity Search Cookbooks](https://docs.pixeltable.com/howto/cookbooks/search/search-similar-text) - Text similarity search
- [HuggingFace Integration](https://docs.pixeltable.com/sdk/latest/huggingface) - Working with HuggingFace models

## Functions Used

This notebook uses the following Pixeltable functions:

- [`add_computed_column()`](https://docs.pixeltable.com/sdk/latest/pixeltable#add-computed-column) - Add computed columns to tables
- [`add_embedding_index()`](https://docs.pixeltable.com/sdk/latest/pixeltable#add-embedding-index) - Create embedding indexes for similarity search
- [`create_view()`](https://docs.pixeltable.com/sdk/latest/pixeltable#func-create-view) - Create views from tables
- [`extract_audio()`](https://docs.pixeltable.com/sdk/latest/video#udf-extract_audio) - Extract audio from video segments
- [`get_table()`](https://docs.pixeltable.com/sdk/latest/pixeltable#func-get-table) - Retrieve existing tables
- [`list_tables()`](https://docs.pixeltable.com/sdk/latest/pixeltable#func-list-tables) - List all tables in the database
- [`scene_detect_histogram()`](https://docs.pixeltable.com/sdk/latest/video#udf-scene_detect_histogram) - Detect scene boundaries in video
- [`similarity()`](https://docs.pixeltable.com/sdk/latest/pixeltable#similarity) - Perform similarity search using embedding indexes
- [`string_splitter()`](https://docs.pixeltable.com/sdk/latest/string#iterator-string_splitter) - Iterator to split text into segments
- [`transcribe()`](https://docs.pixeltable.com/sdk/latest/whisper#udf-transcribe) - Transcribe audio using Whisper
- [`video_splitter()`](https://docs.pixeltable.com/sdk/latest/video#iterator-video_splitter) - Iterator to split video into segments