# Video segments: frames, shots and scenes

_Video time segmentation_ is an important data preparation step that can help unlock the full potential of video content for analysis and automation. By breaking down a video into meaningful segments, you can better understand the structure and context of the content, enabling a wide range of applications such as:

* Identifying key events, scenes, or chapters within the video
* Inserting metadata like ad markers or chapter markers
* Reusing relevant clips or segments for new purposes
* Applying advanced analytics and foundation models to specific parts of the video


![Video file decomposed into frames, shots and scenes](./static/images/01-visual-segments.jpg) 

***Figure:** Video file decomposed into frames, shots (numbered), and scenes (colored)*


In this notebook, you'll explore techniques to decompose a video into smaller segments using visual cues. Specifically, you'll:

* Break down the video into frames, shots, and scenes using visual analysis.
* Generate composite images for video segments that can be used in prompts to Foundation Models (FMs) on Amazon Bedrock to generate insights about the content of video clips.
* Practice prompt engineering with a Foundation Model to understand video content.

By the end of this notebook, you'll have a collection of segmentations for your video that can serve as a foundation for further analysis, automation, and reuse of the video assets. 

The outputs of this notebook will be used in the use case sections later on in this workshop.  



<div class="alert alert-block alert-info">
💡 We'll be using some Python libraries to accomplish the segmentation tasks during this part of the workshop.  The libraries are in the <b>lib/</b> folder of this project if you would like to dive deep into the code.  We are not going in to all the implementation details for this section because we want to leave more time for focusing on prompt engineering and solving use cases.
</div>

<div class="alert alert-block alert-info">
💡 Video segmentation can be done along the temporal dimension or along the spatial dimension.  In the context of this notebook, the term “segmentation” will always be <i>temporal (time) segmentation</i>.
</div>

### Key terms and definitions

You can refer back to this section if you want the definition of terms used in the notebook.

- **Frame** - frame image extracted from the video content
- **Frame sampling** - the selection of a subset of representative frames from a video
- **Shot** - continuous sequences of frames between two edits or cuts that defines one action
- **Scene** - continuous sequence of action taking place in a specific location and time, consisting of a series of shots.
- **Frame accurate timestamp** - a timestamp that can be mapped to a specific frame.  Frame accurate timestamps are useful for synchronization of video elements

### Workflow

The purpose of this lab is to give hands-on practice working with the visual elements of video at the frame, shot, and scene level and practice prompt engineering with frame sequences that represent video clips.  We'll be working with AWS services from the Sagemaker notebook throughout the activity.

![scene detection and contextualization workflow](./static/images/01-scenes-shots-workflow-w-ouputs-drawio.png)

<div class="alert alert-block alert-info">
💡 Click on the list icon in the left navigation panel in this Jupyter notebook to see the outline of the notebook and where you currently are.
</div>



## Prerequisites

### Import python packages

In [None]:
from pathlib import Path
import os
import json
import boto3
from termcolor import colored
from IPython.display import JSON
from IPython.display import Video
from IPython.display import Image as DisplayImage
from lib.frames import VideoFrames
from lib.shots import Shots
from lib.scenes import Scenes
from lib.transcript import Transcript
from lib import util
import requests
from tqdm import tqdm
from lib import frame_utils
from PIL import Image, ImageDraw, ImageFont
import time
from functools import cmp_to_key

### Retrieve saved values from previous notebooks

To run this notebook, you need to have run the previous notebook: [00_prerequisites.ipynb](./00-prequisites.ipynb), where you installed package dependencies and gathered some information from the SageMaker environment.

In [None]:
# Get variables from the previous notebook
%store -r

In [None]:
%store

### Download the sample video

In [None]:
def download_video(url: str, output_path: str) -> str:
    """
    Download test video if not already present
    """
    
    if os.path.exists(output_path):
        print(f"Video already exists at {output_path}")
        return output_path
    
    print("Downloading test video...")
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    with open(output_path, 'wb') as file, tqdm(
        desc=output_path,
        total=total_size,
        unit='iB',
        unit_scale=True,
        unit_divisor=1024,
    ) as pbar:
        for data in response.iter_content(chunk_size=1024):
            size = file.write(data)
            pbar.update(size)
            
    return

### Choose a sample video

* Meridian, 2016, Mystery from [Netflix](https://opencontent.netflix.com/) - This content is available under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode)
* Sintel, 2010, Animation and Fantasy from [The Blender Foundation](https://durian.blender.org/) - This content is available under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode)

<div class="alert alert-block alert-info">
We recommend you run the workshop with <b>Meridian</b> the first time through, as some visualization choices are made to work well with that content. 
</div>

In [None]:
video = {}

# Video Alternatives
# Drama
MERIDIAN='Netflix_Open_Content_Meridian.mp4'
SINTEL='Sintel2010-720p.mp4'

video["path"] = MERIDIAN
video["url"] = f"https://ws-assets-prod-iad-r-pdx-f3b3f9f1a7d6a3d0.s3.us-west-2.amazonaws.com/7db2455e-0fa6-4f6d-9973-84daccd6421f/{video['path']}"

video["output_dir"] = Path(video["path"]).stem

download_video(video["url"], video["path"])

Video(url=video["url"], width=640, height=360, html_attributes="controls muted autoplay")

# Sample frames 

In this section you will extract video frames at one frame per second with a 392×220 pixel resolution.  These settings are optimized for visual quality and computational efficiency through numerous experiments. In this process, we sample one frame per second.

<div class="alert alert-block alert-info">
💡 392×220 pixel resolution is chosen to optimize the number of frames we can present to our chosen Foundation Model, Anthropic Claude Sonnet 3, while still retaining the level of detail we need for our use cases.  Different use cases may use higher or lower resolutions for lower cost or higher quality. 
</div>

<div class="alert alert-block alert-info">
💡 One frame per second sampling is a design choice that is suitable for the content used here, but can be adjusted for high-motion, high-frame-rate videos such as sports or more static video such as newsroom footage.
</div>

### Extract frames from the video

You'll be using a Python package, [VideoFrames](./lib/frames.py), to work with the video at the frame level.  This package is available on GitHub.  When you call the main method of VideoFrames, it will do the following steps to help prepare the video for frame based analysis with machine learning:

1. Extract frames from the video, sampling at the specified frame rate, and store the resulting images in the folder `./<video name>/frames/`.
2. The resulting frame metadata contains the following attributes for each frame:

* **timestamp_millis** — the timestamp, in milliseconds, where the frame appears in the video.  We'll use this timestamp to related video analysis results back to the video timeline.
* **image_file** — the location of the image in the `frames` folder.
* **id** - the unique frame id

Let's give it a try. 

⏳ Generating frames will take a few minutes to process.

In [None]:
# Number of frames to sample per second of video
FRAME_SAMPLING_RATE = 1

video["frames"] = VideoFrames(video["path"], session['bucket'], max_res=(392, 220), sample_rate_fps=FRAME_SAMPLING_RATE, force=False)

In [None]:
display(JSON(video["frames"].frames[0], root="first frame"))

### Visualize the extracted frames

Next, let's visualize the extracted frames of the video.  

<div class="alert alert-block alert-info">
🤔 Do you notice any visual patterns in the frames? Based on the frames, can you predict how many shots are in the video?
</div>

 


In [None]:
video['frames'].display_frames(start=0, end=len(video['frames'].frames))

# Detect shots

A shot is a continuous sequence of frames between two edits or cuts that define one action.  Usually, a shot represents a single camera position, but sometimes, shots may contain camera movements such as panning or zooming.  Frames that belong to the same shot should be similar.  Therefore, one way to implement shot detection would be to use image embeddings such as Amazon [Titan Multimodal Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html) to group similar frames to shots.   However, since we are operating on sampled frames, the accuracy of timestamps for the shots would be limited by our frame sampling rate.

For use cases like ad insertion, editing and search, the ideal is to use **frame accurate** timestamps that identify the exact frame where the shots begin and end.  [Amazon Rekognition's Segment API](https://docs.aws.amazon.com/rekognition/latest/dg/segments.html) is a video analysis service that automatically detects technical cues and shot boundaries in video content which provides frame-accurate timestamps for each shot boundary.

Use the [Shots](lib/shots.py) Python library to generate shots using the Amazon Rekognition's Shot Segment API and map the shots to the frames we sampled in the previous step.  We'll use this result to build large grouping of shots into scenes. 

## Shots library

<div class="alert alert-block alert-info">
<b>Note</b>: The code for shot detection is located in the <b>lib</b> folder of this project if you want to dive deeper, but the objective of this exercise is to understand the concept of a shot.    
</div>

Click the link to open [lib/shots.py](lib/shots.py)

## Run shot detection

In [None]:
video["shots"] = Shots(video["frames"], method="RekognitionShots")

print(f"Number of shots: {len(video['shots'].shots)} from {len(video['frames'].frames)} frames")

Take a moment to look at the metadata for a sample shot from the results.  Each shot contains:

* **method** - the method used to group frames into shots.  Possible values are `SimilarFrames` or `RekognitionShots`
* **start_ms** - the starting timestamp of the shot
* **end_ms** - the ending timestamp of the shot
* **duration_ms** - the duration of the shot
* **video_asset_dir** - the location of the metadata collected for this video
* **start_frame_id** - the frame the shot begins with
* **end_frame_id** - the frame the shot ends with
* **composite_images** - a series of equal sized frame grids containing the frames for the shots.  Composite images can be used as inputs to multi-modal foundation models to generate insights about the shot.

Show metadata for shot 1

In [None]:
display(JSON(video["shots"].shots[1]))

The `Shots` method creates a set of composite images consisting of the frames in each shot.  We will be using these composite images later on as inputs to Anthropic Claude Sonnet 3 on Amazon Bedrock to generate inferences to understand what is happening in the shots.  For now, you can examine the resulting composite images to visualize the shots. 

Show the composite images for shot 1. 

In [None]:
shot = video["shots"].shots[1]
for idx, composite_image in enumerate(shot['composite_images']):
    
    print (f'\nShot {shot["id"] } Composite image file { idx+1 } of { len(shot["composite_images"]) }: { composite_image["file"] }\n')
    display(DisplayImage(filename=composite_image['file']))


Before we move on to scene detection, let's take a look at all of the generated shots.

<div class="alert alert-block alert-info">
💡 Use the scroll bar in the output box to view the shots. 
</div>

In [None]:
# visualize the shots
for counter, shot in enumerate(video["shots"].shots):
    print(f'\nSHOT {counter}: frames {shot["start_frame_id"] } to {shot["end_frame_id"] } =======\n')
    video['frames'].display_frames(start=shot["start_frame_id"], end=shot["end_frame_id"]+1)
    
    # ALTERNATIVE: Display composite images for each shot
    #for image_file in shot['composite_images']:
    #    display(DisplayImage(filename=image_file['file'], height=75))
    

<div class="alert alert-block alert-info">
💡 Use the scroll bar in the output box to view the shots in the cell above.  You can also drag the lower right corner to increase the size of the output cell.  
</div>

🤔 As you look through the lists of shots, do you notice any segments that have unexpected results?  If you do, you may have discovered some tricky situations for automatic video segmentation.  These include:
* Motion of the subject including rolling credits, cars, etc.
* Motion of the camera in the form of panning shots and zooming shots
* Fades and other transition effects


Before we move on to scene detection, let's play a couple of adjacent shots and observe how they will look as a video clip.  Play `shot[12]`.  

In [None]:
start = video['shots'].shots[12]['start_ms']/1000
end = video['shots'].shots[12]['end_ms']/1000
shot_url = f'{video["url"]}#t={start},{end}'
Video(url=shot_url, width=640, height=360)

Play `shot[13]`

In [None]:
start = video['shots'].shots[13]['start_ms']/1000
end = video['shots'].shots[13]['end_ms']/1000
shot_url = f'{video["url"]}#t={start},{end}'
Video(url=shot_url, width=640, height=360)

🤔 If you are running the workshop with Meridian, these shots show two people in the same room having a conversation that spans the shots.  Because these two shots are in the same setting, they belong together into a higher level grouping.  In the next section, we'll group shots together based on the visual information to get a more holistic view of the scene.

# Detect scenes


Even after identifying individual camera shots, there may still be semantically similar shots depicting the same setting. To further cluster these into distinct scenes, we expand frame comparison beyond adjacent frames. By looking at similar frames across an expanded time window, we can identify shots that are likely part of the same contiguous scene. We calculate pairwise similarity scores between all frames within a given time window. Frames with similarity scores above a certain threshold are considered part of the same scene group. This process performs recursively across all frames in a shot. 

<div class="alert alert-block alert-info">
💡 The time window size and similarity threshold are parameters that can significantly impact the accuracy of scene boundary detection. In our example, a 30 second time window and 0.80 similarity threshold gave the best scene clustering results across our video samples, but this can be adjusted.
</div>

We accomplish scene grouping by first indexing all video frames using TME again and storing the embeddings along with their shot information and timestamps into a vector database, as illustrated in the following figure.

![shots-to-scenes.png](./static/images/01-vectorization.jpg)

We then perform a recursive similarity search against this indexed frame corpus. For each frame, we find all other frames within a 3-minute time window in both directions with greater than 80% contextual similarity based on their vector representations. The shot information for these highly similar frames is recorded. This process iterates for all frames in a shot to compile contextually similar shots. This process repeats across all shots, and the compiled results look like this example:

    shot 1 –> 2, 3, 4
    shot 2 –> 1, 3
    shot 3 –> 2, 4, 5
    shot 7 –> 8, 9

Finally, we run a reduction process to group shots that are mutually identified as highly similar into distinct scene groups as follows:

    shot 1, 2, 3, 4, 5 → scene 1
    shot 7, 8, 9 → scene 2

This allows us to segment the initially detected shot boundaries into higher-level semantic scene boundaries based on visual and temporal coherence. The end-to-end process is illustrated in the following diagram.

### Create frame embeddings

Image embeddings are numerical representations (vectors) of images that capture their essential features and characteristics.  These embeddings make it possible to perform mathematical operations on images and compare them in ways that align with human visual perception.  

We'll create an image embedding for each frame from our video using [Amazon Titan Multimodal Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html) in Amazon Bedrock.  We'll be using helper functions in the [lib/frames.py](./lib/frames.py) to accomplish this task.  

The calculated frame embeddings will be added to each frame in the `Frames` object that is stored in the `video` variable.

Calling the method `make_titan_multimodal_embeddings()` from the `Frames` class will create frame embeddings and store them with the metadata for each frame.  



<div class="alert alert-block alert-info">
If you get an <b>AccessDenied</b> error at this point, make sure you completed the step to enable model access for Amazon Titan Multimodal Embeddings and Anthropic Claude Sonnet 3 in the Amazon Bedrock console.
</div>

<div class="alert alert-block alert-info">
⏳ Generating embeddings for our sample videos should take 2-5 minutes.  To speed things up, we will load precomputed embeddings.  
</div>

In [None]:
# workshop FASTPATH setting uses pre-calculated embeddings for the video, set FASTPATH=false to regenerate embeddings
FASTPATH = True
if FASTPATH:
    video['frames'].load_titan_multimodal_embeddings()
else:
    video['frames'].make_titan_multimodal_embeddings()

Use the next cell to print the metadata for the first frame and examine the `titan_multimodal_embedding` attribute.   It's a large vector that encodes the content of the frame in the vector space for the `amazon.titan-embed-image-v1` version of the Titan Multimodal Embeddings model.  When we compare this vector with other vectors encoded using the same model version, we can determine if they are similar.

In [None]:
display(JSON(video["frames"].frames[0], root="first frame"))

Next, lets do some comparisons of the first few frames using embeddings.  First, print a few of the sampled frames.

In [None]:
video['frames'].display_frames(start=0, end=4)

In order to compare frames, we need a way to compare embeddings.  We'll implement a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) function using the Python numpy package.

In [None]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    cos_sim = dot(a, b) / (norm(a) * norm(b))
    return cos_sim

## Next, let's test comparing some frames.

Compare the first black frame to the second frame which is a view of a city street.  As expected, the similarity score is low as these frames are not very similar.

In [None]:
frms = video['frames'].frames
cosine_similarity(frms[0]['titan_multimodal_embedding'], frms[2]['titan_multimodal_embedding'])

Now, compare the second frame to the third frame.  The similarity score should be higher, since the main difference in these frames is the lettering with the words "Los Angeles 1947".

In [None]:
cosine_similarity(frms[1]['titan_multimodal_embedding'], frms[2]['titan_multimodal_embedding'])

### Populate a FAISS vector store

We will be using a local FAISS vector store so we can use a single search command to find all the frames that are similar to a particular frame all at once.  Our search function will use the same cosine similarity method we explored in the previous section. There are a number of different databases on AWS that can be used as a vector store.  One popular choice is [Amazon Opensearch](https://aws.amazon.com/opensearch-service/). 

In [None]:
video['frames'].make_vector_store()

### Test similarity search using the vector store

Let's use the vector store to find similar frames for the second frame in the video.  This is the first frame of the sequence that displays the words "Los Angeles 1947".  Based on inspection of the frames, we should get 3 _adjacent_ similar frames, but there are also several similar frames that are not adjacent.   

Our similarity search function uses a [cosine similarity function](https://en.wikipedia.org/wiki/Cosine_similarity) to determine the [K nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) in the vector space. There are two parameters that you can adjust to tune the result of the similarity search:

* MIN_SIMILARITY is the similarity threshold.  
* TIME_RANGE is the maximum time range to compare frames to from the input frame

You can try different values of these parameters to get a feel for how the results change. The frames 0-19 are displayed after this to help visualize  the results.  Here are some good values to try:

* MIN_SIMILARITY = .85, TIME_RANGE = 30
* MIN_SIMILARITY = .70, TIME_RANGE = 30
* MIN_SIMILARITY = .80, TIME_RANGE = 3

In [None]:
MIN_SIMILARITY = .80
TIME_RANGE = 30
FRAME_ID = 1
video['frames'].search_similarity(FRAME_ID, min_similarity = MIN_SIMILARITY, time_range = TIME_RANGE)

In [None]:
video['frames'].display_frames(start=0, end=20)

### Scenes library

<div class="alert alert-block alert-info">
<b>Note</b>: The code for scene detection is located in the <b>lib</b> folder of this project if you want to dive deeper, but the objective of this exercise is to understand the concept of a scene.    
</div>

Click the link to open [lib/scenes.py](lib/scenes.py)


### Execute the Scene detection

Now let's apply this similarity search to all the frames across shots to find similar shots.  If shots are similar within the TIME_RANGE, then they will be grouped to the same scene.


In [None]:
MIN_SIMILARITY = .90

TIME_RANGE = 30

video['scenes'] = Scenes(video['frames'], video['shots'].shots, MIN_SIMILARITY, TIME_RANGE)

### Visualize the scenes

Now let's visualize some scenes using the generated composite images. Note that some scenes will have more than one composite image.


🤔 Do you think the scenes grouped the shots in a way that makes sense?  Is there anything you would want to change?

In [None]:
# visualize the scenes
for counter, scene in enumerate(video["scenes"].scenes):
    print(f'\nScene {counter}: frames {scene["start_frame_id"] } to {scene["end_frame_id"] } =======\n')
    video['frames'].display_frames(start=scene["start_frame_id"], end=scene["end_frame_id"]+1)

<div class="alert alert-block alert-info">
💡 Make sure to use the scroll bar in the output box above to view the scenes.  You can also drag the lower right corner to increase the size of the output cell.  
</div>


Finally, let's play a couple of adjacent scenes and observe how they will look as a video clips. As you play the video segments, pay attention to the transition of the _video and the audio_ from one scene to the next.  Because the scenes are created based on only the visual information, the audio can get cut off if you clip on a scene boundary.  In the next notebook, we'll look at the audio segmentation of videos. 

In [None]:
start_scene = 3
start = video['scenes'].scenes[start_scene]['start_ms']/1000
end = video['scenes'].scenes[start_scene]['end_ms']/1000
print(f"scene { start_scene } duration: {video['scenes'].scenes[start_scene]['duration_ms']/1000} seconds\n")
print(f"start time: { start } end time: {end} seconds\n")
shot_url = f'{video["url"]}#t={start},{end}'
Video(url=shot_url, width=640, height=360)

Play the next scene.

In [None]:
start_scene = start_scene + 4
start = video['scenes'].scenes[start_scene]['start_ms']/1000
end = video['scenes'].scenes[start_scene]['end_ms']/1000
print(f"scene { start_scene } duration: {video['scenes'].scenes[start_scene]['duration_ms']/1000} seconds\n")
print(f"start time: { start } end time: {end} seconds\n")
shot_url = f'{video["url"]}#t={start},{end}'
Video(url=shot_url, width=640, height=360)


🤔 We just played scenes 9 and 10.  You may have noticed that the scene changed on a visual cue when the focus of the video turns towards the beach and the ocean.  _However_, this scene change occurs in the middle of the audio of the police dispatcher speaking over the radio.  If we want to identify clean breakpoints in the video to make clips, we should probably consider not only the visual content but the audio as well. We'll explore this topic in the next part of the workshop [01B Combining Audio and Video Segments](./01B-combining-audio-and-video.ipynb).  But, before we move on, let's do our first exploration of prompt engineering with video segments.

# Generate contextual metadata for video segments with Amazon Bedrock

![Prompt engineering with frame sequences](./static/images/01-prompt-engineering-with-frame-sequences.jpg)
TBD

# What's next?

### Store the video metadata so it can be used in the rest of the workshop.

In [None]:
%store video

### Continue to the next section of the workshop

In the next sections of the workshop, you will use Amazon Transcribe to generate a transcript from the speech within the video.  The transcript provides additional information that can be used to find segments based on the context of the video narrative.  The transcript can also be used to provide more input context in prompts for video clips.

Go to the next notebook [Audio Segments](./01B-audio-segments.ipynb).