In [None]:
# Get variables from the previous notebook
%store -r

## Shot Detection
As you may recall, a shot is a continuous sequence of frames between two edits or cuts that define one action.  Usually, a shot represents a single camera position, but sometimes shots may contain camera movements such as panning or zooming.  Based on this definition, frames that belong to the same shot should be similar. 


### Create frame embeddings

In order to automatically determine where the shots begin and end in the video, we need a way to group similar frames.  In this step, we'll create an embedding for each frame using Amazon Titan Multimodal Embeddings.  Embeddings encode images, text, or both into the same vector space.  Embeddings can be used to cluster these encoded items by similarity.  With embeddings you can implement use cases such as image search, natural language search for images and multi-modal RAG.  We'll be using helper functions in the [lib/frames.py](./lib/frames.py) to accomplish this task.  

The calculated frame embeddings will be added to each frame in the `Frames` object that is stored in the `video` variable.

Calling the `method make_titan_multimodal_embeddings()` from the `Frames` class will create frame embeddings and store them with the metadata for each frame.  

⏳ Generating embeddings should take 2-5 minutes.

<div class="alert alert-block alert-info">
If you get an <b>AccessDenied</b> error at this point, make sure you completed the step to enable model access for Amazon Titan Multimodal Embeddings and Anthropic Claude Sonnet 3 in the Amazon Bedrock console.
</div>

In [None]:
video['frames'].make_titan_multimodal_embeddings()

Use the next cell to print the metadata for the first frame and examine the `titan_multimodal_embedding` attribute.   It's a large vector that encodes the content of the frame in the vector space for the `amazon.titan-embed-image-v1` version of the Titan Multimodal Embeddings model.  When we compare this vector with other vectors encoded using the same model version, we can determine if they are similar.

In [None]:
display(JSON(video["frames"].frames[0], root="first frame"))

In [None]:
video['frames'].display_frames(start=0, end=4)

In order to compare frames, we need a way to compare embeddings.  We'll implement a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) function using the Python numpy package.

In [None]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    cos_sim = dot(a, b) / (norm(a) * norm(b))
    return cos_sim

Next, let's test comparing some frames.

Compare the first black frame to the second frame which is a view of a city street.  As expected, the similarity score is low as these frames are not very similar.

In [None]:
frms = video['frames'].frames
cosine_similarity(frms[0]['titan_multimodal_embedding'], frms[1]['titan_multimodal_embedding'])

Now, compare the second frame to the third frame.  The similarity score should be higher, since the main difference in these frames is the lettering with the words "Los Angeles 1947".

In [None]:
cosine_similarity(frms[1]['titan_multimodal_embedding'], frms[2]['titan_multimodal_embedding'])

In [None]:
cosine_similarity(frms[2]['titan_multimodal_embedding'], frms[3]['titan_multimodal_embedding'])

### Detect shots using Titan Multi-modal Embeddings

In this next activity, we'll use the frame embeddings to group similar frames into shots using a pairwise cosine similarity comparison of the embeddings for adjacent frames.  We'll apply the same process we did for the first few frames to all the frames.  We'll choose a SIMILARITY_THRESHOLD to determine if frames are part of the same shot or in different shots.


Execute the code to detect shots in the video.  Based on testing, 0.85 is a good threshold to use for grouping frames for this type of content, but you may need to adjust it to adapt to other situations.  For example, content with more or less action may require different thresholds.  You can try setting it lower or higher to experiment with the results.

In [None]:
SIMILARITY_THRESHOLD = 0.85

video["shots"] = Shots(video["frames"], method="SimilarFrames", min_similarity = SIMILARITY_THRESHOLD)

print(f"Number of shots: {len(video['shots'].shots)} from {len(video['frames'].frames)} frames")