# Workshop: Scene level contextual understanding of media content using Generative AI

## Notebook configuration
- Image: Data Science 3.0
- Instance Type: ml.m5.2xlarge (Recommended)
- Python version: 3.10

## Rundown of the workshop
1. Prerequisite
    - Import helper modules
    - Installing python packages (opencv, faiss-cpu, webvtt-py, termcolor
    - Downloading sample video (Netflix Open Content - Meridian)
2. Generating "chapter points" based on topic changes in conversation
    - Uploading the sample video to Amazon S3 bucket
    - Converting speech to text with Amazon Transcribe
    - Using Anthropic Claude 3 Haiku to analyse the conversation
    - Validating "chapter points" timestamps
3. Creating visual scenes from sample video
    - Frame extraction
    - Grouping frames into shots with Amazon Titan Multimodal Embedding
    - Grouping shots into scenes with vector store
4. Generating chapter level contextual information
    - Grouping scenes into chapters with both visual and audio analysis
    - Contextual understandings with Anthropic Claude 3

## Install python packages

- opencv for video and image processing
- faiss for vector store
- webvtt-py for parsing subtitle file
- termcolor for formatting output

In [None]:
!pip install opencv-python termcolor faiss-cpu webvtt-py 
# !pip install torch torchvision

In [None]:
from IPython.display import JSON
from IPython.display import Video
from IPython.display import Pretty
import json
import boto3
import sagemaker
from termcolor import colored
import inspect
import time
import cv2
import datetime
from pathlib import Path
from termcolor import colored
import glob
import os
from functools import cmp_to_key


from lib import transcribe_helper as trh
from lib import s3_helper as s3h
from lib import chapters as chpt
from lib import util
from lib import embeddings
from lib import frames

## Download the sample video, Meridian, from Netflix

The open source content is available under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode)

In [None]:

mp4_file = 'Netflix_Open_Content_Meridian.mp4'
url = f"https://dx2y1cac29mt3.cloudfront.net/mp4/netflix/{mp4_file}"

!curl {url} -o {mp4_file}

Video(mp4_file, width=640, height=360)

## Generate chapter segments based on the dialog in the video

### Get Sagemaker default resources

In [None]:

sagemaker_resources = {}
sagemaker_resources["session"] = sagemaker.Session()
sagemaker_resources["bucket"] = sagemaker_resources["session"].default_bucket()
sagemaker_resources["role"] = sagemaker.get_execution_role()
sagemaker_resources["region"] = sagemaker_resources["session"]._region_name

print(sagemaker_resources)

## Upload the sample video to the default Amazon S3 bucket for this notebook

In [None]:

response = s3h.upload_object(sagemaker_resources["bucket"], "contextual_ad", mp4_file) 


## Use Amazon Transcribe to convert speech to text

This section uses Amazon Transcribe to convert the speech to text and generate a WebVTT output.

If you are getting `AccessDeniedException`, log on to `AWS IAM Console`, find the SageMaker Execution IAM Role, and add the following managed polices:
- AmazonTranscribeFullAccess
- AmazonRekognitionFullAccess
- AmazonBedrockFullAccess

Also check out the pricing on [Amazon Transcribe Pricing](https://aws.amazon.com/transcribe/pricing/) in us-east-1 region

In [None]:
print(f'SageMaker execution IAM Role ARN: {sagemaker_resources["role"]}')

### Start the transcription job and wait for it to complete.

In [None]:
# start transcription job

def transcribe(bucket, path, file, media_format="mp4", language_code="en-US", verbose=True):

    # start transcription job
    transcribe_response = start_transcription_job(
        bucket, 
        path,
        file, media_format, language_code)

    # wait for completion
    transcribe_response = trh.wait_for_transcription_job(
        transcribe_response['TranscriptionJob']['TranscriptionJobName'], 
        verbose)

    return transcribe_response

transcribe_response = trh.transcribe(sagemaker_resources["bucket"], "contextual_ad", mp4_file)

### Examine the results from Amazon Transcribe

The response from Amazon Transcribe contains a `results` dictionary with a `transcript` that contains a text-only transcript and a collection of `items` which contain each word and punctuation in the transcript along with a confidence score and timestamp for the item. The response also contains the same transcript formatted as subtitles in either WebVTT or SRT format.  Let's take a look at these outputs.  

We will be using the WebVTT output for our workflow.

**Transcript**

In [None]:
transcript_filename = trh.download_transcript(transcribe_response)

JSON(filename=transcript_filename)

**WebVTT Subtitles**

In [None]:
vtt_filename = trh.download_vtt(transcribe_response)

!head transcript.vtt


### Estimate the cost of the transcription job

In [None]:

transcribe_cost = trh.display_transcription_cost(mp4_file)


In [None]:
print(inspect.getsource(trh.display_transcription_cost))

### Use an Amazon Bedrock to generate chapters from the subtitles 
This section demonstrates using LLM to breakdown the conversations based on topic changes. It uses Anthropic Claude 3 Haiku model via Amazon Bedrock service.


In [None]:
# run LLM model
import json
from urllib.request import urlretrieve
import boto3
from termcolor import colored

def make_conversation_example():
    example = {
        'chapters': [
            {
                'start': '00:00:10.000',
                'end': '00:00:32.000',
                'reason': 'It appears the chapter talks about...'
            }
        ]
    }

    return {
        'role': 'user',
        'content': 'JSON format. An example of the output:\n{0}\n'.format(json.dumps(example))
    }

def make_transcript(transcript_file):
    with open(transcript_file) as f:
        transcript = f.read()
    
    return {
        'role': 'user',
        'content': 'Here is the transcripts in <transcript> tag:\n<transcript>{0}\n</transcript>\n'.format(transcript)
    }


def analyze_conversations(transcript_file):
    messages = []

    # transcript
    transcript_message = make_transcript(transcript_file)
    messages.append(transcript_message)

    # output format?
    messages.append({
        'role': 'assistant',
        'content': 'Got the transcript. What output format?'
    })

    # example output
    example_message = make_conversation_example()
    messages.append(example_message)

    # prefill output
    messages.append({
        'role': 'assistant',
        'content': '{'
    })

    # model parameters
    model_id = 'anthropic.claude-3-haiku-20240307-v1:0'
    model_ver = 'bedrock-2023-05-31'
    accept = 'application/json'
    content_type = 'application/json'

    ## system prompt to role play
    system = 'You are a media operation assistant who analyses movie transcripts in WebVTT format and suggest chapter points based on the topic changes in the conversations. It is important to read the entire transcripts.'

    ## setting up the model params
    model_params = {
        'anthropic_version': model_ver,
        'max_tokens': 4096,
        'temperature': 0.1,
        'top_p': 0.7,
        'top_k': 20,
        'stop_sequences': ['\n\nHuman:'],
        'system': system,
        'messages': messages
    }

    bedrock_runtime_client = boto3.client(service_name='bedrock-runtime')

    response = bedrock_runtime_client.invoke_model(
        body=json.dumps(model_params),
        modelId=model_id,
        accept=accept,
        contentType=content_type
    )

    response_body = json.loads(response.get('body').read())

    # patch the json string output with '{' and parse it
    response_content = response_body['content'][0]['text']
    if response_content[0] != '{':
        response_content = '{' + response_content
    response_content = json.loads(response_content)
    response_body['content'][0]['json'] = response_content

    return response_body

def display_conversation_cost(response):
    # us-east-1 pricing
    input_per_1k = 0.00025
    output_per_1k = 0.00125

    input_tokens = response['usage']['input_tokens']
    output_tokens = response['usage']['output_tokens']

    conversation_cost = (
        input_per_1k * input_tokens +
        output_per_1k * output_tokens
    ) / 1000

    print('\n')
    print('========================================================================')
    print('Estimated cost:', colored(f"${conversation_cost}", 'green'), f"in us-east-1 region with {colored(input_tokens, 'green')} input tokens and {colored(output_tokens, 'green')} output tokens.")
    print('========================================================================')

    return {
        'input_per_1k': input_per_1k,
        'output_per_1k': output_per_1k,
        'input_tokens': input_tokens,
        'output_tokens': output_tokens,
        'estimated_cost': conversation_cost,
    }


def analyze_conversations(transcript_file):
    messages = []

    # transcript
    transcript_message = make_transcript(transcript_file)
    messages.append(transcript_message)

    # output format?
    messages.append({
        'role': 'assistant',
        'content': 'Got the transcript. What output format?'
    })

    # example output
    example_message = make_conversation_example()
    messages.append(example_message)

    # prefill output
    messages.append({
        'role': 'assistant',
        'content': '{'
    })

    # model parameters
    model_id = 'anthropic.claude-3-haiku-20240307-v1:0'
    model_ver = 'bedrock-2023-05-31'
    accept = 'application/json'
    content_type = 'application/json'

    ## system prompt to role play
    system = 'You are a media operation assistant who analyses movie transcripts in WebVTT format and suggest chapter points based on the topic changes in the conversations. It is important to read the entire transcripts.'

    ## setting up the model params
    model_params = {
        'anthropic_version': model_ver,
        'max_tokens': 4096,
        'temperature': 0.1,
        'top_p': 0.7,
        'top_k': 20,
        'stop_sequences': ['\n\nHuman:'],
        'system': system,
        'messages': messages
    }

    bedrock_runtime_client = boto3.client(service_name='bedrock-runtime')

    response = bedrock_runtime_client.invoke_model(
        body=json.dumps(model_params),
        modelId=model_id,
        accept=accept,
        contentType=content_type
    )

    response_body = json.loads(response.get('body').read())

    # patch the json string output with '{' and parse it
    response_content = response_body['content'][0]['text']
    if response_content[0] != '{':
        response_content = '{' + response_content
    response_content = json.loads(response_content)
    response_body['content'][0]['json'] = response_content

    return response_body

conversation_response = analyze_conversations(vtt_filename)

# show the conversation cost
conversation_cost = display_conversation_cost(conversation_response)

### Let's take a look at the conversations that were generated from the transcript 

In [None]:
conversations = conversation_response['content'][0]['json']

JSON(conversations)

### Generating "chapter points" #4: Validating "chapter points" timestamps

While LLM can help to break down the conversations, we need to ensure that the timestamps generated are indeed valid. This can be done by matching the timestamp boundaries against the timestamps of the "dialogues" from the original WebVTT file.

In [None]:
## Validating the timestamp boundaries of the conversations against the WebVtt timestamps
def validate_timestamps(chapters, captions):
    ## collect caption timestamps per chapter
    for chapter in chapters:
        chapter_start = chapter['start_ms']
        chapter_end = chapter['end_ms']

        while len(captions) > 0:
            caption = captions[0]

            caption_start = caption['start_ms']
            caption_end = caption['end_ms']

            if caption_start >= chapter_end:
                break

            if caption_end <= chapter_start:
                captions.pop(0)
                continue

            if abs(chapter_end - caption_start) < abs(caption_end - chapter_end):
                break

            if 'timestamps' not in chapter:
                chapter['timestamps'] = []
            chapter['timestamps'].append([caption_start, caption_end])

            captions.pop(0)

    ## align the chapter boundary timestamps with the caption timestamps
    for chapter in chapters:
        if 'timestamps' not in chapter:
            continue
        
        chapter_start = chapter['start_ms']
        chapter_end = chapter['end_ms']

        caption_start = chapter['timestamps'][0][0]
        caption_end = chapter['timestamps'][-1][1]

        if chapter_start != caption_start:
            chapter['start_ms'] = caption_start
            chapter['start'] = to_hhmmssms(caption_start)

        if chapter_end != caption_end:
            chapter['end_ms'] = caption_end
            chapter['end'] = to_hhmmssms(caption_end)

        del chapter['timestamps']

    return chapters

In [None]:
## parse the conversation
conversations = conversation_response['content'][0]['json']

## merge overlapped conversation timestamps
chapters = chpt.merge_chapters(conversations['chapters'])

## validate the conversation timestamps against the caption timestamps
captions = chpt.parse_webvtt(vtt_filename)
chapters = chpt.validate_timestamps(chapters, captions)

conversations['chapters'] = chapters

print(f"Number of chapters: {len(conversations['chapters'])}")
print(conversations['chapters'][0])


## save the conversations
util.save_json_to_file('conversations.json', conversations)


In [None]:
estimated_cost = conversation_cost['estimated_cost'] + transcribe_cost['estimated_cost']
estimated_cost = round(estimated_cost, 4)

print('\n')
print('Generating "chapter points"')
print('========================================================================')
print('Transcribe cost:', colored(f"${round(transcribe_cost['estimated_cost'], 4)}", 'green'), f"with duration of {colored(transcribe_cost['duration'], 'green')}s")
print('Bedrock cost:', colored(f"${round(conversation_cost['estimated_cost'], 4)}", 'green'), f"with {colored(conversation_cost['input_tokens'], 'green')} input tokens and {colored(conversation_cost['output_tokens'], 'green')} output tokens.")
print('-----')
print('Estimated cost:', colored(f"${estimated_cost}", 'green'))
print('========================================================================')


## CHECKPOINT

At this point, we have taken the audio part of the video file, run Amazon Transcribe to convert the speech to text, and run Amazon Bedrock (Anthropic Claude 3 Haiku) model to analyze the conversations.

Let's move on to analyzing the visual part of the video.

### Creating visual scenes from sample video #1: Frame extraction

In this section, we are extracting 1 frame per second with a resolution of `392x220` from the sample video. Using `392x220` is chosen for a reason and will be discussed in "Generating chapter level contextual information" section.

In [None]:

def extract_frames(video_file, size = (392, 220)):
    # start fresh
    frames.rmdir('frames')

    cap = cv2.VideoCapture(video_file)
    fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
    print(f"  total_frames = {total_frames}, fps = {fps}")

    # re-create frames folder
    frames.mkdir('frames')

    skip_step = round(fps)
    processed_frames = 0
    saved_frames = 0

    while (True):
        success, frame = cap.read()
        if not success:
            break
        if processed_frames % skip_step == 0:
            # print(f"{saved_frames} / {processed_frames} / {total_frames}")
            name = f"frames/frame.{saved_frames:07d}.jpg"
            scaled = cv2.resize(frame, size)
            cv2.imwrite(name, scaled)
            saved_frames += 1

        processed_frames += 1

    cap.release()

    # return jpeg files
    jpeg_frames = sorted(glob.glob('frames/*.jpg'))
    return jpeg_frames


In [None]:


jpeg_files = extract_frames(mp4_file)

print(f"Frame extracted: {len(jpeg_files)}")

### Creating visual scenes from sample video #2: Grouping frames into shots with Amazon Titan Multimodal Embedding

- Generating frame embeddings with Amazon Titan Multimodal Embedding model
- Grouping frames into shots with cosine similarity

#### Generating frame embeddings with Amazon Titan Multimodal Embedding model

In [None]:
from PIL import Image

def batch_generate_embeddings(jpeg_files):
    frame_embeddings = []

    titan_model_id = 'amazon.titan-embed-image-v1'
    accept = 'application/json'
    content_type = 'application/json'

    bedrock_runtime_client = boto3.client(service_name='bedrock-runtime')

    for jpeg_file in jpeg_files:
        #print(f"== PROCESSING: {jpeg_file}")

        image = Image.open(jpeg_file)
        input_image = frames.image_to_base64(image)

        model_params = {
            'inputImage': input_image,
            'embeddingConfig': {
                'outputEmbeddingLength': 1024 #384 #256
            }
        }

        body = json.dumps(model_params)

        response = bedrock_runtime_client.invoke_model(
            body=body,
            modelId=titan_model_id,
            accept=accept,
            contentType=content_type
        )
        response_body = json.loads(response.get('body').read())

        basename = os.path.basename(jpeg_file)
        frame_no = int(basename.split('.')[1])
        frame_embeddings.append({
            'file': jpeg_file,
            'frame_no': frame_no,
            'embedding': response_body['embedding']
        })

    return frame_embeddings

def display_embedding_cost(frame_embeddings):
    per_image_embedding = 0.00006
    estimated_cost = per_image_embedding * len(frame_embeddings)

    print('\n')
    print('========================================================================')
    print('Estimated cost:', colored(f"${round(estimated_cost, 4)}", 'green'), f"in us-east-1 region with {len(frame_embeddings)} embeddings")
    print('========================================================================')

    return {
        'per_image_embedding': per_image_embedding,
        'estimated_cost': estimated_cost,
        'num_embeddings': len(frame_embeddings)
    }


In [None]:
frame_embeddings = batch_generate_embeddings(jpeg_files)

## save the frame embeddings
util.save_json_to_file('frame_embeddings.json', frame_embeddings)

frame_embeddings_cost = display_embedding_cost(frame_embeddings)

#### Grouping adjacent frames into shots with cosine similarity

In [None]:

def group_frames_to_shots(frame_embeddings, min_similarity = 0.80):
    shots = []
    current_shot = [frame_embeddings[0]]

    # group frames based on the similarity
    for i in range(1, len(frame_embeddings)):
        prev = current_shot[-1]
        cur = frame_embeddings[i]
        prev_embedding = prev['embedding']
        cur_embedding = cur['embedding']

        similarity = embeddings.cosine_similarity(prev_embedding, cur_embedding)
        cur['similarity'] = similarity

        if similarity > min_similarity:
            current_shot.append(cur)
        else:
            shots.append(current_shot)
            current_shot = [cur]

    if len(current_shot) > 0:
        shots.append(current_shot)

    frames_in_shots = []
    for i in range(len(shots)):
        shot = shots[i]
        frames_ids = [frame['frame_no'] for frame in shot]
        frames_in_shots.append({
            'shot_id': i,
            'frame_ids': frames_ids
        })

    return frames_in_shots


def plot_shots(frame_embeddings, num_shots):
    try:
        os.mkdir('shots')
    except Exception as e:
        print(e)

    shots = [[] for _ in range(num_shots)]
    for frame in frame_embeddings:
        shot_id = frame['shot_id']
        file = frame['file']
        shots[shot_id].append(file)

    for i in range(len(shots)):
        shot = shots[i]
        num_frames = len(shot)
        skipped_frames = frames.skip_frames(shot)
        grid_image = frames.create_grid_image(skipped_frames)
        w, h = grid_image.size
        if h > 440:
            grid_image = grid_image.resize((w // 2, h // 2))
        w, h = grid_image.size
        print(f"Shot #{i:04d}: {num_frames} frames ({len(skipped_frames)} drawn) [{w}x{h}]")
        grid_image.save(f"shots/shot-{i:04d}.jpg")
        display(grid_image)


In [None]:
frames_in_shots = group_frames_to_shots(frame_embeddings)

print(f"Number of shots: {len(frames_in_shots)} from {len(frame_embeddings)} frames")

# update shot_id in frame_embeddings dict
for i in range(len(frames_in_shots)):
    frames_in_shot = frames_in_shots[i]
    for frame_id in frames_in_shot['frame_ids']:
        frame_embeddings[frame_id]['shot_id'] = i

# save to json file
util.save_json_to_file('frames_in_shots.json', frames_in_shots)

# overwrite the embeddings with the shot_id
util.save_json_to_file('frame_embeddings.json', frame_embeddings)

# plot the shot images
plot_shots(frame_embeddings, len(frames_in_shots))

### Creating visual scenes from sample video #3: Grouping shots into scenes with vector store

- Creating a local vector store with Faiss and indexing all frame embeddings
- Searching simliar frames and groping them into scenes
    - The previous step (grouping frames to shots) compares the similarity of the adjacent frames. This step compares the frames to the rest of the frame images of the entire content. This allows us to group frame images that are further apart.


#### Creating a local vector store with Faiss and index all frame embeddings

In [None]:

def collect_similar_frames(frame_embeddings, frame_ids):
    similar_frames = []
    for frame_id in frame_ids:
        similar_frames_ids = [frame['idx'] for frame in frame_embeddings[frame_id]['similar_frames']]
        similar_frames.extend(similar_frames_ids)
    # unique frames in shot
    return sorted(list(set(similar_frames)))

def collect_related_shots(frame_embeddings, frame_ids):
    related_shots = []
    for frame_id in frame_ids:
        related_shots.append(frame_embeddings[frame_id]['shot_id'])
    # unique frames in shot
    return sorted(list(set(related_shots)))


def group_shots_in_scenes(frames_in_shots):
    scenes = [
        [
            min(frames_in_shot['related_shots']),
            max(frames_in_shot['related_shots']),
        ] for frames_in_shot in frames_in_shots
    ]

    scenes = sorted(scenes, key=cmp_to_key(embeddings.cmp_min_max))

    stack = [scenes[0]]
    for i in range(1, len(scenes)):
        prev = stack[-1]
        cur = scenes[i]
        prev_min, prev_max = prev
        cur_min, cur_max = cur

        if cur_min >= prev_min and cur_min <= prev_max:
            new_scene = [
                min(cur_min, prev_min),
                max(cur_max, prev_max),
            ]
            stack.pop()
            stack.append(new_scene)
            continue
            
        stack.append(cur)

    return [{
        'scene_id': i,
        'shot_ids': stack[i],
    } for i in range(len(stack))]

def plot_scenes(frame_embeddings, num_scenes):
    try:
        os.mkdir('scenes')
    except Exception as e:
        print(e)

    scenes = [[] for _ in range(num_scenes)]
    for frame in frame_embeddings:
        scene_id = frame['scene_id']
        file = frame['file']
        scenes[scene_id].append(file)

    for i in range(len(scenes)):
        scene = scenes[i]
        num_frames = len(scene)
        skipped_frames = frames.skip_frames(scene)
        grid_image = frames.create_grid_image(skipped_frames)
        w, h = grid_image.size
        if h > 440:
            grid_image = grid_image.resize((w // 2, h // 2))
        w, h = grid_image.size
        print(f"Scene #{i:04d}: {num_frames} frames ({len(skipped_frames)} drawn) [{w}x{h}]")
        grid_image.save(f"scenes/scene-{i:04d}.jpg")
        display(grid_image)



#### Searching simliar frames and grouping them into scenes

- Find all similar frames that are within 30 seconds for each of the frames
- Collect all unique similar frames from previous step for each shot (a collection of frames) and collect the related shots
- Reduction of shots into scenes by checking the overlapping shots

In [None]:
## create an index
dimension = len(frame_embeddings[0]['embedding'])
vector_store = embeddings.create_index(dimension)

## indexing all the frames
embeddings.index_frames(vector_store, frame_embeddings)
print(f"Total indexed = {vector_store.ntotal}")

## find similar frames for each of the frames and store in the frame_embeddings
for frame in frame_embeddings:
    similar_frames = embeddings.search_similarity(vector_store, frame)
    frame['similar_frames'] = similar_frames

## find all similar frames that are related to the shots and store in the frames_in_shots
for frames_in_shot in frames_in_shots:
    similar_frames_in_shot = collect_similar_frames(frame_embeddings, frames_in_shot['frame_ids'])
    frames_in_shot['similar_frames_in_shot'] = similar_frames_in_shot

    related_shots = collect_related_shots(frame_embeddings, similar_frames_in_shot)
    frames_in_shot['related_shots'] = related_shots

shots_in_scenes = group_shots_in_scenes(frames_in_shots)

# store the scene_id to all structs
for scene in shots_in_scenes:
    scene_id = scene['scene_id']
    shot_min, shot_max = scene['shot_ids']
    print(f"Scene #{scene_id}: {shot_min} - {shot_max} ({shot_max - shot_min + 1})")
    # update json files
    for shot_id in range(shot_min, shot_max + 1):
        frames_in_shots[shot_id]['scene_id'] = scene_id
        for frame_id in frames_in_shots[shot_id]['frame_ids']:
            frame_embeddings[frame_id]['scene_id'] = scene_id

# update the json files
util.save_json_to_file('shots_in_scenes.json', shots_in_scenes)
util.save_json_to_file('frames_in_shots.json', frames_in_shots)
util.save_json_to_file('frame_embeddings.json', frame_embeddings)

# plot the scene images
plot_scenes(frame_embeddings, len(shots_in_scenes))

print(f"Number of frames: {len(frame_embeddings)}")
print(f"Number of shots: {len(frames_in_shots)}")
print(f"Number of scenes: {len(shots_in_scenes)}")

### Generating chapter level contextual information #1: Grouping scenes into chapters with both visual and audio analysis

In [None]:
import copy

def make_chapter_item(chapter_id, scene_items, text = ''):
    scene_ids = [scene['scene_id'] for scene in scene_items]
    return {
        'chapter_id': chapter_id,
        'scene_ids': [min(scene_ids), max(scene_ids)],
        'text': text,
    }

def group_scenes_in_chapters(conversations, shots_in_scenes, frames_in_shots):
    scenes = copy.deepcopy(shots_in_scenes)

    chapters = []
    for conversation in conversations['chapters']:
        start_ms = conversation['start_ms']
        end_ms = conversation['end_ms']
        text = conversation['reason']

        stack = []
        while len(scenes) > 0:
            scene = scenes[0]
            shot_min, shot_max = scene['shot_ids']
            frame_start = min(frames_in_shots[shot_min]['frame_ids']) * 1000
            frame_end = max(frames_in_shots[shot_max]['frame_ids']) * 1000

            if frame_start > end_ms:
                break

            # scenes before any conversation starts
            if frame_end < start_ms:
                chapter = make_chapter_item(len(chapters), [scene])
                chapters.append(chapter)
                scenes.pop(0)
                continue

            stack.append(scene)
            scenes.pop(0)

        if len(stack) > 0:
            chapter = make_chapter_item(len(chapters), stack, text)
            chapters.append(chapter)

    ## There could be more scenes without converations, append them
    for scene in scenes:
        chapter = make_chapter_item(len(chapters), [scene])
        chapters.append(chapter)

    return chapters

def plot_chapters(frame_embeddings, num_chapters):
    try:
        os.mkdir('chapters')
    except Exception as e:
        print(e)

    chapters = [[] for _ in range(num_chapters)]
    for frame in frame_embeddings:
        chapter_id = frame['chapter_id']
        file = frame['file']
        chapters[chapter_id].append(file)

    for i in range(len(chapters)):
        chapter = chapters[i]
        num_frames = len(chapter)
        skipped_frames = frames.skip_frames(chapter)
        grid_image = frames.create_grid_image(skipped_frames)
        w, h = grid_image.size
        if h > 440:
            grid_image = grid_image.resize((w // 2, h // 2))
        w, h = grid_image.size
        print(f"Chapter #{i:04d}: {num_frames} frames ({len(skipped_frames)} drawn) [{w}x{h}]")
        grid_image.save(f"chapters/chapter-{i:04d}.jpg")
        display(grid_image)


In [None]:
scenes_in_chapters = group_scenes_in_chapters(
    conversations,
    shots_in_scenes,
    frames_in_shots
)

for scenes_in_chapter in scenes_in_chapters:
    chapter_id = scenes_in_chapter['chapter_id']
    scene_min, scene_max = scenes_in_chapter['scene_ids']
    print(f"Chapter #{chapter_id}: {scene_max - scene_min + 1} scenes")

    # update json files
    for scene_id in range(scene_min, scene_max + 1):
        shots_in_scenes[scene_id]['chapter_id'] = chapter_id
        shot_min, shot_max = shots_in_scenes[scene_id]['shot_ids']
        for shot_id in range(shot_min, shot_max + 1):
            frames_in_shots[shot_id]['chapter_id'] = chapter_id
            for frame_id in frames_in_shots[shot_id]['frame_ids']:
                frame_embeddings[frame_id]['chapter_id'] = chapter_id

# update the json files
util.save_json_to_file('scenes_in_chapters.json', scenes_in_chapters)
util.save_json_to_file('shots_in_scenes.json', shots_in_scenes)
util.save_json_to_file('frames_in_shots.json', frames_in_shots)
util.save_json_to_file('frame_embeddings.json', frame_embeddings)

# plot the chapter images
plot_chapters(frame_embeddings, len(scenes_in_chapters))

### Generating chapter level contextual information #2: Contextual understandings with Anthropic Claude 3

- download the IAB Content Taxonomy definitions
- constructing prompt that includes composites images that describes each scene along with the conversation to Anthropic Claude 3

#### Download the IAB Content Taxonomy definition

In [None]:
iab_file = 'iab_content_taxonomy_v3.json'
url = f"https://dx2y1cac29mt3.cloudfront.net/iab/{iab_file}"

!curl {url} -o {iab_file}

In [None]:
def load_iab_taxonomies(file):
    with open(file) as f:
        iab_taxonomies = json.load(f)
    return iab_taxonomies

def get_chapter_frames(frame_embeddings, scenes_in_chapters):
    num_chapters = len(scenes_in_chapters)
    chapters_frames = [{
        'chapter_id': i,
        'text': '',
        'frames': [],
    } for i in range(num_chapters)]

    for frame in frame_embeddings:
        chapter_id = frame['chapter_id']
        file = frame['file']
        chapters_frames[chapter_id]['frames'].append(file)
        chapters_frames[chapter_id]['text'] = scenes_in_chapters[chapter_id]['text']
        
    return chapters_frames



def make_image_message(images):
    # adding the composite image sequences
    image_contents = [{
        'type': 'text',
        'text': 'Here are {0} images containing frame sequence that describes a scene.'.format(len(images))
    }]

    for image in images:
        bas64_image = frames.image_to_base64(image)
        image_contents.append({
            'type': 'image',
            'source': {
                'type': 'base64',
                'media_type': 'image/jpeg',
                'data': bas64_image
            }
        })

    return {
        'role': 'user',
        'content': image_contents
    }

def make_conversation_message(text):
    message = {
        'role': 'user',
        'content': 'No conversation.'
    }

    if len(text) > 0:
        message['content'] = 'Here is the conversation of the scene in <conversation> tag.\n<conversation>\n{0}\n</conversation>\n'.format(text)

    return message

def make_iab_taxonomoies(iab_list):
    iab = [item['name'] for item in iab_list]
    iab.append('None')

    return {
        'type': 'text',
        'text': 'Here is a list of IAB Taxonomies in <iab> tag:\n<iab>\n${0}\n</iab>\nOnly answer the IAB taxonomy from this list.'.format('\n'.join(iab))
    }

def make_garm_taxonomoies():
    garm = [
        'Adult & Explicit Sexual Content',
        'Arms & Ammunition',
        'Crime & Harmful acts to individuals and Society, Human Right Violations',
        'Death, Injury or Military Conflict',
        'Online piracy',
        'Hate speech & acts of aggression',
        'Obscenity and Profanity, including language, gestures, and explicitly gory, graphic or repulsive content intended to shock and disgust',
        'Illegal Drugs, Tobacco, ecigarettes, Vaping, or Alcohol',
        'Spam or Harmful Content',
        'Terrorism',
        'Debated Sensitive Social Issue',
        'None',
    ]

    return {
        'type': 'text',
        'text': 'Here is a list of GARM Taxonomies in <garm> tag:\n<garm>\n{0}\n</garm>\nOnly answer the GARM taxonomy from this list.'.format('\n'.join(garm))
    }

def make_sentiments():
    sentiments = ['Positive', 'Neutral', 'Negative', 'None']

    return {
        'type': 'text',
        'text': 'Here is a list of Sentiments in <sentiment> tag:\n<sentiment>\n{0}\n</sentiment>\nOnly answer the sentiment from this list.'.format('\n'.join(sentiments))
    }

def make_output_example():
    example = {
        'description': {
            'text': 'The scene describes...',
            'score': 98
        },
        'sentiment': {
            'text': 'Positive',
            'score': 90
        },
        'iab_taxonomy': {
            'text': 'Station Wagon',
            'score': 80
        },
        'garm_taxonomy': {
            'text': 'Online piracy',
            'score': 90
        },
        'brands_and_logos': [
            {
                'text': 'Amazon',
                'score': 95
            },
            {
                'text': 'Nike',
                'score': 85
            }
        ],
        'relevant_tags': [
            {
                'text': 'BMW',
                'score': 95
            }
        ]            
    }
    
    return {
        'role': 'user',
        'content': 'Return JSON format. An example of the output:\n{0}\n'.format(json.dumps(example))
    }

def get_contextual_information(images, text, iab_definitions):
    model_id = 'anthropic.claude-3-haiku-20240307-v1:0'
    model_ver = 'bedrock-2023-05-31'
    accept = 'application/json'
    content_type = 'application/json'

    task_all = 'You are asked to provide the following information: a detail description to describe the scene, identify the most relevant IAB taxonomy, GARM, sentiment, and brands and logos that may appear in the scene, and five most relevant tags from the scene.'
    task_iab_only = 'You are asked to identify the most relevant IAB taxonomy.'
    system = 'You are a media operation engineer. Your job is to review a portion of a video content presented in a sequence of consecutive images. Each image also contains a sequence of frames presented in a 4x7 grid reading from left to right and then from top to bottom. You may also optionally be given the conversation of the scene that helps you to understand the context. {0} It is important to return the results in JSON format and also includes a confidence score from 0 to 100. Skip any explanation.';

    messages = []
 
    # adding sequences of composite images to the prompt
    message_images = make_image_message(images)
    messages.append(message_images)

    # adding the conversation to the prompt
    messages.append({
        'role': 'assistant',
        'content': 'Got the images. Do you have the conversation of the scene?'
    })
    message_conversation = make_conversation_message(text)
    messages.append(message_conversation)

    # other information
    messages.append({
        'role': 'assistant',
        'content': 'OK. Do you have other information to provdie?'
    })

    other_information = []
    ## iab taxonomy
    iab_list = make_iab_taxonomoies(iab_definitions['tier1'])
    other_information.append(iab_list)

    ## GARM
    garm_list = make_garm_taxonomoies()
    other_information.append(garm_list)

    ## Sentiment
    sentiment_list = make_sentiments()
    other_information.append(sentiment_list)

    messages.append({
        'role': 'user',
        'content': other_information
    })

    # output format
    messages.append({
        'role': 'assistant',
        'content': 'OK. What output format?'
    })
    output_format = make_output_example()
    messages.append(output_format)

    # prefill '{'
    messages.append({
        'role': 'assistant',
        'content': '{'
    })
    
    model_params = {
        'anthropic_version': model_ver,
        'max_tokens': 4096,
        'temperature': 0.1,
        'top_p': 0.7,
        'top_k': 20,
        'stop_sequences': ['\n\nHuman:'],
        'system': system.format(task_all),
        'messages': messages
    }


    bedrock_runtime_client = boto3.client(service_name='bedrock-runtime')
    response = bedrock_runtime_client.invoke_model(
        body=json.dumps(model_params),
        modelId=model_id,
        accept=accept,
        contentType=content_type
    )

    response_body = json.loads(response.get('body').read())

    # patch the json string output with '{' and parse it
    response_content = response_body['content'][0]['text']
    if response_content[0] != '{':
        response_content = '{' + response_content
    response_content = json.loads(response_content)
    response_body['content'][0]['json'] = response_content

    return response_body

def display_contextual_cost(usage):
    # us-east-1 pricing
    input_per_1k = 0.00025
    output_per_1k = 0.00125

    input_tokens = usage['input_tokens']
    output_tokens = usage['output_tokens']

    contextual_cost = (
        input_per_1k * input_tokens +
        output_per_1k * output_tokens
    ) / 1000

    print('\n')
    print('========================================================================')
    print('Estimated cost:', colored(f"${round(contextual_cost, 4)}", 'green'), f"in us-east-1 region with {colored(input_tokens, 'green')} input tokens and {colored(output_tokens, 'green')} output tokens.")
    print('========================================================================')

    return {
        'input_per_1k': input_per_1k,
        'output_per_1k': output_per_1k,
        'input_tokens': input_tokens,
        'output_tokens': output_tokens,
        'estimated_cost': contextual_cost,
    }


In [None]:
total_usage = {
    'input_tokens': 0,
    'output_tokens': 0,
}

iab_definitions = load_iab_taxonomies(iab_file)

frames_in_chapters = get_chapter_frames(frame_embeddings, scenes_in_chapters)

for frames_in_chapter in frames_in_chapters:
    chapter_id = frames_in_chapter['chapter_id']
    text = frames_in_chapter['text']
    ch_frames = frames_in_chapter['frames']

    composite_images = frames.create_composite_images(ch_frames)
    num_images = len(composite_images)

    for j in range(num_images):
        composite_image = composite_images[j]
        print(f"Chapter #{chapter_id:02d}: {j + 1} of {num_images} composite images")
        w, h = composite_image.size
        scaled = composite_image.resize((w // 4, h // 4))
        display(scaled)

    contextual_response = get_contextual_information(composite_images, text, iab_definitions)

    usage = contextual_response['usage']
    contextual = contextual_response['content'][0]['json']

    # save the contextual to the chapter
    scenes_in_chapters[chapter_id]['contextual'] = {
        'usage': usage,
        **contextual
    }

    total_usage['input_tokens'] += usage['input_tokens']
    total_usage['output_tokens'] += usage['output_tokens']

    print(f"==== Chapter #{chapter_id:02d}: Contextual information ======")
    for key in ['description', 'sentiment', 'iab_taxonomy', 'garm_taxonomy']:
        print(f"{key.capitalize()}: {colored(contextual[key]['text'], 'green')} ({contextual[key]['score']}%)")

    for key in ['brands_and_logos', 'relevant_tags']:
        items = ', '.join([item['text'] for item in contextual[key]])
        if len(items) == 0:
            items = 'None'
        print(f"{key.capitalize()}: {colored(items, 'green')}")
    print(f"================================================\n\n")

util.save_json_to_file('scenes_in_chapters.json', scenes_in_chapters)

contextual_cost = display_contextual_cost(total_usage)

### Total estimated cost

In [None]:
total_estimated_cost = 0

for estimated_cost in [transcribe_cost, conversation_cost, frame_embeddings_cost, contextual_cost]:
    total_estimated_cost += estimated_cost['estimated_cost']
total_estimated_cost = round(total_estimated_cost, 4)

print('\n')
print('\n== Generating chapter points ===========================================\n')
print('Transcribe cost:', colored(f"${round(transcribe_cost['estimated_cost'], 4)}", 'green'), f"with duration of {colored(transcribe_cost['duration'], 'green')}s")
print('Claude cost:', colored(f"${round(conversation_cost['estimated_cost'], 4)}", 'green'), f"with {colored(conversation_cost['input_tokens'], 'green')} input tokens and {colored(conversation_cost['output_tokens'], 'green')} output tokens.")
print('\n== Generating image embeddings =========================================\n')
print('Titan cost:', colored(f"${round(frame_embeddings_cost['estimated_cost'], 4)}", 'green'), f"with {colored(frame_embeddings_cost['num_embeddings'], 'green')} embeddings.")
print('\n== Chapter contextual information ======================================\n')
print('Claude cost:', colored(f"${round(contextual_cost['estimated_cost'], 4)}", 'green'), f"with {colored(contextual_cost['input_tokens'], 'green')} input tokens and {colored(contextual_cost['output_tokens'], 'green')} output tokens.")
print('\n========================================================================\n')
print('Total estimated cost:', colored(f"${total_estimated_cost}", 'green'))
print('\n========================================================================')
