# AI Video Summarization

Publishers and broadcasters can leverage short-form video across social media platforms such as Facebook, Instagram, and TikTok to attract new audiences and create additional revenue opportunities.

However, generating video summaries is a manual and time-consuming process due to challenges like understanding complex content, maintaining coherence, diverse video types, and lack of scalability when dealing with a large volume of videos. Introducing automation through the use of artificial intelligence (AI) and machine learning (ML) can make this process more viable and scalable with automatic content analysis, real-time processing, contextual adaptation, customization, and continuous AI/ML system improvement.

### High level workflow

![video summarization diagram](static/images/video-summarization-diagram.png)

In this notebook, we'll break down each step and show you in detail how video summarization can be achieved using AWS native services such as [Amazon Transcribe](https://aws.amazon.com/pm/transcribe), [Amazon Bedrock](https://aws.amazon.com/bedrock), [Amazon Polly](https://aws.amazon.com/polly/) and [AWS Elemental MediaConvert](https://aws.amazon.com/mediaconvert/).

## Prerequisites

To run this notebook, you need to have run the previous notebooks: [00-prerequisites](00-prerequisites.ipynb) and [01-video-time-segmentation](01-video-time-segmentation.ipynb), where you installed package dependencies and gathered some information from the SageMaker environment as well as segmented the video using audio, visual and semantic information.

### Retrieve saved values from previous notebooks




In [None]:
store -r

### Import python packages

In [None]:
from pathlib import Path
import os
import json
import time
import boto3
from botocore.exceptions import ClientError
import json_repair
from termcolor import colored
from IPython.display import JSON
from IPython.display import Video
from IPython.display import Pretty
from IPython.display import Image as DisplayImage
from lib.frames import VideoFrames
from lib.shots import Shots
from lib.scenes import Scenes
from lib.transcript import Transcript
import base64
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import subprocess
import numpy as np
import re

## Summarize video content from transcript

We use **Large Language Model (LLM)** with [Amazon Bedrock](https://aws.amazon.com/bedrock/) to summarize the content of the video.

In [None]:
bedrock_client = boto3.client(service_name="bedrock-runtime")
accept = "application/json"
content_type = "application/json"

In [None]:
with open(video['transcript'].transcript_file, 'r') as file:
    transcript_file = json.load(file)
transcript = transcript_file['results']['transcripts'][0]['transcript']

model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

prompt = f"""Summarize the key points from the following video content in chronological order:

{transcript} 

\n\nThe summary should only contain information present in the video content. Do not include any new or unrelated information.

Important: Start the summary immediately without any introductory phrases. Begin directly with the first key point."""

body = json.dumps(
    {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1024,
        "temperature": 0.25,
        "top_p": 0.9,

    }
)
response = bedrock_client.invoke_model(
    body=body, modelId=model_id, accept=accept, contentType=content_type
)
response_body = json.loads(response["body"].read())
summarized_text = response_body["content"][0]["text"]

You can invoke the endpoint with different parameters defined in the payload to impact the text summarization. Two important parameters are `top_p` and `temperature`. While `top_p` is used to control the range of tokens considered by the model based on their cumulative probability, `temperature` controls the level of randomness in the output.

Although there isn’t a one-size-fits-all combination of `top_p` and `temperature` for all use cases, in the previous example, we demonstrate sample values with high `top_p` and low `temperature` that leads to summaries focused on key information and avoid deviating from the original text but still introduce some creative variations to keep the output interesting.

Let's check the summarized video content:

In [None]:
summarized_text

## Generate metadata for voice narration

The next step starts with [Amazon Polly](https://aws.amazon.com/polly/) to generate speech from the summarized text. The output of the Polly task is both MP3 files and documents marked up with [Speech Synthesis Markup Language (SSML)](https://docs.aws.amazon.com/polly/latest/dg/ssml.html). Within this SSML file, essential metadata is encapsulated, describing the duration of individual sentences vocalized by a specific Polly voice. With this audio duration information, we will be able to define the length of the video segments; in this case, a direct 1:1 correspondence is employed.

In [None]:
polly_client = boto3.client("polly")
voice_id = "Matthew"

In [None]:
response = polly_client.synthesize_speech(
    Engine="neural",
    OutputFormat="json",
    Text=summarized_text + " This video is generated by Video Summarization Hub.",
    TextType="text",
    SpeechMarkTypes=["sentence"],
    VoiceId=voice_id,
)

stream_data = response['AudioStream'].read()
polly_ssml = stream_data.decode('utf-8')

The following is the Amazon Polly synthesis speech output in SSML format:

In [None]:
polly_ssml = polly_ssml.split("\n")
polly_ssml

The SSML file provides both the video summary sentences and their speech durations, which represent the time Amazon Polly takes to vocalize each sentence. We will extract these values to align the synthesized speech with the video timeline in the next several steps.

In [None]:
summarized_sentences = []
speech_durations = []

for i in range(len(polly_ssml) - 1):
    curr = polly_ssml[i]
    next = polly_ssml[i + 1]
    if curr.strip() == "" or next.strip() == "":
        continue
    curr = json.loads(curr)
    next = json.loads(next)
    summarized_sentences.append(curr["value"])
    speech_durations.append(int(next["time"]) - int(curr["time"]))

## Select most relevant video shots/scenes

We need to select the most relevant video frame sequence to match with every sentence in the summarized content. Thus, we use text embedding to perform the sentence similarity task, which determines how similar two texts are.

Sentence similarity models transform input texts into vectors (embeddings) that capture semantic information and calculate the proximity or similarity between them.

In this step, we use **Text Embedding Model** with [Amazon Bedrock](https://aws.amazon.com/bedrock/) to create the embeddings for every sentence in the original subtitle and in the video summary.

First, we get the original subtitle file and do some processings to break it down into sentences with start times and end times.

In [None]:
with open(video['transcript'].vtt_file, 'r', encoding='utf-8') as file:
    subtitle = file.read()

if subtitle.startswith("WEBVTT"):
    subtitle = subtitle[len("WEBVTT"):].lstrip()

print(subtitle)

In [None]:
def srt_to_array(s):
    """
    Converts the given transcription in SRT/WEBVTT format to list of sentences and their corresponding timecodes.
    Args:
       s - transcription in SRT/WEBVTT format.
    Returns:
       A list of dictionaries, where each dictionary represents a sentence and its corresponding start time and end time.
    """
    sentences = [line.strip() for line in re.findall(r"\d+\n.*?\n(.*?)\n", s)]

    def get_time(s):
        return re.findall(r"\d{2}:\d{2}:\d{2}.\d{3}", s)

    def time_to_ms(time_str):
        match = re.match(r"(\d+):(\d+):(\d+)[.,](\d+)", time_str)
        h, m, s, ms = match.groups()
        return int(h) * 3600000 + int(m) * 60000 + int(s) * 1000 + int(ms)

    startTimes = get_time(s)[::2]
    endTimes = get_time(s)[1::2]
    startTimes_ms = [time_to_ms(time) for time in startTimes]
    endTimes_ms = [time_to_ms(time) for time in endTimes]

    complete_sentences = []
    complete_startTimes_ms = []
    complete_endTimes_ms = []

    startTime_ms = -1
    endTime_ms = -1
    sentence = ""
    for i in range(len(sentences)):
        if startTime_ms == -1:
            startTime_ms = startTimes_ms[i]
        sentence += " " + sentences[i]
        if (
            sentences[i].endswith(".")
            or sentences[i].endswith("?")
            or sentences[i].endswith("!")
            or i == len(sentences) - 1
        ):
            endTime_ms = endTimes_ms[i]
            complete_sentences.append(sentence)
            complete_startTimes_ms.append(startTime_ms)
            complete_endTimes_ms.append(endTime_ms)
            startTime_ms = -1
            endTime_ms = -1
            sentence = ""
    processed_transcript = []
    for i in range(len(complete_sentences)):
        processed_transcript.append(
            {
                "sentence_startTime": complete_startTimes_ms[i],
                "sentence_endTime": complete_endTimes_ms[i],
                "sentence": complete_sentences[i],
            }
        )
    return processed_transcript

In [None]:
processed_transcript = srt_to_array(subtitle)

Let's visualize some sentences from the video's original transcript:

In [None]:
original_sentences = [item['sentence'] for item in processed_transcript]
original_sentences[:10]

Next, we create the text embeddings for every sentence in the original subtitle and in the video summary. The following code gives an example of how text embedding using Amazon Bedrock API works

In [None]:
def find_matching_sentences(original_sentences, summarized_sentences):
    """
    Calculates the similarity between the given original sentences and the summarized sentences.
    Args:
       original_sentences - sentences extacted from the original video
       summarized_sentences - sentences extacted from the video summary
    Return:
       best_matching_indices - list of indices indicating which original sentence best matches each summarized sentence
       similarity_matrix - sentences similarity matrix
    
    """
    def np_cosine_similarity(original_embeddings, summarized_embeddings):
        """
        We use `Cosine similarity` to measure similarities between two vectors.
        """
        dot_products = np.dot(summarized_embeddings, original_embeddings.T)
        summarized_norms = np.linalg.norm(summarized_embeddings, axis=1)
        original_norms = np.linalg.norm(original_embeddings, axis=1)
        similarity_matrix = dot_products / summarized_norms[:, None] / original_norms[None, :]
        return similarity_matrix
        
    model_id = "amazon.titan-embed-image-v1"
    accept = "application/json"
    content_type = "application/json"
    original_embeddings = []
    for str in original_sentences:
        body = json.dumps({"inputText": str})
        response = bedrock_client.invoke_model(
            body=body, modelId=model_id, accept=accept, contentType=content_type
        )
        response_body = json.loads(response["body"].read())
        original_embeddings.append(response_body.get("embedding"))
    original_embeddings = np.array(original_embeddings)

    summarized_embeddings = []
    for str in summarized_sentences:
        body = json.dumps({"inputText": str})
        response = bedrock_client.invoke_model(
            body=body, modelId=model_id, accept=accept, contentType=content_type
        )
        response_body = json.loads(response["body"].read())
        summarized_embeddings.append(response_body.get("embedding"))
    summarized_embeddings = np.array(summarized_embeddings)

    similarity_matrix = np_cosine_similarity(original_embeddings, summarized_embeddings)
    best_matching_indices = []
    len_summarized_sentences = len(summarized_sentences)
    len_original_sentences = len(original_sentences)

    # Find the best matching sentences.
    dp = np.zeros([len_summarized_sentences, len_original_sentences], dtype=float)
    for i in range(0, len_summarized_sentences):
        for j in range(0, len_original_sentences):
            if i == 0:
                dp[i][j] = similarity_matrix[i][j]
            else:
                max_score = -1
                for k in range(0, j):
                    if similarity_matrix[i][j] > 0 and dp[i - 1][k] > 0:
                        max_score = max(
                            max_score, similarity_matrix[i][j] + dp[i - 1][k]
                        )
                dp[i][j] = max_score

    j = len_original_sentences

    for i in range(len_summarized_sentences - 1, -1, -1):
        arr = dp[i][:j]
        idx = np.argmax(arr)
        best_matching_indices.append(idx)
        j = idx
    best_matching_indices.reverse()

    return best_matching_indices, similarity_matrix

In [None]:
best_matching_indices, similarity_matrix = find_matching_sentences(original_sentences, summarized_sentences)

This will return the similarity matrix result as follow:

In [None]:
similarity_matrix

You can interpret the prior result as: the first row of the matrix corresponds to the first sentence in the summarized content and all the columns show its similarity scores to the sentences in the original text. Similarity values typically range between -1 and 1, where 1 indicates that the vectors are identical or very similar; 0 indicates that the vectors are orthogonal (not correlated) and have no similarity; -1 indicates that the vectors are diametrically opposed or very dissimilar.

From the similarity matrix, we identify the top-k highest similarity scores for each sentence in the summarized content, thereby aligning them with the most similar sentences in the original text. Each sentence in the original text also has its corresponding timestamp (i.e. startTime, endTime) stored in the original subtitle.

By incorporating both the duration of Polly audio for each summarized sentence and the timestamps from the original subtitle file, we can then select the timestamp sequence for the most relevant frames corresponding to each summarized sentence. The length of each selected video segment for a summarized sentence will be aligned with the length of its narration audio.

In [None]:
def get_timecodes(best_matching_indices, idx, endTimes, duration, timecodes):
    """
    Calculate the best start and end time for each summarized sentence aligned with the timecode from the original sentences
    Args:
      best_matching_indices - the indices from the original sentence that is most similar with the summarized sentences.
      idx - index from the summarized sentences to process
      endTimes - the endtime from the original sentences
      duration - speech duration for the synthesized sentences from the summarized text
      timecodes - timecode used for calculating the best placement for the summarized text within the video.
    Return:
      
    """
    best_matching_idx = best_matching_indices[idx]
    startTime = int(endTimes[best_matching_idx]) - duration
    carry = max(0, timecodes[len(timecodes) - 1][1] - startTime)
    startTime += carry
    endTime = int(endTimes[best_matching_idx]) + carry
    return startTime, endTime

def ms_to_timecode(ms, drop_frame=False):
    """
    Convert milliseconds to SMPTE timecode
    Args:
        ms: milliseconds
        drop_frame: Boolean, True for drop frame, False for non-drop frame
    Return:
        string in HH:MM:SS:FF or HH:MM:SS;FF format
    """
    total_frames = int(ms * (29.97 if drop_frame else 30) / 1000)
    frames = total_frames % 30
    
    total_seconds = total_frames // 30
    seconds = total_seconds % 60
    
    total_minutes = total_seconds // 60
    minutes = total_minutes % 60
    
    hours = total_minutes // 60
    separator = ';' if drop_frame else ':'    
    return f"{hours:02d}:{minutes:02d}:{seconds:02d}{separator}{frames:02d}"

In [None]:
intro_time = float(transcript_file["results"]["items"][0]["start_time"]) * 1000

timecodes = [[0, intro_time]]
for i in range(len(summarized_sentences)):
    startTime, endTime = get_timecodes(
        best_matching_indices,
        i,
        [item['sentence_endTime'] for item in processed_transcript],
        speech_durations[i],
        timecodes,
    )
    timecodes.append([startTime, endTime])
creditTime = endTime + 3500
timecodes.append([endTime, creditTime])
timecodes_text = ""
for timecode in timecodes:
    timecodes_text += (
        ms_to_timecode(timecode[0], True)
        + ","
        + ms_to_timecode(timecode[1], True)
        + "\n"
    )
to_json = lambda s: [
    {"StartTimecode": t1, "EndTimecode": t2}
    for t1, t2 in (line.split(",") for line in s.split("\n") if line.strip())
]
timecodes = to_json(timecodes_text)

Here are the generated timecodes that will be used for AWS Elemental MediaConvert input clipping:

In [None]:
timecodes

You can now generate the audio narration from the video summary in MP3 format using Amazon Polly. Keep in mind to escape special characters in the summarized text for SSML compatibility as well as create SSML markup with appropriate breaks for intro timing.

In [None]:
escaped_summarized_text = (
        summarized_text.replace("&", "&amp;")
        .replace('"', "&quot;")
        .replace("'", "&apos;")
        .replace("<", "&lt;")
        .replace(">", "&gt;")
    )
ssml = "<speak>\n"
break_time = intro_time

while break_time > 10000:  # maximum break time in Polly is 10s
    ssml += '<break time = "' + str(break_time) + 'ms"/>'
    break_time -= 10000
ssml += '<break time = "' + str(break_time) + 'ms"/>'
ssml += escaped_summarized_text
ssml += "</speak>"

response = polly_client.synthesize_speech(
    Engine="neural",
    OutputFormat="mp3",
    Text=ssml,
    TextType="ssml",
    VoiceId=voice_id,
)

if "AudioStream" in response:
    with response["AudioStream"] as stream:
        audio_narration = stream.read()

In [None]:
ssml

We upload the audio narration into Amazon S3 bucket ready for the video transcoding step with AWS Elemental MediaConvert.

In [None]:
s3_client = boto3.client("s3")
s3_bucket = session["bucket"]
audio_narration_filename = os.path.splitext(os.path.basename(video['path']))[0] + ".mp3"
s3_client.put_object(
    Body=audio_narration, Bucket=s3_bucket, Key=audio_narration_filename, ContentType="audio/mpeg"
)

## Create MediaConvert assembly workflows

We use the sequence of the timecodes as parameters to create AWS Elemental MediaConvert assembly workflows to performs basic input clipping.

By combining it with the MP3 audio from Amazon Polly and along with the possibility of incorporating background music of your preference, you can ultimately achieve the final video summarization output.

Let's start the assembly workflow from the original video input. An assembly workflow is a MediaConvert job that performs basic input clipping and stitching to assemble output assets from one or different sources without requiring separate editing software.

In [None]:
iam_role = session["MediaConvertRole"]
input_video_path = video["url"]
output_video_path = f"s3://{s3_bucket}/"

In [None]:
video["url"]

In [None]:
media_convert = boto3.client("mediaconvert")
response = media_convert.create_job(
    Queue="Default",
    UserMetadata={},
    Role=iam_role,
    Settings={
        "TimecodeConfig": {"Source": "ZEROBASED"},
        "OutputGroups": [
            {
                "Name": "File Group",
                "Outputs": [
                    {
                        "ContainerSettings": {
                            "Container": "MP4",
                            "Mp4Settings": {},
                        },
                        "VideoDescription": {
                            "CodecSettings": {
                                "Codec": "H_264",
                                "H264Settings": {
                                    "MaxBitrate": 40000000,
                                    "RateControlMode": "QVBR",
                                    "SceneChangeDetect": "TRANSITION_DETECTION",
                                },
                            }
                        }
                    }
                ],
                "OutputGroupSettings": {
                    "Type": "FILE_GROUP_SETTINGS",
                    "FileGroupSettings": {"Destination": output_video_path},
                },
            }
        ],
        "Inputs": [
            {
                "VideoSelector": {},
                "TimecodeSource": "ZEROBASED",
                "FileInput": input_video_path,
                "InputClippings": timecodes,
            }
        ],
    },
    AccelerationSettings={"Mode": "DISABLED"},
    StatusUpdateInterval="SECONDS_60",
    Priority=0,
)

t0 = time.time()

job_complete = False

while not job_complete:
    job_response = media_convert.get_job(Id=response["Job"]["Id"])
    
    job_status = job_response['Job']['Status']
    print(f"MediaConvert job status: {job_status}")
    
    if job_status == 'COMPLETE':
        print("Job is complete!")
        job_complete = True
    elif job_status == 'ERROR':
        print("Job has failed.")
        job_complete = True
    else:
        time.sleep(10)

t1 = time.time()
print(f"  Elapsed time: {round(t1 - t0, 2)}s")

Finally, you create audio tracks in the output and associate a single audio selector with each output track. In addition, you could also add a subtitle into the final video ouput. You could generate a subtitle for the video summary as follow:

In [None]:
video_summary_subtitle = ""
start = intro_time

def split_long_lines(text, max_line_length):
    words = text.split()
    lines = []
    current_line = []
    current_length = 0

    for word in words:
        if current_length + len(word) + len(current_line) > max_line_length:
            lines.append(" ".join(current_line))
            current_line = []
            current_length = 0
        current_line.append(word)
        current_length += len(word) + 1

    if current_line:
        lines.append(" ".join(current_line))

    return lines

def milliseconds_to_subtitleTimeFormat(ms):
    return "{:02d}:{:02d}:{:02d},{:03d}".format(
        int((ms // 3600000) % 24),  # hours
        int((ms // 60000) % 60),  # minutes
        int((ms // 1000) % 60),  # seconds
        int(ms % 1000),  # milliseconds
    )

for i in range(len(summarized_sentences)):
    end = start + speech_durations[i]
    video_summary_subtitle += f"{i+1}\n"
    video_summary_subtitle += f"{milliseconds_to_subtitleTimeFormat(start)} --> {milliseconds_to_subtitleTimeFormat(end)}\n"
    sentence_lines = split_long_lines(summarized_sentences[i], 90)
    for line in sentence_lines:
        video_summary_subtitle += f"{line}\n"
    video_summary_subtitle += "\n"
    start = end

In [None]:
video_summary_subtitle

In [None]:
subtitle_filename = os.path.splitext(os.path.basename(video['path']))[0] + ".srt"
s3_client.put_object(
    Body=video_summary_subtitle, Bucket=s3_bucket, Key=subtitle_filename
)

Finally, you create a MediaConvert job for the final video ouput.

In [None]:
input_video_path = f"s3://{s3_bucket}/{video['path']}"
audio_file_path = f"s3://{s3_bucket}/{audio_narration_filename}"
subtitle_file_path = f"s3://{s3_bucket}/{subtitle_filename}"
output_video_path = f"s3://{s3_bucket}/"

In the following step, we are using a [AWS Elemental MediaConvert](https://aws.amazon.com/mediaconvert/) job to apply the narrated voice and the subtitles on the original video. The output is written to S3 bucket for downstream consumption. 

In [None]:
response = media_convert.create_job(
    Queue="Default",
    UserMetadata={},
    Role=iam_role,
    Settings={
        "TimecodeConfig": {"Source": "ZEROBASED"},
        "OutputGroups": [
            {
                "Name": "File Group",
                "Outputs": [
                    {
                        "ContainerSettings": {
                            "Container": "MP4",
                            "Mp4Settings": {},
                        },
                        "VideoDescription": {
                            "CodecSettings": {
                                "Codec": "H_264",
                                "H264Settings": {
                                    "MaxBitrate": 40000000,
                                    "RateControlMode": "QVBR",
                                    "SceneChangeDetect": "TRANSITION_DETECTION",
                                },
                            }
                        },
                        "NameModifier": "_summary",
                        "AudioDescriptions": [
                            {
                                "AudioSourceName": "Audio Selector Group 1",
                                "CodecSettings": {
                                    "Codec": "AAC",
                                    "AacSettings": {
                                        "Bitrate": 96000,
                                        "CodingMode": "CODING_MODE_2_0",
                                        "SampleRate": 48000,
                                    },
                                },
                            }
                        ],
                        "CaptionDescriptions": [
                            {
                                "CaptionSelectorName": "Captions Selector 1",
                                "DestinationSettings": {
                                    "DestinationType": "BURN_IN",
                                    "BurninDestinationSettings": {
                                        "BackgroundOpacity": 100,
                                        "FontSize": 18,
                                        "FontColor": "WHITE",
                                        "ApplyFontColor": "ALL_TEXT",
                                        "BackgroundColor": "BLACK",
                                    },
                                },
                            }
                        ],
                    }
                ],
                "OutputGroupSettings": {
                    "Type": "FILE_GROUP_SETTINGS",
                    "FileGroupSettings": {"Destination": output_video_path},
                },
            }
        ],
        "Inputs": [
            {
                "AudioSelectors": {
                    "Audio Selector 1": {
                        "DefaultSelection": "NOT_DEFAULT",
                        "ExternalAudioFileInput": audio_file_path,
                    },
                },
                "AudioSelectorGroups": {
                    "Audio Selector Group 1": {
                        "AudioSelectorNames": ["Audio Selector 1"]
                    }
                },
                "VideoSelector": {},
                "TimecodeSource": "ZEROBASED",
                "CaptionSelectors": {
                    "Captions Selector 1": {
                        "SourceSettings": {
                            "SourceType": "SRT",
                            "FileSourceSettings": {"SourceFile": subtitle_file_path},
                        }
                    }
                },
                "FileInput": input_video_path,
            }
        ],
    },
    AccelerationSettings={"Mode": "DISABLED"},
    StatusUpdateInterval="SECONDS_60",
    Priority=0,
)

t0 = time.time()

job_complete = False

while not job_complete:
    job_response = media_convert.get_job(Id=response["Job"]["Id"])
    
    job_status = job_response['Job']['Status']
    print(f"MediaConvert job status: {job_status}")
    
    if job_status == 'COMPLETE':
        print("Job is complete!")
        job_complete = True
    elif job_status == 'ERROR':
        print("Job has failed.")
        job_complete = True
    else:
        time.sleep(10)

t1 = time.time()
print(f"  Elapsed time: {round(t1 - t0, 2)}s")

## Short-form video output

Here is the final video output generated from our summarization process. We preserve the original video's intro before starting our generated narration.

In [None]:
video_summary = os.path.splitext(os.path.basename(video['path']))[0] + "_summary.mp4"
s3_client.download_file(s3_bucket, video_summary, video_summary)
display(Video(url=video_summary, width=640, height=360, html_attributes="controls muted autoplay"))

## Video summarization with visual and audio understanding

In the previous section, we created a summarized video based solely on the video's transcription by:
1. Extracting and summarizing the original video transcript
2. Finding the best matching video segments by semantically comparing the summary with the original transcript
3. Generating the final video output

Now, let's enhance our approach by combining both video visual understanding and transcript analysis to create a more comprehensive summarized video.

In [None]:
video_shots = video["shots"].shots

Let's analyze each detected shot in our video. For each shot, we will generate a description of the visual content using Large Language Model in Amazon Bedrock. We will also match the corresponding transcript of what was said during this shot segment.

In [None]:
MAX_RETRIES = 50
INITIAL_BACKOFF = 5

def invoke_model_with_retry(body, modelId, accept, contentType):
    retries = 0
    backoff = INITIAL_BACKOFF

    while retries < MAX_RETRIES:
        try:
            response = bedrock_client.invoke_model(
                body=body, modelId=modelId, accept=accept, contentType=contentType
            )
            return response
        except ClientError as e:
            error_code = e.response['Error']['Code']
            print(f"Error: {error_code}. Retrying in {backoff} seconds...")
            time.sleep(backoff)
            retries += 1
            backoff += 1
    
    raise Exception("Max retries reached. Unable to invoke model.")

In [None]:
def generate_shot_description(shot):
    """
    Generate a natural language description of a video shot using LLM in Amazon Bedrock
    Args:
        shot - Dictionary containing shot information including:
                - id: unique identifier for the shot
                - start_ms: start time of the shot in milliseconds
                - end_ms: end time of the shot in milliseconds
                - composite_images: visual representation that combine multiple frames from a single shot into one image
              
    Returns:
        response_body - String containing the generated description of the visual content in the shot based on the analyzed frames
    """

    model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
    
    prompt = f"""Provide a concise description of a video shot based on the given frame images. Focus on creating a cohesive narrative of the entire shot rather than describing each frame individually.
        Skip the preamble; go straight into the description."""
        
    body = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
    }

    with open(f"{shot['composite_images'][0]['file']}", "rb") as image_file:
        file_content = image_file.read()
        base64_image_string = base64.b64encode(file_content).decode()
        body["messages"][0]["content"].append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64_image_string,
            },
        })
        
    response = invoke_model_with_retry(
        body=json.dumps(body), modelId=model_id, accept=accept, contentType=content_type
    )
    response_body = json.loads(response["body"].read())
    response_body = response_body["content"][0]["text"]

    return response_body

In [None]:
def add_shot_transcript(shot_startTime, shot_endTime, transcript):
    """
    Extract relevant transcript that corresponds to a specific video shot's time range
    Args:
        shot_startTime - Start time of the shot in milliseconds
        shot_endTime - End time of the shot in milliseconds
        transcript - List of dictionaries containing sentence information including:
                    - sentence_startTime: start time of the sentence
                    - sentence_endTime: end time of the sentence
                    - sentence: the transcript text
                    
    Returns:
        relevant_transcript - String containing concatenated sentences that overlap with the shot's time range by at least 1 second
    """
    relevant_transcript = ""
    for item in transcript:
        if item["sentence_startTime"] >= shot_endTime:
            break
        if item["sentence_endTime"] <= shot_startTime:
            continue
        delta_start = max(item["sentence_startTime"], shot_startTime)
        delta_end = min(item["sentence_endTime"], shot_endTime)
        if delta_end - delta_start >= 1000:
            relevant_transcript += item["sentence"] + "; "
    return relevant_transcript

### Generate shot descriptions

Generate text descriptions of all the shots.

<div class="alert alert-block alert-info">
⏳ Generating shot descriptions for our sample video could take 10 minutes due to account limits for hosted workshops.  To speed things up, we will load precomputed shot descriptions.  You can always turn this off by setting FASTPATH=False in the cell below.
</div>


In [None]:
t0 = time.time()

FASTPATH = False

if FASTPATH:
    video["shots"].load_fastpath_results("shots-descriptions.json")
else:
    for counter, shot in enumerate(video_shots, start=1):
        shot['shot_description'] = generate_shot_description(shot)
        shot['shot_transcript'] = add_shot_transcript(shot['start_ms'], shot['end_ms'], processed_transcript)
        
        # print(f'\nSHOT {counter}/{len(video_shots)}: from {shot["start_ms"] }ms to {shot["end_ms"] }ms =======\n')
        # display(DisplayImage(f"{shot['composite_images'][0]['file']}"))
        # print(f'Shot description: {shot["shot_description"]}\n')
        # print(f'Shot transcript: {shot["shot_transcript"]}\n')
    
    # store shot descriptions so they can be loaded when the notebook is re-executed with FASTPATH=True.
    video["shots"].store_fastpath_results("shots-descriptions.json")

t1 = time.time()
print(f"  Elapsed time: {round(t1 - t0, 2)}s")
print(f"  Shots: {len(video_shots)}")



In [None]:
for counter, shot in enumerate(video_shots, start=1):
    
    print(f'\nSHOT {counter}/{len(video_shots)}: from {shot["start_ms"] }ms to {shot["end_ms"] }ms =======\n')
    display(DisplayImage(f"{shot['composite_images'][0]['file']}"))
    print(f'Shot description: {shot["shot_description"]}\n')
    print(f'Shot transcript: {shot["shot_transcript"]}\n')

Now as we have a description and transcript for each shot, let's store them in [Amazon OpenSearch Serverless](https://aws.amazon.com/opensearch-service/features/serverless/) vector database for semantic search capabilities.

Firstly, we will create an OpenSearch index.

In [None]:
region = sagemaker_resources["region"]
aoss_host = session["AOSSCollectionEndpoint"]
aoss_index = "video-summarization-index"
text_embedding_model = "amazon.titan-embed-text-v2:0"
text_embedding_dimension = 1024

In [None]:
def create_opensearch_index(host, region, index, len_embedding):
    """
    Create an OpenSearch Serverless index with vector search capabilities
    Args:
        host - OpenSearch domain endpoint URL
        region - AWS region where the OpenSearch domain is hosted
        index - Name of the index to create
        len_embedding - Dimension size of the vector embeddings
    Returns:
        client - Configured OpenSearch client object
    """
    host = host.split("://")[1] if "://" in host else host
    credentials = boto3.Session().get_credentials()
    auth = AWSV4SignerAuth(credentials, region, "aoss")

    client = OpenSearch(
        hosts=[{"host": host, "port": 443}],
        http_auth=auth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection,
        pool_maxsize=20,
    )

    exist = client.indices.exists(index)
    if not exist:
        print("Creating index")
        index_body = {
            "mappings": {
                "properties": {
                    "video_name": {"type": "text"},
                    "shot_id": {"type": "text"},
                    "shot_startTime": {"type": "text"},
                    "shot_endTime": {"type": "text"},
                    "shot_description": {"type": "text"},
                    "shot_transcript": {"type": "text"},
                    "shot_desc_vector": {
                        "type": "knn_vector",
                        "dimension": len_embedding,
                        "method": {
                            "engine": "nmslib",
                            "space_type": "cosinesimil",
                            "name": "hnsw",
                            "parameters": {"ef_construction": 512, "m": 16},
                        },
                    },
                    "shot_transcript_vector": {
                        "type": "knn_vector",
                        "dimension": len_embedding,
                        "method": {
                            "engine": "nmslib",
                            "space_type": "cosinesimil",
                            "name": "hnsw",
                            "parameters": {"ef_construction": 512, "m": 16},
                        },
                    }
                }
            },
            "settings": {
                "index": {
                    "number_of_shards": 2,
                    "knn.algo_param": {"ef_search": 512},
                    "knn": True,
                }
            },
        }

        response = client.indices.create(index, body=index_body)

    print("Completed!")
    return client

In [None]:
aoss_client = create_opensearch_index(aoss_host, region, aoss_index, text_embedding_dimension)

Using text embedding model in Amazon Bedrock, we generate text embeddings for shot descriptions and transcripts before inserting these data into OpenSearch index.

In [None]:
def get_text_embedding(text_embedding_model, text):
    """
    Generate vector embeddings for text using Amazon Bedrock's embedding model
    Args:
        text_embedding_model - Model id of the Bedrock embedding model
        text - Input text to generate embeddings
        
    Returns:
        embedding - Result text's vector embedding
    """
    if not text.strip():
        text = "No transcript"
    body = json.dumps({"inputText": text, "dimensions": 1024, "normalize": True})
    response = bedrock_client.invoke_model(
        body=body, modelId=text_embedding_model, accept=accept, contentType=content_type
    )
    response_body = json.loads(response["body"].read())
    return response_body.get("embedding")
        
print(f"Insert embeddings to AOSS index ...")
for counter, shot in enumerate(video_shots, start=1):
    shot_desc_embedding = get_text_embedding(text_embedding_model, shot["shot_description"])
    shot_transcript_embedding = get_text_embedding(text_embedding_model, shot["shot_transcript"])
    embedding_request_body = json.dumps(
        {
            "video_name": video["path"],
            "shot_id": shot["id"],
            "shot_startTime": shot["start_ms"],
            "shot_endTime": shot["end_ms"],
            "shot_description": shot["shot_description"],
            "shot_transcript": shot["shot_transcript"],
            "shot_desc_vector": shot_desc_embedding,
            "shot_transcript_vector": shot_transcript_embedding
        }
    )

    response = aoss_client.index(
        index=aoss_index,
        body=embedding_request_body,
        params={"timeout": 60},
    )

print("Completed!")

For each sentence in the video summary, we will search for relevant shots in the vector database using shots' description embeddings and shots' transcription embeddings. The search process assigns different weights to these embeddings to balance the importance of visual and audio information: **75% (or a 3.0 boost) for shot contextual description** that emphasizes the importance of visual content in finding relevant shots and **25% (or a 1.0 boost) for shot transcript** that allows the audio content to contribute to the search results, but with less influence than the visual descriptions.

The total duration of the selected shots will also need to match with the speech duration for each sentence.

But we will first make sure the inserted data in OpenSearch is ready to be searched.

<div class="alert alert-block alert-info">
⏳ Generating shot descriptions for our sample video could take 10 minutes due to account limits for hosted workshops.  To speed things up, we will load precomputed shot descriptions.  You can always turn this off by setting FASTPATH=False in the cell below.
</div>

In [None]:
t0 = time.time()

print("Waiting for the recent inserted data to be searchable in OpenSearch...")

while True:
    try:
        result = aoss_client.search(index=aoss_index, body={"query": {"match_all": {}}})
        if result['hits']['total']['value'] == len(video_shots):
            print("\nData is now available for search!")
            break
        else:
            print(".", end="", flush=True)
            time.sleep(5)
    except Exception as e:
        print(".", end="", flush=True)
        time.sleep(5)

t1 = time.time()
print(f"  Elapsed time: {round(t1 - t0, 2)}s")
print(f"  Shots: {len(video_shots)}")

In [None]:
def search_by_text(aoss_index, client, user_query):
    """
    Search for relevant video shots using semantic similarity with user's text query
    Args:
        aoss_index - Name of the OpenSearch index
        client - Configured OpenSearch client object
        user_query - Text query from the user
        
    Returns:
        response - List of dictionaries containing matching shots, where each dictionary includes:
                  - shot_id: unique identifier for the shot
                  - shot_startTime: start time of the shot
                  - shot_endTime: end time of the shot
                  - shot_description: visual description of the shot
                  - shot_transcript: transcript text from the shot
                  - score: similarity score of the match
    """
    text_embedding = get_text_embedding(text_embedding_model, user_query)

    aoss_query = {
        "size": 100,
        "query": {
            "bool": {
                "should": [
                    {
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "lang": "knn",
                                "source": "knn_score",
                                "params": {
                                    "field": "shot_desc_vector",
                                    "query_value": text_embedding,
                                    "space_type": "cosinesimil",
                                },
                            },
                            "boost": 3.0
                        }
                    },
                    {
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "lang": "knn",
                                "source": "knn_score",
                                "params": {
                                    "field": "shot_transcript_vector",
                                    "query_value": text_embedding,
                                    "space_type": "cosinesimil",
                                },
                            },
                            "boost": 1.0
                        }
                    }
                ],
                "minimum_should_match": 1
            }
        },
        "_source": [
            "shot_id",
            "shot_startTime",
            "shot_endTime",
            "shot_description",
            "shot_transcript",
        ],
    }

    response = client.search(body=aoss_query, index=aoss_index)
    hits = response["hits"]["hits"]
    response = []
    for hit in hits:
        if hit["_score"] >= 0:  # Set score threshold
            response.append(
                {
                    "shot_id": hit["_source"]["shot_id"],
                    "shot_startTime": hit["_source"]["shot_startTime"],
                    "shot_endTime": hit["_source"]["shot_endTime"],
                    "shot_description": hit["_source"]["shot_description"],
                    "shot_transcript": hit["_source"]["shot_transcript"],
                    "score": hit["_score"],
                }
            )

    return response

In [None]:
def find_shots(timecodes, sentence, duration):
    """
    Find and select video shots that match a summarized sentence, considering timing constraints
    Args:
        timecodes - List to store selected shot timecodes [[start_time, end_time], ...]
        sentence - Text to search for matching video shots
        duration - Required duration for the shots
    """
    relevant_shots = search_by_text(aoss_index, aoss_client, sentence)
    if duration is None: # intro
        timecodes.append([relevant_shots[0]["shot_startTime"], relevant_shots[0]["shot_endTime"]])
        shot_ids.add(relevant_shots[0]["shot_id"])
        intro_time = relevant_shots[0]["shot_endTime"] - relevant_shots[0]["shot_startTime"]
    else:
        i = 0
        while i < len(relevant_shots) and duration > 0:
            if relevant_shots[i]["shot_id"] in shot_ids:
                i += 1
                continue
            shot_duration = relevant_shots[i]["shot_endTime"] - relevant_shots[i]["shot_startTime"]
            # timecodes.append([relevant_shots[i]["shot_startTime"], relevant_shots[i]["shot_startTime"] + min(shot_duration, duration)])
            timecodes.append([relevant_shots[i]["shot_endTime"] - min(shot_duration, duration), relevant_shots[i]["shot_endTime"]])
            shot_ids.add(relevant_shots[i]["shot_id"])
            duration -= shot_duration
            i += 1
            

In [None]:
shot_ids = set()
timecodes = []

find_shots(timecodes, "Meridian", None) # Intro
for i in range(len(summarized_sentences)):
    find_shots(timecodes, summarized_sentences[i], speech_durations[i])

# creditTime = timecodes[-1][1] + 1000
# timecodes.append([timecodes[-1][1], creditTime])
timecodes_text = ""
for timecode in timecodes:
    timecodes_text += (
        ms_to_timecode(timecode[0], True)
        + ","
        + ms_to_timecode(timecode[1], True)
        + "\n"
    )
intro_time = timecodes[0][1] - timecodes[0][0]
to_json = lambda s: [
    {"StartTimecode": t1, "EndTimecode": t2}
    for t1, t2 in (line.split(",") for line in s.split("\n") if line.strip())
]
timecodes = to_json(timecodes_text)

The result is a list of timecodes that defines our final video segments.
    
Here are the generated timecodes that will be used for AWS Elemental MediaConvert input clipping:

In [None]:
timecodes

Now that we have our timecodes, we can follow the same steps as in the previous section to:
1. Create input clips using AWS Elemental MediaConvert
2. Insert audio narration and subtitle to create a final short-form video

In [None]:
escaped_summarized_text = (
        summarized_text.replace("&", "&amp;")
        .replace('"', "&quot;")
        .replace("'", "&apos;")
        .replace("<", "&lt;")
        .replace(">", "&gt;")
    )
ssml = "<speak>\n"
break_time = intro_time

while break_time > 10000:  # maximum break time in Polly is 10s
    ssml += '<break time = "' + str(break_time) + 'ms"/>'
    break_time -= 10000
ssml += '<break time = "' + str(break_time) + 'ms"/>'
ssml += escaped_summarized_text
ssml += "</speak>"

response = polly_client.synthesize_speech(
    Engine="neural",
    OutputFormat="mp3",
    Text=ssml,
    TextType="ssml",
    VoiceId=voice_id,
)

if "AudioStream" in response:
    with response["AudioStream"] as stream:
        audio_narration = stream.read()

In [None]:
s3_client.put_object(
    Body=audio_narration, Bucket=s3_bucket, Key=audio_narration_filename, ContentType="audio/mpeg"
)

In [None]:
video_summary_subtitle = ""
start = intro_time
for i in range(len(summarized_sentences)):
    end = start + speech_durations[i]
    video_summary_subtitle += f"{i+1}\n"
    video_summary_subtitle += f"{milliseconds_to_subtitleTimeFormat(start)} --> {milliseconds_to_subtitleTimeFormat(end)}\n"
    sentence_lines = split_long_lines(summarized_sentences[i], 90)
    for line in sentence_lines:
        video_summary_subtitle += f"{line}\n"
    video_summary_subtitle += "\n"
    start = end
s3_client.put_object(
    Body=video_summary_subtitle, Bucket=s3_bucket, Key=subtitle_filename
)

In [None]:
def process_single_clip(media_convert, iam_role, input_video_path, output_video_path, timecode, clip_index):
    """
    Create a MediaConvert job to process a single video clip (due to multiple input clipping need be processed in chronological order)
    Args:
        media_convert - MediaConvert client
        iam_role - IAM role ARN for MediaConvert
        input_video_path - S3 path for input video
        output_video_path - S3 path for output video
        timecode - Dictionary containing start and end timecodes for the clip
        clip_index - Index number for the clip
        
    Returns:
        - job_id: MediaConvert job Id
        - clip_output: S3 path of the output clip
    """
    clip_output = f"{output_video_path}{video['output_dir']}_{clip_index}"
    
    response = media_convert.create_job(
        Queue="Default",
        UserMetadata={},
        Role=iam_role,
        Settings={
            "TimecodeConfig": {"Source": "ZEROBASED"},
            "OutputGroups": [
                {
                    "Name": "File Group",
                    "Outputs": [
                        {
                            "ContainerSettings": {
                                "Container": "MP4",
                                "Mp4Settings": {},
                            },
                            "VideoDescription": {
                                "CodecSettings": {
                                    "Codec": "H_264",
                                    "H264Settings": {
                                        "MaxBitrate": 40000000,
                                        "RateControlMode": "QVBR",
                                        "SceneChangeDetect": "TRANSITION_DETECTION",
                                    },
                                }
                            }
                        }
                    ],
                    "OutputGroupSettings": {
                        "Type": "FILE_GROUP_SETTINGS",
                        "FileGroupSettings": {"Destination": clip_output},
                    },
                }
            ],
            "Inputs": [
                {
                    "VideoSelector": {},
                    "TimecodeSource": "ZEROBASED",
                    "FileInput": video["url"],
                    "InputClippings": [timecode],
                }
            ],
        },
        AccelerationSettings={"Mode": "DISABLED"},
        StatusUpdateInterval="SECONDS_60",
        Priority=0,
    )
    
    return response["Job"]["Id"], clip_output

def wait_for_job(media_convert, job_id):
    """
    Wait for a MediaConvert job to complete
    Args:
        media_convert - MediaConvert client
        job_id - MediaConvert job Id
        
    Returns:
        bool - True if job completed successfully, False if error occurred
    """
    job_complete = False
    while not job_complete:
        job_response = media_convert.get_job(Id=job_id)
        job_status = job_response['Job']['Status']
        print(f"MediaConvert job status: {job_status}")
        
        if job_status in ['COMPLETE', 'ERROR']:
            return job_status == 'COMPLETE'
        time.sleep(10)

In [None]:
clip_paths = []
t0 = time.time()

for i, timecode in enumerate(timecodes):
    print(f"Processing clip {i+1}/{len(timecodes)}")
    print(timecode)
    
    job_id, clip_output = process_single_clip(
        media_convert, 
        iam_role, 
        input_video_path, 
        output_video_path, 
        timecode, 
        i
    )

    if wait_for_job(media_convert, job_id):
        clip_paths.append(clip_output)
    else:
        print(f"Failed to process clip {i+1}")
        continue

t1 = time.time()
print(f"  Elapsed time: {round(t1 - t0, 2)}s")

In [None]:
def merge_clips(media_convert, iam_role, clip_paths, output_video_path):
    """
    Merge multiple video clips into a single video
    Args:
        media_convert - MediaConvert client
        iam_role - IAM role ARN for MediaConvert
        clip_paths - List of S3 paths of video clips to merge
        output_video_path - S3 path for video output
        
    Returns:
        - job_id: MediaConvert job Id
        - merged_output: S3 path of the video output
    """
    inputs = []
    
    for clip_path in clip_paths:
        inputs.append({
            "VideoSelector": {},
            "TimecodeSource": "ZEROBASED",
            "FileInput": clip_path + ".mp4",
        })

    merged_output = f"{output_video_path}{video['output_dir']}"
    
    response = media_convert.create_job(
        Queue="Default",
        UserMetadata={},
        Role=iam_role,
        Settings={
            "TimecodeConfig": {"Source": "ZEROBASED"},
            "OutputGroups": [
                {
                    "Name": "File Group",
                    "Outputs": [
                        {
                            "ContainerSettings": {
                                "Container": "MP4",
                                "Mp4Settings": {},
                            },
                            "VideoDescription": {
                                "CodecSettings": {
                                    "Codec": "H_264",
                                    "H264Settings": {
                                        "MaxBitrate": 40000000,
                                        "RateControlMode": "QVBR",
                                        "SceneChangeDetect": "TRANSITION_DETECTION",
                                    },
                                }
                            }
                        }
                    ],
                    "OutputGroupSettings": {
                        "Type": "FILE_GROUP_SETTINGS",
                        "FileGroupSettings": {"Destination": merged_output},
                    },
                }
            ],
            "Inputs": inputs,
        },
        AccelerationSettings={"Mode": "DISABLED"},
        StatusUpdateInterval="SECONDS_60",
        Priority=0,
    )
    
    return response["Job"]["Id"], merged_output

def add_audio_subtitles(media_convert, iam_role, input_video_path, audio_file_path, subtitle_file_path, final_output_path):
    """
    Add audio narration and subtitle into the video
    Args:
        media_convert - MediaConvert client
        iam_role - IAM role ARN for MediaConvert
        input_video_path - S3 path for input video
        audio_file_path - S3 path for audio narration file
        subtitle_file_path - S3 path for SRT subtitle file
        final_output_path - S3 path for video output
        
    Returns:
        job_id: MediaConvert job Id
    """
    
    response = media_convert.create_job(
        Queue="Default",
        UserMetadata={},
        Role=iam_role,
        Settings={
            "TimecodeConfig": {"Source": "ZEROBASED"},
            "OutputGroups": [
                {
                    "Name": "File Group",
                    "Outputs": [
                        {
                            "ContainerSettings": {
                                "Container": "MP4",
                                "Mp4Settings": {},
                            },
                            "VideoDescription": {
                                "CodecSettings": {
                                    "Codec": "H_264",
                                    "H264Settings": {
                                        "MaxBitrate": 40000000,
                                        "RateControlMode": "QVBR",
                                        "SceneChangeDetect": "TRANSITION_DETECTION",
                                    },
                                }
                            },
                            "NameModifier": "_summary_v2",
                            "AudioDescriptions": [
                                {
                                    "AudioSourceName": "Audio Selector Group 1",
                                    "CodecSettings": {
                                        "Codec": "AAC",
                                        "AacSettings": {
                                            "Bitrate": 96000,
                                            "CodingMode": "CODING_MODE_2_0",
                                            "SampleRate": 48000,
                                        },
                                    },
                                }
                            ],
                            "CaptionDescriptions": [
                                {
                                    "CaptionSelectorName": "Captions Selector 1",
                                    "DestinationSettings": {
                                        "DestinationType": "BURN_IN",
                                        "BurninDestinationSettings": {
                                            "BackgroundOpacity": 100,
                                            "FontSize": 18,
                                            "FontColor": "WHITE",
                                            "ApplyFontColor": "ALL_TEXT",
                                            "BackgroundColor": "BLACK",
                                        },
                                    },
                                }
                            ],
                        }
                    ],
                    "OutputGroupSettings": {
                        "Type": "FILE_GROUP_SETTINGS",
                        "FileGroupSettings": {"Destination": final_output_path},
                    },
                }
            ],
            "Inputs": [
                {
                    "VideoSelector": {},
                    "TimecodeSource": "ZEROBASED",
                    "FileInput": input_video_path,
                    "AudioSelectors": {
                        "Audio Selector 1": {
                            "DefaultSelection": "NOT_DEFAULT",
                            "ExternalAudioFileInput": audio_file_path,
                        },
                    },
                    "AudioSelectorGroups": {
                        "Audio Selector Group 1": {
                            "AudioSelectorNames": ["Audio Selector 1"]
                        }
                    },
                    "CaptionSelectors": {
                        "Captions Selector 1": {
                            "SourceSettings": {
                                "SourceType": "SRT",
                                "FileSourceSettings": {"SourceFile": subtitle_file_path},
                            }
                        }
                    },
                }
            ],
        },
        AccelerationSettings={"Mode": "DISABLED"},
        StatusUpdateInterval="SECONDS_60",
        Priority=0,
    )
    
    return response["Job"]["Id"]

Merge all clips with audio and subtitles.

In [None]:
t0 = time.time()
merge_job_id, merged_output = merge_clips(
    media_convert, 
    iam_role, 
    clip_paths, 
    output_video_path
)

if wait_for_job(media_convert, merge_job_id):
    print("Successfully merged video clips")
else:
    print("Failed to merge video clips")

final_job_id = add_audio_subtitles(
    media_convert,
    iam_role,
    merged_output + ".mp4",
    audio_file_path,
    subtitle_file_path,
    output_video_path
)

if wait_for_job(media_convert, final_job_id):
    print("Successfully created final video with audio and subtitle")
else:
    print("Failed to add audio and subtitle")

t1 = time.time()
print(f"  Elapsed time: {round(t1 - t0, 2)}s")

Here is our final video summary incorporating both visual and audio understanding. Compare it with our earlier version (based only on audio narration) to see the differences in the results.

In [None]:
video_summary_v2 = os.path.splitext(os.path.basename(video['path']))[0] + "_summary_v2.mp4"
s3_client.download_file(s3_bucket, video_summary_v2, video_summary_v2)
print("Short-form video with video and audio understanding\n")
print("=========================================\n")
display(Video(url=video_summary_v2, width=640, height=360, html_attributes="controls muted autoplay"))

In [None]:
print("Short-form video with audio understanding\n")
print("=========================================\n")
display(Video(url=video_summary, width=640, height=360, html_attributes="controls muted autoplay"))

## Clean up

In [None]:
# s3_client.delete_object(Bucket=s3_bucket, Key=audio_narration_filename)
# s3_client.delete_object(Bucket=s3_bucket, Key=video['path'])
# s3_client.delete_object(Bucket=s3_bucket, Key=subtitle_filename)
# s3_client.delete_object(Bucket=s3_bucket, Key=video_summary)
# s3_client.delete_object(Bucket=s3_bucket, Key=video_summary_v2)
# aoss_client.indices.delete(aoss_index)