# Audio segments using speech to text


The audio part of a video provides numerous clues about the structure and narrative of a video. Audio contains:



* Speech, which can be further broken down into different language segments such as words, punctuation, sentences or subtitles
* Sounds such as crowd noise at a sporting event, foley effects or background music
* Sound levels including silences

Figure 1, below, shows a clip of Meridian with the audio waveform that corresponds the sound levels at different points in time in the video.

![Video with audio waveform](./static/images/01-meridian-audio-wide.png)
**Figure 1. Video with audio waveform**




In this notebook, you'll continue to explore techniques to decompose a video into segments using audio, focusing on the speech in the video. Specifically, you'll:

* Use Amazon Transcribe to extract the audio transcript and identify words, sentences, subtitles, and speech segments.
* Use a Foundation Model to identify segments of video that contain different conversation topics.
* Combine conversation topics with visual scenes to identify chapters in the video.

The outputs of this notebook will be used in the use case sections later on in this workshop.  For example, Ad Break Detection and Contextual Ad Placement, Video Summarization, and more.


### Key terms and definitions

You can refer back to this section if you want the definition the terms used in the notebook.

- **Transcript** - continuous sequence of words and punctuation representing the speech content of the audio track in a video.
- **Subtitle** - a segment of text representing the speech in a video that is meant to be displayed for viewers as they watch the video.
- **Speech Segement** - speech segements are an output of Amazon Transcribe that  
- **Conversation topic (aka topic)** - A summarization of a group of sentences that contain discussion about a similar topic.
- **WebVTT** - a file format used to store timed text track data, such as subtitles or captions, for video content on the web.

### Workflow

The purpose of this lab is to give hands-on practice working with the audio elements of video using speech to text and to practice prompt engineering with timed text derived from audio.  You will execute AWS service APIs from SageMaker Studio.  The workflow takes a video as input and produces a transcript, sentence segments, WebVTT format subtitles, and conversation topics outputs.

![Audio Workflow](./static/images/01-audio-workflow.png)

<div class="alert alert-block alert-info">
💡 Click on the list icon in the left navigation panel in this Jupyter notebook to see the outline of the notebook and where you currently are.
</div>

## Prerequisites

### Import python packages

In [None]:
from pathlib import Path
import os
import json
import boto3
import json_repair
from termcolor import colored
from IPython.display import JSON
from IPython.display import Video
from IPython.display import Pretty
from IPython.display import Image as DisplayImage
from lib.frames import VideoFrames
from lib.shots import Shots
from lib.scenes import Scenes
from lib.transcript import Transcript
from lib import util
import requests
from pathlib import Path
from tqdm import tqdm
from typing import Tuple
import time


### Retrieve saved values from previous notebooks

To run this notebook, you need to have run the previous notebook: [01A-visual-segments-frames-shots-scenes.ipynb](01A-visual-segments-frames-shots-scenes.ipynb).

In [None]:
# Get variables from the previous notebook
%store -r

In [None]:
%store

# Generate a transcript, audio segments and WebVTT subtitles using Amazon Transcribe


## Transcript library

<div class="alert alert-block alert-info">
<b>Note</b>: The code for Transcript library is located in the <b>lib</b> folder of this project if you want to dive deeper, but the objective of this exercise is to understand the concept of a transcript.    
</div>

Click the link to open [lib/transcript.py](lib/transcript.py)

## Generate the transcript and other derived outputs

In [None]:
video['transcript'] = Transcript(video["path"], session['bucket'])

## Examine the results from Amazon Transcribe
The response from Amazon Transcribe contains a results dictionary with a transcript that contains a text-only transcript and a collection of items which contain each word and punctuation in the transcript along with a confidence score and timestamp for the item. The response also contains the same transcript formatted as subtitles in either WebVTT or SRT format. Let's take a look at these outputs.


##### Transcript JSON output

The transcript `results` attribute contains several interesting and useful outputs:
* **transcripts** - are a list of alternative text only transcripts for the video.  Our results is only configured to generate 1 alternative, but you can configure Amazon Transcribe to produce more, if needed.  Alternatives are just different semantic interpretations of the speech in the video.
* **items** - items is a time series of `pronunciations` (aka words) and `punctuation` that Amazon Transcribe inferred from the speech in the video.  Because this is AI inference, there is a _confidence score_ for each item. In Amazon Transcribe, confidence scores represent how confident the service is in the accuracy of each transcribed word.  Finally, there is a start and end time for each item, which can use to align timing of the items with other timestamped elements of the video.
* **audio_segments** - audio segments contains a list of distinct speech segments detected by Amazon Transcribe.  Segment boundaries are determined by:
    * Natural speech pauses
    * Speaker changes
    * Maximum segment duration limits
    * Punctuation

Amazon Transcribe also outputs 

<div class="alert alert-block alert-info">
    ❓ <b>Ask Amazon Q Developer</b>: What are the attributed of items returned from Amazon Transcribe?
    ❓ <b>Ask Amazon Q Developer</b>: What are confidence scores in Amazon Transcribe?
</div>

Take a moment to examine each of these attributes from the sample video below.

In [None]:
JSON(filename=video['transcript'].transcript_file)

### WebVTT Subtitles

Let's take a look at the first few lines of the WebVTT formatted subtitles generated by Amazon Transcribe.  WebVTT subtitles can be used in video players to display the speech in the video as text on the screen.

In [None]:
!echo "$(<{video['transcript'].vtt_file})"

### Play subtitles with the video

Finally, let's view the video with the generated subtitle track.  Note: we used the shot information to start the video at the shot where the first speech occurs in the video.

In [None]:
# Play the video with subtitles

from IPython.display import HTML
from base64 import b64encode
start = video['shots'].shots[8]['start_ms']/1000
end = video['shots'].shots[8]['end_ms']/1000
speech_shot_url = f'{video["url"]}#t={start}'

video_html = f"""
<video width="640" height="360" controls>
    <source src="{speech_shot_url}" type="video/mp4">
    <track src="{video['transcript'].vtt_file}" kind="captions" srclang="en" label="English" default>
</video>
"""

HTML(video_html)

# Generate conversation topics using Amazon Bedrock

In this next section, you will use generative AI to understand the conversation topics that are occurring over time in the video transcript.  This is a text summarization task that can be performed by several Foundation Models.  

<div class="alert alert-block alert-info">
💡 We'll be using Anthropic Claude Sonnet 3 on Amazon Bedrock for this and all the other generative AI tasks throughout this workshop.  We chose Anthropic Claude Sonnet 3 because of its flexibility to perform a variety of tasks on multimodal (image and text) inputs.  In practice, you may want to substitute different FMs for different tasks based on your requirements.  For example, you may find that Anthropic Claude Haiku produces adequate results at a lower price points for a specific use case.
</div>

You will pass the transcript to the Anthropic Claude 3 Sonnet model on Amazon Bedrock. The model analyzes the transcript and suggests conversational topic points in a specific JSON format. In the prompt, you specify that each topic should contain a start and end timestamp along with a reason describing the topic. The prompts for the Sonnet model are shown below.  Note that this prompt uses the [Anthropic Claude Messages API](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-anthropic-claude-messages.html):

**System prompt**

```
You are a media operation assistant who analyses movie transcripts in WebVTT 
format and suggest topic points based on the topic changes in the conversations. 
It is important to read the entire transcripts.
```


**Messages**

```
[
    {
        'content': 'Here is the transcripts in <transcript> tag:\n'
                '<transcript>{transcript}\n</transcript>\n',
        'role': 'user'
    },
    {
        'content': 'OK. I got the transcript. What output format?',
        'role': 'assistant'
    },
    {
        'content': 'JSON format. An example of the output:\n'
                '{"topics": [{"start": "00:00:10.000", "end": "00:00:32.000", '
                '"reason": "It appears the topic talks about..."}]}\n',
        'role': 'user'
    },
    {
        'content': '{', 'role': 'assistant'
    }
 ]
```



<div class="alert alert-block alert-info">
    ❓ <b>Ask Amazon Q Developer</b>: What are the inputs for the Anthropic Claude Messages API?
    <br></br>* You may have noticed the Amazon Q Developer menu icon in the side bar menu of this Jupyter notebook.  Anytime you have a question about the code in the notebook or about AWS APIs, try asking Amazon Q.
</div>

#### The code below constructs the prompt for Amazon Bedrock and then calls the Amazon Bedrock API to execute the prompt.

In [None]:
from lib import bedrock_helper as brh
from lib import util

MODEL_ID = 'anthropic.claude-3-sonnet-20240229-v1:0'
MODEL_VER = 'bedrock-2023-05-31'
CLAUDE_PRICING = (0.00025, 0.00125)

def analyze_conversations(vtt_file):

    response = {}
    messages = []

    # transcript
    transcript_message = make_transcript(vtt_file)
    messages.append(transcript_message)

    # output format?
    messages.append({
        'role': 'assistant',
        'content': 'Got the transcript. What output format?'
    })

    # example output
    example_message = make_conversation_example()
    messages.append(example_message)

    # prefill output
    messages.append({
        'role': 'assistant',
        'content': '{'
    })

    ## system prompt to role play
    system = '''
    You are a media operation assistant who analyses movie transcripts in WebVTT format
    and suggest topic points based on the topic changes in the conversations. 
    It is important to read the entire transcript.
    '''

    ## setting up the model params
    model_params = {
        'anthropic_version': MODEL_VER,
        'max_tokens': 4096,
        'temperature': 0.1,
        'top_p': 0.7,
        'top_k': 20,
        'stop_sequences': ['\n\nHuman:'],
        'system': system,
        'messages': messages
    }

    response['model_params'] = model_params
    try:
        response['response'] = inference(model_params)
    except Exception as e:
        print(colored(f"ERR: inference: {str(e)}\n RETRY...", 'red'))
        response['response'] = inference(model_params)
    return response



def make_conversation_example():
    example = {
        'topics': [
            {
                'start': '00:00:10.000',
                'end': '00:00:32.000',
                'reason': 'It appears the topic talks about...'
            }
        ]
    }

    return {
        'role': 'user',
        'content': 'JSON format. An example of the output:\n{0}\n'.format(json.dumps(example))
    }

def make_transcript(vtt_file):
    with open(vtt_file, encoding="utf-8") as f:
        transcript = f.read()
    
    return {
        'role': 'user',
        'content': 'Here is the transcripts in <transcript> tag:\n<transcript>{0}\n</transcript>\n'.format(transcript)
    }

def make_conversation_message(text):
    message = {
        'role': 'user',
        'content': 'No conversation.'
    }

    if text:
        message['content'] = 'Here is the conversation of the scene in <conversation> tag.\n<conversation>\n{0}\n</conversation>\n'.format(text)

    return message

def chapters_to_vtt(chapters, output_file):
    """
      Constructs a webvtt caption file based on the timestamps from the given chapters.
      Args:
         chapters - the topic points
         output_file - output file where the caption webvtt content is stored.
      Returns:
         None
    """
    vtt_lines = 'WEBVTT\n\n'
    for idx, chapter in enumerate(chapters):
        line = f"{idx}\n{chapter['start']} --> {chapter['end']}\n{chapter['reason']}\n"
        vtt_lines = vtt_lines+line

    util.save_to_file(output_file, vtt_lines)
    return

def inference(model_params):
    model_id = MODEL_ID
    accept = 'application/json'
    content_type = 'application/json'

    bedrock_runtime_client = boto3.client(service_name='bedrock-runtime')

    response = bedrock_runtime_client.invoke_model(
        body=json.dumps(model_params),
        modelId=model_id,
        accept=accept,
        contentType=content_type
    )

    response_body = json.loads(response.get('body').read())

    # patch the json string output with '{' and parse it
    response_content = response_body['content'][0]['text']
    if response_content[0] != '{':
        response_content = '{' + response_content

    try:
        response_content = json.loads(response_content)
    except Exception as e:
        print(colored("Malformed JSON response. Try to repair it...", 'red'))
        try:
            response_content = json_repair.loads(response_content, strict=False)
        except Exception as e:
            print(colored("Failed to repair the JSON response...", 'red'))
            print(colored(response_content, 'red'))
            raise e

    response_body['content'][0]['json'] = response_content
    response_body['model_params'] = model_params

    return response_body


#### Execute the prompt and examine the output

The model will return a result formatted the way that was specified using sample output in the prompt `messages`.  In this case it is a list of topics in the video.  Each topic has:

* **start** - the start time of the topic relative to the start of the video in HH:MM:SS.MS format
* **end** - the start time of the topic relative to the start of the video in HH:MM:SS.MS format
* **reason** - the summary of the conversation in the time range between `start` and `end`

Run the next cells to execute the prompt and look at the resulting topics.

In [None]:
conversations_response = analyze_conversations(video['transcript'].vtt_file)
video['topics'] = conversations_response['response']['content'][0]['json']['topics']

# show the conversation cost
conversation_cost = brh.display_conversation_cost(conversations_response['response'])

In [None]:
display(JSON(video['topics'], root='topics'))

#### Finally, let's take one last look at the actual prompt passed to Anthropic Claude with all the parameters filled in.

The `system` prompt outlines the task and constraints for the Anthropic Claude, while the `messages` part of the prompt, model a conversation with the FM.  

In [None]:
JSON(conversations_response['model_params'], root='prompt')

## Generate "topic points" 

To ensure the model's output accurately reflects the original transcript, the output JSON is post-processed to merge any overlapping chapter timestamps and align the chapter boundaries with the actual caption timestamps from the WebVTT file. 

In [None]:
from lib import topics

# merge overlapped conversation timestamps
video['topics'] = topics.merge_topics(video['topics'])

# validate the conversation timestamps against the caption timestamps
captions = topics.parse_webvtt(video['transcript'].vtt_file)
video['topics'] = topics.validate_timestamps(video['topics'], captions)

# save the conversations
util.save_to_file(os.path.join(video["output_dir"], 'topics.json'), video['topics'])

In [None]:
JSON(video['topics'], root='topics')

## Visualize the topics
Finally, visualize the topics with the video.  We'll output the chapter summaries as a WebVTT file and play them with the video.

<div class="alert alert-block alert-info">
Note: the topics are distributed across the entire 12 minute video.  You can skip through the video by clicking on different points on the player timeline rather than playing the whole video.  See if you can find all 5 topics.
</div>


In [None]:
video['topics_vtt'] = os.path.join(video['frames'].video_asset_dir(), "topics.vtt")
chapters_to_vtt(video['topics'], video['topics_vtt'])

In [None]:
# Play the video with topic summaries

from IPython.display import HTML
from base64 import b64encode
start = video['shots'].shots[8]['start_ms']/1000
end = video['shots'].shots[8]['end_ms']/1000
speech_shot_url = f'{video["url"]}#t={start}'

video_html = f"""
<video width="640" height="360" controls>
    <source src="{speech_shot_url}" type="video/mp4">
    <track src="{video['topics_vtt']}" kind="captions" srclang="en" label="English" default>
</video>
"""

HTML(video_html)

You just used generative AI to make topic summaries of a video.  Neat trick!

🤔 Can you think of other use cases for summarizing different time segments of audio or video content?
<br></br>
🤔 How would you increase the number of topics found by the Foundation Model?


# Store the video metadata so it can be used in the rest of the workshop.

In [None]:
%store video

# What's Next?

Now that you have created some basic segmentations and prompts against video clips, you are ready to apply these techniques to solve some use cases.  From here, you can choose a use case to explore further.

* [Ad break detection and contextual Ad tartgeting](02-ad-breaks-and-contextual-ad-targeting.ipynb) - identify opportunities for ad insertion.  Use a standard taxonomy to match video content to ad content.
* [Video summarization](03-video-summarization.ipynb) - generate short form videos from a longer video
* [Semantic video search](04-semantic-video-search.ipynb) - search video using images and natural language to find relavent clips

.