# Visual Understanding

## üéØ AI-Powered Video Understanding with Adaptive Filmstrip Processing

### What is Visual Understanding in Video?

Visual understanding means teaching AI to "see" and comprehend video content just like humans do. When you watch a video, you naturally understand:
- **What's happening** in each video frame
- **Who** the characters are and what they're doing
- **Where** the action takes place
- **How** the story flows from one moment to the next
- **Transitions** and why visual transitions matter

### Visual Transitions: Understanding Shot Changes

Videos are made up of many **shots** - individual camera angles or scenes. When the video cuts from one shot to another (like from a close-up of a person's face to a wide view of a room), this is called a **shot change**. These transitions are a natural part of storytelling and help us understand:
- **Shot boundaries** - where one part of the story ends and another begins
- **Narrative flow** - how the story is structured and paced
- **Important moments** - shot changes often highlight key story points

For AI to truly understand video, it needs to recognize these natural transitions just like we do.

### Live Stream Visual Understanding

Now imagine applying visual understanding to **live streams** - video content that's happening in real-time. This could be:
- **Live broadcasts** - news, sports, events happening now
- **Video calls** - meetings, interviews, conversations
- **Streaming content** - live shows, gaming, tutorials
- **Security feeds** - monitoring, surveillance, safety systems

Live stream visual understanding means AI can analyze and comprehend video content **as it happens**, providing real-time insights about what's being seen. This opens up powerful possibilities like automatic content moderation, identifying key moments, instant highlight detection, and live content summarization.

### The Multi-Dimensional Challenge

But live stream visual understanding faces challenges across multiple dimensions:

**Data & Performance Challenges:**
- **Too Much Data**: Videos may contain huge number of frames per minute
- **Real-time Speed**: We need fast processing without delays
- **Storage Limits**: Can't store every single frame or handle high resolution content (such as 4K)
- **Quality vs Quantity**: Need to balance detail with coverage

**Model Constraints:**

AI models like Claude have strict limits:
- **File Size**: Maximum 3.75MB per image
- **Dimensions**: Maximum 8000√ó8000 pixels
- **Image Count**: Limited number of images per analysis (20)

### Our Solution: Adaptive Filmstrip Processing

We address these challenges with a smart technique:

- **üé¨ Film Grid**: Pack multiple video frames into organized grid layouts (like a photo collage)
- **üìä Multiple Film Grids**: Create several grid images to cover the entire video
- **üîç Shot Change Detection**: Automatically identify natural shot transitions
- **‚ö° Adaptive Processing**: Automatically optimize everything to fit AI limits based on the source charateristics to maintain accuracy. 

**Let's see how this works in practice!** üé¨

## 1. Configuration Parameters

**Set all processing parameters** - These settings control how the video gets prepared for AI analysis.

In [None]:
# Video Configuration
VIDEO_FILE = '../sample_videos/netflix-2mins.mp4'
OUTPUT_PREFIX = 'output/filmstrip'
OUTPUT_DIR = 'output'

# Grid Configuration
MAX_GRID_WIDTH = 8000
MAX_GRID_HEIGHT = 8000
MAX_GRID_IMAGES = 20
FIXED_GRID_ROWS = 4
FIXED_GRID_COLS = 5
PRESERVE_SOURCE_RESOLUTION = True
MAX_FILE_SIZE_MB = 2.0  # Maximum file size per grid (None = no limit)

# Duration Control
START_TIME = 0.0
PROCESS_DURATION = None  # None = Process entire video

# Visual Configuration
BORDER_THICKNESS = 8
LABEL_HEIGHT = 40
BORDER_COLOR = 'red'
LABEL_BG_COLOR = 'black'
LABEL_TEXT_COLOR = 'white'

# Shot Change Detection
ENABLE_SHOT_DETECTION = True
SHOT_DETECTION_THRESHOLD = 0.3  # Lower = more sensitive (0.0-1.0)

# Bedrock Configuration
CLAUDE_MODEL_ID = 'global.anthropic.claude-sonnet-4-20250514-v1:0'
MAX_GRIDS_TO_ANALYZE = 5
AWS_REGION="us-east-1"
print('‚úÖ Configuration loaded')

## 2. Import Libraries

**Load required modules** for video processing, AI analysis, and visualization.


In [None]:
import sys
import os
import base64
import json
import cv2
from IPython.display import Video, display

sys.path.insert(0, '../src')
from shared.filmstrip_processor import AdaptiveFilmstripProcessor
from shared.shot_change_detector import create_fusion_detector

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import boto3

os.makedirs(OUTPUT_DIR, exist_ok=True)
print('‚úÖ Imports successful')

## 3. Preview Source Video

**Check the video** before processing. This shows you basic video information and lets you watch it.

### Source Video

For this workshop, we will be using **Meridian, 2016**, Mystery from [Netflix](https://opencontent.netflix.com/#h.fzfk5hndrb9w). This video is available under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode)


In [None]:
# Check if video file exists
if os.path.exists(VIDEO_FILE):
    # Get video properties
    cap = cv2.VideoCapture(VIDEO_FILE)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    duration = frame_count / fps if fps > 0 else 0
    cap.release()
    
    print('üìπ Video Information:')
    print(f'   File: {VIDEO_FILE}')
    print(f'   Duration: {duration:.1f}s ({duration/60:.1f} minutes)')
    print(f'   Resolution: {width}√ó{height}px')
    print(f'   FPS: {fps:.1f}')
    print(f'   Total Frames: {frame_count:,}')
    print(f'\n   Processing: {START_TIME:.1f}s to {START_TIME + (PROCESS_DURATION or duration):.1f}s')
    
    # Display video player
    print('\nüé¨ Video Player:')
    display(Video(VIDEO_FILE, width=800, height=450))
else:
    print(f'‚ùå Video file not found: {VIDEO_FILE}')
    print('   Please check the VIDEO_FILE path in the configuration cell.')

### What We're Looking At

Before we start, let's understand some of the key video information:
- The **duration** tells us how long the video is - longer videos will take more time to process.
- The **resolution** shows the video quality (like 1080p or 720p) - higher quality videos create larger files and may take longer to process. 
- **FPS (frames per second)** tells us how smooth the video is - most videos are 24-30 FPS. 
- The **total frames** shows how much data we have to work with.

## 4. Create Adaptive Filmstrip Processor

**Set up the adaptive processor** that automatically prepares optimal filmstrip grid images from the video for AI analysis.

### üéØ How This Works

Think of this like creating a photo collage from a long video. Here's what happens:

**Sample Video:**
- 2 minutes long (120 seconds)
- HD quality (1280√ó720)
- 30 frames per second
- Total: 3,600 individual frames

**The Challenge:**
- AI can only handle images up to 3.75MB
- Images can't be bigger than 8000√ó8000 pixels
- We want to analyze as much video as possible

**Adaptive Filmstrip Process:**

1. **Grid Layout Determination**
   - Use provided grid matrix if specified by user (we will use 4√ó5 user-defined matrix for this workshop)
   - Otherwise, calculate optimal matrix: AI limit (8000√ó8000) √∑ source resolution (1280√ó720) = max 6√ó11 grid
   - Example: Use 4√ó5 grid matrix (20 frames per image)

2. **Frame Packaging Capacity**
   - AI image limit: 20 images maximum
   - Grid capacity: 4√ó5 = 20 frames per image
   - Total packageable frames: 20 images √ó 20 frames = 400 frames

3. **Size Optimization**
   - Downscale overall grid image to fit 2MB limit using compression formulas
   - Balance quality with file size constraints

4. **Frame Sampling Strategy**
   - Source video: 120 seconds √ó 25 FPS = 3,600 total frames
   - Sampling interval: 3,600 frames √∑ 400 packageable frames = 9
   - Extract every 9th frame for comprehensive coverage

**The Result:**
- 20 frames placed in 4√ó5 matrix in single grid image (20 frames in one image)
- 20 grid images created covering 400 frames in total
- Each image: ~2MB (fits AI limits perfectly)
- Sample interval: 9 (pick every 9th frame)
- Timestamp footer added in each cell
- Scene changes automatically detected

### üí° Why This Matters

Instead of manually figuring out sizes, sampling rates, and formats, the processor does all the math automatically. You just set your preferences, and it handles the complex process of creating filmstrip grid images from the given video!

In [None]:
# Create shot change detector if enabled
shot_detector = None
if ENABLE_SHOT_DETECTION:
    shot_detector = create_fusion_detector(threshold=SHOT_DETECTION_THRESHOLD)
    print(f'‚úÖ Shot detector created (threshold={SHOT_DETECTION_THRESHOLD})')

processor = AdaptiveFilmstripProcessor(
    max_grid_size=(MAX_GRID_WIDTH, MAX_GRID_HEIGHT),
    max_grid_images=MAX_GRID_IMAGES,
    fixed_grid_layout=(FIXED_GRID_ROWS, FIXED_GRID_COLS),
    preserve_source_resolution=PRESERVE_SOURCE_RESOLUTION,
    max_file_size_mb=MAX_FILE_SIZE_MB,
    border_thickness=BORDER_THICKNESS,
    label_height=LABEL_HEIGHT,
    border_color=BORDER_COLOR,
    label_bg_color=LABEL_BG_COLOR,
    label_text_color=LABEL_TEXT_COLOR,
    shot_detector=shot_detector
)

print('‚úÖ Processor created')

## 5. Process Video with the Adaptive Filmstrip Processor

**Run the adaptive processor** - This is where the video gets transformed into grid images for AI analysis!

### What Happens Now

When you run this step, the system automatically:

1. **Analyzes the video** - Looks at duration, resolution, and frame rate
2. **Caclulate the optimal packaging** - Identifies the grid layout matrix, number of images, and frame sampling based on the video information and AI limits.
3. **Picks the frames** - Selects representative frames throughout the video
4. **Adds frame timestamps** - Labels each frame with its exact timestamp in the video
5. **Creates grid images** - Packs multiple frames into organized collages
6. **Detects shot changes** - Finds where the video transitions between frames 

üí° **The Result**: Multiple grid images ready for AI analysis, each containing 20 video frames with timestamps, along with the shot changes information.

In [None]:
result = processor.create_adaptive_filmstrips(
    video_file=VIDEO_FILE,
    output_prefix=OUTPUT_PREFIX,
    start_time=START_TIME,
    process_duration=PROCESS_DURATION,
    detect_shot_changes=ENABLE_SHOT_DETECTION
)

layout = result['layout']
output_files = result['output_files']

print(f'\n‚úÖ Created {len(output_files)} filmstrip grids')
print(f'   Frames per grid: {layout["frames_per_grid"]}')
print(f'   Total frames: {layout["frames_to_extract"]}')

## 6. Review Adaptive Filmstrip Processor Output

**Check what the processor has created** - Let's see how the video was prepared for AI analysis.

### Understanding the Results

The results show you exactly what happened to the video:

- **Grid Layout**: How frames are arranged (like 4√ó5 = 20 frames per image)
- **Frame Size**: The dimensions of each individual frame in the grid
- **Frames Extracted**: How many frames were selected vs. total available
- **Sampling Rate**: The pattern used (like "every 10th frame")
- **Coverage**: What percentage of the video is included
- **Shot Changes**: How many frame transitions were found

üí° **Why This Matters**: These numbers help you understand the balance between video coverage and AI limits.

In [None]:
print('\nLayout Summary:')
print(f'  Grid: {layout["grid_rows"]}√ó{layout["grid_cols"]}')
print(f'  Frame size: {layout["cell_size"][0]}√ó{layout["cell_size"][1]}px')
print(f'  Frames extracted: {layout["frames_to_extract"]}/{layout["total_frames"]}')
sampling_text = f'Every {layout["sampling_rate"]} frames' if layout["sampling_rate"] > 1 else 'No sampling'
print(f'  Sampling: {sampling_text}')
print(f'  Coverage: {(layout["frames_to_extract"]/layout["total_frames"])*100:.1f}%')

# Display shot change summary
if ENABLE_SHOT_DETECTION:
    total_shots = sum(len(sc['shot_changes']) for sc in result['shot_changes'])
    print(f'\n  Shot changes detected: {total_shots}')

## 7. Visualize Filmstrip Grids

**View filmstrip grids** - See how the entire video is intelligently packed into multiple grid images for AI analysis.

### Multiple Grid Images Created

The system creates **multiple grid images** (not just one), each containing:

- **20 frames per grid** arranged in 4√ó5 layout
- **Smart packing** to fit AI model limits (under 2MB each)
- **Sequential coverage** - Grid 1 ‚Üí Grid 2 ‚Üí Grid 3 covers the entire video
- **Timestamps and position labels** for precise frame referencing
- **Optimized file sizes** that balance quality with AI constraints

### Why Multiple Grids Work Well

Instead of trying to fit everything in one oversized image, we create multiple optimized grids:
- **Comprehensive coverage** of the entire video
- **AI model compatibility** - each grid fits within size and dimension limits
- **Efficient processing** - 20√ó more efficient than individual frames
- **Perfect for referencing** specific moments across the full video

üí° **Smart approach**: Multiple grids = complete video coverage within AI limits!

In [None]:
# Display filmstrip grids with red box highlighting the focused frame
for i, file_path in enumerate(output_files[:1]):
    if os.path.exists(file_path):
        print(f'\nGrid {i+1}: {file_path}')
        img = mpimg.imread(file_path)
        
        # Calculate frame dimensions for red box overlay
        img_height, img_width = img.shape[:2]
        frame_width = (img_width - (FIXED_GRID_COLS + 1) * BORDER_THICKNESS) // FIXED_GRID_COLS
        frame_height_with_label = (img_height - (FIXED_GRID_ROWS + 1) * BORDER_THICKNESS) // FIXED_GRID_ROWS
        
        # Select middle frame (row 2, col 3 for a 4x5 grid)
        focus_row, focus_col = 1, 2  # 0-indexed (displays as [2√ó3])
        
        # Calculate frame position for red box
        start_x = BORDER_THICKNESS + focus_col * (frame_width + BORDER_THICKNESS)
        start_y = BORDER_THICKNESS + focus_row * (frame_height_with_label + BORDER_THICKNESS)
        end_x = start_x + frame_width
        end_y = start_y + frame_height_with_label
        
        # Display full grid with red box highlight
        fig, ax = plt.subplots(figsize=(16, 12))
        ax.imshow(img)
        
        # Add yellow box around the focused frame
        from matplotlib.patches import Rectangle
        yellow_box = Rectangle((start_x-5, start_y-5), frame_width+10, frame_height_with_label+10, 
                           linewidth=4, edgecolor='yellow', facecolor='none', linestyle='-')
        ax.add_patch(yellow_box)
        
        plt.tight_layout()
        plt.show()

if len(output_files) > 1:
    print(f'\n... and {len(output_files) - 1} more grids')

### What We're Seeing

Each grid image is like a photo collage containing multiple video frames. This section shows you:

- **Individual Frames**: Specific moments from the video
- **Timestamps**: Exactly when each frame occurs in the video
- **Position Labels**: Where each frame sits in the grid (like [2√ó3])

## 8. Examine Individual Frame Within A Grid Image

**Look at the individual frames up close** - Let's closely examine the frame within yellow box from one of the grid images shown above.

In [None]:
# Display focused frames with detailed analysis
print('Focused Frame Analysis - Detailed View of Individual Frames\n')

for i, file_path in enumerate(output_files[:1]):
    if os.path.exists(file_path):
        img = mpimg.imread(file_path)
        
        # Calculate frame dimensions (accounting for borders and labels)
        img_height, img_width = img.shape[:2]
        frame_width = (img_width - (FIXED_GRID_COLS + 1) * BORDER_THICKNESS) // FIXED_GRID_COLS
        frame_height_with_label = (img_height - (FIXED_GRID_ROWS + 1) * BORDER_THICKNESS) // FIXED_GRID_ROWS
        frame_height = frame_height_with_label - LABEL_HEIGHT
        
        # Select middle frame (row 2, col 3 for a 4x5 grid)
        focus_row, focus_col = 1, 2  # 0-indexed (displays as [2√ó3])
        
        # Calculate frame position in the image
        start_x = BORDER_THICKNESS + focus_col * (frame_width + BORDER_THICKNESS)
        start_y = BORDER_THICKNESS + focus_row * (frame_height_with_label + BORDER_THICKNESS)
        end_x = start_x + frame_width
        end_y = start_y + frame_height_with_label
        
        # Extract the frame (including footer)
        frame_img = img[start_y:end_y, start_x:end_x]
        
        # Calculate timestamp for this frame (matching the processor's logic)
        frame_index = focus_row * FIXED_GRID_COLS + focus_col
        frames_per_grid = layout['frames_per_grid']
        global_frame_index = i * frames_per_grid + frame_index
        # Match the processor's timestamp calculation: time_offset = i * interval + (interval / 2)
        time_offset = global_frame_index * layout['extraction_interval'] + (layout['extraction_interval'] / 2)
        timestamp = START_TIME + time_offset
        
        print(f'Grid {i+1} - Focused Frame Analysis:')
        print(f'   Position: [{focus_row+1}√ó{focus_col+1}] (Row {focus_row+1}, Column {focus_col+1})')
        print(f'   Timestamp: {timestamp:.1f}s')
        print(f'   Global Frame Index: {global_frame_index}')
        print(f'   Frame Dimensions: {frame_width}√ó{frame_height_with_label}px')
        
        # Display the focused frame in large size
        fig, ax = plt.subplots(figsize=(20, 15))
        ax.imshow(frame_img)
        ax.axis('off')
        
        # Add title with position and timestamp info
        title = f'Focused Frame [{focus_row+1}√ó{focus_col+1}] | {timestamp:.1f}s | Grid {i+1}'
        ax.set_title(title, fontsize=20, fontweight='bold', pad=30)
        
        # Add detailed footer information with red background for emphasis
        footer_text = f'Position: Row {focus_row+1}, Column {focus_col+1} | Timestamp: {timestamp:.1f}s | Frame: {global_frame_index} | Size: {frame_width}√ó{frame_height_with_label}px'
        plt.figtext(0.5, 0.02, footer_text, ha='center', fontsize=14, fontweight='bold',
                   bbox=dict(boxstyle='round,pad=0.8', facecolor='red', alpha=0.8, edgecolor='darkred'))
        
        plt.tight_layout()
        plt.subplots_adjust(bottom=0.12)  # Make room for footer
        plt.show()
        print('\n' + '='*80 + '\n')

print('Focused frame analysis complete!')

### Why This Helps

When AI analyzes the video, it can reference specific moments by their grid position and timestamp. This makes the analysis much more precise and useful.

üí° **Example**: AI might say "At position [2√ó3] around 15.5 seconds, the character enters the room"!

## 9. Shot Change Detection Results

**See where scenes change in the video** - The system automatically finds visual transitions between different frames.

### How This Works:
1. **Compare Frames**: Look at consecutive video frames
2. **Measure Differences**: Check how much the image changes
3. **Find Big Changes**: When the change is significant, it's probably a new shot
4. **Record the Moment**: Save the timestamp and location

### Why This Matters:
- Helps AI understand the story flow
- Makes analysis more accurate
- Provides context about video structure

üí° **Simple**: The system automatically finds shot changes in the video by comparing and measuring differences between frame images.

In [None]:
# Shot changes are now included in the result with timestamps!
shot_changes_info = result['shot_changes']  # Already has timestamps and positions

if ENABLE_SHOT_DETECTION:
    print('\nShot Change Details:')
    print('=' * 80)
    
    # Check if we have the new format with shot_segments
    has_shot_segments = shot_changes_info and 'shot_segments' in shot_changes_info[0]
    
    if has_shot_segments:
        # New format: shot_segments already calculated by processor
        for grid_data in shot_changes_info:
            grid_idx = grid_data['grid_index']
            shot_segments = grid_data['shot_segments']
            
            if shot_segments:
                print(f'\nGrid {grid_idx + 1}:')
                for shot in shot_segments:
                    print(f'  ‚Ä¢ Shot change at [{shot["row"]}√ó{shot["col"]}] | {shot["timestamp"]:.1f}s')
        
        print('\n' + '=' * 80)
        total_shots = sum(len(sc['shot_segments']) for sc in shot_changes_info)
        print(f'Total shot changes: {total_shots}')
        print('\nüí° Timestamps and grid positions calculated by AdaptiveFilmstripProcessor')
    else:
        # Old format: need to calculate timestamps (backward compatibility)
        print('\n‚ö†Ô∏è  Using old result format. Re-run processing cell for enhanced shot data.')
        print('\nCalculating timestamps from old format...')
        
        for grid_data in shot_changes_info:
            grid_idx = grid_data['grid_index']
            shot_changes = grid_data['shot_changes']
            frame_range = grid_data['frame_range']
            
            if shot_changes:
                print(f'\nGrid {grid_idx + 1}:')
                for shot_idx in shot_changes:
                    row = (shot_idx // FIXED_GRID_COLS) + 1
                    col = (shot_idx % FIXED_GRID_COLS) + 1
                    global_frame_idx = frame_range[0] + shot_idx
                    # Match the processor's timestamp calculation
                    time_offset = global_frame_idx * layout['extraction_interval'] + (layout['extraction_interval'] / 2)
                    timestamp = START_TIME + time_offset
                    print(f'  ‚Ä¢ Shot change at [{row}√ó{col}] | {timestamp:.1f}s')
        
        print('\n' + '=' * 80)
        total_shots = sum(len(sc['shot_changes']) for sc in shot_changes_info)
        print(f'Total shot changes: {total_shots}')
else:
    print('\nShot change detection disabled')

## 10. Visualize Shot Change Detection

**See scene changes in action** - Compare what the video looked like before and after a scene change.

### What You'll See:
- **BEFORE Frame**: What the video looked like in the previous scene
- **AFTER Frame**: What the video looks like in the new scene
- **The Difference**: How much the visual content changed

### Understanding the Numbers:
- **Similarity Score**: How similar the frames are (0 = totally different, 1 = identical)
- **Threshold**: The cutoff point for detecting changes (we use 0.3)
- **Detection**: If similarity is below 0.3, we found a scene change!

üí° **Simple Rule**: Big visual changes = new scenes!

In [None]:
import cv2
import matplotlib.pyplot as plt
import numpy as np

if ENABLE_SHOT_DETECTION and shot_changes_info:
    # Find first shot change to visualize
    first_shot = None
    for grid_info in shot_changes_info:
        if grid_info['shot_segments']:
            first_shot = grid_info['shot_segments'][0]
            grid_idx = grid_info['grid_index']
            break
    
    if first_shot:
        print('\nüé¨ Visualizing Shot Change Detection')
        print(f'   Location: Grid {grid_idx + 1}, Frame [{first_shot["row"]}√ó{first_shot["col"]}]')
        print(f'   Timestamp: {first_shot["timestamp"]:.1f}s')
        print(f'   Frame Index: {first_shot["frame_index"]}')
        
        # Open video and extract frames
        cap = cv2.VideoCapture(VIDEO_FILE)
        
        # Calculate frame positions
        frame_idx = first_shot['frame_index']
        # Match the processor's timestamp calculation
        before_time_offset = (frame_idx - 1) * layout['extraction_interval'] + (layout['extraction_interval'] / 2)
        after_time_offset = frame_idx * layout['extraction_interval'] + (layout['extraction_interval'] / 2)
        before_time = START_TIME + before_time_offset
        after_time = START_TIME + after_time_offset
        
        # Extract before frame (frame before shot change)
        cap.set(cv2.CAP_PROP_POS_MSEC, before_time * 1000)
        ret1, frame_before = cap.read()
        
        # Extract after frame (frame where shot change detected)
        cap.set(cv2.CAP_PROP_POS_MSEC, after_time * 1000)
        ret2, frame_after = cap.read()
        
        cap.release()
        
        if ret1 and ret2:
            # Convert BGR to RGB for display
            frame_before_rgb = cv2.cvtColor(frame_before, cv2.COLOR_BGR2RGB)
            frame_after_rgb = cv2.cvtColor(frame_after, cv2.COLOR_BGR2RGB)
            
            # Create side-by-side visualization
            fig, axes = plt.subplots(1, 2, figsize=(16, 6))
            
            # Before frame
            axes[0].imshow(frame_before_rgb)
            axes[0].set_title(f'BEFORE Shot Change\nFrame {frame_idx - 1} | {before_time:.1f}s', 
                            fontsize=14, fontweight='bold', color='blue')
            axes[0].axis('off')
            
            # After frame
            axes[1].imshow(frame_after_rgb)
            axes[1].set_title(f'AFTER Shot Change (Detected)\nFrame {frame_idx} | {after_time:.1f}s', 
                            fontsize=14, fontweight='bold', color='red')
            axes[1].axis('off')
            
            plt.suptitle('Shot Change Detection: Before vs After', 
                        fontsize=16, fontweight='bold', y=0.98)
            plt.tight_layout()
            plt.show()
            
            # Calculate and display histogram difference
            hsv_before = cv2.cvtColor(frame_before, cv2.COLOR_BGR2HSV)
            hsv_after = cv2.cvtColor(frame_after, cv2.COLOR_BGR2HSV)
            
            hist_before = cv2.calcHist([hsv_before], [0, 1, 2], None, [8, 8, 8], [0, 180, 0, 256, 0, 256])
            hist_after = cv2.calcHist([hsv_after], [0, 1, 2], None, [8, 8, 8], [0, 180, 0, 256, 0, 256])
            
            correlation = cv2.compareHist(hist_before, hist_after, cv2.HISTCMP_CORREL)
            
            print('\n   üìä Detection Metrics:')
            print(f'      Histogram Correlation: {correlation:.4f}')
            print(f'      Threshold: {SHOT_DETECTION_THRESHOLD}')
            print(f'      Shot Detected: {"YES" if correlation < SHOT_DETECTION_THRESHOLD else "NO"} '
                  f'({correlation:.4f} < {SHOT_DETECTION_THRESHOLD})')
            print('\n   üí° Lower correlation = more different frames = shot change')
        else:
            print('   ‚ùå Could not extract frames for visualization')
    else:
        print('\n   No shot changes detected to visualize')
else:
    print('\n   Shot change detection disabled or no changes detected')

## 11. Define Helper Functions

**Create utilities for Claude interaction** - These functions prepare data and prompts for AI analysis.

### What These Functions Do

Think of these as the AI communication toolkit:

1. **Image Encoder**: Converts the grid images into a format AI can read
2. **Prompt Builder**: Creates clear instructions for AI about how to analyze the video
3. **API Communicator**: Sends everything to Claude and gets the analysis back

### Why This Matters

AI needs specific formats and clear instructions to do its best work. These functions handle all the technical details so you don't have to worry about them.

In [None]:
def encode_image_to_base64(image_path):
    with open(image_path, 'rb') as f:
        return base64.b64encode(f.read()).decode('utf-8')

def create_analysis_prompt(grid_rows, grid_cols, num_grids, shot_changes_info=None):
    prompt = f"""You are analyzing filmstrip grids for comprehensive VISUAL UNDERSTANDING of video content.

    GRID SPECIFICATIONS:
    ‚Ä¢ Grid Structure: {grid_rows}√ó{grid_cols} = {grid_rows * grid_cols} frames per grid
    ‚Ä¢ Total Grids: {num_grids}
    ‚Ä¢ Reading Order: LEFT‚ÜíRIGHT, TOP‚ÜíBOTTOM within each grid, then Grid 1‚Üí2‚Üí3 sequentially
    ‚Ä¢ Frame Labels: [row√ócol] | timestamp format
    
    VISUAL ANALYSIS FRAMEWORK:
    Examine each frame for objects, people, environments, visual elements, and text. Track how these elements change across time."""

    # Add shot change information if available
    if shot_changes_info:
        prompt += "\n\nüéØ DETECTED SCENE TRANSITIONS:\n"
        for grid_info in shot_changes_info:
            grid_idx = grid_info['grid_index']
            shots = grid_info['shot_segments']
            if shots:
                prompt += f"\nGrid {grid_idx + 1} - Visual Transitions:\n"
                for shot in shots:
                    prompt += f"  ‚ñ∂ Visual change at {shot['timestamp']:.1f}s [Frame {shot['row']}√ó{shot['col']}] - New visual scene\n"
        prompt += "\nüí° Use these transitions to identify major visual shifts and scene boundaries.\n"
    
    prompt += """
    REQUIRED VISUAL ANALYSIS:
    
    1. **TEXT RECOGNITION & ANALYSIS**
       - Signs, billboards, street names, building labels
       - On-screen text, titles, captions, subtitles
       - License plates, product names, brand logos
       - Written content in documents, books, newspapers
       - Digital displays, screens, monitors showing text
       - Handwritten text or notes visible in frames
    
    2. **OBJECT DETECTION & IDENTIFICATION**
       - People: Count, positions, actions, clothing, expressions
       - Vehicles: Types, colors, positions, movement
       - Buildings/Architecture: Structures, styles, conditions
       - Natural elements: Trees, sky, weather, terrain
       - Props/Items: Tools, furniture, signs, technology
    
    3. **SPATIAL COMPOSITION**
       - Foreground, middle ground, background elements
       - Object placement and relationships
       - Scale and perspective of objects
       - Depth and layering in scenes
    
    4. **VISUAL ENVIRONMENT**
       - Indoor vs outdoor settings
       - Lighting conditions (natural, artificial, time of day)
       - Weather and atmospheric conditions
       - Geographic or architectural context
    
    5. **MOVEMENT & DYNAMICS**
       - Object motion patterns across frames
       - People walking, vehicles moving, environmental changes
       - Camera movement effects on object positions
       - Temporal changes in object states
    
    6. **COLOR & VISUAL PROPERTIES**
       - Dominant color schemes per scene
       - Object colors and visual characteristics
       - Lighting effects on appearance
       - Visual quality and clarity changes
    
    7. **SCENE CONTEXT & SETTING**
       - Location types (street, building, park, etc.)
       - Time period indicators from visual cues
       - Cultural or regional visual markers
       - Activity contexts from visible elements
    
    8. **CONTENT MODERATION ANALYSIS**
       - Violence: Weapons, fighting, aggressive behavior, injuries
       - Adult content: Nudity, suggestive poses, intimate situations
       - Substance use: Smoking, drinking, drug paraphernalia
       - Disturbing content: Blood, graphic imagery, distressing scenes
       - Inappropriate behavior: Dangerous activities, harmful actions
       - Age-inappropriate elements: Content unsuitable for minors
    
    9. **NARRATIVE INFERENCE**
       - Story context derived from visual and textual cues
       - Character relationships and interactions
       - Plot progression indicated by visual elements
       - Setting and time period from environmental clues
       - Emotional tone and atmosphere
       - Thematic elements suggested by visuals and text
    
    For each element (visual, textual, or narrative), reference specific frames using [row√ócol] notation and timestamps. 
    Combine text recognition with visual analysis to provide comprehensive understanding of the video's content and story.

    OUTPUT FORMAT REQUIREMENT:
    You MUST respond with ONLY valid JSON in the following structure:
    
    {
      "video_analysis": {
        "overview": {
          "title": "Brief title of the video content",
          "duration_analyzed": "Duration in seconds",
          "total_frames_analyzed": "Number of frames analyzed",
          "genre": "Video genre/type",
          "summary": "Comprehensive summary of the entire video content"
        },
        "text_recognition": {
          "details" [ details of text extracted" ]
        },
        "movement_dynamics": {
            "details" : [ "Details of the movement and dynamics" ]
        },
        "spatial_compositions": {
            "details" : [ "spatial compositions details " ]
        },
        "color_visual_properties": {
            "details": [ "color and visual details" ]
        },
        "visual_elements": {
            "people_details": [ "Details of the poeple identified" ],
            "object_details": [ "Object Details" ],
            "environment_details": [ "Details of the environment"]

        },
        "content_moderation": {
          "details": [ "content moderation details" ]
        },
        "narrative_analysis": {
          "details": [ Narrative analysis details" ]
        },
        "chapters": [
          {
            "chapter_number": 1,
            "title": "Descriptive chapter title",
            "start_time": "Start timestamp in seconds",
            "end_time": "End timestamp in seconds",
            "duration": "Chapter duration in seconds",
            "description": "Detailed description of what happens in this chapter",
            "key_events": ["List of important events in this chapter"],
            "characters_present": ["List of characters in this chapter"],
            "setting": "Where this chapter takes place",
            "mood": "Emotional mood of this chapter"
          }
        ],
      }
    }
    
    CRITICAL REQUIREMENTS:
    1. Output ONLY valid JSON - no additional text, explanations, or markdown
    2. Do not wrap JSON in json blocks
    3. Include ALL required fields even if empty (use empty arrays [] or empty strings "")
    4. Create logical chapters based on visual transitions and content changes
    5. Reference specific frames using [row√ócol] notation and exact timestamps
    6. Make chapters meaningful segments of 10-30 seconds each when possible
    7. Do not add any markdown syntax for json output.
    
    """
        
    return prompt

def analyze_with_bedrock(image_paths, prompt):
    content = []
    for i, img_path in enumerate(image_paths):
        content.append({
            'type': 'image',
            'source': {
                'type': 'base64',
                'media_type': 'image/jpeg',
                'data': encode_image_to_base64(img_path)
            }
        })
        content.append({'type': 'text', 'text': f'--- Grid {i+1} ---'})
    
    content.append({'type': 'text', 'text': prompt})

    bedrock_runtime = boto3.client(
        service_name='bedrock-runtime',
        region_name=AWS_REGION
    )
    
    response = bedrock_runtime.invoke_model(
        modelId=CLAUDE_MODEL_ID,
        body=json.dumps({
            'anthropic_version': 'bedrock-2023-05-31',
            'max_tokens': 4096,
            'messages': [{'role': 'user', 'content': content}]
        })
    )
    
    return json.loads(response['body'].read())['content'][0]['text']

print('‚úÖ Functions defined')

## 12. Analyze Video with Claude

**Send the video filmstrip grids to Claude for analysis** - This is where the magic happens!

### What Happens Now

We're about to send the organized video grids to Claude AI on Amazon Bedrock for analysis. Here's the process:

1. **Select Grids**: Choose the first few grids (like 5 grids = 100 video frames)
2. **Add Context**: Include information about scene changes we detected
3. **Create Instructions**: Tell Claude how to read the grid format and also sequence of grids reading.
4. **Send to AI**: Upload everything to Claude on Amazon Bedrock
5. **Get Analysis**: Receive detailed insights about the video

### What Claude Will Tell You:
- **Video Summary**: What the video is about overall
- **Story Flow**: How the narrative develops over time
- **Key Scenes**: Important moments with specific timestamps
- **Visual Style**: Cinematography and visual elements
- **Shot Changes**: How the detected transitions fit the story

In [None]:
grids_to_analyze = output_files[:MAX_GRIDS_TO_ANALYZE]

# Filter shot changes for grids being analyzed
filtered_shot_changes = [sc for sc in shot_changes_info if sc['grid_index'] < MAX_GRIDS_TO_ANALYZE]

# Create prompt with shot change information
prompt = create_analysis_prompt(
    FIXED_GRID_ROWS, 
    FIXED_GRID_COLS, 
    len(grids_to_analyze),
    shot_changes_info=filtered_shot_changes if ENABLE_SHOT_DETECTION else None
)

print(f'ü§ñ Analyzing {len(grids_to_analyze)} grids...')
if ENABLE_SHOT_DETECTION:
    total_shots_in_analysis = sum(len(sc['shot_segments']) for sc in filtered_shot_changes)
    print(f'   Including {total_shots_in_analysis} shot changes')

analysis = analyze_with_bedrock(grids_to_analyze, prompt)
print('‚úÖ Analysis complete')

## 13. Display Analysis Results

**Format the JSON analysis into a user-friendly collapsible display**

In [None]:
import sys
sys.path.append('components')
from display_utils import display_analysis_results

# Display the analysis results with video clips
display_analysis_results(analysis, VIDEO_FILE)

## 14. Review Video Analysis Raw Output from Model

**See what Claude has discovered after analyzing the video** - Review the AI's understanding of the video content.


In [None]:
print('\n' + '='*80)
print('CLAUDE ANALYSIS')
print('='*80)
print()
print(analysis)
print()
print('='*80)

**üéâ Congratulations! You now understand how to perform visual understanding at scale with AI!**