# Create Videos with the Azure OpenAI Sora API

This notebook demonstrates how to generate and manage videos using the Sora REST API. With Sora, you can:

- **Create engaging videos** from text prompts (text-to-video)
- **Generate videos from images** using reference images (image-to-video)
- **Automatic audio generation** - all videos include synchronized audio
- **Retrieve and track job statuses** to monitor your video generation progress
- **Download videos** to your local environment
- **Manage** your video generation jobs seamlessly

In addition to Sora, we are using the Azure OpenAI GPT-4.1 model to analyze the content of generated videos in this demo.

### Sora Specifications
- **Resolutions:** 1280x720 (landscape), 720x1280 (portrait)
- **Duration:** 4, 8, or 12 seconds
- **Concurrency:** max 2 pending jobs
- **Jobs expire:** after 24 hours

## Setup

## üìñ Prompt Engineering Guide

Effective prompts are crucial for generating high-quality videos with Sora. According to the [Azure OpenAI documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/video-generation), a well-structured prompt should include:

| Element | Description | Example |
|---------|-------------|---------|
| **Subject** | The main focus of the video | "A sleek silver sports car" |
| **Action** | What the subject is doing | "accelerates down a highway" |
| **Setting** | Environment or background | "through a neon-lit cyberpunk city at night" |
| **Camera Details** | Angles, movements, shot types | "tracking shot following from the side" |
| **Lighting & Mood** | Ambiance and lighting conditions | "dramatic headlights cutting through rain" |

### Example Prompt Structure

```
[Subject] + [Action] + [Setting] + [Camera/Visual Style] + [Mood/Atmosphere]
```

**Good Prompt:**
> "A sleek silver sports car accelerates through rain-slicked streets of a neon-lit city at night. 
> The camera follows from a low angle, capturing reflections on wet pavement. 
> Cinematic lighting with dramatic headlight beams cutting through the mist. 4K, 24fps film look."

**Poor Prompt:**
> "Cool car driving fast"

### API Parameters (Set in Code, Not Prompt)

These are specified in your API call, not the prompt text:
- **Model**: `sora-2` or `sora-2-pro`
- **Size**: `1280x720` (landscape) or `720x1280` (portrait)
- **Seconds**: `4`, `8`, or `12`

In [None]:
from dotenv import load_dotenv, find_dotenv
from openai import AzureOpenAI
import os
import json
import time
import pandas as pd
import threading
from IPython.display import Video, Image, Markdown

from VideoTools import Sora, VideoExtractor, VideoAnalyzer, get_video_metadata
from Instructions import use_case_prompts, filename_system_message, analyze_video_system_message

In [None]:
if not load_dotenv(find_dotenv()): raise IOError("Error: .env file could not be loaded!")

# Sora for video generation
sora_resource_name = os.getenv("SORA_AOAI_RESOURCE")
sora_deployment_name = os.getenv("SORA_DEPLOYMENT")
sora_aoai_api_key = os.getenv("SORA_AOAI_API_KEY")

sora = Sora(sora_resource_name, sora_deployment_name, sora_aoai_api_key)

# Azure OpenAI GPT-4.1 for analyzing videos and generating file names
llm_deployment = os.getenv("LLM_DEPLOYMENT")
llm_aoai_api_key = os.getenv("LLM_AOAI_API_KEY")
llm_resource_name = os.getenv("LLM_AOAI_RESOURCE")

local_video_folder = 'video-generations'

# initialize client
client = AzureOpenAI(
  azure_endpoint = f"https://{llm_resource_name}.openai.azure.com/", 
  api_key=llm_aoai_api_key,  
  api_version="2025-01-01-preview"
)

video_analyzer = VideoAnalyzer(client, llm_deployment)

The following example shows how to submit a video generation request to the Sora API using `requests`. For convenience, we've provided the `Sora` class as a simple wrapper to interact easily with the API throughout the rest of this notebook.

In [None]:
import requests

url = f"https://{sora_resource_name}.openai.azure.com/openai/v1/videos"

headers = {
    "api-key": sora_aoai_api_key,
    "Content-Type": "application/json"
}

# Sora parameters:
# - size: "1280x720" (landscape) or "720x1280" (portrait)
# - seconds: "4", "8", or "12"
payload = {
    "prompt": "A Minecraft player exploring an ancient ruin at sunset.",
    "model": sora_deployment_name,
    "size": "1280x720",
    "seconds": "8"
}

response = requests.post(url, json=payload, headers=headers)
response.raise_for_status()

print(response.json())

## Helper functions

In [None]:
def llm_completion(prompt, client, model, system_message):
    """LLM chat completion. Used for generating concise video filenames. """
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_message},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            },
        ],
        response_format={"type": "json_object"},
        temperature=0.5,
    )

    return response

def check_status(job_id, polling_intervall=30):
    """Check video generation status until job is finished or failed."""
    while True:
        
        job = sora.get_video_generation_job(job_id=job_id)
        status = job['status']
        if status not in ['queued', 'preprocessing', 'running', 'processing']:
            print(f"\nJob finished with status: {status}")
            break
        time.sleep(polling_intervall)
        print('.', end='', flush=True)

def show_video_insights(video_path):
    video_extractor = VideoExtractor(video_path)
    frames = video_extractor.extract_n_video_frames(n=5)

    llm_insights = video_analyzer.video_chat(frames, system_message=analyze_video_system_message)

    display(Markdown(f"Summary:  \n{llm_insights.get('summary', '')}"))
    display(Markdown(f"Products/brands: {llm_insights.get('products', '')}"))
    display(Markdown(f"Tags: {llm_insights.get('tags', '')}"))
    display(Markdown(f"Suggestions:  \n{llm_insights.get('feedback', '')}"))


## Start Video Generation Job (Text-to-Video)

Sora generates videos from text prompts with automatic audio.

- **Resolutions:** 1280x720 (landscape), 720x1280 (portrait, default)

- **Duration:** 4, 8, or 12 seconds (default: 4)

- **Audio:** Automatically included in all generated videos

- **Concurrency:** max 2 pending jobs

- **Content restrictions:** No copyrighted content, no real people, no faces in input images

In [None]:
prompt = use_case_prompts['Coolstuff']['Cyberpunk eye reflection']
print(prompt)

In [None]:
# Available sizes: 1280x720 (landscape) or 720x1280 (portrait)
# Duration: 4, 8, or 12 seconds
job = sora.create_video_generation_job(
    prompt=prompt,
    n_seconds=8,
    width=1280,
    height=720
)

job_id = job['id']
print(f"Created job: {job_id}")
print(f"Video will be {job['n_seconds']}s at {job['width']}x{job['height']}")

# Poll status as background process
threading.Thread(target=check_status, args=(job_id,), daemon=True).start()

## Download and Analyze Videos
In addition to the Sora video generation model, we use **GPT-4.1** to:

1. Generate a concise, descriptive filename based on the original text prompt.
2. Analyze the video content to: Summarize key scenes, identify visible brands, and provide suggestions for improvement.

In [None]:
# If you want to retrieve a specific job
# job_id = "video_68ff672709d481908f1fa7c53265d835"

job = sora.get_video_generation_job(job_id=job_id)
print(f"Job status: {job['status']}")

downloaded_videos = []

if job['status'] == 'succeeded' and job['generations']:
    # Generate filename using LLM
    prompt_text = job['prompt']
    result = llm_completion(prompt_text, client, llm_deployment, filename_system_message)
    scene_summary = json.loads(result.choices[0].message.content)['filename']

    video_id = job['generations'][0]['id']
    filename = f"{scene_summary}_{video_id}.mp4"

    _ = sora.get_video_generation_video_content(
        generation_id=video_id, 
        file_name=filename, 
        target_folder=local_video_folder
    )
    print(f'Downloaded {filename} to folder: {local_video_folder}')
    downloaded_videos.append(filename)
else:
    print(f"Job not completed or failed: {job.get('failure_reason', 'N/A')}")

In [None]:
# Display and analyze downloaded videos
for videofile in downloaded_videos:
    video_path = os.path.join(local_video_folder, videofile)
    print(f'Video: {videofile}')
    display(Video(video_path))
    show_video_insights(video_path)

## üñºÔ∏è Image-to-Video Generation

Sora supports using a reference image as a **visual anchor for the first frame**. This is useful for:
- Animating product shots
- Creating videos that match existing brand imagery
- Bringing still images to life

According to the [Azure OpenAI docs](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/video-generation):
- **Input Reference**: A single image used as visual anchor for the first frame
- **Supported formats**: `image/jpeg`, `image/png`, `image/webp`
- **Requirement**: Image should match the target video size for best results
- **Restriction**: Images with human faces are not supported

### API Implementation

The `input_reference` parameter is sent via **multipart form data** (not JSON):

In [None]:
# Image-to-Video: Animate a reference image
# Using the briefcase.png as the visual anchor for the first frame

reference_image_path = "images/briefcase.png"

# Display the reference image
display(Markdown("### Reference Image"))
display(Image(filename=reference_image_path, width=400))

# Prompt describes the motion/action to apply to the image
image_to_video_prompt = """
A professional dark laptop bag on a clean white background. The camera slowly rotates 
around the bag, showcasing its premium leather texture, metal hardware details, and 
functional compartments. Soft studio lighting highlights the craftsmanship. 
The scene is elegant and minimalist, perfect for an e-commerce product showcase.
Smooth 360-degree rotation. 4K quality, professional product photography style.
"""

print(f"Prompt: {image_to_video_prompt.strip()}")

In [None]:
# Create Image-to-Video generation job
# The Sora wrapper handles multipart form data for input_reference automatically

image_job = sora.create_video_generation_job_with_image(
    prompt=image_to_video_prompt,
    image_path=reference_image_path,
    n_seconds=8,
    width=1280,
    height=720  # Landscape orientation
)

image_job_id = image_job['id']
print(f"Created image-to-video job: {image_job_id}")
print(f"Video will be {image_job['n_seconds']}s at {image_job['width']}x{image_job['height']}")

# Poll status in background
threading.Thread(target=check_status, args=(image_job_id,), daemon=True).start()

In [None]:
# Download and display the image-to-video result
# image_job_id = "video_xxx"  # Uncomment to retrieve a specific job

image_job = sora.get_video_generation_job(job_id=image_job_id)
print(f"Job status: {image_job['status']}")

if image_job['status'] == 'succeeded' and image_job['generations']:
    # Generate descriptive filename
    result = llm_completion(image_to_video_prompt, client, llm_deployment, filename_system_message)
    scene_summary = json.loads(result.choices[0].message.content)['filename']
    
    video_id = image_job['generations'][0]['id']
    img2vid_filename = f"img2vid_{scene_summary}_{video_id}.mp4"
    
    img2vid_path = sora.get_video_generation_video_content(
        generation_id=video_id,
        file_name=img2vid_filename,
        target_folder=local_video_folder
    )
    
    print(f"Downloaded: {img2vid_filename}")
    display(Markdown("### Generated Video from Image"))
    display(Video(img2vid_path))
    
    # Show video insights using VideoAnalyzer
    show_video_insights(img2vid_path)
else:
    print(f"Job not completed: {image_job.get('failure_reason', 'Still processing...')}")

## üìä Advanced Video Analysis with VideoTools

The `VideoTools.py` module provides utilities for deeper video analysis:

| Class/Function | Purpose |
|----------------|---------|
| `VideoExtractor` | Extract frames at intervals or N equally-spaced frames |
| `VideoAnalyzer` | Send frames to GPT-4 vision for multimodal analysis |
| `get_video_metadata()` | Get duration, fps, resolution, bitrate |

### Use Cases
1. **Quality Assessment** - Compare prompt intent vs generated output
2. **Content Moderation** - Detect brands, products, inappropriate content
3. **Metadata Generation** - Auto-generate tags for video libraries
4. **A/B Testing** - Compare multiple generations systematically

In [None]:
# Advanced Analysis: Compare prompt intent vs generated output
# This helps evaluate how well Sora interpreted your prompt

quality_assessment_prompt = """You are a video quality assessor for AI-generated content.
Analyze the video frames and evaluate:

1. **Prompt Adherence** (1-10): How well does the video match the original prompt?
2. **Visual Quality** (1-10): Clarity, consistency, absence of artifacts
3. **Motion Quality** (1-10): Smooth, natural movement without glitches
4. **Composition** (1-10): Camera work, framing, visual appeal

Also identify:
- What was captured well
- What was missed or incorrect
- Suggestions for prompt improvement

Return as JSON:
{
    "scores": {
        "prompt_adherence": <1-10>,
        "visual_quality": <1-10>,
        "motion_quality": <1-10>,
        "composition": <1-10>,
        "overall": <1-10>
    },
    "captured_well": "<what worked>",
    "missed": "<what was missed or incorrect>",
    "prompt_suggestions": "<how to improve the prompt>"
}
"""

def assess_video_quality(video_path: str, original_prompt: str) -> dict:
    """Comprehensive quality assessment comparing video to original prompt."""
    extractor = VideoExtractor(video_path)
    frames = extractor.extract_n_video_frames(n=8)  # More frames for thorough analysis
    
    # Include original prompt context for the assessor
    assessment_context = f"Original prompt used to generate this video:\n{original_prompt}\n\nNow analyze the video:"
    
    # Build message with prompt context
    content_segments = [{"type": "text", "text": assessment_context}]
    for f in frames:
        content_segments.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpg;base64,{f['frame_base64']}", "detail": "auto"}
        })
        content_segments.append({"type": "text", "text": f"timestamp: {f['timestamp']}"})
    
    response = client.chat.completions.create(
        model=llm_deployment,
        messages=[
            {"role": "system", "content": quality_assessment_prompt},
            {"role": "user", "content": content_segments}
        ],
        temperature=0,
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)

print("‚úÖ Quality assessment function ready")

In [None]:
# Collect all videos we've created in this session for quality assessment

videos_to_assess = []

# Add text-to-video result if available
if 'downloaded_videos' in dir() and downloaded_videos:
    for vf in downloaded_videos:
        videos_to_assess.append({
            "path": os.path.join(local_video_folder, vf),
            "prompt": prompt,
            "type": "Text-to-Video"
        })

# Add image-to-video result if available  
if 'img2vid_path' in dir():
    videos_to_assess.append({
        "path": img2vid_path,
        "prompt": image_to_video_prompt,
        "type": "Image-to-Video"
    })

print(f"Found {len(videos_to_assess)} videos to assess")

In [None]:
# Run comprehensive quality assessment on all videos
assessment_results = []

for video_info in videos_to_assess:
    video_path = video_info["path"]
    if not os.path.exists(video_path):
        print(f"‚ö†Ô∏è Skipping {video_path} - file not found")
        continue
        
    display(Markdown(f"### Assessing: {video_info['type']}"))
    display(Markdown(f"**File:** `{os.path.basename(video_path)}`"))
    
    # Get metadata
    metadata = get_video_metadata(video_path)
    display(Markdown(f"**Metadata:** {metadata['duration']}s, {metadata['resolution']}, {metadata['fps']} fps"))
    
    # Run quality assessment
    assessment = assess_video_quality(video_path, video_info["prompt"])
    assessment_results.append({
        "type": video_info["type"],
        "file": os.path.basename(video_path),
        **assessment.get("scores", {})
    })
    
    # Display scores
    scores = assessment.get("scores", {})
    display(Markdown(f"""
| Metric | Score |
|--------|-------|
| Prompt Adherence | {scores.get('prompt_adherence', 'N/A')}/10 |
| Visual Quality | {scores.get('visual_quality', 'N/A')}/10 |
| Motion Quality | {scores.get('motion_quality', 'N/A')}/10 |
| Composition | {scores.get('composition', 'N/A')}/10 |
| **Overall** | **{scores.get('overall', 'N/A')}/10** |
"""))
    
    display(Markdown(f"**‚úÖ Captured Well:** {assessment.get('captured_well', 'N/A')}"))
    display(Markdown(f"**‚ùå Missed:** {assessment.get('missed', 'N/A')}"))
    display(Markdown(f"**üí° Prompt Suggestions:** {assessment.get('prompt_suggestions', 'N/A')}"))
    display(Markdown("---"))

# Summary DataFrame
if assessment_results:
    display(Markdown("### üìà Assessment Summary"))
    summary_df = pd.DataFrame(assessment_results)
    display(summary_df)

## Manage Video Generation Jobs

In [None]:
# List all jobs 
jobs = sora.list_video_generation_jobs(limit=50)

df = pd.DataFrame(jobs['data'])
if not df.empty:
    # Calculate duration for completed jobs
    df['duration_s'] = df.apply(
        lambda row: (row['finished_at'] - row['created_at']) 
        if row['finished_at'] and row['created_at'] else None, 
        axis=1
    )
    columns = ['id', 'status', 'created_at', 'finished_at', 'duration_s', 
               'prompt', 'n_seconds', 'height', 'width', 'has_audio', 'failure_reason']
    available_columns = [c for c in columns if c in df.columns]
    df = df[available_columns]
display(df)

In [None]:
# retrieve a specific job

job_id = df.iloc[1]['id']
job = sora.get_video_generation_job(job_id=job_id)
display(job)

In [None]:
# Extract a frame from a video as a thumbnail/preview
if job['generations']:
    video_id = job['generations'][0]['id']
    
    # Download the video
    temp_video_path = sora.get_video_generation_video_content(
        generation_id=video_id,
        file_name="temp_preview.mp4",
        target_folder=local_video_folder
    )
    
    # Extract first frame as a preview
    video_extractor = VideoExtractor(temp_video_path)
    frames = video_extractor.extract_n_video_frames(n=1)
    if frames:
        import base64
        from PIL import Image as PILImage
        from io import BytesIO
        
        frame_data = base64.b64decode(frames[0]['frame_base64'])
        thumbnail = PILImage.open(BytesIO(frame_data))
        display(thumbnail)

In [None]:
# Stream a video directly into memory (without saving to disk)
if job['generations']:
    video_id = job['generations'][0]['id']

    stream = sora.get_video_generation_video_stream(video_id)

    video_bytes = stream.getvalue()
    display(Video(data=video_bytes, embed=True, mimetype='video/mp4'))

In [None]:
# Download a video from a job
if job['generations']:
    video_id = job['generations'][0]['id']

    file_name = "my_video.mp4"
    target_folder = "video-generations/"

    file_path = sora.get_video_generation_video_content(
        generation_id=video_id,
        file_name=file_name,
        target_folder=target_folder
    ) 

    display(Video(file_path))
    print(get_video_metadata(file_path))