# Multimodality Workshop - Session 1

## Learning Objectives

In this workshop, you'll learn how to work with multiple types of content (text, images, audio, video) using OpenAI's API. We'll look at:

1. **Text + Image combinations** - How to structure multimodal messages
2. **Audio transcription** - Using Whisper for speech-to-text
3. **Video processing** - Extracting frames and audio from videos
4. **Practical application** - Building a Swish payment parser that processes video content into a Swish payment URL

## What is Multimodality?

So far, we've worked with text-only AI interactions. **Multimodality** means AI models can understand and work with different types of content simultaneously:

- **Text** - Written instructions, descriptions, tabular data, etc.
- **Images** - Photos, screenshots, diagrams, charts
- **Audio** - Speech, music, sounds (converted to text via transcription)
- **Video** - Moving images + audio (processed as frames + transcription)

## Beyond this notebook
One thing to note is that there are other ways to process multimodal input. In particular video and audio. For instance, OpenAI has a "Realtime" API that handles speech directly into GPT-4o. Another example is Google's Gemini Live API that handles text, audi, and video directly and that outputs text and audio.

These APIs are still rather expensive and not practical to use in production from an economical perspective. However, we encourage you to explore them. For the right use case, they might very well be useful.

Apart from the uv dependencies you will need ffmpeg installed on your computer. It can be installed easily with:

`brew install ffmpeg`

In [None]:
import os
import tempfile
from openai import OpenAI
from utils import encode_image, extract_audio, sample_frames
from contacts import CONTACTS
from payment_models import PaymentRequest, ProcessedPayment, evaluate_expression

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Text + Image Basics

Let's start by understanding how to combine text and images in API requests. There are three ways to provide images to OpenAI models:

1. **URL** - Link to an image on the internet
2. **Base64** - Encode local images as base64 strings
3. **File ID** - Upload images using the Files API

### Method 1: Using Image URLs

In [None]:
# Example: Analyze an image from a URL
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "What's in this image? Describe it in detail."},
                {
                    "type": "input_image",
                    "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            ],
        }
    ],
)

print(response.output_text)

### Method 2: Using Base64 Encoded Images

For local images, we can encode them as base64 strings. This is useful when you have images stored on your system.

In [None]:
# Example with base64 encoding (you'll need to add your own image)

image_path = "data/framnacon.png"
base64_image = encode_image(image_path)

response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "What's in this image?"},
                {
                    "type": "input_image",
                    "image_url": f"data:image/png;base64,{base64_image}",
                },
            ],
        }
    ],
)

print(response.output_text)

### Method 3: Using File IDs

You can also upload files using the Files API and reference them by ID.

In [None]:
def create_file(file_path):
    with open(file_path, "rb") as file_content:
        result = client.files.create(
            file=file_content,
            purpose="vision",
        )
        return result.id

file_id = create_file("data/framnacon.png")

response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "what's in this image?"},
                {
                    "type": "input_image",
                    "file_id": file_id,
                },
            ],
        }
    ],
)

print(response.output_text)

### Multiple Images in One Request

You can process multiple images in a single request by including them in the content array:

In [None]:
# Example: Compare two images
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "Compare these two nature scenes. What are the similarities and differences?"},
                {
                    "type": "input_image",
                    "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
                {
                    "type": "input_image",
                    "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Boardwalk_-_Jeseniky%2C_Czech_Republic_25.jpg/640px-Boardwalk_-_Jeseniky%2C_Czech_Republic_25.jpg",
                },
            ],
        }
    ],
)

print(response.output_text)

## Part 2: Audio Transcription with Whisper

OpenAI's Whisper model can convert speech to text. This is useful for processing audio content or the audio track from videos.

In [None]:
def transcribe_audio(audio_path: str) -> str:
    """
    Transcribe audio using OpenAI's Whisper model.
    """
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1", 
            file=audio_file
        )
    return transcript.text

# Example usage (uncomment when you have an audio file):
audio_path = "cola.mp4"
transcription = transcribe_audio(audio_path)
print(f"Transcription: {transcription}")

Video Processing
To process a video we will:

1. **Extract frames** at regular intervals
2. **Extract audio** and transcribe it
3. **Combine both** in our AI analysis

In [None]:
import matplotlib.pyplot as plt
import cv2
import numpy as np
import base64

def process_video(video_path: str, max_frames: int = 5):
    encoded_frames = sample_frames(video_path, max_frames)
    audio_path = extract_audio(video_path)
    transcription = transcribe_audio(audio_path)
    
    if os.path.exists(audio_path):
        os.remove(audio_path)
    
    return encoded_frames, transcription

video_path = "data/cola.mp4"
frames, transcript = process_video(video_path)
print(f"Transcript: {transcript}")

def decode_frame(frame_dict):
    """Decode a frame from the API-style dict with base64-encoded image."""
    b64_data = frame_dict["image_url"]["url"].split(",")[1]
    img_bytes = base64.b64decode(b64_data)
    arr = np.frombuffer(img_bytes, np.uint8)
    return cv2.imdecode(arr, cv2.IMREAD_COLOR)

decoded_frames = [decode_frame(f) for f in frames]

fig, axes = plt.subplots(1, len(decoded_frames), figsize=(15, 3))
for i, (ax, img) in enumerate(zip(axes, decoded_frames)):
    ax.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    ax.set_title(f"Frame {i+1}")
    ax.axis("off")

plt.tight_layout()
plt.show()

# Practical Application - Swish Payment Parser

Now let's combine this into a simple application. We'll create a system that can:

1. **Process video content** (like someone speaking payment instructions)
2. **Extract payment details** (recipient, amount, message)
3. **Handle complex amounts** (like "split the bill of 240 SEK between three people")
4. **Match contacts** from a contact list

Note that some of the code is found in the scripts `contacts.py`, `payment_models.py`, and `utils.py`

In [None]:
def parse_payment_from_video(instructions: str, video_path: str) -> ProcessedPayment:
    """
    Parse payment information from a video by analyzing both visual frames
    and audio transcription.
    
    This demonstrates the full multimodal pipeline:
    1. Extract frames from video
    2. Extract and transcribe audio
    3. Send both to AI for structured analysis
    4. Evaluate arithmetic expressions
    """
    encoded_frames, transcript = process_video(video_path, max_frames=5)
    
    content = [
        {"type": "text", "text": "# Key video frames:"},
        encoded_frames[0],
        {
            "type": "text",
            "text": f"""
            # User Contacts
            {CONTACTS}

            # Video Transcript
            {transcript}

            # User Instructions
            {instructions}
            """,
        },
    ]
    
    response = client.chat.completions.parse(
        model="gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an AI assistant that analyzes video content to extract "
                    "structured information for digital payments (Swish). "
                    "You will be provided with key frames from the video, a transcript "
                    "of the audio, user instructions, and contact list. "
                    "Represent amounts as arithmetic expressions using +, -, *, /."
                ),
            },
            {"role": "user", "content": content},
        ],
        response_format=PaymentRequest,
    )

    payment_request: PaymentRequest = response.choices[0].message.parsed
    print(f"Extracted payment request: {payment_request}")
    
    # Evaluate the arithmetic expression
    evaluated_amount = evaluate_expression(payment_request.expression)
    print(f"Evaluated amount: {evaluated_amount}")
    
    return ProcessedPayment(
        phone_number=payment_request.phone_number,
        amount=evaluated_amount,
        message=payment_request.message,
    )

video_path = "data/cola.mp4"
result = parse_payment_from_video(
    "Process this payment request",
    video_path
)
print("\n=== RESULTS ===")
print(result)

### Construct URL
With the data transformed into our data model we can easily turn it into an URL that can be used to deeplink us into the Swish app

In [None]:
from urllib.parse import quote_plus

def build_swish_url(payment: ProcessedPayment) -> str:
    base_url = "https://app.swish.nu/1/p/sw/?"

    encoded_message = quote_plus(payment.message)

    params = (
        f"sw={payment.phone_number}&"
        f"amt={round(payment.amount, 1)}&"
        f"cur=SEK&"
        f"msg={encoded_message}"
    )

    return base_url + params

url = build_swish_url(result)
print(url)


# Including arithmetics
The structured output models also contains a setup for handling arithmetic operations if requested by the LLM. Below is an example only with text.

Try recording your own video where you prompt the model to, for instance, sum/subtract something together!

In [None]:
def parse_payment_from_text(text: str) -> ProcessedPayment:
    """
    Parse payment information from a text prompt only.
    
    This version removes all video/audio processing and instead relies
    purely on the provided text + instructions.
    """
    content = f"""
    # User Contacts
    {CONTACTS}

    # Payment Text
    {text}
    """

    response = client.chat.completions.parse(
        model="gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an AI assistant that extracts structured payment "
                    "information (Swish). You will be provided with a contact list, "
                    "a text description of the payment, and user instructions. "
                    "Represent amounts as arithmetic expressions using +, -, *, /."
                ),
            },
            {"role": "user", "content": content},
        ],
        response_format=PaymentRequest,
    )

    payment_request: PaymentRequest = response.choices[0].message.parsed
    print(f"Extracted payment request: {payment_request}")

    evaluated_amount = evaluate_expression(payment_request.expression)
    print(f"Evaluated amount: {evaluated_amount}")

    return ProcessedPayment(
        phone_number=payment_request.phone_number,
        amount=evaluated_amount,
        message=payment_request.message,
    )


# Example usage
text_input = "Pay Anna 2 bottles of cola for 15 SEK"
result = parse_payment_from_text(text_input)

print("\n=== RESULTS ===")
print(result)


### Testing with Example Data

For debugging purposes, we can test out the operations. Here below we test two sets of operations.

In [None]:
# Test the expression evaluation system
from payment_models import Number, BinaryOperation

# Example: 240 / 3 (splitting a bill)
test_expression = BinaryOperation(
    type="binary_op",
    op="/",
    left=Number(type="number", value=240),
    right=Number(type="number", value=3)
)

result = evaluate_expression(test_expression)
print(f"240 / 3 = {result}")

# Example: (100 + 50) * 0.5 (adding items then taking half)
complex_expression = BinaryOperation(
    type="binary_op",
    op="*",
    left=BinaryOperation(
        type="binary_op",
        op="+",
        left=Number(type="number", value=100),
        right=Number(type="number", value=50)
    ),
    right=Number(type="number", value=0.5)
)

result = evaluate_expression(complex_expression)
print(f"(100 + 50) * 0.5 = {result}")

# Key Takeaways

**And that's it!** You've learned the basics of multimodal AI:

### What We Covered:

1. **Multimodal Message Structure** - How to combine text, images, and transcribed audio
1. **Image Input Methods** - URL, base64, and file upload approaches
1. **Audio Processing** - Using Whisper for speech-to-text transcription