# Prompt Engineering Notebook: OpenAI‚ÄëBased Multimodal Prompting
*Google Colab‚ÄëCompatible ‚Äî Author: ChatGPT (o3) ‚Äî Date: 2025-07-09*

This notebook demonstrates how to craft image‚Äë and audio‚Äëaware prompts using the **OpenAI Python SDK** and GPT‚Äë4o‚Äôs native multimodal capabilities.

## Learning Objectives
1. Configure the OpenAI SDK for multimodal requests.
2. Send **image‚Äëconditioned** chat completions using `image_url` blocks.
3. Combine **multiple images** and text in one prompt.
4. Transcribe and summarize audio with `audio.transcriptions.create`.
5. Sketch a **Realtime API** loop for low‚Äëlatency speech‚Äëto‚Äëspeech.
6. Evaluate & debug multimodal outputs with citations and safety checks.

*All examples follow the explain‚Äëdemo‚Äëexercise pattern for classroom use.*

## 0¬†|¬†Environment Setup

In [None]:
!pip -q install openai pillow python-dotenv
# ‚Üë The OpenAI package ‚â•1.14.0 includes Vision & Realtime helpers

### Add Your OpenAI API Key

In [None]:
import os, getpass
if not os.getenv('OPENAI_API_KEY'):
    os.environ['OPENAI_API_KEY'] = getpass.getpass('üîë OpenAI API Key: ')

## 1¬†|¬†Quick‚ÄëStart: Question‚ÄëAnswering over an Image

In [None]:
from openai import OpenAI
client = OpenAI()

img_url = "https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/multimodal/images/puppy.jpeg"

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user",
         "content": [
             {"type": "text", "text": "Describe the image in one sentence."},
             {"type": "image_url", "image_url": {"url": img_url}}
         ]}
    ]
)
print(response.choices[0].message.content)


> **How it works:**
A `chat.completions.create` request can embed an `image_url` object directly in the `content` list. The model then ‚Äúsees‚Äù that image when forming its response. (*Syntax from the OpenAI Python SDK README*).

## 2¬†|¬†Multi‚ÄëImage & Text Reasoning

In [None]:
# Compare two images and decide which animal looks happier
img1 = "https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/multimodal/images/cat.jpeg"
img2 = "https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/multimodal/images/dog.jpeg"

system_msg = "You are an expert pet behaviorist."
question = "Which animal seems happier and why? Reply in two short sentences."

messages = [
    {"role": "system", "content": system_msg},
    {"role": "user", "content": [
        {"type": "text", "text": question},
        {"type": "image_url", "image_url": {"url": img1}},
        {"type": "image_url", "image_url": {"url": img2}}
    ]}
]

resp = client.chat.completions.create(model="gpt-4o", messages=messages)
print(resp.choices[0].message.content)


**Exercise¬†üñºÔ∏è:** Swap the images for any two pictures you upload to Colab; observe whether the justification matches the visuals.

## 3¬†|¬†Vision¬†+¬†Function¬†Calling

In [None]:
import json, re
def parse_bbox(result_text):
    # toy extractor for numbers inside brackets
    nums = re.findall(r"\d+", result_text)
    return list(map(int, nums))

# Define a tool schema:
tools = [
    {
        "type": "function",
        "function": {
            "name": "store_bbox",
            "description": "Save bounding box coordinates (x1,y1,x2,y2)",
            "parameters": {
                "type": "object",
                "properties": {
                    "bbox": {
                        "type": "array",
                        "items": {"type": "integer"}
                    }
                },
                "required": ["bbox"],
            },
        }
    }
]

img_tool_url = img_url  # reuse puppy image

resp = client.chat.completions.create(
    model="gpt-4o",
    tools=tools,
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": "Locate the puppy's face. Return bbox."},
            {"type": "image_url", "image_url": {"url": img_tool_url}}
        ]}
    ],
    tool_choice="auto"
)
call = resp.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)
print("Parsed bbox:", args["bbox"])


## 4¬†|¬†Audio¬†‚Üí¬†Text¬†‚Üí¬†Summary

*Upload a short MP3/WAV (<25¬†MB) via the Colab sidebar.*

In [None]:
audio_file = "demo_audio.mp3"  # update after upload

transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=open(audio_file, "rb"),
    response_format="text"
).text

summary = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You summarize transcripts."},
        {"role": "user", "content": f"Summarize in 3 bullet points:\n{transcription}"}
    ]
)
print("\nTRANSCRIPT:\n", transcription[:200], "...")
print("\nSUMMARY:\n", summary.choices[0].message.content)


## 5¬†|¬†Realtime Speech¬†‚Üî¬†Speech Skeleton (Advanced)
Below is a **conceptual** loop for the Beta Realtime API (speech‚Äëto‚Äëspeech). A full demo requires WebSockets and a mic stream, beyond this notebook‚Äôs scope.

In [None]:
"""pseudo
from openai import OpenAI
client = OpenAI()

session = client.beta.realtime.sessions.create(model="gpt-4o", format="wav")
for chunk in microphone_stream():
    client.beta.realtime.sessions.send_audio_chunk(session.id, chunk)
    for event in client.beta.realtime.sessions.receive_events(session.id):
        if event.type == "output_audio_chunk":
            play_audio(event.data)
"""

‚û°Ô∏è¬†See the [OpenAI Cookbook realtime example](https://github.com/openai/openai-cookbook) for a complete reference.

## 6¬†|¬†Evaluation & Safety Checklist
- **Grounding**: Does the answer reference actual image/audio evidence?
- **Hallucination**: Flag when the model guesses unseen details.
- **Privacy**: Strip faces/PII from stored media.
- **Bias**: Test outputs across demographics and accents.
- **Rate limits**: Vision calls are compute‚Äëheavy; handle 429 errors.

## Assignment üéì
Build a *multimodal diary assistant* that:
1. Accepts a daily photo and voice memo.
2. Generates an uplifting caption plus a 2‚Äësentence reflection.
3. Saves transcripts and captions to a CSV.
4. Implements one safety check (e.g., blur NSFW images or skip them).

## Further Reading & Resources
- OpenAI Python SDK README (Vision examples)‚ÄØÓàÄciteÓàÇturn10search0ÓàÅ
- GPT‚Äë4o Vision function‚Äëcalling notebook‚ÄØÓàÄciteÓàÇturn6search7ÓàÅ
- Azure/OpenAI Vision how‚Äëto guide‚ÄØÓàÄciteÓàÇturn6search5ÓàÅ
- Realtime API docs (speech streaming)‚ÄØÓàÄciteÓàÇturn8search0ÓàÅ
- `audio.transcriptions.create` source reference‚ÄØÓàÄciteÓàÇturn11search0ÓàÅ