# GPT-4o Audio Podcast and Story Example

GPT-4o ("o" for "omni") and GPT-4o mini are multimodal models designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats. GPT-4o mini is the lightweight version of GPT-4o.

Today we are going to use the `gpt-4o-audio-preview` model to generate an expressive podcast and a story.

We'll also showcase how to use the structured outputs SDK feature.

## Getting Started

### Install OpenAI SDK for Python



In [None]:
%pip install --upgrade openai

### Configure the OpenAI client and submit a test request
To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage. 

You can get an API key by following these steps:
1. [Create a new project](https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects)
2. [Generate an API key in your project](https://platform.openai.com/api-keys)
3. (RECOMMENDED, BUT NOT REQUIRED) [Setup your API key for all projects as an env var](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)

Once you have the API key setup, let's move on to setting things up

In [1]:
from openai import OpenAI 
import os
import base64
from datetime import datetime
# For structured outputs later
import json
from pydantic import BaseModel
from typing import List

## Set the API key
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))

def get_openai_client(api_key=None):
    if not api_key:
        api_key = os.environ.get("OPENAI_API_KEY", "<Your OpenAI API key if not set as an env var>")
    return OpenAI(api_key=api_key)

## Set the output directory for later
output_dir = "output_audio"
os.makedirs(output_dir, exist_ok=True)

### First hello world example output
Now let's try exporting an hello world example.

In [None]:
speech_content = "Hello world!!"
voice = "echo" # https://platform.openai.com/docs/guides/text-to-speech#voice-options
try:
    completion = client.chat.completions.create(
        model="gpt-4o-mini-audio-preview",  #gpt-4o-audio-preview-2024-12-17
        modalities=["text", "audio"],
        audio={"voice": voice, "format": "mp3"}, #wav
        max_tokens = 100,
        temperature= 0.2,
        messages=[
            {
                "role": "system",
                "content": f"""You are a helpful assistant. Say the text exactly as provided"""
            },
            {
                "role": "user",
                "content": speech_content
            }
        ],
    )
except Exception as e:
    print(f"Error generating audio")

# Decode audio and save to file
try:
    mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
    speech_file_path = os.path.join(output_dir, f"helloworld_{timestamp}.mp3")
    with open(speech_file_path, "wb") as f:
        f.write(mp3_bytes)
    print(f"Saved audio to {speech_file_path}")
except Exception as e:
    print(f"Error saving audio")

### Controlling Emotion and Accents
GPT-4o is capable of a lot more than that, let's add an accent and emotion parametres into the system message.

Make sure to try changing them to see how it impacts the output

In [None]:
speech_content = "Hello world! Pleased to be here today!"
voice = "echo" # https://platform.openai.com/docs/guides/text-to-speech#voice-options
accent = "posh british male"
emotion = "excited" #sad  
try:
    completion = client.chat.completions.create(
        model="gpt-4o-mini-audio-preview",  #gpt-4o-audio-preview-2024-12-17
        modalities=["text", "audio"],
        audio={"voice": voice, "format": "mp3"}, #wav
        temperature= 0.2,
        messages=[
            {
                "role": "system",
                "content": f"""You are a helpful assistant. Output the text provided, using a {accent} accent and act {emotion}."""
            },
            {
                "role": "user",
                "content": speech_content
            }
        ],
    )
except Exception as e:
    print(f"Error generating audio")

# Decode audio and save to file
try:
    mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
    speech_file_path = os.path.join(output_dir, f"helloworld_{timestamp}.mp3")
    with open(speech_file_path, "wb") as f:
        f.write(mp3_bytes)
    print(f"Saved audio to {speech_file_path}")
except Exception as e:
    print(f"Error saving audio")

### Putting this together into a podcast or story 
1. Define speakers we want and add the content
2. Generate a script using GPT-4o with structured outputs
2. Process the script and feed to GPT-4o to export the speech
3. Assemble the audio outouts into a single audio file, top and tail with intro/outro idents.

In [16]:
# Add as many as you want, four used here
predefined_speakers = [
    {"speaker": "Dave", "personality": "News anchor host, comedian","accent": "london british male", "voice": "ash"},
    {"speaker": "Kelly",  "personality": "high energy comedian","accent": "scottish female radio voice", "voice": "nova"},
    {"speaker": "Charlie", "personality": "serious and to the point academic","accent": "sassy american female radio voice", "voice": "sage"},
    {"speaker": "Mike", "personality": "silly and funny","accent": "american male deep raspy", "voice": "echo"}
]
# Add some content - you could load a news article, a PDF or a wikipedia page
textinput = """
The Stargate Project is a new company which intends to invest $500 billion over the next four years building new AI infrastructure for OpenAI in the United States. We will begin deploying $100 billion immediately. This infrastructure will secure American leadership in AI, create hundreds of thousands of American jobs, and generate massive economic benefit for the entire world. This project will not only support the re-industrialization of the United States but also provide a strategic capability to protect the national security of America and its allies.

The initial equity funders in Stargate are SoftBank, OpenAI, Oracle, and MGX. SoftBank and OpenAI are the lead partners for Stargate, with SoftBank having financial responsibility and OpenAI having operational responsibility. Masayoshi Son will be the chairman.

Arm, Microsoft, NVIDIA, Oracle, and OpenAI are the key initial technology partners. The buildout is currently underway, starting in Texas, and we are evaluating potential sites across the country for more campuses as we finalize definitive agreements.

As part of Stargate, Oracle, NVIDIA, and OpenAI will closely collaborate to build and operate this computing system. This builds on a deep collaboration between OpenAI and NVIDIA going back to 2016 and a newer partnership between OpenAI and Oracle.

This also builds on the existing OpenAI partnership with Microsoft. OpenAI will continue to increase its consumption of Azure as OpenAI continues its work with Microsoft with this additional compute to train leading models and deliver great products and services.

All of us look forward to continuing to build and develop AI—and in particular AGI—for the benefit of all of humanity. We believe that this new step is critical on the path, and will enable creative people to figure out how to use AI to elevate humanity.
"""

### Generate the script
Now taking the article we generate a script using GPT-4o and the unstructured outputs SDK

In [None]:
# Define a class for our speakers
class ScriptSegment(BaseModel):
    speaker: str
    personality: str
    accent: str
    voice: str
    content: str 

class ScriptOutput(BaseModel):
    segments: List[ScriptSegment]

# Convert predefined speakers to JSON string
speakers_json = json.dumps(predefined_speakers, indent=2)

# Prepare system prompt with predefined speakers
system_message = (
    f"""
    Generate a short news podcast script for "OpenAI News" using predefined speakers, each with a specific accent and voice. Ensure the script is engaging and fun, incorporating humor where appropriate.
    Keep the podcast engaging and short.

    - Start with welcoming the listener to "OpenAI News."
    - Use realistic, conversational language that is high-energy and humorous if suitable.
    - Include interactions between speakers like jokes and laughing, to engage the audience.
    - Use the speaker names in the conversation such as "Over to you Mike", "what do you think kelly?"
    - Use short, concise sentences for clarity. maximum 1-3 sentences. 

    # Script expected output

    - **Speaker**: The character or predefined speaker delivering the lines.
    - **Personality** The predefined personality of the voice
    - **Accent**: The predefined accent in which the speaker should deliver their lines.
    - **Voice**: the predefined name of the voice
    - **Content**: The dialogue or script that the speaker will say.

    Predefined Speakers:\n{speakers_json}

    # Steps

    1. Parse the predefined speakers list to understand their accents.
    2. For each news segment, assign lines to an appropriate speaker.
    3. Ensure each line is engaging and humorous if appropriate, using short, concise language and conversational style.
    4. Use tags such as [excitedly], [whispering], [loudly] to indicate how the lines should be said rather than writing "spoke in blah blah in x accent"
    
    # Notes
    - Ensure the script encapsulates only enough content for a 2-minute episode. 
    - Format the JSON meticulously to avoid errors in interpretation. 
    """
)

# Combine messages for the chat
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": textinput},
]

params = {
    "model": "gpt-4o",
    "messages": messages,
    "temperature": 0.9,
    "max_tokens": 5000,

}
# Structured outputs
script = client.beta.chat.completions.parse(**params, response_format=ScriptOutput)

#  Parse the JSON content
script_data = script.choices[0].message.parsed.segments

# Print the first segment as an example
if script_data:
    first_segment = script_data[0]
    print("First Segment of the Podcast Script:")
    print(first_segment.model_dump_json(indent=2))
else:
    print("No script segments found.")

### Processing the audio segments
Now let's generate the audio segments and stitch it together.

if you don't have pydub installed, let's install that now.

If using newer versions of python 3.11 onwards we need to add support for audioop which was removed in 3.13

(We make no guarantees about the usability or security of 3rd party software such as PyDub.)

In [None]:
%pip install --upgrade pydub audioop-lts
#

In [None]:
from pydub import AudioSegment
from pydub.effects import normalize
audio_segments = []
intro_path = "intro.mp3"
outro_path = "outro.mp3"

# For each script segment process the audio
for idx, segment in enumerate(script_data):
    accent = segment.accent
    personality = segment.personality
    speaker = segment.speaker
    voice = segment.voice
    speech_content = segment.content

    print(f"Processing segment {idx+1}: {speaker} - {accent}")

    # Currently can only change the voice on each completion request not message.
    try:
        completion = client.chat.completions.create(
            model="gpt-4o-mini-audio-preview",  #for better quality use gpt-4o-audio-preview-2024-12-17
            modalities=["text", "audio"],
            max_tokens = 800,
            audio={"voice": voice, "format": "mp3"},
            temperature= 0.8,
            messages=[
                {
                    "role": "system",
                    "content": f"""You are a helpful assistant. output the speech provided using a {accent} accent and {personality} personality. Never say the type of accent or personality"""
                },
                {
                    "role": "user",
                    "content": speech_content
                }
            ],
        )
    except Exception as e:
        print(f"Error generating audio for segment {idx+1}: {e}")
        continue

    # Decode audio and save to file
    try:
        mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
        speech_file_path = os.path.join(output_dir, f"speech_{timestamp}.mp3")
        with open(speech_file_path, "wb") as f:
            f.write(mp3_bytes)
        print(f"Saved audio to {speech_file_path}")
    except Exception as e:
        print(f"Error saving audio for segment {idx+1}: {e}")
        continue

    # Load the audio segment using pydub, normalize, and store
    try:
        audio = AudioSegment.from_mp3(speech_file_path)
        audio = normalize(audio)
        audio_segments.append(audio)
    except Exception as e:
        print(f"Error processing audio for segment {idx+1}: {e}")
        continue

try:
    if os.path.exists(intro_path):
        intro = AudioSegment.from_mp3(intro_path)
        intro = normalize(intro)
        print("Intro music added.")
    else:
        intro = AudioSegment.silent(duration=500)
        print("Intro music not found. Using 0.5 second of silence instead.")
except Exception as e:
    print(f"Error adding intro music: {e}")
    intro = AudioSegment.silent(duration=1000)

try:
    if os.path.exists(outro_path):
        outro = AudioSegment.from_mp3(outro_path)
        outro = normalize(outro)
        print("Outro music added.")
    else:
        outro = AudioSegment.silent(duration=1000)
        print("Outro music not found. Using 1 second of silence instead.")
except Exception as e:
    print(f"Error adding outro music: {e}")
    outro = AudioSegment.silent(duration=1000)

# Combine all audio: Intro + speech segments + Outro
final_podcast = intro
for audio in audio_segments:
    final_podcast += AudioSegment.silent(duration=250)
    final_podcast += audio
final_podcast += outro

# Export the final podcast
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
final_output_path = f"podcast_{timestamp}.mp3"
try:
    final_podcast.export(final_output_path, format="mp3")
    print(f"Final output saved to {final_output_path}")
except Exception as e:
    print(f"Error exporting final output: {e}")

### The voices seem random?
You'll notice that on each segment the voices can change, as of today it's not possible to keep the audio generation consistent and deterministic. 

It's possible to generate all the segments in one go with a cut segment identifier, then re-slice. We'll show that example soon. 

Or alternatively you could use a TTS engine at the cost of expressiveness.

#### How about a scary bedtime story?
Now let's try making a scary bedtime story, same again but with different intro/outro

In [44]:
predefined_speakers = [
    {"speaker": "Zombie Kelly", "personality": "scary zombie with sass","accent": "scottish female", "voice": "nova"},
    {"speaker": "Scary Harry", "personality": "scary silly ghost","accent": "scottish male raspy", "voice": "echo"},
]

# the story prompt
textinput = """Peter and Jane go to the haunted castle in Scotland to find the treasure"""

In [None]:
# scary intros
intro_path = "intro2.mp3"
outro_path = "outro2.mp3"

# Define a class for our speakers
class ScriptSegment(BaseModel):
    speaker: str
    personality: str
    accent: str
    voice: str
    content: str 

class ScriptOutput(BaseModel):
    segments: List[ScriptSegment]

# Convert predefined speakers to JSON string
speakers_json = json.dumps(predefined_speakers, indent=2)

# Prepare system prompt with predefined speakers
system_message = (
    f"""
    Generate a short story for a kids radio show using predefined speakers, each with a specific accent and voice. Ensure the script is engaging and fun, incorporating humor where appropriate.
    Keep the story engaging and short and suitable for children.

    - Start by going straight into the story and set the scene
    - Introduce new characters
    - Use short, concise sentences for clarity. maximum 1-3 sentences. 

    # Script expected output

    - **Speaker**: The character or predefined speaker delivering the lines.
    - **Personality** The predefined personality of the voice
    - **Accent**: The predefined accent in which the speaker should deliver their lines.
    - **Voice**: the predefined name of the voice
    - **Content**: The dialogue or script that the speaker will say.

    Predefined Speakers:\n{speakers_json}

    # Steps

    1. Parse the predefined speakers list to understand their accents.
    2. For each story segment, assign lines to an appropriate speaker.
    3. Ensure each line is engaging and humorous if appropriate, using short, concise language and conversational style.
    4. Use tags such as [excitedly], [whispering], [loudly] to indicate how the lines should be said rather than writing "spoke in blah blah in x accent"
    
    # Notes
    - Ensure the script encapsulates only enough content for a 2-minute episode. 
    - Format the JSON meticulously to avoid errors in interpretation. 
    """
)

# Combine messages for the chat
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": textinput},
]

params = {
    "model": "gpt-4o",
    "messages": messages,
    "temperature": 0.9,
    "max_tokens": 5000,

}
# Structured outputs
script = client.beta.chat.completions.parse(**params, response_format=ScriptOutput)

#  Parse the JSON content
script_data = script.choices[0].message.parsed.segments

#  parse the JSON content
script_data = script.choices[0].message.parsed.segments
from pydub import AudioSegment
from pydub.effects import normalize
audio_segments = []

for idx, segment in enumerate(script_data):
    accent = segment.accent
    speaker = segment.speaker
    voice = segment.voice
    speech_content = segment.content

    print(f"Processing segment {idx+1}: {speaker} - {accent}")

    # completion for each segment
    try:
        completion = client.chat.completions.create(
            model="gpt-4o-mini-audio-preview",  #for better quality use gpt-4o-audio-preview-2024-12-17
            modalities=["text", "audio"],
            max_tokens = 800,
            audio={"voice": voice, "format": "mp3"},
            messages=[
                {
                    "role": "system",
                    "content": f"""You are a helpful assistant. output the speech provided using a {accent} accent and {personality} personality. Never say the type of accent or personality"""
                },
                {
                    "role": "user",
                    "content": speech_content
                }
            ],
        )
    except Exception as e:
        print(f"Error generating audio for segment {idx+1}: {e}")
        continue

    # Decode audio and save to file
    try:
        mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
        speech_file_path = os.path.join(output_dir, f"speech_{timestamp}.mp3")
        with open(speech_file_path, "wb") as f:
            f.write(mp3_bytes)
        print(f"Saved audio to {speech_file_path}")
    except Exception as e:
        print(f"Error saving audio for segment {idx+1}: {e}")
        continue

    # Load the audio segment using pydub, normalize, and store
    try:
        audio = AudioSegment.from_mp3(speech_file_path)
        audio = normalize(audio)
        audio_segments.append(audio)
    except Exception as e:
        print(f"Error processing audio for segment {idx+1}: {e}")
        continue

try:
    if os.path.exists(intro_path):
        intro = AudioSegment.from_mp3(intro_path)
        intro = normalize(intro)
        print("Intro music added.")
    else:
        intro = AudioSegment.silent(duration=500)
        print("Intro music not found. Using 0.5 second of silence instead.")
except Exception as e:
    print(f"Error adding intro music: {e}")
    intro = AudioSegment.silent(duration=1000)

try:
    if os.path.exists(outro_path):
        outro = AudioSegment.from_mp3(outro_path)
        outro = normalize(outro)
        print("Outro music added.")
    else:
        outro = AudioSegment.silent(duration=1000)
        print("Outro music not found. Using 1 second of silence instead.")
except Exception as e:
    print(f"Error adding outro music: {e}")
    outro = AudioSegment.silent(duration=1000)

# Combine all audio: Intro + speech segments + Outro
final_podcast = intro
for audio in audio_segments:
    final_podcast += AudioSegment.silent(duration=250)
    final_podcast += audio
final_podcast += outro

# Export the final podcast
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
final_output_path = f"story_{timestamp}.mp3"
try:
    final_podcast.export(final_output_path, format="mp3")
    print(f"Final output saved to {final_output_path}")
except Exception as e:
    print(f"Error exporting final output: {e}")


## Conclusion
I hope you've had fun playing with GPT-4o's multimodal capabilities, please share what you've made with this!