# Text to Speech with Azure AI Foundry gpt-4o-mini-tts

<img src="https://devblogs.microsoft.com/foundry/wp-content/uploads/sites/89/2025/04/image-1024x576.png">

GPT-4o Mini TTS is an innovative text-to-speech model designed to convert written text into spoken language with high accuracy and efficiency. The model offers customizable voice output, allowing developers to instruct it to speak in specific ways, such as "talk like a sympathetic customer service agent

GPT-4o Mini TTS Key Capabilities:
- High Accuracy: The model offers improved WER performance, making it highly accurate in converting text to speech.
- Efficiency: Optimized for faster and more cost-efficient speech generation, suitable for applications needing quick responses and lower resource consumption.
- Customization: The model can be instructed on how to say input via prompting, allowing control over nuance such as accent, emotional range, intonation, impressions, speed of speech, tone, and whispering

> https://devblogs.microsoft.com/foundry/get-started-azure-openai-advanced-audio-models/#model-comparison

Soundboard is available :
> https://github.com/Azure-Samples/azure-openai-tts-demo

In [1]:
import gradio as gr
import os
import sys
import tempfile
import time

from openai import AzureOpenAI
from datetime import datetime
from dotenv import load_dotenv

In [2]:
sys.version

'3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]'

In [3]:
print(f"Today is {datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is 18-Apr-2025 07:45:23


In [4]:
load_dotenv("azure.env")

api_version = '2025-03-01-preview'
model_name = "gpt-4o-mini-tts"

> Our GPT-4o powered mini tts service currently offers 11 different voices with unique characteristics: Alloy (neutral), Ash (calm), Ballad (melodic), Coral (energetic), Echo (deep), Fable (warm), Onyx (authoritative), Nova (friendly), Sage (wise), Shimmer (cheerful), and Verse (poetic).

In [5]:
# List of voices
VOICES = [
    "alloy", "ash", "ballad", "coral", "echo", "fable", "onyx", "nova", "sage",
    "shimmer", "verse"
]

In [6]:
def gpt4ominitts(instructions, text, voice):
    """
    Generate a speech audio file using GPT-4 model with specified voice and instructions.

    Args:
        instructions (str): Instructions for speech synthesis, including tone, delivery, and emotion.
        text (str): The text to be converted into speech.
        voice (str): The voice model to be used for speech synthesis.

    Returns:
        str: The filename of the generated speech audio file in WAV format.
    """
    client = AzureOpenAI(
        azure_endpoint=os.getenv('endpoint'),
        api_key=os.getenv('key'),
        api_version=api_version,
    )

    response = client.audio.speech.create(
        model=model_name,
        voice=voice,
        input=text,
        instructions=instructions,
        response_format="wav",
    )

    with tempfile.NamedTemporaryFile(delete=False,
                                     suffix=".wav") as temp_audio:
        temp_audio.write(response.content)
        tempfilename = temp_audio.name
    time.sleep(0.5)

    return tempfilename

In [7]:
instructions = """
Tone: Inspirational, determined, and unifying, conveying a deep commitment to shared values and collective progress.

Delivery: Measured and deliberate, with intentional pacing to allow key messages to resonate and reinforce a sense of gravity.

Emotion: Passionate, sincere, and empathetic, appealing to both reason and the heart to build trust and rally support.

Punctuation: Purposeful, strong sentence structure—utilizing pauses, rhetorical repetition, and emphatic phrasing to drive momentum and clarity.

Pronunciation: Clear and deliberate enunciation, especially on policy points and names, to project authority and reinforce message precision.

Personality Affect: Statesmanlike and relatable—balancing gravitas with warmth, conveying both leadership and a genuine connection to the people.
"""

In [8]:
text = """
ARTICLE PREMIER.
La France est une République indivisible, laïque, démocratique et sociale. Elle assure l'égalité devant la loi de tous les citoyens sans distinction d'origine, de race ou de religion. Elle respecte toutes les croyances. Son organisation est décentralisée.

La loi favorise l'égal accès des femmes et des hommes aux mandats électoraux et fonctions électives, ainsi qu'aux responsabilités professionnelles et sociales.
"""

In [9]:
with gr.Blocks(
        theme='soft',
        css=
        ".gr-box {border-radius: 12px; box-shadow: 0 4px 12px rgba(0,0,0,0.05);} .gr-button {font-size: 16px;}"
) as webapp:
    gr.Markdown("""
    # 🎙️ Azure OpenAI GPT-4o Mini TTS
    #### Powered by **Azure AI Foundry**
    Transform your text into high-quality speech using the GPT-4o Mini Text-to-Speech model.
    """)

    with gr.Row(equal_height=True):
        with gr.Column():
            gr.Markdown("### 📝 Text Input")
            instructions_box = gr.Textbox(
                label="Model Instructions",
                placeholder="E.g., use a calm tone, slow speech, etc.",
                value=instructions,
                lines=10,
                interactive=True)
            input_box = gr.Textbox(
                label="Text to Convert",
                placeholder="Enter the text you want to synthesize...",
                value=text,
                lines=10,
                interactive=True)

        with gr.Column():
            gr.Markdown("### 🎧 Output & Settings")
            voice_picker = gr.Dropdown(
                VOICES,
                value="ash",
                label="Voice Template",
                info="Select a predefined voice configuration")
            output_audio = gr.Audio(label="🎵 Generated Audio",
                                    type="filepath",
                                    interactive=False)

    with gr.Row():
        play_button = gr.Button("▶️ Generate Speech",
                                variant="primary",
                                size="lg")
        play_button.click(
            fn=gpt4ominitts,
            inputs=[instructions_box, input_box, voice_picker],
            outputs=[output_audio],
        )

webapp.launch(share=True)

* Running on local URL:  http://127.0.0.1:7862
* Running on public URL: https://d40d4e1ab19d9d4537.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


