# gpt-4o mini transcribe for TTS with Azure AI Foundry

<img src="https://devblogs.microsoft.com/foundry/wp-content/uploads/sites/89/2025/04/image-1024x576.png">

> https://devblogs.microsoft.com/foundry/get-started-azure-openai-advanced-audio-models/

In [1]:
import os
import sys
import time

from datetime import datetime
from dotenv import load_dotenv
from openai import AzureOpenAI

In [2]:
model = "gpt-4o-mini-transcribe"

In [3]:
sys.version

'3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]'

In [4]:
print(f"Today is {datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is 18-Apr-2025 07:46:48


In [5]:
print('OK') if load_dotenv("azure.env") else print('ERROR: Check file location or name.')

OK


## Client

In [6]:
client = AzureOpenAI(
    azure_endpoint=os.getenv("endpoint"),
    api_key=os.getenv("key"),
    api_version="2025-01-01-preview",
)

## Example

In [7]:
audiofile_path = "speech.wav"

In [8]:
start = time.time()

audio_file = open(audiofile_path, "rb")

transcription = client.audio.transcriptions.create(model=model,
                                                   file=audio_file)

end = time.time()
print(f"Done in {end - start:.3f} seconds")

Done in 2.285 seconds


In [9]:
transcription

Transcription(text="There are some parts that I quibbled with because I didn't understand at the time that it's in the editing that a film is shaped. So I noticed that Stephen had shot almost everything I had demanded, you know, that had to be there, but a lot of it ended up cut. And that wasn't very good to see. But the story is strong. It's a very robust story, and it's very universal, so that no matter how they cut it, it's always true.", logprobs=None)

In [10]:
print(transcription.text)

There are some parts that I quibbled with because I didn't understand at the time that it's in the editing that a film is shaped. So I noticed that Stephen had shot almost everything I had demanded, you know, that had to be there, but a lot of it ended up cut. And that wasn't very good to see. But the story is strong. It's a very robust story, and it's very universal, so that no matter how they cut it, it's always true.


## Response format

In [11]:
transcription = client.audio.transcriptions.create(model=model,
                                                   file=audio_file,
                                                   response_format="text")

print(transcription)

And there are some parts that I quibbled with, because I didn't understand at the time that it's in the editing that a film is shaped. So I noticed that Stephen had shot almost everything I had demanded, you know, that had to be there, but a lot of it ended up cut. And that wasn't very good to see. But the story is strong. It's a very robust story, and it's very universal. So that no matter how they cut it, it's always true.



## Streaming

In [12]:
start = time.perf_counter()

stream = client.audio.transcriptions.create(model="gpt-4o-mini-transcribe",
                                            file=audio_file,
                                            stream=True)

for idx, event in enumerate(stream, start=1):
    if event.type == "transcript.text.delta":
        elapsed = time.perf_counter() - start
        print(f"{idx:03} word: {event.delta:15} [{elapsed:.3f} sec]")

end = time.perf_counter()
print(f"\nDone in {end - start:.3f} seconds")

001 word: There           [1.504 sec]
002 word:  are            [1.543 sec]
003 word:  some           [1.558 sec]
004 word:  parts          [1.567 sec]
005 word:  that           [1.568 sec]
006 word:  I              [1.573 sec]
007 word:  qu             [1.574 sec]
008 word: ibb             [1.576 sec]
009 word: led             [1.577 sec]
010 word:  with           [1.579 sec]
011 word:  because        [1.579 sec]
012 word:  I              [1.580 sec]
013 word:  didn't         [1.581 sec]
014 word:  understand     [1.583 sec]
015 word:  at             [1.583 sec]
016 word:  the            [1.584 sec]
017 word:  time           [1.584 sec]
018 word:  that           [1.585 sec]
019 word:  it's           [1.585 sec]
020 word:  in             [1.586 sec]
021 word:  the            [1.587 sec]
022 word:  editing        [1.588 sec]
023 word:  that           [1.588 sec]
024 word:  a              [1.589 sec]
025 word:  film           [1.589 sec]
026 word:  is             [1.591 sec]
027 word:  s

## Logprob
When the AI is transcribing audio, it guesses what the next word or sound might be. For each guess, it assigns a probability—a measure of how confident it is.

Instead of showing that probability directly (like 0.92 or 92%), it uses the logarithm of that probability, which is what logprob represents.

The closer to 0 the logprob, the more confident the model is.

In [13]:
start = time.perf_counter()

stream = client.audio.transcriptions.create(model=model,
                                            file=audio_file,
                                            response_format="json",
                                            stream=True,
                                            include=["logprobs"])

for idx, event in enumerate(stream):
    timestamp = time.perf_counter() - start

    if event.type == "transcript.text.delta":
        logprob = event.logprobs[0].logprob if event.logprobs else None
        print(f"{timestamp:.3f} sec: {event.delta:20} [logprob = {logprob:.10f}]")

    elif event.type == "transcript.text.done":
        snippet = event.text[:20] + "..." if len(
            event.text) > 20 else event.text
        print(f"\nFinished at {timestamp:.3f}s: {snippet}")

1.489 sec: There                [logprob = -0.1535009100]
1.490 sec:  are                 [logprob = -0.0040983470]
1.494 sec:  some                [logprob = -0.0000057962]
1.498 sec:  parts               [logprob = -0.0000006704]
1.505 sec:  that                [logprob = -0.0000341667]
1.509 sec:  I                   [logprob = -0.0000077034]
1.514 sec:  qu                  [logprob = -0.0000022201]
1.519 sec: ibb                  [logprob = -0.0001161788]
1.524 sec: led                  [logprob = 0.0000000000]
1.532 sec:  with                [logprob = -0.0000023393]
1.534 sec:  because             [logprob = -0.2032676200]
1.539 sec:  I                   [logprob = -0.0000065114]
1.543 sec:  didn't              [logprob = -0.0004909569]
1.549 sec:  understand          [logprob = -0.0000026969]
1.556 sec:  at                  [logprob = -0.0000160477]
1.560 sec:  the                 [logprob = -0.0000253456]
1.563 sec:  time                [logprob = -0.0000003128]
1.569 sec:  tha