# Usage of Speech to Text APIs in Python

This notebook demonstrates various approaches for speech-to-text transcription using Azure services.

## Table of Contents

1. [Setup and Initialization](#Setup-and-Initialization)
2. [Azure Speech SDK](#Azure-Speech-SDK)
   - [Recognizing from Microphone](#Recognise-from-mic)
   - [Recognizing from File](#From-a-file)
   - [Understanding SpeechRecognitionResult](#Understanding-speechsdk.SpeechRecognitionResult)
3. [Continuous Speech Recognition](#Continuous-Speech-recognition)
   - [Continuous Recognition on File](#Optional:-Continuous-Speech-recognition-on-File)
4. [Speech Recognition with Diarization](#Continuous-Speech-recognition-with-diarization)
5. [Fast Transcription API](#Fast-Transcription)
6. [Azure OpenAI Whisper](#Azure-OpenAI-Whisper)
7. [Azure OpenAI GPT-4o-transcribe](#Azure-OpenAI-GPT-4o-transcribe-Model)
   - [File Transcription](#1.-Transcribing-Audio-Files-with-GPT-4o-transcribe)
   - [Advanced Transcription Options](#Advanced-Options-for-File-Transcription)
   - [Streaming Transcription for Files](#Streaming-Transcription-for-Completed-Audio-Files)
8. [WebSockets for Real-time Transcription (OpenAI Real-time API)](#Using-WebSockets-with-OpenAI-Realtime-API-for-Live-Transcription)

This notebook requires several API keys and configurations to be set in a `.env` file.

## Setup and Initialization

The following cells imports necessary libraries and environment variables for Azure Speech SDK. Additional setup will be done in the approprate sections.

## Environment Setup

This notebook requires several API keys and configurations to be set in a `.env` file in the same directory as the notebook. Below is a guide on how to set up your `.env` file with all required credentials.

### Required Environment Variables

Create a `.env` file in the notebook directory with the following variables:

```
# Azure Speech Service credentials (for Azure Speech SDK and Fast Transcription)
SPEECH_KEY=your_azure_speech_service_key
SERVICE_REGION=your_azure_region (e.g., uksouth, eastus)

# Azure OpenAI Service credentials (for Whisper model)
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint

# Azure OpenAI GPT-4o credentials (for GPT-4o-transcribe)
AZURE_OPENAI_GPT4O_API_KEY=your_azure_openai_gpt4o_api_key
AZURE_OPENAI_GPT4O_ENDPOINT=your_azure_openai_gpt4o_endpoint
AZURE_OPENAI_GPT4O_DEPLOYMENT_ID=your_azure_openai_gpt4o_deployment_id

# Direct OpenAI API credentials (optional, for using OpenAI's services directly)
OPENAI_API_KEY=your_openai_api_key
```

### Credential Usage

| Variable | Used For |
|---------|----------|
| `SPEECH_KEY`, `SERVICE_REGION` | Azure Speech SDK, speech recognition, diarization, and fast transcription API |
| `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT` | Azure OpenAI Whisper model for audio transcription |
| `AZURE_OPENAI_GPT4O_API_KEY`, `AZURE_OPENAI_GPT4O_ENDPOINT`, `AZURE_OPENAI_GPT4O_DEPLOYMENT_ID` | Azure OpenAI GPT-4o-transcribe model for high-quality transcription |
| `OPENAI_API_KEY` | Direct access to OpenAI services (optional alternative to Azure) |

### Setting Up Azure Resources

1. For the Azure Speech Service, create a resource in the Azure portal and copy the key and region
2. For Azure OpenAI, create a resource and deploy the Whisper model with your chosen deployment name
3. For GPT-4o-transcribe, deploy the model in your Azure OpenAI resource and note the deployment ID

Now let's begin by importing the necessary libraries and loading these environment variables:

In [1]:
import azure.cognitiveservices.speech as speechsdk
from dotenv import load_dotenv
from openai import AzureOpenAI, OpenAI
import time

import os
import json

load_dotenv()
speech_key = os.getenv("SPEECH_KEY")
service_region = os.getenv("SERVICE_REGION")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")

In [None]:
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

## Azure Speech SDK

Azure Speech SDK provides a robust set of speech recognition capabilities. This section demonstrates different ways to use the SDK for transcription from various sources.

The Speech SDK is a software development kit that exposes many of Azure Speech Service capabilities, allowing you to develop speech-enabled applications across multiple platforms and programming languages.

[Official Documentation: Azure Speech SDK](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-sdk)

### Recognise from mic

In [None]:
def from_mic() -> speechsdk.SpeechRecognitionResult:
    """
    Capture speech from the microphone and perform speech recognition.
    
    Returns:
        speechsdk.SpeechRecognitionResult: The recognition result object
    """
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

    print("Speak into your microphone.")
    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    print(speech_recognition_result.text)
    return speech_recognition_result


speech_recognition_result = from_mic()

print(json.dumps(json.loads(speech_recognition_result.json), indent=4))

### From a file

The following section demonstrates how to transcribe speech from an audio file using Azure Speech SDK.

In [None]:
FILE_NAME = "../data/dummy-call-centre.wav"
audio_config = speechsdk.AudioConfig(filename=FILE_NAME)


def from_file() -> speechsdk.SpeechRecognitionResult:
    speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config, audio_config=audio_config
    )

    print(f"Recognizing speech from file: {FILE_NAME}")
    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    return speech_recognition_result


speech_recognition_result = from_file()

print(json.dumps(json.loads(speech_recognition_result.json), indent=4))

### Understanding `speechsdk.SpeechRecognitionResult`

SKD returns a `speechsdk.SpeechRecognitionResult` which can be used to understand and process output in various situatiions. This will be used in the next section when we perform continuous Speech recognition.

In [None]:
def recognize_from_microphone():
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
    speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config, audio_config=audio_config
    )

    print("Speak into your microphone.")
    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Recognized: {}".format(speech_recognition_result.text))
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print(
            "No speech could be recognized: {}".format(
                speech_recognition_result.no_match_details
            )
        )
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")


# Don't speak into the mic to see alternate results
recognize_from_microphone()

## Continuous Speech recognition

We can use `start_continuous_recognition()` and `stop_continuous_recognition()` to start recognizing Speech in the background. SDK provides _callbacks_ when data in available.

Continuous speech recognition enables real-time transcription by processing speech as it's being spoken rather than waiting until the end. This is particularly useful for applications requiring live transcription or voice commands.

[Official Documentation: Continuous Recognition](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech?pivots=programming-language-python#continuous-recognition)

In [None]:
audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
speech_recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config, audio_config=audio_config
)


## Callback function that is called each time a speech recognition event occurs
def process_callback(evt: speechsdk.SpeechRecognitionEventArgs):
    if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
        # Print final recognised text
        print("Recognised: ", evt.result.text)
    elif evt.result.reason == speechsdk.ResultReason.RecognizingSpeech:
        # Continuously print recognised text
        print("Recognising: ", evt.result.text, end="\r")
    else:
        print("Event: {}".format(evt))

In [None]:
# We are using the same callback funcation for each kind of event
#   The most interestng events are RecognizingSpeech and RecognizedSpeech.
#   RecognizingSpeech is called when the speech recognizer has hypothesized a partial recognition result
#   RecognizedSpeech is called when the speech recognizer has recognized a final recognition result
speech_recognizer.recognizing.connect(process_callback)
speech_recognizer.recognized.connect(process_callback)
speech_recognizer.session_started.connect(process_callback)
speech_recognizer.session_stopped.connect(process_callback)
speech_recognizer.canceled.connect(process_callback)
speech_recognizer.session_stopped.connect(process_callback)
speech_recognizer.canceled.connect(process_callback)

In [None]:
# Start continuous speech recognition
speech_recognizer.start_continuous_recognition()

In [None]:
speech_recognizer.stop_continuous_recognition()

### Optional: Continuous Speech recognition on File

In [None]:
FILE_NAME = "../data/dummy-call-centre.wav"
audio_config = speechsdk.AudioConfig(filename=FILE_NAME)


def from_file():
    speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config, audio_config=audio_config
    )

    print(f"Recognizing speech from file: {FILE_NAME}")

    done = False

    def stop_recognition(evt):
        print("CLOSING on {}".format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

    speech_recognizer.recognizing.connect(process_callback)
    speech_recognizer.recognized.connect(process_callback)
    speech_recognizer.session_stopped.connect(stop_recognition)
    speech_recognizer.canceled.connect(stop_recognition)

    speech_recognizer.start_continuous_recognition()
    while not done:
        pass


from_file()

## Continuous Speech recognition with diarization

Diarization is the process of identifying and separating different speakers in an audio recording. This feature is particularly valuable for transcribing conversations, meetings, or call center interactions where multiple speakers are involved.

[Official Documentation: Speaker Recognition](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speaker-recognition-overview)

In [None]:
speech_config.set_property(
    property_id=speechsdk.PropertyId.SpeechServiceResponse_DiarizeIntermediateResults,
    value="true",
)


def process_transcription_callback(evt: speechsdk.SpeechRecognitionEventArgs):
    if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
        # Print final recognised text
        if evt.result.speaker_id:
            print(f"Speaker {evt.result.speaker_id}: {evt.result.text}")
        else:
            print("Recognised: ", evt.result.text)
    elif evt.result.reason == speechsdk.ResultReason.RecognizingSpeech:
        # Continuously print recognised text
        if evt.result.speaker_id:
            print(f"Speaker {evt.result.speaker_id}: {evt.result.text}", end="\r")
        else:
            print("Recognising: ", evt.result.text, end="\r")
    else:
        print("Event: {}".format(evt))


def transcribe(file=None):
    if file:
        audio_config = speechsdk.AudioConfig(filename=FILE_NAME)
    else:
        audio_config = speechsdk.AudioConfig(use_default_microphone=True)
    conversation_transcriber = speechsdk.transcription.ConversationTranscriber(
        speech_config=speech_config, audio_config=audio_config
    )

    print(f"Recognizing speech from file: {FILE_NAME}")

    done = False

    def stop_transcription(evt):
        print("CLOSING on {}".format(evt))
        conversation_transcriber.stop_transcribing_async()
        nonlocal done
        done = True

    conversation_transcriber.transcribing.connect(process_transcription_callback)
    conversation_transcriber.transcribed.connect(process_transcription_callback)
    conversation_transcriber.session_stopped.connect(stop_transcription)
    conversation_transcriber.canceled.connect(stop_transcription)

    conversation_transcriber.start_transcribing_async()

    # Keep looping until keyboard interrupt
    try:
        while not done:
            time.sleep(0.5)
    except KeyboardInterrupt:
        conversation_transcriber.stop_transcribing_async()

In [None]:
transcribe()

In [None]:
transcribe(file="../data/dummy-call-centre.wav")

## Fast Transcription

Fast Transcription is an Azure AI Speech Service REST API designed for quick, efficient transcription of audio files. It provides a simplified workflow for batch processing without the overhead of continuous recognition, making it ideal for scenarios requiring rapid transcription of pre-recorded audio.

[Official Documentation: Fast Transcription](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/fast-transcription-create?tabs=locale-specified)

In [None]:
import requests

url = "https://uksouth.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15"
headers = {"Ocp-Apim-Subscription-Key": speech_key}
files = {
    "audio": open("../data/dummy-call-centre.wav", "rb"),
    "definition": (None, '{"locales":["en-US"]}'),
}

response = requests.post(url, headers=headers, files=files)

if response.status_code == 200:
    for phrase in response.json()["phrases"]:
        print(phrase["text"])

## Azure OpenAI Whisper

Whisper is a state-of-the-art speech recognition model from OpenAI that can transcribe audio in multiple languages. The Azure OpenAI implementation provides high-quality transcription with lower latency and better cost efficiency compared to traditional speech recognition methods.

[Official Documentation: Azure OpenAI Whisper Model](https://learn.microsoft.com/en-us/azure/ai-services/openai/whisper-quickstart)

In [None]:
client = AzureOpenAI(
    api_key=azure_openai_api_key,
    api_version="2024-02-01",
    azure_endpoint=azure_openai_endpoint,
)

deployment_id = "whisper"  # This will correspond to the custom name you chose for your deployment when you deployed a model."
audio_test_file = "../data/dummy-call-centre.wav"

result = client.audio.transcriptions.create(
    file=open(audio_test_file, "rb"), model=deployment_id
)

print(result.text)

# Azure OpenAI GPT-4o-transcribe Model

This section demonstrates how to use the GPT-4o-transcribe model for both file transcription and real-time audio streaming using Azure OpenAI.

GPT-4o-transcribe is a state-of-the-art transcription model that offers high accuracy across various languages and audio qualities. It provides advantages like improved handling of domain-specific terminology and streaming capabilities.

[Official Documentation: Realtime API for speech and audio](https://learn.microsoft.com/en-us/azure/ai-services/openai/realtime-audio-quickstart).


In [13]:
# Load necessary libraries for GPT-4o-transcribe
from openai import AzureOpenAI, OpenAI
import os
from dotenv import load_dotenv
import pyaudio
import wave
import numpy as np
import time
import threading
import requests
import json
import websocket
import base64
import queue

# Load environment variables if not already loaded
load_dotenv()

# Get GPT-4o-transcribe credentials from .env file
AZURE_OPENAI_GPT4O_API_KEY = os.getenv("AZURE_OPENAI_GPT4O_API_KEY")
AZURE_OPENAI_GPT4O_ENDPOINT = os.getenv("AZURE_OPENAI_GPT4O_ENDPOINT")
AZURE_OPENAI_GPT4O_DEPLOYMENT_ID = os.getenv("AZURE_OPENAI_GPT4O_DEPLOYMENT_ID")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")  # Direct OpenAI API key

# Initialize Azure OpenAI client for GPT-4o
gpt4o_client = AzureOpenAI(
    api_key=AZURE_OPENAI_GPT4O_API_KEY,
    api_version="2025-03-01-preview",  # Make sure to use the correct API version
    azure_endpoint=f"https://{AZURE_OPENAI_GPT4O_ENDPOINT.split('/openai/deployments')[0]}"  # Base endpoint without the path
)

# Initialize direct OpenAI client
openai_client = OpenAI(api_key=OPENAI_API_KEY)

## 1. Transcribing Audio Files with GPT-4o-transcribe

First, let's demonstrate how to transcribe an existing audio file using GPT-4o-transcribe.

In [14]:
def transcribe_file_with_gpt4o(file_path, response_format="text"):
    """
    Transcribe an audio file using Azure OpenAI's GPT-4o-transcribe model
    
    Args:
        file_path (str): Path to the audio file
        response_format (str): Format of the response ('text' or 'json')
    
    Returns:
        The transcription result
    """
    try:
        with open(file_path, "rb") as audio_file:
            transcription = gpt4o_client.audio.transcriptions.create(
                model=AZURE_OPENAI_GPT4O_DEPLOYMENT_ID,
                file=audio_file,
                response_format=response_format
            )
            
            if response_format == "json":
                return transcription
            else:
                return transcription
    except Exception as e:
        print(f"Error transcribing file: {e}")
        raise e
        return None

# Test with an audio file
audio_test_file = "../data/realistic-call-centre.wav"

print("Transcribing audio file...")
transcription = transcribe_file_with_gpt4o(audio_test_file)


Transcribing audio file...


In [15]:
print("\nTranscription:")
print(transcription)


Transcription:
Thank you for calling Rocket Speed Internet my name is Giza may I please have your phone or account number? I'm sorry can you hear me okay now? I was asking you about your phone or account number. I'm really sorry for the inconvenience. I would probably feel the same way if I'm in your situation. But don't worry I promise you that we'll get your issue resolved. Let me get first your account number so we can check your account. Would that be okay? You got it. May I please verify the name on the account? Okay Mr. Robert can we call you back at the same number or do you have a better callback number? Sure I was just asking you if we can call you at the same number you gave me or if you have a better callback number. Okay based on our test results it shows here that you are not getting a DSL signal. That's why you can't get online or check your email. We can actually fix this problem over the phone but I will need to walk you through on some steps. Would that be okay sir? O

### Advanced Options for File Transcription

GPT-4o-transcribe supports additional options like prompting to improve the quality of transcription.

In [16]:
def transcribe_file_with_prompt(file_path, prompt="", response_format="text"):
    """
    Transcribe an audio file with a prompt to guide the transcription
    
    Args:
        file_path (str): Path to the audio file
        prompt (str): A prompt to guide the transcription
        response_format (str): Format of the response ('text' or 'json')
    
    Returns:
        The transcription result
    """
    try:
        with open(file_path, "rb") as audio_file:
            transcription = gpt4o_client.audio.transcriptions.create(
                model=AZURE_OPENAI_GPT4O_DEPLOYMENT_ID,
                file=audio_file,
                response_format=response_format,
                prompt=prompt
            )
            
            if response_format == "json":
                return transcription
            else:
                return transcription
    except Exception as e:
        print(f"Error transcribing file: {e}")
        return None

# Test with a prompt for call center context
call_center_prompt = "The following is a call center conversation between a customer service representative and a customer discussing a banking issue."

print("Transcribing audio file with call center context prompt...")
transcription_with_prompt = transcribe_file_with_prompt(
    audio_test_file, 
    prompt=call_center_prompt
)


Transcribing audio file with call center context prompt...


In [17]:
print("\nTranscription with prompt:")
print(transcription_with_prompt)


Transcription with prompt:
Thank you for calling Rocket Speed Internet, my name is Friza, may I please have your phone or account number?

I'm sorry, can you hear me okay now? I was asking you about your phone or account number.

I'm really sorry for the inconvenience. I would probably feel the same way if I'm in your situation. But don't worry, I promise you that we'll get your issue resolved. Let me get first your account number so we can check your account, would that be okay?

You got it. May I please verify the name on the account?

Okay Mr. Robert, can we call you back at the same number or do you have a better callback number?

Sure, I was just asking you if we can call you at the same number you gave me or if you have a better callback number.

Okay, based on our test results, it shows here that you are not getting a DSL signal. That's why you can't get online or check your email. We can actually fix this problem over the phone, but I will need to walk you through on some step

### Streaming Transcription for Completed Audio Files

GPT-4o-transcribe supports streaming responses for completed audio files, which allows getting transcription results incrementally.

In [24]:
def stream_transcription_from_file(file_path):
    """
    Stream transcription results from a completed audio file
    
    Args:
        file_path (str): Path to the audio file
    """
    try:
        with open(file_path, "rb") as audio_file:
            stream = gpt4o_client.audio.transcriptions.create(
                model=AZURE_OPENAI_GPT4O_DEPLOYMENT_ID,
                file=audio_file,
                response_format="json",
                stream=True,
                include=["logprobs"],
            )
            
            full_transcript = ""
            print("Streaming transcription:")
            for event in stream:
                if event.type == "transcript.text.delta":
                    full_transcript += event.delta
                    full_transcript = full_transcript.replace("\n", "")
                    print("Recognizing: ", full_transcript, end="\r", flush=True)
                elif event.type == "transcript.text.done":
                    return event            
            return None
    except Exception as e:
        print(f"Error streaming transcription: {e}")



In [None]:
print("Streaming transcription from file...")
transcipt_response = stream_transcription_from_file(audio_test_file)

print("\n\nFinal Transcription:")
print(transcipt_response.text)

Streaming transcription from file...
Streaming transcription:
Streaming transcription:for calling Rocket Speed Internet, my name is
Recognizing: r calling Rocket Speed Internet, my name is Queza. May I please have your phone or account number?I'm sorry, can you hear me okay now? I was asking you about your phone or account number.I'm really sorry for the inconvenience, I would probably feel the same way if I'm in your situation. But don't worry, I promise you that we'll get your issue resolved. Let me get first your account number so we can check your account, would that be okay?You got it, may I please verify the name on the account?Okay Mr. Robert, can we call you back at the same number or do you have a better callback number?Sure, I was just asking you if we can call you at the same number you gave me or if you have a better callback number.Okay, based on our test results, it shows here that you are not getting a DSL signal, that's why you can't get online or check your email. We c

In [None]:
# Print each token and its log probability
# for logprob in transcipt_response.logprobs:
#     token = logprob.token
#     logprob = logprob.logprob
#     print(f"Token: {token}, Log Probability: {logprob}")
    
# Example output:
# Token:  on, Log Probability: -5.9153886e-06
# Token:  you, Log Probability: -0.0067254375
# Token: ., Log Probability: -0.0021892798
# Token:  Wow, Log Probability: -4.441817e-05

In [26]:
# Print the transcription result, color coded by log probability
def color_gradient(value, min_value, max_value):
    """
    Generate a color gradient where higher values are green and lower values are red
    
    Args:
        value (float): The value to color
        min_value (float): The minimum value for the gradient
        max_value (float): The maximum value for the gradient
    
    Returns:
        str: ANSI escape code for the color
    """
    ratio = (value - min_value) / (max_value - min_value)
    g = int(255 * ratio)  # Green increases with value
    r = int(255 * (1 - ratio))  # Red decreases with value
    return f"\033[38;2;{r};{g};0m"  # RGB color code

# Print the transcription result with a color gradient based on log probability
for logprob in transcipt_response.logprobs:
    token = logprob.token
    logprob = logprob.logprob
    prob = np.round(np.exp(logprob) * 100, 2)
    color = color_gradient(prob, 0, 100)  # Color gradient from 0 to 100%
    print(f"{color}{token}\033[0m", end="")

[38;2;0;254;0mThank[0m[38;2;0;255;0m you[0m[38;2;0;255;0m for[0m[38;2;0;255;0m calling[0m[38;2;2;252;0m Rocket[0m[38;2;24;230;0m Speed[0m[38;2;4;250;0m Internet[0m[38;2;160;94;0m,[0m[38;2;1;253;0m my[0m[38;2;0;255;0m name[0m[38;2;0;254;0m is[0m[38;2;239;15;0m Que[0m[38;2;97;157;0mza[0m[38;2;180;74;0m.[0m[38;2;0;254;0m May[0m[38;2;0;254;0m I[0m[38;2;0;254;0m please[0m[38;2;0;254;0m have[0m[38;2;0;255;0m your[0m[38;2;0;255;0m phone[0m[38;2;0;254;0m or[0m[38;2;0;254;0m account[0m[38;2;0;254;0m number[0m[38;2;139;115;0m?

[0m[38;2;0;254;0mI'm[0m[38;2;0;255;0m sorry[0m[38;2;16;238;0m,[0m[38;2;0;254;0m can[0m[38;2;0;255;0m you[0m[38;2;0;255;0m hear[0m[38;2;0;254;0m me[0m[38;2;5;249;0m okay[0m[38;2;0;254;0m now[0m[38;2;1;253;0m?[0m[38;2;0;254;0m I[0m[38;2;0;255;0m was[0m[38;2;0;254;0m asking[0m[38;2;37;217;0m you[0m[38;2;0;255;0m about[0m[38;2;0;254;0m your[0m[38;2;0;255;0m phone[0m[38;2;0;254;0m or[0m[38;2;0

## Using WebSockets with OpenAI Realtime API for Live Transcription

This section demonstrates how to use WebSockets for real-time audio transcription using OpenAI's Realtime API. This API allows for continuous audio streaming and transcription, which is useful for applications like voice assistants, live captioning, and more.

### How WebSocket Transcription Works

The WebSocket-based transcription service provides several advantages over traditional file-based transcription:

1. **Real-time results**: Transcription happens as you speak, without waiting for the complete audio
2. **Continuous streaming**: Audio is sent in small chunks through a persistent connection
3. **Turn detection**: Automatically detects speech segments using Voice Activity Detection (VAD)
4. **Configurable noise reduction**: Can be optimized for near-field or far-field speech

### Configuration Options

The `TranscriptionService` class in `transcription_websocket_service.py` supports these key parameters:

- `service_type`: Choose between `"azure"` or `"openai"` (direct) services
- `model`: Specify model ("gpt-4o-transcribe" or "gpt-4o-mini-transcribe")
- `noise_reduction`: Set to "near_field" or "far_field" for different environments
- `turn_threshold`: Sensitivity for voice activity detection (0.0-1.0)
- `include_logprobs`: Whether to include confidence scores for transcribed text

### Required Environment Variables

For Azure OpenAI service (used in this example):
- `AZURE_OPENAI_GPT4O_ENDPOINT`: Your Azure OpenAI endpoint
- `AZURE_OPENAI_GPT4O_DEPLOYMENT_ID`: The deployment name for your GPT-4o-transcribe model
- `AZURE_OPENAI_GPT4O_API_KEY`: Your Azure OpenAI API key

For direct OpenAI service (alternative option):
- `OPENAI_API_KEY`: Your OpenAI API key

### Official Documentation

- [OpenAI Speech to Text Documentation](https://platform.openai.com/docs/guides/speech-to-text)
- [OpenAI Realtime Transcription Guide](https://platform.openai.com/docs/guides/realtime-transcription)
- [Realtime Transcription API Reference](https://platform.openai.com/docs/guides/realtime-transcription#page-top)

Note: The WebSocket API is currently in preview as of May 2025. Refer to [`transcription_websocket_service.py`](./transcription_websocket_service.py) for detailed implementation.

In [1]:
import os
import nest_asyncio

from transcription_websocket_service import start_azure_transcription

# Enable asyncio in Jupyter only if needed
nest_asyncio.apply()


In [2]:
endpoint = os.environ.get("AZURE_OPENAI_GPT4O_ENDPOINT")
deployment = os.environ.get("AZURE_OPENAI_GPT4O_DEPLOYMENT_ID")
api_key = os.environ.get("AZURE_OPENAI_GPT4O_API_KEY")

In [3]:
transcript, probs = start_azure_transcription(
    endpoint=endpoint, 
    deployment=deployment, 
    api_key=api_key, 
    duration=60
)


🚀 Starting transcription for 60 seconds. Speak into your microphone.
🔄 Connecting to Azure OpenAI Realtime API...
🎙️ Recording started...
🔗 WebSocket connection established
✅ Sent session configuration
✅ Transcription session created
✅ Transcription session updated

🎤 Speech detected, listening...
🔇 Speech stopped
📤 Audio buffer committed
📝 New conversation item created

📝 Azure Completed Transcript: "Everything seems to be working fine now."

🎤 Speech detected, listening...
🔇 Speech stopped
📤 Audio buffer committed
📝 New conversation item created

📝 Azure Completed Transcript: "Great."
⛔ Interrupted by user
📤 Audio sending complete
🎙️ Recording stopped
📥 Message receiving complete
✅ WebSocket connection closed
✅ Transcription session ended
📋 Full Transcript:
-------------------
Everything seems to be working fine now.
Great.
