# Adding Speech with OpenAI’s GPT4o Audio

In this lesson, we see how to leverage the new audio capabilities of GPT4o with the "gpt-4o-audio-preview" model. We'll see how to write code that registers our voices, sends it to the model, and plays back the audio response. We'll also learn how to parse audio streaming output and play it as soon as the first audio chunks arrive. Last, we integrate this with the AI tutor knowledge base, getting to a script that listens to the user query, instructs the LLM to use the knowledge base to retrieve information for answering the query, and plays back the final audio response.

## Libraries and Environment Variables

The code has been tested with the following libraries installed:

```
chromadb==0.5.3
huggingface-hub==0.26.2
llama-index==0.10.49
llama-index-embeddings-openai==0.1.11
numpy==1.26.4
openai==1.54.3
PyAudio==0.2.14
sounddevice==0.5.1
wavio==0.0.9
```

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<YOUR-API-KEY>"

## Load Knowledge Base and Create Retriever

In this section, we download our 500 blog dataset and create a vector retriever with it.

In [None]:
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

In [None]:
# Download 500 blog dataset as knowledge base
from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="vectorstore.zip", repo_type="dataset", local_dir=".")

In [None]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex

# Load the vector store from the local storage
db = chromadb.PersistentClient(path="/Users/fabio/Desktop/temp/ai_tutor_knowledge")
chroma_collection = db.get_collection("ai_tutor_knowledge")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Create the index based on the vector store
vector_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

# Create retriever
vector_retriever = vector_index.as_retriever(similarity_top_k=5)

In [None]:
# Test the retriever with a query
nodes = vector_retriever.retrieve("How does RAG work?")
for node in nodes:
    print(node.metadata["title"])
    print(node.metadata["url"])
    print("-" * 5)

## Registering Audio and Generating Audio Responses with GPT4o

In this section, we see how to (1) register audio from your microphone, (2) send the audio to GPT4o to generate an audio response, and (3) play the audio response and show its transcript.

In [None]:
import sounddevice as sd
import numpy as np
import wavio
import base64
from openai import OpenAI
import tempfile
import json
import simpleaudio as sa

In [None]:
def record_audio(key="q", sample_rate=44100, channels=1):
    """Record audio from the microphone until the user sends the "q" key."""
    print(f"Recording... Press '{key}' to stop.")
    audio_data = []

    # Define a callback function to capture audio data
    def callback(indata, frames, time, status):
        audio_data.append(indata.copy())

    # Open audio input stream and start recording
    with sd.InputStream(samplerate=sample_rate, channels=channels, callback=callback):
        while True:
            if input() == key:
                break
    print("Stopped recording.")

    # Combine audio data and return as a numpy array
    audio_data = np.concatenate(audio_data, axis=0)

    # Save the audio to a temporary file
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as audio_file:
        wavio.write(audio_file.name, audio_data, sample_rate, sampwidth=2)
        audio_file_path = audio_file.name

    return audio_file_path

In [None]:
def send_audio_to_llm(audio_file_path, prompt):
    """Sends an audio file to the OpenAI API and returns the audio completion."""
    # Read the temp file and encode as base64
    with open(audio_file_path, "rb") as audio_file:
        encoded_audio = base64.b64encode(audio_file.read()).decode('utf-8')

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_audio,
                        "format": "wav"
                    }
                }
            ]
        },
    ]

    # Send to OpenAI API
    completion = openai_client.chat.completions.create(
        model="gpt-4o-audio-preview",
        modalities=["text", "audio"],
        audio={"voice": "alloy", "format": "pcm16"},
        messages=messages
    )

    return completion

In [None]:
def play_sound(pcm_bytes, sample_rate=24000, channels=1, sample_width=2):
    """Plays a sound from PCM bytes using simpleaudio"""
    play_obj = sa.play_buffer(
        pcm_bytes,
        num_channels=channels,
        bytes_per_sample=sample_width,
        sample_rate=sample_rate
    )
    play_obj.wait_done()

In [None]:
# Record audio until the user presses 'q'
audio_file_path = record_audio()

# Initialize OpenAI API client
openai_client = OpenAI()

# Print transcription result
prompt = "Transcribe the attached recording. Write only the transcription and nothing else."
completion = send_audio_to_llm(audio_file_path, prompt)
print(completion.choices[0].message.audio.transcript)

# Play the audio response
pcm_bytes = base64.b64decode(completion.choices[0].message.audio.data)
play_sound(pcm_bytes)

## Using Streaming Outputs

In this section we see how to leveraging streaming outputs of the OpenAI API to retrieve the audio response chunk by chunk. This allows us to play the response audio with lower latency as we play the first bytes as soon as we receive them intead of waiting for the whole audio output.

In [None]:
import pyaudio
import threading
import queue

In [None]:
def play_sound_from_queue(pcm_queue, sample_rate=24000, channels=1, sample_width=2):
    """
    Play PCM audio data from a queue that is being filled over time.

    Args:
        pcm_queue: A Queue object from which PCM data is read.
    """
    p = pyaudio.PyAudio()
    format = p.get_format_from_width(sample_width)

    # Open a blocking stream
    stream = p.open(format=format,
                    channels=channels,
                    rate=sample_rate,
                    output=True)

    # Read data from the queue and write to the stream
    while True:
        data = pcm_queue.get()
        if data is None:
            break  # No more data to play
        stream.write(data)

    # Clean up
    stream.stop_stream()
    stream.close()
    p.terminate()

In [None]:
def play_sound_and_print_transcript(stream):
    """
    Starting from a stream of audio chunks (the response to the LLM call),
    plays the response audio and prints its transcript.
    """
    pcm_queue = queue.Queue()
    has_playback_started = False
    for chunk in stream:
        if hasattr(chunk.choices[0].delta, "audio"):
            chunk_audio = chunk.choices[0].delta.audio
            if "transcript" in chunk_audio:
                print(chunk_audio["transcript"], end="") # Print the transcript
            elif "data" in chunk_audio:
                pcm_bytes = base64.b64decode(chunk_audio["data"])
                pcm_queue.put(pcm_bytes) # Add the audio data to the queue
                if not has_playback_started:
                    # Start the playback thread
                    playback_thread = threading.Thread(target=play_sound_from_queue, args=(pcm_queue,))
                    playback_thread.start()
                    has_playback_started = True
    pcm_queue.put(None) # Signal end of data
    playback_thread.join() # Wait for playback to finish

In [None]:
# Get response from GPT4o (i.e. a stream of chunks of audio)
with open(audio_file_path, "rb") as audio_file:
    encoded_audio = base64.b64encode(audio_file.read()).decode('utf-8')

# Prepare messages
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": prompt
            },
            {
                "type": "input_audio",
                "input_audio": {
                    "data": encoded_audio,
                    "format": "wav"
                }
            }
        ]
    },
]

# Get streaming response from the LLM
stream = openai_client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "pcm16"},
    messages=messages,
    stream=True,
)

# Play the audio response and print the transcript
play_sound_and_print_transcript(stream)

## Integrating Audio Inputs and Outputs with RAG

In this section, we see how to (1) define the tool that retrieves relevant information from our knowledge base, (2) send the user query to the LLM specifying the available tools, (3) manage the LLM response if it asks to use a tool, (4) get the final audio response via streaming from the LLM leveraging the tool response, and (5) play the audio response.

In [None]:
# This function will be used as tool for the LLM to retrieve resources
def retrieve_resources(query: str) -> str:
    """Given a query, retrieve relevant resources and return them as a formatted string."""
    nodes = vector_retriever.retrieve(query)

    context_text = ""
    for i, node in enumerate(nodes):
        context_text += f"<resource-{i+1}>" + "\n"
        context_text += "<resource-title>" + node.node.metadata["title"] + "</resource-title>" + "\n\n"
        context_text += "<resource-text>" + "\n" + node.node.text + "\n" + "</resource-text>" + "\n"
        context_text += f"</resource-{i+1}>" + "\n\n"
    context_text = context_text.strip()

    return context_text

In [None]:
# Define the tools for the LLM
tools = [
    {
        "type": "function",
        "function": {
            "name": "retrieve_resources",
            "description": "Given a query, find resources that are relevant to the query and useful for answering it. It leverages an internal knowledge base.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "A query that will be used (via embeddings similarity search) to find relevant resources."
                    }
                },
                "required": ["query"],
                "additionalProperties": False
            },
            "response": {
                "type": "string",
                "description": "A textual representation of the resources found that are relevant to the query."
            }
        }
    }
]

In [None]:
system_prompt = """
You are a helpful assistant whose job is answering user queries about artificial intelligence topics.
Leverage the "retrieve_resources" tool to find resources based on the user's query.
You can use the tool at most once per user query.
Always leverage the retrieved resources to provide a helpful response.
If you can't find useful information, don't use your knowledge to make up an answer, just say that you can't find the information in your knowledge base.
Speak fast.
Be very concise. Answer with at most 50 words.
""".strip()

In [None]:
def send_audio_to_llm(audio_file_path, system_prompt):
    """Sends an audio file to the OpenAI API and returns the audio completion."""
    # Read the temp file and encode as base64
    with open(audio_file_path, "rb") as audio_file:
        encoded_audio = base64.b64encode(audio_file.read()).decode('utf-8')

    # Define the messages to send to the LLM
    messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_audio,
                        "format": "wav"
                    }
                }
            ]
        },
    ]

    # Send to OpenAI API
    completion = openai_client.chat.completions.create(
        model="gpt-4o-audio-preview",
        modalities=["text", "audio"],
        audio={"voice": "alloy", "format": "pcm16"},
        messages=messages,
        tools=tools,
    )

    return completion

In [None]:
completion = send_audio_to_llm(audio_file_path, system_prompt)

In [None]:
# Show the response (spoiler: it's a function call)
completion.choices[0].to_dict()

In [None]:
def manage_tool_call(completion):
    """
    If the LLM completion contains a tool call, retrieve the resources and continue the conversation.
    The returned conversation is in the form of a stream.
    """
    if completion.choices[0].finish_reason == "tool_calls":
        tool_call_id = completion.choices[0].message.tool_calls[0].id
        tool_name = completion.choices[0].message.tool_calls[0].function.name # not used
        tool_query = json.loads(completion.choices[0].message.tool_calls[0].function.arguments)["query"]
        resources = retrieve_resources(tool_query)

        new_messages = messages + [
            completion.choices[0].message,
            {
                "role": "tool",
                "content": json.dumps({
                    "query": tool_query,
                    "resources": resources,
                }),
                "tool_call_id": tool_call_id
            },
        ]

        stream = openai_client.chat.completions.create(
            model="gpt-4o-audio-preview",
            modalities=["text", "audio"],
            audio={"voice": "alloy", "format": "pcm16"},
            messages=new_messages,
            stream=True,
        )

        return stream
    return None

In [None]:
# Run the tool call and play the audio response
stream = manage_tool_call(completion)
play_sound_and_print_transcript(stream)

## Putting All Together

Last, we put everything together in a single script so that (1) the user registers its question via audio, (2) the LLM generates a final audio response leveraging the retrieval tool, and (3) the audio response is played via streaming.

In [None]:
# 1. Record audio until the user presses 'q'
audio_file_path = record_audio()

# 2. Send audio to GPT4o
completion = send_audio_to_llm(audio_file_path, system_prompt)

# 3. Manage tool call
# NB: We're assuming that the first LLM response is always a tool call!
stream = manage_tool_call(completion)

# 4. Play final response
play_sound_and_print_transcript(stream)