# Unlocking Audio Capabilities with OpenAI's gpt-4o-audio-preview Model: A Practical Guide
    
This notebook will walk you through how to use OpenAI’s new `gpt-4o-audio-preview` model using LangChain.
We’ll go step by step, covering everything from environment setup to audio processing and advanced use cases like tool calling and chaining tasks.

Whether you want to transcribe audio or generate spoken responses, this guide will get you up and running with practical examples.

---

## Prerequisites

Before we get started, ensure you have:
- An OpenAI account and API key.
- The `langchain-openai` package installed.
- (Optional) LangSmith for tracing your API calls.

Let's begin by setting up our environment!

### 1. Installing the Required Packages

We’ll need the `langchain-openai` package to interact with OpenAI models. Run the command below to install it.

```bash
%pip install -qU langchain-openai
```

In [None]:
# Install langchain-openai package
%pip install -qU langchain-openai

### 2. Setting Up Environment Variables

Now we’ll set up the environment variables to store your OpenAI API key. This keeps sensitive information out of your code.

You can manually set the environment variable in your terminal, or use a `.env` file in combination with `python-dotenv`. Here, we’ll show you how to set it within Python.

In [None]:
import getpass
import os

# Set your OpenAI API key as an environment variable
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

### 3. Instantiating the Model

Now that we have our environment set up, let’s instantiate the `gpt-4o-audio-preview` model using LangChain. We’ll also configure some basic parameters like temperature and token limits.

In [None]:
from langchain_openai import ChatOpenAI

# Instantiate the model
llm = ChatOpenAI(
    model="gpt-4o-audio-preview",  # Specifying the model
    temperature=0,  # Low randomness for structured output
    max_tokens=None,  # Unlimited tokens (set a limit if needed)
    timeout=None,  # No timeout for processing
    max_retries=2  # Retry if the request fails
)

### 4. Uploading and Encoding Audio Files

We’ll now upload an audio file and encode it into base64 format so that it can be processed by the model. Here’s how you can read and encode an audio file in Python.

In [None]:
import base64

# Open the audio file and convert to base64
with open("path/to/audio.wav", "rb") as f:
    audio_data = f.read()

# Convert binary audio data to base64
audio_b64 = base64.b64encode(audio_data).decode()

### 5. Transcribing Audio

Now that we’ve encoded the audio, we can pass it to the model and get a transcription. Let’s send the request and retrieve the transcribed text.

In [None]:
# Send audio for transcription
messages = [
    (
        "human",
        [
            {"type": "text", "text": "Transcribe the following:"},
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
        ],
    )
]

# Invoke the model and get the transcription
output_message = llm.invoke(messages)
print(output_message.content)  # The transcription will appear here

### 6. Generating Audio Responses

Let’s now configure the model to generate audio outputs, allowing it to respond with actual speech. We’ll specify the voice and format for the output.

In [None]:
# Configure the model to generate audio responses
llm = ChatOpenAI(
    model="gpt-4o-audio-preview",
    temperature=0,
    model_kwargs={
        "modalities": ["text", "audio"],  # Enable audio responses
        "audio": {"voice": "alloy", "format": "wav"},  # Set voice and output format
    }
)

# Generate a response with audio
messages = [("human", "Are you made by OpenAI? Just answer yes or no.")]
output_message = llm.invoke(messages)

# Access the generated audio data
audio_response = output_message.additional_kwargs['audio']['data']
print(f"Generated audio (base64): {audio_response}")

### 7. Saving and Playing Back Audio

After generating the audio response, you might want to save it and play it back. Here’s how you can decode the base64-encoded audio data and save it as a `.wav` file.

In [None]:
# Decode the base64 audio data
audio_bytes = base64.b64decode(audio_response)

# Save the audio as a .wav file
with open("output.wav", "wb") as f:
    f.write(audio_bytes)

print("Audio saved as output.wav")

### 8. Tool Binding and Task Chaining

In more advanced use cases, you can bind tools to the model and chain tasks. For example, we can bind a weather fetching tool to the model and chain it with transcription.

In [None]:
from pydantic import BaseModel, Field

# Define a tool to fetch weather information
class GetWeather(BaseModel):
    location: str = Field(..., description="The city and state, e.g. San Francisco, CA")

# Bind the tool to the model
llm_with_tools = llm.bind_tools([GetWeather])

# Use the bound tool to fetch weather data
ai_msg = llm_with_tools.invoke("What's the weather like in San Francisco, CA?")
print(ai_msg)