# Voice Agent for Conversational AI with Pipecat
In this notebook, we walk through how to craft and deploy a voice AI agent agent using Pipecat AI. We illustrate the basic Pipecat flow with the `meta/llama-3.3-70b-instruct`* LLM model and Riva for STT (Speech-To-Text) & TTS (Text-To-Speech). However, Pipecat is not opinionated and other models and STT/TTS services can easily be used. See [Pipecat documentation](https://docs.pipecat.ai/server/services/supported-services#supported-services) for other supported services.

Pipecat AI is an open-source framework for building voice and multimodal conversational agents. Pipecat simplifies the complex voice-to-voice AI pipeline, and lets developers build AI capabilities easily and with Open Source, commercial, and custom models. See [Pipecat's Core Concepts](https://docs.pipecat.ai/getting-started/core-concepts) for a deep dive into how it works.

The framework was developed by Daily, a company that has provided real-time video and audio communication infrastructure since 2016. It is fully vendor neutral and is not tightly coupled to Daily's infrastructure. That said, we do use it in this demo. Sign up for a Daily-agent API key [here](https://agents.daily.co/sign-up).

> [Development Note]: *We are using "meta/llama-3.3-70b-instruct" for now because it works with tool calling, but can update/change this model at any time. It is a one line change in the notebook.

Below is the architecture diagram

![Architecture Diagram](./arch.png)

A three-phase approach is used for Conversational AI Agent with Pipecat and NVIDIA NIM

Phase 1 : User Input 
- Audio Processing with NVIDIA RIVA ASR with NIM

Phase 2: User Content Aggregator with Pipecat and NVIDIA NIM
- Custom processing with Pipecat 
- NVIDIA RIVA TTS with NIM

Phase 3: Run the Agent


# Content Overview 

- [Prerequisites](#prerequisites)
- [Initialize the User Input](#initialize-the-user-input)
- [Initialize the Content Aggragtor](#initialize-the-context-aggregator) 
- [Run the Agent](#run-the-agent)

## Prerequisites

### NGC API Key
Prior to getting started, you will need to create API Keys for the NVIDIA API Catalog, Tavily, and Langchain.

- NVIDIA API Catalog
  1. Navigate to **[NVIDIA API Catalog](https://build.nvidia.com/explore/discover)**.
  2. Select any model, such as llama-3.3-70b-instruct.
  3. On the right panel above the sample code snippet, click on "Get API Key". This will prompt you to log in if you have not already.

#### Export API Keys

Save these API Keys as environment variables.

First, set the NVIDIA API Key as the environment variable. 

In [None]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

### Install dependencies

First we set our environment.

We use Daily for transport, OpenAI for context aggregation, Riva for TTS & TTS, and Silero for VAD (Voice Activity Detection). If using different services, for example Cartesia for TTS, one would run `pip install "pipecat-ai[cartesia]"`.

> [Development note]: We're installing from the github main branch here to ensure we have the latest improvements. By the time we address feedback we'll have a new release of Pipecat and just install the pipecat parts we are using.

In [None]:
!pip install "git+https://github.com/pipecat-ai/pipecat.git@main"
!pip install "pipecat-ai[daily,openai,riva,silero]"

## Initialize the User Input

Configure Daily transport for WebRTC communication
- DAILY_SAMPLE_ROOM_URL: Where to connect (and where will navigate to to talk to our agent)
- None: No authentication token needed

In [None]:
# Url to talk to the NVIDIA NIM agent
# Update to your sample room url after obtaining Daily-agent API key
#### NOTE: if this is changed, the link in Step 11 markdown will no longer work.

DAILY_SAMPLE_ROOM_URL="https://pc-34b1bdc94a7741719b57b2efb82d658e.daily.co/prod-test"

In [None]:
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.transports.services.daily import DailyParams, DailyTransport

transport = DailyTransport(
    DAILY_SAMPLE_ROOM_URL,
    None,
    "NVIDIA NIM Agent",
    DailyParams(
        audio_out_enabled=True,
        vad_enabled=True,
        vad_analyzer=SileroVADAnalyzer(),
        vad_audio_passthrough=True,
    ),
)

#### Initialize the LLM, RIVA services with NVIDIA NIM

you can customize the different LLM `model`, that works with RIVA ASR and TTS services.

In [None]:
import os
from pipecat.services.nim import NimLLMService
from pipecat.services.riva import FastPitchTTSService, ParakeetSTTService

stt = ParakeetSTTService(api_key=os.getenv("NVIDIA_API_KEY"))

llm = NimLLMService(
    api_key=os.getenv("NVIDIA_API_KEY"), model="meta/llama-3.3-70b-instruct"
)

tts = FastPitchTTSService(api_key=os.getenv("NVIDIA_API_KEY"))

Now it's time to Define LLM prompt as needed but you can always edit the prompt as desired.

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way and make a weather pun if it is possible.",
    },
]

#### Define tool calling function for weather queries 

Here we use the classic "get_weather" example. We use OpenAI's ChatCompletionToolParam and register the function with the llm.

In [None]:
from openai.types.chat import ChatCompletionToolParam
from pipecat.frames.frames import TextFrame


async def start_fetch_weather(function_name, llm, context):
    await llm.push_frame(TextFrame("Let me check on that."))
    print(f"Starting fetch_weather_from_api with function_name: {function_name}")

async def fetch_weather_from_api(function_name, tool_call_id, args, llm, context, result_callback):
    await result_callback({"conditions": "nice", "temperature": "75"})

tools = [
            ChatCompletionToolParam(
                type="function",
                function={
                    "name": "get_current_weather",
                    "description": "Returns the current weather at a location, if one is specified, and defaults to the user's location.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The location to find the weather of, or if not provided, it's the default location.",
                            },
                            "format": {
                                "type": "string",
                                "enum": ["celsius", "fahrenheit"],
                                "description": "Whether to use SI or USCS units (celsius or fahrenheit).",
                            },
                        },
                        "required": ["location", "format"],
                    },
                },
            )
        ]

llm.register_function(None, fetch_weather_from_api, start_callback=start_fetch_weather)

## Initialize the Context Aggregator

In [None]:
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

context = OpenAILLMContext(messages, tools)
context_aggregator = llm.create_context_aggregator(context)

Create pipeline to process speech into text with RIVA, send to NVIDIA NIM, then turn the NVIDIA NIM response text into speech.

In [None]:
from pipecat.pipeline.pipeline import Pipeline

pipeline = Pipeline(
    [
        transport.input(),              # Transport user input
        stt,                            # STT
        context_aggregator.user(),      # User responses
        llm,                            # LLM
        tts,                            # TTS
        transport.output(),             # Transport agent output
        context_aggregator.assistant(), # Assistant spoken responses
    ]
)

Create a PipelineTask to allow interruption while in conversation

In [None]:
from pipecat.pipeline.task import PipelineParams, PipelineTask

task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))

Create a pipeline runner to manage the processing pipeline.

In [None]:
from pipecat.pipeline.runner import PipelineRunner

runner = PipelineRunner()

#### Set Function call event handlers
There are two handlers here 

First one `on_first_participant_joined` handler tells the agent to start the conversation when you join the call.  

Second one `on_participant_left` handler sends an EndFrame which signals to terminate the pipeline.

In [None]:
from pipecat.frames.frames import LLMMessagesFrame, EndFrame

@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
    # Kick off the conversation.
    messages.append({"role": "system", "content": "Please introduce yourself to the user and deliver a weather fact."})
    await task.queue_frames([LLMMessagesFrame(messages)])

@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
    print(f"Participant left: {participant}")
    await task.queue_frame(EndFrame())   

## Run the Agent!

`NOTE:` 
    The first time you run the agent, it will load weights for a voice activity model into the local Python process. This will take 10-15 seconds. A permissions dialog will ask you to allow the browser to access your camera and microphone. Click yes to start talking to the agent. If you have any trouble with this, see [here](https://help.daily.co/en/articles/2525908-allow-camera-and-mic-access).

In [None]:
await runner.run(task)