# Speech-to-Speech Voice Agent with NVIDIA Pipecat

Welcome to Module 2! In this notebook, you’ll learn how to build a basic **voice-enabled AI agent** using the NVIDIA Pipecat framework. 

By the end of this module, you will have a working conversational agent that:
- Listens to user speech
- Transcribes it into text
- Generates an intelligent response
- Speaks the response back to the user

This notebook focuses on the **fundamental building blocks** for digital humans and intelligent avatars, and introduces the nvidia-pipecat framework.

## Introduction
**ACE Controller** is a framework for building advanced conversational agents, built on top of NVIDIA Pipecat. It provides a modular pipeline for connecting speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) modules, and is designed for real-time, interactive applications.

**Goal:** Deploy a basic voice agent that acts as a friendly museum guide, using a FastAPI server and websocket-based communication.

## Prerequisites
Prior to getting started, you will need to create an API Key for the NVIDIA API Catalog for the voice agent.

### Obtain API Keys
#### NGC API Key
- NVIDIA API Catalog
  1. Navigate to **[NVIDIA API Catalog](https://build.nvidia.com/meta/llama-3_3-70b-instruct)**.
  2. This will take you to the `llama-3.3-70b-instruct` model.
  3. On the right above the sample code snippet, click on "Get API Key". This will prompt you to log in if you have not already.

### Export API Keys
Save these API keys as environment variables in the .env file of this directory.

Below will check to see if the NVIDIA API Key is set as an environment variable. If not, it will prompt you to enter the key.

In [1]:
import os
import getpass
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = os.getenv("NVIDIA_API_KEY")

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

# Configure ACE Transport for WebSocket Communication

In this section, we configure how our Pipecat agent **communicates with a client**.

### What's Happening?

- A **WebSocket** is a network protocol (like HTTP) that creates a two-way connection between the client and server.
- Our Pipecat agent uses this WebSocket to **receive user audio** and **send back AI responses** (like text and synthesized speech) with low latency.
- **ACETransport** is the Pipecat transport class responsible for **managing these audio/text events** over the WebSocket.

🔹 ACETransport also supports **RTSP input** for streaming audio if needed — a feature useful when scaling up to real-time video avatars or remote microphone inputs.

---

## Understanding Pipecat Transports

In Pipecat, a **Transport** defines how frames (chunks of data like audio, text, or images) move between the external world and the internal AI pipeline.

Different transport types support different connection methods:
- WebSocket (for browser or app clients)
- RTSP streams (for camera/mic feeds)
- Custom transports (for specialized devices)

In this notebook, we use **ACETransport** because it easily integrates with other ACE microservices needed for communicating with a digital human.

---

### ✏️ Try It Yourself:

- Change the `vad_audio_passthrough` flag to `False` in `ACETransportParams` and observe how audio streaming changes.
- Think about: How would you modify the transport if your client was sending **video frames** instead of just audio?

In [2]:
from pipecat.audio.vad.silero import SileroVADAnalyzer
from nvidia_pipecat.transports.network.ace_fastapi_websocket import ACETransport, ACETransportParams

# Transport setup
def create_transport(pipeline_metadata):
    
    return ACETransport(
        # Connect the websocket provided by the pipeline
        websocket=pipeline_metadata.websocket,  # Active connection between the client and the server
        
        # Set transport parameters
        params=ACETransportParams(
            vad_enabled=True,  # Enable Voice Activity Detection (VAD)
            vad_analyzer=SileroVADAnalyzer(),   # Use Silero model to detect when the user is speaking
            vad_audio_passthrough=True,  # Pass through audio even when VAD is active (does not cut off)
        ),
    )

# Set Up AI Services: ASR, TTS, and LLM

Now that our transport is ready to handle connections, let's set up the **AI services** that will drive our conversational agent.

### What's Happening?

- **ASR (Automatic Speech Recognition):** Converts user speech into text.  
  ➔ We use **RivaASRService**, NVIDIA's high-accuracy, low-latency speech-to-text engine.
  
- **TTS (Text-to-Speech):** Converts the AI’s text replies into natural-sounding speech.  
  ➔ We use **RivaTTSService** for flexible, high-quality voice synthesis.

- **LLM (Large Language Model):** Generates intelligent text responses based on the conversation history.  
  ➔ We use **NvidiaLLMService** to access cloud-hosted NIM LLMs like Meta Llama 3.

These three services form the **core intelligence and voice** of our digital human agent.

---
### Understanding Pipecat Services

In Pipecat, **Services** are special types of frame processors that:
- Take incoming frames (audio or text)
- Call an external AI model (like ASR, LLM, TTS)
- Output transformed frames (transcripts, responses, synthesized audio)

**NVIDIA Pipecat** extends the basic Pipecat framework with ready-made service processors that connect to NVIDIA Riva, Audio2Face, Foundational RAG, and more. For now, this notebook will focus on Riva services.

---
#### ✏️ Try It Yourself:

- Change the **LLM model** name in `NvidiaLLMService` to use a different NIM-hosted model. These can be found at [build.nvidia.com](build.nvidia.com)
- Modify the **TTS voice_id** to hear your agent respond with a different voice or accent.
- Explore the `language` parameter — can you make your agent speak or understand another language?

In [3]:
from nvidia_pipecat.services.nvidia_llm import NvidiaLLMService
from nvidia_pipecat.services.riva_speech import RivaASRService, RivaTTSService

def create_services():
    # Setting up a LLM service
    llm = NvidiaLLMService(
        api_key=os.getenv("NVIDIA_API_KEY"),
        model="meta/llama-3.3-70b-instruct",
    )

    # Setting up an ASR service
    stt = RivaASRService(api_key=os.getenv("NVIDIA_API_KEY"))

    # Setting up a TTS service
    tts = RivaTTSService(api_key=os.getenv("NVIDIA_API_KEY"))

    return llm, stt, tts

# Define the Services
We use ACE for transport, Llama-3.3-70B-Instruct NIM for LLM, Riva for STT & TTS, and Silero for VAD (Voice Activity Detection).. We'll also se a system prompt to make the agent act as a friendly museum guide.

we will showcase how to build a simple speech-to-speech voice assistant pipeline using nvidia-pipecat along with the pipecat-ai library and deploy it for testing. This pipeline will use WebSocket-based ACETransport, Riva ASR and TTS models, and NVIDIA LLM Service. It is recommended to first follow the Pipecat documentation or the Pipecat Overview section to understand core concepts.

### Define LLM Prompt
Let's set a basic prompt for the LLM. You can edit the prompt as desired.

In [4]:
messages = [
    {
        "role": "system",
        "content": """
You are Lydia; a conversational voice agent who acts as a friendly museum curator. 
You listen carefully to visitors and answer their questions about the exhibits, collections, and the museum itself. 
The purpose is to demonstrate natural, open-ended voice conversation.

Here is background content to reference in the conversation. Only use the background content provided.

BACKGROUND:

You work at a prestigious art and history museum. 
The museum's key exhibits include:
  - Ancient civilizations (Egypt, Greece, Rome)
  - Renaissance art (Da Vinci, Michelangelo, Botticelli)
  - Modern art (Picasso, Matisse, O'Keeffe)
  - Natural history (Dinosaurs, fossils, early mammals)
  - Technological innovation (early computers, space exploration artifacts)

The museum is also known for its interactive experiences, educational programs, and traveling exhibits that rotate every six months.

CRITICAL VOICE REQUIREMENTS:

Your responses will be converted to audio. 
Please avoid special characters except for '!' or '?'. 
Speak clearly and naturally as a professional curator would.

RESPONSE REQUIREMENTS:

Speaking style:
- Keep responses natural, brief, and welcoming
- Start with one clear fact or comment related to the visitor's question
- Add one or two short supporting details if relevant
- Then ask a question to continue the conversation
- Never repeat or rephrase information already said
- Never restate the visitor's exact words
- Avoid filler phrases like also, additionally, furthermore, moreover

Example of BAD response (too long):
"Our Ancient Egypt collection includes artifacts from the Old Kingdom, Middle Kingdom, and New Kingdom. You will find funerary masks, canopic jars, and intricate jewelry, many of which were used in religious ceremonies or burial practices. It's fascinating to explore the craftsmanship of the time. Would you like me to recommend a guided tour?"

Example of BAD response (too short):
"We have Egyptian artifacts. Want a tour?"

Example of GOOD response:
"Our Egyptian gallery features burial artifacts from the New Kingdom. Are you more interested in jewelry or tomb relics?"

Natural Acknowledgments:
- Use short, professional acknowledgments like "That's a great question" or "Fascinating topic"
- Stay focused on museum content
- Avoid emotional support or overly casual phrases like "No worries" or "You're doing great"

Example of BAD acknowledgment:
"That's wonderful! You're asking such great questions."

Example of GOOD acknowledgment:
"Fascinating topic. Our modern art gallery is one of the most visited. Are you interested in early 20th century works?"

INSTRUCTIONS

You can:
  - Answer questions about the museum exhibits, collections, and programs
  - Share interesting facts about art, history, and science based on the background
  - Recommend galleries or activities based on visitor interest

You cannot:
  - Provide information outside of the background content
  - Make up exhibits or historical facts

INITIAL GREETING:

Introduce yourself by saying:
"Hello, I'm Lydia, the curator here. I'm excited to share stories and discoveries from our exhibits. What brings you to the museum today?"

If the visitor introduces themselves, reply with:
"Nice to meet you! Is there a particular exhibit you're most excited to explore?"

If the visitor does not introduce themselves, simply continue the conversation naturally.
"""
    },
]

# Initialize the Context Aggregator

In [5]:
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

def create_context_aggregator(llm_service):
    """
    Set up the LLM conversational context and aggregator.
    """
    
    context = OpenAILLMContext(messages)
    context_aggregator = llm_service.create_context_aggregator(context)
    
    return context, context_aggregator, messages  # Note: return messages too for later use!

# Pipeline Setup

In [6]:
from pipecat.pipeline.pipeline import Pipeline
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.task import PipelineParams, PipelineTask
from nvidia_pipecat.pipeline.ace_pipeline_runner import ACEPipelineRunner, PipelineMetadata

# Full pipeline task setup
async def create_pipeline_task(pipeline_metadata: PipelineMetadata):
    """
    Creates the main speech-to-speech conversational agent pipeline.
    """
    # Create transport
    transport = create_transport(pipeline_metadata)

    # Create services
    llm, stt, tts = create_services()

    # Create context and aggregator
    context, context_aggregator, messages = create_context_aggregator(llm)

    # Define the processing pipeline
    pipeline = Pipeline([
        transport.input(),
        stt,
        context_aggregator.user(),
        llm,
        tts,
        transport.output(),
        context_aggregator.assistant(),
    ])

    task = PipelineTask(pipeline)

    # Event handler for when client connects
    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        messages.append({"role": "system", "content": "Introduce yourself to the user."})
        await task.queue_frames([LLMMessagesFrame(messages)])

    return task

# Launch the FastAPI Server

In [7]:
import os
import asyncio
from fastapi import FastAPI
from fastapi.staticfiles import StaticFiles
from nvidia_pipecat.transports.services.ace_controller.routers.websocket_router import router as websocket_router


# FastAPI app setup
app = FastAPI()

# Websocket route
app.include_router(websocket_router)

# Set pipeline runner - Only run ONCE!
runner = ACEPipelineRunner(pipeline_callback=create_pipeline_task)

# Mount static web client (for connecting users)
app.mount("/static", StaticFiles(directory="static"), name="static")

The ACEPipelineRunner above should only be run once per session. If changes need to be made, we recommend restarting the kernel using the refresh icon in the toolbar.

In [None]:

# Run server (notebook users might skip this and run externally)
# Run server within Jupyter Notebook
import nest_asyncio
import uvicorn

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

# Start the Uvicorn server
uvicorn.run(app, host="0.0.0.0", port=8100)

INFO:     Started server process [81479]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8100 (Press CTRL+C to quit)


INFO:     127.0.0.1:55294 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:55294 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:55297 - "GET /static/index.html HTTP/1.1" 200 OK
INFO:     127.0.0.1:55297 - "GET /static/frames.proto HTTP/1.1" 200 OK
INFO:     127.0.0.1:55297 - "GET /favicon.ico HTTP/1.1" 404 Not Found


INFO:     ('127.0.0.1', 55321) - "WebSocket /ws/52cdb8fb-2662-4a06-954c-0fedaa87a75b" [accepted]
[32m2025-05-14 19:19:54.547[0m | [34m[1mDEBUG   [0m | [36mpipecat.audio.vad.silero[0m:[36m__init__[0m:[36m111[0m - [34m[1mLoading Silero VAD model...[0m
[32m2025-05-14 19:19:54.630[0m | [34m[1mDEBUG   [0m | [36mpipecat.audio.vad.silero[0m:[36m__init__[0m:[36m133[0m - [34m[1mLoaded Silero VAD[0m
[32m2025-05-14 19:19:56.963[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking PipelineSource#0 -> ACEInputTransport#0[0m
[32m2025-05-14 19:19:56.964[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking ACEInputTransport#0 -> RivaASRService#0[0m
[32m2025-05-14 19:19:56.965[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking RivaASRService#0 -> OpenAIUserContextAggrega

INFO:     127.0.0.1:55299 - "GET /favicon.ico HTTP/1.1" 404 Not Found


[32m2025-05-14 19:19:57.868[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36mrun_tts[0m:[36m172[0m - [34m[1mGenerating TTS: [ I'm excited to share stories and discoveries from our exhibits.][0m
[32m2025-05-14 19:19:58.014[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36m_response_handler[0m:[36m422[0m - [34m[1mSending new Riva ASR streaming request...[0m
[32m2025-05-14 19:19:58.207[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36mrun_tts[0m:[36m172[0m - [34m[1mGenerating TTS: [ What brings you to the museum today?][0m
[32m2025-05-14 19:20:00.286[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36m_handle_response[0m:[36m479[0m - [34m[1mTranscript received at Riva ASR: [yeah][0m
[32m2025-05-14 19:20:00.287[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36m_handle_response[0m:[36m500[0m - [34m[1mInterim Us

Now go to http://localhost:8100/static/index.html

Leave this cell running to interface with the web ui.