# Deploying a Speech-to-Speech Virtual Tour Guide with NVIDIA Pipecat

In this notebook, we will learn how to build, configure, and deploy a voice AI Agent agent using ACE Controller, which leverages NVIDIA Pipecat. We'll customize it into a simple Virtual Museum Guide that can interact with users using voice.

We'll illustrate the basic NVIDIA-Pipecat flow using NVIDIA Pipecat along with the Pipecat-AI library and deploying it for testing and development.

## Introduction
**ACE Controller** is a framework for building advanced conversational agents, built on top of NVIDIA Pipecat. It provides a modular pipeline for connecting speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) modules, and is designed for real-time, interactive applications.

**Goal:** Deploy a basic voice agent that acts as a friendly museum guide, using a FastAPI server and websocket-based communication.

## Prerequisites
Prior to getting started, you will need to create an API Key for the NVIDIA API Catalog and a Daily API Key for the voice agent's transport layer in this demo.

### Obtain API Keys
#### NGC API Key
- NVIDIA API Catalog
  1. Navigate to **[NVIDIA API Catalog](https://build.nvidia.com/explore/discover)**.
  2. Select any model, such as `llama-3.3-70b-instruct`.
  3. On the right panel above the sample code snippet, click on "Get API Key". This will prompt you to log in if you have not already.

#### Daily API Key
1. Signup at **[Daily](https://dashboard.daily.co/u/signup?pipecat=y)**.
2. Verify email address and choose a subdomain to complete onboarding.
3. Click on "Developers" in left-side menu of Daily dashboard to reveal API Key.

### Export API Keys
Save these API keys as environment variables in the .env file of this directory.

Below will check to see if the NVIDIA API Key is set as an environment variable. If not, it will prompt you to enter the key.

In [1]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

Enter your NVIDIA API key:  ········


Now set the Daily API Key as an environment variable.

### Install dependencies

Lets set our environment.

In [2]:
!pip install nvidia-pipecat
!pip install "pipecat-ai[nim,daily,openai,riva,silero]"
!pip install websockets
!pip install FastAPI
!pip install uvicorn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

# Initialize User Input
Configure the ACE Transport for websocket communication.

### What's this?
- A **WebSocket** is a network protocol (like HTTP or TCP) that creates a persistent, two-way connection between a client and a server.
- Our pipecat agent connects to the client useing this websocket.
- Responses, like the agent's text and voice, can get streamed back over the same websocket with low latency.
- ACETransport is what handles the sending/receiving of these events over the websocket.

Lets setup our services. these are initialized once and can be plugged into our Pipecat pipeline. this makes it easy to swap out which models we use to drive the LLM, ASE, and TTS functionality.

NvidiaLLMService supports both NIM-hosted models and locally deployed NIM LLMs. This is showing the NIM-Hosted method.

In [3]:
from pipecat.audio.vad.silero import SileroVADAnalyzer
from nvidia_pipecat.transports.network.ace_fastapi_websocket import ACETransport, ACETransportParams

# Transport setup
def create_transport(pipeline_metadata):
    
    return ACETransport(
        # Connect the websocket provided by the pipeline
        websocket=pipeline_metadata.websocket,  # Active connection between the client and the server
        
        # Set transport parameters
        params=ACETransportParams(
            vad_enabled=True,  # Enable Voice Activity Detection (VAD)
            vad_analyzer=SileroVADAnalyzer(),   # Use Silero model to detect when the user is speaking
            vad_audio_passthrough=True,  # Pass through audio even when VAD is active (does not cut off)
        ),
    )

In [6]:
from nvidia_pipecat.services.nvidia_llm import NvidiaLLMService
from nvidia_pipecat.services.riva_speech import RivaASRService, RivaTTSService

def create_services():
    # Setting up a LLM service
    llm = NvidiaLLMService(
        api_key=os.getenv("NVIDIA_API_KEY"),
        model="meta/llama-3.3-70b-instruct",
    )

    # Setting up an ASR service
    stt = RivaASRService(api_key=os.getenv("NVIDIA_API_KEY"))

    # Setting up a TTS service
    tts = RivaTTSService(api_key=os.getenv("NVIDIA_API_KEY"))

    return llm, stt, tts

# Define the Services
We use ACE for transport, Llama-3.3-70B-Instruct NIM for LLM, Riva for STT & TTS, and Silero for VAD (Voice Activity Detection).. We'll also se a system prompt to make the agent act as a friendly museum guide.

we will showcase how to build a simple speech-to-speech voice assistant pipeline using nvidia-pipecat along with the pipecat-ai library and deploy it for testing. This pipeline will use WebSocket-based ACETransport, Riva ASR and TTS models, and NVIDIA LLM Service. It is recommended to first follow the Pipecat documentation or the Pipecat Overview section to understand core concepts.

### Define LLM Prompt
Let's set a basic prompt for the LLM. You can edit the prompt as desired.

In [7]:
messages = [
    {
        "role": "system",
        "content": """
You are Lydia; a conversational voice agent who acts as a friendly museum curator. 
You listen carefully to visitors and answer their questions about the exhibits, collections, and the museum itself. 
The purpose is to demonstrate natural, open-ended voice conversation.

Here is background content to reference in the conversation. Only use the background content provided.

BACKGROUND:

You work at a prestigious art and history museum. 
The museum's key exhibits include:
  - Ancient civilizations (Egypt, Greece, Rome)
  - Renaissance art (Da Vinci, Michelangelo, Botticelli)
  - Modern art (Picasso, Matisse, O'Keeffe)
  - Natural history (Dinosaurs, fossils, early mammals)
  - Technological innovation (early computers, space exploration artifacts)

The museum is also known for its interactive experiences, educational programs, and traveling exhibits that rotate every six months.

CRITICAL VOICE REQUIREMENTS:

Your responses will be converted to audio. 
Please avoid special characters except for '!' or '?'. 
Speak clearly and naturally as a professional curator would.

RESPONSE REQUIREMENTS:

Speaking style:
- Keep responses natural, brief, and welcoming
- Start with one clear fact or comment related to the visitor's question
- Add one or two short supporting details if relevant
- Then ask a question to continue the conversation
- Never repeat or rephrase information already said
- Never restate the visitor's exact words
- Avoid filler phrases like also, additionally, furthermore, moreover

Example of BAD response (too long):
"Our Ancient Egypt collection includes artifacts from the Old Kingdom, Middle Kingdom, and New Kingdom. You will find funerary masks, canopic jars, and intricate jewelry, many of which were used in religious ceremonies or burial practices. It's fascinating to explore the craftsmanship of the time. Would you like me to recommend a guided tour?"

Example of BAD response (too short):
"We have Egyptian artifacts. Want a tour?"

Example of GOOD response:
"Our Egyptian gallery features burial artifacts from the New Kingdom. Are you more interested in jewelry or tomb relics?"

Natural Acknowledgments:
- Use short, professional acknowledgments like "That's a great question" or "Fascinating topic"
- Stay focused on museum content
- Avoid emotional support or overly casual phrases like "No worries" or "You're doing great"

Example of BAD acknowledgment:
"That's wonderful! You're asking such great questions."

Example of GOOD acknowledgment:
"Fascinating topic. Our modern art gallery is one of the most visited. Are you interested in early 20th century works?"

INSTRUCTIONS

You can:
  - Answer questions about the museum exhibits, collections, and programs
  - Share interesting facts about art, history, and science based on the background
  - Recommend galleries or activities based on visitor interest

You cannot:
  - Provide information outside of the background content
  - Make up exhibits or historical facts

INITIAL GREETING:

Introduce yourself by saying:
"Hello, I'm Lydia, the curator here. I'm excited to share stories and discoveries from our exhibits. What brings you to the museum today?"

If the visitor introduces themselves, reply with:
"Nice to meet you! Is there a particular exhibit you're most excited to explore?"

If the visitor does not introduce themselves, simply continue the conversation naturally.
"""
    },
]

# Initialize the Context Aggregator

In [8]:
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

def create_context_aggregator(llm_service):
    """
    Set up the LLM conversational context and aggregator.
    """
    
    context = OpenAILLMContext(messages)
    context_aggregator = llm_service.create_context_aggregator(context)
    
    return context, context_aggregator, messages  # Note: return messages too for later use!

# Pipeline Setup

In [10]:
from pipecat.pipeline.pipeline import Pipeline
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.task import PipelineParams, PipelineTask
from nvidia_pipecat.pipeline.ace_pipeline_runner import ACEPipelineRunner, PipelineMetadata

# Full pipeline task setup
async def create_pipeline_task(pipeline_metadata: PipelineMetadata):
    """
    Creates the main speech-to-speech conversational agent pipeline.
    """
    # Create transport
    transport = create_transport(pipeline_metadata)

    # Create services
    llm, stt, tts = create_services()

    # Create context and aggregator
    context, context_aggregator, messages = create_context_aggregator(llm)

    # Define the processing pipeline
    pipeline = Pipeline([
        transport.input(),
        stt,
        context_aggregator.user(),
        llm,
        tts,
        transport.output(),
        context_aggregator.assistant(),
    ])

    task = PipelineTask(pipeline)

    # Event handler for when client connects
    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        messages.append({"role": "system", "content": "Introduce yourself to the user."})
        await task.queue_frames([LLMMessagesFrame(messages)])

    return task

# Launch the FastAPI Server

In [11]:
import os
import asyncio
from fastapi import FastAPI
from fastapi.staticfiles import StaticFiles
from nvidia_pipecat.transports.services.ace_controller.routers.websocket_router import router as websocket_router


# FastAPI app setup
app = FastAPI()

# Websocket route
app.include_router(websocket_router)

# Set pipeline runner - Only run ONCE!
runner = ACEPipelineRunner(pipeline_callback=create_pipeline_task)

# Mount static web client (for connecting users)
app.mount("/static", StaticFiles(directory="static"), name="static")

In [None]:

# Run server (notebook users might skip this and run externally)
# Run server within Jupyter Notebook
import nest_asyncio
import uvicorn

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

# Start the Uvicorn server
uvicorn.run(app, host="0.0.0.0", port=8100)

INFO:     Started server process [96261]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8100 (Press CTRL+C to quit)


INFO:     127.0.0.1:58470 - "GET /static/index.html HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:58470 - "GET /static/frames.proto HTTP/1.1" 304 Not Modified


INFO:     ('127.0.0.1', 58474) - "WebSocket /ws/691dd732-85d1-43ea-93b4-8345616b6b97" [accepted]
[32m2025-04-28 18:33:03.394[0m | [34m[1mDEBUG   [0m | [36mpipecat.audio.vad.silero[0m:[36m__init__[0m:[36m111[0m - [34m[1mLoading Silero VAD model...[0m
[32m2025-04-28 18:33:03.472[0m | [34m[1mDEBUG   [0m | [36mpipecat.audio.vad.silero[0m:[36m__init__[0m:[36m133[0m - [34m[1mLoaded Silero VAD[0m
[32m2025-04-28 18:33:04.533[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking PipelineSource#0 -> ACEInputTransport#0[0m
[32m2025-04-28 18:33:04.534[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking ACEInputTransport#0 -> RivaASRService#0[0m
[32m2025-04-28 18:33:04.535[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking RivaASRService#0 -> OpenAIUserContextAggrega

INFO:     127.0.0.1:58472 - "GET /favicon.ico HTTP/1.1" 404 Not Found


[32m2025-04-28 18:33:05.278[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36mrun_tts[0m:[36m172[0m - [34m[1mGenerating TTS: [ I'm excited to share stories and discoveries from our exhibits.][0m
[32m2025-04-28 18:33:05.464[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36m_handle_response[0m:[36m479[0m - [34m[1mTranscript received at Riva ASR: [and][0m
[32m2025-04-28 18:33:05.466[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36m_handle_response[0m:[36m500[0m - [34m[1mInterim User transcript: [and][0m
[32m2025-04-28 18:33:05.470[0m | [34m[1mDEBUG   [0m | [36mpipecat.transports.base_input[0m:[36mprocess_frame[0m:[36m119[0m - [34m[1mEmulating user started speaking[0m
[32m2025-04-28 18:33:05.471[0m | [34m[1mDEBUG   [0m | [36mpipecat.transports.base_input[0m:[36m_handle_user_interruption[0m:[36m154[0m - [34m[1mUser started speaking[0m
[32m2025-04-28 18

Now go to http://localhost:8100/static/index.html