# NVIDIA-Pipecat Text to Speech Basics

Welcome to this section of Module 2, where we dive into Text-to-Speech (TTS) integration using NVIDIA Pipecat. You'll learn how to convert text, whether static or dynamically generated by a Large Language Model (LLM), into audible speech in a streaming fashion.

The `RivaTTSService` is a key component of the `nvidia-pipecat` library. It leverages NVIDIA's Riva TTS models to provide high-quality speech synthesis. This service is designed for real-time applications, making it ideal for digital humans and voice agents.

## Learning Objectives:
- Understand how `RivaTTSService` processes text frames and generates audio frames within a Pipecat pipeline.
- Implement a basic pipeline to synthesize speech from predefined text.
- Extend the pipeline to synthesize speech from dynamically generated text using `NvidiaLLMService`.
- Explore customization options for voice and language.

## Prerequisites
Before you begin, ensure you have:
- Set up your Python environment according to `0-0-Environment-Setup-Guide.md`.
- Selected the `nv-pipecat-env` Jupyter kernel.
- An NVIDIA API Key from the NVIDIA API Catalog to access the models used in this notebook. This key should be in your `.env` file or you'll be prompted for it.

**Need an API Key? It's Free!**
1. Navigate to the **[NVIDIA API Catalog](https://build.nvidia.com/explore/discover)**.
2. Select any model (e.g., `meta/llama-3.3-70b-instruct`).
3. Click "Get API Key" on the model's page.

In [2]:
import os
import getpass
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = os.getenv("NVIDIA_API_KEY")

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

In [18]:
# Necessary Imports
import os
import sys
import asyncio
import nest_asyncio
from dotenv import load_dotenv
from IPython.display import Audio

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask
from pipecat.pipeline.runner import PipelineRunner
from pipecat.frames.frames import LLMMessagesFrame, TTSSpeakFrame, EndFrame
from pipecat.transports.local.audio import LocalAudioTransport, LocalAudioTransportParams

from nvidia_pipecat.services.riva_speech import RivaTTSService
from nvidia_pipecat.services.nvidia_llm import NvidiaLLMService

## Part 1: Generating Speech from Predefined Text

First, we'll create a simple pipeline that takes a fixed string of text, converts it to speech using `RivaTTSService`, and plays it back.

### The `RivaTTSService`
The `RivaTTSService` is a Pipecat `FrameProcessor` specifically designed for NVIDIA's Riva TTS. Key features include:
- **Input:** It primarily processes `TextFrame` or `TTSSpeakFrame`. A `TTSSpeakFrame` is a specialized frame that signals the TTS service to synthesize the contained text.
- **Output:** It generates `TTSAudioRawFrame`s, which contain chunks of the synthesized audio, and control frames like `TTSStartedFrame` and `TTSStoppedFrame`.
- **Configuration:** You can specify the `api_key` (for cloud models), `voice_id` (to choose different voices and languages), `sample_rate`, and other TTS parameters.

Let's define our TTS service instance.

In [21]:
# Define our text-to-speech service instance
tts_service = RivaTTSService(
    api_key=os.getenv("NVIDIA_API_KEY"), 
    voice_id="English-US.Female-1"  # Example: A standard US English female voice. Explore other voice_ids!
)

### Building and Running the Static TTS Pipeline
We'll define a message and then construct a pipeline to speak it. The `LocalAudioTransport` is used here to play the audio output on your local machine.

The pipeline will be: `TTSSpeakFrame (queued manually)` → `RivaTTSService` → `TTSAudioRawFrame (streamed)` → `LocalAudioTransport (output)`.

In [24]:
# The message you want the agent to speak. Try changing this!
static_message = "Hello from NVIDIA Pipecat! I can speak this pre defined text."

In [25]:
async def run_static_tts_pipeline():
    print(f"Attempting to speak: '{static_message}'")
    # LocalAudioTransport handles playback of audio frames from TTS.
    audio_transport = LocalAudioTransport(LocalAudioTransportParams(audio_out_enabled=True))

    # Define the pipeline: TTS service -> Audio output transport
    pipeline = Pipeline([tts_service, audio_transport.output()])

    # Create a task for this pipeline execution
    task = PipelineTask(pipeline)

    # This inner function will queue the text to be spoken after the pipeline starts.
    async def speak_message():
        await asyncio.sleep(1)  # Allow pipeline to initialize
        # TTSSpeakFrame signals the TTS service to synthesize the text.
        # EndFrame signals the end of input for this task.
        await task.queue_frames([TTSSpeakFrame(static_message), EndFrame()])
        print("Message queued for TTS.")

    runner = PipelineRunner()

    # Run the pipeline task and the message queuing concurrently
    await asyncio.gather(runner.run(task), speak_message())
    print("Static TTS pipeline finished.")

if __name__ == "__main__":
    await run_static_tts_pipeline()

[32m2025-05-14 15:25:19.311[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking PipelineSource#4 -> RivaTTSService#2[0m
[32m2025-05-14 15:25:19.312[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking RivaTTSService#2 -> LocalAudioOutputTransport#4[0m
[32m2025-05-14 15:25:19.312[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking LocalAudioOutputTransport#4 -> PipelineSink#4[0m
[32m2025-05-14 15:25:19.313[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking PipelineTaskSource#4 -> Pipeline#4[0m
[32m2025-05-14 15:25:19.313[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking Pipeline#4 -> PipelineTaskSink#4[0m
[32m2025-05-14 15:25:19.314[0m | 

Attempting to speak: 'Hello from NVIDIA Pipecat! I can speak this pre defined text.'


[32m2025-05-14 15:25:20.318[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36mrun_tts[0m:[36m172[0m - [34m[1mGenerating TTS: [Hello from NVIDIA Pipecat! I can speak this pre defined text.][0m


Message queued for TTS.


[32m2025-05-14 15:25:20.659[0m | [34m[1mDEBUG   [0m | [36mpipecat.transports.base_output[0m:[36m_bot_started_speaking[0m:[36m224[0m - [34m[1mBot started speaking[0m
[32m2025-05-14 15:25:24.807[0m | [34m[1mDEBUG   [0m | [36mpipecat.pipeline.runner[0m:[36mrun[0m:[36m50[0m - [34m[1mRunner PipelineRunner#4 finished running PipelineTask#4[0m


Static TTS pipeline finished.


Turn up your volume to hear the output! 

In this example:
1. We create a `PipelineTask` for our simple TTS pipeline.
2. We manually queue a `TTSSpeakFrame` containing our `static_message` into the task. This frame acts as the input to the `RivaTTSService`.
3. The `RivaTTSService` processes this frame, synthesizes speech, and outputs a stream of `TTSAudioRawFrame`s.
4. The `LocalAudioTransport` consumes these audio frames and plays them through your speakers.

### Exercise:
- Modify the `static_message` variable and re-run the cell to hear different outputs.
- Change the `voice_id` in the `RivaTTSService` definition. You can find available voices in the Riva or NIM documentation for the TTS service. For example, try `"English-US.Male-1"` or explore other languages/accents if available and your API key has access.

## Part 2: Generating Speech from LLM-Generated Text

While speaking static text is useful, voice agents typically need to speak dynamically generated content, often from an LLM. Let's enhance our pipeline to include `NvidiaLLMService` for an LLM -> TTS pipeline.

### The `NvidiaLLMService`
This service (introduced in Module 1.1) connects to NVIDIA NIM LLM endpoints. 
- **Input:** It expects an `LLMMessagesFrame`, which contains a list of messages (system prompt, user queries, assistant history).
- **Output:** It streams `TextFrame` (or `LLMTokenFrame`) objects containing the LLM's response.

These output `TextFrame`s will then be consumed by our `RivaTTSService`.

In [28]:
# Define our LLM service
llm_service = NvidiaLLMService(  
    api_key=os.getenv("NVIDIA_API_KEY"),
    model="meta/llama-3.3-70b-instruct" 
)

### Defining User Input and System Prompt for the LLM
We'll provide a simple user query and a system prompt to guide the LLM's response style.

In [29]:
# User input to be processed by the LLM
dynamic_user_input = "Tell me a short, interesting fact about virtual humans."

# System prompt for the LLM
llm_system_prompt = "You are a helpful and enthusiastic assistant. Keep your responses concise and engaging."

### Building and Running the LLM-TTS Pipeline
The pipeline will now be: `LLMMessagesFrame (queued manually)` → `NvidiaLLMService` → `TextFrame (streamed)` → `RivaTTSService` → `TTSAudioRawFrame (streamed)` → `LocalAudioTransport (output)`.

In [30]:
async def run_dynamic_tts_pipeline():  
    print(f"User asks: '{dynamic_user_input}'")
    # Set up audio output transport, same as before
    audio_transport = LocalAudioTransport(LocalAudioTransportParams(audio_out_enabled=True))  
      
    # Create a pipeline: LLM service -> TTS service -> Audio output transport
    pipeline = Pipeline([llm_service, tts_service, audio_transport.output()])  
      
    task = PipelineTask(pipeline)  
      
    async def generate_and_speak():  
        await asyncio.sleep(1) # Allow pipeline to initialize
          
        # Prepare messages for the LLM
        messages_for_llm = [  
            {"role": "system", "content": llm_system_prompt},
            {"role": "user", "content": dynamic_user_input}
        ]  
          
        # Queue the LLMMessagesFrame to the LLM service.
        # The LLM's output (TextFrames) will automatically flow to the TTS service.
        await task.queue_frames([LLMMessagesFrame(messages_for_llm), EndFrame()])
        print("Message queued for LLM and then TTS.")
      
    runner = PipelineRunner()  
      
    await asyncio.gather(runner.run(task), generate_and_speak())  
    print("Dynamic LLM-TTS pipeline finished.")
  
if __name__ == "__main__":  
    await run_dynamic_tts_pipeline()

[32m2025-05-14 15:27:46.725[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking PipelineSource#5 -> NvidiaLLMService#2[0m
[32m2025-05-14 15:27:46.726[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking NvidiaLLMService#2 -> RivaTTSService#2[0m
[32m2025-05-14 15:27:46.727[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking RivaTTSService#2 -> LocalAudioOutputTransport#5[0m
[32m2025-05-14 15:27:46.727[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking LocalAudioOutputTransport#5 -> PipelineSink#5[0m
[32m2025-05-14 15:27:46.727[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking PipelineTaskSource#5 -> Pipeline#5[0m
[32m2025-05-14 15:27:46.728

User asks: 'Tell me a short, interesting fact about virtual humans.'


[32m2025-05-14 15:27:47.731[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.nvidia_llm[0m:[36m_stream_chat_completions[0m:[36m176[0m - [34m[1mGenerating chat: [{"role": "system", "content": "You are a helpful and enthusiastic assistant. Keep your responses concise and engaging.", "name": "system"}, {"role": "user", "content": "Tell me a short, interesting fact about virtual humans.", "name": "user"}][0m


Message queued for LLM and then TTS.


[32m2025-05-14 15:27:48.604[0m | [34m[1mDEBUG   [0m | [36mnvidia_pipecat.services.riva_speech[0m:[36mrun_tts[0m:[36m172[0m - [34m[1mGenerating TTS: [Did you know that virtual humans, also known as digital humans, can now be created with such precision that they can even mimic the subtleties of human emotions and behaviors, making them almost indistinguishable from real people?][0m
[32m2025-05-14 15:27:49.111[0m | [34m[1mDEBUG   [0m | [36mpipecat.transports.base_output[0m:[36m_bot_started_speaking[0m:[36m224[0m - [34m[1mBot started speaking[0m
[32m2025-05-14 15:28:02.143[0m | [34m[1mDEBUG   [0m | [36mpipecat.pipeline.runner[0m:[36mrun[0m:[36m50[0m - [34m[1mRunner PipelineRunner#5 finished running PipelineTask#5[0m


Dynamic LLM-TTS pipeline finished.


### How This Works
1.  The `LLMMessagesFrame` is sent to the `NvidiaLLMService`.
2.  The LLM processes the input and system prompt, generating a response as a stream of `TextFrame`s (or `LLMTokenFrame`s that get aggregated into `TextFrame`s implicitly by the `LLMService` before outputting if not handled by a downstream token aggregator).
3.  These `TextFrame`s are then passed sequentially to the `RivaTTSService`.
4.  `RivaTTSService` converts the incoming text chunks into `TTSAudioRawFrame`s.
5.  `LocalAudioTransport` plays the audio as it's received, demonstrating the streaming capability from LLM text generation through to speech output.

This creates a complete pipeline from a user query to a spoken response:
`User Input (text)` → `NvidiaLLMService` → `TextFrame (stream)` → `RivaTTSService` → `TTSAudioRawFrame (stream)` → `Spoken Output`

### ✏️ Exercises & Further Exploration:
1.  **Change LLM Model:** In `llm_service`, try a different `model` from the NVIDIA NIM catalog (a smaller, faster model, or one specialized for chat if available).
2.  **Modify System Prompt:** Experiment with different `llm_system_prompt` values to see how it influences the LLM's tone and content, and subsequently the spoken output.
3.  **Temperature Control:** Add a `temperature` parameter to the `NvidiaLLMService` initialization (`temperature=0.7`). Observe how different temperature values affect the creativity/predictability of the LLM's responses and the resulting speech.
4.  **Observe Frames (Advanced):** Adapt the `FramePrinter` observer from Module 1.1 to log the `TextFrame`s coming from the LLM and the `TTSAudioRawFrame`s from the TTS. This will help visualize the streaming flow.

## Conclusion

In this notebook, you've learned how to use `RivaTTSService` within NVIDIA Pipecat to synthesize speech, both from static text and from dynamic text generated by an LLM. You've seen how Pipecat's pipeline architecture allows for seamless, streaming integration of these powerful AI services.

Key takeaways:
- `RivaTTSService` converts `TextFrame` or `TTSSpeakFrame` inputs into `TTSAudioRawFrame` outputs.
- Pipelines can chain services like LLMs and TTS to create responsive voice interactions.
- `LocalAudioTransport` provides a simple way to hear TTS output during development.

In the next sections and modules, we will build upon this foundation by integrating Automatic Speech Recognition (ASR) to create a full speech-to-speech conversational agent, and explore more advanced features of `nvidia-pipecat` for building sophisticated digital humans.