# Intro to Speech Pipelines for Digital Humans
This module introduces the core technical components of a real-time voice agent system, with a focus on speech-to-speech (S2S) processing. These pipelines form the backbone of interactive digital human applications across support, education, entertainment, and more.

We won’t cover chatbot prompting or dialogue logic here—that comes in the next module. The goal of this section is to understand how speech data flows through a pipeline: from raw audio input to synthesized voice output.
The focus here is on building towards a real-time, interactive digital human application, such as a customer support avatar, gaming use case, etc. but we will focus on voice and speech pipeline technical components.

⚠️ This notebook is intentionally light on theory. The priority is to help you implement, compare, and modify working components in a modular AI agent stack.

## By the end of this module, students should be able to:
- Describe the full stack of a speech to speech pipeline and its role in a digital human.  z
- Compare and choose appropriate technologies for ASR and TTS.
- Design for real-time constraints like streaming vs. batch, and low latency interaction.
- Understand how data flows between components and what format conversions are required.
- Maybe we lightly mention deployment considerations, including hardware, containerization, and service orchestration.

# Core Components of a Voice Agent Pipeline
**Automatic Speech Recognition**: Converts spoken language into text.  
**LLM/NLU**: Understand input and generate a meaningful response.  
**Text to Speech**: Converts the LLM response into speech.  
**Voice Activity Detection**: Detects when a user is speaking. (Optional) 



Implementing effective end-to-end (e2e) conversational systems is a major challenge.  
Applications like voice assistants are nondeterministic in nature and the above multi-component architecture introduces problems like latency, 

In [None]:
# This cell builds the *smallest* possible Pipecat pipeline:
#   TTSSpeakFrame  ─▶  RivaTTSService  ─▶  LocalAudioTransport
#
# When you run it, you should hear the configured voice speak
# the message defined in `message`.

# 1)  Configure the Riva TTS processor
# ------------------------------------
tts = RivaTTSService(
    api_key=os.getenv("NVIDIA_API_KEY"), # set API Key
    voice_id= "English-US.Female-1",  # define the voice
    )

# 2)  Editable message (rerun the cell after you change it)
# ---------------------------------------------------------
message="Hello there, how is it going!"

# 3)  Async driver
# ----------------
async def main():
    # LocalAudioTransport plays the raw audio directly on your machine.
    transport = LocalAudioTransport(LocalAudioTransportParams(audio_out_enabled=True))

    # Build pipeline: [Riva TTS] → [Audio output]
    pipeline = Pipeline([tts, transport.output()]) # We define our RivaTTS Service in the Pipeline

    # Wrap in a PipelineTask so the runner can start/stop it
    task = PipelineTask(pipeline)

    # This allows for a single speech request, then closes.
    async def say_something():
        # Small delay so the pipeline is fully up before we push frames
        await asyncio.sleep(1)
        
        await task.queue_frames([
            TTSSpeakFrame(message), # trigger TTS
            EndFrame() # signal the pipeline is done
            ]
        )

    # PipelineRunner handles the event loop & graceful shutdown
    runner = PipelineRunner(handle_sigint=False if sys.platform == "win32" else True)

    # Run the pipeline and our helper in parallel
    await asyncio.gather(runner.run(task), say_something())

# 4)  Jupyter notebooks need 'nest_asyncio' to nest loops safely
if __name__ == "__main__":
    nest_asyncio.apply()
    await main()

# Conclusion
A real-time voice agent for digital humans requires careful coordination of VAD, ASR, LLM, and TTS. Each step must be optimized for latency, accuracy, and compatibility. In this module, we introduced the essential technical building blocks. In the next, we’ll apply them to interactive prompting and memory-enabled agents.