# NVIDIA-Pipecat Automatic Speech Recognition Basics

The RivaASRService provides streaming speech recognition using NVIDIA’s Riva ASR models. It supports real-time transcription with interim results and interruption handling.

## Setup and Prerequisites
Before running this notebook, make sure you have:
- An NVIDIA API key for accessing cloud-hosted models via NVCF: [build.nvidia.com](build.nvidia.com)

## Setup Environment and Import Libraries

In [1]:
#Enter Your NVIDIA API KEY
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

Enter your NVIDIA API key:  ········


In [1]:
import asyncio
import nest_asyncio
import os
import io
import sys

from dotenv import load_dotenv
from pipecat.frames.frames import EndFrame, TTSSpeakFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from nvidia_pipecat.services.nvidia_llm import NvidiaLLMService
from nvidia_pipecat.services.riva_speech import RivaASRService, RivaTTSService
from pipecat.transports.local.audio import LocalAudioTransport, LocalAudioTransportParams

## Transcription with Riva ASR
**ASR** takes an audio stream or audio buffer as input and returns one or more text transcripts, along with additional optional metadata. Speech recognition in Riva is a GPU-accelerated compute pipeline, with optimized performance and accuracy.  

Riva provides state-of-the-art OOTB (out-of-the-box) models and pipelines for multiple languages, like English, Spanish, German, Russian and Mandarin, that can be easily deployed with nvidia-pipecat.  

Now, let's generate a transcript using Riva ASR Service for a sample audio clip, starting with English.

In [2]:
# Connect to the RivaASRService
stt = RivaASRService(
    api_key=os.getenv("NVIDIA_API_KEY"), # set API Key
    voice_id= "English-US.Female-1",  # define the voice
    )

### Offline recognition for English
You can use Riva ASR in either **streaming** mode or **offline** mode. In streaming mode, a continuous stream of audio is captured and recognized, producing a stream of transcribed text.  
In offline mode, an audio clip of a set length is transcribed to text. Riva ASR supports .wav files in pulse-code modulation (PCM) format; including .alaw, .mulaw, and .flac formats.

Now, let's make a gRPC request to the Riva Speech server for ASR with a sample .wav file in offline mode. Start by loading the audio.
Let's look at an example showing offline ASR for an English audio clip:

In [3]:
!pwd

/Users/avasquez/Developer/nvidia-pipecat-notebooks/notebooks/1-Foundations of Digital Human Agents


In [4]:
# This example uses a .wav file with LINEAR_PCM encoding.
# read in an audio file from local disk
path = "./audio_samples/en-Mark_Neutral.wav"
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

FileNotFoundError: [Errno 2] No such file or directory: 'audio_samples/en-Mark_Neutral.wav'

In [6]:
async def main():
    transport = LocalAudioTransport(LocalAudioTransportParams(audio_out_enabled=True))

    pipeline = Pipeline([stt, transport.output()]) # We define our RivaTTS Service in the Pipeline

    task = PipelineTask(pipeline)

    async def say_something():
        await asyncio.sleep(1)
        await task.queue_frames([TTSSpeakFrame(message), EndFrame()])

    runner = PipelineRunner(handle_sigint=False if sys.platform == "win32" else True)

    await asyncio.gather(runner.run(task), say_something())


if __name__ == "__main__":
    nest_asyncio.apply()
    await main()

[32m2025-05-01 11:00:04.749[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking PipelineSource#0 -> RivaTTSService#0[0m
[32m2025-05-01 11:00:04.750[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking RivaTTSService#0 -> LocalAudioOutputTransport#0[0m
[32m2025-05-01 11:00:04.750[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking LocalAudioOutputTransport#0 -> PipelineSink#0[0m
[32m2025-05-01 11:00:04.751[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking PipelineTaskSource#0 -> Pipeline#0[0m
[32m2025-05-01 11:00:04.751[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m177[0m - [34m[1mLinking Pipeline#0 -> PipelineTaskSink#0[0m
[32m2025-05-01 11:00:04.751[0m | 