# NVIDIA Parakeet ASR Model Deployment on Amazon SageMaker AI using Pytorch container

## Introduction

This notebook demonstrates how to deploy the [**NVIDIA Parakeet TDT 0.6B v2**](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) model for Automatic Speech Recognition (ASR) tasks using Amazon SageMaker with both real-time and asynchronous inference capabilities.

### About NVIDIA Parakeet TDT 0.6B v2

The **Parakeet TDT (Transducer-Decoder-Transducer) 0.6B v2** is a state-of-the-art neural speech recognition model developed by NVIDIA:

- **Architecture**: Transformer-based transducer model optimized for streaming ASR
- **Model Size**: 600 million parameters, balancing accuracy and efficiency
- **Performance**: Excellent accuracy on diverse speech datasets with low latency
- **Language Support**: Primarily English, with robust performance across accents
- **Optimization**: Built with NVIDIA NeMo framework for production deployment

### SageMaker Asynchronous Inference Benefits

**Asynchronous inference** is particularly well-suited for ASR workloads:

1. **Long Processing Times**: Audio transcription can take several seconds to one hour
2. **Variable Input Sizes**: Audio files range from seconds to hours in duration
3. **Managed queuing**: Efficiently handle multiple audio files simultaneously
4. **Cost Optimization**: Pay only for actual processing time, with automatic scaling to zero
5. **Large File Support**: Handle audio files up to 1GB in size
6. **Event driven pipeline**: Async endpoint supports SNS which helps with building a event driven pipeline




## Prerequisites and Setup

**❗If you run this notebook in SageMaker Studio, please consider building the custom docker container from alternative services. This notebook was tested using SageMaker notebook instance with ml.g5.2xlarge and EBS volume 100G.**

### Required Dependencies

In [None]:
# Install required packages
%pip install datasets==2.14.7
%pip install sagemaker==2.246.0 huggingface_hub
%pip install librosa -q
%pip install soundfile -q
%pip install libcst==1.1.0
%pip install --upgrade ml_dtypes

Also install the nemo toolkit for asr, it will take a few mins for the installation to finish.

In [None]:
%pip install nemo_toolkit['asr']

**❗Please restart the kernel before executing the cells below.**

In [None]:
# Import required packages
import sagemaker
import time
import json
import boto3
import soundfile as sf
from datasets import load_dataset
import nemo.collections.asr as nemo_asr

from botocore.exceptions import ClientError

## Configuration Setup

Initialize the basic configuration for SageMaker deployment:

- **SageMaker Session**: Manages interactions with SageMaker services
- **S3 Bucket**: Default bucket for storing model artifacts and data
- **IAM Role**: Execution role with necessary permissions for SageMaker
- **S3 Prefixes**: Organized folder structure for model artifacts
- **Runtime Client**: For invoking endpoints after deployment

In [None]:
# Basic configurations
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
prefix = 'parakeet-asr'
role = sagemaker.get_execution_role()
s3_model_prefix = (
    "hf-asr-models/nvidia-asr"  # folder within bucket where code artifact will go
)
# below boto3 clients are for invoking asynchronous endpoint 
sm_runtime = boto3.client("sagemaker-runtime")
region = sess.boto_region_name
account_id = sess.account_id()

## Model Preparation and Testing

### Local Model Testing

Before deploying to SageMaker, we'll first test the Parakeet model locally to ensure it works correctly. This step helps validate:

- Model loading and initialization
- Audio file processing capabilities

### Download Model from HuggingFace Hub

Download the NVIDIA Parakeet model from HuggingFace Hub for SageMaker deployment:

- **Model Repository**: `nvidia/parakeet-tdt-0.6b-v2`
- **File Filtering**: Only download necessary files (*.json, *.safetensors, *.nemo)
- **Local Caching**: Store model files locally for packaging
- **LFS Support**: Handle large model files using Git LFS

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os
import sagemaker
import jinja2

parakeet = "nvidia/parakeet-tdt-0.6b-v2"

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = parakeet
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.safetensors", "*.nemo"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

In [None]:
model = nemo_asr.models.ASRModel.restore_from(restore_path=model_download_path + "/parakeet-tdt-0.6b-v2.nemo")

In [None]:
output = model.transcribe(['../data/test.wav'])
print(output[0].text)

### Package Model for SageMaker

Create a compressed archive of the model files for SageMaker deployment, the below cell might take a few mins to finish.

In [None]:
!cp $model_download_path/parakeet-tdt-0.6b-v2.nemo parakeet-tdt-0.6b-v2.nemo
!tar -cvzf model.tar.gz parakeet-tdt-0.6b-v2.nemo

In [None]:
# Upload the model to S3
model_uri = sess.upload_data('model.tar.gz', bucket=bucket, key_prefix=s3_model_prefix)
!rm model.tar.gz
!rm parakeet-tdt-0.6b-v2.nemo
model_uri

### Create Custom Docker Image by extending from AWS prebuilt PyTorch container

Build a custom Docker image optimized for NVIDIA Parakeet ASR model:

**Base Image**: PyTorch inference container with GPU support
- `pytorch-inference:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker`

**System Dependencies**:
- **ffmpeg**: Audio format conversion and processing
- **libsndfile1**: Audio file I/O library

**Python Dependencies**:
- **nemo_toolkit[asr]**: NVIDIA NeMo framework for ASR
- **ffmpeg-python**: Python wrapper for ffmpeg
- **soundfile**: Audio file reading/writing

Note that the notebook was tested in 'us-west-2', if you are using other region, please check the prebuilt image uri in that region from the [available images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

In [None]:
%%writefile Dockerfile
# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker

# Install system dependencies
RUN apt-get update && apt-get install -y \
    ffmpeg \
    libsndfile1 \
    && rm -rf /var/lib/apt/lists/*

RUN pip install nemo_toolkit[asr] ffmpeg-python soundfile cuda-python



### Configuring Docker Storage on Amazon SageMaker Notebook Instances
Amazon SageMaker notebook instances come with a 5 GB Amazon EBS storage volume by default, but Docker uses the instance's root volume for storing images and containers. When building multiple Docker images, the root volume has limited disk space that can quickly run out of space, causing "no space left on device" errors.

To solve this storage limitation, you can redirect Docker to use the larger EBS volume by modifying the Docker daemon configuration. Note that the EBS volume specified for the SageMaker Notebook instance is under path `/home/ec2-user/SageMaker`, but when you go to terminal the default path is `/home/ec2-user`.

From the terminal, you can stop the Docker service and edit the Docker daemon configuration.

```
# Stop the Docker service
sudo systemctl stop docker

# Create the new Docker data directory on the EBS volume
sudo mkdir -p /home/ec2-user/SageMaker/docker

# Create or edit the Docker daemon configuration file
sudo nano /etc/docker/daemon.json
```

Add the following configuration to /etc/docker/daemon.json:
```json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "data-root": "/home/ec2-user/SageMaker/docker"
}
```

Restart the docker service
```
# Start the Docker service with new configuration
sudo systemctl start docker

# Verify Docker is running with correct data directory
docker info | grep "Docker Root Dir"
```

### Build and Push Docker Image to ECR

Build the custom Docker image and push it to Amazon Elastic Container Registry (ECR):

**Process Overview**:
1. **ECR Repository**: Create or verify ECR repository exists
2. **Docker Build**: Build the custom image with all dependencies
3. **Authentication**: Login to ECR using AWS credentials
4. **Tag and Push**: Tag the image and push to ECR

**Repository Naming**: `asr-sagemaker` for easy identification

Note that, you need to make sure the IAM role has proper permission to access the ECR service

In [None]:
%%sh
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
REPOSITORY_NAME=asr-sagemaker

# login to access the base image from the prebuilt images
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin 763104351884.dkr.ecr.$REGION.amazonaws.com

# Create ECR repository if needed
if aws ecr describe-repositories --repository-names "${REPOSITORY_NAME}" &>/dev/null; then
    echo "Repository ${REPOSITORY_NAME} already exists"
else
    echo "Creating ECR repository ${REPOSITORY_NAME}..."
    aws ecr create-repository \
        --repository-name "${REPOSITORY_NAME}" \
        --region "${REGION}"
fi

#build docker image and push to ECR repository
docker build -t asr-sagemaker .
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
docker tag asr-sagemaker:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest

### Create PyTorch Model Object

Create a SageMaker PyTorchModel with specific environment variables setup for async workloads:

**Key Configuration Parameters**:
- **SAGEMAKER_MODEL_SERVER_WORKERS**: set the number of torch worker that will load the the number of model copied into GPU memory
- **TS_DEFAULT_RESPONSE_TIMEOUT**: time out setting for Torch server worker, for long audio processing you can set it to a higher number
- **TS_MAX_REQUEST_SIZE**: byte size values for request, set to 1G for async endpoint
- **TS_MAX_RESPONSE_SIZE**: byte size values for response
  
**Session Selection**: Switch between local testing and cloud deployment
SageMaker local session is a feature in the SageMaker Python SDK that allows you to create estimators and run training, processing, and inference jobs locally using Docker containers instead of managed AWS infrastructure, providing a fast way to test and debug your machine learning scripts before scaling to production. You can see more examples in this [github repo](https://github.com/aws-samples/amazon-sagemaker-local-mode).

In [None]:
# Generate a unique model name and provide image uri

id = int(time.time())
model_name = f'parakeet-model-{id}'

# !Please change the image URI for the region that you are using: e.g. us-east-1
image = f"{account_id}.dkr.ecr.{region}.amazonaws.com/asr-sagemaker:latest"

In [None]:
from sagemaker.local import LocalSession

local_session = LocalSession()
local_session.config = {'local': {'local_code': True}}

In [None]:
# Create a PyTorchModel for deployment
from sagemaker.pytorch.model import PyTorchModel

parakeet_model = PyTorchModel(
    entry_point="inference.py",
    source_dir="code",
    model_data=model_uri,
    image_uri=image,
    role=role,
    name=model_name,
    env={"SAGEMAKER_MODEL_SERVER_WORKERS": "2",
         "TS_MAX_REQUEST_SIZE": "1073741824",
         "TS_MAX_RESPONSE_SIZE": "1073741824",
         "TS_DEFAULT_RESPONSE_TIMEOUT": "300"
        },
    # sagemaker_session=local_session # used for local test
    sagemaker_session=sess  # used for actual endpoint
)

### Real-time inference 

Set up data serialization for audio file processing

In [None]:
from sagemaker.serializers import DataSerializer
from sagemaker.deserializers import JSONDeserializer

# Define serializers and deserializer
audio_serializer = DataSerializer(content_type="audio/x-audio")
deserializer = JSONDeserializer()

### Deploy Real-time Endpoint

Deploy the model as a real-time inference endpoint. Note that if you choose to use the local SageMaker Session when creating the model object, change the `instance_type` to `local_gpu` to be able to quickly test the endpoint from local SageMaker notebook instance for fast testing. If you are going to deploy the model to an async endpoint, please make sure you create the PyTorchModel object with the actual sagemaker session. In this case, it will be `sess`.

In [None]:
# Deploy the model for real-time inference locally or remotely
endpoint_name = f'parakeet-real-time-endpoint-{id}'

real_time_predictor = parakeet_model.deploy(
    initial_instance_count=1,  # number of instances
    # instance_type="local_gpu",
    instance_type="ml.g5.xlarge",  # instance type
    endpoint_name=endpoint_name,
    serializer=audio_serializer,
    deserializer=deserializer
)


### Test Real-time Inference

Test the deployed real-time endpoint with a sample audio file:

- **Input**: Audio file path (automatically serialized)
- **Processing**: Synchronous transcription
- **Output**: JSON response with transcription results

In [None]:
import json
# Perform real-time inference
audio_path = "../data/test.wav"
response = real_time_predictor.predict(data=audio_path)
print(response[0])


Uncomment below cell to delete the endpoint once you finish testing

In [None]:
## optional: Delete real-time inference endpoint
real_time_predictor.delete_endpoint()


## Asynchronous Inference Deployment

Set up asynchronous inference configuration with comprehensive monitoring:

**Configuration Components**:
- **Output Path**: S3 location for storing transcription results
- **Concurrency**: Maximum concurrent invocations per instance (4 for optimal GPU usage)
- **SNS Notifications**: Real-time alerts for job completion status
- **Failure Path**: Separate S3 location for failed job artifacts

**SNS Topics**: Configure separate topics for success and failure notifications


### Create SNS Topics for Async Notifications

Create SNS topics for monitoring asynchronous inference job status:

- **Success Topic**: Receives notifications when transcription jobs complete successfully
- **Error Topic**: Receives notifications when transcription jobs fail
- **Automatic Creation**: Creates topics if they don't exist, reuses if they do
- **Subscription Ready**: Topics are ready for email, SMS, or Lambda subscriptions

In [None]:
sts_client = boto3.client('sts')

print(f"Account ID: {account_id}")
print(f"Region: {region}")

# Initialize SNS client
sns_client = boto3.client('sns')

def create_sns_topic_if_not_exists(topic_name, description):
    """Create SNS topic if it doesn't exist, return the ARN"""
    try:
        # Try to create the topic (idempotent operation)
        response = sns_client.create_topic(Name=topic_name)
        topic_arn = response['TopicArn']
        
        # Set topic attributes for better identification
        sns_client.set_topic_attributes(
            TopicArn=topic_arn,
            AttributeName='DisplayName',
            AttributeValue=description
        )
        
        print(f"✅ Topic '{topic_name}' ready: {topic_arn}")
        return topic_arn
        
    except ClientError as e:
        print(f"❌ Error creating topic '{topic_name}': {e}")
        raise

# Create success topic
success_topic_name = "async-success"
success_description = "SageMaker Async Inference Success Notifications"
success_topic_arn = create_sns_topic_if_not_exists(success_topic_name, success_description)

# Create error topic  
error_topic_name = "async-failed"
error_description = "SageMaker Async Inference Error Notifications"
error_topic_arn = create_sns_topic_if_not_exists(error_topic_name, error_description)

print(f"\n📧 SNS Topics Created Successfully:")
print(f"Success Topic ARN: {success_topic_arn}")
print(f"Error Topic ARN: {error_topic_arn}")

print(f"\n🔧 Topics are ready for AsyncInferenceConfig!")


In [None]:
%%time
from sagemaker.async_inference import AsyncInferenceConfig

# Create an AsyncInferenceConfig object
async_config = AsyncInferenceConfig(
    output_path=f"s3://{bucket}/{prefix}/output", 
    max_concurrent_invocations_per_instance = 4,
    notification_config = {
      "SuccessTopic": f"arn:aws:sns:{region}:{account_id}:async-success",
      "ErrorTopic": f"arn:aws:sns:{region}:{account_id}:async-failed",
    }, #  Notification configuration 
    failure_path=f"s3://{bucket}/{prefix}/failed"
)

# Deploy the model for async inference
endpoint_name = f'parakeet-async-endpoint-{id}'
async_predictor = parakeet_model.deploy(
    async_inference_config=async_config,
    initial_instance_count=1, # number of instances
    instance_type ='ml.g5.xlarge', # instance type
    endpoint_name = endpoint_name
)

Upload test audio file to S3 for asynchronous processing

In [None]:
def upload_to_s3(s3_client, file_path, bucket_name, s3_key):
    """Upload file to S3"""
    try:
        s3_client.upload_file(file_path, bucket_name, s3_key)
        return True
    except Exception as e:
        print(f"Error uploading {s3_key}: {e}")
        return False
s3_client = boto3.client('s3')
audio_path = "../data/test_audio.wav" 
s3_key = prefix+f"/data/{audio_path}"
upload_to_s3(s3_client, audio_path, bucket, s3_key)

In [None]:
# Provide the S3 path for the audio file you want to processs

input_path = f"s3://{bucket}/{s3_key}"
input_path

In [None]:
# Perform async inference
initial_args = {'ContentType':"audio/x-audio"}
response = async_predictor.predict_async(initial_args = initial_args, input_path=input_path)
response.output_path



Monitor and retrieve the results from asynchronous processing

In [None]:
import urllib, time
from botocore.exceptions import ClientError

def get_output(output_location):
    output_url = urllib.parse.urlparse(output_location)
    bucket = output_url.netloc
    key = output_url.path[1:]
    while True:
        try:
            return sess.read_s3_file(
                        bucket=output_url.netloc, 
                        key_prefix=output_url.path[1:])
        except ClientError as e:
            if e.response['Error']['Code'] == 'NoSuchKey':
                print("waiting for output...")
                time.sleep(2)
                continue
            raise
            
output = get_output(response.output_path)
print(f"Output: {output}")

### Optional: Advanced Configuration Auto-scaling for Asynchronous Inference


Auto-scaling supports scale down to zero which can help with cost saving when there are no workloads requested.

In [None]:
autoscale = boto3.client('application-autoscaling') 
resource_id='endpoint/' + endpoint_name + '/variant/' + 'AllTraffic'

# Register scalable target
register_response = autoscale.register_scalable_target(
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=0,  
    MaxCapacity=3 # * check how many instances available in your account
)

# Define scaling policy
scalingPolicy_response = autoscale.put_scaling_policy(
    PolicyName='Invocations-ScalingPolicy',
    ServiceNamespace='sagemaker', # The namespace of the AWS service that provides the resource. 
    ResourceId=resource_id,  
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # SageMaker supports only Instance Count
    PolicyType='TargetTrackingScaling', # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 3.0, # The target value for the metric. This needs to be setup based on load testing
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSizePerInstance',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': endpoint_name }
            ],
            'Statistic': 'Average',
        },
        'ScaleInCooldown': 60, # The cooldown period helps you prevent your Auto Scaling group from launching or terminating 
                                # additional instances before the effects of previous activities are visible. 
                                # You can configure the length of time based on your instance startup time or other application needs.
                                # ScaleInCooldown - The amount of time, in seconds, after a scale in activity completes before another scale in activity can start. 
        'ScaleOutCooldown': 60 # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        
        # 'DisableScaleIn': True|False - indicates whether scale in by the target tracking policy is disabled. 
                            # If the value is true , scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    }
)

scalingPolicy_response

In [None]:
# Trigger 1000 asynchronous invocations with autoscaling from 1 to 3
# then scale down to 0 on completion

print(endpoint_name)
for i in range(1,100):
    response = sm_runtime.invoke_endpoint_async(
    EndpointName=endpoint_name, 
    InputLocation=input_path)
    
print("\nAsync invocations for PyTorch serving with autoscaling\n")

### Clean up

In [None]:
# Delete Asynchronous inference endpoint
async_predictor.delete_endpoint()