# NVIDIA NIM ASR Model Deployment on Amazon SageMaker AI using BYOC (Bring Your Own Container)

## Introduction

This notebook demonstrates how to deploy the **NVIDIA NIM ASR (Parakeet 1.1B CTC EN-US)** model for Automatic Speech Recognition (ASR) tasks using Amazon SageMaker with a custom container that supports both HTTP and gRPC protocols.

### About NVIDIA NIM ASR

The **NVIDIA NIM ASR** provides a production-ready speech recognition service:

- **Architecture**: Combined HTTP + gRPC routing to NVIDIA NIM ASR container
- **Model**: Parakeet 1.1B CTC EN-US optimized for streaming ASR
- **Performance**: Excellent accuracy with intelligent routing between protocols
- **Features**: Supports speaker diarization, variable file sizes, and automatic protocol selection
- **Deployment**: Ready for SageMaker real-time inference

### Key Features

1. **Dual Protocol Support**: Auto-routing between HTTP and gRPC based on file size and features
2. **Speaker Diarization**: Advanced speaker separation capabilities via gRPC
3. **Production Ready**: Built with NVIDIA NIM for enterprise deployment

## Prerequisites and Setup

**❗ Important Notes:**
- This notebook requires Docker for building custom containers
- You need an **NGC_API_KEY** from NVIDIA NGC (export as environment variable)
- ECR permissions are required for pushing custom Docker images
- Recommended instance: ml.g5.xlarge or larger for deployment
- Default region: us-east-1 (adjust ECR URIs if using different region)

### Required Dependencies and Environment Setup

In [None]:
# Install required packages
!pip install sagemaker>=2.246.0 boto3


In [None]:
!conda install -c conda-forge librosa -y

In [None]:
!pip install soundfile python-multipart

In [None]:
# Import required packages
import sagemaker
import time
import json
import boto3
import os
import base64
import uuid
from botocore.exceptions import ClientError
import subprocess
import tempfile

## Configuration Setup

Initialize the basic configuration for SageMaker deployment:


In [None]:
# Basic configurations
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = sess.boto_region_name
account_id = sess.account_id()

# Runtime clients
sm_runtime = boto3.client("sagemaker-runtime")
ecr_client = boto3.client('ecr')
sts_client = boto3.client('sts')

print(f"Region: {region}")
print(f"Account ID: {account_id}")
print(f"S3 Bucket: {bucket}")
print(f"IAM Role: {role}")

### Environment Variables Setup

**⚠️ IMPORTANT**: You must set your NGC_API_KEY environment variable before proceeding.

Get your NGC API Key from [NVIDIA NGC](https://ngc.nvidia.com/) and set it as an environment variable:

```bash
export NGC_API_KEY="your_ngc_api_key_here"
```

Or set it in this notebook session:

In [None]:
# Check NGC API Key
ngc_api_key = os.getenv('NGC_API_KEY')
if not ngc_api_key:
    print("❌ NGC_API_KEY not found in environment variables")
    print("Please set NGC_API_KEY before continuing:")
    print("export NGC_API_KEY='your_api_key_here'")
    # Uncomment the line below and add your key directly if needed
    # os.environ['NGC_API_KEY'] = 'your_api_key_here'
else:
    print("✅ NGC_API_KEY found and configured")

## Custom Container Development

### Understanding the NIM BYOC Architecture

Our custom container combines:

1. **Base NIM Container**: `nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us:latest`
2. **SageMaker Adapter**: Python server that handles SageMaker inference protocol
3. **Dual Protocol Support**: Routes requests to either HTTP or gRPC based on requirements
4. **Intelligent Routing**: Automatically selects optimal protocol for each request

### Container Files Overview

Let's examine the three key files that make up our custom container:

### Dockerfile Analysis


In [None]:
# Display the Dockerfile contents
try:
    with open('Dockerfile', 'r') as f:
        dockerfile_content = f.read()
    print("📄 Dockerfile contents:")
    print("=" * 50)
    print(dockerfile_content)
except FileNotFoundError:
    print("❌ Dockerfile not found in current directory")

### Configuring Docker Storage on Amazon SageMaker Notebook Instances
If you are using Amazon SageMaker Notebook instance to run the notebook, you need to configure Docker storage to prepare for the container build. Amazon SageMaker notebook instances come with a 5 GB Amazon EBS storage volume by default, but Docker uses the instance's root volume for storing images and containers. When building multiple Docker images, the root volume has limited disk space that can quickly run out of space, causing "no space left on device" errors.

To solve this storage limitation, you can redirect Docker to use the larger EBS volume by modifying the Docker daemon configuration. Note that the EBS volume specified for the SageMaker Notebook instance is under path `/home/ec2-user/SageMaker`, but when you go to terminal the default path is `/home/ec2-user`.

From the terminal, you can stop the Docker service and edit the Docker daemon configuration.

```
# Stop the Docker service
sudo systemctl stop docker

# Create the new Docker data directory on the EBS volume
sudo mkdir -p /home/ec2-user/SageMaker/docker

# Create or edit the Docker daemon configuration file
sudo nano /etc/docker/daemon.json
```

Add the following configuration to /etc/docker/daemon.json:
```json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "data-root": "/home/ec2-user/SageMaker/docker"
}
```

Restart the docker service
```
# Start the Docker service with new configuration
sudo systemctl start docker

# Verify Docker is running with correct data directory
docker info | grep "Docker Root Dir"
```

## Container Build and Push Process

### Build Custom Docker Image

Build a custom Docker image that combines NVIDIA NIM ASR with SageMaker compatibility:


In [None]:
# Build and push Docker image to ECR
repository_name = "nim-sagemaker-asr"
image_tag = "latest"

print(f"Building Docker image: {repository_name}:{image_tag}")
print(f"Region: {region}")
print(f"Account ID: {account_id}")

# ECR repository URI
ecr_repository_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}"
full_image_uri = f"{ecr_repository_uri}:{image_tag}"

print(f"Target ECR URI: {full_image_uri}")

In [None]:
# Create ECR repository if it doesn't exist
def create_ecr_repository(repository_name):
    """Create ECR repository if it doesn't exist"""
    try:
        response = ecr_client.create_repository(
            repositoryName=repository_name,
            imageScanningConfiguration={'scanOnPush': True}
        )
        print(f"✅ Created ECR repository: {repository_name}")
        return response['repository']['repositoryUri']
    except ClientError as e:
        if e.response['Error']['Code'] == 'RepositoryAlreadyExistsException':
            print(f"✅ ECR repository already exists: {repository_name}")
            return f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}"
        else:
            print(f"❌ Error creating repository: {e}")
            raise

# Create ECR repository
repository_uri = create_ecr_repository(repository_name)
print(f"🎯 Repository URI: {repository_uri}")

### Docker Build Process

Now we'll build the Docker image. This process includes:

1. **Base Image Authentication**: Login to NVIDIA NGC to pull base NIM container
2. **Docker Build**: Build our custom image with SageMaker integration
3. **ECR Authentication**: Login to Amazon ECR for pushing
4. **Image Push**: Upload the built image to ECR

**⚠️ Note**: This process may take 10-15 minutes.

In [None]:
%%bash -s "$repository_name" "$region" "$account_id" "$ngc_api_key"

# Parameters from notebook
REPOSITORY_NAME=$1
REGION=$2
ACCOUNT_ID=$3
NGC_API_KEY=$4

echo "Starting Docker build process..."
echo "Repository: $REPOSITORY_NAME"
echo "Region: $REGION"
echo "Account: $ACCOUNT_ID"

# Check if NGC_API_KEY is available
if [ -z "$NGC_API_KEY" ]; then
    echo "❌ NGC_API_KEY not provided. Please set it in the previous cell."
    exit 1
fi

# Login to NVIDIA NGC (for base image access)
echo "🔐 Authenticating with NVIDIA NGC..."
echo $NGC_API_KEY | docker login nvcr.io --username '$oauthtoken' --password-stdin

# Build the Docker image
echo "🏗️  Building Docker image..."
docker build -t $REPOSITORY_NAME:latest . --build-arg NGC_API_KEY=$NGC_API_KEY

if [ $? -eq 0 ]; then
    echo "✅ Docker build completed successfully"
else
    echo "❌ Docker build failed"
    exit 1
fi

# Login to ECR
echo "Authenticating with Amazon ECR..."
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

# Tag the image for ECR
echo "Tagging image for ECR..."
docker tag $REPOSITORY_NAME:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest

# Push to ECR
echo "Pushing image to ECR..."
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest

if [ $? -eq 0 ]; then
    echo "✅ Image pushed successfully to ECR"
    echo "Image URI: $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest"
else
    echo "❌ Failed to push image to ECR"
    exit 1
fi

## SageMaker Model Creation

### Create SageMaker Model with Custom Container

Now we'll create a SageMaker model using our custom NIM container:
- **Container Image**: Our custom ECR image with NIM + SageMaker integration
- **Environment Variables**: NGC_API_KEY for NIM authentication
- **Model Artifacts**: No separate model artifacts needed (included in container)

In [None]:
# Generate unique names for AWS resources
timestamp = int(time.time())
model_name = f'nim-asr-combined-model-{timestamp}'
endpoint_config_name = f'nim-asr-combined-config-{timestamp}'
endpoint_name = f'nim-asr-combined-endpoint-{timestamp}'

# Full ECR image URI
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}:{image_tag}"

print(f"Model name: {model_name}")
print(f"Endpoint config: {endpoint_config_name}")
print(f"Endpoint name: {endpoint_name}")
print(f"Image URI: {image_uri}")

In [None]:
# Create SageMaker model
sagemaker_client = boto3.client('sagemaker')

print(f"Creating SageMaker model: {model_name}")


create_model_response = sagemaker_client.create_model(
    ModelName=model_name,
    PrimaryContainer={
        'Image': image_uri,
        'Environment': {
            'NGC_API_KEY': ngc_api_key,
            'NIM_HOST': '127.0.0.1',
            'NIM_HTTP_PORT': '9000',
            'RIVA_GRPC_PORT': '50051',
            'SAGEMAKER_BIND_TO_PORT': '8080',
            'NIM_TAGS_SELECTOR': 'name=parakeet-1-1b-ctc-en-us,mode=ofl'
        }
    },
    ExecutionRoleArn=role
)
print(f"Model created successfully: {model_name}")


### Create Endpoint Configuration

Configure the SageMaker endpoint with appropriate settings:

**Instance Configuration**:
- **Instance Type**: ml.g5.4xlarge
- **Instance Count**: 1 (can setup autoscaling based on requirements)
- **Timeouts**: Extended timeouts for container startup (2400 seconds)

In [None]:
# Create endpoint configuration
print(f"Creating endpoint configuration: {endpoint_config_name}")

create_endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            'VariantName': 'primary',
            'ModelName': model_name,
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.g5.4xlarge',
            'ContainerStartupHealthCheckTimeoutInSeconds': 2400,
            'ModelDataDownloadTimeoutInSeconds': 2400
        }
    ]
)
print(f"Endpoint configuration created: {endpoint_config_name}")


## SageMaker Endpoint Deployment

### Deploy the Endpoint

Create the SageMaker endpoint with our NIM ASR container:

In [None]:
# Create or update endpoint
print(f"Deploying endpoint: {endpoint_name}")
print(f"This process typically takes 5-10 minutes...")

# Try to create endpoint
create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name
)
print(f"Endpoint creation initiated: {endpoint_name}")


### Wait for Endpoint to be Ready

Monitor the endpoint deployment status and wait for it to become "InService":

**What's happening during startup**:
- Container image download and extraction
- NIM services initialization
- Model loading into GPU memory
- HTTP and gRPC service startup
- Health check validation

In [None]:
# Wait for endpoint to be in service
print(f"Waiting for endpoint {endpoint_name} to be ready...")

waiter = sagemaker_client.get_waiter('endpoint_in_service')

# Check current status first
response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
current_status = response['EndpointStatus']
print(f"Current status: {current_status}")

if current_status != 'InService':
    print("Waiting for endpoint to reach InService status...")
    waiter.wait(
        EndpointName=endpoint_name,
        WaiterConfig={
            'Delay': 60,  # Check every 60 seconds
            'MaxAttempts': 40  # Wait up to 40 minutes
        }
    )

# Verify final status
final_response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
final_status = final_response['EndpointStatus']

if final_status == 'InService':
    print(f"\nEndpoint {endpoint_name} is ready for inference!")
    print(f"Endpoint URL: {final_response['EndpointArn']}")
    print(f"Creation time: {final_response['CreationTime']}")
else:
    print(f"Endpoint failed to reach InService status: {final_status}")
    
    # Get failure reason if available
    if 'FailureReason' in final_response:
        print(f"Failure reason: {final_response['FailureReason']}")


## Inference Testing and Validation

### Test Audio File Preparation

Let's create a sample audio file for testing our NIM ASR endpoint:

In [None]:
# Create a simple test audio file using text-to-speech or use existing file
import tempfile
import soundfile as sf

# Create test audio file
test_audio_path = "../data/test.wav"

### Test Endpoint with Different Protocols

Our NIM ASR endpoint supports multiple inference methods:

1. **Auto-routing** (`/invocations`): Automatically chooses HTTP or gRPC based on file size
2. **Force HTTP** (`X-Amzn-SageMaker-Custom-Attributes: /invocations/http`): Direct HTTP route
3. **Force gRPC** (`X-Amzn-SageMaker-Custom-Attributes: /invocations/grpc`): Direct gRPC route with diarization support

#### Test 1: Auto-routing (Recommended)

In [None]:
# Test 1: Auto-routing inference
def test_endpoint_auto_routing(audio_file_path):
    """Test endpoint with auto-routing"""
    print(f"Testing auto-routing with {audio_file_path}")
    
    try:
        # Read audio file
        with open(audio_file_path, 'rb') as f:
            audio_data = f.read()
        
        file_size = len(audio_data)
        print(f"Audio file size: {file_size:,} bytes ({file_size / (1024*1024):.2f} MB)")
        
        # Create multipart form data
        boundary = f"----WebKitFormBoundary{uuid.uuid4().hex}"
        
        # Build multipart payload
        parts = []
        parts.append(f'--{boundary}')
        parts.append('Content-Disposition: form-data; name="file"; filename="test.wav"')
        parts.append('Content-Type: audio/wav')
        parts.append('')
        
        # Join text parts
        text_part = '\r\n'.join(parts) + '\r\n'
        language_part = f'\r\n--{boundary}\r\nContent-Disposition: form-data; name="language_code"\r\n\r\nen-US\r\n--{boundary}--'
        
        # Combine all parts
        payload = text_part.encode() + audio_data + language_part.encode()
        
        # Invoke endpoint
        response = sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType=f'multipart/form-data; boundary={boundary}',
            Body=payload
        )
        
        # Parse response
        result = json.loads(response['Body'].read().decode())
        print(f"\nAuto-routing inference successful!")
        print(f"Response: {json.dumps(result, indent=2)}")
        return result
        
    except Exception as e:
        print(f"Auto-routing test failed: {e}")
        return None

# Run auto-routing test
auto_result = test_endpoint_auto_routing(test_audio_path)

#### Test 2: Force HTTP Route

Test the HTTP-only route (optimized for files <5MB):

In [None]:
# Test 2: Force HTTP route
def test_endpoint_http_route(audio_file_path):
    """Test endpoint with forced HTTP route"""
    print(f"Testing HTTP route with {audio_file_path}")
    
    try:
        # Read audio file
        with open(audio_file_path, 'rb') as f:
            audio_data = f.read()
        
        # Create multipart form data
        boundary = f"----WebKitFormBoundary{uuid.uuid4().hex}"
        
        # Build multipart payload (same as auto-routing)
        parts = []
        parts.append(f'--{boundary}')
        parts.append('Content-Disposition: form-data; name="file"; filename="test.wav"')
        parts.append('Content-Type: audio/wav')
        parts.append('')
        
        text_part = '\r\n'.join(parts) + '\r\n'
        language_part = f'\r\n--{boundary}\r\nContent-Disposition: form-data; name="language_code"\r\n\r\nen-US\r\n--{boundary}--'
        payload = text_part.encode() + audio_data + language_part.encode()
        
        # Invoke endpoint with HTTP route forced
        response = sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType=f'multipart/form-data; boundary={boundary}',
            Body=payload,
            CustomAttributes='/invocations/http'  # Force HTTP route
        )
        
        result = json.loads(response['Body'].read().decode())
        print(f"\nHTTP route inference successful!")
        print(f"Response: {json.dumps(result, indent=2)}")
        return result
        
    except Exception as e:
        print(f"HTTP route test failed: {e}")
        return None

# Run HTTP route test
http_result = test_endpoint_http_route(test_audio_path)

#### Test 3: Force gRPC Route with Speaker Diarization

Test the gRPC route with speaker diarization capabilities:

In [None]:
# Test 3: Force gRPC route with speaker diarization
def test_endpoint_grpc_route(audio_file_path, enable_diarization=True, max_speakers=4):
    """Test endpoint with forced gRPC route and speaker diarization"""
    print(f"Testing gRPC route with speaker diarization: {enable_diarization}")
    
    try:
        # Read audio file
        with open(audio_file_path, 'rb') as f:
            audio_data = f.read()
        
        # Create multipart form data with diarization parameters
        boundary = f"----WebKitFormBoundary{uuid.uuid4().hex}"
        
        # Build multipart payload with additional parameters
        payload_parts = []
        
        # Audio file
        payload_parts.append(f'--{boundary}')
        payload_parts.append('Content-Disposition: form-data; name="file"; filename="test.wav"')
        payload_parts.append('Content-Type: audio/wav')
        payload_parts.append('')
        
        text_part = '\r\n'.join(payload_parts) + '\r\n'
        
        # Additional parameters
        additional_params = []
        
        # Language code
        additional_params.append(f'--{boundary}')
        additional_params.append('Content-Disposition: form-data; name="language_code"')
        additional_params.append('')
        additional_params.append('en-US')
        
        # Speaker diarization
        additional_params.append(f'--{boundary}')
        additional_params.append('Content-Disposition: form-data; name="speaker_diarization"')
        additional_params.append('')
        additional_params.append('true' if enable_diarization else 'false')
        
        # Max speakers
        additional_params.append(f'--{boundary}')
        additional_params.append('Content-Disposition: form-data; name="max_speakers"')
        additional_params.append('')
        additional_params.append(str(max_speakers))
        
        # Close boundary
        additional_params.append(f'--{boundary}--')
        
        params_part = '\r\n'.join(additional_params)
        
        # Combine all parts
        payload = text_part.encode() + audio_data + ('\r\n' + params_part).encode()
        
        # Invoke endpoint with gRPC route forced
        response = sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType=f'multipart/form-data; boundary={boundary}',
            Body=payload,
            CustomAttributes='/invocations/grpc'  # Force gRPC route
        )
        
        result = json.loads(response['Body'].read().decode())
        print(f"\ngRPC route inference successful!")
        print(f"Speaker diarization enabled: {enable_diarization}")
        print(f"Max speakers: {max_speakers}")
        print(f"Response: {json.dumps(result, indent=2)}")
        return result
        
    except Exception as e:
        print(f"gRPC route test failed: {e}")
        return None

# Run gRPC route test
grpc_result = test_endpoint_grpc_route("../data/medical-diarization.wav")

## Resource Cleanup

### Important: Clean Up Resources

**Cost Warning**: Make sure to clean up resources when done testing.


In [None]:
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
print(f"Endpoint {endpoint_name} deleted")
# List all endpoint configs with our naming pattern
response = sagemaker_client.list_endpoint_configs(
    NameContains='nim-asr-combined-config'
)
for config in response['EndpointConfigs']:
    config_name = config['EndpointConfigName']
    try:
        sagemaker_client.delete_endpoint_config(EndpointConfigName=config_name)
        print(f"Endpoint config {config_name} deleted")
    except ClientError as e:
        print(f"Error deleting endpoint config {config_name}: {e}")
        
# List all models with our naming pattern
response = sagemaker_client.list_models(
    NameContains='nim-asr-combined-model'
)

for model in response['Models']:
    model_name = model['ModelName']
    try:
        sagemaker_client.delete_model(ModelName=model_name)
        print(f"Model {model_name} deleted")
    except ClientError as e:
        print(f"Error deleting model {model_name}: {e}")


### Additional Cleanup (Optional)

Clean up additional resources if needed:

In [None]:
# Optional: Clean up ECR repository
delete_ecr_repo = False  # Set to True if you want to delete the ECR repository

if delete_ecr_repo:
    print("Deleting ECR repository...")
    try:
        ecr_client.delete_repository(
            repositoryName=repository_name,
            force=True  # Delete even if images exist
        )
        print(f"ECR repository {repository_name} deleted")
    except ClientError as e:
        print(f"Error deleting ECR repository: {e}")
else:
    print(f"ECR repository {repository_name} preserved for future use")

## Conclusion and Next Steps

### Summary

This notebook successfully demonstrated:
1. **Custom Container Development**: Built a BYOC solution combining NVIDIA NIM with SageMaker
2. **Dual Protocol Support**: Implemented both HTTP and gRPC routing capabilities
3. **Speaker Diarization**: Added advanced speaker separation features
4. **Intelligent Routing**: Automatic protocol selection based on file size and requirements
5. **Production Deployment**: Successfully deployed to SageMaker real-time inference

**🎉 Congratulations!** You have successfully deployed a production-ready NVIDIA NIM ASR solution on Amazon SageMaker using BYOC (Bring Your Own Container).