# Deploy Whisper model on AWS Inference 2 using Optimum Neuron
The easiest way to deploy model on neuron devices is to use [Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/index) library. Whisper is one of the already supported models

First install dependencies and download test file. You can skip this step if you executed [01_Whisper_gpu](01_Whisper_gpu.ipynb) notebook

In [None]:
%%capture
!pip install -U sagemaker librosa
!wget --no-check-certificate https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac

Remember to restart kernel after installing dependencies

In [None]:
import sagemaker
import boto3
import librosa
from sagemaker.huggingface import HuggingFaceModel

inf_region = 'us-east-2'

session = sagemaker.Session(boto_session=boto3.Session(region_name=inf_region))

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
print(role)

inf_bucket = session.default_bucket()
print(inf_bucket)

## Compile model to neuron
Before we deploy the model we need to compile it so it can run on neuron devices. To do that we will use training job on Amazon Sagemaker that will run the compilation script and export compiled model to s3.

Let's start with creating `src` directory when we put requirements.txt file for the compilation job and compilation script

In [None]:
!mkdir -p src

In [None]:
%%writefile src/requirements.txt
--extra-index-url https://pip.repos.neuron.amazonaws.com
optimum-neuron[neuronx]==0.3.0
librosa==0.11.0

The code executed inside __main__ will be used to compile the model. However, the same script will then be used to deploy a SageMaker endpoint later. For the model deployment, only the methods defined before main will be used by SageMaker, for instance: __model_fn__, __predict_fn__, etc.

In [None]:
%%writefile src/compile.py
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

import os
import io
import librosa
import shutil
import logging
import argparse
from huggingface_hub import login
from transformers import AutoProcessor
from optimum.neuron import NeuronWhisperForConditionalGeneration, pipeline

# Defines a function model_fn that loads a tokenizer and a model from the specified directory.
def model_fn(model_dir, context=None):
    processor = AutoProcessor.from_pretrained(model_dir)
    neuron_model = NeuronWhisperForConditionalGeneration.from_pretrained(model_dir)
    neuron_model.config.forced_decoder_ids = None
    neuron_model.config.suppress_tokens = []
    neuron_model.generation_config.forced_decoder_ids = None
    neuron_model.generation_config._from_model_config = True

    return pipeline(
        task="automatic-speech-recognition",
        model=neuron_model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        chunk_length_s=30,
        # batch_size=16,  # batch size for inference
    )

# Defines an input_fn function to process incoming requests.
def input_fn(input_data, content_type, context=None):
    if content_type == 'audio/x-audio':
        # Direct audio bytes
        audio_array, sr = librosa.load(io.BytesIO(input_data), sr=16000)
        return audio_array
    else:
        raise Exception(f"Unsupported mime type: {content_type}. Supported: audio/x-audio")    

# Defines a predict_fn function that generates predictions based on user input.
def predict_fn(audio_array, asr_pipeline, context=None):
    logging.info("starting inference")
    output = asr_pipeline(audio_array)
    logging.info(f"output: {output}")
    return {"transcription": output["text"]}

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--batch_size", type=int, default=1, help="Number of samples processed in each batch during training or inference")
    parser.add_argument("--max_seq_len", type=int, default=448, help="Maximum sequence length for input data")
    parser.add_argument("--hf_token", type=str, default=None, help="Which is used for authentication with Hugging Face's model hub")
    parser.add_argument("--model_id", type=str, default="meta-llama/Llama-3.2-1B", help="Specifies the id for the pre-trained model to be used")
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])    

    args, _ = parser.parse_known_args()
    if not args.hf_token is None and len(args.hf_token) > 0:
        print("HF token defined. Logging in...")
        login(token=args.hf_token)

    compiler_args = {"auto_cast": "all", "auto_cast_type": "bf16"}
    input_shapes = {"batch_size": args.batch_size, "sequence_length": args.max_seq_len}
    model = NeuronWhisperForConditionalGeneration.from_pretrained(
        "openai/whisper-large-v3",
        export=True,
        inline_weights_to_neff=False,
        **compiler_args,
        **input_shapes,
    )
    # Save locally
    model.save_pretrained(args.model_dir)

    code_path = os.path.join(args.model_dir, 'code')
    os.makedirs(code_path, exist_ok=True)

    shutil.copy(__file__, os.path.join(code_path, "inference.py"))
    shutil.copy('requirements.txt', os.path.join(code_path, 'requirements.txt'))

Define the training job and run it. We are using trn1.2xlarge instance because compilation requires extra amount of memory then running the model. We will use AWS Inference 2 later to deploy already compiled model.

In [None]:
import json
import logging
from sagemaker.pytorch import PyTorch

HF_TOKEN=""
tp_degree=1
batch_size=1
# since compilation needs inputs size to be fixed we need to specify the max output for the decoder.
max_seq_len=44

# optimum-neuron 0.3.0 requires neuronxcc-2.19.8089 which is sdk2.24.1
image_uri=f"763104351884.dkr.ecr.{inf_region}.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04"

hyperparameters={
    "max_seq_len": max_seq_len,
    "batch_size": batch_size,
    "model_id": "openai/whisper-large-v3"
}

if HF_TOKEN and len(HF_TOKEN) > 3:
    hyperparameters["hf_token"]= HF_TOKEN
    
estimator = PyTorch(
    entry_point="compile.py", # Specify your train script
    source_dir="src",
    role=role,
    sagemaker_session=session,
    container_log_level=logging.DEBUG,
    instance_count=1,
    instance_type='ml.trn1.2xlarge',
    output_path=f"s3://{inf_bucket}/output",
    disable_profiler=True,
    disable_output_compression=True,


    image_uri=image_uri,
    env={
        'NEURON_RT_NUM_CORES': str(tp_degree)
    },
    hyperparameters=hyperparameters
)

Compilation will take roughly 15-20 minutes.

_NOTE: sometimes it fails in this instance size due to OOM errors, if that happens, you can try to compile in a instance with more RAM, i.e. inf2.8xlarge (verify thay your AWS Account Service Quota allows to use this type of instance._

In [None]:
estimator.fit()

## Deploy compiled model to AWS Inferentia2
First let's get where the compiled model is stored.

In [None]:
model_data=estimator.model_data
print(estimator.model_data)

If you have already a pre-compiled version of the model in S3, you can paste the S3_URI below:

In [None]:
precompiled_s3_uri = 'PASTE_S3_URI_HERE'
model_data={'S3DataSource': {'S3Uri': precompiled_s3_uri, 'S3DataType': 'S3Prefix', 'CompressionType': 'None'}}
print(estimator.model_data)

Now let deploy the model

In [None]:
import logging
from sagemaker.utils import name_from_base
from sagemaker.pytorch.model import PyTorchModel


print(f"Model data: {model_data}")

instance_type="ml.inf2.xlarge"
num_workers=1

image_uri=f"763104351884.dkr.ecr.{inf_region}.amazonaws.com/pytorch-inference-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04"

print(f"Instance type: {instance_type}. Num SM workers: {num_workers}")
pytorch_model = PyTorchModel(
    image_uri=image_uri,
    model_data=model_data,
    role=role,
    name=name_from_base('whisper-neuronx'),
    sagemaker_session=session,
    container_log_level=logging.DEBUG,
    model_server_workers=num_workers,
    framework_version="2.1.2",
    env = {
        'SAGEMAKER_MODEL_SERVER_TIMEOUT': '3600',
        'MAX_SEQ_LEN': str(max_seq_len),
        'NEURON_RT_NUM_CORES': str(tp_degree)
    }
    # for production it is important to define vpc_config and use a vpc_endpoint
    #vpc_config={
    #    'Subnets': ['<SUBNET1>', '<SUBNET2>'],
    #    'SecurityGroupIds': ['<SECURITYGROUP1>', '<DEFAULTSECURITYGROUP>']
    #}
)
pytorch_model._is_compiled_model = True

Deployment can take roughly take 8-10 minutes 

In [None]:
predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    model_data_download_timeout=3600, # it takes some time to download all the artifacts and load the model
    container_startup_health_check_timeout=1800,
)

Play the audio file we gonna transcibe

In [None]:
import IPython.display as ipd
import librosa

# Load and play
audio, sr = librosa.load("mlk.flac")
ipd.Audio(audio, rate=sr)

Configure serializers for input and output. Input is an audio file and output it's transcription

In [None]:
from sagemaker.serializers import DataSerializer
from sagemaker.deserializers import JSONDeserializer	
predictor.serializer = DataSerializer(content_type='audio/x-audio')
predictor.deserializer = JSONDeserializer()

Execute transcription

In [None]:
with open("mlk.flac", "rb") as f:
	data = f.read()

output = predictor.predict(data)
output

Calculate average transcription time

In [None]:
import time
iters = 10

start = time.time()
for i in range(0,iters):
    predictor.predict(data)
end = time.time()

transcription_time = (end-start)/iters
transcription_time
print(f"Average transcription time: {transcription_time}")

## Cost performance calculation

In [None]:
duration = librosa.get_duration(path="mlk.flac")
print(f"Audio duration: {duration}")

At the moment `ml.inf2.xlarge` is not supported by pricing api so we set the price per hour manualy according to [AWS Pricing Calculator](https://aws.amazon.com/sagemaker/ai/pricing/)

In [None]:
price=0.99 # USD/hour in us-east-2

In [None]:
price_to_transcribe_1_sec = price / (3600.0/transcription_time*duration)
price_per_minute = price_to_transcribe_1_sec * 60
print(f"Cost to transcribe 1 second of audio using Whisper on {instance_type}: ${price_to_transcribe_1_sec:.6f} USD")
print(f"Cost per minute: ${price_per_minute:.4f} USD")

## Clean up resources

In [None]:
predictor.delete_model()
predictor.delete_endpoint()