# Deploying models using AWS Trainium and AWS Inferentia2 to reduce cost
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. It available on [Huggingface](https://huggingface.co/openai/whisper-large-v3) and we will use it to demonstrate various approaches to deploy it and how we can use AWS Inferentia2 to achive better cost performance.

First let's install required libraries and download sample file that we gonna use for testing

In [None]:
%%capture
!pip install -U sagemaker librosa
!wget --no-check-certificate https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac

Remember to restart kernel after installing dependencies. Next let's estabilish sagemaker session.

In [None]:
region = 'us-east-1'

import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel

session = sagemaker.Session(boto_session=boto3.Session(region_name=region))

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
role

## Deploying Whisper on Amazon SageMaker Endpoint using gpu

In [None]:
# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'openai/whisper-large-v3',
	'HF_TASK':'automatic-speech-recognition'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.49.0',
	pytorch_version='2.6.0',
	py_version='py312',
	env=hub,
	role=role,
    sagemaker_session=session
)

For later cost performance comparison we will use G5 instance. You can also try to use newest version G6.

In [None]:
instance_type='ml.g5.xlarge'
#instance_type='ml.g6.xlarge'

Deploy model to SageMaker Inference, it will roughly take 8-10 minutes

In [None]:
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type=instance_type # ec2 instance type
)

If by any chance your notebook looses conection and the endpoint is succesfully deployed you can create the predictor using name obtained from the console. Just uncomment these lines and fill in the endpoint_name

In [None]:
# from sagemaker.predictor import Predictor
# predictor = Predictor(
#     endpoint_name="YOUR_ENDPOINT_NAME",
#     sagemaker_session=session
# )

Let's configure serializers for input and output. Input is an audio file and output it's transcription

In [None]:
from sagemaker.serializers import DataSerializer
from sagemaker.deserializers import JSONDeserializer	

predictor.serializer = DataSerializer(content_type='audio/x-audio')
predictor.deserializer = JSONDeserializer()

First play the audio file we gonna transcibe

In [None]:
import IPython.display as ipd
import librosa

# Load and play
audio, sr = librosa.load("mlk.flac")
ipd.Audio(audio, rate=sr)

Now execute transcirption

In [None]:
with open("mlk.flac", "rb") as f:
    data = f.read()
predictor.predict(data)

## Cost performance calculation

In [None]:
duration = librosa.get_duration(path="mlk.flac")
print(f"Audio duration: {duration}")

In [None]:
import time
iters = 10

start = time.time()
for i in range(0,iters):
    predictor.predict(data)
end = time.time()

transcription_time = (end-start)/iters
print(f"Average transcription time: {transcription_time}")

Get pricing for instance

In [None]:
region_names = {
    'us-east-1': 'US East (N. Virginia)',
    'us-east-2': 'US East (Ohio)',
    'us-west-1': 'US West (N. California)',
    'us-west-2': 'US West (Oregon)',
    'eu-west-1': 'Europe (Ireland)',
    'eu-west-2': 'Europe (London)',
    'eu-west-3': 'Europe (Paris)',
    'eu-central-1': 'Europe (Frankfurt)',
    'ap-southeast-1': 'Asia Pacific (Singapore)',
    'ap-southeast-2': 'Asia Pacific (Sydney)',
    'ap-northeast-1': 'Asia Pacific (Tokyo)',
    # Add more as needed
}

# pricing api requires us-east-1 region
pricing = boto3.client('pricing', region_name='us-east-1')

response = pricing.get_products(
    ServiceCode='AmazonSageMaker',
    Filters=[
        {'Type': 'TERM_MATCH', 'Field': 'instanceType', 'Value': instance_type},
        {'Type': 'TERM_MATCH', 'Field': 'productFamily', 'Value': 'ML Instance'},
        {'Type': 'TERM_MATCH', 'Field': 'location', 'Value': region_names[region]}
    ]
)

import json
data = json.loads(response['PriceList'][0])
on_demand = data['terms']['OnDemand']
first_term = next(iter(on_demand.values()))
first_dimension = next(iter(first_term['priceDimensions'].values()))
price = float(first_dimension['pricePerUnit']['USD'])
print(f"Price per hour for {instance_type}: {price} USD")

In [None]:
price_to_transcribe_1_sec = price / (3600.0/transcription_time*duration)
print(f"Cost to transcribe 1 second of audio using Whisper on {instance_type}: {price_to_transcribe_1_sec} USD")

## Clean up resources

In [None]:
predictor.delete_model()
predictor.delete_endpoint()