# Deploying Pre-trained Faster-Whisper-Large-v3 Model on SageMaker with Multi-Model Endpoint and Triton Serve as custom model

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

---

Welcome to this notebook, where we will explore the process of deploying a pre-trained Faster-Whisper-Large-v3 model from Hugging Face on Amazon SageMaker. This deployment will be facilitated through the use of a Multi-Model Endpoint, allowing us to serve multiple models on a single SageMaker instance efficiently.

## Overview

In this demonstration, we'll cover the following key aspects:

1. **Model Selection:** We will leverage the Hugging Face model hub to choose the Faster-Whisper-Large-v3 model, a powerful pre-trained model for various natural language processing tasks.

2. **SageMaker Deployment:** Learn how to deploy the selected model on Amazon SageMaker, a fully managed service that enables the training and deployment of machine learning models at scale.

3. **Multi-Model Endpoint:** Explore the advantages of using a Multi-Model Endpoint on SageMaker. This approach allows us to host and serve multiple models on a single instance, optimizing resource utilization and cost efficiency.

4. **Triton Serve Integration:** Understand how Triton Serve, an open-source model serving platform, can be employed to dynamically load and unload models from the GPU. This capability is crucial for managing resources effectively and reducing operational costs.

## Cost Optimization

By utilizing a Multi-Model Endpoint on SageMaker, we aim to showcase how you can achieve significant cost savings. Serving multiple models on a single instance helps maximize the utilization of resources, leading to more efficient and economical deployment.

## Note
Ensure that you have [git-lfs](https://git-lfs.com/) installed 



### Installs
Installs the dependencies required to package the model and run inferences using Triton server.</br>Update SageMaker, boto3, AWS CLI etc.



In [None]:
%pip install -qU pip awscli boto3 sagemaker

### Imports and variables

#### Setting up SageMaker Execution Role

1. **Create Role:**
   - Go to IAM Console.
   - Choose "Roles" &rarr; "Create role."
   - Select "AWS service" &rarr; "SageMaker"
   - Attach necessary policies.
   - Name the role: `AmazonSageMaker-ExecutionRole-mme-FasterWishper`.

2. **Assign Permissions:**
   - Go to SageMaker Console.
   - Select your instance or job.
   - In settings, choose the created role.
   - Save changes.

Now, your SageMaker instance/job is set up with the required execution role.


In [None]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# sagemaker variables
region_name = "eu-west-1"  # change to your region
sm_client = boto3.client(service_name="sagemaker", region_name="eu-west-1")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
role = f"arn:aws:iam::{boto3.client('sts').get_caller_identity().get('Account')}:role/AmazonSageMaker-ExecutionRole-mme-FasterWishper"
s3_client = boto3.client("s3")
bucket = sagemaker.Session().default_bucket()
print("S3 Bucket: {}".format(bucket))
prefix = "faster-whisper-large-v3-mme-gpu"

# account mapping for SageMaker MME Triton Image
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

region = boto3.Session(region_name=region_name).region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.05-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)

### Generate model artifact for faster-whisper-large-v3 from hugging face

We will clone the model from hugging face

In [None]:
!git clone https://huggingface.co/Systran/faster-whisper-large-v3

Now we need to arrange the model as the Triton server expects it

Model repository structure for Faster Whisper large v3 Model.

```
faster-whisper-large-v3
├── 1
│   └── model.py
└── config.pbtxt
```

In [None]:
!mkdir ./faster-whisper-large-v3/1
!mv ./faster-whisper-large-v3/*.* ./faster-whisper-large-v3/1/
!rm -f ./faster-whisper-large-v3/.gitattributes ./faster-whisper-large-v3/.git

### Create config.pbtxt

In [None]:
%%writefile faster-whisper-large-v3/config.pbtxt
name: "faster-whisper-large-v3"
backend: "python"
max_batch_size: 1

input [
    {
        name: "audio_array"
        data_type: TYPE_STRING
        dims: [1]
    },
    {
        name: "audio_lang"
        data_type: TYPE_STRING
        dims: [1]
    }
]

output [
    {
        name: "transcript"
        data_type: TYPE_STRING
        dims: [ 1 ]
    }
]

instance_group [
  {
    kind: KIND_GPU
  }
]


### Package models and upload to S3
Next, we will package our model as `*.tar.gz` files for uploading to S3. 


In [None]:
!tar -C ./ -czf faster-whisper-large-v3.tar.gz faster-whisper-large-v3
model_uri_faster_whisper_large = sagemaker_session.upload_data(
    path="faster-whisper-large-v3.tar.gz", key_prefix=prefix
)
print("Model URI: {}".format(model_uri_faster_whisper_large))

### Build and Push docker image
Now we will build our docker file that will be based on [SageMaker deep learning images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

#### Create the docker file
In the docker image we will add all our requirements

In [None]:
from IPython.core.magic import register_line_cell_magic


@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, "w") as f:
        f.write(cell.format(**globals()))

In [None]:
%%writetemplate Dockerfile
FROM {mme_triton_image_uri}

RUN apt-get update && apt-get install -y \
    ffmpeg \
    libcublas11

RUN pip3 install faster-whisper ffmpeg ffmpeg-python 


Build the docker image and push to ECR

In [None]:
docker_version = "0.0.1"

In [None]:
!docker build -t {sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/triton/faster-whisper:{docker_version} .

In [None]:
!aws ecr create-repository --repository-name triton/faster-whisper --region {region}
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com
!docker push {sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/triton/faster-whisper:{docker_version}

### Create SageMaker Endpoint

Now that we have uploaded the model artifacts to S3, we can create a SageMaker multi-model endpoint.

#### Define the serving container
In the container definition, define the `ModelDataUrl` to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load and serve predictions. Set `Mode` to `MultiModel` to indicate SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU

In [None]:
model_data_url = f"s3://{bucket}/{prefix}/"

container = {
    "Image": f"{sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/triton/faster-whisper:{docker_version}",
    "ModelDataUrl": model_data_url,
    "Mode": "MultiModel",
}
print("Container: {}".format(container))

#### Create a multi-model object

Once the image, data location are set we create the model using `create_model` by specifying the `ModelName` and the Container definition

In [None]:
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sm_model_name = f"{prefix}-mdl-{ts}"
sagemaker_execution_role = role

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=sagemaker_execution_role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

#### Define configuration for the multi-model endpoint

Using the model above, we create an [endpoint configuration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) where we can specify the type and number of instances we want in the endpoint. Here we are deploying to `g4dn.2xlarge` NVIDIA GPU instance.

In [None]:
endpoint_config_name = f"{prefix}-epc-{ts}-2xl"

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g4dn.2xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

#### Create Multi-Model Endpoint

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **In-service** once the deployment is successful.

In [None]:
endpoint_name = f"{prefix}-ep-{ts}-2xl"

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### Run Inference

Once we have the endpoint running we can use some sample raw data to do an inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard [inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/protocol/README.md).

In [None]:
import io
import base64


def get_payload(wav_file_path, language):
    with open(wav_file_path, "rb") as f:
        wav = base64.b64encode(f.read()).decode("ascii")

    payload = {}
    payload["inputs"] = []
    payload["inputs"].append(
        {
            "name": "audio_array",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": [wav],
        }
    )
    payload["inputs"].append(
        {
            "name": "audio_lang",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": [language],
        }
    )
    return payload

In [None]:
payload = get_payload("./audio/sample.wav", "fr")

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(payload),
    TargetModel="faster-whisper-large-v3.tar.gz",
)

response_body = json.loads(response["Body"].read().decode("utf-8"))
print(response_body)

In [None]:
sm_client.delete_model(ModelName=sm_model_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|nlp|realtime|triton|multi-model|mme-triton-custom-faster-whisper|mme-triton-custom-faster-whisper.ipynb)
