# Compile, Deploy, and Benchmark Mathstral on Inferentia2 with Optimum Neuron and SageMaker
---

In this notebook, we walk through the basics of how you can get started with compiling models for AWS Neuron to deploy on Inferentia2 instances.
AWS Neuron is an SDK with a compiler, runtime, and profiling tools that unlocks high-performance and cost-effective deep learning (DL) acceleration. It supports high-performance training on AWS Trainium-based Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances. For model deployment, it supports high-performance and low-latency inference on AWS Inferentia-based Amazon EC2 Inf1 instances and AWS Inferentia2-based Amazon EC2 Inf2 instances. With Neuron, you can use popular frameworks, such as TensorFlow and PyTorch, and optimally train and deploy machine learning (ML) models on Amazon EC2 Trn1, Inf1, and Inf2 instances, and Neuron is designed to minimize code changes and tie-in to vendor-specific solutions.

In this sample, we will demonstrate compilation and benchmarking with the [mistralai/Mathstral-7B-v0.1](https://huggingface.co/mistralai/Mathstral-7B-v0.1) model to [Amazon SageMaker](https://aws.amazon.com/sagemaker/). We will utilize the Hugging Face LLM DLC, a purpose-built Inference Container designed to facilitate the deployment of Large Language Models (LLMs) in a secure and managed environment. This Deep Learning Container (DLC) is powered by <b>Text Generation Inference (TGI)</b>, a scalable and optimized solution for deploying and serving LLMs efficiently. Detailed instance requirements for various model sizes will also be provided to ensure optimal deployment configurations. We will be using the Ray/llmperf tool for benchmarking performance of our sagemaker endpoint with inf2.

## Set up environment

#### Local Setup (Optional)

For a local server, follow these steps to execute this jupyter notebook:

1. **Configure AWS CLI**: Configure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) with your AWS credentials. Run `aws configure` and enter your AWS Access Key ID, AWS Secret Access Key, AWS Region, and default output format.

2. **Install required libraries**: Install the necessary Python libraries for working with SageMaker, such as [sagemaker](https://github.com/aws/sagemaker-python-sdk/), [boto3](https://github.com/boto/boto3), and others. You can use a Python environment manager like [conda](https://docs.conda.io/en/latest/) or [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your Python packages in your preferred IDE (e.g. [Visual Studio Code](https://code.visualstudio.com/)).

3. **Create an IAM role for SageMaker**: Create an AWS Identity and Access Management (IAM) role that grants your user [SageMaker permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). 

By following these steps, you can set up a local Jupyter Notebook environment capable of deploying machine learning models on Amazon SageMaker using the appropriate IAM role for granting the necessary permissions.

---

#### Prerequisites

This Jupyter Notebook can be run on a t3.medium instance (ml.t3.medium). However, to deploy mistral models to an Inf2 endpoint, you may need to request a quota increase. 

To request a quota increase, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon SageMaker.
3. Review your default quota for the following resources:
   - `ml.inf2.24xlarge` for endpoint usage `or`
   - `ml.inf2.48xlarge` for endpoint usage
   
    Note that although this example showcases Mathstral ( which has been compiled to run on 12 Neuron cores) with inf2.24xlarge, you are still able to deploy and benchmark other models compiled for Neuron in our [neuron-compile-jobs](https://huggingface.co/collections/nithiyn/neuron-compile-jobs-66fc4163c5350829c9121e80) collection for Mistral models on HuggingFace.
4. If needed, request a quota increase for these resources.

---

#### Requirements

If using the `sagemaker` python SDK to deploy Mistral model compiled for AWS Neuron to Amazon SageMaker, we need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

1. Create an Amazon SageMaker Notebook Instance 
- [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
- For Notebook Instance type, choose `(ml.m5.4xlarge)`, since we will be benchmarking performance in this notebook.
    
2. For Select Kernel, choose [conda_pytorch_p310](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).

3. Install the required packages.

4. Set up your [HuggingFace token](https://huggingface.co/docs/transformers.js/en/guides/private): 
- User Access Tokens are the preferred way to authenticate an application to Hugging Face services.
- To generate an access token, navigate to the Access Tokens tab in your settings and click on the New token button.
- Choose a name for your token and click Generate a token (we recommend keeping the “Role” as read-only). You can then click the Copy button next to your newly-created token to copy it to your clipboard.
- Copy and replace this token below in the `HF_TOKEN` and `HUGGING_FACE_HUB_TOKEN` parameter under the optimum neuron compile section and the `config` in the deployment section.


In [None]:
#Install packages and import libraries

In [None]:
!pip install sagemaker --upgrade --quiet

In [None]:
!pip install transformers --upgrade --quiet
!pip install gradio --upgrade --quiet

In [None]:
import boto3
import gradio as gr
import json
import os
import sagemaker
import sys
import time
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel


print(sagemaker.__version__)
if not sagemaker.__version__ >= "2.232.0": print("You need to upgrade or restart the kernel if you already upgraded")

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = sess.boto_region_name


We use v0.025 of the tgi-neuronx image from the ecr deep learning container repository, since we need neuronx-cc> 2.15 since this is what is being used to compile our models for inf2.

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri
# use the latest huggingface image for neuronx
llm_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04-v1.0"
 
# print ecr image uri
print(f"llm image uri: {llm_image}")

## Compile models for Optimum Neuron - optional
AWS Inferentia2 does not support dynamic shapes for inference, which means that we need to specify our sequence length and batch size ahead of time. 
For ease of use, our [Mistral-on-AWS](https://github.com/aws-samples/mistral-on-aws) team has pre-compiled these models to Neuron for your use. In order to be able to compile your models to NEFF(Neuron Executable File Format), follow the steps below:


<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> This section is optional, we have already compiled this model for you in our [neuron-compile-jobs](https://huggingface.co/collections/nithiyn/neuron-compile-jobs-66fc4163c5350829c9121e80) collection for Mistral models on HuggingFace.
</div>

### Prerequisites for Compilation
Follow the steps below to set up your EC2 instance with the HuggingFace Neuron DLAMI from the AWS Marketplace.

---
#### Create Your EC2 instance
##### Follow the steps here for a detailed set up of your EC2 instance: [setup](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html)

##### Steps:
- Navigate to the EC2 dashboard from the AWS mgmt console and launch your instance.
- Search for the [HuggingFace Neuron DLAMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2).
-  Choose the instance size as `inf2.24xlarge/inf2.48xlarge` or any other AWS Neuron based instances.
- Set the inbound rule for `ssh` to your local machine's ip address or `anywhere` (note that it is not in accordance to set this to allow trafic from any ipv4, please ensure you secure these ports once done testing.
- Create and specify your ssh key in the instance configuration step. You will need your `.pem` file
- Create your instance.

Once you have launched your instance, navigate to either your terminal or VSCODE and follow the steps below:

<b>ssh for powershell:</b>
```
$PUBLIC_DNS="paste your public ipv4 dns here" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console
$KEY_PATH="paste ssh key path here" # local path to key, e.g. ssh/trn.pem

ssh -i $KEY_PATH -L 8080:localhost:8080 ubuntu@$PUBLIC_DNS
```
<b>ssh for linux/macOS:</b>
```
export PUBLIC_DNS="paste your public ipv4 dns here" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console
export KEY_PATH="paste ssh key path here" # local path to key, e.g. ssh/trn.pem

ssh -i $KEY_PATH -L 8080:localhost:8080 ubuntu@$PUBLIC_DNS
``` 
You should have sshed into your EC2 instance.
Next we can change our directory to home, navigate to 

```
(aws_neuronx_venv_pytorch_2_1) ubuntu@ip-172-31-0-5:~$ cd huggingface-neuron-notebooks/
(aws_neuronx_venv_pytorch_2_1) ubuntu@ip-172-31-0-5:~/huggingface-neuron-notebooks$ cd text-generation/
(aws_neuronx_venv_pytorch_2_1) ubuntu@ip-172-31-0-5:~/huggingface-neuron-notebooks/text-generation$ python -m notebook --allow-root --port=8080
```
You should see a familiar jupyter output with a URL to the notebook.

`http://localhost:8080/....`

We can click on it, and a jupyter environment opens in our local browser. Upload this notebook to your jupyter environment and run the steps in the cells below by modifying it for the model you would like to compile for Neuron.

In [None]:
!mkdir -p ./mistral-model #set the name of the model as directory name

In [None]:
!rm -rf /var/tmp/neuron-compile-cache/* # clear neuron cache

In [None]:
!optimum-cli neuron cache lookup mistralai/#look up the mistral model you would like to compile to see if it is already in the neuron persistent cache

In [None]:
#Replace the empty values below with the input shapes you would like for your model. for the input shapes used for mathstral, refer to the next section
MODEL_ID = ""#HF model ID for the mistral model you would like to 
SEQUENCE_LENGTH ="" # Sequence length that the Neuronx-cc compiler exported model will be able to take as input.
BATCH_SIZE ="" # Batch size that the Neuronx-cc compiler exported model will be able to take as input.
NUM_CORES ="" # each inferentia chip has 2 cores, e.g. inf2.48xlarge has 12 chips or 24 cores
PRECISION ="" # fp32/bf16/fp16 depending on the precision
HF_MODEL_ID_TO_PUSH ="" # change this to your desired model id/
HF_TOKEN ="" #HF_TOKEN to use that you generate in requirments
 
# login into the huggingface hub to access gated models, like llama or mistral
!huggingface-cli login --token $HF_TOKEN
# compile model with optimum for batch size 4 and sequence length 2048
!optimum-cli export neuron -m {MODEL_ID} --batch_size {BATCH_SIZE} --sequence_length {SEQUENCE_LENGTH} --num_cores {NUM_CORES} --auto_cast_type {PRECISION} ./mistral-model
# push model to hub [repo_id] [local_path] [path_in_repo]
!huggingface-cli upload {HF_MODEL_ID_TO_PUSH} ./mistral-model ./

Once you run the above cells in your jupyter server, your compile job should finish and push your model to the hub under the model ID that you have specified. In the case that you would like to continue without compiling the model yourself, our Mistral-on-AWS team has created a collection of neuron compiled NEFF binaries for you to use [here](https://huggingface.co/collections/nithiyn/neuron-compile-jobs-66fc4163c5350829c9121e80). This collection is a work in progress and we will continue adding models compiled for neuron to this repository.

## Deploying Your Model to an Endpoint

----

In this example, we are deploying the [mistralai/Mathstral-7B-v0.1](https://huggingface.co/mistralai/Mathstral-7B-v0.1) model to an Inf2.24xlarge. This model has been exported to the neuron format by the Mistral-on-AWS team using specific input_shapes and compiler parameters detailed in the paragraphs below.

It has been compiled to run on an inf2.24xlarge instance on AWS. 
Note that this compilation uses 24 cores. 

For demonstration purposes, we have compiled it with the below input shapes, feel free to recompile as needed.

These input shapes are as below:

SEQUENCE_LENGTH = 4096

BATCH_SIZE = 4

NUM_CORES = 12

PRECISION = "bf16"

In production environments, to deploy Huggingface 🤗 Transformers models on Neuron devices, you need to compile your models and export them to a serialized format before inference. Through Ahead-Of-Time (AOT) compilation with Neuron Compiler( neuronx-cc or neuron-cc ), models are converted to serialized and optimized TorchScript modules.

Although pre-compilation avoids overhead during the inference, a compiled Neuron model has some limitations:

- The input shapes and data types used during the compilation cannot be changed.

- Neuron models are specialized for each hardware and SDK version, which means:
1. Models compiled with Neuron can no longer be executed in non-Neuron environment.
2. Models compiled for inf1 (NeuronCore-v1) are not compatible with inf2 (NeuronCore-v2), and vice versa.
3. Models compiled for an SDK version are (generally) not compatible with another SDK version


In [None]:
from huggingface_hub import HfFolder
from sagemaker.huggingface import HuggingFaceModel
 
# sagemaker config
instance_type = "ml.inf2.24xlarge"
health_check_timeout=2400 # additional time to load the model
volume_size=512 # size in GB of the EBS volume
 
# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "nithiyn/mathstral-neuron", # replace with your model id if you are using your own model
    "HF_NUM_CORES": "12", # number of neuron cores
    "HF_AUTO_CAST_TYPE": "bf16",  # dtype of the model
    "MAX_BATCH_SIZE": "4", # max batch size for the model
    'HUGGING_FACE_HUB_TOKEN': "REPLACE WITH YOUR HF TOKEN",
    "MAX_INPUT_LENGTH": "4000", # max length of input text
    "MAX_TOTAL_TOKENS": "4096", # max length of generated text
    "MESSAGES_API_ENABLED": "true", # Enable the messages API
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

In [None]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm_model._is_compiled_model = True # We have precompiled the model
 
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
  volume_size=volume_size
)
 

After our endpoint is deployed we can run inference with it. We will use the predict method from the predictor to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. You can find supported parameters in the here.

The `Messages API` allows us to interact with the model in a conversational way. We can define the role of the message and the content. The role can be either system, assistant or user. The system role is used to provide context to the model and the user role is used to ask questions or provide input to the model.

In [None]:
{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is deep learning?" }
  ]
}

In [None]:
# Prompt to generate
messages=[
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is deep learning?" }
  ]
 
# Generation arguments
parameters = {
    "model": "nithiyn/mathstral-neuron", # placholder, needed
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 512,
    "stop": ["</s>"],
}

---
#### Streaming Responses with a Gradio Application

[Amazon SageMaker supports streaming responses](https://aws.amazon.com/de/blogs/machine-learning/elevating-the-generative-ai-experience-introducing-streaming-support-in-amazon-sagemaker-hosting/) from your model. Using this capability, in the below section, let's build a gradio application to stream responses.

Th code for the gradio application in the following steps can be found in [mistral_codestral.py](../gradio_neuron/mistral_codestral.py). The application will stream the responses from the model and display them in the UI. You can also use the application to test your model with your own inputs.

In [None]:
# add apps directory to path ../apps/
import sys
sys.path.append("gradio_neuron") 
from mathstral_chat import create_gradio_app
 
# create gradio app
create_gradio_app(
    llm.endpoint_name,           # Sagemaker endpoint name
    session=sess.boto_session,   # boto3 session used to send request 
    system_prompt="You are a helpful Assistant, called Mathstral. You are a meant to be a helpful assistant",
    share=True,                  # Share app publicly
)

---
## Benchmarking with llmperf

LLMPerf is a benchmarking tool designed to evaluate the performance of Large Language Models (LLMs) across various platforms, hardware configurations, and environments. It aims to standardize the process of measuring the efficiency, speed, and resource usage of LLMs by providing a set of tools, metrics, and frameworks that can be used to test different LLM implementations. Here, we have forked this repository and made modifications to the sagemaker client. Use the fork [nithiyn/llmperf](https://github.com/nithiyn/llmperf) for the purpose of this notebook.


In [None]:
!git clone https://github.com/nithiyn/llmperf.git
!pip install -e llmperf/ --quiet

In [None]:
# tell llmperf that we are using the messages api
!MESSAGES_API=true python llmperf/token_benchmark_ray.py \
--model {llm.endpoint_name} \
--llm-api "sagemaker" \
--max-num-completed-requests 100 \
--timeout 600 \
--num-concurrent-requests 5 \
--results-dir "results"

In [None]:
#summarize

In [None]:
import glob
import json
 
# Reads the summary.json file and prints the results
with open(glob.glob(f'results/*summary.json')[0], 'r') as file:
    data = json.load(file)
    
print("Concurrent requests: 5")
print(f"Avg. Input token length: {int(data['results_number_input_tokens_mean'])}")
print(f"Avg. Output token length: {int(data['results_number_output_tokens_mean'])}")
print(f"Avg. Time-to-first-Token: {data['results_ttft_s_mean']*1000:.2f}ms")
print(f"Avg. Inter-Token-Latency: {data['results_inter_token_latency_s_mean']*1000:.2f}ms/token")
print(f"Avg. Thorughput: {data['results_mean_output_throughput_token_per_s']:.2f} tokens/sec")
print(f"Request per minute (RPM): {data['results_num_completed_requests_per_min']:.2f} req/min")

# Conclusion

In this notebook, we've successfully gone over the process of compiling, deploying, and benchmarking Mathstral on Inferentia2.

---
## Distributors

- AWS
- Mistral