## Deploy Mathstral on Amazon SageMaker with TGI/LMI

---

[Mathstral](https://mistral.ai/news/mathstral/) is an open-weight generative AI model explicitly designed for math and reasoning generation tasks. Mathstral achieves state-of-the-art reasoning capacities in its size category across various industry-standard benchmarks. In particular, it achieves 56.6% on MATH and 63.47% on MMLU, with the following MMLU performance difference by subject between Mathstral 7B and Mistral 7B.

---

As a 7B model, Mathstral sets a new standard on the performance/latency space for math and reasoning generation compared to similar models used for math/reasoning. Mathstral can achieve significantly better results with more inference-time computation: Mathstral 7B scores 68.37% on MATH with majority voting and 74.59% with a strong reward model among 64 candidates.

![mathstral](imgs/mathstral-benchmarks.png)

In this notebook, you will learn how to deploy the [mistralai/Mathstral-v0.1](https://huggingface.co/mistralai/mathstral-7B-v0.1) model to [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and perform inference. We will utilize the Hugging Face LLM DLC, a purpose-built Inference Container designed to facilitate the deployment of Large Language Models (LLMs) in a secure and managed environment. This Deep Learning Container (DLC) is powered by <b>Text Generation Inference (TGI)</b>, a scalable and optimized solution for deploying and serving LLMs efficiently. Additionally, you are also able to use the <b>Language Model Inference (LMI)</b> container as an alternative DLC within this notebook. LMIs are specialized Docker containers for LLM inference, provided by AWS. With these containers, you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. Detailed instance requirements for various model sizes will also be provided to ensure optimal deployment configurations.  




In the notebook, we will cover how to:
1. [Set up environment](#1-set-up-environment)
2. [Retrieve the DLC](#2-retrieve-the-dlc)
3. [Hardware requirements](#3-hardware-requirements)
4. [Deploy Mathstral to Amazon SageMaker](#4-deploy-Mathstral-to-amazon-sagemaker)
5. [Run inference and chat with the model](#5-run-inference-and-chat-with-the-model)
6. [Clean up](#6-clean-up)
7. [Conclusion](#7-conclusion)




<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> Mathstral is a 7B open-weight model licensed under Apache 2.0
</div>

### 1. Set up environment

#### Local Setup (Optional)

For a local server, follow these steps to execute this jupyter notebook:

1. **Configure AWS CLI**: Configure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) with your AWS credentials. Run `aws configure` and enter your AWS Access Key ID, AWS Secret Access Key, AWS Region, and default output format.

2. **Install required libraries**: Install the necessary Python libraries for working with SageMaker, such as [sagemaker](https://github.com/aws/sagemaker-python-sdk/), [boto3](https://github.com/boto/boto3), and others. You can use a Python environment manager like [conda](https://docs.conda.io/en/latest/) or [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your Python packages in your preferred IDE (e.g. [Visual Studio Code](https://code.visualstudio.com/)).

3. **Create an IAM role for SageMaker**: Create an AWS Identity and Access Management (IAM) role that grants your user [SageMaker permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). 

By following these steps, you can set up a local Jupyter Notebook environment capable of deploying machine learning models on Amazon SageMaker using the appropriate IAM role for granting the necessary permissions.

---

#### Prerequisites

This Jupyter Notebook can be run on a t3.medium instance (ml.t3.medium). However, to deploy `Mathstral`, you may need to request a quota increase. 

To request a quota increase, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon SageMaker.
3. Review your default quota for the following resources:
   - `ml.g5.12xlarge` for endpoint usage 
4. If needed, request a quota increase for these resources.

---

#### Requirements

If using the `sagemaker` python SDK to deploy Mathstral to Amazon SageMaker, we need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)- For Notebook Instance type, choose (ml.t3.medium).
    
2. For Select Kernel, choose [conda_pytorch_p310](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).

3. Install the required packages.

4. Set up your [HuggingFace token](https://huggingface.co/docs/transformers.js/en/guides/private): 
- User Access Tokens are the preferred way to authenticate an application to Hugging Face services.
- To generate an access token, navigate to the Access Tokens tab in your settings and click on the New token button.
- Choose a name for your token and click Generate a token (we recommend keeping the “Role” as read-only). You can then click the Copy button next to your newly-created token to copy it to your clipboard.
- Copy and replace this token below in the `HUGGING_FACE_HUB_TOKEN` parameter under `config` in the deployment section.

In [1]:
# Install required packages to run this notebook
!pip install sagemaker --upgrade --quiet # NOTE: Please use version 2.219 of sagemaker with this notebook
!pip install gradio  --upgrade --quiet

---
#### Import Necessary Libraries

In the below section, we import the necessary libraries to run this notebook.

<div class="alert alert-block alert-info"> 

<b>NOTE:

- </b> In order to be able to run the gradio app for Mathstral, please ensure you have cloned in/replicated the <b>Mathstral_chat_ui</b> subfolder with the Mathstral_chat module

</div>


In [2]:
import gradio as gr
import json
import os
import sagemaker
import sys
import time
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel


print(sagemaker.__version__)
if not sagemaker.__version__ >= "2.219.0": print("You need to upgrade or restart the kernel if you already upgraded")

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = sess.boto_region_name


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
2.232.3


---
### 2. Retrieve the DLC

#### 2.a. Retrieve the latest HuggingFace DLC

The first step is to retrieve the DLC URI. This URI is crucial as it serves as a reference point for the HuggingFaceModel class, specifically through the image_uri parameter. The DLC is a pre-configured Docker image that encapsulates all the necessary dependencies and frameworks required to run our LLM efficiently in the SageMaker environment.
To streamline this process, the sagemaker SDK provides a specialized method called `get_huggingface_llm_image_uri`. This method is designed to retrieve the most suitable Hugging Face LLM DLC URI based on two key parameters:

<b>backend</b>: This specifies the deep learning inference framework, in this case which can be huggingface/tgi, lmi, etc.

<b>region</b>: This refers to the AWS region where you're deploying your model. It's important to use the correct region to ensure optimal performance and compliance with data residency requirements.


In [3]:
# retrieve the huggingface llm image uri
tgi_image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", #tgi
  region=region
)

# print ecr image uri
print(f"huggingface llm image uri: {tgi_image_uri}")

huggingface llm image uri: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04


### 2.b. Retrieve the latest LMI container

LMI containers are a set of high-performance Docker Containers purpose built for large language model (LLM) inference. With these containers, you can leverage high performance open-source inference libraries like vLLM/Deepspeed to deploy LLMs on AWS SageMaker Endpoints.



In [4]:
# retrieve the lmi image uri
lmi_image_uri = get_huggingface_llm_image_uri(
  backend="lmi", 
  region=region
)

# print ecr image uri
print(f"lmi image uri: {lmi_image_uri}")

lmi image uri: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.24.0-deepspeed0.10.0-cu118


The above section is meant to show how lmi images can be retrieved. To deploy Mathstral on to Sagemaker with `LMI`.
The below sections will use `TGI` as the default deep learning inference framework

---
## 3. Hardware requirements

[Mathstral](https://huggingface.co/mistralai/Mathstral-22B-v0.1) is a 7-billion parameter open-weight model with a 32k context length. For small scale code generation tasks, the `g5.12xlarge` with 96GB VRAM comfortably accommodates the model's memory requirements.

> For the purpose of this notebook, we will just be deploying the unquantized version of the model to a sagemaker endpoint with TGI on the g5.12xlarge.


| Model                                                                       | Instance Type       | Quantization | NUM_GPUS | VRAM |
|-----------------------------------------------------------------------------|---------------------|--------------|----------|------|
| [Mathstral](https://huggingface.co/mistralai/Mathstral-22B-v0.1) | `(ml.)g5.12xlarge` | `-` / bitsnbytes (8-bit)        | 4        | 96GB |




---
## 4. Deploy Mathstral to Amazon SageMaker

To deploy [Mathstral](https://huggingface.co/mistralai/Mathstral-22B-v0.1) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type`, and `huggingface_hub_token`. We will use a `g5.12xlarge` instance type, which has 4 NVIDIA A10G GPUs and 192GB of GPU memory. Depending on the instance type being used, you will also need to chnage the `number_of_gpus` to reflect this (refer to the table above).

The parameters for `MAX_INPUT_TOKENS`and `MAX_TOTAL_TOKENS` can be altered as needed. We set the `health_check_timeout` to 900 to provide the model with enough time to load into memory. This can be adjusted as needed. If your container is correctly set up and the CloudWatch logs indicate a health check timeout, you should increase this quota so the container has enough time to respond to health checks.

In [5]:

# Sagemaker endpoint configuration
instance_type = "ml.g5.12xlarge"   
number_of_gpus = 4                 #number of gpus the instance in use has
health_check_timeout = 900         #additional time to load in the model
max_total_tokens = 32768           #max total tokens for Mathstral


config = {
  'HF_MODEL_ID': "mistralai/mathstral-7B-v0.1",                                    # model_id from HuggingFace
  'HF_TASK': "text-generation",                                                     # huggingface inference pipeline
  'SM_NUM_GPUS': json.dumps(number_of_gpus),                                        # Number of GPU used per replica
  'HUGGING_FACE_HUB_TOKEN': "<REPLACE WITH YOUR TOKEN>",                            #add your huggingface hub access token with read permissions
  'MAX_INPUT_TOKENS': json.dumps(max_total_tokens - 1),                             # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(max_total_tokens),                                 # Max length of the generation (including input text)'
  #not supported currently - 'HF_MODEL_QUANTIZE': 'bitsandbytes' # You are also able to quantize the model to 8-bit quantization to further improve performance at the cost of a certain degree of loss to precision
}

# check if token is set
assert config['HUGGING_FACE_HUB_TOKEN'] !="<REPLACE WITH YOUR TOKEN>", "Please set your Hugging Face Hub token"

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=tgi_image_uri, #switch to lmi if needed
  env=config
)

After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.48xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs.

In [6]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)


----------------!

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. 

---
## 5. Run inference and chat with the model

After our endpoint is deployed we can run inference on it. Parameters can be defined as in the `parameters` attribute of the payload.

The mistral models have the following prompt structure:
  
```
<s> [INST] User Instruction 1 [/INST] Model answer 1</s> [INST] User instruction 2 [/INST]
```
Let's now define our prompt and set the parameters for our payload.

##### Sample math & reasoning generation questions:
1. "Please provide a step-by-step reasoning process to estimate the number of stars in our galaxy, the Milky Way. Break down the calculation into logical steps, explaining your thought process and any assumptions you make along the way. Use scientific notation where appropriate, and conclude with a final estimate."
2. "What were the main reasons Frank Lloyd Wright designed his Oak Park Studio with high windows placed near the ceiling, and how did this feature reflect his architectural philosophy and the needs of an architecture firm's workspace?"
3. "Escape velocity from a neutron star: Given: A neutron star has a mass of 2.5 solar masses and a radius of 12 km. Task: Calculate the escape velocity from the surface of this neutron star in km/s"
4. "Calculating the orbital period of an exoplanet:Given: An exoplanet orbits its star at a distance of 2.5 AU (Astronomical Units). The star has a mass of 1.2 solar masses.Task: Calculate the orbital period of the exoplanet in Earth years"
5. "How did the Catholic Church's Counter-Reformation influence the development and characteristics of Baroque art, particularly in terms of its emotional intensity and use of dramatic techniques?"
6. "How is the rise of AI-generated art challenging traditional notions of creativity, authorship, and artistic value, and what are the potential long-term implications for human artists, art institutions, and the art market?"

In [8]:
# Code Generation prompt
prompt=f"<s> [INST] Please provide a step-by-step reasoning process to estimate the number of stars in our galaxy, the Milky Way. Break down the calculation into logical steps, explaining your thought process and any assumptions you make along the way. Use scientific notation where appropriate, and conclude with a final estimate. [/INST]"

# payload params
payload = {
    "top_p": 0.6,
    "temperature": 0.1,
    "top_k": 50,
    "max_new_tokens": 3000, #change max tokens as needed, note that if you if the Model Response times out you will have to upgrade the instance size or decrease the max_new_tokens
    "stop": ["</s>"]
}

Okay lets test it.

In [9]:
#
chat = llm.predict({"inputs":prompt, "parameters":payload})

print(chat[0]["generated_text"])

<s> [INST] Please provide a step-by-step reasoning process to estimate the number of stars in our galaxy, the Milky Way. Break down the calculation into logical steps, explaining your thought process and any assumptions you make along the way. Use scientific notation where appropriate, and conclude with a final estimate. [/INST] Estimating the number of stars in the Milky Way is a complex task that involves several steps and assumptions. Here's a simplified version of the process:

1. **Determine the size of the Milky Way**: The Milky Way is a barred spiral galaxy, and its size can be estimated by observing its visible structure. The diameter of the Milky Way is estimated to be about 100,000 light-years.

2. **Estimate the density of stars**: The density of stars in the Milky Way can vary greatly depending on where you are in the galaxy. For simplicity, let's assume an average density of 1 star per cubic light-year.

3. **Calculate the volume of the Milky Way**: The volume of a sphere 

---
#### Streaming Responses with a Gradio Application

[Amazon SageMaker supports streaming responses](https://aws.amazon.com/de/blogs/machine-learning/elevating-the-generative-ai-experience-introducing-streaming-support-in-amazon-sagemaker-hosting/) from your model. Using this capability, in the below section, let's build a gradio application to stream responses.

Th code for the gradio application in the following steps can be found in [Mathstral_chat.py](../Mathstral_chat_ui/Mathstral_chat.py). The application will stream the responses from the model and display them in the UI. You can also use the application to test your model with your own inputs.

In [10]:
# add directory to path
sys.path.append("Mathstral_chat_ui") 
from Mathstral_chat import create_gradio_app
# params
parameters = {
    "top_p": 0.6,
    "temperature": 0.1,
    "top_k": 50,
    "max_new_tokens": 30000,
    "stop": ["</s>"]
}

# define format function for our input
def format_prompt(message, history, system_prompt):
    prompt = ""
    for user_prompt, bot_response in history:
        prompt = f"<s> [INST] {user_prompt} [/INST] {bot_response}</s>"
        prompt += f"### Instruction\n{user_prompt}\n\n"
        prompt += f"### Answer\n{bot_response}\n\n"  
    # add new user prompt if history is not empty
    if len(history) > 0:
        prompt += f" [INST] {message} [/INST] "    
    else:
        prompt += f"<s> [INST] {message} [/INST] "
    return prompt

# create gradio app
create_gradio_app(
    llm.endpoint_name,           # Sagemaker endpoint name
    session=sess.boto_session,   # boto3 session used to send request 
    parameters=parameters,       # Request parameters
    system_prompt=None,          # System prompt to use -> Mistral does not support system prompts
    format_prompt=format_prompt, # Function to format prompt
    share=True,                  # Share app publicly
)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://e6db99270d7258a0f1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


---
## 6. Clean up

To clean up, we can delete the model and endpoint.


In [None]:
llm.delete_model()
llm.delete_endpoint()

## 7. Conclusion

In this notebook, we've explored the process of deploying the mistralai/Mathstral-v0.1 model on Amazon SageMaker and performing inference using the Hugging Face LLM Deep Learning Container. By leveraging the power of Text Generation Inference (TGI), we've demonstrated how to efficiently deploy and serve this large language model in a secure, managed environment.
We've walked through the steps of setting up the SageMaker environment, configuring the model deployment, and showcasing both standard inference and streaming responses. The integration with a Gradio application further illustrates the practical application of this deployment, enabling interactive and user-friendly access to the model's capabilities.

---