## Deploy Mistral Nemo 12B on Amazon SageMaker with HuggingFace LLM DLC 

---

[Mistral NeMo](https://mistral.ai/news/mistral-nemo/) is a 12-billion (12B) parameter model developed in collaboration between Mistral AI and NVIDIA. The model provides a large context window of up to 128k tokens, making it highly proficient in reasoning, world knowledge, and coding accuracy within its size category. 

Designed for seamless integration, Mistral NeMo can replace existing systems using the Mistral 7B model effortlessly. It is available under the Apache 2.0 license to foster widespread adoption among researchers and enterprises. 

Mistral NeMo employs the Tekken tokenizer, which is more efficient than previous models, particularly in compressing source code and various languages. The model provides robust multilingual capabilities, covering languages such as English, French, Gerxman, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

Additionally, the model has undergone advanced fine-tuning to excel in following instructions, reasoning, multi-turn conversations, and code generation. Pre-trained base and instruction-tuned checkpoints are accessible on HuggingFace.

---

Mistral NeMo 12B base performance and accuracy compared to Gemma 2 9B and Llama 8B:

![nemo](imgs/nemo-base-performance.png)

Mistral NeMo 12B instruction fine tuned results:

![nemo](imgs/nemo-instruct-performance.png)

---

In this notebook, you will learn how to deploy the [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407), as well as the [mistralai/Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) model to [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and perform inference. We will utilize the Hugging Face LLM DLC, a purpose-built Inference Container designed to facilitate the deployment of Large Language Models (LLMs) in a secure and managed environment. This Deep Learning Container (DLC) is powered by <b>Text Generation Inference (TGI)</b>, a scalable and optimized solution for deploying and serving LLMs efficiently. 

In the notebook, we will cover how to:
1. Set up environment
2. Retrieve the DLC
3. Model architecture and hardware requirements
4. Deploy Mistral NeMo to Amazon SageMaker
5. Run inference and chat with the model
6. Clean up
7. Conclusion




---
#### 1 Set up environment




#### Prerequisites

You can either execute this notebook in SageMaker Studio or locally. If you are new to Amazon SageMaker, please follow the guidance provided here: [Guide to getting set up with Amazon SageMaker
](https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html). 

For local setup please make sure to configure configure the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) with your credentials, install necessary libraries and ensure you have [permissions for SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

In addition to either setting up your local environment with the local permissions or setting up SageMaker Studio please also make sure to check instance quota. This Jupyter Notebook itself can be run on a ml.t3.medium instance. To deploy the model to the SageMaker endpoint, you may need to request a quota increase. To request a quota increase, follow these steps:

1. Navigate to the Service Quotas console.
2. Choose Amazon SageMaker.
3. Review your default quota for the following resources:
    - ml.g5.12xlarge for endpoint usage
4. If needed, request a quota increase for these resources.


#### Install necessary packages and dependencies

In [None]:
!pip install sagemaker pip boto3 botocore python-dotenv --upgrade  --quiet

##### Import necessary libraries

In the below section, we import the necessary libraries to run this notebook.

In [16]:
import boto3
import json
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.huggingface import HuggingFaceModel

print(sagemaker.__version__)
if not sagemaker.__version__ >= "2.219.0": print("You need to upgrade or restart the kernel if you already upgraded")

sess = sagemaker.Session()

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

2.219.0


Couldn't call 'get_role' to get Role ARN from role name power-user-plus to get Role path.


In [17]:
from dotenv import load_dotenv
load_dotenv()

import os

HF_HUB_TOKEN = os.environ.get('HF_HUB_TOKEN')



---
### 2. Retrieve the HuggingFace Deep Learning Container (DLC)

The first step is to retrieve the DLC URI. This URI is crucial as it serves as a reference point for the HuggingFaceModel class, specifically through the image_uri parameter. The DLC is a pre-configured Docker image that encapsulates all the necessary dependencies and frameworks required to run our LLM efficiently in the SageMaker environment.
To streamline this process, the sagemaker SDK provides a specialized method called `get_huggingface_llm_image_uri`. This method is designed to retrieve the most suitable Hugging Face LLM DLC URI based on two key parameters:

<b>backend</b>: This specifies the deep learning inference framework, in this case which can be huggingface/tgi, lmi, etc.

<b>region</b>: This refers to the AWS region where you're deploying your model. It's important to use the correct region to ensure optimal performance and compliance with data residency requirements.

In this case we manually added the latest TGI image URI from the [deep learning containers repository](https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+gpu&expanded=true). 


In [18]:
tgi_image_uri = "763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0"

huggingface llm image uri: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0


---
## 3. Model architecture and hardware requirements

[Mistral NeMo](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) is a 12-billion parameter open-weight model with a 128k context length. 

Taking into account the model architecture (parameters, precision factor, ...) we can define the model loading size and KV cache size. Further taking into account batch size and token length we can approximate GPU vRAM requirement. 

For small scale tasks, the `g5.12xlarge` with 96GB VRAM comfortably accommodates the model's memory requirements.

When running into CUDA out of memory issues you can either play with the hyperparameters (e.g. reduce token length), leverage quantization or change to a larger instance. 

> For the purpose of this notebook, we will be deploying the unquantized version of the model to a sagemaker endpoint with TGI on the g5.12xlarge.


| Model                                                                       | Instance Type       | Quantization | NUM_GPUS | VRAM |
|-----------------------------------------------------------------------------|---------------------|--------------|----------|------|
| [Mistral NeMo Instruct](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) | `(ml.)g5.12xlarge` | `-` / fp8 (8-bit)        | 4        | 96GB |




---
## 4. Deploy Mistral NeMo to Amazon SageMaker

To deploy [Mistral NeMo](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-24078) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type`, and `huggingface_hub_token`. We will use a `g5.12xlarge` instance type, which has 48 vCPUs, 192 GB of memory, 4 NVIDIA A10G GPUs and 96GB of GPU memory. Depending on the instance type being used, you will also need to chnage the `number_of_gpus` to reflect this (refer to the table above).

We set the `health_check_timeout` to 900 to provide the model with enough time to load into memory. This can be adjusted as needed. If your container is correctly set up and the CloudWatch logs indicate a health check timeout, you should increase this quota so the container has enough time to respond to health checks.

In [19]:

# Sagemaker endpoint configuration
endpoint_name="mistral-nemo-instruct-hf-dlc-tgi"
instance_type = "ml.g5.12xlarge"   
number_of_gpus = 4                 #number of gpus the instance in use has
health_check_timeout = 900         #additional time to load in the model

config = {
  'HUGGING_FACE_HUB_TOKEN': HF_HUB_TOKEN,                     # add your huggingface hub access token with read permissions
  'HF_MODEL_ID': "mistralai/Mistral-Nemo-Instruct-2407",      # model_id from HuggingFace
  'HF_TASK': "text-generation",                               # HuggingFace inference pipeline
  'SM_NUM_GPUS': json.dumps(number_of_gpus),                  # Number of GPU used per replica
  'HF_MODEL_QUANTIZE': 'fp8'                                  # TGI can quantize the model to 8-bit quantization to further improve performance at the cost of a certain degree of loss to precision
}

llm_model = HuggingFaceModel(
  role=role,
  image_uri=tgi_image_uri, 
  env=config
)

After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.12xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs.

In [20]:
# Deploy model to an endpoint
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
  endpoint_name=endpoint_name
)


-----------!

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. 

---
## 5. Run inference and chat with the model

After our endpoint is deployed we can run inference on it. The mistral models have the following prompt structure:
  
```
<s> [INST] User Instruction 1 [/INST] Model answer 1</s> [INST] User instruction 2 [/INST]
```

Additionally parameters can be defined as in the `parameters` attribute of the payload. Let's now define the parameters and the prompt for our payload.

In [None]:
# payload params
payload = {
    "temperature": 0.1, # controls the randomness of the predictions
    "top_p": 0.6, # controls the diversity of the generated text
    "top_k": 50, # controls the diversity of the generated text
    "max_new_tokens": 4000, # specifies the maximum number of tokens to generate in the response
    "stop": ["</s>"]
}

##### General knowledge question

In [36]:
prompt=f"<s> [INST]What is the history behind Artificial Intelligence? [/INST]"

chat = llm.predict({"inputs":prompt, "parameters":payload})

print(chat[0]["generated_text"])

<s> [INST]What is the history behind Artificial Intelligence? [/INST]Artificial Intelligence (AI) has a rich history that spans over seven decades, evolving from a theoretical concept to the powerful technology it is today. Here's a brief overview of its history:

1. **Alan Turing (1936-1950)**: The concept of AI can be traced back to the work of British mathematician and computer scientist Alan Turing. In 1936, he proposed the idea of a "universal machine" that could carry out calculations of any complexity, given enough time and resources. In 1950, he introduced the "Turing Test" to determine whether a machine could exhibit intelligent behavior indistinguishable from that of a human.

2. **John McCarthy Coins the Term "Artificial Intelligence" (1956)**: While working at Dartmouth College, John McCarthy coined the term "Artificial Intelligence" in a proposal for a conference on the subject. This conference is often considered the birth of the AI field.

3. **Early AI Programs (1950s-1

##### Sample math & reasoning generation question

In [39]:
prompt=f"<s> [INST] Please provide a step by step reasoning process to estimate the number of smarties (sweet) that fit into a smart (car). [/INST]"

chat = llm.predict({"inputs":prompt, "parameters":payload})

print(chat[0]["generated_text"])

<s> [INST] Please provide a step by step reasoning process to estimate the number of smarties (sweet) that fit into a smart (car). [/INST]To estimate the number of Smarties (sweets) that can fit into a Smart (car), we'll follow a step-by-step reasoning process that involves several assumptions and calculations. Here's how we can approach this:

1. **Determine the volume of the Smart car:**
   - The Smart Fortwo (the smallest Smart car model) has a length of approximately 2.69 meters, a width of 1.55 meters, and a height of 1.66 meters.
   - Calculate the volume (V) of the car using the formula for the volume of a rectangular prism: V = length × width × height.
   - V = 2.69 m × 1.55 m × 1.66 m ≈ 6.77 cubic meters.

2. **Determine the volume of a single Smartie:**
   - A typical Smartie sweet is roughly spherical with a diameter of about 1.5 cm (or 0.015 meters).
   - Calculate the radius (r) of the Smartie: r = diameter / 2 = 0.015 m / 2 = 0.0075 m.
   - Calculate the volume (v) of a s

##### Sample code generation question

In [40]:
prompt="<s> [INST] Create a React component that calculates body mass index. [/INST]"

chat = llm.predict({"inputs":prompt, "parameters":payload})

print(chat[0]["generated_text"])

<s> [INST] Create a React component that calculates body mass index. [/INST]Here is a simple React component that calculates Body Mass Index (BMI). This component takes weight and height as props and calculates the BMI using the formula: weight(kg) / height(m)².

```jsx
import React from 'react';

const BmiCalculator = ({ weight, height }) => {
  const bmi = weight / Math.pow(height, 2);

  return (
    <div>
      <h2>Body Mass Index Calculator</h2>
      <p>Weight: {weight} kg</p>
      <p>Height: {height} m</p>
      <p>BMI: {bmi.toFixed(2)}</p>
      <p>
        Interpretation:
        {bmi < 18.5 ? 'Underweight' :
        bmi >= 18.5 && bmi < 24.9 ? 'Normal weight' :
        bmi >= 25 && bmi < 29.9 ? 'Overweight' :
        bmi >= 30 ? 'Obesity' : ''}
      </p>
    </div>
  );
};

export default BmiCalculator;
```

You can use this component in your app like this:

```jsx
import React from 'react';
import BmiCalculator from './BmiCalculator';

function App() {
  return (
    <div 

---
#### Streaming Responses

[Amazon SageMaker supports streaming responses](https://aws.amazon.com/de/blogs/machine-learning/elevating-the-generative-ai-experience-introducing-streaming-support-in-amazon-sagemaker-hosting/) from your model. Below we will demonstrate how to create a streaming response using our existing endpoint. 

In [71]:
import json
import boto3
import logging
import io

from sagemaker.base_deserializers import StreamDeserializer

boto3.set_stream_logger("",logging.INFO)

llm.deserializer=StreamDeserializer()

smr = boto3.client('sagemaker-runtime')

In [77]:
# Helper class to iterate the lines via https://github.com/aws-samples/sagemaker-hosting/blob/main/GenAI-Hosting/Large-Language-Model-Hosting/LLM-Streaming/Falcon-40b-and-7b/falcon-40b-and-7b-tgi-streaming/falcon-7b-streaming_tgi.ipynb
class LineIterator:
    """
    A helper class for parsing the byte stream input from TGI container. 
    
    The output of the model will be in the following format:
    ```
    b'data:{"token": {"text": " a"}}\n\n'
    b'data:{"token": {"text": " challenging"}}\n\n'
    b'data:{"token": {"text": " problem"
    b'}}'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. It will also save any pending 
    lines that doe not end with a '\n' to make sure truncations are concatinated
    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('\n'):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])

In [79]:
stop_token = '</s>'

body = {
    "inputs":"<s> [INST]What is the history behind Artificial Intelligence? [/INST]", # prompt
    "parameters":{
        "max_new_tokens": 4000,
        "return_full_text": False
    },
    "stream": True # set stream to True to enable streaming response
}

resp = smr.invoke_endpoint_with_response_stream(EndpointName=llm.endpoint_name, Body=json.dumps(body), ContentType='application/json')
print(resp)
event_stream = resp['Body']
start_json = b'{'
for line in LineIterator(event_stream):
    if line != b'' and start_json in line:
        data = json.loads(line[line.find(start_json):].decode('utf-8'))
        if data['token']['text'] != stop_token:
            print(data['token']['text'],end='')

{'ResponseMetadata': {'RequestId': '757670cc-b895-4331-a0f3-810fced965a6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '757670cc-b895-4331-a0f3-810fced965a6', 'x-amzn-invoked-production-variant': 'AllTraffic', 'x-amzn-sagemaker-content-type': 'text/event-stream', 'date': 'Sat, 10 Aug 2024 17:17:49 GMT', 'content-type': 'application/vnd.amazon.eventstream', 'transfer-encoding': 'chunked', 'connection': 'keep-alive'}, 'RetryAttempts': 0}, 'ContentType': 'text/event-stream', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.eventstream.EventStream object at 0x12d11da10>}
Artificial Intelligence (AI) has a rich history that spans over seven decades, evolving from a theoretical concept to the powerful technology it is today. Here's a brief overview of its history:

1. **Alan Turing (1936-1950)**: The concept of AI can be traced back to the work of British mathematician and computer scientist Alan Turing. In 1936, he proposed the "Turing Machine," a theoretical model

---
## 6. Clean up

To clean up, we can delete the model and endpoint.

In [80]:
llm.delete_model()
llm.delete_endpoint()

2024-08-10 19:19:44,453 sagemaker [INFO] Deleting model with name: huggingface-pytorch-tgi-inference-2024-08-10-15-57-56-989
2024-08-10 19:19:44,453 sagemaker [INFO] Deleting model with name: huggingface-pytorch-tgi-inference-2024-08-10-15-57-56-989
2024-08-10 19:19:44,621 sagemaker [INFO] Deleting endpoint configuration with name: mistral-nemo-instruct-hf-dlc-tgi
2024-08-10 19:19:44,621 sagemaker [INFO] Deleting endpoint configuration with name: mistral-nemo-instruct-hf-dlc-tgi
2024-08-10 19:19:44,787 sagemaker [INFO] Deleting endpoint with name: mistral-nemo-instruct-hf-dlc-tgi
2024-08-10 19:19:44,787 sagemaker [INFO] Deleting endpoint with name: mistral-nemo-instruct-hf-dlc-tgi


---
## 7. Conclusion

In this notebook, we've explored the process of deploying the Mistral NeMo 12B base and instruction tuned model on Amazon SageMaker and performing inference using the Hugging Face LLM Deep Learning Container. Leveraging the power of Text Generation Inference (TGI), we've demonstrated how to efficiently deploy and serve this large language model in a secure, managed environment.

We've walked through the steps of setting up the SageMaker environment, configuring the model deployment, and showcasing both standard inference and streaming responses. 

---
