# Deploy Mixtral 8x7b with NVIDIA NIM on Amazon SageMaker

---

## What is NIM

[NVIDIA NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/mistralai/containers/mixtral-8x7b-instruct-v01) enables efficient deployment of large language models (LLMs) across various environments, including cloud, data centers, and workstations. It simplifies self-hosting LLMs by providing scalable, high-performance microservices optimized for NVIDIA GPUs. NIM's containerized approach allows for easy integration into existing workflows, with support for advanced language models and enterprise-grade security. Leveraging GPU acceleration, NIM offers fast inference capabilities and flexible deployment options, empowering developers to build powerful AI applications such as chatbots, content generators, and translation services.

### Features

NIM abstracts away model inference internals such as execution engine and runtime operations. They are also the most performant option available whether it be with [TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm) or others. NIM offers the following high performance features:

1. Scalable Deployment that is performant and can easily and seamlessly scale from a few users to millions.
2. Advanced Language Model support with pre-generated optimized engines for a diverse range of cutting edge LLM architectures.
3. Flexible Integration to easily incorporate the microservice into existing workflows and applications. Developers are provided with an OpenAI API compatible programming model and custom NVIDIA extensions for additional functionality.
4. Enterprise-Grade Security emphasizes security by using safetensors, constantly monitoring and patching CVEs in our stack and conducting internal penetration tests.

### Architecture

NIMs are packaged as container images on a per model/model family basis. Each NIM is its own Docker container with a model, such as [mistralai/Mixtral-8x7B](https://huggingface.co/mistralai). These containers include a runtime that runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. NIMs are distributed as [NGC](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html) container images through the NVIDIA NGC Catalog. NIM automatically downloads the model from NGC, leveraging a local filesystem cache if available. Each NIM is built from a common base, so once a NIM has been downloaded, downloading additional NIMs is extremely fast.

In this example we show how to deploy `Mixtral 8x7B` on a `p4d.24xlarge` instance with NIM on Amazon SageMaker.

## Model Card
---
### Mixtral 8x7B Instruct

- **Description:** A 7B sparse Mixture-of-Experts model with stronger capabilities than Mistral 7B. Utilizes 12B active parameters out of 45B total.
- **Max Tokens:** 4,096
- **Context Window:** 32K
- **Languages:** English, French, German, Spanish, Italian
- **Supported Use Cases:** Text summarization, structuration, question answering, and code completion

## Prerequisites
---

<div class="alert alert-block alert-info">
<b>NOTE:</b>  To run NIM on SageMaker you will need to have your `NGC API KEY` to access NGC resources. Check out <a href="https://build.nvidia.com/mistralai/mixtral-8x7b-instruct"> this LINK</a> to learn how to get an NGC API KEY. 
</div>

##### 1. Setup and retrieve API key:

1. First you will need to sign into [NGC](9https://ngc.nvidia.com/signin) with your NVIDIA account and password.
2. Navigate to setup.
3. Select “Get API Key”.
4. Generate your API key.
5. Keep your API key secret and in a safe place. Do not share it or store it in a place where others can see or copy it

For more information on NIM, check out the [NIM LLM docs](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) .

##### 2. You must have `ecr:CreateRepository` and appropriate push permissions associated with your execution role

##### 3. NIM public ECR image is currently available only in `us-east-1` region

---

## Setup

Installs the dependencies and setup roles required to package the model and create SageMaker endpoint. 

In [3]:
import boto3 
import json
import os
import sagemaker
import time
from pathlib import Path
from sagemaker import get_execution_role

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
region = sess.region_name
sts_client = sess.client('sts')
account_id = sts_client.get_caller_identity()['Account']

### Set Variables

In this example, since we are deploying `Mixtral-8x7B` we define some configurations below for retrieving our ECR image for NIM along with some other requirements.

In [4]:
# mixtral-8x7b-instruct
public_nim_image = "public.ecr.aws/nvidia/nim:mixtral-8x7b-instruct-v01-1.0.0"
nim_model = "nim-mixtral-8x7b-instruct"
sm_model_name = "nim-mixtral-8x7b-instruct"
instance_type = "ml.p4d.24xlarge"
payload_model = "mistralai/mixtral-8x7b-instruct-v0.1"

### NIM Container

We first pull the NIM image from public ECR and then push it to private ECR repo within your account for deploying on SageMaker endpoint. 

Note, as mentioned previously:
  - NIM ECR image is currently available only in `us-east-1` region
  - You must have `ecr:CreateRepository` and appropriate push permissions associated with your execution role

In [5]:
import subprocess

# Get AWS account ID
result = subprocess.run(['aws', 'sts', 'get-caller-identity', '--query', 'Account', '--output', 'text'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    print(f"Error getting AWS account ID: {result.stderr}")
else:
    account = result.stdout.strip()
    print(f"AWS account ID: {account}")

bash_script = f"""
echo "Public NIM Image: {public_nim_image}"
docker pull {public_nim_image}


echo "Resolved account: {account}"
echo "Resolved region: {region}"

nim_image="{account}.dkr.ecr.{region}.amazonaws.com/{nim_model}"

# Ensure the repository name adheres to AWS constraints
repository_name=$(echo "{nim_model}" | tr '[:upper:]' '[:lower:]' | tr -cd '[:alnum:]._/-')

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "$repository_name" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "$repository_name" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin "{account}.dkr.ecr.{region}.amazonaws.com"

docker tag {public_nim_image} $nim_image
docker push $nim_image
echo -n $nim_image
"""
nim_image=f"{account}.dkr.ecr.{region}.amazonaws.com/{nim_model}"
# Run the bash script and capture real-time output
process = subprocess.Popen(bash_script, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

while True:
    output = process.stdout.readline()
    if output == b'' and process.poll() is not None:
        break
    if output:
        print(output.decode().strip())

stderr = process.stderr.read().decode()
if stderr:
    print("Errors:", stderr)


AWS account ID: 570598552974
Public NIM Image: public.ecr.aws/nvidia/nim:mixtral-8x7b-instruct-v01-1.0.0
mixtral-8x7b-instruct-v01-1.0.0: Pulling from nvidia/nim
Digest: sha256:29183d1f5b27fbb95963cc539f313c1d7f92458bc440aaa9599d4eecdae4a582
Status: Image is up to date for public.ecr.aws/nvidia/nim:mixtral-8x7b-instruct-v01-1.0.0
public.ecr.aws/nvidia/nim:mixtral-8x7b-instruct-v01-1.0.0
Resolved account: 570598552974
Resolved region: us-west-2
Login Succeeded
Using default tag: latest
The push refers to repository [570598552974.dkr.ecr.us-west-2.amazonaws.com/nim-mixtral-8x7b-instruct]
368e1459fd45: Preparing
1c581f8c3425: Preparing
d1247f59f9e1: Preparing
1226ae4cbcd1: Preparing
a73a9c988583: Preparing
ff45a40404c2: Preparing
fa87cb652a48: Preparing
69a5953b82a5: Preparing
9cc945697f55: Preparing
861b0723f458: Preparing
277cc9d6a73f: Preparing
0b60d597c587: Preparing
8c01e44155fa: Preparing
8921a0fe7f12: Preparing
28e03e31e934: Preparing
41c3d70405ca: Preparing
0a7e6d30c81b: Preparing

We print the private ECR NIM image in your account that we will be using for SageMaker deployment. 
- Should be similar to  `"<ACCOUNT ID>.dkr.ecr.<REGION>.amazonaws.com/<NIM_MODEL>:latest"`

In [6]:
print(nim_image)

570598552974.dkr.ecr.us-west-2.amazonaws.com/nim-mixtral-8x7b-instruct


---

## Create SageMaker Endpoint

**Before proceeding further, please set your NGC API Key.**

In [7]:
# Set your NGC API key here
NGC_API_KEY = "SET KEY HERE"

Pass in the **NGC_API_KEY**, and define the model

In [None]:
container = {
    "Image": nim_image,
    "Environment": {"NGC_API_KEY": NGC_API_KEY}
}
create_model_response = sm.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Next we create endpoint configuration, here we are deploying the Mixtral model on the specified instance type.

In [None]:
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
            "ContainerStartupHealthCheckTimeoutInSeconds": 850
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [11]:
endpoint_name = sm_model_name

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-west-2:570598552974:endpoint/nim-mixtral-8x7b-instruct


In [12]:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:570598552974:endpoint/nim-mixtral-8x7b-instruct
Status: InService


## Test Inference and Streaming Inference with Endpoint

Once we have the endpoint's status as `InService` we can use a sample text to do a chat completion inference request using json as the payload format. For inference request format, currently NIM on SageMaker supports the OpenAI API chat completions inference protocol. For explanation of supported parameters please see [this link](https://platform.openai.com/docs/api-reference/chat). 

<div class="alert alert-block alert-info">
<b>NOTE:</b> The model's name in the inference request payload needs to be the name of the NIM model. 
</div>

In [13]:
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Explain to me in detail how Optimum Neuron helps compile LLMs for AWS infrastructure"}
]
payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 1024
}


response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(json.dumps(output, indent=2))

{
  "id": "cmpl-c6262b90892348b19e2eea30965366af",
  "object": "chat.completion",
  "created": 1722008071,
  "model": "mistralai/mixtral-8x7b-instruct-v0.1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " Optimum Neuron is a library developed by Hugging Face for efficiently training and deploying large language models (LLMs) on Amazon Web Services (AWS) infrastructure. The library provides a simple interface for defining and launching training jobs on AWS, as well as optimized model configurations that take advantage of hardware acceleration.\n\nHere's a detailed explanation of how Optimum Neuron helps compile LLMs for AWS infrastructure:\n\n1. Preparing the model: First, you need to select a model architecture and pre-trained weights from the Hugging Face Model Hub or train your own model. Optimum Neuron provides pre-configured model classes for various architectures such as BERT, RoBERTa, DistilBERT, etc. These model classe

In [14]:
content = output["choices"][0]["message"]["content"]
print(content)

 Optimum Neuron is a library developed by Hugging Face for efficiently training and deploying large language models (LLMs) on Amazon Web Services (AWS) infrastructure. The library provides a simple interface for defining and launching training jobs on AWS, as well as optimized model configurations that take advantage of hardware acceleration.

Here's a detailed explanation of how Optimum Neuron helps compile LLMs for AWS infrastructure:

1. Preparing the model: First, you need to select a model architecture and pre-trained weights from the Hugging Face Model Hub or train your own model. Optimum Neuron provides pre-configured model classes for various architectures such as BERT, RoBERTa, DistilBERT, etc. These model classes have built-in optimizations for various hardware and software configurations.
2. AWS Infrastructure setup: Optimum Neuron integrates with Amazon SageMaker, a fully managed machine learning service that makes it easy to build, train, and deploy machine learning models

### Try streaming inference

NIM on SageMaker also supports streaming inference and you can enable that by setting **`"stream"` as `True`** in the payload and by using [`invoke_endpoint_with_response_stream`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint_with_response_stream.html) method.

In [15]:
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Explain to me in detail what inference engines and llm serving frameworks are"}
]
payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 1024,
  "stream": True
}


response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We have some postprocessing code for the streaming output.

In [16]:
event_stream = response['Body']
accumulated_data = ""
start_marker = 'data:'
end_marker = '"finish_reason":null}]}'

for event in event_stream:
    try:
        payload = event.get('PayloadPart', {}).get('Bytes', b'')
        if payload:
            data_str = payload.decode('utf-8')

            accumulated_data += data_str

            # Process accumulated data when a complete response is detected
            while start_marker in accumulated_data and end_marker in accumulated_data:
                start_idx = accumulated_data.find(start_marker)
                end_idx = accumulated_data.find(end_marker) + len(end_marker)
                full_response = accumulated_data[start_idx + len(start_marker):end_idx]
                accumulated_data = accumulated_data[end_idx:]

                try:
                    data = json.loads(full_response)
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', "")
                    if content:
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    continue
    except Exception as e:
        print(f"\nError processing event: {e}", flush=True)
        continue

 Sure, I'd be happy to explain!

An inference engine is a key component of expert systems and other artificial intelligence (AI) applications. It is responsible for reasoning and making decisions based on the rules and knowledge encoded in the system. The inference engine applies various logical inference techniques to draw conclusions and make predictions based on the available data. In other words, it uses the knowledge base to make inferences about new situations or data.

Inference engines can be classified into several categories based on the inference techniques they use, such as forward chaining, backward chaining, resolution, and constraint satisfaction. Forward chaining starts with the available data and applies the rules to deduce new knowledge until a goal is achieved. Backward chaining, on the other hand, starts with the goal and works backward to find the evidence needed to reach that goal.

On the other hand, an LLM (Language Learning Model) serving framework is a system 

---
### Delete endpoint and clean up artifacts

In [17]:
sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'c0e3552f-ead3-4380-b9ec-192c8ef17cc2',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'c0e3552f-ead3-4380-b9ec-192c8ef17cc2',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Fri, 26 Jul 2024 15:35:52 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

---
### Acknowledgements

Draws inspiration from NVIDIA's [nim-deploy](https://github.com/NVIDIA/nim-deploy) samples repositiory that showcases different ways NVIDIA NIMs can be deployed.


---
## Distributors
- Amazon Web Services
- Mistral AI
