# Deploy Llama 3 70B with NVIDIA NIM on Amazon SageMaker

---

## What is NIM

[NVIDIA NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/mistralai/containers/mixtral-8x7b-instruct-v01) enables efficient deployment of large language models (LLMs) across various environments, including cloud, data centers, and workstations. It simplifies self-hosting LLMs by providing scalable, high-performance microservices optimized for NVIDIA GPUs. NIM's containerized approach allows for easy integration into existing workflows, with support for advanced language models and enterprise-grade security. Leveraging GPU acceleration, NIM offers fast inference capabilities and flexible deployment options, empowering developers to build powerful AI applications such as chatbots, content generators, and translation services.

### Features

NIM abstracts away model inference internals such as execution engine and runtime operations. They are also the most performant option available whether it be with [TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm), and others. NIM offers the following high performance features:

1. Scalable Deployment that is performant and can easily and seamlessly scale from a few users to millions.
2. Advanced Language Model support with pre-generated optimized engines for a diverse range of cutting edge LLM architectures.
3. Flexible Integration to easily incorporate the microservice into existing workflows and applications. Developers are provided with an OpenAI API compatible programming model and custom NVIDIA extensions for additional functionality.
4. Enterprise-Grade Security emphasizes security by using safetensors, constantly monitoring and patching CVEs in our stack and conducting internal penetration tests.

Here is a link to the [NIM Support Matrix](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html)

### Architecture

NIMs are packaged as container images on a per model/model family basis. Each NIM is its own Docker container with a model, such as llama3. These containers include a runtime that runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. These containers include a runtime that runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. In this sample, we will be using the [NVIDIA NIM public ECR gallery on AWS](https://gallery.ecr.aws/nvidia/nim).

In this example we show how to deploy `Llama3 70B` on a `p4d.24xlarge` instance with NIM on Amazon SageMaker.

## Model Card
---
### Llama 3 70B

- **Description:** Ideal for content creation, conversational AI, language understanding, research development, and enterprise applications. 
- **Max Tokens:** 2,048
- **Context Window:** 8,196
- **Languages:** English
- **Supported Use Cases:** Synthetic Text Generation and Accuracy, Text Classification and Nuance, Sentiment Analysis and Nuance Reasoning, Language Modeling, Dialogue Systems, and Code Generation.

## Prerequisites
---

<div class="alert alert-block alert-info">
<b>NOTE:</b>  To run NIM on SageMaker you will need to have your `NGC API KEY` to access NGC resources. Check out <a href="https://build.nvidia.com/meta/llama3-70b?signin=true"> this LINK</a> to learn how to get an NGC API KEY. 
</div>

##### 1. Setup and retrieve API key:

1. First you will need to sign into [NGC](9https://ngc.nvidia.com/signin) with your NVIDIA account and password.
2. Navigate to setup.
3. Select “Get API Key”.
4. Generate your API key.
5. Keep your API key secret and in a safe place. Do not share it or store it in a place where others can see or copy it

For more information on NIM, check out the [NIM LLM docs](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) .

##### 2. You must have the appropriate push permissions associated with your execution role
- Copy and paste the following json inline policy to your `Amazon SageMaker Execution Role` :

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "imagebuilder:GetComponent",
                "imagebuilder:GetContainerRecipe",
                "ecr:GetAuthorizationToken",
                "ecr:BatchGetImage",
                "ecr:InitiateLayerUpload",
                "ecr:UploadLayerPart",
                "ecr:CompleteLayerUpload",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:PutImage"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": "*",
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "kms:EncryptionContextKeys": "aws:imagebuilder:arn",
                    "aws:CalledVia": [
                        "imagebuilder.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::ec2imagebuilder*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:/aws/imagebuilder/*"
        }
    ]
}
```
- Or add the `EC2InstanceProfileForImageBuilderECRContainerBuilds` permission policy to your `SageMaker Execution Role`

##### 3. NIM public ECR image is currently available only in `us-east-1` region

##### 4. This Jupyter Notebook can be run on a t3.medium instance (ml.t3.medium). However, to deploy `Llama 3 70B`, you may need to request a quota increase. 

To request a quota increase, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon SageMaker.
3. Review your default quota for the following resources:
   - `p4d.24xlarge` for endpoint usage
4. If needed, request a quota increase for these resources.

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.
</div>

---

## Setup

Installs the dependencies and setup roles required to package the model and create SageMaker endpoint. 

In [1]:
import boto3
import json

sess = boto3.Session()
sm = sess.client("sagemaker")
client = boto3.client("sagemaker-runtime")

### Set Variables

In this example, since we are deploying `Llama3 70B` we define some configurations below for retrieving our ECR image for NIM along with some other requirements.

In [2]:
public_nim_image = "public.ecr.aws/nvidia/nim:llama3-70b-instruct-1.0.0"
nim_model = "nim-llama3-70b-instruct"
sm_model_name = "nim-llama3-70b-instruct"
instance_type = "ml.p4d.24xlarge"
payload_model = "meta/llama3-70b-instruct"
NGC_API_KEY = "<YOUR NGC API KEY>"

In [3]:
# Use store magic to save the global variables for running base nim notebook.
%store public_nim_image nim_model sm_model_name instance_type payload_model NGC_API_KEY

Stored 'nim_model' (str)
Stored 'sm_model_name' (str)
Stored 'instance_type' (str)
Stored 'payload_model' (str)
Stored 'NGC_API_KEY' (str)


## Create SageMaker Endpoint with NIM Container

Using the above container configurations we create a sagemaker endpoint and wait for the deployment to finish.

In [4]:
%run ../base_nim_NVIDIA.ipynb

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
Stored 'endpoint_name' (str)
Stored 'sm_model_name' (str)
Stored 'endpoint_config_name' (str)


In [5]:
# Use store magic to retrieve the global variables to run inference over the sagemaker endpoint.
%store -r endpoint_name sm_model_name endpoint_config_name

## Test Inference and Streaming Inference with Endpoint

Once we have the endpoint's status as `InService` we can use a sample text to do a chat completion inference request using json as the payload format. For inference request format, currently NIM on SageMaker supports the OpenAI API chat completions inference protocol. For explanation of supported parameters please see [this link](https://platform.openai.com/docs/api-reference/chat). 

<div class="alert alert-block alert-info">
<b>NOTE:</b> The model's name in the inference request payload needs to be the name of the NIM model. 
</div>

In [None]:
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Explain to me in detail how Optimum Neuron helps compile LLMs for AWS infrastructure"}
]
payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 1024
}


response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(json.dumps(output, indent=2))

In [None]:
content = output["choices"][0]["message"]["content"]
print(content)

### Try streaming inference

NIM on SageMaker also supports streaming inference and you can enable that by setting **`"stream"` as `True`** in the payload and by using [`invoke_endpoint_with_response_stream`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint_with_response_stream.html) method.

In [None]:
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Explain to me in detail what inference engines and llm serving frameworks are"}
]
payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 1024,
  "stream": True
}


response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We have some postprocessing code for the streaming output.

In [None]:
event_stream = response['Body']
accumulated_data = ""
start_marker = 'data:'
end_marker = '"finish_reason":null}]}'

for event in event_stream:
    try:
        payload = event.get('PayloadPart', {}).get('Bytes', b'')
        if payload:
            data_str = payload.decode('utf-8')

            accumulated_data += data_str

            # Process accumulated data when a complete response is detected
            while start_marker in accumulated_data and end_marker in accumulated_data:
                start_idx = accumulated_data.find(start_marker)
                end_idx = accumulated_data.find(end_marker) + len(end_marker)
                full_response = accumulated_data[start_idx + len(start_marker):end_idx]
                accumulated_data = accumulated_data[end_idx:]

                try:
                    data = json.loads(full_response)
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', "")
                    if content:
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    continue
    except Exception as e:
        print(f"\nError processing event: {e}", flush=True)
        continue

---
### Delete endpoint and clean up artifacts

In [None]:
sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

---
## Distributors
- Amazon Web Services
- Meta
