# Using Llama-3.1-SuperNova-Lite on SageMaker through the Hugging Face hub

This sample notebook shows you how to deploy [Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) using Amazon SageMaker. Llama-3.1-SuperNova-Lite is a conversational model developed by [Arcee.ai](https://www.arcee.ai). 

Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture. It is a distilled version of the larger Llama-3.1-405B-Instruct model, leveraging offline logits extracted from the 405B parameter variant. This 8B variation of Llama-3.1-SuperNova maintains high performance while offering exceptional instruction-following capabilities and domain-specific adaptability.

The model was trained using a state-of-the-art distillation pipeline and an instruction dataset generated with [EvolKit](https://github.com/arcee-ai/EvolKit), ensuring accuracy and efficiency across a wide range of tasks. For more information on its training, visit blog.arcee.ai. 

## Use cases

Llama-3.1-SuperNova-Lite excels in both benchmark performance and real-world applications, providing the power of large-scale models in a more compact, efficient form ideal for organizations seeking high performance with reduced resource requirements.

## Pre-requisites
1. Before running this notebook, please make sure you got this notebook from the model catalog on SageMaker AWS Management Console.
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**.

## Contents
1. [Import dependencies](#1.-Import-dependencies)

2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
    1. [Define the endpoint configuration](#A.-Define-the-endpoint-configuration)
    2. [Create the endpoint](#B.-Create-the-endpoint)
    3. [Define a test payload](#C.-Define-a-test-payload)
    4. [Perform real-time inference](#D.-Perform-real-time-inference)
    5. [Visualize output](#E.-Visualize-output)

3. [Clean-up](#4.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Delete the endpoint](#B.-Delete-the-endpoint)
    

## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Import dependencies

In [None]:
%%sh
pip install -qU boto3 sagemaker

In [None]:
import datetime
import json
import pprint

import boto3
import sagemaker
from IPython.display import Markdown, display
from sagemaker import get_execution_role
from sagemaker.huggingface import (HuggingFaceModel,
                                   get_huggingface_llm_image_uri)

In [None]:
role = get_execution_role()
sagemaker_session = sagemaker.Session()
runtime_sm_client = boto3.client("runtime.sagemaker")

## 2. Create an endpoint and perform real-time inference

In this example, we're deploying Llama-3.1-SuperNova-Lite on a SageMaker real-time endpoint hosted on a GPU instance. If you need general information on real-time inference with Amazon SageMaker, please refer to the SageMaker [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html).

The endpoint runs a Hugging Face [Deep Learning Container](https://huggingface.co/docs/sagemaker/index), powered by the Hugging Face [Text Generation Inference](https://huggingface.co/docs/text-generation-inference/index) Server (TGI). TGI enables high-performance text generation for the most popular open-source language models. 

For flexibility, you can pick from two sample configurations, depending your use case and the instances types available to you. Please make sure to run just one of the configuration cells below.

#### G5 endpoint

The first configuration focuses on cost efficiency. It uses a [g5.2xlarge](https://aws.amazon.com/ec2/instance-types/g5/) instance. This instance has a single NVDIA A10G GPU, with 24 GB of GPU RAM. Llama-Spark has 8 billion 16-bit parameters, which can easily fit without the need for quantization.

For context size, we use the default value defined by the TGI inference server, i.e. 4K.

#### P4 endpoint

The second configuration focuses on performance. It uses a [p4d.24xlarge](https://aws.amazon.com/ec2/instance-types/p4/) instance. This instance has eight NVDIA A100 GPUs, with a total of 320 GB of GPU RAM. This is more than enough to load the model and use a much larger context size of 80K.

#### OpenAI compatibility

For both configurations, we enable the [OpenAI Messages API](https://huggingface.co/docs/text-generation-inference/messages_api) available in TGI. This will alllow you to invoke the endpoint in the same way you would invoke an OpenAI model. Likewise, the output format will be identical to the OpenAI models. If that's not desirable, you can simply comment out the line setting `MESSAGES_API_ENABLED` to `true`.



### A. Define the endpoint configuration

In [None]:
model_id = "arcee-ai/Llama-3.1-SuperNova-Lite"
endpoint_name_prefix = "Llama-SuperNova-Lite"

In [None]:
# g5 endpoint

real_time_inference_instance_type = "ml.g5.2xlarge"

model_environment = {
    "HF_MODEL_ID": model_id,
    "SM_NUM_GPUS": "1",
    "MESSAGES_API_ENABLED": "true",
}

In [None]:
# p4 endpoint

real_time_inference_instance_type = "ml.p4d.24xlarge"

model_environment = {
    "HF_MODEL_ID": model_id,
    "SM_NUM_GPUS": "8",
    "MAX_INPUT_TOKENS": "40960",
    "MAX_TOTAL_TOKENS": "81920",
    "MESSAGES_API_ENABLED": "true",
}

### B. Create the endpoint

In [None]:
# create a deployable model from the model package.
model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface", version="2.2.0"),
    role=role,
    env=model_environment,
)

# create a unique endpoint name
timestamp = "{:%Y-%m-%d-%H-%M-%S}".format(datetime.datetime.now())
endpoint_name = f"{endpoint_name_prefix}-{timestamp}"
print(f"Deploying endpoint {endpoint_name}")

# deploy the model
response = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type,
    endpoint_name=endpoint_name,
    model_data_download_timeout=3600,
    container_startup_health_check_timeout=900,
)

Once the endpoint is in service, you will be able to perform real-time inference.

### C. Define a test payload

In [None]:
model_sample_input = {
    "model": "tgi",
    "messages": [
        {"role": "system", "content": "You are a friendly and helpful AI assistant."},
        {
            "role": "user",
            "content": "Suggest 5 names for a new neighborhood pet food store. Names should be short, fun, easy to remember, and respectful of pets. \
        Explain why customers would like them.",
        },
    ],
    "max_tokens": 1024,
}

### D. Perform real-time inference

In [None]:
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))

### E. Visualize output

We can print the raw JSON output in OpenAI format.

In [None]:
pprint.pprint(output)

We can also print the generated output with Markdown formatting.

In [None]:
display(Markdown(output["choices"][0]["message"]["content"]))

Here are some more examples. Please feel free to tweak them and add your own!

In [None]:
prompt = """Please write a friendly marketing pitch for a new SaaS AI platform called Arcee Cloud.
We will send this pitch by email to business and technical decision-makers, so make it sound exciting yet professional.
The contact email is sales@arcee.ai. Feel free to use emojis as appropriate.
Arcee Cloud makes it simple for enterprise users to tailor open-source small language models to their own domain knowledge,
in order to build high-quality, cost-effective and secure AI solutions."""

model_sample_input = {
    "model": "tgi",
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly and helpful Marketing Manager working at Arcee.ai.",
        },
        {"role": "user", "content": prompt},
    ],
    "max_tokens": 1024,
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))
display(Markdown(output["choices"][0]["message"]["content"]))

In [None]:
model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "As a friendly technical assistant engineer, answer the question in detail.",
        },
        {"role": "user", "content": "Why are transformers better models than LSTM?"},
    ],
    "max_tokens": 1024,
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))
display(Markdown(output["choices"][0]["message"]["content"]))

In [None]:
model_sample_input = {
    "model": "tgi",
    "messages": [
        {
            "role": "system",
            "content": "You are Darlene, a friendly and helpful salesperson \
        working at Crystal River Classic Bikes, a classic motorcycle dealership in central Florida.",
        },
        {
            "role": "user",
            "content": "Using English, write a personalized customer email to get \
        them to sign up for a test ride on the new 2025 motorcycles that are visible at the dealership. \
        Tone should be warm and personal, make sure to weave in the customer information below. \
        Wyatt, your chief mechanic and road captain, has just won the 2024 State Award for Best Mechanic. \
        \
        Customer information:\
        - name: Julien \
        - last visit: 6 months ago for bike service \
        - Owns 2 bikes, a 2002 sporty bike and a 2007 cruiser \
        ",
        },
    ],
    "max_tokens": 1024,
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))
display(Markdown(output["choices"][0]["message"]["content"]))

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

## 4. Clean-up

Please don't forget to run the cells below to delete all resources and avoid unecessary charges.

### A. Delete the endpoint

In [None]:
model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

### B. Delete the model

In [None]:
model.delete_model()

Thank you for trying out Llama-Spark on SageMaker. We have only scratched the surface of what you can do with this model.

We'd be happy to hear from you, learn more about your use case, and help you build your next AI-driven solution. Please reach out to julien@arcee.ai.