# Using Llama SuperNova Lite on SageMaker and Inferentia2

This sample notebook shows you how to deploy [Llama SuperNova Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) on Inferentia2 using Amazon SageMaker. Llama SuperNova Lite is a conversational model developed by [Arcee.ai](https://www.arcee.ai).

Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture. It is a distilled version of the larger Llama-3.1-405B-Instruct model, leveraging offline logits extracted from the 405B parameter variant. This 8B variation of Llama-3.1-SuperNova maintains high performance while offering exceptional instruction-following capabilities and domain-specific adaptability.

The model was trained using a state-of-the-art distillation pipeline and an instruction dataset generated with EvolKit, ensuring accuracy and efficiency across a wide range of tasks. For more information on its training, visit blog.arcee.ai.

## Use cases

Llama-3.1-SuperNova-Lite excels in both benchmark performance and real-world applications, providing the power of large-scale models in a more compact, efficient form ideal for organizations seeking high performance with reduced resource requirements.

## Pre-requisites
1. Before running this notebook, please make sure you got this notebook from the model catalog on SageMaker AWS Management Console.
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**.

## Contents
1. [Import dependencies](#1.-Import-dependencies)

2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
    1. [Define the endpoint configuration](#A.-Define-the-endpoint-configuration)
    2. [Create the endpoint](#B.-Create-the-endpoint)
    3. [Define a test payload](#C.-Define-a-test-payload)
    4. [Perform real-time inference](#D.-Perform-real-time-inference)
    5. [Visualize output](#E.-Visualize-output)
    6. [Perform streaming inference](#F.-Perform-streaming-inference)


3. [Clean-up](#4.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Delete the endpoint](#B.-Delete-the-endpoint)
    
## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Import dependencies

In [None]:
%%sh
pip install -qU boto3 sagemaker

In [None]:
import datetime
import json
import pprint

import boto3
import sagemaker
from IPython.display import Markdown, display
from sagemaker import Model, get_execution_role, image_uris
from sagemaker.djl_inference.model import DJLModel
from sagemaker_streaming import print_event_stream

In [None]:
role = get_execution_role()
sagemaker_session = sagemaker.Session()
sagemaker_bucket = sagemaker_session.default_bucket()
sm_client = boto3.client("sagemaker")
runtime_sm_client = boto3.client("runtime.sagemaker")

## 2. Create an endpoint and perform real-time inference

In this example, we're deploying Llama SuperNova Lite on a SageMaker real-time endpoint. If you need general information on real-time inference with Amazon SageMaker, please refer to the SageMaker [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html).

The endpoint runs a Large Model Inference (LMI) [Deep Learning Container](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html), powered by the [DJLServing](https://docs.djl.ai/master/docs/serving/index.html) server and the [transformers-neuronx](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/tnx_user_guide.html) library.

We will deploy the model to an [inf2.2xlarge](https://aws.amazon.com/ec2/instance-types/inf2/) instance. This instance has two NeuronCores v2, with a total of 32 GB of accelerator RAM.

For flexibility, you can pick from two sample configurations. Please make sure to run just one of the configuration in the cells below.

1. Download the model from the Hugging Face hub and compile it on the fly

    This configuration is more flexible, as we can pick the batch size and the sequence length at deployment time.

    However, the endpoint creation time is longer, as we need to download the model from the hub and compile it. In this example, endpoint creation should take 12-13 minutes, including about 4 minutes of model compilation.
   
    Accessing the hub may also not be possible in air-gapped deployment scenarios. We could load a Hugging Face model previously saved in S3, which would remove the dependency on the hub and speed up download a bit. Or course, model compilation would still be required.

3. Load a precompiled model from an Amazon S3 bucket (batch size 4, sequence length 4096)

    You should follow these [instructions](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/tutorials/tnx_aot_tutorial.html) to pre-compile and package the model. 

    Deployment is faster as we skip compilation: in this example, endpoint creation should take about 7-8 minutes.

    Also, a compiled model allows you to lock down the configuration of your endpoint and make sure it's deployed with static settings

    However, batch size and sequence length are fixed: you'll need several compiled models for different settings.

#### OpenAI compatibility

For both configurations, the endpoint supports the [OpenAI Messages API](https://huggingface.co/docs/text-generation-inference/messages_api). This allows you to invoke the endpoint in the same way you would invoke an OpenAI model. Likewise, the output format will be identical to the OpenAI models.

### A. Define the endpoint configuration

In [None]:
model_id = "arcee-ai/Llama-3.1-SuperNova-Lite"
model_name_prefix = "llama-supernova-lite-neuron"
instance_type = "ml.inf2.xlarge"

In [None]:
image_uri = image_uris.retrieve(
    framework="djl-neuronx",
    region=sagemaker_session.boto_session.region_name,
    version="0.29.0",
)

print(image_uri)

#### First configuration: deploy with a model compiled on the fly

First, we define serving parameters in the model environment. Then, we create the model object.

In [None]:
model_environment = {
    "OPTION_ENTRYPOINT": "djl_python.transformers_neuronx",
    "OPTION_ROLLING_BATCH": "auto",
    "OPTION_TENSOR_PARALLEL_DEGREE": "2",
    #  1 doesn't work, use 2-16 https://github.com/deepjavalibrary/djl-serving/issues/2354
    "OPTION_MAX_ROLLING_BATCH_SIZE": "2",
    "OPTION_N_POSITIONS": "8192",
    "OPTION_MODEL_LOADING_TIMEOUT": "900",
}

djl_model = DJLModel(
    model_id=model_id, image_uri=image_uri, env=model_environment, role=role
)

Once we've done this, we can [create the endpoint](#B.-Create-the-endpoint).

#### Second configuration: deploy with a precompiled model (batch size 4, sequence length 4096)

Once we've followed the [instructions](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/tutorials/tnx_aot_tutorial.html), the model and its compiled version are available in S3, in this example at `s3://arcee-uswest2-marketplace-models/XXX/`

```
|- config.json
|- generation_config.json (not in the instructions, fixes a warning at loading time)
|- special_tokens_map.json
|- tokenizer*.*
|- checkpoint/
|- - config.json
|- - generation_config.json
|- - model*.safetensors
|- - model.safetensors.index.json
|- compiled/
|- - VERSION
|- - *.neff
```

Next, we define serving parameters in a configuration file. Please make sure that `tensor_parallel_degree`, `n_positions` and `max_rolling_batch_size` match the values used at compilation time.

In [None]:
%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id=s3://arcee-uswest2-marketplace-models/XXX/
option.tensor_parallel_degree=2
option.n_positions=4096
option.rolling_batch=auto
option.max_rolling_batch_size=4
option.model_loading_timeout=3600
option.enable_mixed_precision_accumulation=true

Then, we package the configuration file and upload it to S3.

In [None]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

In [None]:
properties_artifact = sagemaker_session.upload_data(
    "mymodel.tar.gz", sagemaker_bucket, model_name_prefix
)

Finally, we create the model. Once we've done this, we can [create the endpoint](#B.-Create-the-endpoint).

In [None]:
djl_model = Model(model_data=properties_artifact, image_uri=image_uri, role=role)

### B. Create the endpoint

In [None]:
%%time
# create a unique endpoint name
timestamp = "{:%Y-%m-%d-%H-%M-%S}".format(datetime.datetime.now())
endpoint_name = f"{model_name_prefix}-{timestamp}"
print(f"Deploying endpoint {endpoint_name}")

# deploy the model
predictor = djl_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    model_data_download_timeout=900,
    container_startup_health_check_timeout=900,
)

Once the endpoint is in service, you will be able to perform real-time inference.

### C. Define a test payload

In [None]:
model_sample_input = {
    "messages": [
        {"role": "system", "content": "You are a friendly and helpful AI assistant."},
        {
            "role": "user",
            "content": "Suggest 5 names for a new neighborhood pet food store. Names should be short, fun, easy to remember, and respectful of pets. \
        Explain why customers would like them.",
        },
    ],
    "max_tokens": 1024,
    "stream": False
}

### D. Perform real-time inference

In [None]:
%%time
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

### E. Visualize output

We can print the raw JSON output in OpenAI format.

In [None]:
output = json.loads(response["Body"].read().decode("utf8"))
pprint.pprint(output)

We can also print the generated output with Markdown formatting.

In [None]:
display(Markdown(output["choices"][0]["message"]["content"]))

Here are some more examples. Please feel free to tweak them and add your own!

In [None]:
prompt = """Please write a friendly marketing pitch for a new SaaS AI platform called Arcee Cloud.
We will send this pitch by email to business and technical decision-makers, so make it sound exciting yet professional.
The contact email is sales@arcee.ai. Feel free to use emojis as appopriate.
Arcee Cloud makes it simple for enterprise users to tailor open-source small language models to their own domain knowledge,
in order to build high-quality, cost-effective and secure AI solutions."""

model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly and helpful Marketing Manager working at Arcee.ai.",
        },
        {"role": "user", "content": prompt},
    ],
    "max_tokens": 1024,
    "stream": False
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))
display(Markdown(output["choices"][0]["message"]["content"]))

### F. Perform streaming inference

In [None]:
model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "As a friendly technical assistant engineer, answer the question in detail.",
        },
        {"role": "user", "content": "Why are transformers better models than LSTM?"},
    ],
    "max_tokens": 1024,
    "stream": True
}

response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=predictor.endpoint_name,
    Body=json.dumps(model_sample_input),
    ContentType='application/json'
)

print_event_stream(response['Body'])

In [None]:
model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "You are Darlene, a friendly and helpful salesperson \
        working at Crystal River Classic Bikes, a classic motorcycle dealership in central Florida.",
        },
        {
            "role": "user",
            "content": "Using English, write a personalized customer email to get \
        them to sign up for a test ride on the new 2025 motorcycles that are visible at the dealership. \
        Tone should be warm and personal, make sure to weave in the customer information below. \
        Wyatt, your chief mechanic and road captain, has just won the 2024 State Award for Best Mechanic. \
        \
        Customer information:\
        - name: Julien \
        - last visit: 6 months ago for bike service \
        - Owns 2 bikes, a 2002 sporty bike and a 2007 cruiser \
        ",
        },
    ],
    "max_tokens": 1024,
    "stream": True
}

response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=predictor.endpoint_name,
    Body=json.dumps(model_sample_input),
    ContentType='application/json'
)

print_event_stream(response['Body'])

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

## 4. Clean-up

Please don't forget to run the cells below to delete all resources and avoid unecessary charges.

### A. Delete the endpoint

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)

### B. Delete the model

In [None]:
djl_model.delete_model()

Thank you for trying out Llama-Spark on Inferentia2 and SageMaker. We have only scratched the surface of what you can do with this model.

We'd be happy to hear from you, learn more about your use case, and help you build your next AI-driven solution. Please reach out to julien@arcee.ai.