# Using Arcee-Lite on SageMaker through Model Packages

---

**If you already deployed the model package with CloudFormation, the AWS CLI or directly in the SageMaker console, please use the [sample-notebook-all-models-existing-sagemaker-endpoint.ipynb](sample-notebook-all-models-existing-sagemaker-endpoint.ipynb) notebook instead.**

---

This sample notebook shows you how to deploy [Arcee Lite](https://huggingface.co/arcee-ai/arcee-lite) using Amazon SageMaker. Arcee-Lite is a compact yet powerful 1.5B parameter language model developed by [Arcee.ai](https://www.arcee.ai) as part of the [DistillKit](https://github.com/arcee-ai/DistillKit) open-source project. Despite its small size, Arcee-Lite demonstrates impressive performance, particularly in the MMLU (Massive Multitask Language Understanding) benchmark.

Arcee-Lite is a distillation of the phi-3-medium 14B model into a Qwen2 1.5 model. It has a 32 KB context size.

## Use cases
* Embedded systems
* Mobile applications
* Edge computing
* Resource-constrained environments


## Pre-requisites
1. Before running this notebook, please make sure you got this notebook from the model catalog on SageMaker AWS Management Console.
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**.

## Contents
1. [Select model package](#1.-Select-model-package)

2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
    1. [Define the endpoint configuration](#A.-Define-the-endpoint-configuration)
    2. [Create the endpoint](#B.-Create-the-endpoint)
    3. [Define a test payload](#C.-Define-a-test-payload)
    4. [Perform real-time inference](#D.-Perform-real-time-inference)
    5. [Visualize output](#E.-Visualize-output)

3. [Clean-up](#4.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Delete the endpoint](#B.-Delete-the-endpoint)
    

## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Select the model package
Confirm that you received this notebook from model catalog on SageMaker AWS Management Console.

In [None]:
model_package_map = {
    "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Tokyo
    "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Seoul
    "ap-south-1": "arn:aws:sagemaker:ap-south-1:077584701553:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Mumbai
    "ap-southeast-1": "arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Singapore
    "ap-southeast-2": "arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Sydney
    "ca-central-1": "arn:aws:sagemaker:ca-central-1:470592106596:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Canada Central
    "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Frankfurt
    "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Stockholm
    "eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Ireland
    "eu-west-2": "arn:aws:sagemaker:eu-west-2:856760150666:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # London
    "eu-west-3": "arn:aws:sagemaker:eu-west-3:843114510376:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Paris
    "sa-east-1": "arn:aws:sagemaker:sa-east-1:270155090741:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # São Paulo
    "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # N. Virginia
    "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Ohio
    "us-west-1": "arn:aws:sagemaker:us-west-1:382657785993:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # N. California
    "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/arcee-lite-tgi-marketplace-v1-743078a6db673d138724a78cb048d735",  # Oregon
}

In [None]:
import datetime
import json
import pprint

import boto3
import sagemaker
from IPython.display import Markdown, display
from sagemaker import ModelPackage, get_execution_role

In [None]:
region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise "UNSUPPORTED REGION"

model_package_arn = model_package_map[region]

In [None]:
role = get_execution_role()
sagemaker_session = sagemaker.Session()
runtime_sm_client = boto3.client("runtime.sagemaker")

## 2. Create an endpoint and perform real-time inference

In this example, we're deploying Arcee-Lite on a SageMaker real-time endpoint hosted on a GPU instance. If you need general information on real-time inference with Amazon SageMaker, please refer to the SageMaker [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html).

The endpoint runs a Hugging Face [Deep Learning Container](https://huggingface.co/docs/sagemaker/index), powered by the Hugging Face [Text Generation Inference](https://huggingface.co/docs/text-generation-inference/index) Server (TGI). TGI enables high-performance text generation for the most popular open-source language models. 

The configuration focuses on cost efficiency. It uses a [g5.xlarge](https://aws.amazon.com/ec2/instance-types/g5/) instance. This instance has a single NVDIA A10G GPU, with 24 GB of GPU RAM. Context size is set to 4 KB. Even on this small GPU instance, we should be able to get over 100 tokens per second.

#### OpenAI compatibility

We enable the [OpenAI Messages API](https://huggingface.co/docs/text-generation-inference/messages_api) available in TGI. This will allow you to invoke the endpoint in the same way you would invoke an OpenAI model. Likewise, the output format will be identical to the OpenAI models.

### A. Define the endpoint configuration

In [None]:
model_name = "Arcee-Lite"
real_time_inference_instance_type = "ml.g5.xlarge"

### B. Create the endpoint

In [None]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

# create a unique endpoint name
timestamp = "{:%Y-%m-%d-%H-%M-%S}".format(datetime.datetime.now())
endpoint_name = f"{model_name}-{timestamp}"
print(f"Deploying endpoint {endpoint_name}")

# deploy the model
response = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type,
    endpoint_name=endpoint_name,
    model_data_download_timeout=3600,
    container_startup_health_check_timeout=600,
)

Once the endpoint is in service, you will be able to perform real-time inference.

### C. Define a test payload

In [None]:
model_sample_input = {
    "model": "tgi",
    "messages": [
        {"role": "system", "content": "You are a friendly and helpful AI assistant."},
        {
            "role": "user",
            "content": "Suggest 5 names for a new neighborhood pet food store. Names should be short, fun, easy to remember, and respectful of pets. \
        Explain why customers would like them.",
        },
    ],
    "max_tokens": 1024,
}

### D. Perform real-time inference

In [None]:
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))

### E. Visualize output

We can print the raw JSON output in OpenAI format.

In [None]:
pprint.pprint(output)

We can also print the generated output with Markdown formatting.

In [None]:
display(Markdown(output["choices"][0]["message"]["content"]))

Here are some more examples. Please feel free to tweak them and add your own!

In [None]:
model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "As a friendly technical assistant engineer, answer the question in detail.",
        },
        {"role": "user", "content": "Why are transformers better models than LSTM?"},
    ],
    "max_tokens": 1024,
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))
display(Markdown(output["choices"][0]["message"]["content"]))

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

## 4. Clean-up

Please don't forget to run the cells below to delete all resources and avoid unecessary charges.

### A. Delete the endpoint

In [None]:
model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

### B. Delete the model

In [None]:
model.delete_model()

Thank you for trying out Arcee-Lite on SageMaker. We have only scratched the surface of what you can do with this model.

We'd be happy to hear from you, learn more about your use case, and help you build your next AI-driven solution. Please reach out to julien@arcee.ai.