# Using Arcee.ai Virtuoso Models on SageMaker through Model Packages
*The latest version of this notebook is available on [Github](https://github.com/arcee-ai/aws-samples/tree/main/model_package_notebooks).*

This notebook shows you how to deploy the [Arcee.ai](https://www.arcee.ai) Virtuoso models listed on [AWS Marketplace](https://aws.amazon.com/marketplace/seller-profile?id=seller-r7b33ivdczgs6). You must have previously subscribed to the appropriate model to deploy it.

The Virtuoso models are general-purpose, high-performance models that excel in both benchmark performance and real-world applications. They bring you the same quality as much larger models in a more compact form ideal for organizations looking for both performance and cost efficiency.

They are available in three sizes, all with a 128K token context size:
* **Virtuoso Large:** Best-in-class frontier model.
* **Virtuoso Medium**: Mid-tier general-purpose performance at a lower cost. 
* **Virtuoso Small**: Optimized for lightweight tasks and faster inference.

Models are deployed to an Amazon SageMaker endpoint.  If you need general information on real-time inference with SageMaker, please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html).

**If you already deployed the model package with CloudFormation, the AWS CLI or directly in the AWS console, there is no need to deploy it again with this notebook. For inference, please use the [sample-notebook-all-models-existing-sagemaker-endpoint.ipynb](sample-notebook-all-models-existing-sagemaker-endpoint.ipynb) notebook instead.**

## Use cases
The Virtuoso models are suitable for a wide range of language tasks, demonstrating particular strength in:
* **Reasoning**: Solving complex problems and drawing logical conclusions.
* **Creative Writing**: Generating engaging and original content across various genres.
* **General Language Understanding**: Comprehending and generating human-like text in diverse contexts.

They can be applied to various business tasks such as:
* **Customer Service**: Implement sophisticated chatbots and virtual assistants.
* **Content Creation**: Generate high-quality written content for marketing and documentation.
* **Data Analysis**: Enhance data interpretation and generate insightful reports.
* **Research and Development**: Assist in literature reviews and hypothesis generation.
* **Legal and Compliance**: Automate contract analysis and regulatory compliance checks.
* **Education and Training**: Create adaptive learning systems and intelligent tutoring programs.

## Pre-requisites
1. This notebook works for models listed on AWS Marketplace. Please make sure you have previously subscribed to the appropriate model.
1. Ensure that IAM role attached to this notebook has the **AmazonSageMakerFullAccess** IAM policy.

## Contents
1. [Select model package](#1.-Select-model-package)

2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
    1. [Define the endpoint configuration](#A.-Define-the-endpoint-configuration)
    2. [Create the endpoint](#B.-Create-the-endpoint)
    3. [Define a test payload](#C.-Define-a-test-payload)
    4. [Perform real-time inference](#D.-Perform-real-time-inference)
    5. [Visualize output](#E.-Visualize-output)
    6. [Perform streaming inference](#F.-Perform-streaming-inference)

3. [Clean-up](#3.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Delete the endpoint](#B.-Delete-the-endpoint)

In [None]:
%%sh
pip install -q boto3 sagemaker

In [None]:
import datetime
import json
import pprint

import boto3
import sagemaker
from IPython.display import Markdown, display
from sagemaker import ModelPackage, get_execution_role
from sagemaker_streaming import print_event_stream

In [None]:
role = get_execution_role()
sagemaker_session = sagemaker.Session()
runtime_sm_client = boto3.client("runtime.sagemaker")

## 1. Select the model package

Virtuoso Small, Medium and Large are packaged separately. Please run one of the three cells below to select the size you'd like to deploy, and the instance type you'd like to deploy it on. 

By default, models are deployed on Amazon EC2 [g6e](https://aws.amazon.com/ec2/instance-types/g6e/) instances powered by NVIDIA L40S GPUs. You may use other instance types as long as they're supported by the model package: you will find the list on the AWS Marketplace model page.

In [None]:
# Run this cell to deploy Virtuoso Small

model_name = "virtuoso-small"
real_time_inference_instance_type = "ml.g6e.12xlarge"

model_package_map = {
    "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Tokyo
    "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Seoul
    "ap-south-1": "arn:aws:sagemaker:ap-south-1:077584701553:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Mumbai
    "ap-southeast-1": "arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Singapore
    "ap-southeast-2": "arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Sydney
    "ca-central-1": "arn:aws:sagemaker:ca-central-1:470592106596:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Canada Central
    "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Frankfurt
    "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Stockholm
    "eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Ireland
    "eu-west-2": "arn:aws:sagemaker:eu-west-2:856760150666:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # London
    "eu-west-3": "arn:aws:sagemaker:eu-west-3:843114510376:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Paris
    "sa-east-1": "arn:aws:sagemaker:sa-east-1:270155090741:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # São Paulo
    "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # N. Virginia
    "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Ohio
    "us-west-1": "arn:aws:sagemaker:us-west-1:382657785993:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # N. California
    "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/virtuoso-small-vllm-marketplac-823845f606f738259097c384364ff2f0",  # Oregon
}

In [None]:
# Run this cell to deploy Virtuoso Medium

model_name = "virtuoso-medium"
real_time_inference_instance_type = "ml.g6e.12xlarge"

model_package_map = {
    "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Tokyo
    "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Seoul
    "ap-south-1": "arn:aws:sagemaker:ap-south-1:077584701553:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Mumbai
    "ap-southeast-1": "arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Singapore
    "ap-southeast-2": "arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Sydney
    "ca-central-1": "arn:aws:sagemaker:ca-central-1:470592106596:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Canada Central
    "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Frankfurt
    "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Stockholm
    "eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Ireland
    "eu-west-2": "arn:aws:sagemaker:eu-west-2:856760150666:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # London
    "eu-west-3": "arn:aws:sagemaker:eu-west-3:843114510376:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Paris
    "sa-east-1": "arn:aws:sagemaker:sa-east-1:270155090741:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # São Paulo
    "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # N. Virginia
    "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Ohio
    "us-west-1": "arn:aws:sagemaker:us-west-1:382657785993:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # N. California
    "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/virtuoso-medium-vllm-marketpla-c91c71ed9080314ab3a082d29c9bd3bb",  # Oregon
}

In [None]:
# Run this cell to deploy Virtuoso Large

model_name = "virtuoso-large"
real_time_inference_instance_type = "ml.g6e.48xlarge"

model_package_map = {
    "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Tokyo
    "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Seoul
    "ap-south-1": "arn:aws:sagemaker:ap-south-1:077584701553:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Mumbai
    "ap-southeast-1": "arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Singapore
    "ap-southeast-2": "arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Sydney
    "ca-central-1": "arn:aws:sagemaker:ca-central-1:470592106596:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Canada Central
    "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Frankfurt
    "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Stockholm
    "eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Ireland
    "eu-west-2": "arn:aws:sagemaker:eu-west-2:856760150666:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # London
    "eu-west-3": "arn:aws:sagemaker:eu-west-3:843114510376:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Paris
    "sa-east-1": "arn:aws:sagemaker:sa-east-1:270155090741:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # São Paulo
    "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # N. Virginia
    "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Ohio
    "us-west-1": "arn:aws:sagemaker:us-west-1:382657785993:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # N. California
    "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/virtuoso-large-vllm-marketplac-4fbe605cb0dd3237aeef5fe94842be93",  # Oregon
}

In [None]:
region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise "UNSUPPORTED REGION"

model_package_arn = model_package_map[region]

## 2. Create an endpoint and perform real-time inference

### A. Define the endpoint configuration

Models have been pre-packaged and stored in AWS. No public download is taking place at deployment time.

The SageMaker endpoint runs the AWS [Large Model Inference](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html) container (LMI), powered by the vLLM inference server. vLLM enables high-performance text generation for the most popular open-source language models. 

The [OpenAI Messages API](https://huggingface.co/docs/text-generation-inference/messages_api) is available in vLLM.

### B. Create the endpoint

In [None]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

# create a unique endpoint name
timestamp = "{:%Y-%m-%d-%H-%M-%S}".format(datetime.datetime.now())
endpoint_name = f"{model_name}-{timestamp}"
print(f"Deploying endpoint {endpoint_name}")

# deploy the model
response = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type,
    endpoint_name=endpoint_name,
    model_data_download_timeout=900,
    container_startup_health_check_timeout=900,
)

Once the endpoint is in service, you will be able to perform real-time inference.

### C. Define a test payload

In [None]:
model_sample_input = {
    "messages": [
        {"role": "system", "content": "You are a friendly and helpful AI assistant."},
        {
            "role": "user",
            "content": "Suggest 5 names for a new neighborhood pet food store. Names should be short, fun, easy to remember, and respectful of pets. \
        Explain why customers would like them.",
        },
    ],
    "max_tokens": 1024,
}

### D. Perform real-time inference

In [None]:
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))

### E. Visualize output

We can print the raw JSON output in OpenAI format.

In [None]:
pprint.pprint(output)

We can also print the generated output with Markdown formatting.

In [None]:
display(Markdown(output["choices"][0]["message"]["content"]))

### F. Perform streaming inference

Here are some more examples. Please feel free to tweak them and add your own!

In [None]:
prompt = """Please write a friendly marketing pitch for a new SaaS AI platform called Arcee Cloud.
We will send this pitch by email to business and technical decision-makers, so make it sound exciting yet professional.
The contact email is sales@arcee.ai.
Arcee Cloud makes it simple for enterprise users to tailor open-source small language models to their own domain knowledge,
in order to build high-quality, cost-effective and secure AI solutions."""

model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly and helpful AI marketing assistant.",
        },
        {"role": "user", "content": prompt},
    ],
    "max_tokens": 1024,
    "stream": True,
}

response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(model_sample_input),
    ContentType="application/json",
)

print_event_stream(response["Body"])

In [None]:
model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "As a friendly financial assistant, answer the question in detail.",
        },
        {
            "role": "user",
            "content": "Suggest a pre-earning options hedging strategy for a volative tech stock.\
        Show me an example with a fictitious company.",
        },
    ],
    "max_tokens": 1024,
    "stream": True,
}

response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(model_sample_input),
    ContentType="application/json",
)

print_event_stream(response["Body"])

In [None]:
model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "You are Darlene, a friendly and helpful salesperson \
        working at Crystal River Classic Bikes, a classic motorcycle dealership in central Florida.",
        },
        {
            "role": "user",
            "content": "Using English, write a personalized customer email to get \
        them to sign up for a test ride on the new 2025 motorcycles that are visible at the dealership. \
        Tone should be warm and personal, make sure to weave in the customer information below. \
        Wyatt, your chief mechanic and road captain, has just won the 2024 State Award for Best Mechanic. \
        \
        Customer information:\
        - name: Julien \
        - last visit: 6 months ago for bike service \
        - Owns 2 bikes, a 2002 sporty bike and a 2007 cruiser \
        - Wishes he had more time to ride!",
        },
    ],
    "max_tokens": 1024,
    "stream": True,
}

response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(model_sample_input),
    ContentType="application/json",
)

print_event_stream(response["Body"])

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

## 4. Clean-up

Please don't forget to run the cells below to delete all resources and avoid unecessary charges.

### A. Delete the endpoint

In [None]:
model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

### B. Delete the model

In [None]:
model.delete_model()

Thank you for trying Virtuoso. We have only scratched the surface of what you can do with this model.

We'd be happy to hear from you, learn more about your use case, and help you build your next AI-powered product or service. Please reach out to julien@arcee.ai.