# Using Arcee.ai Coder Models on SageMaker through Model Packages
*The latest version of this notebook is available on [Github](https://github.com/arcee-ai/aws-samples/tree/main/model_package_notebooks).*

This notebook shows you how to deploy the [Arcee.ai](https://www.arcee.ai) Coder models listed on [AWS Marketplace](https://aws.amazon.com/marketplace/seller-profile?id=seller-r7b33ivdczgs6). You must have previously subscribed to the appropriate model to deploy it.

The Coder models are purpose-built for developers, and excel in both benchmark performance and real-world applications. They bring you the same quality as much larger models in a more compact form ideal for organizations looking for both performance and cost efficiency.

They are available in two sizes, both with a 32K token context size:
* **Coder Large** handles advanced programming and development tasks.
* **Coder Small** is a lightweight option for faster, simpler coding workflows and autocomplete tasks.

Models are deployed to an Amazon SageMaker endpoint.  If you need general information on real-time inference with SageMaker, please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html).

**If you already deployed the model package with CloudFormation, the AWS CLI or directly in the AWS console, there is no need to deploy it again with this notebook. For inference, please use the [sample-notebook-all-models-existing-sagemaker-endpoint.ipynb](sample-notebook-all-models-existing-sagemaker-endpoint.ipynb) notebook instead.**

## Use cases
The Coder models are suitable for a wide range of code-related tasks, demonstrating particular strength in:
* **Writing and Generating Code**: Creating code snippets, scripts, and full programs in various programming languages (e.g., Python, Java, C++, JavaScript).
* **Debugging Code**: Identifying and fixing errors in existing code.
* **Code Optimization**: Improving the performance and efficiency of code.
* **Algorithm Design**: Developing algorithms to solve specific problems.
* **Learning Resources**: Recommending tutorials, documentation, and learning materials for different programming languages and frameworks.


They can be applied to various business tasks such as:
* **Software Development**: Writing and generating code for custom software solutions tailored to business needs.
* **API Integration**: Assisting with the integration of third-party APIs to enhance functionality and data flow.
* **Data Analysis**: Using code to perform data analysis and generate insights for business decision-making.
* **DevOps Implementation**: Helping with setting up CI/CD pipelines, containerization, and cloud deployments to streamline development processes.
* **Security Assurance**: Providing code reviews and security best practices to protect business applications from vulnerabilities.

## Pre-requisites
1. This notebook works for models listed on AWS Marketplace. Please make sure you have previously subscribed to the appropriate model.
1. Ensure that IAM role attached to this notebook has the **AmazonSageMakerFullAccess** IAM policy.

## Contents
1. [Select model package](#1.-Select-model-package)

2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
    1. [Define the endpoint configuration](#A.-Define-the-endpoint-configuration)
    2. [Create the endpoint](#B.-Create-the-endpoint)
    3. [Define a test payload](#C.-Define-a-test-payload)
    4. [Perform real-time inference](#D.-Perform-real-time-inference)
    5. [Visualize output](#E.-Visualize-output)
    6. [Perform streaming inference](#F.-Perform-streaming-inference)

3. [Clean-up](#3.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Delete the endpoint](#B.-Delete-the-endpoint)

In [None]:
%%sh
pip install -q boto3 sagemaker

In [None]:
import datetime
import json
import pprint

import boto3
import sagemaker
from IPython.display import Markdown, display
from sagemaker import ModelPackage, get_execution_role
from sagemaker_streaming import print_event_stream

In [None]:
role = get_execution_role()
sagemaker_session = sagemaker.Session()
runtime_sm_client = boto3.client("runtime.sagemaker")

## 1. Select the model package

Virtuoso Small, Medium and Large are packaged separately. Please run one of the three cells below to select the size you'd like to deploy, and the instance type you'd like to deploy it on. 

By default, models are deployed on Amazon EC2 [g6e](https://aws.amazon.com/ec2/instance-types/g6e/) instances powered by NVIDIA L40S GPUs. You may use other instance types as long as they're supported by the model package: you will find the list on the AWS Marketplace model page.

In [None]:
# Run this cell to deploy Coder Small

model_name = "coder-small"
real_time_inference_instance_type = "ml.g6e.12xlarge"

model_package_map = {
    "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Tokyo
    "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Seoul
    "ap-south-1": "arn:aws:sagemaker:ap-south-1:077584701553:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Mumbai
    "ap-southeast-1": "arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Singapore
    "ap-southeast-2": "arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Sydney
    "ca-central-1": "arn:aws:sagemaker:ca-central-1:470592106596:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Canada Central
    "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Frankfurt
    "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Stockholm
    "eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Ireland
    "eu-west-2": "arn:aws:sagemaker:eu-west-2:856760150666:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # London
    "eu-west-3": "arn:aws:sagemaker:eu-west-3:843114510376:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Paris
    "sa-east-1": "arn:aws:sagemaker:sa-east-1:270155090741:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # São Paulo
    "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # N. Virginia
    "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Ohio
    "us-west-1": "arn:aws:sagemaker:us-west-1:382657785993:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # N. California
    "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/coder-small-vllm-marketplace-v-d0c67172309a363d990442eeb1299404",  # Oregon
}

In [None]:
# Run this cell to deploy Coder Large

model_name = "coder-large"
real_time_inference_instance_type = "ml.g6e.48xlarge"

model_package_map = {
    "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Tokyo
    "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Seoul
    "ap-south-1": "arn:aws:sagemaker:ap-south-1:077584701553:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Mumbai
    "ap-southeast-1": "arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Singapore
    "ap-southeast-2": "arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Sydney
    "ca-central-1": "arn:aws:sagemaker:ca-central-1:470592106596:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Canada Central
    "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Frankfurt
    "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Stockholm
    "eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Ireland
    "eu-west-2": "arn:aws:sagemaker:eu-west-2:856760150666:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # London
    "eu-west-3": "arn:aws:sagemaker:eu-west-3:843114510376:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Paris
    "sa-east-1": "arn:aws:sagemaker:sa-east-1:270155090741:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # São Paulo
    "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # N. Virginia
    "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Ohio
    "us-west-1": "arn:aws:sagemaker:us-west-1:382657785993:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # N. California
    "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/coder-large-vllm-marketplace-v-271e62e9152c3b48a99025600eb78bfb",  # Oregon
}

In [None]:
region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise "UNSUPPORTED REGION"

model_package_arn = model_package_map[region]

## 2. Create an endpoint and perform real-time inference

### A. Define the endpoint configuration

Models have been pre-packaged and stored in AWS. No public download is taking place at deployment time.

The SageMaker endpoint runs the AWS [Large Model Inference](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html) container (LMI), powered by the vLLM inference server. vLLM enables high-performance text generation for the most popular open-source language models. 

The [OpenAI Messages API](https://huggingface.co/docs/text-generation-inference/messages_api) is available in vLLM.

### B. Create the endpoint

In [None]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

# create a unique endpoint name
timestamp = "{:%Y-%m-%d-%H-%M-%S}".format(datetime.datetime.now())
endpoint_name = f"{model_name}-{timestamp}"
print(f"Deploying endpoint {endpoint_name}")

# deploy the model
response = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type,
    endpoint_name=endpoint_name,
    model_data_download_timeout=900,
    container_startup_health_check_timeout=900,
)

Once the endpoint is in service, you will be able to perform real-time inference.

### C. Define a test payload

In [None]:
model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly and helpful AI coding assistant.",
        },
        {
            "role": "user",
            "content": """Explain the difference between logits distillation and hidden states distillation.
            In particular, explain how the loss functions differ. Show code snippets with Pytorch.""",
        },
    ],
    "max_tokens": 1024,
}

### D. Perform real-time inference

In [None]:
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))

### E. Visualize output

We can print the raw JSON output in OpenAI format.

In [None]:
pprint.pprint(output)

We can also print the generated output with Markdown formatting.

In [None]:
display(Markdown(output["choices"][0]["message"]["content"]))

### F. Perform streaming inference

Here are some more examples. Please feel free to tweak them and add your own!

In [None]:
code = """
import numpy as np
import pandas as pd

def process_data(file_path):
    # Load data from CSV
    data = pd.read_csv(file_path)
    # Filter data where 'age' is greater than 30
    filtered_data = data[data['age'] > 30]
    # Calculate the mean of the 'salary' column
    mean_salary = np.mean(filtered_data['salary'])
    # Calculate the standard deviation of the 'salary' column
    std_salary = np.std(filtered_data['salary'])
    # Create a summary dictionary
    summary = {
        'mean_salary': mean_salary,
        'std_salary': std_salary
    }
    return summary
"""

model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly and helpful AI coding assistant.",
        },
        {"role": "user", "content": f"Identify and fix issues in this code: {code}"},
    ],
    "max_tokens": 1024,
    "stream": True,
}

response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(model_sample_input),
    ContentType="application/json",
)

print_event_stream(response["Body"])

In [None]:
code = """
import boto3

my_bucket = "image-bucket'
s3 = boto3.client('s3')

# List all JPEG images larger than 8MB
"""

model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly and helpful AI coding assistant.",
        },
        {"role": "user", "content": f"Complete this code fragment: {code}"},
    ],
    "max_tokens": 1024,
    "stream": True,
}

response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(model_sample_input),
    ContentType="application/json",
)

print_event_stream(response["Body"])

In [None]:
code = """
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ObjectListing;
import com.amazonaws.services.s3.model.S3ObjectSummary;

public class S3ListObjects {

    public static void main(String[] args) {
        // Hardcoded AWS credentials (bad practice)
        String accessKey = "YOUR_ACCESS_KEY";
        String secretKey = "YOUR_SECRET_KEY";
        String bucketName = "image-bucket";

        // Create AWS credentials
        BasicAWSCredentials awsCreds = new BasicAWSCredentials(accessKey, secretKey);

        // Create S3 client
        AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                .withRegion("us-west-2")
                .withCredentials(new AWSStaticCredentialsProvider(awsCreds))
                .build();

        // List objects in the bucket
        ObjectListing objectListing = s3Client.listObjects(bucketName);

        // Iterate over the objects and print their keys
        for (S3ObjectSummary os : objectListing.getObjectSummaries()) {
            System.out.println(" - " + os.getKey() + " (size = " + os.getSize() + ")");
        }
    }
}
"""

model_sample_input = {
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly and helpful AI coding assistant.",
        },
        {
            "role": "user",
            "content": f"Fix all issues in the code. Write a short description for a pull request: {code}",
        },
    ],
    "max_tokens": 1024,
    "stream": True,
}

response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(model_sample_input),
    ContentType="application/json",
)

print_event_stream(response["Body"])

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

## 4. Clean-up

Please don't forget to run the cells below to delete all resources and avoid unecessary charges.

### A. Delete the endpoint

In [None]:
model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

### B. Delete the model

In [None]:
model.delete_model()

Thank you for trying Coder. We have only scratched the surface of what you can do with this model.

We'd be happy to hear from you, learn more about your use case, and help you build your next AI-powered product or service. Please reach out to julien@arcee.ai.