# Using Solar Mini Chat on SageMaker JumpStart




**Solar Mini Chat** is an advanced English/Korean large language model developed by Upstage. Specifically fine-tuned for multi-turn chat purposes, it demonstrates enhanced performance across a wide range of natural language processing tasks. This fine-tuning equips it with the ability to handle extended conversations more effectively, making it particularly adept for interactive applications. It employs a scaling method called ‘depth up-scaling’ (DUS), which is comprised of depthwise scaling and continued pretraining. DUS allows for a much more straightforward and efficient enlargement of smaller models than other scaling methods such as mixture-of-experts. 

This sample notebook shows you how to deploy [Solar Mini Chat](https://aws.amazon.com/marketplace/pp/prodview-7fug6scf2nc6g) and [Solar Mini Chat - Quant](https://aws.amazon.com/marketplace/pp/prodview-npmcixwzkjoxu) using Amazon SageMaker.

## Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  

## Contents:
1. [Subscribe to the model package](#1.-Subscribe-to-the-model-package)
2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
   1. [Create an endpoint](#A.-Create-an-endpoint)
   2. [Prepare input payload](#B.-Prepare-input-payload)
   3. [Perform real-time inference](#C.-Perform-real-time-inference)
3. [Clean-up](#4.-Clean-up)
   1. [Delete the endpoint](#A.-Delete-the-endpoint)
   2. [Delete the model](#B.-Delete-the-model)
    

## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to the model package

To subscribe to the model package:
1. Open [Solar Mini Chat](https://aws.amazon.com/marketplace/pp/prodview-7fug6scf2nc6g) or [Solar Mini Chat - Quant](https://aws.amazon.com/marketplace/pp/prodview-npmcixwzkjoxu) model package listing page.
2. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
3. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
4. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

We offer two types of packages.
* Solar-1-Mini-Chat (Support `ml.g5.12xlarge`)
* 4-bit quantized version of Solar-1-Mini-Chat (Support `ml.g5.2xlarge`)

In [None]:
%pip install sseclient-py

In [None]:
import time
import json

import sseclient

import boto3
import sagemaker
from sagemaker import ModelPackage, get_execution_role

In [None]:
role = get_execution_role()
sagemaker_session = sagemaker.Session()

sagemaker_runtime = boto3.client("sagemaker-runtime")

In [None]:
# Choose one of our model packages
# We offer two types of packages.

model_package_name = "solar-1-mini-chat-240612-r2-80ca3470b1d032c582ac66a019de67d0"
# model_package_name = "solar-1-mini-chat-4bit-240612--b09baa1c640133d49d247beb85eef04d" # 4-bit quantized model

# Mapping for Model Packages
model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{model_package_name}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{model_package_name}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{model_package_name}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{model_package_name}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{model_package_name}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{model_package_name}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{model_package_name}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{model_package_name}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{model_package_name}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{model_package_name}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{model_package_name}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{model_package_name}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{model_package_name}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{model_package_name}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{model_package_name}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{model_package_name}",
}

region = sagemaker_session.boto_region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

print(f"Model Package: '{model_package_arn}'")

## 2. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [None]:
model_name = "solar-1-mini-chat"
content_type = "application/json"

# ml.g5.2xlarge for quantized version, or
# ml.g5.12xlarge for non-quantized version
real_time_inference_instance_type = (
    # "ml.g5.2xlarge"
    "ml.g5.12xlarge"
)

### A. Create an endpoint

In [None]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

endpoint_name = sagemaker.utils.name_from_base(model_name)
print(f"endpoint name: '{endpoint_name}'")

In [None]:
# Deploy the model
model.deploy(1, real_time_inference_instance_type, endpoint_name=endpoint_name)

Once endpoint has been created, you would be able to perform real-time inference.

### B. Prepare input payload

We support request/response payload compitable to OpenAI's Chat completion endpoint.

Supported parameters:
- messages(list of objects)*: List of messages that contains `role` and `content`. `role` must be one of [`system`, `user`, `assistant`].
- model(string): You can use `model` parameter for compitability, but since We have only one model, this parameter is not required.
- frequency_penalty(number): Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
- presence_penalty: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
- max_tokens: The maximum number of tokens that can be generated in the chat completion. Solar support maximum 4k(4096) context for input and generated tokens.
- temperature: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
- top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

**required*

In [None]:
# Single-turn chat example
input = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Can you provide a Python script to merge two sorted lists?",
        },
    ],
    "temperature": 0.7,
}

In [None]:
# To limit length of your output, you can use max_tokens parameter.
input = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Can you provide a Python script to merge two sorted lists?",
        },
    ],
    "max_tokens": 1024,
}

In [None]:
# Multi-turn chat example
input = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Can you provide a Python script to merge two sorted lists?",
        },
        {
            "role": "assistant",
            "content": """Sure, here is a Python script to merge two sorted lists:

                    ```python
                    def merge_lists(list1, list2):
                        return sorted(list1 + list2)
                    ```
                    """,
        },
        {
            "role": "user",
            "content": "Can you provide an example of how to use this function?",
        },
    ]
}

### C. Perform real-time inference

In [None]:
# real-time inference
def invoke_endpoint(endpoint_name, payload):
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
    )
    data = response["Body"].read().decode("utf-8")

    return data

In [None]:
response = json.loads(invoke_endpoint(endpoint_name, input))

In [None]:
print(response)

In [None]:
print("assistant: ")
print(response["choices"][0]["message"]["content"])

### D. Perform real-time inference with Streaming

In [None]:
input["stream"] = True

In [None]:
def invoke_endpoint_stream(endpoint_name, request_body):
    response = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(request_body),
        ContentType="application/json",
    )

    for event in response["Body"]:
        yield event["PayloadPart"]["Bytes"]

In [None]:
# print stream response
response = invoke_endpoint_stream(endpoint_name, input)

client = sseclient.SSEClient(response)
for event in client.events():
    if event.data == "[DONE]":
        break

    data = json.loads(event.data)
    if data.get("choices"):
        print(data["choices"][0]["delta"].get("content", ""), end="")

## 3. Clean-up

### A. Delete the endpoint

Now that you have successfully performed a real-time inference, you can delete the endpoint and avoid being charged.

In [None]:
model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

### B. Delete the model

In [None]:
model.delete_model()