# Using Solar Mini Chat on SageMaker JumpStart




**Solar Mini Chat** is an advanced English/Korean large language model developed by Upstage. Specifically fine-tuned for multi-turn chat purposes, it demonstrates enhanced performance across a wide range of natural language processing tasks. This fine-tuning equips it with the ability to handle extended conversations more effectively, making it particularly adept for interactive applications. It employs a scaling method called ‘depth up-scaling’ (DUS), which is comprised of depthwise scaling and continued pretraining. DUS allows for a much more straightforward and efficient enlargement of smaller models than other scaling methods such as mixture-of-experts. 

This sample notebook shows you how to deploy [Solar Mini Chat](https://aws.amazon.com/marketplace/...) using Amazon SageMaker. (TODO: Fix the link)

## Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  

## Contents:
1. [Subscribe to the model package](#1.-Subscribe-to-the-model-package)
2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
   1. [Create an endpoint](#A.-Create-an-endpoint)
   2. [Prepare input payload](#B.-Prepare-input-payload)
   3. [Perform real-time inference](#C.-Perform-real-time-inference)
3. [Clean-up](#4.-Clean-up)
   1. [Delete the endpoint](#A.-Delete-the-endpoint)
   2. [Delete the model](#B.-Delete-the-model)
    

## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package [listing page](https://aws.amazon.com/marketplace/pp/...) (TODO: Update link)
2. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
3. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
4. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

We offer two types of packages. (TODO: Update ARNs)
* Solar-1-Mini-Chat (Support `ml.g5.12xlarge`)
    * `arn:aws:sagemaker:us-west-2::model-package/Solar-1-Mini-Chat`
* 4-bit quantized version of Solar-1-Mini-Chat (Support `ml.g5.2xlarge`)
    * `arn:aws:sagemaker:us-west-2::model-package/Solar-1-Mini-Chat-4bit`

In [22]:
# Choose one of our model packages
model_package_arn = (
    # "arn:aws:sagemaker:us-west-2::model-package/Solar-1-Mini-Chat"
    "arn:aws:sagemaker:us-west-2::model-package/Solar-1-Mini-Chat-4bit"
)
print(f"Model Package: '{model_package_arn}'")

Model Package: 'arn:aws:sagemaker:us-west-2::model-package/Solar-1-Mini-Chat-4bit'


In [59]:
import json

import sagemaker
from sagemaker import ModelPackage, get_execution_role, serializers, deserializers

In [24]:
role = get_execution_role()
sagemaker_session = sagemaker.Session()

## 2. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [28]:
model_name = "solar-1-mini-chat"
content_type = "application/json"

# ml.g5.2xlarge for quantized version, or
# ml.g5.12xlarge for non-quantized version
real_time_inference_instance_type = (
    "ml.g5.2xlarge"
    # "ml.g5.12xlarge",
)

### A. Create an endpoint

In [29]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

endpoint_name = sagemaker.utils.name_from_base(model_name)
print(f"endpoint name: '{endpoint_name}'")

endpoint name: 'solar-1-mini-chat-2024-02-06-06-07-59-503'


In [30]:
# Deploy the model
model.deploy(1, real_time_inference_instance_type, endpoint_name=endpoint_name)

------!

In [34]:
# our requests and responses will be in json format,
# so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

Once endpoint has been created, you would be able to perform real-time inference.

### B. Prepare input payload

We support request/response payload compitable to OpenAI's Chat completion endpoint.

Supported parameters:
- messages(list of objects)*: List of messages that contains `role` and `content`. `role` must be one of [`system`, `user`, `assistant`].
- model(string): You can use `model` parameter for compitability, but since We have only one model, this parameter is not required.
- frequency_penalty(number): Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
- presence_penalty: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
- max_tokens: The maximum number of tokens that can be generated in the chat completion. Solar support maximum 4k(4096) context for input and generated tokens.
- temperature: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
- top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

**required*

In [52]:
# Single-turn chat example
input = {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Can you provide a Python script to merge two sorted lists?"
      }
    ],
    "temperature": 0.7,
}

In [68]:
# To limit length of your output, you can use max_tokens parameter.
input = {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Can you provide a Python script to merge two sorted lists?"
      }
    ],
    "max_tokens": 1024,
}

In [32]:
# Multi-turn chat example
input = {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Can you provide a Python script to merge two sorted lists?"
      },
      {
        "role": "assistant",
        "content": """Sure, here is a Python script to merge two sorted lists:

                    ```python
                    def merge_lists(list1, list2):
                        return sorted(list1 + list2)
                    ```
                    """
      },
      {
        "role": "user",
        "content": "Can you provide an example of how to use this function?"
      }
    ]
}

### C. Perform real-time inference

In [62]:
# real-time inference
response = predictor.predict(
    input
)

Response example:

```
{
    "id": "cmpl-cc3f71b58086441fa6a53480b94fd0ec",
    "object": "chat.completion",
    "created": 3421,
    "model": "/opt/ml/model",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "What do you call a fake noodle?\n\nAn impasta!"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 34,
        "total_tokens": 334,
        "completion_tokens": 300
    }
}
```


In [63]:
# Since We use `JSONDeserializer`, output of `preidct` is dictionary object
print(json.dumps(response, indent=4))

{
    "id": "cmpl-514c9a7f9ef449a0b4ba1fd950715622",
    "object": "chat.completion",
    "created": 4207,
    "model": "/opt/ml/model",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Sure, here's a simple Python script to merge two sorted lists:\n\n```python\ndef merge_sorted_lists(list1, list2):\n    result = []\n    while list1 and list2:\n        if list1[0] < list2[0]:\n            result.append(list1.pop(0))\n        else:\n            result.append(list2.pop(0))\n    result += list1\n    result += list2\n    return result\n\n# Test the function\nlist1 = [1, 3, 5]\nlist2 = [2, 4, 6]\nprint(merge_sorted_lists(list1, list2))  # Output: [1, 2, 3, 4, 5, 6]\n```\n\nThis script defines a function called `merge_sorted_lists` that takes two lists as input and returns a new list containing the elements of both input lists in sorted order. The function uses a while loop to compare the first elemen

In [64]:
print("assistant: ")
print(response["choices"][0]["message"]["content"])

assistant: 
Sure, here's a simple Python script to merge two sorted lists:

```python
def merge_sorted_lists(list1, list2):
    result = []
    while list1 and list2:
        if list1[0] < list2[0]:
            result.append(list1.pop(0))
        else:
            result.append(list2.pop(0))
    result += list1
    result += list2
    return result

# Test the function
list1 = [1, 3, 5]
list2 = [2, 4, 6]
print(merge_sorted_lists(list1, list2))  # Output: [1, 2, 3, 4, 5, 6]
```

This script defines a function called `merge_sorted_lists` that takes two lists as input and returns a new list containing the elements of both input lists in sorted order. The function uses a while loop to compare the first elements of the input lists and append the smaller one to the result list, then pops that element from the corresponding input list. The loop continues until one of the input lists is empty, at which point the function appends the remaining elements from the non-empty input list to the resul

## 3. Clean-up

### A. Delete the endpoint

Now that you have successfully performed a real-time inference, you can delete the endpoint and avoid being charged.

In [66]:
model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

### B. Delete the model

In [67]:
model.delete_model()