# Deploy Friendli Container: Solar Pro Preview Instruct Model Package from AWS Marketplace 

**Friendli Container: Solar Pro Preview Instruct** is a SageMaker model package of the [Friendli Container](https://friendli.ai/products/container) with an instruction-tuned 8 billion parameter language model from Meta.

This sample notebook demonstrates how to deploy the [Friendli Container: Solar Pro Preview Instruct](https://aws.amazon.com/marketplace/pp/prodview-fmmknqim7hr6s) on Amazon SageMaker.

## Pre-requisites:
1. **Note**: This notebook contains elements that render correctly in the Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or from Amazon SageMaker Studio.
2. Ensure that the IAM role used for the instance has the **AmazonSageMakerFullAccess** permission.
3. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have the authority to make AWS Marketplace subscriptions through your AWS account: 
        1. **aws-marketplace:ViewSubscriptions**
        2. **aws-marketplace:Unsubscribe**
        3. **aws-marketplace:Subscribe**

## Contents:
1. [Subscribe to the model package](#1.-Subscribe-to-the-model-package)
2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
   1. [Create an endpoint](#A.-Create-an-endpoint)
   2. [Create the input payload](#B.-Create-input-payload)
   3. [Perform real-time inference](#C.-Perform-real-time-inference)
   4. [Perform real-time streaming inference](#D.-Perform-real-time-streaming-inference)
   5. [Delete the endpoint](#D.-Delete-the-endpoint)
3. [Perform batch inference](#3.-Perform-batch-inference) 
4. [Clean-up](#4.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    

## Usage instructions
You can run this notebook one cell at a time by hitting `Shift+Enter` to run a cell.

## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page: [Friendli Container: Solar Pro Preview Instruct](https://aws.amazon.com/marketplace/pp/prodview-fmmknqim7hr6s)
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with the EULA, pricing, and the support terms. 
1. Once you click on the **Continue to configuration** button and choose your **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

In [None]:
%pip install boto3 sagemaker sseclient-py

In [None]:
model_package_arn = "arn:aws:sagemaker:us-east-1:172947302787:model-package/friendli-container-v1-6-11-llama-3-1-8b-instruct-int8"

In [None]:
import json
import sseclient

import boto3
import sagemaker
from sagemaker import ModelPackage, get_execution_role

In [None]:
# The following line could raise an error if you’re trying to execute this notebook from somewhere other than SageMaker(e.g. a local environment).
# If an error occurs, change the execution role to "<ARN OF ROLE WITH SageMakerFullAccess>".
role = get_execution_role()

sagemaker_session = sagemaker.Session()

sagemaker_runtime = boto3.client("sagemaker-runtime")

## 2. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [this documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [None]:
model_name = "friendli-container-solar-pro-preview-instruct"

content_type = "application/json"

real_time_inference_instance_type = "ml.g5.xlarge"

### A. Create an endpoint

In [None]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

endpoint_name = model_name

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=endpoint_name)

Once the endpoint is created, you would be able to perform real-time inference.

### B. Create the input payload

Request/response payloads are compatible with the OpenAI chat completion endpoint.

The input payload is composed of:
- **messages**(**required**, list of objects): A list of messages comprising the conversation so far.
  - **role**(**required**, [`system`, `user`, `assistant`, `tool`]): The role of the messages author.
  - **content**(**required**, string): The content of message
- frequency_penalty(float): Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.
- presence_penalty(float): Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). should be greater than or equal to 1.0. 1.0 means no penalty. see keskar et al., 2019 for more details. this is similar to Hugging Face's [repetition_penalty](https://huggingface.co/docs/transformers/v4.26.0/en/main_classes/text_generation#transformers.generationconfig.repetition_penalty) argument.
- max_tokens(integer): The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus `max_tokens` should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, `max_tokens` should not exceed the model's maximum output length. This is similar to Hugging Face's [max_new_tokens](https://huggingface.co/docs/transformers/v4.26.0/en/main_classes/text_generation#transformers.GenerationConfig.max_new_tokens) argument.
- min_tokens(integer): The minimum number of tokens to generate. default value is 0. this is similar to hugging face's [min_new_tokens](https://huggingface.co/docs/transformers/v4.26.0/en/main_classes/text_generation#transformers.generationconfig.min_new_tokens) argument.
- n(integer): The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's [num_return_sequences](https://huggingface.co/docs/transformers/v4.26.0/en/main_classes/text_generation#transformers.GenerationConfig.num_return_sequences) argument.
- stop(list of strings): When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.
- stream(boolean): Whether to stream generation result. When set true, each token will be sent as [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format) once generated.
- temperature(float): Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., `top_k = 1`) sampling. defaults to 1.0. this is similar to hugging face's [temperature argument](https://huggingface.co/docs/transformers/v4.26.0/en/main_classes/text_generation#transformers.generationconfig.temperature).
- top_p(float): Tokens comprising the top `top_p` probability mass are kept for sampling. Numbers between 0.0 (exclusive) and 1.0 (inclusive) are allowed. Defaults to 1.0. This is similar to Hugging Face's [top_p](https://huggingface.co/docs/transformers/v4.26.0/en/main_classes/text_generation#transformers.GenerationConfig.top_p) argument.
- top_k(integer): The number of highest probability tokens to keep for sampling. Numbers between 0 and the vocab size of the model (both inclusive) are allowed. The default value is 0, which means that the API does not apply top-k filtering. This is similar to Hugging Face's [top_k](https://huggingface.co/docs/transformers/v4.26.0/en/main_classes/text_generation#transformers.GenerationConfig.top_k) argument.
- timeout_microseconds(integer): Request timeout. Gives the `HTTP 429 Too Many Requests` response status code. Default behavior is no timeout.
- seed(list of integers): Seed to control random procedure. If nothing is given, random seed is used for sampling, and return the seed along with the generated result. When using the `n` argument, you can pass a list of seed values to control all of the independent generations.
- eos_token(list of integers): A list of endpoint sentence tokens.

In [None]:
# Input payload example
input_payload = {
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 200,
}

### C. Perform real-time inference

In [None]:
def invoke_endpoint(endpoint_name, payload):
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=json.dumps(payload),
        ContentType="application/json",
    )
    return response['Body'].read().decode('utf-8')

response = json.loads(invoke_endpoint(endpoint_name, input_payload))
print(response)

### D. Perform real-time streaming inference

In [None]:
def invoke_streaming_endpoint(endpoint_name, payload):
    response = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload),
        ContentType="application/json",
    )
    event_stream = response['Body']
    for event in event_stream:
        yield event["PayloadPart"]["Bytes"]

input_payload['stream'] = True
response = invoke_streaming_endpoint(endpoint_name, input_payload)
client = sseclient.SSEClient(response)

for event in client.events():
    if event.data == "[DONE]":
        break
    data = json.loads(event.data)
    if data.get("choices"):
        print(data["choices"][0]["delta"].get("content", ""), end="")

### E. Delete the endpoint

Once you're finished with the real-time inference, you can terminate the endpoint to avoid extra charges.

In [None]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 3. Clean-up

### A. Delete the model

In [None]:
model.delete_model()