# Deploy SageMaker Real-Time Endpoint

This notebook demonstrates how to create an Amazon SageMaker Real-Time Endpoint by using Flan-T5 XXL

In this notebook, we will create a SageMaker Real-Time Endpoint by providing our own custom script for the [inference](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#write-an-inference-script).

**SageMaker Studio Kernel**: Data Science 3.0

In this exercise you will do:
 - Get Flan-T5 XXL model from HuggingFace Hub
 - Deploy an Amazon SageMaker Real-Time Endpoint by using a custom script for inference
 - Test the endpoint by performing a prediction

***

# Step 1 - Import Modules

Here we’ll import some libraries and define some variables.

In [None]:
import boto3
from botocore.exceptions import ClientError
import json
from sagemaker.huggingface import get_huggingface_llm_image_uri, HuggingFacePredictor
from sagemaker.model import Model
from sagemaker.predictor import Predictor
import sagemaker.session
import traceback

In [None]:
s3_client = boto3.client("s3")
sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime = boto3.client('sagemaker-runtime')

Create a SageMaker Session and save the default region and the execution role in some Python variables

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
bucket_name = sagemaker_session.default_bucket()
region = boto3.session.Session().region_name
role = sagemaker.get_execution_role()

***

# Step 2 - Retrieve Model info

Let's retrieve the model information from SageMaker Jumpstart

Retrieve image_uri

In [None]:
deploy_image_uri = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

print(f'Deploy image URI => {deploy_image_uri}')

***

# Step 3 - Deploy an Amazon SageMaker Real-Time Endpoint

Here we are creating a real-time endpoint

By using the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), we are going to use a [HuggingFace Predictor](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-predictor) for using a built-in SageMaker container for HuggingFace, which gives us the possibility to provide the inference scripts and the requirements.txt for installing additional dependencies.

In order to make sure that Amazon SageMaker will install our additional Python modules by reading `requirements.txt`, we are compressing the content of the [inference](./code) folder and uploading it in the default S3 Bucket.

## Global Parameters

In [None]:
MODEL_ID = 'tiiuae/falcon-40b-instruct'

inference_instance_count = 1
inference_instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

env = {
    'HF_MODEL_ID': MODEL_ID,
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'MAX_INPUT_LENGTH': json.dumps(1536),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
}

### Create SageMaker model

This method can be used for creating a SageMaker model

In [None]:
model_name = "falcon-40b-instruct"

In [None]:
model = Model(
    name=model_name,
    image_uri=deploy_image_uri,
    role=role,
    predictor_cls=HuggingFacePredictor,
    env=env)

### Deploy a SageMaker Endpoint

Let's deploy the endpoint. We are defining some utilities scripts in order to create or update an Amazon SageMaker Endpoint.

Let's create or update an Amazon SageMaker Endpoint

In [None]:
endpoint_name = "falcon-40b-instruct-endpoint"

In [None]:
import time

try:
    model.deploy(
        endpoint_name=endpoint_name,
        initial_instance_count=inference_instance_count,
        instance_type=inference_instance_type,
        container_startup_health_check_timeout=health_check_timeout
    )
except ClientError as e:
    stacktrace = traceback.format_exc()
    print("{}".format(stacktrace))

    model = Model(
        name=model_name + "-" + str(round(time.time())),
        image_uri=deploy_image_uri,
        role=role,
        predictor_cls=HuggingFacePredictor,
        env=env
    )
    
    model.create(
        instance_type=inference_instance_type
    )
    
    predictor = Predictor(
        endpoint_name=endpoint_name,
        sagemaker_session=sagemaker_session
    )

    predictor.update_endpoint(
        initial_instance_count=inference_instance_count,
        instance_type=inference_instance_type,
        model_name=model.name
    )

***

# Step 5 - Test the Endpoint Locally

Here we'll test the Amazon SageMaker Endpoint by performing some predictions. Our endpoint expects a json with at least inputs key.

In [None]:
import json

In [None]:
endpoint_name = "falcon-40b-instruct-endpoint"

## Text Summarization

In [None]:
payload = """
Hello, how are you?
"""

parameters = {
    "inputs": payload,
    "parameters": {
        "max_new_tokens": 512,
        "temperature": 0.2,
        "top_p": 0.9,
    }
}

print(json.dumps(parameters).encode("utf-8"))

results = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(parameters).encode("utf-8"))

response = json.loads(results["Body"].read())

response

***

# Step 6 - Delete Endpoint and Function

In [None]:
endpoint_name = "falcon-40b-instruct-endpoint"

predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)

predictor.delete_endpoint(delete_endpoint_config=True)