# Deploy SageMaker Real-Time Endpoint

This notebook demonstrates how to create an Amazon SageMaker Real-Time Endpoint by using GPT-J 6B FP-16


**SageMaker Studio Kernel**: Data Science 3.0

In this exercise you will do:
 - Get GPT-J 6B FP-16 model from SageMaker Jumpstart Model Hub
 - Deploy an Amazon SageMaker Real-Time Endpoint
 - Test the endpoint by performing a prediction

***

# Step 1 - Import Modules

Here we’ll import some libraries and define some variables.

In [None]:
import boto3
from botocore.exceptions import ClientError
from sagemaker import image_uris, model_uris
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.model import Model
from sagemaker.predictor import Predictor
import sagemaker.session
import traceback

In [None]:
s3_client = boto3.client("s3")
sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime = boto3.client('sagemaker-runtime')

Create a SageMaker Session and save the default region and the execution role in some Python variables

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
bucket_name = sagemaker_session.default_bucket()
region = boto3.session.Session().region_name
role = sagemaker.get_execution_role()

***

# Step 2 - Retrieve Model info

Let's retrieve the model information from SageMaker Jumpstart

In [None]:
FILTER = 'task == textembedding'
embeddings_models = list_jumpstart_models(filter=FILTER)
embeddings_models

In [None]:
IMAGE_SCOPE = 'inference'
MODEL_ID = 'huggingface-textembedding-gpt-j-6b-fp16'
MODEL_VERSION = '*'

inference_instance_type = "ml.g5.2xlarge"

Retrieve image_uri and model_uri

In [None]:
deploy_image_uri = image_uris.retrieve(region=region,
                                       framework=None,
                                       image_scope=IMAGE_SCOPE,
                                       model_id=MODEL_ID,
                                       model_version=MODEL_VERSION,
                                       instance_type=inference_instance_type)

print(f'Deploy image URI => {deploy_image_uri}')

In [None]:
model_uri = model_uris.retrieve(region=region,
                                model_id=MODEL_ID,
                                model_version=MODEL_VERSION,
                                model_scope=IMAGE_SCOPE)

print(f'Model URI => {model_uri}')

***

# Step 3 - Deploy an Amazon SageMaker Real-Time Endpoint

Here we are creating a real-time endpoint

By using the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), we are going to use a [HuggingFace Predictor](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-predictor) for using a built-in SageMaker container for HuggingFace, which gives us the possibility to provide the inference scripts and the requirements.txt for installing additional dependencies.

In order to make sure that Amazon SageMaker will install our additional Python modules by reading `requirements.txt`, we are compressing the content of the [inference](./code) folder and uploading it in the default S3 Bucket.

## Global Parameters

In [None]:
inference_instance_count = 1
inference_instance_type = "ml.g5.2xlarge"

env = {
    'SAGEMAKER_MODEL_SERVER_TIMEOUT': str(3600),
    'MODEL_CACHE_ROOT': '/opt/ml/model',
    'SAGEMAKER_ENV': '1',
    'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/code/',
    'SAGEMAKER_PROGRAM': 'inference.py',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1',
    'TS_DEFAULT_WORKERS_PER_MODEL': '1',
}

### Create SageMaker model

This method can be used for creating a SageMaker model

In [None]:
model_name = "gpt-j-qa"

In [None]:
model = Model(
    name=model_name,
    image_uri=deploy_image_uri,
    model_data=model_uri,
    role=role,
    predictor_cls=Predictor,
    env=env)

model.prepare_container_def()

### Deploy a SageMaker Endpoint

Let's deploy the endpoint. We are defining some utilities scripts in order to create or update an Amazon SageMaker Endpoint.

Let's create or update an Amazon SageMaker Endpoint

In [None]:
endpoint_name = "gpt-j-qa-endpoint"

In [None]:
import time

try:
    model.deploy(
        endpoint_name=endpoint_name,
        initial_instance_count=inference_instance_count,
        instance_type=inference_instance_type,
        model_data_download_timeout=3600,
        container_startup_health_check_timeout=3600
    )
except ClientError as e:
    stacktrace = traceback.format_exc()
    print("{}".format(stacktrace))

    model = Model(
        name=model_name + "-" + str(round(time.time())),
        image_uri=deploy_image_uri,
        model_data=model_uri,
        role=role,
        predictor_cls=Predictor,
        env=env)
    
    model.create(
        instance_type=inference_instance_type
    )
    
    predictor = Predictor(
        endpoint_name=endpoint_name,
        sagemaker_session=sagemaker_session
    )

    predictor.update_endpoint(
        initial_instance_count=inference_instance_count,
        instance_type=inference_instance_type,
        model_name=model.name
    )

***

# Step 5 - Test the Endpoint Locally

Here we'll test the Amazon SageMaker Endpoint by performing some predictions. Our endpoint expects a json with at least inputs key.

In [None]:
endpoint_name = "gpt-j-qa-endpoint"

In [None]:
predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)

## Text Embeddings

In [None]:
import json

payload = {
    "text_inputs": ["This is an example of text embedding"]
}

payload = json.dumps(payload).encode('utf-8')

results = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=payload)

model_predictions = json.loads(results['Body'].read())
embedding = model_predictions['embedding'][0]

print(embedding)

***

# Step 6 - Delete Endpoint and Function

In [None]:
endpoint_name = "gpt-j-qa-endpoint"

predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)

predictor.delete_endpoint(delete_endpoint_config=True)