# Deploy SageMaker Real-Time Endpoint

This notebook demonstrates how to create an Amazon SageMaker Real-Time Endpoint by using Flan-T5 XXL

In this notebook, we will create a SageMaker Real-Time Endpoint by providing our own custom script for the [inference](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#write-an-inference-script).

**SageMaker Studio Kernel**: Data Science 3.0

In this exercise you will do:
 - Get Flan-T5 XXL model from HuggingFace Hub
 - Deploy an Amazon SageMaker Real-Time Endpoint by using a custom script for inference
 - Test the endpoint by performing a prediction

***

# Step 1 - Import Modules

Here we’ll import some libraries and define some variables.

In [None]:
import boto3
from botocore.exceptions import ClientError
import sagemaker.session
from sagemaker.huggingface.model import HuggingFaceModel, HuggingFacePredictor
import traceback

In [None]:
comprehend_client = boto3.client("comprehend")
lambda_client = boto3.client("lambda")
s3_client = boto3.client("s3")
sagemaker_client = boto3.client("sagemaker")
translate_client = boto3.client("translate")

Create a SageMaker Session and save the default region and the execution role in some Python variables

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
bucket_name = sagemaker_session.default_bucket()
region = boto3.session.Session().region_name
role = sagemaker.get_execution_role()

***

# Step 2 - Download Flan-T5 XXL Model

Let's retrieve the model information stored in the HuggingFace Hub

In [None]:
from distutils.dir_util import copy_tree
from huggingface_hub import snapshot_download
import os
from pathlib import Path
from tempfile import TemporaryDirectory

In [None]:
HF_MODEL_ID="philschmid/flan-t5-xxl-sharded-fp16"
model_dir_name = "model"

In [None]:
# create model dir
model_dir = Path(model_dir_name)

if not os.path.isdir(model_dir_name):
    model_dir.mkdir()

with TemporaryDirectory() as tmpdir:
    # download snapshot
    snapshot_dir = snapshot_download(repo_id=HF_MODEL_ID, cache_dir=tmpdir)
    # copy snapshot to model dir
    copy_tree(snapshot_dir, str(model_dir))

## Copy code folder in model dir

In [None]:
from distutils.dir_util import copy_tree
from pathlib import Path

In [None]:
model_dir_name = "model"
model_dir = Path(model_dir_name)

In [None]:
copy_tree("code/", str(model_dir.joinpath("code")))

## Create model.tar.gz

In [None]:
import os
from pathlib import Path
import tarfile

In [None]:
model_dir_name = "model"
model_dir = Path(model_dir_name)

In [None]:
# helper to create the model.tar.gz
def compress(tar_dir=None,output_file="model.tar.gz"):
    parent_dir=os.getcwd()
    os.chdir(tar_dir)
    with tarfile.open(os.path.join(parent_dir, output_file), "w:gz") as tar:
        for item in os.listdir('.'):
          print(item)
          tar.add(item, arcname=item)
    os.chdir(parent_dir)

compress(str(model_dir))

## Upload in S3

In [None]:
model_name = "flan-t5-xxl"
model_path = "models"

In [None]:
model_url = sagemaker.Session().upload_data(
    "model.tar.gz", bucket=bucket_name, key_prefix="/".join([model_path, model_name])
)

model_url

***

# Step 3 - Deploy an Amazon SageMaker Real-Time Endpoint

Here we are creating a real-time endpoint

By using the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), we are going to use a [HuggingFace Predictor](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-predictor) for using a built-in SageMaker container for HuggingFace, which gives us the possibility to provide the inference scripts and the requirements.txt for installing additional dependencies.

In order to make sure that Amazon SageMaker will install our additional Python modules by reading `requirements.txt`, we are compressing the content of the [inference](./code) folder and uploading it in the default S3 Bucket.

## Global Parameters

In [None]:
inference_framework_version = "1.10"
inference_python_version = "py38"
inference_transformers_version = "4.17"
inference_instance_count = 1
inference_instance_type = "ml.g5.xlarge"

### Create SageMaker model

This method can be used for creating a SageMaker model

In [None]:
model_name = "flan-t5-xxl"

In [None]:
model = HuggingFaceModel(
    name=model_name,
    transformers_version=inference_transformers_version,
    pytorch_version=inference_framework_version,
    py_version=inference_python_version,
    model_data=model_url,
    role=role,
    sagemaker_session=sagemaker_session
)

### Deploy a SageMaker Endpoint

Let's deploy the endpoint. We are defining some utilities scripts in order to create or update an Amazon SageMaker Endpoint.

Let's create or update an Amazon SageMaker Endpoint

In [None]:
endpoint_name = "flan-t5-endpoint"

In [None]:
import time

try:
    model.deploy(
        endpoint_name=endpoint_name,
        initial_instance_count=inference_instance_count,
        instance_type=inference_instance_type
    )
except ClientError as e:
    stacktrace = traceback.format_exc()
    print("{}".format(stacktrace))

    model = HuggingFaceModel(
        name=model_name + "-" + str(round(time.time())),
        transformers_version=inference_transformers_version,
        pytorch_version=inference_framework_version,
        py_version=inference_python_version,
        model_data=model_url,
        role=role,
        sagemaker_session=sagemaker_session
    )
    
    model.create(
        instance_type=inference_instance_type
    )
    
    predictor = HuggingFacePredictor(
        endpoint_name=endpoint_name,
        sagemaker_session=sagemaker_session
    )

    predictor.update_endpoint(
        initial_instance_count=inference_instance_count,
        instance_type=inference_instance_type,
        model_name=model.name
    )

***

# Step 4 - Create Lambda Function

In this section, we are creating a lambda function that will handle the requests for the SageMaker Endpoint. The Lambda Function is using [Amazon Comprehend](https://aws.amazon.com/comprehend/) for detecting the input language, and [Amazon Translate](https://aws.amazon.com/translate/) for translating the payload text in English.

## Create a Lambda function

In [None]:
endpoint_name = "flan-t5-endpoint"
lambda_function_name = "Multi-Language-GenAI"

In [None]:
! pygmentize ./lambda/handler.py

Zip the lambda code

In [None]:
from zipfile import ZipFile

In [None]:
with ZipFile('./lambda.zip', 'w') as zip_object:
   # Adding files that need to be zipped
   zip_object.write('./lambda/handler.py')

Create the lambda function

In [None]:
with open('lambda.zip', 'rb') as f:
	zipped_code = f.read()

In [None]:
response = lambda_client.create_function(
    FunctionName=lambda_function_name,
    Runtime='python3.9',
    Role=role,
    Handler='lambda.handler.lambda_handler',
    Code=dict(ZipFile=zipped_code),
    Timeout=900, # Maximum allowable timeout,
    Environment={
        'Variables': {
            'SAGEMAKER_ENDPOINT': endpoint_name
        }
    }
)

***

## Update Lambda Function

In [None]:
endpoint_name = "flan-t5-endpoint"
lambda_function_name = "Multi-Language-GenAI"

In [None]:
! pygmentize ./lambda/handler.py

Zip the lambda code

In [None]:
from zipfile import ZipFile

In [None]:
with ZipFile('./lambda.zip', 'w') as zip_object:
   # Adding files that need to be zipped
   zip_object.write('./lambda/handler.py')

In [None]:
with open('lambda.zip', 'rb') as f:
	zipped_code = f.read()

Update the lambda function

In [None]:
response = lambda_client.update_function_code(
    FunctionName=lambda_function_name,
    ZipFile=zipped_code,
    Publish=True
)

response

In [None]:
response = lambda_client.update_function_configuration(
    FunctionName=lambda_function_name,
    Environment={
        'Variables': {
            'SAGEMAKER_ENDPOINT': endpoint_name
        }
    }
)

response

***

# Step 5 - Test the Endpoint

Here we'll test the Amazon SageMaker Endpoint by invoking the created Lambda Functio for making some predictions. Our endpoint expects a json with at least inputs key.

In [None]:
import json

In [None]:
endpoint_name = "flan-t5-endpoint"
lambda_function_name = "Multi-Language-GenAI"

In [None]:
predictor = HuggingFacePredictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)

In [None]:
payload = """Summarize the following text:
Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital.
Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well.
Therefore, Peter stayed with her at the hospital for 3 days without leaving.
"""

payload_2 = """Riassumete il testo seguente:
Peter ed Elizabeth hanno preso un taxi per partecipare a una festa notturna in città. Durante la festa, Elizabeth ha avuto un collasso ed è stata portata d'urgenza in ospedale.
Poiché le è stata diagnosticata una lesione cerebrale, il medico ha detto a Peter di starle accanto finché non si fosse ripresa.
Pertanto, Peter rimase con lei in ospedale per 3 giorni senza uscire.
"""

parameters = {
  "early_stopping": True,
  "length_penalty": 2.0,
  "temperature": 0,
  "min_length": 10,
  "no_repeat_ngram_size": 3,
}

body = {
    "payload": payload_2,
    "parameters": parameters
}

response = lambda_client.invoke(
    FunctionName=lambda_function_name,
    Payload=json.dumps(body)
)

results = json.loads(response['Payload'].read().decode("utf-8"))

print(results["body"])

# Step 6 - Test the Endpoint Locally (Optional)

Here we'll test the Amazon SageMaker Endpoint by performing some predictions. Our endpoint expects a json with at least inputs key.

In [None]:
endpoint_name = "flan-t5-endpoint"

In [None]:
predictor = HuggingFacePredictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)

## Translate Service

In [None]:
def translate_string(row, start_lan="it", end_lan="en"):
    try:
        print("Translating {} from {} to {}".format(row, start_lan, end_lan))

        response = translate_client.translate_text(
            Text=row,
            SourceLanguageCode=start_lan,
            TargetLanguageCode=end_lan
        )

        return response["TranslatedText"]

    except Exception as e:
        stacktrace = traceback.format_exc()
        print("{}".format(stacktrace))

        raise e    

## Detect Language

In [None]:
def detect_language(body):
    try:
        results = comprehend_client.detect_dominant_language(Text=body)

        max_result = max(results["Languages"], key=lambda x: x['Score'])

        return max_result["LanguageCode"]
    except Exception as e:
        stacktrace = traceback.format_exc()
        print("{}".format(stacktrace))

        raise e

## Text Summarization

In [None]:
payload = """Summarize the following text:
Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital.
Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well.
Therefore, Peter stayed with her at the hospital for 3 days without leaving.
"""

payload_2 = """Riassumete il testo seguente:
Peter ed Elizabeth hanno preso un taxi per partecipare a una festa notturna in città. Durante la festa, Elizabeth ha avuto un collasso ed è stata portata d'urgenza in ospedale.
Poiché le è stata diagnosticata una lesione cerebrale, il medico ha detto a Peter di starle accanto finché non si fosse ripresa.
Pertanto, Peter rimase con lei in ospedale per 3 giorni senza uscire.
"""

start_lan = detect_language(payload_2)

if start_lan != "en":
    payload = translate_string(payload_2, start_lan, "en")

    print("Translated sentence: {}".format(payload))
else:
    print("Detected en language")

parameters = {
  "early_stopping": True,
  "length_penalty": 2.0,
  "temperature": 0,
  "min_length": 10,
  "no_repeat_ngram_size": 3,
}

results = predictor.predict({
    "inputs": payload,
    "parameters" :parameters
})

if start_lan != "en":
    results[0]["generated_text"] = translate_string(results[0]["generated_text"], "en", start_lan)
else:
    logger.info("Detected en language")

results

***

# Step 7 - Delete Endpoint and Function

In [None]:
lambda_function_name = "Multi-Language-GenAI"

lambda_client.delete_function(
    FunctionName=lambda_function_name
)

In [None]:
endpoint_name = "flan-t5-endpoint"

predictor = HuggingFacePredictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)

predictor.delete_endpoint(delete_endpoint_config=True)