#  Flan-T5-XXL
In this notebook we will create and deploy a Flan-T5-XXL using inference components on the endpoint you created in the first notebook. For this model we will be using HuggingFace's TGI container. We will also be using two GPU's for each model copy of the inference component we create. This is the 3rd notebook in a series of 5 notebooks used to deploy a model against the endpoint you created in the first notebook. The last notebook will show you other apis available and clean up the artifacts created.

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/2b_flant5_xxl-tgi.ipynb)

---

Tested using the `Python 3 (Data Science)` kernel on SageMaker Studio and `conda_python3` kernel on SageMaker Notebook Instance.

### Install dependencies

Upgrade the SageMaker Python SDK.

In [None]:
!pip install sagemaker --upgrade

### Import libraries

In [None]:
import boto3
import botocore
import sagemaker
import sys
import time
from sagemaker import Model, image_uris, serializers, deserializers

### Set configurations

`REPLACE` the `endpoint_name` value with the created endpoint from the first notebook

In [None]:
%store -r \
endpoint_name

if "endpoint_name" not in locals():
    print("Please specify the endpoint_name before proceed.")

else:
    print(f"Endpoint name: {endpoint_name}")

We first by creating the objects we will need for our notebook. In particular, the boto3 library to create the various clients we will need to interact with SageMaker and other variables that will be referenced later in our notebook. 

In [None]:
sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime_client = boto3.client("sagemaker-runtime")

sagemaker_session = (
    sagemaker.session.Session()
)  # sagemaker session for interacting with different AWS APIs
region = sagemaker_session._region_name

In [None]:
role = sagemaker.get_execution_role()
print(f"Role: {role}")

s3_client = boto3.client("s3")
prefix = sagemaker.utils.unique_name_from_base("DEMO")


s3_bucket = sagemaker_session.default_bucket()
s3_prefix = f"sagemaker/{prefix}"
s3_key = f"s3://{s3_bucket}/{s3_prefix}"
print(f"Demo S3 key: {s3_key}")

print(f"Demo endpoint name: {endpoint_name}")

inference_component_name = f"{prefix}-inference-component"
print(f"Demo inference component name: {inference_component_name}")

## Create Model Artifact
We will be deploying the the FlanT5-XXL model using the TGI container. In order to do so you need to set the image you would like to use with the proper configuartion. You can also create a SageMaker model to be referenced when you create your inference component

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
hf_inference_dlc = get_huggingface_llm_image_uri("huggingface", version="0.9.3")

# print ecr image uri
print(f"llm image uri: {hf_inference_dlc}")

In [None]:
deployment_name = "sm"

flant5xxlmodel = {
    "Image": hf_inference_dlc,
    "Environment": {"HF_MODEL_ID": "google/flan-t5-xxl", "HF_TASK": "text-generation"},
}


# create SageMaker Model
sagemaker_client.create_model(
    ModelName=f"{deployment_name}-model-flan-t5-xxl",
    ExecutionRoleArn=role,
    Containers=[flant5xxlmodel],
)

We can now create the Inference Components which will deployed on the endpoint that you specify. Please note here that you can provide a SageMaker model or a container to specification. If you provide a container, you will need to provide an image and artifactURL as parameters. In this example we set it to the model name we prepared in the cells above. You can also set the 'ComputeResourceRequirements' to supply SageMaker what should be reserved for each copy of the inference component. You can also set the copy count of the number of Inference Components you would like to deploy. These can be managed and scaled as the capabilities become available. 

Note that in this example we set the `NumberOfAcceleratorDevicesRequired` to a value of `2`. By doing so we reserve 2 accelerators for each copy of this inference component so that we can use tensor parallel. 

In [None]:
inference_component_name_flant5 = f"{prefix}-IC-flan-xxl"
variant_name = "AllTraffic"

sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name_flant5,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": f"{deployment_name}-model-flan-t5-xxl",
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 2,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

Wait until the inference endpoint is InService

In [None]:
while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name_flant5
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

In [None]:
# Store the inference component name for notebook 3.

ic2_name = inference_component_name_flant5
%store \
ic2_name

Now that the Inference Components are 'InService' they are availble to service requests. Here we invoke the endpoint but please notice that we add an additional parameter called 'InferenceComponentName'. This allows SageMaker to direct your request to the proper Inference Component

In [None]:
import json


payload = '''Summarize the following text:
Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital.
Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well.
Therefore, Peter stayed with her at the hospital for 3 days without leaving.'''


response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_flant5,
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(
        {
            "inputs": payload,
            "parameters": {
                "early_stopping": True,
                "length_penalty": 2.0,
                "max_new_tokens": 50,
                "temperature": 1,
                "min_length": 10,
                "no_repeat_ngram_size": 3,
                },
        }
    ),
)
result = json.loads(response["Body"].read().decode())
result

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2b_flant5_xxl-tgi.ipynb)