# Access und host Large Language Models on AWS
This notebook demostrates how to use LLMs on AWS. 

There are following options to host and use LLMs/FMs on AWS:
1. [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-service.html)
2. [Amazon SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models.html)
3. Amazon SageMaker-managed hosting using [SageMaker endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html)
4. [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/)
5. Self-hosted on EC2 containers using [Amazon Elastic Container Service (ECS)](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html) or [Elastic Kubernetes Service (EKS)](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html)

Refer to [Generative AI on AWS](https://aws.amazon.com/generative-ai/) landing page to understand details and use cases for each deployment options. 

This notebook demonstrates how you can use Amazon Bedrock, Amazon SageMaker JumpStart, and Amazon SageMaker real-time and asynchronous endpoints to host generative AI models.

## How to use this notebook
You're going to use this notebook for deployment of LLM endpoints through out the workshop for different labs.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><b>This notebook creates cost-incurring resources</b></h2>
    <br>
    <p style="text-align: center; margin: auto;"><b>You don't need to run this notebook if you're going to use Amazon Bedrock only for LLM access<b></p>
        <p></p>
    <p style="text-align: center; margin: auto;">Create <b>at least one<b> SageMaker endpoint to host an LLM if you'd like to experiment with endpoints and use the endpoint in a lab</p>
    <p style="text-align: center; margin: auto;">You can create more than one inference endpoint. Feel free to experiment but be aware of potential costs</p>
    <br>
</div>

## Setup environment
Select the _PyTorch 2.0.0 Python 3.10 CPU Optimized_ image for this notebook and `ml.t3.medium` compute instance:

![](../static/img/notebook-image-kernel.png)

In [2]:
!pip install sagemaker boto3 gradio langchain --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.27.132 requires botocore==1.29.132, but you have botocore 1.31.79 which is incompatible.
awscli 1.27.132 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0.1 which is incompatible.
awscli 1.27.132 requires s3transfer<0.7.0,>=0.6.0, but you have s3transfer 0.7.0 which is incompatible.
confection 0.0.4 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.4.2 which is incompatible.
spacy 3.5.2 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.4.2 which is incompatible.
spacy 3.5.2 requires typer<0.8.0,>=0.3.0, but you have typer 0.9.0 which is incompatible.
thinc 8.1.10 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.4.2 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is availab

In [None]:
# Restart kernel to get the packages
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [3]:
import sagemaker
import boto3
import os
import json
import uuid

print(sagemaker.__version__)

assert(sagemaker.__version__ >= '2.195.0')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
2.197.0


In [4]:
# Get some variables you need to interact with SageMaker service
boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "genai-on-aws-workshop"  
sm_session = sagemaker.Session()
sm_client = boto_session.client("sagemaker")
sm_role = sagemaker.get_execution_role()
account_id = boto3.client("sts").get_caller_identity()["Account"]

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [5]:
# The notebook tags all deployed resources
workshop_tags = [{'Key': 'project', 'Value': 'genai-on-aws-workshop'}]

In [6]:
# Get domain id and user profile
NOTEBOOK_METADATA_FILE = "/opt/ml/metadata/resource-metadata.json"
domain_id = None

if os.path.exists(NOTEBOOK_METADATA_FILE):
    with open(NOTEBOOK_METADATA_FILE, "rb") as f:
        md = json.loads(f.read())
        domain_id = md.get('DomainId')
        user_profile_name = md.get('UserProfileName')
        
        print(f"SageMaker domain id: {domain_id}\n"
              f"User profile name: {user_profile_name}")

SageMaker domain id: d-gg3vgzr0ftzx
User profile name: canvas-demo-user


In [7]:
%store domain_id
%store user_profile_name
%store region
%store account_id
%store bucket_prefix

%store

Stored 'domain_id' (str)
Stored 'user_profile_name' (str)
Stored 'region' (str)
Stored 'account_id' (str)
Stored 'bucket_prefix' (str)
Stored variables and their in-db values:
account_id                             -> '949335012047'
baseline_s3_url                        -> 's3://sagemaker-us-east-1-949335012047/from-idea-t
bucket_name                            -> 'sagemaker-us-east-1-949335012047'
bucket_prefix                          -> 'genai-on-aws-workshop'
domain_id                              -> 'd-gg3vgzr0ftzx'
evaluation_s3_url                      -> 's3://sagemaker-us-east-1-949335012047/from-idea-t
experiment_name                        -> 'from-idea-to-prod-experiment-20-07-44-29'
initialized                            -> True
input_s3_url                           -> 's3://sagemaker-us-east-1-949335012047/from-idea-t
model_package_group_name               -> 'from-idea-to-prod-model-group'
output_s3_url                          -> 's3://sagemaker-us-east-1-949335012047

## Check quotas

In [8]:
def check_quota(quota_code, min_v):
    r = quotas_client.get_service_quota(
        ServiceCode="sagemaker",
        QuotaCode=quota_code,
    )
    
    q = r["Quota"]["Value"]
    n = r["Quota"]["QuotaName"]

    if q < min_v:
        print (
            f"WARNING: Your quota {q} for {n} is less than required value of {min_v}"
        )
    else:
        print(
            f"SUCCESS: Your quota {q} for {n} is equal or more than required value of {min_v}"
        )

In [9]:
quotas_client = boto3.client("service-quotas")
llm_instance_types = [
    "ml.g5.2xlarge",  # needed for Falcon-7b deployment
    "ml.g5.12xlarge", # needed for Falcon-40b deployment
    "ml.g5.48xlarge", # needed for Falcon-40b deployment
]
                      
quotas = {
    "ml.g5.2xlarge": ["L-9614C779", 1],
    "ml.g5.12xlarge": ["L-65C4BD00", 1],
    "ml.g5.48xlarge": ["L-0100B823", 0],
    "ml.g4dn.xlarge": ["L-B67CFA0C", 1],
}
     
for i in llm_instance_types:
    check_quota(quotas[i][0], quotas[i][1])

SUCCESS: Your quota 2.0 for ml.g5.2xlarge for endpoint usage is equal or more than required value of 1
SUCCESS: Your quota 2.0 for ml.g5.12xlarge for endpoint usage is equal or more than required value of 1
SUCCESS: Your quota 2.0 for ml.g5.48xlarge for endpoint usage is equal or more than required value of 0


## Self-hosting LLM
This section shows how to:
1. Use SageMaker JumpStart to deploy a real-time endpoint with two lines of code
2. Use SageMaker [real-time endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) to host an LLM
3. Use SageMaker [asynchronous endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) to host an LLM

In [10]:
# Import HuggingFace classes
from sagemaker.huggingface import (
    get_huggingface_llm_image_uri, 
    HuggingFaceModel, 
    HuggingFacePredictor,
)

### Deploy using SageMaker JumpStart
The easiest option to deploy an LLM is to use [`JumpStartModel`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.jumpstart.model.JumpStartModel) Python SDK class. Refer to [Introduction to SageMaker JumpStart - Text Generation with Falcon models](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/text-generation-falcon.ipynb) for an example.

For available models refer to the [JumpStart model list](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html).

You deploy [Falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) LLM to a SageMaker real-time inference.

If you have access to at least `ml.g5.12xlarge` instance for a real-time inference, you can deploy a bigger and more capable mode [Falcon-40B-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct). 

<div class="alert alert-info"> 💡   
We recommend to use Falcon-40B-instruct for the retrieval augmented generation lab.
</div>

In [11]:
from sagemaker.jumpstart.model import JumpStartModel

In [12]:
# Uncomment the line for the model you'd like to use

js_model_id = "huggingface-llm-falcon-7b-bf16"
# js_model_id = "huggingface-llm-falcon-40b-instruct-bf16" # Falcon-40B-instruct

js_model = JumpStartModel(model_id=js_model_id)



In [13]:
predictor_js = js_model.deploy(tags=workshop_tags)

-----------!



In [None]:
# send request
predictor_js.predict({
	"inputs": "Hey Falcon! Any recommendations for my holidays in Seattle?",
    'parameters': {'max_new_tokens': 50,}
})

[{'generated_text': "\nI'm going to Seattle for a week in July. I'm staying in the downtown area, but I'm not sure what to do. I'm not a big shopper, but I'm not opposed to it. I"}]

### Deploy using HuggingFace TGI
<div class="alert alert-info"> 💡 
This section deploys another SageMaker real-time endpoint. If you've already deployed an endpoint using SageMaker JumpStart and don't want to deploy the second endpoint, skip this section.
</div>

If you'd like to use a model which is not onboarded to JumpStart, you can use a HuggingFace container.

![](../static/img/hf-tgi.png)

[HuggingFace LLM DLC](https://huggingface.co/blog/sagemaker-huggingface-llm) is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI), an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures

<div class="alert alert-info"> 💡   
We recommend to use Falcon-40B-instruct for the retrieval augmented generation lab.
</div>

In [None]:
# Uncomment the settings for the model you'd like to use

# Falcon-7b instruct
# hf_model_id = 'tiiuae/falcon-7b-instruct'
# llm_instance_type = 'ml.g5.2xlarge'
# gpus = 1

# Falcon-40b instruct
hf_model_id = 'tiiuae/falcon-40b-instruct'
llm_instance_type = 'ml.g5.12xlarge' # recommended ml.g5.48xlarge, gpus = 8
gpus = 4 # or 8 if ml.g5.48xlarge

tgi_image = get_huggingface_llm_image_uri("huggingface",version="1.1.0")

# Hub Model configuration. https://huggingface.co/models
config = {
	'HF_MODEL_ID': hf_model_id,
	'SM_NUM_GPUS': json.dumps(gpus),
    'HF_MODEL_QUANTIZE': "bitsandbytes", # Use quantization with ml.g5.12xlarge
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=tgi_image,
	env=config,
	role=sm_role, 
)

# print ecr image uri
print(f"llm image uri: {tgi_image}")



In [None]:
# deploy model to SageMaker real-time endpoint
predictor_tgi = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type=llm_instance_type,
	container_startup_health_check_timeout=1800,
    tags=workshop_tags,
  )



In [None]:
# send request
predictor_tgi.predict({
	"inputs": "Hey Falcon! Any recommendations for my holidays in Seattle?",
    'parameters': {'max_new_tokens': 500,}
})

### Deploy ansynchrounous endpoint

You can use SageMaker [asynchronous inference](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) to host LLMs.

SageMaker asynchronous inference queues incoming requests and processes them asynchronously. Asynchronous inference is suitable for workloads that do not have sub-second latency requirements and have relaxed latency requirements. You might require the asynchronous inference in two use cases:  
1. If inference requests can take up to 15 minutes to process and if you have payload sizes of up to 1 GB  
2. Asynchronous inference endpoints let you control costs by scaling down endpoints instance count to zero when they are idle, so you only pay when your endpoints are processing requests.

Invocation of asynchronous endpoints differ from real-time endpoints. Rather than pass request payload inline with the request, you upload the payload to Amazon S3 and pass an Amazon S3 URI as a part of the request. Upon receiving the request, SageMaker provides you with a token with the output location where the result will be placed once processed. Internally, SageMaker maintains a queue with these requests and processes them. During endpoint creation, you can optionally specify an Amazon SNS topic to receive success or error notifications. Once you receive the notification that your inference request has been successfully processed, you can access the result in the output Amazon S3 location.

The following diagram shows an overview of the end-to-end flow with Asynchronous inference endpoint.

![](../static/img/sm-async-endpoint.png)

In [24]:
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig

Creation of an asynchronous endpoint isn't any different from real-time endpoints. You need to create a model, an endpoint configuration, and an endpoint. If you use [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), the [Model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) class is convenient abstraction to create all needed structures with just one line of code by calling `deploy()` method.

First, create a model:

In [25]:
# Uncomment the settings for the model you'd like to use
# Falcon-7b instruct
hf_model_id = 'tiiuae/falcon-7b-instruct'
llm_instance_type = 'ml.g5.2xlarge'
gpus = 1

# Falcon-40b instruct
# hf_model_id = 'tiiuae/falcon-40b-instruct'
# llm_instance_type = 'ml.g5.12xlarge' # recommended ml.g5.48xlarge, gpus = 8
# gpus = 4 # or 8 if ml.g5.48xlarge

tgi_image = get_huggingface_llm_image_uri("huggingface",version="1.1.0")

# Hub Model configuration. https://huggingface.co/models
config = {
	'HF_MODEL_ID': hf_model_id,
	'SM_NUM_GPUS': json.dumps(gpus),
    # 'HF_MODEL_QUANTIZE': "bitsandbytes", # Use quantization with ml.g5.12xlarge
}

# create Hugging Face Model Class
async_inference_model = HuggingFaceModel(
	image_uri=tgi_image,
	env=config,
	role=sm_role, 
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


Second, configure [`AsyncInferenceConfig`](https://sagemaker.readthedocs.io/en/stable/api/inference/async_inference.html) to use in `model.depoy()`. In the inference config you can specify an S3 url where the endpoint uploads an inference output, an optional Amazon SNS notification config, and other settings like encryption key, or failure path.

In [26]:
output_path = f"s3://{bucket_name}/{bucket_prefix}/async-inference/output"
async_inference_config = AsyncInferenceConfig(
    output_path=output_path,
)

Third, create an asynchronous endpoint using `model.deploy()` method:



In [27]:
# deploy model to SageMaker real-time endpoint
predictor_async = async_inference_model.deploy(
	initial_instance_count=1,
	instance_type=llm_instance_type,
    async_inference_config=async_inference_config, # important to use AsyncInferenceConfig to create an asyn endpoint
	container_startup_health_check_timeout=1800,
    tags=workshop_tags,
  )

--------------!



You can use `predictor.predict()` method to send a request to an asynchronous endpoint. The Python SDK class implementation takes care about all required steps, such as uploading a request to Amazon S3, calling the SageMaker runtime API [`InvokeEndpointAsync`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html), waiting until the endpoint uploads an inference result to the output path, and returning the result back.

If you use SageMaker API or boto3 SDK method [`invoke_endpoint_async`](https://boto3.amazonaws.com/v1/documentation/api/1.26.83/reference/services/sagemaker-runtime/client/invoke_endpoint_async.html), you need to implement all these steps. For an example refer to a sample [notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough.ipynb).

In [28]:
predictor_async.predict({
	"inputs": "Hey Falcon! Any recommendations for my holidays in Seattle?",
    'parameters': {'max_new_tokens': 500,}
})

[{'generated_text': 'Hey Falcon! Any recommendations for my holidays in Seattle?\nThere are so many great things to do in Seattle! Some popular recommendations include visiting the Space Needle, hiking in the nearby mountains, and checking out the local food scene. You could also consider taking a tour of the city, or visiting some of the nearby islands for a change of scenery. Let me know if you have any specific interests or preferences, and I can help you come up with a plan!'}]

By running the next cell, you can check that the inference result is uploaded to the specified S3 url:

In [43]:
!aws s3 ls {output_path} --recursive

2023-10-30 20:28:48        485 genai-on-aws-workshop/async-inference/output/03f45e09-3dc7-4ae8-a4ca-a74964f50274.out
2023-10-30 17:46:29        485 genai-on-aws-workshop/async-inference/output/10598ef0-a8a9-47da-9393-90439550c06b.out
2023-10-31 10:49:28        485 genai-on-aws-workshop/async-inference/output/3387b96b-cdb3-42c9-9ef5-fefc5f1775bd.out
2023-10-30 15:02:39        485 genai-on-aws-workshop/async-inference/output/662e57c1-3128-4c2b-8882-5a87575a6659.out
2023-10-30 20:28:32        485 genai-on-aws-workshop/async-inference/output/cbe09475-d754-4488-a538-62b48434d231.out
2023-10-31 09:01:11        485 genai-on-aws-workshop/async-inference/output/cee62680-c50c-46f7-b3c6-1c1d0144a416.out


### Add auto scaling policy to the asynchronous endpoint
Amazon SageMaker supports [automatic scaling (autoscaling) your asynchronous endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-autoscale.html). Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. Autoscaling works with both real-time and asynchronous endpoints, but with asynchonous inference you can scale down an endpoint to zero intances. When an endpoind receives requests when there are zero intances, they are queued for processing once the endpoint scales up. This section implements an autoscaling policy to scale to zero for the deployed asynchronous endpoint.

In [25]:
try:
    predictor_async.endpoint_name
except NameError:
    raise Exception(f"ERROR: You haven't deployed an asynchronous endpoint!")
    
ep = sm_client.describe_endpoint(EndpointName=predictor_async.endpoint_name)

First, define a scaling policy. The scaling policy defines the desired scaling behavior in response to changes in metrics.
For a list of possible scaling metrics refer to [Asynchronous Inference Endpoint Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-monitor.html#async-inference-monitor-cloudwatch-async). This configuration tracks the `ApproximateBacklogSizePerInstance` and `HasBacklogWithoutCapacity` metrics.

In [26]:
TargetTrackingScalingPolicyConfiguration = {
        'TargetValue': 2.0, # The target value for the metric. Here the metric is: ApproximateBacklogSizePerInstance
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSizePerInstance',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': predictor_async.endpoint_name },
            ],
            'Statistic': 'Average',
        },
        'ScaleInCooldown': 300, # duration until scale in
        'ScaleOutCooldown': 300 # duration between scale out
    }            

In [27]:
StepScalingPolicyConfiguration = {
    "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
    "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
    "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
    "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
    [ 
        {
          "MetricIntervalLowerBound": 0,
          "ScalingAdjustment": 1
        }
    ]
}

Second, register the endpoint as a target for autoscaling. The `MinCapacity` is zero, which means the endpoint can be scaled down to zero instances:

In [28]:
aa = boto3.client("application-autoscaling")
cw = boto3.client("cloudwatch")

In [29]:
# you need to use this format for resource_id
resource_id = ("endpoint/" + predictor_async.endpoint_name + "/variant/" + ep['ProductionVariants'][0]['VariantName'])

# Configure Autoscaling on asynchronous endpoint down to zero instances
r = aa.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=2,
)

Third, apply scaling policies to the endpoint. We apply two policies, one for tracking the number of requests in queue and scale the endpoint based on it, and the second to scale the endpoint up from zero instances if there is at least one request waiting in the queue.

In [30]:
r = aa.put_scaling_policy(
    PolicyName=f"ApproximateBacklogSizePerInstance-ScalingPolicy-{predictor_async.endpoint_name}",
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint production variant
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling' or 'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration=TargetTrackingScalingPolicyConfiguration,    
    
)

In [31]:
r = aa.put_scaling_policy(
    PolicyName=f"HasBacklogWithoutCapacity-ScalingPolicy-{predictor_async.endpoint_name}",
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint production variant
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
    StepScalingPolicyConfiguration=StepScalingPolicyConfiguration,    
)

step_scaling_policy_arn = r['PolicyARN']

Finally, create a CloudWatch alarm with the SageMaker metric `HasBacklogWithoutCapacity`. When this allarm is triggered, it initiates the step scaling policy we defined in the previous step:

In [41]:
r = cw.put_metric_alarm(
    AlarmName=f"StepTracking-endpoint/{predictor_async.endpoint_name}/variant/{ep['ProductionVariants'][0]['VariantName']}-AlarmStepUp-{uuid.uuid4()}",
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 2,
    DatapointsToAlarm= 2,
    Threshold= 1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[
        { 'Name':'EndpointName', 'Value': predictor_async.endpoint_name },
    ],
    Period= 60,
    AlarmActions=[step_scaling_policy_arn]
)

Now the endpoint will be scaled down to zero instances if there is no requests in the queue and scaled up from zero if there is at least one request waiting in the queue.

### Use asynchronous endpoints in LangChain
SageMaker asynchronous endpoints are a cost-optimized solution when you need to host an LLM but traffic is inpredictable. While you can use SageMaker Python SDK classes [`Predictor`](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) and [`AsyncPredictor`](https://sagemaker.readthedocs.io/en/stable/api/inference/predictor_async.html) to access asynchronous endpoints, LangChain class [`SageMakerEndpoint`](https://python.langchain.com/docs/integrations/llms/sagemaker) doesn't support asynchronous inference.

This workshop contains an implementation of `SageMakerEndpointAsync` class you can use in LangChain framework. The code is in the `llm/sagemaker_async_endpoint.py` file. The implementation is based on [this GitHub repository](https://github.com/dgallitelli/langchain/blob/master/langchain/llms/sagemaker_async_endpoint.py).

Run the following cells to test LangChain with `SageMakerEndpointAsync` abstraction.

In [52]:
from typing import Dict
from langchain import PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains import LLMChain
from llm.sagemaker_async_endpoint import *

In [53]:
input_bucket = bucket_name
input_prefix = f"{bucket_prefix}/async-endpoint"

In [54]:
class ContentHandler(LLMContentHandler):
    content_type:str = "application/json"
    accepts:str = "application/json"
    len_prompt:int = 0

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        self.len_prompt = len(prompt)
        input_str = json.dumps({"inputs": prompt, "parameters": {"max_new_tokens": 100, "do_sample": False, "repetition_penalty": 1.1}})
        return input_str.encode('utf-8')

    def transform_output(self, output: bytes) -> str:
        response_json = output.read()
        res = json.loads(response_json)
        ans = res[0]['generated_text']
        return ans

In [55]:
chain = LLMChain(
    llm=SagemakerAsyncEndpoint(
        input_bucket=input_bucket,
        input_prefix=input_prefix,
        endpoint_name=predictor_async.endpoint_name,
        region_name=sagemaker.Session().boto_region_name,
        content_handler=ContentHandler(),
    ),
    prompt=PromptTemplate(
        input_variables=["query"],
        template="{query}",
    ),
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


<div class="alert alert-info"> 💡 
If the async endpoint scaled to zero, the following cell raises an exception "The endpoint is not running". The endpoint wakes up automatically and is available after approx 10 min. You just re-run the cell after the endoint is in InService status again.
</div>

In [58]:
chain.run("Hey Falcon! Any recommendations for my holidays in Seattle?")

'Hey Falcon! Any recommendations for my holidays in Seattle?\nThere are so many great things to do in Seattle! Some popular recommendations include visiting the Space Needle, hiking in the nearby mountains, and checking out the local food scene. You could also consider taking a tour of the city, or exploring some of the nearby neighborhoods like Ballard or Fremont. Let me know if you have any specific interests or preferences!'

## Use Amazon Bedrock to access LLM
The easiest way to access different LLMs is to use Amazon Bedrock managed API. With Amazon Bedrock you don't need to create and maintain an inference endpoint. You can access Amazon and third-party models.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
You must request access to a model before you can use it. If you try to use the model (with the API or console) before you have requested access to it, you receive an error message. For more information, see <a href=https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html>Model access</a>.
</div>

In [None]:
# make sure you have a boto3 version that supports Amazon Bedrock
assert(boto3.__version__ >= '1.28.57')

In [None]:
bedrock = boto3.client(service_name='bedrock')

# list all available models you have access to
bedrock.list_foundation_models()["modelSummaries"]

To use the model you need to provide an Amazon Bedrock model id. The list of available model ids you can find in Amazon Bedrock [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids-arns.html).

In [None]:
bedrock_model_id = 'anthropic.claude-instant-v1'
llm = bedrock.get_foundation_model(modelIdentifier=bedrock_model_id)

In [None]:
bedrock_runtime = boto3.client(service_name='bedrock-runtime')
body = json.dumps({
    "prompt": "\n\nHuman:explain black holes to 8th graders\n\nAssistant:",
    "max_tokens_to_sample": 300,
    "temperature": 0.1,
    "top_p": 0.9,
})

modelId = llm['modelDetails']['modelId']
accept = 'application/json'
contentType = 'application/json'

response = bedrock_runtime.invoke_model(
    body=body, 
    modelId=modelId, 
    accept=accept, 
    contentType=contentType
)

response_body = json.loads(response.get('body').read())

# text
print(response_body.get('completion'))

## Test the LLM endponts

If you successfully deployed one or more SageMaker LLM endpoints, you can chat with the deployed LLMs. For creating UX around the LLM you can use [Gradio](https://gradio.app/) Python library.

In [None]:
import gradio_app

Configure and create Gradio chatbot application:

In [None]:
# hyperparameters for llm
parameters = {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:", "<|endoftext|>", " User:", "###"],
}

system_prompt = "You are an helpful Assistant, called LLM. Knowing everyting about AWS."

List all endpoints you created in this notebook:

In [18]:
def list_workshop_endpoints():
    workshop_endpoints = []
    workshop_tag_key = workshop_tags[0]['Key']
    workshop_tag_value = workshop_tags[0]['Value']

    for ep in sm_client.list_endpoints().get('Endpoints'):
        ep_name = ep['EndpointName']
        en_arn = ep['EndpointArn']
        v = [d['Value'] for d in sm_client.list_tags(ResourceArn=en_arn)['Tags'] if d['Key'] == workshop_tag_key]
        if len(v) and v[0] == workshop_tag_value:
            workshop_endpoints.append(ep_name)

    return workshop_endpoints

def describe_workshop_endpoints():
    for i, ep_name in enumerate(list_workshop_endpoints()):
        ep_config_name = sm_client.describe_endpoint(EndpointName=ep_name)['EndpointConfigName']
        ep_config = sm_client.describe_endpoint_config(EndpointConfigName=ep_config_name)

        instance = ep_config['ProductionVariants'][0]['InstanceType']
        m_name = ep_config['ProductionVariants'][0]['ModelName']
        is_async = ep_config.get('AsyncInferenceConfig') != None

        print(f"endpoint {i} ({'async' if is_async else 'real-time'}): {ep_name} ({instance}) -> model: {m_name}")

In [22]:
describe_workshop_endpoints()

endpoint 0 (async): huggingface-pytorch-tgi-inference-2023-10-31-08-46-21-240 (ml.g5.2xlarge) -> model: huggingface-pytorch-tgi-inference-2023-10-31-08-46-20-523
endpoint 1 (real-time): hf-llm-falcon-7b-bf16-2023-10-31-08-37-14-377 (ml.g5.2xlarge) -> model: hf-llm-falcon-7b-bf16-2023-10-31-08-37-13-274


In [None]:
# Use any of existing endpoints from the list above
if len(list_workshop_endpoints()) > 0:
    endpoint_name_to_use = list_workshop_endpoints()[0]
else:
    raise Exception(f"You don't have any SageMaker endpoints. You need to create one to run Gradio chatbot")

boto_session=boto3.session.Session()

In [None]:
# You cannot use asynchronous endpoint with Gradio, choose a real-time endpoint instead
if sm_client.describe_endpoint(EndpointName=endpoint_name_to_use).get('AsyncInferenceConfig'):
    raise Exception(f"You cannot use an async endpoint with Gradio, choose a real-time endpoint")

The cell below create a chat window where you can play with the model by asking questions.

In [None]:
# create gradio app
gradio_app.create_gradio_app(
    endpoint_name_to_use,
    session=boto_session,
    parameters=parameters, 
    system_prompt=system_prompt
)

## Use endpoints in workhop labs
<div style="border: 4px solid coral; text-align: center; margin: auto;">
If you need an endpoint name to use in a workshop lab, the following cell prints all deployed endpoints:
</div>

In [22]:
describe_workshop_endpoints()

endpoint 0 (async): huggingface-pytorch-tgi-inference-2023-10-31-08-46-21-240 (ml.g5.2xlarge) -> model: huggingface-pytorch-tgi-inference-2023-10-31-08-46-20-523


## Clean up
You must remove deployed endpoints after you completed workshop labs might use them to avoid unexpected costs.

In [16]:
from sagemaker.predictor import Predictor

def delete_by_endpoint_name(ep_name):
    print(f"The endpoint {ep_name} will be deleted!")
    print("Are you sure you want to delete this endpoint? (y/n)")
    
    if input() == 'y':
        print(f"Deleting {ep_name}")
        predictor = Predictor(endpoint_name=ep_name)
        predictor.delete_model()
        predictor.delete_endpoint()

In [29]:
for ep in list_workshop_endpoints():
    delete_by_endpoint_name(ep)

The endpoint huggingface-pytorch-tgi-inference-2023-11-10-19-08-26-786 will be deleted!
Are you sure you want to delete this endpoint? (y/n)


 y


Deleting huggingface-pytorch-tgi-inference-2023-11-10-19-08-26-786
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
The endpoint huggingface-pytorch-tgi-inference-2023-10-31-08-46-21-240 will be deleted!
Are you sure you want to delete this endpoint? (y/n)


 n


In [30]:
describe_workshop_endpoints()

endpoint 0 (async): huggingface-pytorch-tgi-inference-2023-10-31-08-46-21-240 (ml.g5.2xlarge) -> model: huggingface-pytorch-tgi-inference-2023-10-31-08-46-20-523


## Shutdown Kernel

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>