# How to deploy Large Language Models (LLMs) to Amazon SageMaker using new Hugging Face LLM DLC

This is an example on how to deploy the open-source LLMs, like [BLOOM](bigscience/bloom) to Amazon SageMaker for inference using the new Hugging Face LLM Inference Container. We will deploy the 40B-Instruct [Falcon](https://huggingface.co/tiiuae/falcon-40b-instruct) an open-source Chat LLM trained by TII.

The example covers:
1. [Setup development environment](#1-setup-development-environment)
2. [Retrieve the new Hugging Face LLM DLC](#2-retrieve-the-new-hugging-face-llm-dlc)
3. [Deploy Falcon to Amazon SageMaker](#3-deploy-open-assistant-12b-to-amazon-sagemaker)
4. [Run inference and chat with our model](#4-run-inference-and-chat-with-our-model)

## What is Hugging Face LLM Inference DLC?

Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference), an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. 
Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
* Tensor Parallelism and custom cuda kernels
* Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
* Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
* [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
* Accelerated weight loading (start-up time) with [safetensors](https://github.com/huggingface/safetensors)
* Logits warpers (temperature scaling, topk, repetition penalty ...)
* Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
* Stop sequences, Log probabilities
* Token streaming using Server-Sent Events (SSE)

Officially supported model architectures are currently: 
* [BLOOM](https://huggingface.co/bigscience/bloom) / [BLOOMZ](https://huggingface.co/bigscience/bloomz)
* [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
* [Galactica](https://huggingface.co/facebook/galactica-120b)
* [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b) (joi, pythia, lotus, rosey, chip, RedPajama, open assistant)
* [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) (T5-11B)
* [Llama](https://github.com/facebookresearch/llama) (vicuna, alpaca, koala)
* [Starcoder](https://huggingface.co/bigcode/starcoder) / [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) / [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)

With the new Hugging Face LLM Inference DLCs on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly concurrent, low latency LLM experiences like [HuggingChat](https://hf.co/chat), [OpenAssistant](https://open-assistant.io/), and Inference API for LLM models on the Hugging Face Hub. 

Lets get started!


## 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [None]:
!pip install sagemaker --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.


In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

## 2. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the `get_huggingface_llm_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers)


In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri("huggingface", version="0.9.3")

# print ecr image uri
print(f"llm image uri: {llm_image}")

## 3. Deploy Realtime Endpoint with Falcon to Amazon SageMaker

To deploy [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.12xlarge` instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.

_Note: We could also optimize the deployment for cost and use `g5.2xlarge` instance type and enable int-8 quantization._

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# Define Model and Endpoint configuration parameter
hf_model_id = "tiiuae/falcon-40b-instruct"  # model id from huggingface.co/models
instance_type = "ml.g4dn.12xlarge" # "ml.g5.12xlarge"  # instance type to use for deployment
number_of_gpu = 4  # number of gpus to use for inference and tensor parallelism
health_check_timeout = 600  # Increase the timeout for the health check to 5 minutes for downloading the model

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
    role=role,
    image_uri=llm_image,
    env={
        "HF_MODEL_ID": hf_model_id,
        "HF_MODEL_REVISION": "1e7fdcc9f45d13704f3826e99937917e007cd975",
        'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize when using g4dn, comment out when using g5 instances
        "SM_NUM_GPUS": json.dumps(number_of_gpu),
        "MAX_INPUT_LENGTH": json.dumps(1900),  # Max length of input text
        "MAX_TOTAL_TOKENS": json.dumps(2048),  # Max length of the generation (including input text)
    },
)

After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.12xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs.

In [None]:
import datetime
model_name = hf_model_id.split("/")[-1].replace(".", "-")
endpoint_name = model_name + "-realtime-" + instance_type.replace(".", "-").replace("ml","") + "-" + str(datetime.datetime.now().strftime("%y-%m-%d--%H-%M-%S"))
endpoint_name

In [None]:
llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    endpoint_name=endpoint_name,
)

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. 

## 4. Run inference and chat with our model

After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. As of today the TGI supports the following parameters:
* `temperature`: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.
* `max_new_tokens`: The maximum number of tokens to generate. Default value is 20, max value is 512.
* `repetition_penalty`: Controls the likelihood of repetition, defaults to `null`.
* `seed`: The seed to use for random generation, default is `null`.
* `stop`: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.
* `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is `null`, which disables top-k-filtering.
* `top_p`: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to `null`
* `do_sample`: Whether or not to use sampling ; use greedy decoding otherwise. Default value is `false`.
* `best_of`: Generate best_of sequences and return the one if the highest token logprobs, default to `null`.
* `details`: Whether or not to return details about the generation. Default value is `false`.
* `return_full_text`: Whether or not to return the full text or only the generated part. Default value is `false`.
* `truncate`: Whether or not to truncate the input to the maximum length of the model. Default value is `true`.
* `typical_p`: The typical probability of a token. Default value is `null`.
* `watermark`: The watermark to use for the generation. Default value is `false`.

You can find the open api specification of the TGI in the [swagger documentation](https://huggingface.github.io/text-generation-inference/)

In [None]:
prompt = """System: You are a smart assistant designed to help high school teachers come up with reading comprehension questions.\nGiven a piece of text, you must come up with a question and answer pair that can be used to test a student\'s reading comprehension abilities.\nWhen coming up with this question/answer pair, you must respond in the following format:\n```\n{\n    "question": "$YOUR_QUESTION_HERE",\n    "answer": "$THE_ANSWER_HERE"\n}\n```\n\nEverything between the ``` must be valid json.\n\nHuman: Please come up with a question/answer pair, in the specified JSON format, for the following text:\n----------------\nWhat are Cookies?\n\n\nCookies are small text files which a website may install on your computer or mobile device when you first visit a site or page. A Cookie will help the website, or another website, to recognize your device the next time you visit. Web beacons or other similar files can also do the same thing. We use the term "Cookies" in this policy to refer to all files that collect information in this way.\n\n\nThere are many functions Cookies serve. For example, they can help us to remember your preferences, analyze how well our website is performing, or even allow us to recommend content we believe will be most relevant to you.\n\n\nCertain Cookies contain Personal Information. You can block Cookies by activating the setting on your browser that allows you to refuse the setting of all or some Cookies.\n\n\nHowever, if you use your browser settings to block all Cookies (including essential Cookies) you may not be able to access all or parts of our site."""

chat = llm.predict({"inputs": prompt, "parameters": {"temperature": 0.1}})

print(chat[0]["generated_text"])

In [None]:
from sagemaker.huggingface import HuggingFacePredictor

predictor = HuggingFacePredictor(endpoint_name="falcon-40b-instruct-realtime--g4dn-12xlarge-23-09-14--07-54-50")

In [None]:
chat = predictor.predict({"inputs": """Hello, how are you?"""})

print(chat[0]["generated_text"])

In [None]:
input_json = {
    "inputs": """
The following is a conversation between a highly knowledgeable and intelligent AI assistant, called AWSomeChat, and a human user, called User. In the following interactions, User and AWSomeChat will converse in natural language, and AWSomeChat will answer User's questions. AWSomeChat was built to be respectful, polite and inclusive. AWSomeChat was built by the Amazon Web Services in Zurich. AWSomeChat will never decline to answer a question, and always attempts to give an answer that User would be satisfied with. It knows a lot, and always tells the truth.  The conversation begins.

User: Please confirm that you will use OpenSearch Index for information retrieval.

AWSomeChat:
""",
    "parameters": {"temperature": 0.8},
}
chat = predictor.predict(input_json)

print(chat[0]["generated_text"])

## 4. Deploy Async Endpoint with Falcon to Amazon SageMaker

To deploy [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.48xlarge` instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.

_Note: We could also optimize the deployment for cost and use `g5.2xlarge` instance type and enable int-8 quantization._

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.s3 import s3_path_join

# Define Model and Endpoint configuration parameter
hf_model_id = "tiiuae/falcon-40b-instruct" # model id from huggingface.co/models
instance_type = "ml.g4dn.12xlarge" # "ml.g5.12xlarge"  # instance type to use for deployment, g4dn typically ensures higher availability when restarting the endpoint
number_of_gpu = 4 # number of gpus to use for inference and tensor parallelism
health_check_timeout = 600 # Increase the timeout for the health check to 5 minutes for downloading the model

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env={
    'HF_MODEL_ID': hf_model_id,
    'HF_MODEL_REVISION': "1e7fdcc9f45d13704f3826e99937917e007cd975",
    'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize when using g4dn, comment out when using g5 instances
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'MAX_INPUT_LENGTH': json.dumps(1900),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  }
)

In [None]:
# create async endpoint configuration
async_config = AsyncInferenceConfig(
    output_path=s3_path_join("s3://",sagemaker_session_bucket,"async_inference/output") , # Where our results will be stored
    # notification_config={
            #   "SuccessTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
            #   "ErrorTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
    # }, #  Notification configuration
)



After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.48xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs.

In [None]:
import datetime
model_name = hf_model_id.split("/")[-1].replace(".", "-")
endpoint_name = model_name + "-async-" + instance_type.replace(".", "-") + "--" + str(datetime.datetime.now().strftime("%y-%m-%d--%H-%M-%S"))
endpoint_name

In [None]:
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
  endpoint_name=endpoint_name,
  async_inference_config=async_config # <<<<<--------------- ASYNC
)


## Autoscale (to Zero) the Asynchronous Inference Endpoint: Option 1 (recommended)
The example below is a second option provided to attach an autoscaling policy to an asynchronous endpointand is based on [this blog post](https://medium.com/@neethu.v.gopal/asynchronous-endpoints-for-stable-diffusion-in-aws-using-sagemaker-with-autoscaling-b0db4206648b)

In [None]:
# application-autoscaling client
asg_client = boto3.client("application-autoscaling")

# This is the format in which application autoscaling references the endpoint
resource_id = f"endpoint/{llm.endpoint_name}/variant/AllTraffic"

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = asg_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=1,
)


#Configure scaling policy to increase instance count from zero when new requests come
response = asg_client.put_scaling_policy(
    PolicyName = f'HasBacklogWithoutCapacity-ScalingPolicy-{llm.endpoint_name}',
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
        "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
        "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
        "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
        [ 
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
          ]
    },    
)

cw_client = boto3.client('cloudwatch')
step_scaling_policy_arn = response['PolicyARN']

response = cw_client.put_metric_alarm(
    AlarmName=f'step_scaling_policy_alarm_name-{llm.endpoint_name}',
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 1,
    DatapointsToAlarm= 1,
    Threshold= 0.5,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[
        { 'Name':'EndpointName', 'Value':llm.endpoint_name },
    ],
    Period= 60,
    AlarmActions=[step_scaling_policy_arn]
)


#Configure scaling policy to decrease instance count to zero when there are no further requests to process
response_scalein = asg_client.put_scaling_policy(
    PolicyName = f'scaleinpolicy-{llm.endpoint_name}',
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
        "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
        "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
        "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
        [ 
            {
              "MetricIntervalUpperBound": 0,
              "ScalingAdjustment": -1
            }
          ]
    },    
)


stepin_scaling_policy_arn = response_scalein['PolicyARN']

response = cw_client.put_metric_alarm(
    AlarmName=f'step_scale-in_policy-{llm.endpoint_name}',
    MetricName='ApproximateBacklogSizePerInstance',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 2,
    DatapointsToAlarm= 2,
    Threshold= 0.5,
    ComparisonOperator='LessThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[
        { 'Name':'EndpointName', 'Value':llm.endpoint_name },
    ],
    Period= 3600,
    AlarmActions=[stepin_scaling_policy_arn]
)


## Autoscale (to Zero) the Asynchronous Inference Endpoint: Option 2 (alternative)
The example below is the first option provided to attach an autoscaling policy to an asynchronous endpointand is based on [this blog post](https://www.philschmid.de/sagemaker-huggingface-async-inference)

In [None]:
# application-autoscaling client
asg_client = boto3.client("application-autoscaling")

# This is the format in which application autoscaling references the endpoint
resource_id = f"endpoint/{llm.endpoint_name}/variant/AllTraffic"

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = asg_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=1,
)

response = asg_client.put_scaling_policy(
    PolicyName=f'Request-ScalingPolicy-{llm.endpoint_name}',
    ServiceNamespace="sagemaker",  
    ResourceId=resource_id, 
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 1.0, 
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance", # HasBacklogWithoutCapacity
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": llm.endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 300, # The amount of time, in seconds, after a scale-in activity completes before another scale-in activity can start.
        "ScaleOutCooldown": 60 # The amount of time, in seconds, to wait for a previous scale-out activity to take effect.
    },
)


## Test the deployed the endpoint

In [None]:
import time
from sagemaker.async_inference.waiter_config import WaiterConfig
from datetime import datetime

start = time.time()
print(datetime.fromtimestamp(start))

output_list=[]

# send 10 requests
for i in range(10):
  resp = llm.predict_async(data={"inputs": "it 's a charming and often affecting journey ."})
  output_list.append(resp)

# iterate over list of output paths and get results
results = []
for async_response in output_list:
    response = async_response.get_result(WaiterConfig(max_attempts=600))
    results.append(response)

print(f"Time taken: {time.time() - start}s")


## Delete the async inference endpoint & Autoscaling policy

In [None]:
# response = asg_client.deregister_scalable_target(
#     ServiceNamespace='sagemaker',
#     ResourceId=resource_id,
#     ScalableDimension='sagemaker:variant:DesiredInstanceCount'
# )
# llm.delete_model()
# llm.delete_endpoint()

# Warmup a cold endpoint

In [None]:
import json
import sagemaker
import boto3

# The name of the endpoint. The name must be unique within an AWS Region in your AWS account. 
endpoint_name='falcon-40b-instruct-async-ml-g4dn-12xlarge--23-09-12--04-27-16'

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

deployed_llm = sagemaker.predictor.Predictor(endpoint_name, sagemaker_session=sess)
deployed_llm = sagemaker.predictor_async.AsyncPredictor(deployed_llm)
print(f"deployed model: {deployed_llm}")

## Test the endpoint warming up

In [None]:
import time, json
from sagemaker.async_inference.waiter_config import WaiterConfig
import uuid
from datetime import datetime

start = time.time()
print(datetime.fromtimestamp(start))

output_list=[]

# send 10 requests
for i in range(10):
    unser_data = {"inputs": "it 's a charming and often affecting journey ."}
    data = json.dumps( unser_data )
    # print(type(unser_data),unser_data,type(data),data)
    input_path = s3_path_join("s3://",sagemaker_session_bucket,"async_inference/input_folder/")+"/"+str(uuid.uuid4())
    print(input_path)
    resp = deployed_llm.predict_async(data=data, input_path=input_path)
    output_list.append(resp)

# iterate over list of output paths and get results
results = []
for async_response in output_list:
    response = async_response.get_result(WaiterConfig(max_attempts=5, delay=5))
    results.append(response)

print(f"Time taken: {time.time() - start}s")
