# Hugging Face Large Model Inference

This notebook demonstrates how to deploy Hugging Face transformer models using Hugging Face Text Generation Inference (TGI) Deep Learning Container on Amazon SageMaker.

TGI is an open source, high performance inference library that can be used to deploy large language models from Hugging Face’s repository in minutes. The library includes advanced functionality like model parallelism and dynamic batching to simplify production inference with large language models like flan-t5-xxl, LLaMa, StableLM, and GPT-NeoX. 

## Setup

### Install the SageMaker Python SDK

First, make sure that the latest version of SageMaker SDK is installed.

In [4]:
%pip install "sagemaker>=2.163.0"

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Setup account and role

Then, we import the SageMaker python SDK and instantiate a `sagemaker_session` which we use to determine the current region and execution role.

In [5]:
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

## Retrieve the LLM Image URI

We use the helper function `get_huggingface_llm_image_uri()` to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference.

The function takes a required parameter `backend` and several optional parameters. The `backend` specifies the type of backend to use for the model, the values can be "lmi" and "huggingface". The "lmi" stands for SageMaker LMI inference backend, and "huggingface" refers to using Hugging Face TGI inference backend.

In [6]:
image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", # or lmi
  region=region
)

## Create the Hugging Face Model

Next we configure the `model` object by specifying a unique name, the `image_uri` for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the `HF_MODEL_ID` which corresponds to the model from the HuggingFace Hub that will be deployed, and the `HF_TASK` which configures the inference task to be performed by the model.

You should also define `SM_NUM_GPUS`, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. Here, you should set `SM_NUM_GPUS` to the number of available GPUs on your selected instance type.  For example, in this tutorial, we set `SM_NUM_GPUS` to 4 because our selected instance type `ml.g5.12xlarge` has 4 available GPUs.

Additionally, we could reduce the memory footprint of the model by setting the `HF_MODEL_QUANTIZE` environment variable to true.

In [7]:
model_name = "falcon-40b-async-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'tiiuae/falcon-40b',
    'HF_TASK':'text-generation',
    'SM_NUM_GPUS':'4'
}

model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)

In [8]:
s3_bucket=sagemaker_session.default_bucket()
bucket_prefix='falcon-async-inference'

In [9]:
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig

async_config = AsyncInferenceConfig(
    output_path=f"s3://{s3_bucket}/{bucket_prefix}/output",
    max_concurrent_invocations_per_instance=4,
    # Optionally specify Amazon SNS topics
    # notification_config = {
    # "SuccessTopic": "arn:aws:sns:<aws-region>:<account-id>:<topic-name>",
    # "ErrorTopic": "arn:aws:sns:<aws-region>:<account-id>:<topic-name>",
    # }
)

In [10]:
from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144,
    max_concurrency=1,
)

## Creating a SageMaker Endpoint

Next we deploy the model by invoking the `deploy()` function.

To efficiently deploy and run large language models, it is important to choose an appropriate instance type that can handle the computational requirements. Here we use an `ml.g5.12xlarge` instance which come with 4 NVIDIA A10 GPUs. By setting the `SM_NUM_GPUS` environment variable to 4 in the last code block, we indicate that this model should be sharded across all 4 GPU devices.

Please refer to the [guide](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-choosing-instance-types.html) provided by AWS SageMaker on large model inference instance type selection.

In [15]:
model_name = "falcon-40b-async-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.24xlarge",
    endpoint_name=model_name,
    async_inference_config=async_config
)

----------------!

In [16]:
import boto3
client = boto3.client(
    "application-autoscaling"
)  # Common class representing Application Auto Scaling for SageMaker amongst other services

resource_id = (
    "endpoint/" + model_name + "/variant/" + "AllTraffic"
)  # This is the format in which application autoscaling references the endpoint

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=5,
)

response = client.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the AWS service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": model_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,  # The cooldown period helps you prevent your Auto Scaling group from launching or terminating
        # additional instances before the effects of previous activities are visible.
        # You can configure the length of time based on your instance startup time or other application needs.
        # ScaleInCooldown - The amount of time, in seconds, after a scale in activity completes before another scale in activity can start.
        "ScaleOutCooldown": 300  # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        # 'DisableScaleIn': True|False - ndicates whether scale in by the target tracking policy is disabled.
        # If the value is true , scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    },
)



## Running Inference

Once the endpoint is up and running, we can evaluate the model using the `predict()` function.

In [107]:
input_data = {
  "inputs": "The diamondback terrapin was the first reptile to",
  "parameters": {
    "do_sample": True,
    "max_new_tokens": 100,
    "temperature": 0.7,
    "watermark": True
  }
}
predictor.predict(input_data)

[{'generated_text': "The diamondback terrapin was the first reptile to be officially designated as a Maryland State Reptile.\nThe diamondback terrapin is Maryland's State reptile.\nThe diamondback terrapin is Maryland's State Reptile.\nAs the diamondback terrapin was adopted as the state reptile, it is a protected species.\nThe diamondback terrapin was adopted as Maryland's State Reptile.\nThe diamondback terrapin was adopted as the state reptile.\nThe Maryland"}]

In [19]:
import os


def upload_file(input_location):
    prefix = f"{bucket_prefix}/input"
    return sagemaker_session.upload_data(
        input_location,
        bucket=sagemaker_session.default_bucket(),
        key_prefix=prefix,
        extra_args={"ContentType": "text/json"},
    )

In [None]:
import boto3
smr_client = boto3.client("sagemaker-runtime")

In [135]:
%%writefile ./async_endpoint_input.jsonl
{"inputs": "What is the purpose of life?", "parameters": {"do_sample": true, "temperature": 0.7,"min_length": 100,"max_length": 150}}

Overwriting ./async_endpoint_input.jsonl


In [136]:
input_1_location = "./async_endpoint_input.jsonl"
input_1_s3_location = upload_file(input_1_location)

In [137]:
response_model = smr_client.invoke_endpoint_async(
    EndpointName='falcon-40b-async-2023-08-08-21-04-39',
    InputLocation=input_1_s3_location,
    Accept='application/json',
    ContentType="application/json"
)

output_location = response_model["OutputLocation"]
print(f"OutputLocation: {output_location}")

OutputLocation: s3://sagemaker-us-west-2-461312420708/falcon-async-inference/output/a19a1fe2-4ce1-43b8-843c-a998f9b6c0e4.out


In [138]:
output = get_output(output_location)
print(f"Output: {output}")

key falcon-async-inference/output/a19a1fe2-4ce1-43b8-843c-a998f9b6c0e4.out
waiting for output...
Output: [{"generated_text":"What is the purpose of life?\nI was wondering what is the purpose of life? I know that we are all here to learn"}]


In [44]:
import urllib, time
from botocore.exceptions import ClientError


def get_output(output_location):
    output_url = urllib.parse.urlparse(output_location)
    bucket = output_url.netloc
    key = output_url.path[1:]
    print("key",output_url.path[1:])
    while True:
        try:
            return sagemaker_session.read_s3_file(bucket=output_url.netloc, key_prefix=output_url.path[1:])
        except ClientError as e:
            if e.response["Error"]["Code"] == "NoSuchKey":
                print("waiting for output...")
                time.sleep(2)
                continue
            raise

## Cleaning Up

After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs.

In [None]:
# predictor.delete_model()
# predictor.delete_endpoint()

## Conclusion

In this tutorial, we used a TGI container to deploy Falcon-40B using 4 GPUs on a SageMaker `ml.g5.12xlarge` instance. With Hugging Face's Text Generation Inference and SageMaker Hosting, you can easily host large language models like GPT-NeoX, flan-t5-xxl, and LLaMa.