# Deploy Large Language Models (LLMs) to Amazon SageMaker using Hugging Face Text Generation Inference Container

This is an example on how to deploy the open-source LLMs to Amazon SageMaker for inference using your own build of the Hugging Face TGI container.

This examples demonstrate how to deploy a fine-tuned model from Amazon S3 to Amazon SageMaker.

If you want to learn more about the Hugging Face TGI container check out the [Hugging Face TGI GitHub repository](https://github.com/huggingface/text-generation-inference/tree/main). Lets get started!



## 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy a fine-tuned Falcon LLM to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [None]:
!pip install "sagemaker==2.163.0" "huggingface_hub" "hf-transfer" --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.


In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

## 2. Locate fine-tuned model

Replace the Amazon S3 URI with the output from your Amazon SageMaker training job.

In [None]:
# safetensors to be used with sharded model
# https://discuss.huggingface.co/t/deploying-fine-tune-falcon-40b-with-qlora-on-sagemaker-inference-error/46841
# https://discuss.huggingface.co/t/bin-to-safetensors-without-publishing-it-on-hub/39956/2
s3_model_uri = "<Amazon S3 URI that contains the model.tar.gz of your fine-tuned model>/model.tar.gz"

## 3. Build latest version of Hugging Face Text Generation Inference Container for Amazon SageMaker

The version `0.8.2` of the Hugging Face TGI container image that is available for Amazon SageMaker in the public Amazon ECR repository did not work.
So we need to build our own Hugging Face TGI container image with the latest version available from the [Hugging Face Text Generation Inference repository](https://github.com/huggingface/text-generation-inference).

In [None]:
tgi_version = "0.9.3"  # the version of Hugging Face TGI to build
aws_account_id = "123456789012"  # replace with your account id
aws_region = "eu-west-1"  # replace with your region
ecr_base_url = f"{aws_account_id}.dkr.ecr.{aws_region}.amazonaws.com"
target_ecr_repo = f"{ecr_base_url}/huggingface/text-generation-inference"  # your private Amazon ECR repo to push the container image to

In [None]:
!aws ecr get-login-password --region $aws_region | docker login --username AWS --password-stdin $ecr_base_url

In [None]:
!git clone -b v$tgi_version https://github.com/huggingface/text-generation-inference.git
!cd text-generation-inference
!docker build -t $target_ecr_repo:$tgi_version --target sagemaker .
!docker push $target_ecr_repo:$tgi_version

## 3. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the `get_huggingface_llm_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers)


In [None]:
def get_huggingface_llm_image_uri(backend, version):
    return f"{target_ecr_repo}:{version}"

In [None]:
# from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri("huggingface", version=tgi_version)

# print ecr image uri
print(f"llm image uri: {llm_image}")

## 4. Deploy fine-tuned model to Amazon SageMaker

To deploy your fine-tuned model to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.12xlarge` instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "/opt/ml/model",  # path to where sagemaker stores the model
    "SM_NUM_GPUS": json.dumps(number_of_gpu),  # Number of GPU used per replica
    "MAX_INPUT_LENGTH": json.dumps(1900),  # Max length of input text
    "MAX_TOTAL_TOKENS": json.dumps(
        2048
    ),  # Max length of the generation (including input text)
    # 'HF_MODEL_QUANTIZE': "bitsandbytes",# Comment in to quantize
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
    role=role, image_uri=llm_image, model_data=s3_model_uri, env=config
)

After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.12xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs.

In [None]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
    container_startup_health_check_timeout=health_check_timeout,  # 10 minutes to be able to load the model
    endpoint_name="huggingface-qlora-falcon-7b-instruct-finetuned",
)

## 5. Test the model and run inference

After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. You can find a list of parameters in the [announcement blog post](https://huggingface.co/blog/sagemaker-huggingface-llm). or as part of the [swagger documentation](https://huggingface.github.io/text-generation-inference/)

Replace the prompt with one that is relevant for you model.

For a conversation model for answering coding question we can simply prompt by asking our question:
  
```
<|system|>\n You are an Python Expert<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>
```

lets give it a first try and ask how to filter a list in python:


In [None]:
query = "How can i filter a list of dictionaries?"

res = llm.predict(
    {
        "inputs": f"<|system|>\n You are an Python Expert<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
    }
)

print(res[0]["generated_text"])

Now we will run inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. This can be used to have the model stop the generation after the turn of the `bot`.

In [None]:
# define payload
prompt = f"<|system|>\n You are an Python Expert<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"

# hyperparameters for llm
payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.95,
        "temperature": 0.2,
        "top_k": 50,
        "max_new_tokens": 256,
        "repetition_penalty": 1.03,
        "stop": ["<|end|>"],
    },
}

# send request to endpoint
response = llm.predict(payload)

# print(response[0]["generated_text"][:-len("<human>:")])
print(response[0]["generated_text"])

Awesome! ðŸš€ We have successfully deployed our model from Amazon S3 to Amazon SageMaker and run inference on it. Now, its time for you to try it out yourself and build Generation AI applications with the new Hugging Face LLM DLC on Amazon SageMaker.

## 6. Clean up

To clean up, we can delete the model and endpoint.


In [None]:
llm.delete_model()
llm.delete_endpoint()