# Sagemaker AI - Deploy Hugging Face LLMs to Endpoint
* Notebook by Adam Lang
* Date: 3/26/2025

# Overview
* This is a simple notebook of how to deploy a hugging face transformer and a decoder LLM Falcon to a Sagemaker endpoint and run inference on the model.

# Install Dependencies

In [16]:
%%capture
!pip install -U sagemaker 

# Sagemaker Setup

In [None]:
import sagemaker 
import boto3

## init sagemaker session
sess = sagemaker.Session()

## sagemaker bucket session -- used for uploading data, models, logs
## sagemaker automatically creates this bucket if it doesn't exist
sagemaker_session_bucket=None 
if sagemaker_session_bucket is None and sess is not None:
    # set default bucket if bucket name not given 
    sagemaker_session_bucket = sess.default_bucket()

## sagemaker Role management
try:
    role=sagemaker.get_execution_role()
except ValueError: 
    iam=boto3.client("iam") 
    role=iam.get_role(RoleName="sagemaker_execution_role")['Role']['Arn']


## init session with default bucket
session=sagemaker.Session(default_bucket=sagemaker_session_bucket)

## print arn role & region_name
print(f"Sagemaker role arn: {role}")
print(f"Sagemaker session region: {sess.boto_region_name}")

# Load a Hugging Face Model
* Model loaded: `distilbert/distilbert-base-uncased-distilled-squad`
* Model card: https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad

In [11]:
import torch  
import transformers

In [12]:
## check versions of torch and transformers
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")

PyTorch version: 2.3.0+cu121
Transformers version: 4.48.0


Downgrade transformers and pytorch versions to work with hugging face version

In [13]:
%%capture 
!pip install transformers==4.48.0

In [14]:
%%capture 
pip install torch==2.3.0

In [15]:
## Again check versions of torch and transformers
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")

PyTorch version: 2.3.0+cu121
Transformers version: 4.48.0


In [16]:
## check python version
!python --version

Python 3.11.11


In [21]:
from sagemaker.huggingface.model import HuggingFaceModel

## HF hub model config
hf_hub = {
    "HF_MODEL_ID": 'distilbert/distilbert-base-uncased-distilled-squad', ## HF model id
    "HF_TASK": "question-answering" ## input prediction task 
} 
## create Hugging Face Model Class 
huggingface_model = HuggingFaceModel(
    env=hf_hub,                    ## HF hub model config
    role=role,                     ## IAM role permissions in AWS Sagemaker
    transformers_version="4.48.0", ## transformers versio using
    pytorch_version="2.3.0", ## pytorch version using
    py_version="py311",     ## python version to use 

)

# Deploy Model

In [22]:
## deploy model for Sagemaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge", ## 
)


--------!

# Inference with Deployed Model

In [24]:
## test request #1  
data = {
    "inputs": {
        "question": "What is used for inference?",
        "context": "My name is Joe and I work in a button factory. This model is used with sagemaker for inference."
    }
}
## make request 
predictor.predict(data)

{'score': 0.998674750328064, 'start': 71, 'end': 80, 'answer': 'sagemaker'}

In [25]:
## test request #2 
data = {
    "inputs": {
        "question": "What does Tom like?",
        "context": "My name is Tom and I play NFL football for the New England Patriots."
    }
}
## make request 
predictor.predict(data)

{'score': 0.2989235520362854, 'start': 26, 'end': 38, 'answer': 'NFL football'}

# Delete Endpoint
* docs: https://huggingface.co/docs/sagemaker/en/inference

In [26]:
# delete endpoint
predictor.delete_endpoint()

# LLM Endpoint Deployment
* This is an example of a more advanced sagemaker endpoint deployment than the simple example above since we will deploy a large language model not a small encoder model as we did above. 
* The steps are mostly the same with a few exceptions.

## Upgrade Sagemaker

In [27]:
%%capture
!pip install -U sagemaker 

## Sagemaker Setup

In [None]:
import sagemaker 
import boto3

## init sagemaker session
sess = sagemaker.Session()

## sagemaker bucket session -- used for uploading data, models, logs
## sagemaker automatically creates this bucket if it doesn't exist
sagemaker_session_bucket=None 
if sagemaker_session_bucket is None and sess is not None:
    # set default bucket if bucket name not given 
    sagemaker_session_bucket = sess.default_bucket()

## sagemaker Role management
try:
    role=sagemaker.get_execution_role()
except ValueError: 
    iam=boto3.client("iam") 
    role=iam.get_role(RoleName="sagemaker_execution_role")['Role']['Arn']


## init session with default bucket
session=sagemaker.Session(default_bucket=sagemaker_session_bucket)

## print arn role & region_name
print(f"Sagemaker role arn: {role}")
print(f"Sagemaker session region: {sess.boto_region_name}")

## Advanced Model Loading
* Compared to deploying a regular hugging face model, we first need to get a container URI and give it to the `HuggingFaceModel` class with an `image_uri` that points to the image.
* To obtain the Hugging Face LLM Deep Learning Container in AWS Sagemaker we use the `get_huggingface_llm_image_uri` method via the Sagemaker SDK.
* This method will allow us to get the URI for the Hugging Face LLM Deep Learning Container of choice for the specific session, region, version, and backend.
* Essentially we place the ENTIRE LLM in a Container and then call the container.

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

## get LLM image URI 
llm_image = get_huggingface_llm_image_uri(
    "huggingface",
    version="0.8.2",
)

## get ECR image URI 
print(f"LLM image URI: {llm_image}") 

# Deploy an LLM in AWS Sagemaker
* To deploy a large language model such as `Falcon-40B-Instruct`, we first need to create the HuggingFaceModel class and define an endpoint config via the `hf_model_id` and the `instance_type`.
* For endpoint inference we will try to use the `ml.g5.2xlarge`.
* This is the model we will deploy: `tiiuae/falcon-40b-instruct`
  * model card: https://huggingface.co/tiiuae/falcon-40b-instruct
  * This is a causal decoder only model.

In [32]:
import json
from sagemaker.huggingface import HuggingFaceModel

## setup sagemaker config 
instance_type = "ml.g5.2xlarge" 
number_of_gpu = 4 

# TGI config
config = {
    "HF_MODEL_ID": "tiiuae/falcon-40b-instruct", ## HF model id checkpoint 
    "SM_NUM_GPUS": json.dumps(number_of_gpu), ## number of GPU used per replica 
    "MAX_INPUT_LENGTH": json.dumps(1024), ## max len input to LLM 
    "MAX_TOTAL_TOKENS": json.dumps(2048) ## max len of generation tokens (including input)
    # "HF_MODEL_QUANTIZE": "bitsandbytes", ## uncomment to quantize LLM 

}

## now create class HuggingFaceModel 
llm_model = HuggingFaceModel(
    role=role, ## IAM role 
    image_uri=llm_image, ## image uri from above 
    env=config, 
)

In [None]:
## Deploy model to an endpoint 
llm = llm_model.deploy(
    initial_instance_count=1, 
    instance_type=instance_type, 
    # volume_size=400, ## if using instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
)

# Inference with Deployed LLM

In [None]:
## define payload for LLM 
prompt = """You are a very helpful assistant named Falcon and you know a lot about AWS.

User: Can you tell me 3 important points about AWS Sagemaker?
Falcon:"""

## LLM hyperparameters 
payload = { 
    "inputs": prompt,
    "parameters": { 
        "do_sample": True, 
        "top_p": 0.8,
        "temperature": 0.5,
        "max_new_tokens": 1024,
        "repetition_penalty": 1.04, 
        "stop": ["\nUser:", "<|endoftext|>", "</s>"]  # Corrected this line
    }
}

## send request to endpoint
response = llm.predict(payload)

# response is a dictionary with a 'generated_text' key
if isinstance(response, dict) and 'generated_text' in response:
    print(f"Result: {response['generated_text']}")
elif isinstance(response, list):
    for seq in response:
        if isinstance(seq, dict) and 'generated_text' in seq:
            print(f"Result: {seq['generated_text']}")
        else:
            print(f"Unexpected sequence format: {seq}")
else:
    print(f"Unexpected response format: {response}")