### Deploy Model for Inference with SageMaker

Install the sagemaker library

In [2]:
!pip install sagemaker

Collecting sagemaker
  Downloading sagemaker-2.239.0-py3-none-any.whl.metadata (16 kB)
Collecting attrs<24,>=23.1.0 (from sagemaker)
  Downloading attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Collecting cloudpickle>=2.2.1 (from sagemaker)
  Downloading cloudpickle-3.1.1-py3-none-any.whl.metadata (7.1 kB)
Collecting docker (from sagemaker)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting fastapi (from sagemaker)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting google-pasta (from sagemaker)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting importlib-metadata<7.0,>=1.4.0 (from sagemaker)
  Downloading importlib_metadata-6.11.0-py3-none-any.whl.metadata (4.9 kB)
Collecting omegaconf<=2.3,>=2.2 (from sagemaker)
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting pathos (from sagemaker)
  Downloading pathos-0.3.3-py3-none-any.whl.metadata (11 kB)
Collecting sagemaker-core<2.0.0,>=1.0.

Next, get a Sagemaker session and execution role that will allow you to
create and manage SageMaker resources 

In [3]:
import os
import sagemaker

os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

print(f"sagemaker role arn: {role}")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


sagemaker role arn: arn:aws:iam::232087633204:role/EC2SageMakerIAMRole2


We will be using a Deep Learning Container 

In [6]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface-neuronx",
  version="0.0.25"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")


llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04


Deploy the compiled model to a SageMaker real-time endpoint on AWS
Inferentia.  

Change user_id below to your Hugging Face username. Make sure to update
HF_MODEL_ID and HUGGING_FACE_HUB_TOKEN with your Hugging Face username
and your access token.

In [None]:
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.inf2.24xlarge"
health_check_timeout=2400 # additional time to load the model
volume_size=512 # size in GB of the EBS volume

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "liorsadan/Mixtral-8x7B-Instruct-v0.1.25", # replace with your model id if you are using your own model
    "HF_NUM_CORES": "4", # number of neuron cores
    "HF_AUTO_CAST_TYPE": "fp16",  # dtype of the model
    "MAX_BATCH_SIZE": "1", # max batch size for the model
    "MAX_INPUT_LENGTH": "1000", # max length of input text
    "MAX_TOTAL_TOKENS": "1024", # max length of generated text
    "MESSAGES_API_ENABLED": "true", # Enable the messages API
    "HUGGING_FACE_HUB_TOKEN": "YOUR_HUGGINGFACE_TOKEN" # Add your Hugging Face token here
}


# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

You are now ready to deploy the model to a SageMaker real-time inference
endpoint.  SageMaker will provision the necessary compute resources
instance, retrieve and launches the inference container, which downloads
the model artifacts from your Hugging Face repository, loads the model
to the Inferentia devices and starts inference serving.  This process
can take several minutes.

In [9]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm_model._is_compiled_model = True # We precompiled the model

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
  volume_size=volume_size
)

-------------------------------!

Next, run a simple test to check the endpoint. First, update user_id to
match your Hugging Face username, then create the prompt and parameters.

and send the prompt to the SageMaker real-time endpoint for inference

In [25]:
# Prompt to generate
messages=[
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Create a program in python that: 1. Asks for the user's name, 2. Handles empty input, and 3. Returns a greeting" }
]

# Generation arguments
parameters = {
    "model": "liorsadan/Mixtral-8x7B-Instruct-v0.1.25", # replace user_id
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 800,
}

and send the prompt to the SageMaker real-time endpoint for inference

In [30]:
chat = llm.predict({"messages" :messages, **parameters})

print(chat["choices"][0]["message"]["content"].strip())

Here is a simple Python program that meets your requirements:

```python
# Ask for the user's name
name = input("Please enter your name: ")

# Handle empty input
while name == "":
    print("Sorry, you can't leave the name empty.")
    name = input("Please enter your name: ")

# Return a greeting
print("Hello, " + name + "! Nice to meet you.")
```

In this program, the user is asked to enter their name. If the user enters an empty string, the program will display a message asking them to enter a name again. This will continue until the user enters a name that is not an empty string. Once a valid name is entered, the program will display a greeting.</s>


In [28]:
# Ask for the user's name
name = input("Please enter your name: ")

# Handle empty input
while name == "":
    print("Sorry, you can't leave your name empty. Please try again.")
    name = input("Please enter your name: ")

# Return a greeting
print("Hello, " + name + "! Nice to meet you.")

Please enter your name:  


Sorry, you can't leave your name empty. Please try again.


Please enter your name:  Yair


Hello, Yair! Nice to meet you.


In the future if you want to connect to this inference endpoint from
other applications, first find the name of the inference endpoint:

In [12]:
endpoints = sess.sagemaker_client.list_endpoints()

for endpoint in endpoints['Endpoints']:
    print(endpoint['EndpointName'])

huggingface-pytorch-tgi-inference-ml-in-2025-02-03-23-29-21-549


Alternatively, use the AWS console \> SageMaker \> Inference \>
Endpoints to see a list of all the SageMaker endpoints deployed in your
account.

Use this endpoint name to update the following code, which can be ran in
other locations.

In [29]:
from sagemaker.huggingface import HuggingFacePredictor

endpoint_name="huggingface-pytorch-tgi-inference-ml-in-2025-02-03-23-29-21-549"

llm = HuggingFacePredictor(
  endpoint_name=endpoint_name,
  sagemaker_session=sess
)

### Cleanup

Make sure to delete the endpoint to stop paying for the provisioned
resources


In [31]:
llm.delete_model()
llm.delete_endpoint()