### [Deploy LLM on Amazon SageMaker](https://huggingface.co/docs/sagemaker/en/inference)

In [40]:
!pip3 install  sagemaker --upgrade --quiet

### Authenticate

In [41]:
import json
import sagemaker
import boto3

iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker-dlc-demo')['Role']['Arn']

In [42]:
role

'arn:aws:iam::754289655784:role/sagemaker-dlc-demo'

### Select image URI

In [43]:
llm_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.7.0-tgi3.3.6-gpu-py311-cu124-ubuntu22.04-v1.0"

### Configure instance and model

In [44]:

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g6.16xlarge"
health_check_timeout = 900


# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "Qwen/Qwen3-4B-Thinking-2507", # model_id from hf.co/models
  'SM_NUM_GPUS': "1", # Number of GPU used per replica
  'MAX_INPUT_LENGTH': "2048",  # Max length of input text
  #'HUGGING_FACE_HUB_TOKEN': "<REPLACE WITH YOUR TOKEN>"
}


# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config,
  name="qwen3-4b-thinking-2507-demo-endpoint"
)

In [45]:
llm_model

<sagemaker.huggingface.model.HuggingFaceModel at 0x135a89950>

In [28]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, 
)

Using already existing model: qwen3-4b-thinking-2507-demo-endpoint


----------!

In [46]:
# Prompt to generate
messages=[
    { "role": "user", "content": "Give me a short introduction to large language model" }
  ]

# Generation arguments
parameters = {
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 128,
}


In [47]:
chat = llm.predict({"messages" :messages, **parameters})

print(chat["choices"][0]["message"]["content"].strip())

Okay, the user asked for a short introduction to large language models. Let me start by recalling what I know. LLMs are a big topic in AI, so I need to keep it concise but informative. They want it short, so I shouldn't go into too much detail.

First, I should define what an LLM is. Maybe mention they're AI models trained on massive text data. Highlight key features like understanding and generating human-like text. The user might be a beginner, so avoid jargon where possible. Terms like "neural networks" are okay but should be explained simply.

I remember that LLMs like


In [48]:
llm.delete_model()
llm.delete_endpoint()