# Deploying an LLM to Amazon SageMaker AI real-time endpoint

## Prerequisites

To use SageMaker AI endpoints in these examples, you will need to first deploy a managed endpoint. In this example you will deploy an endpoint through SageMaker Jumpstart, a feature that helps machine learning practitioners quickly get started with hundreds of production-ready models in SageMaker AI.

## Dependencies

<div class="alert alert-block alert-info">
⚠️ <b>Important:</b> Make sure you've run the <code>1-required-dependencies-strands.ipynb</code> notebook in this folder before proceeding. If you haven't, close this notebook, run the previous one first, then come back to this.
</div>

## Create Sagemaker Endpoint

In this notebook, we first create an endpoint config that defines parameters for the endpoint. Then we specify an inference component that will field our requests. This component will create and kick-off the Sagemaker endpoint after some delay.

**NOTE**: In order to deploy a Sagemaker Endpoint on a larger instance type, such as `ml.g5.48xlarge`, you may have to request a quota increase for your account. Please refer to AWS documentation on how to [increase quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html). 



### Run this cell to make sure the Strands Agent libraries are installed

In [1]:
%pip show strands-agents strands-agents-tools

Name: strands-agents
Version: 1.4.0
Summary: A model-driven approach to building AI agents in just a few lines of code
Home-page: https://github.com/strands-agents/sdk-python
Author: 
Author-email: AWS <opensource@amazon.com>
License: Apache-2.0
Location: /opt/conda/lib/python3.12/site-packages
Requires: boto3, botocore, docstring-parser, mcp, opentelemetry-api, opentelemetry-instrumentation-threading, opentelemetry-sdk, pydantic, typing-extensions, watchdog
Required-by: strands-agents-builder, strands-agents-tools
---
Name: strands-agents-tools
Version: 0.2.6
Summary: A collection of specialized tools for Strands Agents
Home-page: https://github.com/strands-agents/tools
Author: 
Author-email: AWS <opensource@amazon.com>
License: Apache-2.0
Location: /opt/conda/lib/python3.12/site-packages
Requires: aiohttp, aws-requests-auth, botocore, dill, markdownify, pillow, prompt-toolkit, pyjwt, readabilipy, requests, rich, slack-bolt, strands-agents, sympy, tenacity, watchdog
Required-by: stran

### If Strands Agents libraries do not show above, then install them by running this cell

In [2]:
# Uncomment line below to run pip install
# %pip install 'strands-agents[sagemaker]' strands-agents-tools

In [3]:
import boto3
import json
from sagemaker import get_execution_role
from sagemaker.jumpstart.model import JumpStartModel

# Setup role and sagemaker session
iam_role = get_execution_role()
boto_session = boto3.Session(region_name='us-west-2')
sagemaker_client = boto_session.client('sagemaker')
sagemaker_runtime = boto3.client("sagemaker-runtime") 

# Names for endpoint config and endpoint
endpointName='strands-endpoint-001'
endpointConfigName='strands-endpoint-config'
inferenceComponentName='mistral-24b-instruct-2501-ic'


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Delete Resources if they already exist

**Note**: If you re-run this notebook, or see failures to create resources, run the next 3 cells to delete previously created resources.

In [12]:
# Run this line if the endpoint already exists
sagemaker_client.delete_endpoint(EndpointName=endpointName)

{'ResponseMetadata': {'RequestId': 'b8a68ac1-efe8-4b98-9f8e-ccb90d8974a2',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b8a68ac1-efe8-4b98-9f8e-ccb90d8974a2',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 10 Sep 2025 15:32:35 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

In [13]:
# Run this line if the endpoint config already exists
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpointConfigName)

{'ResponseMetadata': {'RequestId': '776861f6-f561-45b9-a37b-2c23c9c06b19',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '776861f6-f561-45b9-a37b-2c23c9c06b19',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 10 Sep 2025 15:32:42 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

In [None]:
# Run this code to delete inference component (if exists)
sagemaker_client.delete_inference_component(InferenceComponentName=inferenceComponentName)

## Create Endpoint Config, Endpoint, and Inference Component

In [15]:
# Create endpoint config

endpoint_config = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpointConfigName,
    EnableNetworkIsolation=False,
    ExecutionRoleArn=iam_role,
    ProductionVariants=[
        {
            "VariantName": 'AllTraffic',
            "InstanceType": 'ml.g5.48xlarge',
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 0,
                "MaxInstanceCount": 2,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)


In [16]:
# Create endpoint
endpoint = sagemaker_client.create_endpoint(
    EndpointName=endpointName,
    EndpointConfigName=endpointConfigName
)

In [17]:
env = {
    'ENDPOINT_SERVER_TIMEOUT': '600',
    'SM_NUM_GPUS': json.dumps(4),
    'MESSAGES_API_ENABLED': 'true'
}

model = JumpStartModel(
    model_id="huggingface-llm-mistral-small-24B-Instruct-2501",
    enable_network_isolation=False,
    env=env,
    role=iam_role
)

model.create()
model_name = model.name

Using model 'huggingface-llm-mistral-small-24B-Instruct-2501' with wildcard version identifier '*'. You can pin to version '3.0.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


In [18]:
# Create inference component

mistral_7Bv03_ic = sagemaker_client.create_inference_component(
    EndpointName=endpointName,
    InferenceComponentName=inferenceComponentName,
    RuntimeConfig={
        'CopyCount': 1
    },
    Specification={
        'ModelName': model_name,
        'StartupParameters': {
            'ModelDataDownloadTimeoutInSeconds': 3600,
            'ContainerStartupHealthCheckTimeoutInSeconds': 3600
        },
        'ComputeResourceRequirements': {
            'MinMemoryRequiredInMb': 1024,
            'NumberOfAcceleratorDevicesRequired': 4,
        }
    },
    Tags=[{
        'Key': 'Usage',
        'Value': 'Strands Agents'
    }],
    VariantName="AllTraffic"
)

<div class="alert alert-block alert-info">
⚠️ <b>Note:</b> Deployment of the Sagemaker Endpoint Inference Component will take 5~10 minutes. 
</div>

## Must WAIT up to 10 minutes for Endpoint AutoScaling to complete

**Notice**: If you get this error message below, you must WAIT for the endpoint to complete it's auto-scaling

`ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component
has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to 
increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.`

If the cell below fails, you can simply retry after waiting.

In [20]:
messages = [
    {"role": "system", "content": "You are a helpful assistant capable of performing math calculations."},
    {"role": "user", "content": "Calculate the square root of 81. Briefly explain the significance of the result."}
]

payload = {
    "messages": messages,
    "max_tokens": 4000,
    "temperature": 0.1,
    "top_p": 0.9,
}

response_model = sagemaker_runtime.invoke_endpoint(
    InferenceComponentName=inferenceComponentName,
    EndpointName=endpointName,
    Body=json.dumps({
      "messages": messages,
      "parameters": {
        "do_sample": True,
        "max_new_tokens": 256,
      }
    }),
    ContentType="application/json"
)

response = json.loads(response_model['Body'].read().decode('utf-8'))
print(response)

{'id': 'chatcmpl-19101479311344f58381c4a5a3e75e47', 'object': 'chat.completion', 'created': 1757520299, 'model': 'lmi', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'reasoning_content': None, 'content': 'The square root of 81 is 9.\n\nThe significance of this result is that 9 is the number that, when multiplied by itself, gives 81. In mathematical terms, if \\( x^2 = 81 \\), then \\( x = \\sqrt{81} = 9 \\). This is a fundamental concept in algebra and arithmetic, demonstrating the relationship between a number and its square root.', 'tool_calls': []}, 'logprobs': None, 'finish_reason': 'stop', 'stop_reason': None}], 'usage': {'prompt_tokens': 33, 'total_tokens': 120, 'completion_tokens': 87, 'prompt_tokens_details': None}, 'prompt_logprobs': None}


### Save Endpoint Name Attributes

We save various attributes like endpoint name and inference component name using `store` magic command, as they will be needed in later notebooks.

In [21]:
MISTRAL_ENDPOINT_NAME = endpointName
MISTRAL_ENDPOINT_CONFIG_NAME = endpointConfigName
MISTRAL_INFERENCE_COMPONENT_NAME = inferenceComponentName

print(f"Endpoint name: {MISTRAL_ENDPOINT_NAME}")
print(f"Endpoint Config Name: {MISTRAL_ENDPOINT_CONFIG_NAME}")
print(f"Inference Component Name: {MISTRAL_INFERENCE_COMPONENT_NAME}")

%store MISTRAL_ENDPOINT_NAME
%store MISTRAL_ENDPOINT_CONFIG_NAME
%store MISTRAL_INFERENCE_COMPONENT_NAME
    

Endpoint name: strands-endpoint-001
Endpoint Config Name: strands-endpoint-config
Inference Component Name: mistral-24b-instruct-2501-ic
Stored 'MISTRAL_ENDPOINT_NAME' (str)
Stored 'MISTRAL_ENDPOINT_CONFIG_NAME' (str)
Stored 'MISTRAL_INFERENCE_COMPONENT_NAME' (str)
