# Deploying an LLM to Amazon SageMaker AI real-time endpoint

## Prerequisites

To use SageMaker AI endpoints in these examples, you will need to first deploy a managed endpoint. In this example you will deploy an endpoint through SageMaker Jumpstart, a feature that helps machine learning practitioners quickly get started with hundreds of production-ready models in SageMaker AI.

## Dependencies

<div class="alert alert-block alert-info">
⚠️ <b>Important:</b> Make sure you've run the <code>1-required-dependencies-strands.ipynb</code> notebook in this folder before proceeding. If you haven't, close this notebook, run the previous one first, then come back to this.
</div>

## Create Sagemaker Endpoint

In this notebook, we first create an endpoint config that defines parameters for the endpoint. Then we specify an inference component that will field our requests. This component will create and kick-off the Sagemaker endpoint after some delay.

**NOTE**: In order to deploy a Sagemaker Endpoint on a larger instance type, such as `ml.g5.48xlarge`, you may have to request a quota increase for your account. Please refer to AWS documentation on how to [increase quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html). 



### Run this cell to make sure the Strands Agent libraries are installed

In [None]:
%pip show strands-agents strands-agents-tools

### If Strands Agents libraries do not show above, then install them by running this cell

In [None]:
# Uncomment line below to run pip install
# %pip install 'strands-agents[sagemaker]' strands-agents-tools

In [None]:
import boto3
import json
from sagemaker.core.helper.session_helper import get_execution_role

# Setup role and sagemaker session
iam_role = get_execution_role()
boto_session = boto3.Session(region_name='us-west-2')
sagemaker_client = boto_session.client('sagemaker')
sagemaker_runtime = boto3.client("sagemaker-runtime") 

# Names for endpoint config and endpoint
endpointName='strands-endpoint-001'
endpointConfigName='strands-endpoint-config'
inferenceComponentName='mistral-24b-instruct-2501-ic'


## Delete Resources if they already exist

**Note**: If you re-run this notebook, or see failures to create resources, run the next 3 cells to delete previously created resources.

In [None]:
# Run this code to delete inference component (if exists)
sagemaker_client.delete_inference_component(InferenceComponentName=inferenceComponentName)

In [None]:
# Run this line if the endpoint already exists
sagemaker_client.delete_endpoint(EndpointName=endpointName)

In [None]:
# Run this line if the endpoint config already exists
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpointConfigName)

## Create Endpoint Config, Endpoint, and Inference Component

In [None]:
# Create endpoint config

endpoint_config = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpointConfigName,
    EnableNetworkIsolation=False,
    ExecutionRoleArn=iam_role,
    ProductionVariants=[
        {
            "VariantName": 'AllTraffic',
            "InstanceType": 'ml.g5.48xlarge',
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 0,
                "MaxInstanceCount": 2,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)


In [None]:
# Create endpoint
endpoint = sagemaker_client.create_endpoint(
    EndpointName=endpointName,
    EndpointConfigName=endpointConfigName
)

In [None]:
# SDK v3: Using boto3 to create model from JumpStart artifacts
# Get JumpStart model artifacts using SageMaker API
import time

# Mistral Small 24B model specs
model_id = "huggingface-llm-mistral-small-24B-Instruct-2501"
region = boto_session.region_name

# Get the JumpStart model specs
js_client = boto_session.client('sagemaker')

# For JumpStart models in SDK v3, we retrieve the model package and create directly
# Using the HuggingFace TGI container for Mistral
tgi_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.1-gpu-py311-cu124-ubuntu22.04"

model_name = f"mistral-small-24b-{int(time.time())}"

env = {
    'ENDPOINT_SERVER_TIMEOUT': '600',
    'SM_NUM_GPUS': json.dumps(4),
    'MESSAGES_API_ENABLED': 'true',
    'HF_MODEL_ID': 'mistralai/Mistral-Small-24B-Instruct-2501',
    'MAX_INPUT_LENGTH': '4096',
    'MAX_TOTAL_TOKENS': '8192',
}

# Create model using boto3
response = sagemaker_client.create_model(
    ModelName=model_name,
    PrimaryContainer={
        'Image': tgi_image,
        'Environment': env,
    },
    ExecutionRoleArn=iam_role,
    EnableNetworkIsolation=False,
)
print(f"Created model: {model_name}")

### WAIT up to 5 minutes before creating Inference Component

Typically, we need to wait up to 5 minutes for Endpoint creation before creating an Inference Component



In [None]:
# Create inference component

mistral_7Bv03_ic = sagemaker_client.create_inference_component(
    EndpointName=endpointName,
    InferenceComponentName=inferenceComponentName,
    RuntimeConfig={
        'CopyCount': 1
    },
    Specification={
        'ModelName': model_name,
        'StartupParameters': {
            'ModelDataDownloadTimeoutInSeconds': 3600,
            'ContainerStartupHealthCheckTimeoutInSeconds': 3600
        },
        'ComputeResourceRequirements': {
            'MinMemoryRequiredInMb': 1024,
            'NumberOfAcceleratorDevicesRequired': 4,
        }
    },
    Tags=[{
        'Key': 'Usage',
        'Value': 'Strands Agents'
    }],
    VariantName="AllTraffic"
)

<div class="alert alert-block alert-info">
⚠️ <b>Note:</b> Deployment of the Sagemaker Endpoint Inference Component can take 5~10 minutes. 
</div>

## Must WAIT up to 10 minutes for Endpoint AutoScaling to complete

**Notice**: If you get this error message below, you must WAIT for the endpoint to complete it's auto-scaling

`ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component
has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to 
increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.`

If the cell below fails, you can simply retry after waiting.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant capable of performing math calculations."},
    {"role": "user", "content": "Calculate the square root of 81. Briefly explain the significance of the result."}
]

payload = {
    "messages": messages,
    "max_tokens": 4000,
    "temperature": 0.1,
    "top_p": 0.9,
}

response_model = sagemaker_runtime.invoke_endpoint(
    InferenceComponentName=inferenceComponentName,
    EndpointName=endpointName,
    Body=json.dumps({
      "messages": messages,
      "parameters": {
        "do_sample": True,
        "max_new_tokens": 256,
      }
    }),
    ContentType="application/json"
)

response = json.loads(response_model['Body'].read().decode('utf-8'))
print(response)

### Save names of Endpoint, Endpoint Config, and Inference Component

We save various attributes like endpoint name and inference component name using `store` magic command, as they will be needed in later notebooks.

In [None]:
MISTRAL_ENDPOINT_NAME = endpointName
MISTRAL_ENDPOINT_CONFIG_NAME = endpointConfigName
MISTRAL_INFERENCE_COMPONENT_NAME = inferenceComponentName

print(f"Endpoint name: {MISTRAL_ENDPOINT_NAME}")
print(f"Endpoint Config Name: {MISTRAL_ENDPOINT_CONFIG_NAME}")
print(f"Inference Component Name: {MISTRAL_INFERENCE_COMPONENT_NAME}")

%store MISTRAL_ENDPOINT_NAME
%store MISTRAL_ENDPOINT_CONFIG_NAME
%store MISTRAL_INFERENCE_COMPONENT_NAME
