# Scheduled your SageMaker Inference Endpoints to scale in to Zero

In some scenario, you might observe consistent weekly traffic patterns: steady workload Monday through Friday, and no traffic on weekends. You can optimize costs and performance by configuring scheduled actions that align with these patterns:

* **Weekend Scale-in (Friday Evening)**: Configure a scheduled action to reduce the number of model copies to zero. This will instruct SageMaker to scale the number instance behind the endpoint to zero, completely eliminating costs during weekend no usage period. 
* **Workweek Scale-out (Monday Morning)**: Set up a complementary scheduled action to restore the required model capacity, for the Inference component on Monday morning, ensuring your application is ready for weekday operations.

This demo notebook demonstrates how you can schedule the scale in of your SageMaker endpoint to zero instances during idle periods, eliminating the previous requirement of maintaining at least one running instance. 

**Note:** Scale-to-zero is only supported when using inference components. for more information on Inference Components see “[Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/)” blog.



## Set up

In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade

In [None]:
import boto3
import botocore
import sagemaker
import sys
import time

sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime_client = boto3.client("sagemaker-runtime")
s3_client = boto3.client("s3")

role = sagemaker.get_execution_role()
print(f"Role: {role}")

prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")

## Setup your SageMaker Real-time Endpoint 

### Create a SageMaker endpoint configuration

We begin by creating the endpoint configuration and set MinInstanceCount to 0. This allows the endpoint to scale in all the way down to zero instances when not in use.

In [None]:
# Set an unique name for our endpoint config
endpoint_config_name = f"{prefix}-llama3-8b-scale-to-zero-sc-config"
print(f"Endpoint config name: {endpoint_config_name}") 

In [None]:
# Configure variant name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.g5.12xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

min_instance_count = 0 # Minimum instance must be set to 0
max_instance_count = 2

sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": min_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

### Create the SageMaker endpoint
Next, we create our endpoint using the endpoint config

In [None]:
# Set a unique endpoint name
endpoint_name = f"{prefix}-llama3-8b-scale-to-zero-sc-endpoint"
print(f"Endpoint name: {endpoint_name}")

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

#### We wait for our endpoint to go InService. This step can take ~3 mins

In [None]:
import time
# Let's see how much it takes
start_time = time.time()

while True:
    desc = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    status = desc["EndpointStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

## import the required libraries and set some variables for the model that we will be using

In [None]:
import boto3
import botocore
import sagemaker
import sys
import time
import jinja2
from sagemaker import image_uris
from sagemaker.session import Session
import os
import json
from pathlib import Path
from datetime import datetime
from getpass import getpass
from rich import print
from sagemaker.deserializers import JSONDeserializer
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

In [None]:
boto_session = boto3.Session()

sagemaker_session = sagemaker.session.Session(boto_session) # sagemaker session for interacting with different AWS APIs
region = sagemaker_session._region_name

model_bucket = sagemaker_session.default_bucket()  # bucket to house model artifacts

region = sagemaker_session._region_name
account_id = sagemaker_session.account_id()

#### Set the relevant Model Configurations and select the relevant Large Model Inference container image
SageMaker offers optimized [large model inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) that contains different frameworks for model parallelism enabling inference of LLMs on multiple GPUs. For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)

#### Select the container to use

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-lmi", region=region, version="0.30.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

#### Set the model to use. In this example, we will use the llama3_8b 
1. `HF_MODEL_ID`: The model id of a pre-trained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co.

2. `HF_TOKEN`: Your HuggingFace token for accessing gated model

As environment variables, we provide the correct HuggingFace model ID. Additionally, we also provide our HuggingFace token since this is a gated model

In [None]:
HF_TOKEN = os.getenv("HUGGING_FACE_HUB_TOKEN") or getpass("Enter HUGGINGFACE Access Token: ")

hf_model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

llama3model = {
    "Image": inference_image_uri,
    "Environment": {
        "HF_MODEL_ID": hf_model_id,  # model_id from hf.co/models
        "HF_TOKEN": HF_TOKEN,
    },
}

## Create an inference component for our llama3_8b model and invoke the model
Inference components can reuse a SageMaker model that you may have already created. You also have the option to specify your artifacts and container directly when creating an inference component which we will show below. In this example we will also create a SageMaker model if you want to reference it later. 

### Create Inference Component (IC)
We can now create our inference component. Note below that we specify an inference component name. You can use this name to update your inference compent or view metrics and logs on the inference component you create in CloudWatch. You will also want to set your "ComputeResourceRequirements". This will tell SageMaker how much of each resource you want to reserver for EACH COPY of your inference component. Finally we set the number of copies that we want to deploy. The number of copies can be managed through autoscaling policies. 

In [None]:
# Set an unique name for our IC
inference_component_name = f"{prefix}-llama3-8b-scale-to-zero-sc"
print(f"inference component name: {inference_component_name}")

model_name = f"{prefix}-llama3-8b-scale-to-zero-scheduled"
print(f"model name: {model_name}")

sagemaker_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    Containers=[llama3model],
)

initial_copy_count = 1
max_copy_count_per_instance = 4  # up to 4 llama3-8b

variant_name = "AllTraffic"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600
min_memory_required_in_mb = 1024  # max memory util is up to 85%
number_of_accelerator_devices_required = 1

sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": initial_copy_count,
    },
)

#### We wait for our IC to go InService This step can take ~8 mins or longer
Let's wait for the endpoint to be ready before proceeding with inference.

In [None]:
# Let's see how much it takes
start_time = time.time()

while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

### Invoke the endpoint with a sample prompt
Now we can invoke our endpoint with sample text to test its functionality and see the model's output.

In [None]:
import boto3
import botocore
import sagemaker
import sys
import time
import jinja2
from sagemaker import image_uris
from sagemaker.session import Session
import os
import json
from pathlib import Path
from datetime import datetime
from sagemaker.deserializers import JSONDeserializer
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

In [None]:
# create predictor object
predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    component_name=inference_component_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)


# Prompt to generate
messages = [
    {"role": "system", "content": "You are a helpful assistant. Be concise"},
    {"role": "user", "content": "What is deep learning?"},
]

# Generation arguments
parameters = {
    "model": hf_model_id,  # model id is required
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 512,
    "stop": ["<|eot_id|>"],
}

chat = predictor.predict({"messages": messages, **parameters})

# Unpack and print response
print(chat["choices"][0]["message"]["content"].strip())

## Schedules using (UpdateInferenceComponentRuntimeConfig API)

You can scale your endpoint to zero in two ways. The first method is to set the number of model copies to zero in your Inference component using [UpdateInferenceComponentRuntimeConfigAPI](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateInferenceComponentRuntimeConfig.html). This approach maintains your endpoint configuration while eliminating compute costs during periods of inactivity. 
```
sagemaker_client.update_inference_component_runtime_config(
    InferenceComponentName=inference_component_name,
    DesiredRuntimeConfig={
        'CopyCount': 0
    }
)
```

### Create a schedule to shutdown the endpoint on friday and brings it back on Monday

[Amazon EventBridge Scheduler](https://docs.aws.amazon.com/eventbridge/latest/userguide/using-eventbridge-scheduler.html) can automate SageMaker API calls using cron/rate expressions for recurring schedules or one-time invocations. To function, EventBridge Scheduler requires an execution role with appropriate permissions to invoke the target API operations on your behalf, please refer to the [documentation](https://docs.aws.amazon.com/scheduler/latest/UserGuide/setting-up.html#setting-up-execution-role) on how to create this role. The specific permissions needed depend on the target API being called.

The code below creates two scheduled actions for the Inference component during 2024-2025. The first schedule scales in the CopyCount to zero every Friday at 18:00 UTC+1, while the second schedule restores model capacity every Monday at 07:00 UTC+1. The schedule swill start on November 29, 2024, end on December 31, 2025, and will be deleted after completion.


#### Weekend Scale-in (Friday Evening)
We start with creating a schedule to scale in our endpoint to 0 every friday at 18:00 UTC+1, starting on November 29, 2024 and ending on December 31, 2025. We need to specify the target API to call [UpdateInferenceComponentRuntimeConfigAPI](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateInferenceComponentRuntimeConfig.html) in this case , and the correct parameter for that API

In [None]:
import boto3
import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 0}, "InferenceComponentName": inference_component_name })
}

# Scale in our endpoint to 0 every friday at 18:00 UTC+1, starting on November 29, 2024
update_IC_scale_in_schedule = "scale-to-zero-schedule"
scheduler.create_schedule(
    Name=update_IC_scale_in_schedule,
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)



#### Workweek Scale-out (Monday Morning):
Set up a complementary scheduled action to restore the required model capacity, for the Inference component on Monday morning 07:00 UTC+1, ensuring your application is ready for weekday operations.

In [None]:
# Specify the SageMaker target API for the scale out schedule
scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 2}, "InferenceComponentName": inference_component_name })
}

# Scale out our endpoint every Monday at 07:00 UTC+1
update_IC_scale_out_schedule = "scale-out-schedule"
scheduler.create_schedule(
    Name=update_IC_scale_out_schedule,
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

## Schedules using (DeleteInferenceComponen API)

The second method is to delete the Inference components by calling the [DeleteInferenceComponent API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteInferenceComponent.html). This alternative approach achieves the same cost-saving benefit while completely removing the components from your configuration. The following code creates a scheduled action that automatically delete the IC every Friday at 18:00 UTC during 2024-2025. it also create a complementary scheduled action that recreate the IC every Monday at 07:00 UTC+1.

#### Weekend Scale-in (Friday Evening)
The following code creates a scheduled action that automatically delete the IC every Friday at 18:00 UTC during 2024-2025. 

In [None]:
import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:deleteInferenceComponent",
    "Input": json.dumps({"InferenceComponentName": inference_component_name })
}

# Scale in our endpoint by deleting the IC every friday at 18:00 UTC+1
delete_IC_scale_in_schedule = "scale-to-zero-schedule-1"
scheduler.create_schedule(
    Name=delete_IC_scale_in_schedule,
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

#### Workweek Scale-out (Monday Morning):
create a complementary scheduled action that recreate the IC every Monday at 07:00 UTC+1.

In [None]:
# Specify the SageMaker target API for the scale up schedule
input_config = {
  "EndpointName": endpoint_name,
  "InferenceComponentName": inference_component_name,
  "RuntimeConfig": {
    "CopyCount": 2
  },
  "Specification": {
    "ModelName": model_name,
    "StartupParameters": {
        "ModelDataDownloadTimeoutInSeconds": 3600,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
    },
    "ComputeResourceRequirements": {
      "MinMemoryRequiredInMb": 1024,
      "NumberOfAcceleratorDevicesRequired": 1
    }
  },
  "VariantName": variant_name
}

scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:createInferenceComponent",
    "Input": json.dumps(input_config)
}

# Scale out our endpoint by recreating the IC every Monday at 07:00 UTC+1
delete_IC_scale_out_schedule = "scale-out-schedule-1"
scheduler.create_schedule(
    Name=delete_IC_scale_out_schedule,
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

### Note: 

To schedule the scale to Zero on an endpoint with multiple inference components (IC), all ICs must be either set to 0 or deleted. You can also automate this process by using EventBridge Scheduler to trigger a Lambda function that handles either deletion or zero-setting of all ICs.

## Optionally clean up the environment

- Delete schedules

In [None]:
# Delete all the schedule created above
schedules = [delete_IC_scale_out_schedule, delete_IC_scale_in_schedule, update_IC_scale_out_schedule, update_IC_scale_in_schedule]

for schedule in schedules:
    try:
        scheduler.delete_schedule(Name=schedule)
        print(f"Deleted schedule [b]{schedule} ✅")
    except scheduler.exceptions.ResourceNotFoundException:
        print(f"Schedule [b]{schedule}[/b] not found.")

Delete inference component

Delete endpoint

delete endpoint-config

Delete model


In [None]:
sagemaker_client.delete_inference_component(InferenceComponentName=inference_component_name)
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sagemaker_client.delete_model(ModelName=model_name)