# Scheduled your SageMaker Inference Endpoints to scale in to Zero

In some scenario, you might observe consistent weekly traffic patterns: steady workload Monday through Friday, and no traffic on weekends. You can optimize costs and performance by configuring scheduled actions that align with these patterns:

* **Weekend Scale-in (Friday Evening)**: Configure a scheduled action to reduce the number of model copies to zero. This will instruct SageMaker to scale the number instance behind the endpoint to zero, completely eliminating costs during weekend no usage period. 
* **Workweek Scale-out (Monday Morning)**: Set up a complementary scheduled action to restore the required model capacity, for the Inference component on Monday morning, ensuring your application is ready for weekday operations.

This demo notebook demonstrates how you can schedule the scale in of your SageMaker endpoint to zero instances during idle periods, eliminating the previous requirement of maintaining at least one running instance. 

**Note:** Scale-to-zero is only supported when using inference components. for more information on Inference Components see “[Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/)” blog.



## Set up

In [None]:
!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2

In [None]:
import boto3
import botocore
import sagemaker
import sys
import time
import json

sagemaker_client = boto3.client("sagemaker")
role = sagemaker.get_execution_role()
print(f"Role: {role}")

boto_session = boto3.Session()
sagemaker_session = sagemaker.session.Session(boto_session) # sagemaker session for interacting with different AWS APIs
region = sagemaker_session._region_name

model_bucket = sagemaker_session.default_bucket()  # bucket to house model artifacts

prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")

## Setup your SageMaker Real-time Endpoint 

### Create a SageMaker endpoint configuration

We begin by creating the endpoint configuration and set MinInstanceCount to 0. This allows the endpoint to scale in all the way down to zero instances when not in use.

In [None]:
# Set an unique name for our endpoint config
endpoint_config_name = f"{prefix}-llama3-8b-scale-to-zero-sc-config"
print(f"Endpoint config name: {endpoint_config_name}") 

In [None]:
# Configure variant name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.g5.12xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

min_instance_count = 0 # Minimum instance must be set to 0
max_instance_count = 3

sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": min_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

### Create the SageMaker endpoint
Next, we create our endpoint using the endpoint config

In [None]:
# Set a unique endpoint name
endpoint_name = f"{prefix}-llama3-8b-scale-to-zero-sc-endpoint"
print(f"Endpoint name: {endpoint_name}")

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

#### We wait for our endpoint to go InService. This step can take ~3 mins

In [None]:
import time
# Let's see how much it takes
start_time = time.time()

while True:
    desc = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    status = desc["EndpointStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

## Create Model Builder
We use Amazon SageMaker Fast Model Loader. The feature works by streaming model weights directly from Amazon S3 to GPU accelerators, bypassing the typical sequential loading steps that contribute to deployment latency. In internal testing, this approach has shown to load large models up to 15 times faster compared to traditional methods. For more information on this feature, please refer to our example [notebook on GitHub](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Llama3.1/Llama3.1-70B-SageMaker-Fast-Model-Loader.ipynb)

We'll make use of the ModelBuilder class to prepare and package the model inference components. In this example, we're using the Meta-Llama-3-8B-Instruct SageMaker JumpStart.

Key configurations:
- Model: Meta-Llama-3-8B-Instruct
- Schema Builder: Defines input/output format
- Model_metadata: `CUSTOM_MODEL_PATH` Here we reuse the shards from our previous model optimization run in the autoscaling example notebook
- Instance_type: ml.g5.12xlarge

In [None]:
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging

prompt = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the"
response = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the east coast."

model_id = "meta-textgeneration-llama-3-8b-instruct"
output_path = f"s3://{model_bucket}/llama3-8b/sharding"

model_builder = ModelBuilder(
    model=model_id,
    model_metadata={
                "CUSTOM_MODEL_PATH": output_path,
            },
    role_arn=role,
    schema_builder=SchemaBuilder(sample_input=prompt, sample_output=response),
    instance_type="ml.g5.12xlarge",
    log_level=logging.WARN
)

## Build and Deploy Model
After optimization, we'll build the final model artifacts and deploy them to a SageMaker endpoint. 

Key configurations:
- Instance Type: ml.g5.12xlarge
- Memory Request: 104096 MB
- Number of Accelerators: 4 (for tensor parallelism)

In [None]:
final_model = model_builder.build()

In [None]:
# Make sure our model is sharded
if not final_model._is_sharded_model:
    final_model._is_sharded_model = True
final_model._is_sharded_model

In [None]:
# EnableNetworkIsolation cannot be set to True since SageMaker Fast Model Loading of model requires network access.
if final_model._enable_network_isolation:
    final_model._enable_network_isolation = False
    
final_model._enable_network_isolation

#### Select the container image to use
Use the latest LMI image to take advantage of caching feature

In [None]:
final_model.image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124"
print(f"Image going to be used is ---- > {final_model.image_uri}")

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

resources_required = ResourceRequirements(
    requests={
        "memory" : 104096,
        "num_accelerators": 4,
        "copies": 1, # specify the number of initial copies (default is 1)
    },
)

#### Deploy your model to the endpoint

Deploy your model with the model’s existing deploy method. We specify the name of our existing Real-time endpoint SageMaker will host the model on our existing endpoint, so it can starts making predictions on incoming requests.

In [None]:
final_model.deploy(
    instance_type="ml.g5.12xlarge", 
    accept_eula=True, 
    endpoint_name=endpoint_name,
    # endpoint_logging=False, 
    resources=resources_required,
)

### Test the endpoint with a sample prompt
Now we can invoke our endpoint with sample text to test its functionality and see the model's output.

In [None]:
from sagemaker.predictor import retrieve_default 

endpoint_name = final_model.endpoint_name 
inference_component_name = final_model.inference_component_name

predictor = retrieve_default(endpoint_name, sagemaker_session=sagemaker_session, 
                             inference_component_name = inference_component_name, 
                             model_id=model_id) 

payload = { "inputs": "What is deep learning a?", 
            "parameters": { 
                "max_new_tokens": 64, 
                "top_p": 0.7, 
                "temperature": 0.9 
            } 
        }
response = predictor.predict(payload) 
print(response['generated_text']) 

## Schedules using (UpdateInferenceComponentRuntimeConfig API)

You can scale your endpoint to zero in two ways. The first method is to set the number of model copies to zero in your Inference component using [UpdateInferenceComponentRuntimeConfigAPI](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateInferenceComponentRuntimeConfig.html). This approach maintains your endpoint configuration while eliminating compute costs during periods of inactivity. 
```
sagemaker_client.update_inference_component_runtime_config(
    InferenceComponentName=inference_component_name,
    DesiredRuntimeConfig={
        'CopyCount': 0
    }
)
```

### Create a schedule to shutdown the endpoint on friday and brings it back on Monday

[Amazon EventBridge Scheduler](https://docs.aws.amazon.com/eventbridge/latest/userguide/using-eventbridge-scheduler.html) can automate SageMaker API calls using cron/rate expressions for recurring schedules or one-time invocations. To function, EventBridge Scheduler requires an execution role with appropriate permissions to invoke the target API operations on your behalf, please refer to the [documentation](https://docs.aws.amazon.com/scheduler/latest/UserGuide/setting-up.html#setting-up-execution-role) on how to create this role. The specific permissions needed depend on the target API being called.

The code below creates two scheduled actions for the Inference component during 2024-2025. The first schedule scales in the CopyCount to zero every Friday at 18:00 UTC+1, while the second schedule restores model capacity every Monday at 07:00 UTC+1. The schedule swill start on November 29, 2024, end on December 31, 2025, and will be deleted after completion.


#### Weekend Scale-in (Friday Evening)
We start with creating a schedule to scale in our endpoint to 0 every friday at 18:00 UTC+1, starting on November 29, 2024 and ending on December 31, 2025. We need to specify the target API to call [UpdateInferenceComponentRuntimeConfigAPI](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateInferenceComponentRuntimeConfig.html) in this case , and the correct parameter for that API

In [None]:
inference_component_name = final_model.inference_component_name

In [None]:
import boto3
import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 0}, "InferenceComponentName": inference_component_name })
}

# Scale in our endpoint to 0 every friday at 18:00 UTC+1, starting on November 29, 2024
update_IC_scale_in_schedule = "scale-to-zero-schedule"
scheduler.create_schedule(
    Name=update_IC_scale_in_schedule,
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T17:00:00",
    EndDate="2025-12-31T23:59:59"
)

#### Workweek Scale-out (Monday Morning):
Set up a complementary scheduled action to restore the required model capacity, for the Inference component on Monday morning 07:00 UTC+1, ensuring your application is ready for weekday operations.

In [None]:
# Specify the SageMaker target API for the scale out schedule
scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 2}, "InferenceComponentName": inference_component_name })
}

# Scale out our endpoint every Monday at 07:00 UTC+1
update_IC_scale_out_schedule = "scale-out-schedule"
scheduler.create_schedule(
    Name=update_IC_scale_out_schedule,
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T17:00:00",
    EndDate="2025-12-31T23:59:59"
)

## Schedules using (DeleteInferenceComponen API)

The second method is to delete the Inference components by calling the [DeleteInferenceComponent API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteInferenceComponent.html). This alternative approach achieves the same cost-saving benefit while completely removing the components from your configuration. The following code creates a scheduled action that automatically delete the IC every Friday at 18:00 UTC during 2024-2025. it also create a complementary scheduled action that recreate the IC every Monday at 07:00 UTC+1.

We Fetch the model_name to be able to recreate our inference component

In [None]:
res = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
model_name = res ['Specification']['ModelName']
print(model_name )

#### Weekend Scale-in (Friday Evening)
The following code creates a scheduled action that automatically delete the IC every Friday at 18:00 UTC during 2024-2025. 

In [None]:
import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:deleteInferenceComponent",
    "Input": json.dumps({"InferenceComponentName": inference_component_name })
}

# Scale in our endpoint by deleting the IC every friday at 18:00 UTC+1
delete_IC_scale_in_schedule = "scale-to-zero-schedule-1"
scheduler.create_schedule(
    Name=delete_IC_scale_in_schedule,
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T17:00:00",
    EndDate="2025-12-31T23:59:59"
)

#### Workweek Scale-out (Monday Morning):
create a complementary scheduled action that recreate the IC every Monday at 07:00 UTC+1.

In [None]:
# Specify the SageMaker target API for the scale up schedule
input_config = {
  "EndpointName": endpoint_name,
  "InferenceComponentName": inference_component_name,
  "RuntimeConfig": {
    "CopyCount": 2
  },
  "Specification": {
    "ModelName": model_name,
    "StartupParameters": {
        "ModelDataDownloadTimeoutInSeconds": 3600,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
    },
    "ComputeResourceRequirements": {
      "MinMemoryRequiredInMb": 1024,
      "NumberOfAcceleratorDevicesRequired": 1
    }
  },
  "VariantName": variant_name
}

scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:createInferenceComponent",
    "Input": json.dumps(input_config)
}

# Scale out our endpoint by recreating the IC every Monday at 07:00 UTC+1
delete_IC_scale_out_schedule = "scale-out-schedule-1"
scheduler.create_schedule(
    Name=delete_IC_scale_out_schedule,
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T17:00:00",
    EndDate="2025-12-31T23:59:59"
)

### Note: 

To schedule the scale to Zero on an endpoint with multiple inference components (IC), all ICs must be either set to 0 or deleted. You can also automate this process by using EventBridge Scheduler to trigger a Lambda function that handles either deletion or zero-setting of all ICs.

## Optionally clean up the environment

- Delete schedules

In [None]:
# Delete all the schedule created above
schedules = [delete_IC_scale_out_schedule, delete_IC_scale_in_schedule, update_IC_scale_out_schedule, update_IC_scale_in_schedule]

for schedule in schedules:
    try:
        scheduler.delete_schedule(Name=schedule)
        print(f"Deleted schedule [b]{schedule} ✅")
    except scheduler.exceptions.ResourceNotFoundException:
        print(f"Schedule [b]{schedule}[/b] not found.")

Delete inference component

Delete endpoint

delete endpoint-config

In [None]:
sagemaker_client.delete_inference_component(InferenceComponentName=inference_component_name)
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)