# Overview

This modifies the [notebook](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/scale-to-zero-endpoint/llama3-8b-scale-to-zero-autoscaling.ipynb) for scale-to-zero, but to *focus on deploying a basic model through the SageMaker endpoint workflow*.

This is designe to interface scale-to-zero with traditional SageMaker endpoints that do not have inference components by default. 

**Core Issue**

The “scale-to-zero” autoscaling `ManagedInstanceScaling` introduced in the AWS blog does not work on endpoint *models.* Rather, it integrates only with *inference components* which are a new and separate abstraction from the rest of the SageMaker SDK. This leads to:

- No integration with traditional SageMaker inference endpoints
- No automatic integration with CloudWatch
- No automatic integration with auto scaling
- Broken integration with the AWS SageMaker Console.

**Solution Overview**

- Create an endpoint config and endpoint *without* a model
- Create a model without deploying to endpoint
- Create instance component from the model
- Attach the *instance component* to the endpoint
- Manually configure CloudWatch and autoscaling

**Session Overhead**

In [None]:
from sagemaker.pytorch import PyTorchModel
from sagemaker import get_execution_role
from sagemaker.serverless import ServerlessInferenceConfig
import sagemaker
import boto3

# NOTE: Still using this us-west-2 region for thie GPU 
# This needs to be specified if not operating from this region 
boto_session = boto3.Session(region_name='us-west-2')
sagemaker_session = sagemaker.Session(boto_session=boto_session)
sagemaker_client = boto3.client("sagemaker", region_name="us-west-2")
role = sagemaker.get_execution_role()
assert sagemaker_session._region_name == 'us-west-2'
model_bucket = sagemaker_session.default_bucket()  # bucket to house model artifacts

**Create Endpoint Config**

The endpoint *config* is separate from an endpoint object. Normally, the `model.deploy()` function creates this config behind the scenes. You can also create an endpoint config from the AWS console. This function of code creates the config programmatically with the SageMaker SDK. 

The important part of this config is the addition of the “scale-to-zero” functionality. This snippet also closely follows the reference blog. The `ManagedInstanceScaling` snippet is important to enable the new scale-to-zero feature.

In [None]:
#### Create Endpoint Config
prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")
variant_name = "AllTraffic"
# Large A10G machine to process pipeline
# 16 vCPUs
instance_type = "ml.g5.4xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

min_instance_count = 0 # Minimum instance must be set to 0
max_instance_count = 1

# Config creation function
sagemaker_client.create_endpoint_config(
    EndpointConfigName='scale-to-zero-endpoint-config',
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name, # NOTE: ModelName in config will NOT work
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": min_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

It is also worth mentioning that in the SageMaker Docs, it says that you can just add `ModelName` directly to the [EndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html)

This is true if you are deploying a normal endpoint to SageMaker. However, when deploying with fancier new keys like `ManagedInstanceScaling`, you will receive the IAM error:

`ExecutionRoleArn is not supported with the given EndpointConfig setup.`

**Create Empty Endpoint**

Now that we have the endpoint config, we can deploy an endpoint from it.

In [None]:
endpoint_name = f"{prefix}-example-endpoint"
print(f"Endpoint name: {endpoint_name}")

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName='example-endpoint-config',
)

Note, this endpoint does not have a Model. After this deploys, it will even scale up to 1 instance - but you cannot perform inference on it. 

One would think you can update the endpoint by adding a model to it with the SageMaker `UpdateEndpoint` [API:](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html)

This yields the error: 
`"ClientError: An error occurred (ValidationException) when calling the UpdateEndpoint operation: Cannot update in-progress endpoint"`


**Create SageMaker Model**

These steps are similar to our normal workflow. There is one exception: we do not call `model.deploy()` . 

The objective of this is to deploy a `Model` class, but not deploy it to a specific endpoint.

In [None]:
# (NOTE: update model.tar.gz. file if changes to inference code)
model_data = sagemaker_session.upload_data(
    path="model.tar.gz",
    bucket=sagemaker_session.default_bucket(),
    key_prefix="models/model"
)
    
model = PyTorchModel(
    model_data=model_data,
    role=get_execution_role(),
    entry_point="inference.py",
    source_dir="code",
    sagemaker_session=sagemaker_session,
    framework_version="2.3",
    env={"PYTHONUNBUFFERED": "1"},
    py_version="py311",
    # transformers_version="4.37",
    code_location=f"s3://{sagemaker_session.default_bucket()}/code",
)

## DO NOT DEPLOY
# model.deploy(
#     instance_type='ml.g5.4xlarge',
#     endpoint_name="example-endpoint-name",
#     initial_instance_count=1
# )

**Create Inference Components**

This is what took the longest for me to figure out. 

SageMaker inference components are a new version of endpoint provisioning designed for LLM [deployments:](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/)

They don’t follow the traditional `Model + EndpointConfig -> Inference Endpoint` setup that SageMaker uses otherwise. They are instead additional configurations that you can invoke for inference on an individual endpoint. It seems to be good for hosting multiple models per endpoint etc.

But importantly for this, the new `ManagedInstanceScaling` part of the Endpoint Config expects to scale inference components, NOT endpoint Models.

In [None]:
##### Create Inference Components
sagemaker_client.create_inference_component(
    InferenceComponentName="scale-to-zero-inference-component",
    EndpointName="scale-to-zero-endpoint",
    VariantName="AllTraffic",
    Specification={
        "ModelName": "example-model-name",
        "ComputeResourceRequirements": {
		    "NumberOfAcceleratorDevicesRequired": 1, 
			"NumberOfCpuCoresRequired": 2, 
			"MinMemoryRequiredInMb": 1024
	    }
    },
    RuntimeConfig={"CopyCount": 1},
)
# Describe inference component
response = sagemaker_client.describe_inference_component(
    InferenceComponentName='console-endpoint-inference-component'
)

**Create AutoScaling Policies and Log Groups for Inference Component**

- Inference components do not have a log group created automatically, so we have to create them.
- You cannot configure autoscaling for inference components from the AWS console, so you have to do it programmatically, too. 

In [None]:
aas_client = sagemaker_session.boto_session.client("application-autoscaling", region_name='us-west-2')
cloudwatch_client = sagemaker_session.boto_session.client("cloudwatch", region_name='us-west-2')
inference_component_name = 'scale-to-zero-inference-component'

# Autoscaling parameters
resource_id = f"inference-component/{inference_component_name}"
service_namespace = "sagemaker"
scalable_dimension = "sagemaker:inference-component:DesiredCopyCount"
min_copy_count = 0
max_copy_count = 2

aas_client.register_scalable_target(
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    MinCapacity=min_copy_count,
    MaxCapacity=max_copy_count,
)

aas_client.describe_scalable_targets(
    ServiceNamespace=service_namespace,
    ResourceIds=[resource_id],
    ScalableDimension=scalable_dimension,
)

# The policy name for the target traking policy
target_tracking_policy_name = f"scale-to-zero-endpoint-{inference_component_name}"

aas_client.put_scaling_policy(
    PolicyName=target_tracking_policy_name,
    PolicyType="TargetTrackingScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    TargetTrackingScalingPolicyConfiguration={
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",
        },
        # Low TPS + load TPS
        "TargetValue": 5,  # you need to adjust this value based on your use case
        "ScaleInCooldown": 300,  # default
        "ScaleOutCooldown": 300,  # default
    },
)


# The policy name for the step scaling policy
step_scaling_policy_name = f"scale-to-zero-endpoint-{inference_component_name}"

aas_client.put_scaling_policy(
    PolicyName=step_scaling_policy_name,
    PolicyType="StepScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "MetricAggregationType": "Maximum",
        "Cooldown": 60,
        "StepAdjustments":
          [
             {
               "MetricIntervalLowerBound": 0,
               "ScalingAdjustment": 1
             }
          ]
    },
)

resp = aas_client.describe_scaling_policies(
    PolicyNames=[step_scaling_policy_name],
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
)
step_scaling_policy_arn = resp['ScalingPolicies'][0]['PolicyARN']
print(f"step_scaling_policy_arn: {step_scaling_policy_arn}")

# The alarm name for the step scaling alarm
step_scaling_alarm_name = f"console-endpoint-{inference_component_name}"

cloudwatch_client.put_metric_alarm(
    AlarmName=step_scaling_alarm_name,
    AlarmActions=[step_scaling_policy_arn],  # Replace with your actual ARN
    MetricName='NoCapacityInvocationFailures',
    Namespace='AWS/SageMaker',
    Statistic='Maximum',
    Dimensions=[
        {
            'Name': 'InferenceComponentName',
            'Value': inference_component_name  # Replace with actual InferenceComponentName
        }
    ],
    Period=30, # Set a lower period 
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing'
)

**Endpoint Monitoring**

You can use these python commands to monitor aspects of the endpoint: 

Describing the inference component:

In [None]:
sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
sagemaker_client.describe_endpoint(EndpointName='console-endpoint')
sagemaker_client.describe_endpoint(EndpointName='console-endpoint')

**Important Notes on the console:** 

There will be an error specifying your endpoint does not have a `ModelName` (which is true, it has an inference component instead). There does not seem to be support for inference components in the dashboard.

The “Desired Instance Count” should reflect the actual autoscaling and billing accurately.


One cannot configure autoscaling in the Console. This is because ManagedInstanceScaling is NOT actually “Auto Scaling” in SageMaker. It is a different API, geared towards instance components, not endpoint models. 


**Endpoint Invocation**

In [None]:
import json
import boto3
runtime = boto3.client('sagemaker-runtime', region='us-west-2')
payload = {
    "key": "value"
}
response = runtime.invoke_endpoint(
    EndpointName='example-endpoint',
    ContentType='application/json',
    Body=json.dumps(payload),
  InferenceComponentName='example-inference-component' # must specify inference component name
)