# Deploy Multiple SOTA LLMs on a single endpoint with Scale-to-Zero

This demo notebook demonstrate how you can scale in your SageMaker endpoint to zero instances during idle periods, eliminating the previous requirement of maintaining at least one running instance.

The new [Scaling to Zero feature](https://aws.amazon.com/blogs/machine-learning/unlock-cost-savings-with-the-new-scale-down-to-zero-feature-in-amazon-sagemaker-inference/) expands the possibilities for managing SageMaker Inference endpoints. It allows customers to configure the endpoints so they can scale to zero instances during periods of inactivity, providing an additional tool for resource management. Using this feature customers can closely match their compute resource usage to their actual needs, potentially reducing costs during times of low demand. This enhancement builds upon SageMaker's existing auto-scaling capabilities, offering more granular control over resource allocation. Customers can now configure their scaling policies to include scaling to zero, allowing for more precise management of their AI inference infrastructure. 

The Scaling to Zero feature presents new opportunities for how businesses can approach their cloud-based machine learning operations. It provides additional options for managing resources across various scenarios, from development and testing environments to production deployments with variable traffic patterns. As with any new feature, customers are encouraged to carefully evaluate how it fits into their overall architecture and operational needs, considering factors such as response times and the specific requirements of their applications.

#### Determining When to Scale Down to Zero

SageMaker's scale-to-zero capability is ideal for three scenarios:

1. **Predictable traffic patterns:** If your inference traffic is predictable and follows a consistent schedule, you can use this scaling functionality to automatically scale in to zero during periods of low or no usage. This eliminates the need to manually delete and recreate inference components/endpoints.

2. **Sporadic workloads:** For applications that experience sporadic or variable inference traffic patterns, scaling in to zero instances can provide significant cost savings. However, it's important to note that scaling out from zero instances to serving traffic is not instantaneous. During the scale-out process, any requests sent to the endpoint will fail, and these "NoCapacityInvocationFailures" will be captured in CloudWatch.

3. **Development and testing:** The scale-to-zero functionality is also beneficial when testing and evaluating new machine learning models. During model development and experimentation, you may create temporary inference endpoints to test different configurations. However, it's easy to forget to delete these endpoints when you're done. Scaling to zero ensures these test endpoints automatically scale in to zero instances when not in use, preventing unwanted charges. This allows you to freely experiment without closely monitoring infrastructure usage or remembering to manually delete endpoints. The automatic scaling to zero provides a cost-effective way to test out ideas and iterate on your machine learning solutions.
   
**Note:** Scale-to-zero is only supported when using inference components. for more information on Inference Components see ‚Äú[Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/)‚Äù blog.


## Set up

- Fetch and import dependencies
- Initialize SageMaker environment and required clients to access AWS services

In [None]:
%pip install sagemaker==2.245.0 --upgrade --quiet --no-warn-conflicts

In [None]:
import time
import sys
import json
import sagemaker
import boto3
from IPython.display import display, Markdown

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()

sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"sagemaker version: {sagemaker.__version__}")

## Setup your SageMaker Real-time Endpoint 

In [None]:
CONTAINER_VERSION = "0.34.0-lmi16.0.0-cu128"
inference_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"

instance = {"type": "ml.g5.12xlarge", "num_gpu": 4}
endpoint_config_name = endpoint_name = sagemaker.utils.name_from_base("lab3", short = True)
timeout = 600
variant_name = "main"

In [None]:
lmi_env = {
    "SERVING_FAIL_FAST": "true",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_MAX_MODEL_LEN": "16384",
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
}

### Create a SageMaker endpoint configuration and Endpoint

There are a few parameters we want to setup for our endpoint. We begin by creating the endpoint configuration and setting MinInstanceCount to 0. This allows the endpoint to scale in all the way down to zero instances when not in use, see the [blog](https://aws.amazon.com/blogs/machine-learning/unlock-cost-savings-with-the-new-scale-down-to-zero-feature-in-amazon-sagemaker-inference/). In addition we will use Managed Instance Scaling which allows SageMaker to scale the number of instances based on the requirements of the scaling of your inference components. Lastly, we set *RoutingStrategy* for the endpoint to optimally tune how to route requests to instances and inference components for the best performance.

In [None]:
endpoint_config = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = role,
    ProductionVariants = [
        {
            "VariantName": variant_name,
            "InstanceType": instance["type"],
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
            "RoutingConfig": {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
            'ManagedInstanceScaling': {
                'Status': 'ENABLED',
                'MinInstanceCount': 0,
                'MaxInstanceCount': 2
            },
        },
    ],
)
endpoint = sm_client.create_endpoint(EndpointName = endpoint_name,
                                     EndpointConfigName = endpoint_config_name)
_ = sess.wait_for_endpoint(endpoint_name)

### Model deployment

#### Qwen/Qwen3-4B

We will use 1 GPU on the endpoint

In [None]:
qwen_env = {
    "HF_MODEL_ID": "Qwen/Qwen3-4B",
    "HF_TOKEN": "<YOUR_HF_TOKEN>",
}
qwen_model_name = sagemaker.utils.name_from_base("qwen", short=True)
qwen_ic_name = f"ic-{qwen_model_name}"

min_memory_required_in_mb = 4096
number_of_accelerator_devices_required = 1

model_response = sm_client.create_model(
    ModelName = qwen_model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image,
        "Environment": qwen_env | lmi_env,
    },
)

ic_response = sm_client.create_inference_component(
    InferenceComponentName = qwen_ic_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification={
        "ModelName": qwen_model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": timeout,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": 1,
    },
)
_ = sess.wait_for_inference_component(qwen_ic_name)

#### Inference test

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}

start_time = time.time()
res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                 InferenceComponentName = qwen_ic_name,
                                 Body = json.dumps(payload),
                                 ContentType = "application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"‚úÖ Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

## Configure per model Autoscaling and Enable Cost-Saving Autoscaling with Scale-to-Zero

### Scaling policies 

Once our models are deployed and InService, we can then add the necessary scaling policies:

* A [target tracking](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html) policy that can scale in the copy count for our inference component model copies to zero, and from 1 to n. 
* A [step scaling policy](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) policy that will allow the endpoint to scale out from zero.

These policies work together to provide cost-effective scaling - the endpoint can scale to zero when idle and automatically scale out as needed to handle incoming requests.

### Autoscaling Helper Function

> **Note** ofr the target tracking policy, Application Auto Scaling creates two CloudWatch alarms per scaling target. The first triggers scale-out actions after 30 seconds (using 3 sub-minute data point), while the second triggers scale-in after 15 minutes (using 90 sub-minute data points). The time to trigger the scaling action is usually 1‚Äì2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling to react. 


In [None]:
aas_client = boto3.client("application-autoscaling")
cloudwatch_client = boto3.client("cloudwatch")

In [None]:
def setup_autoscaling(
    inference_component_name,
    min_capacity=0,
    max_capacity=2,
    target_requests_per_copy=3
):
    """
    Configure complete autoscaling: scale-to-zero + scale-out

    Args:
        inference_component_name (str): Name of the inference component to configure
        min_capacity (int): Minimum number of model copies (set to 0 for scale-to-zero)
        max_capacity (int): Maximum number of model copies this component can scale to
        target_requests_per_copy (int): Concurrent requests threshold per copy before scaling out
    """

    resource_id = f"inference-component/{inference_component_name}"

    # 1. Register autoscaling target
    aas_client.register_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:inference-component:DesiredCopyCount",
        MinCapacity=min_capacity,  # Configurable minimum (0 enables scale-to-zero)
        MaxCapacity=max_capacity,  # Configurable maximum
    )

    # 2. Target tracking policy (scales min_capacity+1 ‚Üí max_capacity)
    aas_client.put_scaling_policy(
        PolicyName=f"target-tracking-{inference_component_name}",
        PolicyType="TargetTrackingScaling",
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:inference-component:DesiredCopyCount",
        TargetTrackingScalingPolicyConfiguration={
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",
            },
            "TargetValue": target_requests_per_copy,
        },
    )

    # 3. Step scaling policy (only needed if min_capacity = 0)
    step_policy_response = aas_client.put_scaling_policy(
        PolicyName=f"step-scaling-{inference_component_name}",
        PolicyType="StepScaling",
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:inference-component:DesiredCopyCount",
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity",
            "MetricAggregationType": "Maximum",
            "Cooldown": 60,
            "StepAdjustments": [{"MetricIntervalLowerBound": 0, "ScalingAdjustment": 1}]
        },
    )

    # CloudWatch alarm for scale-out from zero
    cloudwatch_client.put_metric_alarm(
        AlarmName=f"scale-from-zero-{inference_component_name}",
        AlarmActions=[step_policy_response['PolicyARN']],
        MetricName='NoCapacityInvocationFailures',
        Namespace='AWS/SageMaker',
        Statistic='Maximum',
        Dimensions=[{'Name': 'InferenceComponentName', 'Value': inference_component_name}],
        Period=30,
        EvaluationPeriods=1,
        DatapointsToAlarm=1,
        Threshold=1,
        ComparisonOperator='GreaterThanOrEqualToThreshold',
    )

    print(f"‚úÖ Autoscaling configured for {inference_component_name}")

### Setup autoscaling for Qwen model

In [None]:
# Apply autoscaling configuration
setup_autoscaling(qwen_ic_name)

## Testing scale to Zero behaviour

### IC copy count scales in to zero
We'll pause for a few minutes without making any invocations to our model. Based on our target tracking policy, when our SageMaker endpoint doesn't receive requests for about 10 to 15 minutes, it will automatically scale down to zero the number of model copies. 

In [None]:
time.sleep(900)
start_time = time.time()
while True:
    desc = sm_client.describe_inference_component(InferenceComponentName=qwen_ic_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

desc = sm_client.describe_inference_component(InferenceComponentName=qwen_ic_name)
print(desc)

### Endpoint's instances scale in to zero

After 10 additional minutes of inactivity, SageMaker automatically terminates all underlying instances of the endpoint, eliminating all associated costs.

In [None]:
# after 10mins instances will scale down to 0
time.sleep(600)
# verify whether CurrentInstanceCount is zero
sess.wait_for_endpoint(endpoint_name)

#### Invoke Qwen model with a sample prompt

If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error: `An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.`

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is deep learning?"}
    ],
    "max_tokens": 100,
}
try:
    res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                     InferenceComponentName = qwen_ic_name,
                                     Body = json.dumps(payload),
                                     ContentType = "application/json")
    response = json.loads(res["Body"].read().decode("utf8"))
    display(Markdown(response["choices"][0]["message"]["content"]))

except Exception as e:
    print(f"   Reason: {str(e)}")

### Scale out from zero kicks in
However, after 1 minutes our step scaling policy should kick in. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests. This demonstrates the endpoint's ability to automatically scale out from zero when needed.

In [None]:
start_time = time.time()
while True:
    desc = sm_client.describe_inference_component(InferenceComponentName=qwen_ic_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

#### verify that our endpoint has succesfully scaled out from zero

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is deep learning?"}
    ],
    "max_tokens": 100,
}
try:
    res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                     InferenceComponentName = qwen_ic_name,
                                     Body = json.dumps(payload),
                                     ContentType = "application/json")
    response = json.loads(res["Body"].read().decode("utf8"))
    display(Markdown(response["choices"][0]["message"]["content"]))

except Exception as e:
    print(f"   Reason: {str(e)}")

## Testing scaling behavior 1->n

### Helper functions

In [None]:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def make_prediction(endpoint_name, payload, request_id):
    """Make a single prediction and return results with timing"""
    start_time = time.time()
    try:
        res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                         InferenceComponentName = qwen_ic_name,
                                         Body = json.dumps(payload),
                                         ContentType = "application/json")
        response = json.loads(res["Body"].read().decode("utf8"))
        duration = time.time() - start_time
        return {
            'request_id': request_id,
            'success': True,
            'duration': duration,
            'response': response['choices'][0]['message']['content'][:50] + "..." if 'choices' in response else response[0]['generated_text'][:50] + "..."
        }
    except Exception as e:
        duration = time.time() - start_time
        return {
            'request_id': request_id,
            'success': False,
            'duration': duration,
            'error': str(e)
        }

def get_current_copy_count(inference_component_name):
    """Get the current number of inference component copies"""
    try:
        response = sm_client.describe_inference_component(
            InferenceComponentName=inference_component_name
        )
        return response['RuntimeConfig']['CurrentCopyCount']
    except Exception as e:
        return "Error"

### Autoscaling demonstration

In the following code we will run 6 iterations with 6 concurrents request each.

What we will observe:
- Initial low component count
- Concurrent load being applied
- Component count increasing as autoscaling kicks in

In [None]:
num_iterations = 6
requests_per_iteration = 6
wait_between_iterations = 10

print("Starting Inference load")
print("="*50)

for iteration in range(1, num_iterations + 1):
    print(f"\nIteration {iteration}/{num_iterations}")

    # Show current inference component count
    current_copies = get_current_copy_count(qwen_ic_name)
    print(f"Current Inference Components: {current_copies}")

    # Make concurrent requests
    print(f"Sending {requests_per_iteration} concurrent requests...")
    start_time = time.time()

    with ThreadPoolExecutor(max_workers=requests_per_iteration) as executor:
        futures = [
            executor.submit(make_prediction, endpoint_name, payload, f"{iteration}-{i+1}")
            for i in range(requests_per_iteration)
        ]

        results = [future.result() for future in as_completed(futures)]

    # Summary
    successful = sum(1 for r in results if r['success'])
    total_time = time.time() - start_time

    if successful > 0:
        avg_response_time = sum(r['duration'] for r in results if r['success']) / successful
        print(f"‚úÖ Results: {successful}/{requests_per_iteration} successful")
        print(f"‚è±Ô∏è Total: {total_time:.1f}s, Average response: {avg_response_time:.1f}s")
    else:
        print(f"‚ùå All requests failed")

    # Show any failures
    failures = [r for r in results if not r['success']]
    if failures:
        print(f"‚ö†Ô∏è  {len(failures)} requests failed")

    # Wait between iterations (except last one)
    if iteration < num_iterations:
        print(f"‚è≥ Waiting {wait_between_iterations}s before next iteration...")
        time.sleep(wait_between_iterations)

print(f"\n Inference load complete!")

#### Check that our model has scaled to 2 copies

> **Note** you can rerun the cell above and increase the number of iterations

In [None]:
while True:
    desc = sm_client.describe_inference_component(InferenceComponentName=qwen_ic_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

current_copies = get_current_copy_count(qwen_ic_name)
print(f"Current Inference Components: {current_copies}")

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is deep learning?"}
    ],
    "max_tokens": 100,
}
try:
    res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                     InferenceComponentName = qwen_ic_name,
                                     Body = json.dumps(payload),
                                     ContentType = "application/json")
    response = json.loads(res["Body"].read().decode("utf8"))
    display(Markdown(response["choices"][0]["message"]["content"]))

except Exception as e:
    print(f"   Reason: {str(e)}")

### Note: 
**If you do not have a multi-GPU instance, you can skip to the cleanup section**

## (OPTIONAL MULTI-GPUs INSTANCE) Deploy gpt-oss-20b model on the same endpoint

Now we will deploy gpt-oss-20b model on the same endpoint previously created with the Qwen model. 
> **Please note that you require a multi-GPU machine for this to work**

In [None]:
gptoss_env = {
    "HF_MODEL_ID": "openai/gpt-oss-20b",
    "HF_TOKEN": "YOUR_HF_TOKEN",
}
gptoss_model_name = sagemaker.utils.name_from_base("gpt-oss", short=True)
gptoss_ic_name = f"ic-{gptoss_model_name}"

min_memory_required_in_mb = 4096
number_of_accelerator_devices_required = 1

model_response = sm_client.create_model(
    ModelName = gptoss_model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image,
        "Environment": gptoss_env | lmi_env,
    },
)

ic_response = sm_client.create_inference_component(
    InferenceComponentName = gptoss_ic_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification={
        "ModelName": gptoss_model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": timeout,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": 1,
    },
)
_ = sess.wait_for_inference_component(gptoss_ic_name)

#### Test inference

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}

start_time = time.time()
res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                 InferenceComponentName = gptoss_ic_name,
                                 Body = json.dumps(payload),
                                 ContentType = "application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"‚úÖ Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

### Setup autoscaling for GPT-OSS model

In [None]:
# Apply autoscaling configuration
setup_autoscaling(gptoss_ic_name)

In [None]:
current_copies = get_current_copy_count(gptoss_ic_name)
print(f"Current Inference Components: {current_copies}")

### Understanding Model Copies and scale-to-zero

**Important** To scale to zero on an endpoint with multiple inference components, all components must be either set to 0 or deleted.

**Model Copy**: One loaded instance of your model in GPU memory
- Copies can share GPU instances (depending on model size and available accelerators)

**Instance**: The underlying compute resource (e.g., ml.g5.2xlarge)
- SageMaker automatically manages instances based on copy demands
- Multiple small model copies can share one instance
- Large model copies might need dedicated instances

In [None]:
!aws sagemaker describe-inference-component --inference-component-name $gptoss_ic_name

In [None]:
!aws sagemaker describe-endpoint --endpoint-name $endpoint_name

# Clean up the environment

- Deregister scalable target
- Delete cloudwatch alarms
- Delete scaling policies

In [None]:
def cleanup_autoscaling(inference_component_names, aas_client, cloudwatch_client):
    """Clean up autoscaling resources for workshop"""

    if isinstance(inference_component_names, str):
        inference_component_names = [inference_component_names]

    print("üßπ Cleaning up autoscaling resources...")

    for ic_name in inference_component_names:
        resource_id = f"inference-component/{ic_name}"

        # Clean up all autoscaling components
        try:
            # Get and delete policies
            policies = aas_client.describe_scaling_policies(
                ServiceNamespace="sagemaker",
                ResourceId=resource_id,
                ScalableDimension="sagemaker:inference-component:DesiredCopyCount"
            )['ScalingPolicies']

            for policy in policies:
                aas_client.delete_scaling_policy(
                    PolicyName=policy['PolicyName'],
                    ServiceNamespace="sagemaker",
                    ResourceId=resource_id,
                    ScalableDimension="sagemaker:inference-component:DesiredCopyCount"
                )

            # Delete alarm
            cloudwatch_client.delete_alarms(AlarmNames=[f"scale-from-zero-{ic_name}"])

            # Deregister target
            aas_client.deregister_scalable_target(
                ServiceNamespace="sagemaker",
                ResourceId=resource_id,
                ScalableDimension="sagemaker:inference-component:DesiredCopyCount"
            )

            print(f"‚úÖ Cleaned up autoscaling for {ic_name}")

        except Exception as e:
            print(f"‚ÑπÔ∏è  Partial cleanup for {ic_name} (some resources may not exist)")

    print("üéâ Autoscaling cleanup complete!")

In [None]:
cleanup_autoscaling([qwen_ic_name, gptoss_ic_name], aas_client, cloudwatch_client)

- Delete inference component
- Delete endpoint
- Delete endpoint-config
- Delete model

In [None]:
sess.delete_inference_component(qwen_ic_name, wait=True)
sess.delete_inference_component(gptoss_ic_name, wait=True)
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
sess.delete_model(qwen_model_name)
sess.delete_model(gptoss_model_name)

print("‚úÖ Workshop cleanup complete!")