# Using SageMaker Efficient Multi-Adapter Serving to host LoRA adapters at Scale

In this notebook, we will show you how to deploy a SageMaker AI endpoint with a single base model **Llama 3.1 8B Instruct** with **3 fine tuned LoRA adapters**, and change adapters for different requests using SageMaker Large Model Inference Containers[SageMaker Large Model Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers).

WHY LoRA?
- LoRA (Low Rank Adapters) is a powerful technique for fine-tuning large language models.
- This technique significantly reduces the number of trainable parameters compared to traditional fine-tuning while achieving comparable or superior performance.
- A major benefit of LoRA is that the fine-tuned adapters can easily be added to and removed from the base model, which makes switching adapters pretty cheap and viable at runtime.(demonstrated in step5)

Multi-Adapter serving allows for multiple fine-tuned models to be hosted in a cost efficient manner on a singular endpoint. Via a multi-adapter, approach we can tackle multiple different tasks with a singular base LLM. In this example, you will use a pre-trained LoRA adapter that was fine tuned from *Llama 3.1 8B Instruct* on the [ECTSum dataset](https://huggingface.co/datasets/mrSoul7766/ECTSum).

You will also see how to dynamically load and unload these adapters using [SageMaker Inference Components](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/), in this example, we specifically explore the Inference Component Adapter feature which will allow for us to load hundreds of adapters on a SageMaker real-time endpoint.

![](./images/ic-adapter-architecture.png)

#### Details of Steps:

- **Step 0: Prerequisites**
- **Step 1: Set up**
    - Fetch and import dependencies
    - Restart kernel before continuing
    - Configure development environment and boto3 clients
- **Step 2: Deploy a base model to SageMaker IC-based endpoint**
  - Select a Large Model Inference (LMI) container image
  - Configure model container environment
  - Pick one option to deploy base model
  - View logs for the base inference component (and adapters after they're loaded)
  - Create the Inference Components (ICs) for the adapters
  - Copy the compressed local adapter ectsum-adapter.tar.gz to S3
  - Create ECTSum adapter inference component
  - View Cloudwatch Inference logs
- **Step 3: Invoking the Endpoint**
  - get Sample Dataset
  - Create the prompt to invoke the LLM
  - Invoke base model without any adapters and observe response
  - Invoke ECTSum adapter and observe response
  - Compare outputs
- **Step 4:Swapping adapters between GPU/CPU/disk**
  - Create second adapter
  - Create third adapter
- **Step 5:Upload a new ECTSum adapter artifact and update the live adapter inference component**
  - Upload compressed adapter to S3
  - Update Inference component adapter
  - Retest the updated adapter
- **Step 6: Clean up the environment**
  - Delete adapters
  - Delete base inference components and model
  - Delete Sagemaker endpoint and config



## Step 0 : Prerequisites

- Make sure you have access to ml.g5.2xlarge for endpoint usage
- Run these two commands in terminal
    - `sudo apt-get install git-lfs`
    - `git clone https://github.com/aws-samples/sagemaker-genai-hosting-examples.git`

## Step 1: Setup

### 1.1 Fetch and import dependencies 
Ignore incompatability errors

In [26]:
%pip install -Uq datasets==3.0.0 --no-warn-conflicts

Note: you may need to restart the kernel to use updated packages.


In [27]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

Note: you may need to restart the kernel to use updated packages.


### 1.2 Restart kernel before continuing 
 Menu Bar > Kernel > Restart Kernel...

In [1]:
import sagemaker
import boto3
import json

print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
boto3 version: 1.40.18
sagemaker version: 2.251.0


### 1.3 Configure development environment and boto3 clients

In [2]:
role = sagemaker.get_execution_role() # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name

sm_client = boto3.client(service_name="sagemaker")
sm_runtime = boto3.client(service_name="sagemaker-runtime")
print(bucket)

sagemaker-us-east-1-230904922206


## Step 2: Deploy a Base model to SageMaker IC-based endpoint



### 2.1 Select a Large Model Inference (LMI) container image

Select one of the [available Large Model Inference (LMI) container images for hosting](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). Efficient adapter inference capability is available in `0.33.0-lmi14.0.0` and higher. Ensure that you are using the image URI for the region that corresponds with your deployment region.

In [4]:
inference_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124"

print(f"Inference container image : {inference_image_uri}")

Inference container image : 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124


### 2.2 Configure model container environment

Create an container environment for the hosting container. LMI container parameters can be found in the [LMI User Guides](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/index.html).

By using the `OPTION_MAX_LORAS` and `OPTION_MAX_CPU_LORAS` parameters, you can control how adapters are loaded and unloaded into GPU/CPU memory. The `OPTION_MAX_LORAS` parameter defines the number of adapters that will be held in GPU memory. The `OPTION_MAX_CPU_LORAS` parameter controls the number of adapters that will be held in CPU memory. It is important to note that adapters which are loaded to GPU have to be precached in CPU memory and will occupy space in the CPU cache. This means `OPTION_MAX_CPU_LORAS` should be set to `OPTION_MAX_LORAS + <number of adapters you want to cache in CPU>`. Any adapters beyond this will be offloaded to local SSD. 

In the following example, the container will hold 30 adapters in GPU memory, and 70 adapters in CPU memory. Out of the 70, 30 will be precached adapters that already reside in GPU, leaving you with 40 slots free.

```
env = {
    "HF_MODEL_ID": f"{s3_model_path}",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "30",
    "OPTION_MAX_CPU_LORAS": "70",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}
```


Later in this workshop you will test scenarios where you will force adapters to swap between different tiers. To make this easier, you will set the `OPTION_MAX_LORAS` property to `1` and the `OPTION_MAX_CPU_LORAS` to `2`. This will allow you to hold 1 adapter in GPU memory and 1 in CPU memory (plus 1 precached from GPU) before moving adapters to disk.

### 2.3 Options to deploy base model

You can deploy the base model on SageMaker endpoint from several sources:
- Option 1 : SageMaker JumpStart(**Use this for AWS provisioned lab**)
- Option 2 : HuggingFace model hub 
- Option 3 : Amazon S3 bucket 

**Note: Please choose only ONE deployment option below**


### 2.3 Option 1: Deploy a model from SageMaker JumpStart

In [5]:
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

model_id, model_version = "meta-textgeneration-llama-3-1-8b-instruct", "2.7.2"

model_name = endpoint_name = sagemaker.utils.name_from_base("test")
base_inference_component_name = "base-" + model_name

env = {
    "HF_MODEL_ID": "/opt/ml/model",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "1",
    "OPTION_MAX_CPU_LORAS": "2",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

jumpstart_model = JumpStartModel(model_id=model_id, 
                                 model_version=model_version, 
                                 name=model_name,
                                 image_uri=inference_image_uri,
                                 env=env)

jumpstart_model.deploy(
    accept_eula=True,
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    container_startup_health_check_timeout=900,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=base_inference_component_name,
    resources=ResourceRequirements(requests={"num_accelerators": 1, "memory": 4096, "copies": 1,}),
)

----!------------------!

<sagemaker.base_predictor.Predictor at 0x7f009a062900>

### 2.3 Option 2: Deploy a model from HuggingFace model hub( Skip)

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

model_id = "meta-llama/Llama-3.1-8B-Instruct"

model_name = endpoint_name = sagemaker.utils.name_from_base("test")
base_inference_component_name = "base-" + model_name

env = {
    "HF_MODEL_ID": model_id,
    "HF_TOKEN": "<YOUR_HF_TOKEN>",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "1",
    "OPTION_MAX_CPU_LORAS": "2",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

lmi_model = sagemaker.Model(image_uri = inference_image_uri,
                            env = env,
                            role = role,
                            name = model_name)


lmi_model.deploy(instance_type = "ml.g5.2xlarge",
                 initial_instance_count = 1,
                 container_startup_health_check_timeout = 900,
                 endpoint_name = endpoint_name,
                 endpoint_type = sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
                 inference_component_name = base_inference_component_name,
                 resources = ResourceRequirements(requests={"num_accelerators": 1, "memory": 4096, "copies": 1}))

### 2.3 Option 3: S3 bucket (Skip)

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

model_id = "s3://YOUR_BUCKET"

model_name = endpoint_name = sagemaker.utils.name_from_base("test")
base_inference_component_name = "base-" + model_name

env = {
    "HF_MODEL_ID": model_id,
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "1",
    "OPTION_MAX_CPU_LORAS": "2",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

lmi_model = sagemaker.Model(image_uri = inference_image_uri,
                            env = env,
                            role = role,
                            name = model_name)


lmi_model.deploy(instance_type = "ml.g5.2xlarge",
                 initial_instance_count = 1,
                 container_startup_health_check_timeout = 900,
                 endpoint_name = endpoint_name,
                 endpoint_type = sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
                 inference_component_name = base_inference_component_name,
                 resources = ResourceRequirements(requests={"num_accelerators": 1, "memory": 4096, "copies": 1}))

### 2.4 View logs for the base inference component (and adapters after they're loaded)

In [6]:
import urllib

cw_path = urllib.parse.quote_plus(f'/aws/sagemaker/InferenceComponents/{base_inference_component_name}', safe='', encoding=None, errors=None)

print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

You can view your inference component logs here:

 https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/%2Faws%2Fsagemaker%2FInferenceComponents%2Fbase-test-2025-08-28-19-54-59-710


### 2.5 Create the Inference Components (ICs) for the adapters

In this example you’ll create a single adapter, but you could host up to hundreds of them per endpoint. They will need to be compressed and uploaded to S3.

The adapter package has the following files at the root of the archive with no sub-folders:

![](./images/adapter_files.png)

For this example, an adapter was fine tuned using QLoRA and [Fully Sharded Data Parallel (FSDP)](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-v2.html) on the training split of the [ECTSum dataset](https://huggingface.co/datasets/mrSoul7766/ECTSum). Training took 21 minutes on a ml.p4d.24xlarge and cost ~$13 using current [on-demand pricing](https://aws.amazon.com/sagemaker/pricing/).

### 2.6 Copy the compressed local adapter ectsum-adapter.tar.gz to S3

In [7]:
ectsum_adapter_filename = "ectsum-adapter.tar.gz"
ectsum_adapter_s3_uri = f"s3://{bucket}/adapters/{ectsum_adapter_filename}"
print(ectsum_adapter_s3_uri)

!aws s3 cp ./{ectsum_adapter_filename} {ectsum_adapter_s3_uri}

s3://sagemaker-us-east-1-230904922206/adapters/ectsum-adapter.tar.gz
upload: ./ectsum-adapter.tar.gz to s3://sagemaker-us-east-1-230904922206/adapters/ectsum-adapter.tar.gz


### 2.7 Create ECTSum adapter inference component

For each adapter that you are going to deploy, you need to specify an **`InferenceComponentName`**, an **`ArtifactUrl`** with the S3 location of the adapter archive, and a **`BaseInferenceComponentName`** to create the connection between the base model IC and the new adapter ICs. You will repeat this process for each additional adapter.

#### This step can take around 2 minutes

In [8]:
%%time

ic1_adapter_name = f"ic1-ectsum-{model_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic1_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic1_adapter_name)

print(f"\nCreated Adapter inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

------!
Created Adapter inference component ARN: arn:aws:sagemaker:us-east-1:230904922206:inference-component/ic1-ectsum-test-2025-08-28-19-54-59-710
CPU times: user 51.9 ms, sys: 2.86 ms, total: 54.7 ms
Wall time: 2min 21s


### 2.8 View Cloudwatch Inference logs 
Look at base inference component logs again. It should show a line that looks like:

`Registered adapter <ADAPTER_NAME> from /opt/ml/models/ ... successfully`.

In [9]:
print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

You can view your inference component logs here:

 https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/%2Faws%2Fsagemaker%2FInferenceComponents%2Fbase-test-2025-08-28-19-54-59-710


## Step 3: Invoking the Endpoint

### 3.1 Get sample dataset 
First you will pull a random datapoint form the ECTSum test split. You'll use the `text` field to invoke the model and the `summary` filed to compare with ground truth later.

In [10]:
from datasets import load_dataset
dataset_name = "mrSoul7766/ECTSum"

test_dataset = load_dataset(dataset_name, split="test")

#due to GPU memory limitations on ml.g5.2xlarge, we have limited the max sequence length to 6000 tokens. 
#Some of the ECTSum samples are too large.
#This code will loop until it gets a sample that is < 5500 so that inference does not throw errors.

valid_test_value = False
while not valid_test_value:
    test_item = test_dataset.shuffle().select(range(1))
    sample_size = len(test_item["text"][0])/4
    if sample_size > 5500:
        print(f'sample size {sample_size} > 5500, fetching new sample.')
    else:
        print(f'sample_size {sample_size}')
        valid_test_value = True

ground_truth_response = test_item["summary"]

sample_size 2077.25


### 3.2 Create the prompt to invoke the LLM
Next you will build a prompt **to invoke the model for earnings summarization**, filling in the source text with a random item from the ECTSum dataset. 

In [11]:
prompt =f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
                You are an AI assistant trained to summarize earnings calls. Provide a concise summary of the call, capturing the key points and overall context. Focus on quarter over quarter revenue, earnings per share, changes in debt, highlighted risks, and growth opportunities.
                <|eot_id|><|start_header_id|>user<|end_header_id|>
                Summarize the following earnings call:

                {test_item["text"]}
                <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

### 3.3 Invoke base model without any adapters and observe response

To test the base model, specify the `EndpointName` for the endpoint you created earlier and the name of the base inference component as `InferenceComponentName` along with your prompt and other inference parameters in the `Body` parameter.

In [22]:
%%time

component_to_invoke = base_inference_component_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 135, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

base_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"]}\n\n')
print(f'Base Model Response:\n\n {base_response}\n')

### 3.4 Invoke ECTSum adapter and observe response

To invoke the adapter, use the adapter inference component name in your `invoke_endpoint` call.

In [14]:
%%time

component_to_invoke = ic1_adapter_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 135, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"]}\n\n')
print(f'Adapter Model Response:\n\n {adapter_response}\n')

Ground Truth:

 ['badger meter q3 earnings per share $0.54.\nq3 earnings per share $0.54.\nq3 sales $128.7 million versus refinitiv ibes estimate of $127 million.\nanticipate component shortages and lengthened lead times will ease over time, but assume they will persist well into 2022.']


Adapter Model Response:

 
                Here is a concise summary of the call:

                q3 revenue increased 13.3 percent to $128.7 million.
q3 earnings per share $0.54 versus $0.51.
for the third quarter of 2021, badger meter reported record sales of $128.7 million, representing a 13.3% increase over prior year sales of $113.6 million.
q3 diluted earnings per share $0.54.
                

CPU times: user 10.9 ms, sys: 3.87 ms, total: 14.8 ms
Wall time: 4.09 s


### 3.5 Compare outputs

Compare the outputs of the **base model** and **adapter** to ground truth. 

In this test, notice that while the base model looks subjectively more visually attractive, the adapter response is **significantly closer to ground truth**; which is what you are looking for. This will be proven with metrics in the next section.

In [13]:
print(f'Ground Truth:\n\n {test_item["summary"][0]}\n\n')
print("\n----------------------------------\n")
print(f'Base Model Response:\n\n {base_response}')
print("\n----------------------------------\n")
print(f'Adapter Model Response:\n\n {adapter_response}')

Ground Truth:

 compname reports qtrly earnings per share of $0.54.
qtrly earnings per share $0.54.
continued confidence in sales outlook; fy 2021 adjusted earnings per share guidance of $2.07 to $2.27 increased $0.07.
sees q2 2021 total sales to be between $1.25 and $1.33 billion & adjusted earnings per share of $0.54 to $0.60.
qtrly sales grew 8% to $1.2 billion; underlying sales grew 5%.
overall, full year 2021 sales guidance remains at $4.9 to $5.3 billion.
qtrly global transcatheter aortic valve replacement sales of $792 million, up 7% on reported basis.
remains confident that tavr global opportunity will exceed $7 billion by 2024.
continues to anticipate underlying tavr sales growth in 15 to 20 percent range in 2021.



----------------------------------

Base Model Response:

 

Here's a summary of the earnings call:

**Key Highlights:**

* Q1 2021 sales: $1.2 billion, up 5% on a constant currency basis from Q1 2020
* Adjusted EPS: $0.54, up 8% from Q1 2020
* Raised full-year 20

To validate the true adapter performance, you can use a tool like [fmeval](https://github.com/aws/fmeval) to run an evaluation of summarization accuracy. This will calculate the METEOR, ROUGE, and BertScore metrics for the adapter versus the base model. Doing so against the test split of ECTSum yields the following results:

![](./images/fmeval-overall.png)

The fine-tuned adapter shows a 59% increase in METEOR score, 159% increase in ROUGE score, and 8.6% in BertScore. The following diagram shows the frequency distribution of scores for the different metrics, with the adapter consistently scoring better more often in all metrics. 

Since the adapter is already loaded into GPU memory, model latency is largely unaffected, with only a difference of 2% between direct base model invocation and the adapter. If the adapter is loaded from CPU memory or disk, it will incur an cold start delay for the first load to GPU.

![](./images/fmeval-histogram.png)

## Step 4: Swapping adapters between GPU/CPU/disk

To illustrate the swapping of adapters between different tiers, you will create 2 more adapter inference components. For simplicity, you can reuse the same adapter artifact code from earlier.

When registering new adapters, the newest registration moves into GPU and if `OPTION_MAX_LORAS` is exceeded, will evict the least recently used (LRU) adapter to the CPU tier. If this move causes `OPTION_MAX_CPU_LORAS` to be exceeded, the LRU adapter from the CPU is then evicted to disk.

Since you have set up `OPTION_MAX_LORAS` to `1` and `OPTION_MAX_CPU_LORAS` to `2` in the earlier section, the registration of IC2 in the next step will:
- precache IC2 in CPU
- load IC2 in GPU
- evict IC1 to CPU

The subsequent registration of IC3 will:
- precache IC3 to CPU
- evict IC1 from CPU (available from disk)
- load IC3 in GPU
- evict IC2 from GPU (already precached in CPU)

Invoking adapters not currently in GPU will incur a cold start penalty on the first invocation. `max_new_tokens` is set to `1` on to focus on the cold start impact.

### 4.1 Create second adapter

In [15]:
%%time

ic2_adapter_name = f"ic2-ectsum-{base_inference_component_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic2_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic2_adapter_name)

print(f"\nCreated Adapter 2 inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

------!
Created Adapter 2 inference component ARN: arn:aws:sagemaker:us-east-1:230904922206:inference-component/ic2-ectsum-base-test-2025-08-28-19-54-59-710
CPU times: user 49.8 ms, sys: 5.79 ms, total: 55.6 ms
Wall time: 2min 21s


### 4.2 Create third adapter

In [16]:
%%time

ic3_adapter_name = f"ic3-ectsum-{base_inference_component_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic3_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic3_adapter_name)

print(f"\nCreated Adapter 3 inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

-----!
Created Adapter 3 inference component ARN: arn:aws:sagemaker:us-east-1:230904922206:inference-component/ic3-ectsum-base-test-2025-08-28-19-54-59-710
CPU times: user 46.2 ms, sys: 3.44 ms, total: 49.6 ms
Wall time: 2min 2s


In [17]:
import time

#starting tier indexes 0 - GPU, 1 - CPU, 2 - Disk
tiers = [ ic3_adapter_name, ic2_adapter_name, ic1_adapter_name ]

cycles = 10

invocation_order = [
    ic3_adapter_name, #ic3 in GPU already.  GPU: ic3 CPU: ic2 DISK: ic1
    ic2_adapter_name, #swap ic2 from CPU.   GPU: ic2 CPU: ic3 DISK: ic1
    ic2_adapter_name, #ic2 is still in GPU. GPU: ic2 CPU: ic3 DISK: ic1
    ic1_adapter_name, #swap ic1 from disk.  GPU: ic1 CPU: ic2 DISK: ic3
    ic1_adapter_name, #ic1 is still in GPU. GPU: ic1 CPU: ic2 DISK: ic3
    ic2_adapter_name, #swap ic2 from CPU.   GPU: ic2 CPU: ic1 DISK: ic3
    ic3_adapter_name, #swap ic3 from disk.  GPU: ic3 CPU: ic2 DISK: ic1
    # back to the starting configuration
]

no_swaps = []
cpu_swaps = []
disk_swaps = []

swap_type = ""

for cycle in range(cycles):
    for invocation in invocation_order:
    
        if invocation == base_inference_component_name or tiers.index(invocation) == 0:
            #do nothing
            swap_type = "NONE"
            pass
        elif tiers.index(invocation) == 1:
            tiers[1] = tiers[0]
            tiers[0] = invocation
            swap_type = "FROM_CPU"
        elif tiers.index(invocation) == 2:
            tiers[2] = tiers[1]
            tiers[1] = tiers[0]
            tiers[0] = invocation
            swap_type = "FROM_DISK"


        component_to_invoke = invocation
        
        start = time.time()*1000
        
        response_model = sm_runtime.invoke_endpoint(
            EndpointName = endpoint_name,
            InferenceComponentName = component_to_invoke,
            Body = json.dumps(
                {
                    "inputs": prompt,
                    "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 1, "temperature":0.9}
                }
            ),
            ContentType = "application/json",
        )
    
        end = time.time()*1000

        total = int(end - start)
        
        if swap_type == "NONE":
            no_swaps.append(total)
        elif swap_type == "FROM_CPU":
            cpu_swaps.append(total)
        elif swap_type == "FROM_DISK":
            disk_swaps.append(total)
    
        print(f'call to [{invocation.split("-")[0]}] {total} ms. swap: [{swap_type}] [ GPU: {tiers[0].split("-")[0]} CPU: {tiers[1].split("-")[0]} Disk: {tiers[2].split("-")[0]} ]')

no_swaps_count = len(no_swaps)
no_swaps_avg = int(sum(no_swaps)/len(no_swaps))
cpu_swaps_count = len(cpu_swaps)
cpu_swaps_avg = int(sum(cpu_swaps)/len(cpu_swaps))
disk_swaps_count = len(disk_swaps)
disk_swaps_avg = int(sum(disk_swaps)/len(disk_swaps))

call to [ic3] 941 ms. swap: [NONE] [ GPU: ic3 CPU: ic2 Disk: ic1 ]
call to [ic2] 936 ms. swap: [FROM_CPU] [ GPU: ic2 CPU: ic3 Disk: ic1 ]
call to [ic2] 545 ms. swap: [NONE] [ GPU: ic2 CPU: ic3 Disk: ic1 ]
call to [ic1] 611 ms. swap: [FROM_DISK] [ GPU: ic1 CPU: ic2 Disk: ic3 ]
call to [ic1] 544 ms. swap: [NONE] [ GPU: ic1 CPU: ic2 Disk: ic3 ]
call to [ic2] 556 ms. swap: [FROM_CPU] [ GPU: ic2 CPU: ic1 Disk: ic3 ]
call to [ic3] 604 ms. swap: [FROM_DISK] [ GPU: ic3 CPU: ic2 Disk: ic1 ]
call to [ic3] 510 ms. swap: [NONE] [ GPU: ic3 CPU: ic2 Disk: ic1 ]
call to [ic2] 557 ms. swap: [FROM_CPU] [ GPU: ic2 CPU: ic3 Disk: ic1 ]
call to [ic2] 544 ms. swap: [NONE] [ GPU: ic2 CPU: ic3 Disk: ic1 ]
call to [ic1] 609 ms. swap: [FROM_DISK] [ GPU: ic1 CPU: ic2 Disk: ic3 ]
call to [ic1] 540 ms. swap: [NONE] [ GPU: ic1 CPU: ic2 Disk: ic3 ]
call to [ic2] 565 ms. swap: [FROM_CPU] [ GPU: ic2 CPU: ic1 Disk: ic3 ]
call to [ic3] 604 ms. swap: [FROM_DISK] [ GPU: ic3 CPU: ic2 Disk: ic1 ]
call to [ic3] 542 ms. swap

In [18]:
import pandas as pd
import numpy as np
np_no_swaps = np.array(no_swaps)
np_cpu_swaps = np.array(cpu_swaps)
np_disk_swaps = np.array(disk_swaps)

data = {
    "count": [no_swaps_count, cpu_swaps_count, disk_swaps_count],
    "average": [no_swaps_avg, cpu_swaps_avg, disk_swaps_avg],
    "+latency avg": [0, cpu_swaps_avg-no_swaps_avg, disk_swaps_avg-no_swaps_avg],
    "+%latency avg": [0, ((cpu_swaps_avg-no_swaps_avg)/no_swaps_avg)*100, ((disk_swaps_avg-no_swaps_avg)/no_swaps_avg)*100],
    "p50": [int(np.percentile(np_no_swaps, 50)), int(np.percentile(np_cpu_swaps, 50)), int(np.percentile(np_disk_swaps, 50))],
    "p75": [int(np.percentile(np_no_swaps, 75)), int(np.percentile(np_cpu_swaps, 75)), int(np.percentile(np_disk_swaps, 75))],
    "p99": [int(np.percentile(np_no_swaps, 99)), int(np.percentile(np_cpu_swaps, 99)), int(np.percentile(np_disk_swaps, 99))]
}


df = pd.DataFrame(data, index = ["no swap", "cpu swap", "disk swap"])

df

Unnamed: 0,count,average,+latency avg,+%latency avg,p50,p75,p99
no swap,30,554,0,0.0,542,544,852
cpu swap,20,589,35,6.31769,557,564,907
disk swap,20,608,54,9.747292,602,605,706


If you were to run a similar test on 1000 cycles (7000 invocations), you'd see the following:

![](./images/adapter-load-latency-1000.png)

## Step 5. Upload a new ECTSum adapter artifact and update the live adapter inference component

Since adapters are managed as Inference Components, you can update them on a running endpoint. SageMaker handles the unloading/deregistering of the old adapter and loading/registering of the new adapter onto every base ICs on all of the instances that it is running on for this endpoint. 

- To update an adapter IC, use the  ***update_inference_component**  API and supply the existing **IC name** and the **S3 path** to the new compressed adapter archive. 

You can train a new adapter, or re-upload the existing adapter artifact to test this functionality.

### 5.1 Upload compressed adapter to S3

In [19]:
new_ectsum_adapter_s3_uri = f"s3://{bucket}/adapters/new-ectsum-adapter.tar.gz"
print(new_ectsum_adapter_s3_uri)

!aws s3 cp ./ectsum-adapter.tar.gz {new_ectsum_adapter_s3_uri}

s3://sagemaker-us-east-1-230904922206/adapters/new-ectsum-adapter.tar.gz
upload: ./ectsum-adapter.tar.gz to s3://sagemaker-us-east-1-230904922206/adapters/new-ectsum-adapter.tar.gz


### 5.2 Update Inference component adapter
**Note:  This step can take around 5 minutes**

In [23]:
%%time

update_inference_component_response = sm_client.update_inference_component(
    InferenceComponentName = ic1_adapter_name,
    Specification={
        "Container": {
            "ArtifactUrl": new_ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic1_adapter_name)

print(f'\nUpdated inference component adapter ARN: {update_inference_component_response["InferenceComponentArn"]}')

-------------------------!
Updated inference component adapter ARN: arn:aws:sagemaker:us-east-1:230904922206:inference-component/ic1-ectsum-test-2025-08-28-19-54-59-710
CPU times: user 129 ms, sys: 11.7 ms, total: 140 ms
Wall time: 8min 42s


If you view your inference component logs (link below), you will see log entries for the deregistration of the old adapter and the registration of the new one.

You should see something similar to:
`[INFO ] PyProcess - W-200-0d1e4741a42db26-stdout: [1,0]<stdout>:INFO::Unregistered adapter ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401 successfully`

`[INFO ] PyProcess - W-200-0d1e4741a42db26-stdout: [1,0]<stdout>:INFO::Registered adapter ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401 from /opt/ml/models/container_340043819279-ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401-1732570150851-MaeveWestworldService-1.0.9353.0 successfully`

In [20]:
print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

You can view your inference component logs here:

 https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/%2Faws%2Fsagemaker%2FInferenceComponents%2Fbase-test-2025-08-27-18-49-32-454


### 5.3 Retest with updated adapter

In [21]:
%%time

component_to_invoke = ic1_adapter_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 135, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"][0]}\n\n')
print(f'Updated Adapter Model Response:\n\n {adapter_response}\n')

Ground Truth:

 compname reports qtrly earnings per share of $0.54.
qtrly earnings per share $0.54.
continued confidence in sales outlook; fy 2021 adjusted earnings per share guidance of $2.07 to $2.27 increased $0.07.
sees q2 2021 total sales to be between $1.25 and $1.33 billion & adjusted earnings per share of $0.54 to $0.60.
qtrly sales grew 8% to $1.2 billion; underlying sales grew 5%.
overall, full year 2021 sales guidance remains at $4.9 to $5.3 billion.
qtrly global transcatheter aortic valve replacement sales of $792 million, up 7% on reported basis.
remains confident that tavr global opportunity will exceed $7 billion by 2024.
continues to anticipate underlying tavr sales growth in 15 to 20 percent range in 2021.


Updated Adapter Model Response:

 
                Here is a concise summary of the call:

                q1 earnings per share $0.54.
sees q2 adjusted earnings per share $0.54 to $0.60.
sees fy adjusted earnings per share $2.07 to $2.27.
sees q2 sales $1.25 billi

## Step 6: Clean up the environment

If you need to delete an adapter, call the **`delete_inference_component`** API with the IC name to remove it. 

In [22]:
sess.delete_inference_component(ic1_adapter_name, wait = True)
print(f'Adapter Component {ic1_adapter_name} deleted.')

sess.delete_inference_component(ic2_adapter_name, wait = True)
print(f'Adapter Component {ic2_adapter_name} deleted.')

sess.delete_inference_component(ic3_adapter_name, wait = True)
print(f'Adapter Component {ic3_adapter_name} deleted.')

Adapter Component ic1-ectsum-test-2025-08-27-18-49-32-454 deleted.
Adapter Component ic2-ectsum-base-test-2025-08-27-18-49-32-454 deleted.
Adapter Component ic3-ectsum-base-test-2025-08-27-18-49-32-454 deleted.


**Deleting the base model IC** will automatically delete the base IC and any associated adapter ICs.

In [24]:
sess.delete_inference_component(base_inference_component_name, wait = True)

print(f'Base Component {base_inference_component_name} deleted.')

Base Component base-test-2025-08-27-18-49-32-454 deleted.


Clean up the **running endpoint and its configuration**.

In [25]:
sess.delete_endpoint(endpoint_name)
print(f'Endpoint {endpoint_name} deleted.')

sess.delete_endpoint_config(endpoint_name)
print(f'Endpoint Configuration {endpoint_name} deleted.')

sess.delete_model(model_name)
print(f'Model {model_name} deleted.')

Endpoint test-2025-08-27-18-49-32-454 deleted.
Endpoint Configuration test-2025-08-27-18-49-32-454 deleted.
Model test-2025-08-27-18-49-32-454 deleted.
