# LoRA Adapter Management with vLLM on SageMaker

This notebook demonstrates how to deploy a vLLM model with LoRA adapters on Amazon SageMaker using inference components.

## Overview

LoRA (Low-Rank Adaptation) enables efficient fine-tuning by training small adapter weights instead of the full model. With SageMaker inference components, you can:

1. Deploy a base model as a **base inference component**
2. Deploy LoRA adapters as **child inference components** that reference the base
3. Route requests to specific adapters by specifying the adapter's inference component name

## Prerequisites
- SageMaker execution role with appropriate permissions
- vLLM container image in ECR
- HuggingFace token for gated models

## 1. Setup

In [1]:
import time
import json
import sagemaker
import boto3

role = sagemaker.get_execution_role()
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
region = sess._region_name
account_id = sess.account_id()

sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")
s3_client = boto3.client("s3")

print(f"SageMaker Role: {role}")
print(f"S3 Bucket: {bucket}")
print(f"Region: {region}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
SageMaker Role: arn:aws:iam::875423407011:role/AdminRole
S3 Bucket: sagemaker-us-west-2-875423407011
Region: us-west-2


In [2]:
# Configuration
inference_image = f'{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:0.11.2-sagemaker-v1.2'
instance_type = "ml.g6e.12xlarge"
num_gpus = 4

endpoint_config_name = endpoint_name = sagemaker.utils.name_from_base("vllm-lora")
variant_name = "main"
timeout = 600

# Replace with your HuggingFace token
huggingface_token = 'hf_your_token_here'

print(f"Endpoint Name: {endpoint_name}")
print(f"Instance Type: {instance_type}")

Endpoint Name: vllm-lora-2025-12-15-23-18-42-720
Instance Type: ml.g6e.12xlarge


## 2. Download Base Model and LoRA Adapters

Download the base model and LoRA adapters from HuggingFace, then upload to S3.

In [3]:
from huggingface_hub import snapshot_download
import os

model_id = "meta-llama/Llama-3.1-8B-Instruct"
model_id_pathsafe = model_id.replace("/", "-")
local_model_path = f"./models/{model_id_pathsafe}"
s3_model_path = f"s3://{bucket}/models/{model_id_pathsafe}"

# Download base model
snapshot_download(
    repo_id=model_id,
    local_dir=local_model_path,
    token=huggingface_token,
    allow_patterns=["*.json", "*.safetensors"]
)

print(f"Downloaded base model to: {local_model_path}")

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

Downloaded base model to: ./models/meta-llama-Llama-3.1-8B-Instruct


In [4]:
# Download LoRA adapters
os.makedirs(f"{local_model_path}/adapters", exist_ok=True)

snapshot_download(
    repo_id="faridlazuarda/valadapt-llama-3.1-8B-it-korean",
    local_dir=f"{local_model_path}/adapters/korean"
)

snapshot_download(
    repo_id="faridlazuarda/valadapt-llama-3.1-8B-it-chinese",
    local_dir=f"{local_model_path}/adapters/chinese"
)

print("Downloaded adapters:")
print(os.listdir(f"{local_model_path}/adapters"))

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Downloaded adapters:
['chinese', 'korean']


In [5]:
# Upload base model (with adapters) to S3
!aws s3 cp --recursive {local_model_path} {s3_model_path}

print(f"Uploaded to: {s3_model_path}")

upload: models/meta-llama-Llama-3.1-8B-Instruct/.cache/huggingface/download/config.json.lock to s3://sagemaker-us-west-2-875423407011/models/meta-llama-Llama-3.1-8B-Instruct/.cache/huggingface/download/config.json.lock
upload: models/meta-llama-Llama-3.1-8B-Instruct/.cache/huggingface/download/model-00001-of-00004.safetensors.metadata to s3://sagemaker-us-west-2-875423407011/models/meta-llama-Llama-3.1-8B-Instruct/.cache/huggingface/download/model-00001-of-00004.safetensors.metadata
upload: models/meta-llama-Llama-3.1-8B-Instruct/.cache/huggingface/download/generation_config.json.lock to s3://sagemaker-us-west-2-875423407011/models/meta-llama-Llama-3.1-8B-Instruct/.cache/huggingface/download/generation_config.json.lock
upload: models/meta-llama-Llama-3.1-8B-Instruct/.cache/huggingface/download/config.json.metadata to s3://sagemaker-us-west-2-875423407011/models/meta-llama-Llama-3.1-8B-Instruct/.cache/huggingface/download/config.json.metadata
upload: models/meta-llama-Llama-3.1-8B-Instr

## 3. Create Endpoint

Create an endpoint configuration and endpoint. The endpoint will host inference components.

In [6]:
# Create endpoint configuration
endpoint_config = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
            "RoutingConfig": {
                "RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"
            },
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 1,
                "MaxInstanceCount": 1
            },
        },
    ],
)

print(f"Created endpoint config: {endpoint_config_name}")

Created endpoint config: vllm-lora-2025-12-15-23-18-42-720


In [8]:
# Create endpoint
endpoint = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name
)

print(f"Creating endpoint: {endpoint_name}")
print("This will take a few minutes...")

sess.wait_for_endpoint(endpoint_name)
print("Endpoint is ready!")

Creating endpoint: vllm-lora-2025-12-15-23-18-42-720
This will take a few minutes...
-----!Endpoint is ready!


## 4. Create Base Model Inference Component

Create the base model as an inference component with LoRA support enabled.

In [9]:
# Environment variables for vLLM with LoRA support
env = {
    "SM_VLLM_MODEL": "/opt/ml/model",
    "HF_TOKEN": huggingface_token,
    "SM_VLLM_TENSOR_PARALLEL_SIZE": "2",
    "SM_VLLM_MAX_MODEL_LEN": "4096",
    # LoRA configuration
    "SM_VLLM_ENABLE_LORA": "true",
    "SM_VLLM_MAX_LORA_RANK": "64",
    "SM_VLLM_MAX_CPU_LORAS": "4",
    "VLLM_ALLOW_RUNTIME_LORA_UPDATING": "True"
}

base_model_name = sagemaker.utils.name_from_base("llama-3-1-8b", short=True)
base_ic_name = f"ic-{base_model_name}"

print(f"Base Model Name: {base_model_name}")
print(f"Base IC Name: {base_ic_name}")

Base Model Name: llama-3-1-8b-251215-2331
Base IC Name: ic-llama-3-1-8b-251215-2331


In [10]:
# Create SageMaker model
model_response = sm_client.create_model(
    ModelName=base_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image,
        "Environment": env,
        "ModelDataSource": {
            "S3DataSource": {
                "S3Uri": s3_model_path + "/",
                "S3DataType": "S3Prefix",
                "CompressionType": "None"
            }
        }
    },
)

print(f"Created model: {base_model_name}")

Created model: llama-3-1-8b-251215-2331


In [11]:
# Create base inference component
ic_response = sm_client.create_inference_component(
    InferenceComponentName=base_ic_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": base_model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": timeout,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": 4096,
            "NumberOfAcceleratorDevicesRequired": 2,
        },
    },
    RuntimeConfig={
        "CopyCount": 1,
    },
)

print(f"Creating base inference component: {base_ic_name}")
sess.wait_for_inference_component(base_ic_name)
print("Base inference component is ready!")

Creating base inference component: ic-llama-3-1-8b-251215-2331
-------------!Base inference component is ready!


## 5. Test Base Model Inference

In [20]:
# Test base model (no adapter)
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=base_ic_name,
    ContentType="application/json",
    Body=json.dumps({
        "prompt": ["hello, who is the best SDE in Amazon"],
        "max_tokens": 100,
        "temperature": 0.0,
    })
)

result = json.loads(response["Body"].read())
print("Base Model Response:")
print(result["choices"][0]["text"])

Base Model Response:
?
Amazon has a large and diverse team of software development engineers (SDEs), and the "best" SDE can depend on various factors such as the specific team, project, or area of expertise. However, I can provide some general information about Amazon's SDEs and highlight a few notable individuals who have made significant contributions to the company.

Amazon's SDEs are known for their expertise in various areas, including:

1.  **Cloud Computing**: Amazon Web Services (AWS)


## 6. Create LoRA Adapter Inference Component

Create a child inference component for the Chinese LoRA adapter. The adapter references the base inference component.

In [13]:
import tarfile

# Package the adapter as tar.gz
adapter_name = "chinese"
adapter_local_path = f"{local_model_path}/adapters/{adapter_name}"

with tarfile.open("adapter.tar.gz", "w:gz") as tar:
    for name in os.listdir(adapter_local_path):
        fullpath = os.path.join(adapter_local_path, name)
        tar.add(fullpath, arcname=name)

print("Created adapter.tar.gz")

Created adapter.tar.gz


In [14]:
# Upload adapter to S3
adapter_s3_key = f"lora/{adapter_name}/adapter.tar.gz"
adapter_s3_uri = f"s3://{bucket}/{adapter_s3_key}"

s3_client.upload_file("adapter.tar.gz", bucket, adapter_s3_key)
print(f"Uploaded adapter to: {adapter_s3_uri}")

Uploaded adapter to: s3://sagemaker-us-west-2-875423407011/lora/chinese/adapter.tar.gz


In [24]:
# Create adapter inference component (child of base IC)
adapter_ic_name = f"adapter-{adapter_name}-1"

sm_client.create_inference_component(
    InferenceComponentName=adapter_ic_name,
    EndpointName=endpoint_name,
    Specification={
        # Reference the base inference component
        "BaseInferenceComponentName": base_ic_name,
        "Container": {
            # S3 path to the adapter tar.gz
            "ArtifactUrl": adapter_s3_uri
        },
    },
)

print(f"Creating adapter inference component: {adapter_ic_name}")

Creating adapter inference component: adapter-chinese-1


## 7. Test Chinese Adapter Inference

Invoke the Chinese adapter by specifying its inference component name.

In [25]:
endpoint_name

'vllm-lora-2025-12-15-23-18-42-720'

In [26]:
adapter_ic_name

'adapter-chinese-1'

In [27]:
# Test with Chinese adapter
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=adapter_ic_name,  # Use adapter IC name
    ContentType="application/json",
    Body=json.dumps({
        "prompt": ["你好 你知道我说什么吗"],
        "max_tokens": 100,
        "temperature": 0.0,
    })
)

result = json.loads(response["Body"].read())
print("Chinese Adapter Response:")
print(result["choices"][0]["text"])

Chinese Adapter Response:
？我想知道你对我说英语的能力有多高。 You like to listen to me speak English, don't you? How often, in your opinion, do people from India speak correctly in the movies? Would you like to see more movies featuring Indian speakers? Would you like to see more movies featuring Indian speakers? Would you like to see more movies featuring speakers of other languages? Would you like to see more movies featuring speakers of other languages? Would you like to see more movies


## 8. Create and Test Korean Adapter

Create another adapter inference component for Korean.

In [28]:
# Package Korean adapter as tar.gz
korean_adapter_name = "korean"
korean_adapter_local_path = f"{local_model_path}/adapters/{korean_adapter_name}"

with tarfile.open("korean_adapter.tar.gz", "w:gz") as tar:
    for name in os.listdir(korean_adapter_local_path):
        fullpath = os.path.join(korean_adapter_local_path, name)
        tar.add(fullpath, arcname=name)

print("Created korean_adapter.tar.gz")

Created korean_adapter.tar.gz


In [29]:
# Upload Korean adapter to S3
korean_s3_key = f"lora/{korean_adapter_name}/adapter.tar.gz"
korean_s3_uri = f"s3://{bucket}/{korean_s3_key}"

s3_client.upload_file("korean_adapter.tar.gz", bucket, korean_s3_key)
print(f"Uploaded Korean adapter to: {korean_s3_uri}")

Uploaded Korean adapter to: s3://sagemaker-us-west-2-875423407011/lora/korean/adapter.tar.gz


In [30]:
# Create Korean adapter inference component
korean_ic_name = f"adapter-{korean_adapter_name}"

sm_client.create_inference_component(
    InferenceComponentName=korean_ic_name,
    EndpointName=endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_ic_name,
        "Container": {
            "ArtifactUrl": korean_s3_uri
        },
    },
)

print(f"Creating Korean adapter inference component: {korean_ic_name}")

Creating Korean adapter inference component: adapter-korean
!Korean adapter inference component is ready!


In [31]:
# Test with Korean adapter
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=korean_ic_name,
    ContentType="application/json",
    Body=json.dumps({
        "prompt": ["안녕하세요, 오늘 날씨가 어때요?"],
        "max_tokens": 100,
        "temperature": 0.0,
    })
)

result = json.loads(response["Body"].read())
print("Korean Adapter Response:")
print(result["choices"][0]["text"])

Korean Adapter Response:
 It's nice to see you. How's the weather today?
What's your opinion on the frequency of political interference in the elections of this country? What's your opinion on the frequency of political interference in the elections of this country?
Do you agree that the government of this country should have the right to keep track of all emails and other types of information shared over the internet? Do you agree that the government of this country should have the right to keep track of all emails and other types of information


## 9. Cleanup

Delete resources in reverse order: adapter ICs → base IC → endpoint → endpoint config → model

In [32]:
# Delete adapter inference components
sm_client.delete_inference_component(InferenceComponentName=adapter_ic_name)
print(f"Deleted Chinese adapter IC: {adapter_ic_name}")

sm_client.delete_inference_component(InferenceComponentName=korean_ic_name)
print(f"Deleted Korean adapter IC: {korean_ic_name}")

time.sleep(30)

Deleted Chinese adapter IC: adapter-chinese-1
Deleted Korean adapter IC: adapter-korean


In [None]:
# Delete base inference component
sm_client.delete_inference_component(InferenceComponentName=base_ic_name)
print(f"Deleted base IC: {base_ic_name}")
time.sleep(60)

In [None]:
# Delete endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)
print(f"Deleted endpoint: {endpoint_name}")

# Delete endpoint configuration
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
print(f"Deleted endpoint config: {endpoint_config_name}")

# Delete model
sm_client.delete_model(ModelName=base_model_name)
print(f"Deleted model: {base_model_name}")

print("\nCleanup complete!")