# Using Deployment Configuration to replace Gwen3-8B model on IC with GPT-OSS

In this notebook we will show how to deploy Qwen3-8B on SageMaker AI Inference Components and then we will use deployment configuration to update the model on teh same IC with GPT-OSS

We will use 1-GPU `ml.g6e.4xlarge` instance for SageMaker AI real-time endpoint and we will deploy:
- [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
- [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)

We are going to use AWS Python API (`boto3`) for model deployments.

## Step 1: Setup

Fetch and import dependencies

In [None]:
%pip install sagemaker==2.245.0 --upgrade --quiet --no-warn-conflicts

Note: you may need to restart the kernel to use updated packages.


In [None]:
import time
import json
import sagemaker
import boto3
from IPython.display import display, Markdown

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()

sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"sagemaker version: {sagemaker.__version__}")

## Deployment

### Common setup

In [34]:
CONTAINER_VERSION = "0.34.0-lmi16.0.0-cu128"
inference_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"

instance = {"type": "ml.g6e.4xlarge", "num_gpu": 1}
endpoint_config_name = endpoint_name = sagemaker.utils.name_from_base("IC-dep", short = True)
timeout = 600
variant_name = "main"

In [35]:
lmi_env = {
    "SERVING_FAIL_FAST": "true",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_MAX_MODEL_LEN": "16384",
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
}

#### Creating Endpoint

In [None]:
initial_instance_count = 1
min_instance_count = 1
max_instance_count = 4

endpoint_config = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = role,
    ProductionVariants = [
        {
            "VariantName": variant_name,
            "InstanceType": instance["type"],
            "InitialInstanceCount": initial_instance_count,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
            "RoutingConfig": {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            },
            'ManagedInstanceScaling': {
                'Status': 'ENABLED',
                'MinInstanceCount': min_instance_count,
                'MaxInstanceCount': max_instance_count,
            },
        },
    ],
)
endpoint = sm_client.create_endpoint(EndpointName = endpoint_name,
                                     EndpointConfigName = endpoint_config_name)
_ = sess.wait_for_endpoint(endpoint_name)

----!

### Model deployment

#### Qwen/Qwen3-8B

We will use 1 GPU on the endpoint.

##### Please note that we've created the endpoint with 1 instance but we are deploying 2 copies of the IC.
##### SageMaker AI will automatically launch another instances because we enabled ManagedInstanceScaling in the previous step

```
{
    "EndpointName": "IC-dep-251116-2100",
    "EndpointArn": "arn:aws:sagemaker:us-east-1:123456789012:endpoint/IC-dep-251116-2100",
    "EndpointConfigName": "IC-dep-251116-2100",
    "ProductionVariants": [
        {
            "VariantName": "main",
            "CurrentInstanceCount": ** 1 **,
            "DesiredInstanceCount": ** 2 **,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 1,
                "MaxInstanceCount": 4
            },
            "RoutingConfig": {
                "RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"
            }
        }
    ],
    "EndpointStatus": "Updating",
    "CreationTime": "2025-11-16T21:00:26.774000+00:00",
    "LastModifiedTime": "2025-11-16T21:03:31.088000+00:00"
}
```

In [None]:
qwen_env = {
    "HF_MODEL_ID": "Qwen/Qwen3-8B",
    "HF_TOKEN": "<YOUR_HF_TOKEN>",
}
qwen_model_name = sagemaker.utils.name_from_base("qwen", short=True)
qwen_ic_name = f"ic-{qwen_model_name}"

min_memory_required_in_mb = 4096
number_of_accelerator_devices_required = 1

model_response = sm_client.create_model(
    ModelName = qwen_model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image,
        "Environment": qwen_env | lmi_env,
    },
)

ic_response = sm_client.create_inference_component(
    InferenceComponentName = qwen_ic_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification={
        "ModelName": qwen_model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": timeout,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": 2,
    },
)
_ = sess.wait_for_inference_component(qwen_ic_name)

-------------------------------!

##### You can use AWS CLI to check status of the endpoint and IC

In [None]:
!aws sagemaker describe-endpoint --endpoint-name $endpoint_name

In [None]:
!aws sagemaker describe-inference-component --inference-component-name $qwen_ic_name

## Inference Example

### Qwen3-8B - before the deployment

In [40]:
payload={
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}

start_time = time.time()
res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                 InferenceComponentName = qwen_ic_name,
                                 Body = json.dumps(payload),
                                 ContentType = "application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"âœ… Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

âœ… Response time: 28.52s



<think>
Okay, the user is asking for popular places to visit in London. Let me start by recalling the main attractions. First, the British Museum is a must. It's free and has a lot of historical artifacts. Then there's the Tower of London, which is iconic with the Crown Jewels. Big Ben and the Houses of Parliament are also key, though they're more for the view than the buildings themselves. The London Eye is a modern attraction with great views of the city. 

Wait, I should mention the Tower Bridge too. It's a famous landmark, even if it's not as old as the Tower. The Sherlock Holmes Museum might be a good spot for fans of the detective stories. The National Gallery is another must, especially for art lovers. 

What about the Coca-Cola London Eye? Oh, right, that's the same as the London Eye. Maybe I should clarify that. The Trafalgar Square is a big area with Nelson's Column and the National Gallery nearby. The Westminster Abbey is important for historical and religious reasons. 

Oh, and the London Underground is a unique experience, even if it's not a place to visit. The British Library is another cultural spot. The South Bank area is lively with the River Thames and the Shakespeare's Globe Theatre. 

Wait, I should check if there are any other notable places. The Natural History Museum and Science Museum are popular. The Victoria and Albert Museum for art and design. The Shard is a modern skyscraper with a viewing platform. 

I need to make sure I'm not missing any major attractions. Maybe the Regent's Park for the London Zoo or the Royal Botanic Gardens at Kew. Also, the Camden Market for shopping and food. 

I should structure this in a list, maybe with categories like historical, cultural, modern, and parks. But the user just asked for popular places, so a straightforward list with brief descriptions would work. Let me organize them in order of popularity and include a bit of detail for each. Also, mention that some are free or have entry fees. Alright, that should cover the main points.
</think>

London is a city rich in history, culture, and iconic landmarks. Here are some of the most popular places to visit:

### **Historical & Cultural Attractions**
1. **British Museum**  
   - Home to artifacts like the Rosetta Stone, Parthenon Marbles, and Egyptian mummies. Free entry.

2. **Tower of London**  
   - A historic castle housing the Crown Jewels, Tower Bridge, and the White Tower. Entry fee applies.

3. **Houses of Parliament & Big Ben**  
   - Iconic landmarks with stunning architecture and a view of the Thames. Free to visit the exterior.

4. **Westminster Abbey**  
   - A historic church where British monarchs are crowned and buried. Free entry.

5. **St. Paulâ€™s Cathedral**  
   - A breathtaking example of English Baroque architecture with panoramic views.

### **Modern & Iconic Landmarks**
6. **London Eye**  
   - A giant Ferris wheel offering 360Â° views of the city. Entry fee.

7. **The Shard**  
   - A modern skyscraper with a viewing platform and a unique glass floor.

8. **Trafalgar Square**  
   - A bustling square with Nelsonâ€™s Column, the National Gallery, and the statue of the Fourth Earl of Chatham.

9. **Buckingham Palace**  
   - The official residence of the British monarch. Free to view the exterior and gardens.

### **Art & Museums**
10. **National Gallery**  
    - Houses masterpieces by Van Gogh, Rembrandt, and Turner. Free entry.

11. **Natural History Museum**  
    - Famous for its dinosaur skeletons and the Hintze Hall. Entry fee.

12. **Science Museum**  
    - Interactive exhibits on science and technology, including the Apollo 10 spacecraft.

### **Parks & Green Spaces**
13. **Hyde Park & Kensington Gardens**  
    - Perfect for walking, picnicking, and visiting the Serpentine Lake.

14. **Regentâ€™s Park**  
    - Home to the London Zoo and the Royal Botanic Gardens (Kew Gardens are nearby).

### **Shopping & Markets**
15. **Oxford Street & Bond Street**  
    - Retail hubs with high-end brands and department stores.

16. **Camden Market**  
    - A vibrant market offering vintage, street food, and eclectic shops.

### **Theaters & Performing Arts**
17. **West End**  
    - Renowned for its musicals and theater productions (e.g., *Hamilton*, *Les MisÃ©rables*).

18. **Shakespeareâ€™s Globe Theatre**  
    - A reconstructed Elizabethan playhouse in Southwark.

### **Other Highlights**
19. **Covent Garden**  
    - A lively area with street performers, markets, and historic buildings.

20. **South Bank**  
    - A cultural hub with the National Theatre, River Thames views, and the St. Thomas Hospital.

### **Tips**  
- **Free attractions**: Many museums, parks, and landmarks are free.  
- **Transport**: Use the London Underground (Tube) for easy access.  
- **Seasonal events**: Check for festivals like the London Eye fireworks or the Notting Hill Carnival.

Londonâ€™s mix of old and new ensures thereâ€™s something for everyone, whether youâ€™re into history, art, shopping, or simply soaking in the cityâ€™s atmosphere! ðŸ—¼ðŸŒ†

### openai/gpt-oss-20b - Replace Qwen3 on the IC

We will use 1 GPUs on the endpoit (for illustrative purpose)

In [None]:
gptoss_env = {
    "HF_MODEL_ID": "openai/gpt-oss-20b",
    "HF_TOKEN": "<YOUR_HF_TOKEN>",
}
gptoss_model_name = sagemaker.utils.name_from_base("gpt-oss", short=True)

min_memory_required_in_mb = 4096
number_of_accelerator_devices_required = 1

model_response = sm_client.create_model(
    ModelName = gptoss_model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image,
        "Environment": gptoss_env | lmi_env,
    },
)

##### Please note that in `update_inference_component` we re-use the IC name but deploying a different mode (ModelName)

At some point SageMaker AI will launch 1 additional instance for the rolling deployment

```
{
    "EndpointName": "IC-dep-251116-2100",
    "EndpointArn": "arn:aws:sagemaker:us-east-1:12345678912:endpoint/IC-dep-251116-2100",
    "EndpointConfigName": "IC-dep-251116-2100",
    "ProductionVariants": [
        {
            "VariantName": "main",
            "CurrentInstanceCount": ** 3 **,
            "DesiredInstanceCount": ** 3 **,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 1,
                "MaxInstanceCount": 4
            },
            "RoutingConfig": {
                "RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"
            }
        }
    ],
    "EndpointStatus": "Updating",
    "CreationTime": "2025-11-16T21:00:26.774000+00:00",
    "LastModifiedTime": "2025-11-16T21:17:22.257000+00:00"
}
```

In [42]:
dep_config = {
    'RollingUpdatePolicy': {
        'MaximumBatchSize': {
            'Type': 'COPY_COUNT',
            'Value': 1
        },
        'WaitIntervalInSeconds': 60,
        'MaximumExecutionTimeoutInSeconds': 3600,
        'RollbackMaximumBatchSize': {
            'Type': 'COPY_COUNT',
            'Value': 1
        }
    },
}

ic_response = sm_client.update_inference_component(
    InferenceComponentName = qwen_ic_name,
    Specification={
        "ModelName": gptoss_model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": timeout,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": 2,
    },
    DeploymentConfig = dep_config,
)
_ = sess.wait_for_inference_component(qwen_ic_name)

----------------------------------------------------------------------------------------------!

In [None]:
!aws sagemaker describe-endpoint --endpoint-name $endpoint_name

##### Once the eIC is in service, you can see the new IC model name is changed to the GPT-OSS model as shown below:

```
{
    "InferenceComponentName": "ic-qwen-251116-2103",
    "InferenceComponentArn": "arn:aws:sagemaker:us-east-1:123456789012:inference-component/ic-qwen-251116-2103",
    "EndpointName": "IC-dep-251116-2100",
    "EndpointArn": "arn:aws:sagemaker:us-east-1:123456789012:endpoint/ic-dep-251116-2100",
    "VariantName": "main",
    "Specification": {
        "ModelName": "gpt-oss-251116-2116",
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": 600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 600
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1.0,
            "MinMemoryRequiredInMb": 4096
        }
    },
    "RuntimeConfig": {
        "DesiredCopyCount": 2,
        "CurrentCopyCount": 2
    },
    "CreationTime": "2025-11-16T21:03:23.576000+00:00",
    "LastModifiedTime": "2025-11-16T21:48:54.509000+00:00",
    "InferenceComponentStatus": "InService",
    "LastDeploymentConfig": {
        "RollingUpdatePolicy": {
            "MaximumBatchSize": {
                "Type": "COPY_COUNT",
                "Value": 1
            },
            "WaitIntervalInSeconds": 60,
            "MaximumExecutionTimeoutInSeconds": 3600,
            "RollbackMaximumBatchSize": {
                "Type": "COPY_COUNT",
                "Value": 1
            }
        }
    }
}
```

In [None]:
!aws sagemaker describe-inference-component --inference-component-name $qwen_ic_name

## Inference Example

### GPT-OSS - after the deployment

#### Please note a very different output

In [44]:
payload={
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}

start_time = time.time()
res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                 InferenceComponentName = qwen_ic_name,
                                 Body = json.dumps(payload),
                                 ContentType = "application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"âœ… Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

âœ… Response time: 4.19s



Here are some of Londonâ€™s most popular places to visit:

- **The British Museum** â€“ Worldâ€‘class collections from ancient Egypt to the modern era.  
- **The Tower of London** â€“ Historic castle, Crown Jewels, and Yeoman Warder tours.  
- **St Paulâ€™s Cathedral** â€“ Iconic dome, stunning interior, and panoramic city views from the dome.  
- **The National Gallery** â€“ Home to European masterpieces from da Vinci to Van Gogh.  
- **The Tate Modern** â€“ Contemporary art housed in a converted power station.  
- **Buckingham Palace** â€“ The royal residence; watch the Changing of the Guard.  
- **The Houses of Parliament & Big Ben** â€“ The heart of UK politics with stunning Gothic architecture.  
- **The River Thames & Southbank** â€“ Scenic walks, the London Eye, and cultural venues like the Globe Theatre.  
- **The Natural History Museum** â€“ Dinosaurs, the blue whale skeleton, and interactive exhibits.  
- **The Victoria and Albert Museum (V&A)** â€“ Worldâ€‘class design and decorative arts.  
- **Camden Town** â€“ Vibrant market, live music, and eclectic street food.  
- **Piccadilly Circus & Leicester Square** â€“ Iconic entertainment hubs with neon lights.  
- **Oxford Street** â€“ Europe's busiest shopping street.  
- **The Shard** â€“ Tallest building in the UK; skyâ€‘deck offers breathtaking city vistas.  
- **Hyde Park & Kensington Gardens** â€“ Lush green spaces for relaxation and boating.  
- **The Royal Botanic Gardens, Kew** â€“ Extensive plant collections and greenhouse spectacles.  
- **Covent Garden** â€“ Street performers, boutique shops, and the Royal Opera House.  
- **The Golden Galleon, the Gherkin** â€“ Modernist architecture adjacent to the financial district.  
- **The Sherlock Holmes Museum** â€“ 221B Baker Street for fans of the detective.  
- **The Saatchi Art Gallery & Contemporary Space** â€“ Cuttingâ€‘edge contemporary art scene.  

Feel free to pick your favorites based on interestsâ€”culture, history, shopping, or green spaces!

### After timeout, SageMaker AI will terminate additional instance that was used for rolling deployment. 

```
{
    "EndpointName": "IC-dep-251116-2100",
    "EndpointArn": "arn:aws:sagemaker:us-east-1:123456789012:endpoint/IC-dep-251116-2100",
    "EndpointConfigName": "IC-dep-251116-2100",
    "ProductionVariants": [
        {
            "VariantName": "main",
            "CurrentInstanceCount": ** 2 **,
            "DesiredInstanceCount": ** 2 **,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 1,
                "MaxInstanceCount": 4
            },
            "RoutingConfig": {
                "RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"
            }
        }
    ],
    "EndpointStatus": "Updating",
    "CreationTime": "2025-11-16T21:00:26.774000+00:00",
    "LastModifiedTime": "2025-11-16T21:58:45.998000+00:00"
}
```

## Cleanup

In [45]:
sess.delete_inference_component(qwen_ic_name, wait=True)
sess.delete_model(qwen_model_name)
sess.delete_model(gptoss_model_name)

In [46]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)