# Deploy OpenAI gpt-oss model on SageMaker AI

In this notebook we deploy OpenAI gpt-oss model on SageMaker AI using several options.

Please see the OpenAI introduction [blog](https://simonwillison.net/2025/Aug/5/gpt-oss/) for more details

## Step 1: Setup

Fetch and import dependencies

In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

In [None]:
import json
import sagemaker
import boto3

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()

sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"sagemaker version: {sagemaker.__version__}")

## Deployment

### Option 1. Deploy gpt-oss-120b from JumpStart 

We will use Inference Component enabled endpoint

In [None]:
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

accept_eula = False  # Change to True to agree to term and conditions and accept EULA
model_id, model_version = "openai-reasoning-gpt-oss-120b", "1.0.0"

model_name = endpoint_name = sagemaker.utils.name_from_base("gpt-oss-120b")
inference_component_name = f"ic-{model_name}"

jumpstart_model = JumpStartModel(
    model_id=model_id,
    model_version=model_version,
    name=model_name
)

jumpstart_model.deploy(
    accept_eula=accept_eula,
    instance_type="ml.p5en.48xlarge",
    initial_instance_count=1,
    container_startup_health_check_timeout=900,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    resources=ResourceRequirements(requests={"num_accelerators": 8, "memory": 1024*10, "copies": 1,}),
)
llm = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
    component_name=inference_component_name
)

### Option 2. Deploy gpt-oss model from S3

If you need to deploy these models from S3 (for example, after fine-tuning) you can use the code below.

Please change the `model_s3_path` to the S3 prefix with your model weights.

In [None]:
inference_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.0.0.dev1-lmi0.0.0-cu128"
instance_type = "ml.p5en.48xlarge"

model_s3_path = "s3://<BUCKET>/<PREFIX>/"

lmi_env = {
    "OPTION_MODEL_ID": "/opt/ml/model",
    #"TIKTOKEN_ENCODINGS_BASE": "/opt/ml/model",
    "OPTION_TENSOR_PARALLEL_SIZE": "8",
}

model_name = sagemaker.utils.name_from_base("model-lmi")
endpoint_name = model_name
inference_component_name = f"ic-{model_name}"

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

lmi_model = sagemaker.Model(
    image_uri=inference_image,
    env=lmi_env,
    role=role,
    name=model_name,
    model_data={
        'S3DataSource': {
            'S3Uri': model_s3_path,
            'S3DataType': 'S3Prefix',
            'CompressionType': 'None'
        }
    },
)

lmi_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=600,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    resources=ResourceRequirements(requests={"num_accelerators": 8, "memory": 1024*10, "copies": 1,}),
)

llm = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
    component_name=inference_component_name
)

### Option 3. Deploying from HF using BYOC

Amazon SageMaker AI provides the ability to build Docker containers to run on SageMaker endpoints, where they listen for health checks on `/ping` and receive real-time inference requests on `/invocations`.

Below, we'll demonstrate how to adapt the [vLLM](https://github.com/vllm-project/vllm) framework to run on SageMaker AI endpoints.

#### Container preparation:
1. Enable Docker access in your Studio domain. Please see [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local-get-started.html#studio-updated-local-enable) for details
2. Install Docker in your Studion environment. See this [link](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local-get-started.html#studio-updated-local-docker-installation) for details
3. Please modify `build.sh` to change the account_id, region, repository name, and tag if required
4. Build the image and push to ECR repository using `build.sh` in docker directory


In [None]:
%%writefile docker/build.sh
export REGION=YOUR_REGION
export ACCOUNT_ID=YOUR_ACCOUNT_ID
export REPOSITORY_NAME=vllm
export TAG=v0.10.0-gpt-oss

full_name="${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPOSITORY_NAME}:${TAG}"

DOCKER_BUILDKIT=1 docker build . --tag $REPOSITORY_NAME:$TAG --file Dockerfile.gptoss

aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --region ${REGION} --repository-names "${REPOSITIRY_NAME}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --region ${REGION} --repository-name "${REPOSITORY_NAME}" > /dev/null
fi

docker tag $REPOSITORY_NAME:$TAG ${full_name}
docker push ${full_name}

In [None]:
# !cd docker; ./build.sh; cd ..

In [None]:
#
# Please make sure you are using the image that you pushed into ECR in a previous step
#
inference_image = f"{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:v0.10.0-gpt-oss"
instance_type = "ml.g6e.4xlarge"
num_gpu = 1
model_name = sagemaker.utils.name_from_base("model-byoc")
endpoint_name = model_name
inference_component_name = f"ic-{model_name}"

config = {
    "OPTION_MODEL": "openai/gpt-oss-20b",
    "OPTION_SERVED_MODEL_NAME": "model",
    "OPTION_TENSOR_PARALLEL_SIZE": json.dumps(num_gpu),
    "VLLM_ATTENTION_BACKEND": "TRITON_ATTN_VLLM_V1",
    "OPTION_ASYNC_SCHEDULING": "true",
}

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

lmi_model = sagemaker.Model(
    image_uri=inference_image,
    env=config,
    role=role,
    name=model_name,
)

lmi_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=600,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    resources=ResourceRequirements(requests={"num_accelerators": num_gpu, "memory": 1024*5, "copies": 1,}),
)

llm = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
    component_name=inference_component_name
)

## Inference Examples

In [36]:
payload={
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}
res = llm.predict(payload)
print("-----\n" + res["choices"][0]["message"]["content"] + "\n-----\n")
print(res["usage"])

-----
Here are some of the must‑see spots in London — a mix of iconic landmarks, world‑class museums, and vibrant neighborhoods:

| # | Place | Why It’s Popular |
|---|-------|------------------|
| 1 | **Buckingham Palace** | The Queen’s official London residence – watch the Changing of the Guard. |
| 2 | **The Tower of London & Tower Bridge** | Historic castle, Crown Jewels, and the iconic bridge with glass floors. |
| 3 | **The British Museum** | World‑famous collection from the Rosetta Stone to Egyptian mummies (free entry). |
| 4 | **The Houses of Parliament & Big Ben** | The classic symbol of London’s politics and architecture. |
| 5 | **The National Gallery (Tate Britain)** | Home to masterpieces from Van Gogh to Turner. |
| 6 | **Buckinghamshire Gardens (Kew Gardens)** | Stunning botanical gardens with a glasshouse and the Horniman Insect Zoo. |
| 7 | **Camden Market** | Eclectic stalls, street food, music and vintage fashion. |
| 8 | **Covent Garden** | Lively piazza with stree

In [37]:
payload={
    "messages": [
        {"role": "user", "content": "What is bigger 9.11 or 9.8?"}
    ],
}
res = llm.predict(payload)
print("-----\n" + res["choices"][0]["message"]["content"] + "\n-----\n")
print(res["usage"])

-----
**9.8** is the larger number.

- 9.11 is nine and eleven‑hundredths (9 + 0.11).  
- 9.8 is nine and eight‑tenths (9 + 0.8).  

Since 0.8 > 0.11, 9.8 > 9.11.
-----

{'prompt_tokens': 84, 'total_tokens': 233, 'completion_tokens': 149, 'prompt_tokens_details': None}


## Cleanup

In [38]:
sess.delete_inference_component(inference_component_name)
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
sess.delete_model(model_name)