# Lab 0: Deploy GPT-OSS Model to SageMaker AI Endpoint

This notebook guides you through deploying a [GPT-OSS 20B](https://github.com/openai/gpt-oss) model to [Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) using the [Large Model Inference (LMI)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) v16 container and the vLLM backend. We'll use this deployed endpoint in later notebooks, to explore detection methods that rely on the extra internal visibility offered by open-weight models.

‚è∞ We start the deployment early because it can take some time to complete, **but**...

üèéÔ∏è Since the first lab doesn't need it yet, you can start working on that **in parallel** when you hit a long wait in this notebook!

For further examples on hosting different Foundation Models on SageMaker, check out [aws-samples/sagemaker-genai-hosting-examples](https://github.com/aws-samples/sagemaker-genai-hosting-examples)

## Getting started and connecting to AWS

If you haven't already, install the dependencies for the workshop as listed in [pyproject.toml](./pyproject.toml) by running the cell below:

In [None]:
%pip install -e .

<div style="border: 4px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    <h4>üîÑ Restart the kernel after installing</h4>
    <p>
        You'll need to restart the notebook kernel after running this cell, for the installations
        to take full effect.
    </p>
    <p>
        Note that you may see some error notices about dependency conflicts in SageMaker Studio
        environments, but this is okay as long as the installations are completed.
    </p>
</div>
<br/>

Once the necessary libraries are installed, run the cell below to initially connect to the AWS services we'll use.

For more information, you can refer to:
- [`sagemaker` Python SDK Documentation](https://sagemaker.readthedocs.io/en/stable/)
- Boto3 client docs for [sagemaker](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html) and [sagemaker-runtime](https://boto3.amazonaws.com/v1/documentation/api/1.21.44/reference/services/sagemaker-runtime.html)

In [None]:
# Python Built-Ins:
import json

# External Dependencies:
import boto3  # AWS SDKs for Python
import sagemaker  # High-level SDK for SageMaker AI
from sagemaker.compute_resource_requirements.resource_requirements import (
    ResourceRequirements,
)
from sagemaker.enums import EndpointType
from sagemaker.session import Session as SageMakerSession

# You could instead set region_name explicitly in Session() if wanted:
boto_sess = boto3.Session()
region = boto_sess.region_name

# Low-level clients for SageMaker control plane and endpoint invocation:
sm_client = boto_sess.client("sagemaker")
smr_client = boto_sess.client("sagemaker-runtime")

sm_sess = SageMakerSession(boto_session=boto_sess)  # High-level client for SageMaker

print(f"Working in AWS Region: {region}")

## Specify Container and vLLM Serving Properties

For deploying GPT-OSS-20B model, we'll use:

- The AWS [Large Model Inference (LMI)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) container image, at version 16+ to include vLLM version 0.10.2+
- A GPU-accelerated G5 compute instance type: [ml.g5.4xlarge](https://aws.amazon.com/ec2/instance-types/g5/)

Note that there are more samples around some of the newer features/support with v16, that you can reference down below:

- [Custom Input/Output Formatters](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/input_formatter_schema.md)
- [Multi-Adapter Inference](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/adapters.md)
- [Sticky Session Routing](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/stateful_sessions.md)

In [None]:
# specify container LMIv16
CONTAINER_VERSION = "0.34.0-lmi16.0.0-cu128-v1.2"
inference_image = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"
)
num_gpu = 1
instance_type = "ml.g5.4xlarge"
print(f"Using image URI: {inference_image}")

**Regarding your endpoint `role`:**

Your endpoint will run with an [execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) that defines its permissions, configured below. The provided role must:

1. Be assumable by the SageMaker service
2. Have sufficient permissions to pull your container image from Amazon ECR (and should probably also have permissions to send logs, events, etc to Amazon CloudWatch) - see [here in the SageMaker AI Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-createmodel-perms) for more details

If you're running this notebook outside of SageMaker, the `sagemaker.get_execution_role()` we use below might fail or you might be using a role that's not assumable by SageMaker. You can instead set `role` to an existing role ARN of your choice.

**Configuring vLLM:**

This container is set up to map `OPTION_...` environment variables into [vllm serve CLI arguments](https://docs.vllm.ai/en/latest/cli/serve.html). For example:
- `OPTION_MODEL` -> [`--model`](https://docs.vllm.ai/en/latest/cli/serve.html#-model)
- `OPTION_TOOL_CALL_PARSER` -> [`--tool-call-parser`](https://docs.vllm.ai/en/latest/cli/serve.html#-tool-call-parser).

**Learn more:**
- [SageMaker Inference Components](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-component-based-endpoint.html)
- [vLLM Engine Arguments](https://docs.vllm.ai/en/latest/models/engine_args.html)
- [SageMaker Instance Types](https://aws.amazon.com/sagemaker/pricing/)

In [None]:
# Environment variables including vLLM config
config = {
    "HF_MODEL_ID": "openai/gpt-oss-20b",  # Load GPT-OSS from Hugging Face
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
    "OPTION_REASONING_PARSER": "openai_gptoss",
    "OPTION_SERVED_MODEL_NAME": "model",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_TOOL_CALL_PARSER": "openai",
    "SERVING_FAIL_FAST": "true",
}

# Naming your SageMaker resources:
# (name_from_base attaches a timestamp-based suffix for transparent uniqueness)
model_name = sagemaker.utils.name_from_base("gpt-oss-20b-vllm")
endpoint_name = model_name
inference_component_name = f"ic-{model_name}"

# IAM permissions (see note above):
role = sagemaker.get_execution_role()

print(f"Using IAM Role ARN:\n{role}")

## Create Inference Component and Endpoint
Here we utilize the higher level SageMaker Python SDK to create both an endpoint and an Inference Component, adjust the resource requirements depending on your model and use-case.

This next cell deploys your model to SageMaker AI. The deployment process:
1. Creates a SageMaker Model object from your container image
2. Deploys a component-based Endpoint (with associated Endpoint Configuration) to host your model
3. Adds the model to the endpoint as an 'inference component'

**Learn more:**
- [SageMaker Model Deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deploy-models.html)
- [SageMaker Predictor API](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html)

In [None]:
lmi_model = sagemaker.Model(
    env=config,
    image_uri=inference_image,
    name=model_name,
    role=role,
    sagemaker_session=sm_sess,
)

print(f"Deploying endpoint name: {endpoint_name}")
print(f"Inference component name: {inference_component_name}")
lmi_model.deploy(
    container_startup_health_check_timeout=600,
    endpoint_name=endpoint_name,
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    initial_instance_count=1,
    instance_type=instance_type,
    resources=ResourceRequirements(
        requests={
            "num_accelerators": num_gpu,
            "memory": 1024 * 3,
            "copies": 1,
        }
    ),
)
print("\nDeployed!")

# Store the endpoint configuration for use in other notebooks
%store endpoint_name
%store inference_component_name



<div style="border: 4px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    <h4>‚è∞ This process can take ~6-10 minutes</h4>
    <p>
        While it's running, you can move on to start working through lab 1 and check back later. You need to make sure the SageMaker endpoint is successfully deployed before starting lab 2, so come back after you finished Lab 1.
    </p>
    <p>
        Check the logs below to make sure the build and push completed successfully before moving
        on in this notebook.
    </p>
</div>

## Testing the Deployed Endpoint

Now that your model is deployed, let's test it with various inference examples using the OpenAI-compatible API format.

Here, we'll set up a "Predictor" client using the SageMaker high-level Python SDK to make inference requests. It's also possible to use the low-level SageMaker `InvokeEndpoint` API via generic AWS SDKs like `boto3` for Python or equivalents for other languages.

**Learn more:**
- [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create)
- [SageMaker Runtime API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html)
- [SageMaker Python SDK 'Predictor' client](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html)
- [Boto3 SageMakerRuntime with low-level Python `invoke_endpoint` method](https://boto3.amazonaws.com/v1/documentation/api/1.21.44/reference/services/sagemaker-runtime.html)

In [None]:
%store -r endpoint_name
%store -r inference_component_name

llm = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sm_sess,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
    component_name=inference_component_name,
)

### Basic Inference Example

This example demonstrates a simple, non-streaming chat completion request to the model - extracting just the text content of the response:

In [None]:
payload = {
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}
res = llm.predict(payload)
print("\n".join(("-----", res["choices"][0]["message"]["content"], "-----", "")))
print(res["usage"])

Alternatively, you can also view the full response object including metadata:

In [None]:
print(json.dumps(res, indent=2))

### Tool calling

The following shows how tool/function calling would typically work through the provided model API: Passing in a (list of) tool definition(s) which the model can decide to invoke as part of its response.

In [None]:
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "text": "What's the service pricing code for Amazon SageMaker AI?",
                    "type": "text",
                }
            ],
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_pricing_service_codes",
                "description": "Get AWS service codes available in the Price List API.",
                "parameters": {
                    "properties": {},
                    "title": "get_pricing_service_codesArguments",
                    "type": "object",
                    "required": [],
                },
            },
        }
    ],
    "tool_choice": "auto",
}

res = llm.predict(payload)
print("\n".join(("-----", json.dumps(res["choices"][0], indent=2), "-----", "")))
print(res["usage"])

### Advanced options: Log-Probabilities

We can also enable **log probabilities** (logprobs), which return the model's confidence scores for generated tokens. This is useful for:
- **Uncertainty quantification:** Understanding model confidence in its predictions
- **Hallucination detection:** Lower confidence scores may indicate uncertain or hallucinated content
- **Alternative token analysis:** Seeing what other tokens the model considered

**Parameters:**
- `logprobs: True` - Enable log probability output
- `top_logprobs: 5` - Return top 5 alternative tokens with their probabilities for each position
- `n: 2` - (Optional) Generate multiple response choices for comparison

**Learn more:**
- [Understanding Log Probabilities in LLMs](https://platform.openai.com/docs/api-reference/chat/create#chat-create-logprobs)
- [Using Logprobs for Hallucination Detection](https://github.com/aws-samples/sagemaker-genai-hosting-examples)

In [None]:
payload = {
    "messages": [
        {"role": "user", "content": "Name popular places to visit in Singapore?"}
    ],
    "logprobs": True,  # <- Enable logprobs
    "top_logprobs": 5,  # <- Set number of logprobs to return per token
    # "n": 2,  # <- Generate multiple 'choices' (good for semantic similarity!)
}
res = llm.predict(payload)
print("-----\n" + res["choices"][0]["message"]["content"] + "\n-----\n")
print(res["usage"])

You can view the detailed log probability data for each token in the response. Each entry contains:
- The generated token
- Its log probability (confidence score)
- Top alternative tokens and their probabilities

In [None]:
res["choices"][0]["logprobs"]["content"]

## Clean Up Resources (Optional)

You'll need this endpoint for the labs in the workshop, but endpoints are billable for as long as they're deployed. Remember to delete your endpoint when you're done experimenting with the workshop, to avoid ongoing charges.

**Learn more:**
- [SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/)
- [Deleting SageMaker Resources](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html)

In [None]:
# # Uncomment to delete the endpoint
# sm_client.delete_inference_component(InferenceComponentName=inference_component_name)
# sm_client.delete_endpoint(EndpointName=endpoint_name)
# sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
# sm_client.delete_model(ModelName=model_name)