# Reinforcement Learning from Verifiable Rewards (RLVR) with SageMaker

## Lab 4 — LLM Deployment

In this lab, you will deploy the RLVR fine-tuned model to a **SageMaker real-time endpoint** and test it with a math question.

### Deployment architecture

SageMaker real-time endpoints use an **Inference Component** pattern that decouples the endpoint infrastructure from the model:

1. **Endpoint Configuration** — defines the instance type and routing strategy
2. **Endpoint** — provisions the compute infrastructure
3. **Model** — registers the fine-tuned weights with a serving container (DJL LMI + vLLM)
4. **Inference Component** — attaches the model to the endpoint with specific resource requirements

This separation lets you host multiple models on a single endpoint or swap models without recreating the infrastructure.

---

## 1. Prerequisites

### Set up the SageMaker session

In [None]:
import boto3
import os
from rich.pretty import pprint
from sagemaker.core.helper.session_helper import Session, get_execution_role

sess = Session()
sagemaker_session_bucket = None

if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

s3_client = boto3.client("s3")
sess = Session(default_bucket=sagemaker_session_bucket)
sm_client = boto3.client("sagemaker", region_name=sess.boto_region_name)
bucket_name = sess.default_bucket()
default_prefix = sess.default_bucket_prefix

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

### Retrieve the fine-tuned model

We look up the Model Package from Lab 2 and extract the S3 URI of the merged model weights. We also generate unique resource names to avoid collisions.

In [None]:
import random
from sagemaker.core.resources import ModelPackage, ModelPackageGroup
from sagemaker.core import s3

base_model_id = "huggingface-llm-qwen2-5-7b-instruct"
model_name = f"{base_model_id}-rlvr-{random.randint(100, 100000)}"

model_package_group_name = f"{base_model_id}-rlvr"
model_package_version = "1"

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"
ic_name = f"{model_name}-ic"
region = sess.boto_region_name

model_package_group = ModelPackageGroup.get(model_package_group_name)

fine_tuned_model_package_arn = f"{model_package_group.model_package_group_arn.replace('model-package-group', 'model-package', 1)}/{model_package_version}"
print(f"Fine-tuned Model Package ARN: {fine_tuned_model_package_arn}")

model_package = ModelPackage.get(fine_tuned_model_package_arn)

merged_model_s3_uri = s3.s3_path_join(
    model_package.inference_specification.containers[0].model_data_source.s3_data_source.s3_uri,
    "checkpoints", "hf_merged"
) + "/"
print(f"Merged model S3 URI: {merged_model_s3_uri}")

---

## 2. Create the endpoint

We first create an **Endpoint Configuration** that specifies the instance type (`ml.g5.2xlarge` — a single NVIDIA A10G GPU) and routing strategy, then create the **Endpoint** itself.

> **⏱ Expected duration:** The endpoint takes a few minutes to reach `InService` status.

In [None]:
from sagemaker.core.resources import Endpoint, EndpointConfig
from sagemaker.core.shapes import ProductionVariant

print(f"Creating EndpointConfig: {endpoint_config_name}")
endpoint_config = EndpointConfig.create(
    endpoint_config_name=endpoint_config_name,
    execution_role_arn=role,
    production_variants=[
        ProductionVariant(
            variant_name="AllTraffic",
            instance_type="ml.g5.2xlarge",
            initial_instance_count=1,
            model_data_download_timeout_in_seconds=700,
            routing_config={"routing_strategy": "LEAST_OUTSTANDING_REQUESTS"}
        )
    ]
)

In [None]:
print(f"Creating Endpoint: {endpoint_name}")
endpoint = Endpoint.create(
    endpoint_name=endpoint_name,
    endpoint_config_name=endpoint_config_name
)
endpoint.wait_for_status("InService")
print(f"Endpoint {endpoint_name} is InService")

---

## 3. Create the model and inference component

We register the fine-tuned model using the **DJL LMI** (Large Model Inference) container with **vLLM** as the serving backend. The environment variables configure:

- `OPTION_MAX_MODEL_LEN` — maximum sequence length (16K tokens)
- `OPTION_TENSOR_PARALLEL_DEGREE` — set to `max` to use all available GPUs
- `OPTION_ENTRYPOINT` — the vLLM async serving engine

Then we create an **Inference Component** that attaches the model to the endpoint with specific memory and accelerator requirements.

In [None]:
CONTAINER_VERSION = "0.36.0-lmi18.0.0-cu128"
inference_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"

lmi_env = {
    "SERVING_FAIL_FAST": "true",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_MAX_MODEL_LEN": "16384",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
    "OPTION_TRUST_REMOTE_CODE": "true",
}

In [None]:
from sagemaker.core.resources import Model
from sagemaker.core.shapes import ContainerDefinition, ModelDataSource, S3ModelDataSource

fine_tuned_model = Model.create(
    model_name=model_name,
    primary_container=ContainerDefinition(
        image=inference_image,
        model_data_source=ModelDataSource(
            s3_data_source=S3ModelDataSource(
                s3_uri=merged_model_s3_uri,
                s3_data_type="S3Prefix",
                compression_type="None"
            )
        ),
        environment=lmi_env
    ),
    execution_role_arn=role
)

pprint(fine_tuned_model)

In [None]:
from sagemaker.core.resources import InferenceComponent
from sagemaker.core.shapes import (
    InferenceComponentSpecification,
    InferenceComponentComputeResourceRequirements,
    InferenceComponentRuntimeConfig,
)

inference_component = InferenceComponent.create(
    inference_component_name=ic_name,
    endpoint_name=endpoint_name,
    variant_name="AllTraffic",
    specification=InferenceComponentSpecification(
        model_name=model_name,
        compute_resource_requirements=InferenceComponentComputeResourceRequirements(
            min_memory_required_in_mb=10240,
            number_of_accelerator_devices_required=1,
        )
    ),
    runtime_config=InferenceComponentRuntimeConfig(
        copy_count=1
    ),
    region=region
)

print(f"InferenceComponent created: {inference_component.inference_component_name}")
print(f"Endpoint ARN: {endpoint.endpoint_arn}")
inference_component.wait_for_status("InService")
print(f"InferenceComponent {ic_name} is InService")

---

## 4. Test the endpoint

Let's send a math question to the deployed model using streaming inference. The helper functions below handle the SageMaker Runtime API call and parse the streaming response tokens.

In [None]:
import json
import io


def execute_inference(prompt, endpoint_name, inference_component_name, stream=True):
    sm_rt_client = boto3.client("sagemaker-runtime")

    payload = {
        "inputs": f"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n",
        "parameters": {"max_new_tokens": 512, "temperature": 0.1, "top_p": 0.9},
    }

    if stream:
        payload["stream"] = True
        result = sm_rt_client.invoke_endpoint_with_response_stream(
            EndpointName=endpoint_name,
            InferenceComponentName=inference_component_name,
            CustomAttributes='accept_eula=true',
            Body=json.dumps(payload),
            ContentType="application/json"
        )
        return result['Body']
    else:
        result = sm_rt_client.invoke_endpoint(
            EndpointName=endpoint_name,
            InferenceComponentName=inference_component_name,
            CustomAttributes='accept_eula=true',
            Body=json.dumps(payload),
            ContentType="application/json"
        )
        return result["Body"].read().decode("utf8")


class LineIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('\n'):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])


def print_stream(stream):
    for line in LineIterator(stream):
        try:
            if line != b'':
                resp = json.loads(line)
                print(resp["token"].get("text"), end='')
        except:
            print(line)

In [None]:
prompt = "What is 25 * 4 + 10? Show your reasoning step by step."

stream = execute_inference(prompt, endpoint_name, ic_name, stream=True)
print_stream(stream)

---

## 5. Clean up resources

Uncomment and run the cells below to delete the endpoint and avoid ongoing charges. Resources must be deleted in reverse order: inference component → model → endpoint → endpoint config.

> **⚠️ Important:** Remember to clean up after completing the workshop!

In [None]:
# inference_component.delete()
# fine_tuned_model.delete()
# endpoint.delete()
# endpoint_config.delete()