# Reinforcement Learning from Verifiable Rewards (RLVR) with SageMaker

## Lab 2 — Fine-Tuning with RLVRTrainer

In this lab, you will launch a **serverless RLVR training job** using the SageMaker `RLVRTrainer` to fine-tune **Qwen 2.5 — 7B Instruct** on the math dataset you prepared in Lab 1.

### What you'll do in this notebook

1. Set up your SageMaker session and retrieve the datasets from Lab 1
2. Create a Model Package Group to store the fine-tuned model
3. Configure the `RLVRTrainer` with model, data, and hyperparameters
4. Submit the serverless training job and monitor progress

### How serverless RLVR training works

With SageMaker serverless customization, you don't need to provision or manage any compute infrastructure. SageMaker automatically:
- Selects the right instance types and cluster size for your model
- Runs the RLVR training loop (policy rollouts → reward verification → policy update)
- Logs metrics to MLflow for experiment tracking
- Saves the fine-tuned model to S3 and registers it in the Model Registry

---

## 1. Prerequisites

### Set up the SageMaker session

We initialize the SageMaker session and resolve the IAM role, just like in Lab 1.

In [None]:
import boto3
import os
from sagemaker.core.helper.session_helper import Session, get_execution_role

sess = Session()
sagemaker_session_bucket = None

if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

s3_client = boto3.client("s3")
sess = Session(default_bucket=sagemaker_session_bucket)
sm_client = boto3.client("sagemaker", region_name=sess.boto_region_name)
bucket_name = sess.default_bucket()
default_prefix = sess.default_bucket_prefix

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

### Retrieve datasets and configure the job

We fetch the training and validation datasets that were registered in the SageMaker AI Registry during Lab 1. We also define the base model ID, output path, and MLflow experiment name.

In [None]:
from sagemaker.ai_registry.dataset import DataSet

# Required config
base_model_id = "huggingface-llm-qwen2-5-7b-instruct"
training_dataset = DataSet.get(name="rlvr-train")
validation_dataset = DataSet.get(name="rlvr-val")

# Optional configs
job_name = f"rlvr-{base_model_id.split('/')[-1].replace('.', '-')}"
mlflow_experiment_name = "qwen2-5-7b-rlvr"

if default_prefix:
    output_path = f"s3://{bucket_name}/{default_prefix}/{base_model_id}-rlvr"
else:
    output_path = f"s3://{bucket_name}/{base_model_id}-rlvr"

os.environ["SAGEMAKER_MLFLOW_CUSTOM_ENDPOINT"] = (
    f"https://mlflow.sagemaker.{sess.boto_region_name}.app.aws"
)

print(f"Base model:    {base_model_id}")
print(f"Training data: {training_dataset.arn}")
print(f"Val data:      {validation_dataset.arn}")
print(f"Output path:   {output_path}")

---

## 2. Create a Model Package Group

A **Model Package Group** is a container in the SageMaker Model Registry that holds versioned model artifacts. Each time you run a training job, the resulting model is saved as a new version in this group — making it easy to compare runs and roll back if needed.

In [None]:
from sagemaker.core.resources import ModelPackageGroup

model_package_group_name = f"{base_model_id}-rlvr"

model_package_group = ModelPackageGroup.create(
    model_package_group_name=model_package_group_name,
    model_package_group_description='Store models from SageMaker serverless RLVR customization'
)

print(f"Model Package Group: {model_package_group_name}")

---

## 3. Configure the RLVR Trainer

The `RLVRTrainer` is the high-level API for serverless RLVR training. You provide:

| Parameter | Description |
|---|---|
| `model` | Base model ID from the SageMaker Hub |
| `training_dataset` | Registered training dataset from the AI Registry |
| `validation_dataset` | Registered validation dataset from the AI Registry |
| `model_package_group` | Where to store the fine-tuned model versions |
| `s3_output_path` | S3 location for model artifacts |
| `mlflow_experiment_name` | MLflow experiment name for metric tracking |

> **Note:** Setting `accept_eula=True` is required for gated models.

In [None]:
from sagemaker.train.rlvr_trainer import RLVRTrainer

trainer = RLVRTrainer(
    model=base_model_id,
    model_package_group=model_package_group,
    training_dataset=training_dataset,
    validation_dataset=validation_dataset,
    s3_output_path=output_path,
    mlflow_experiment_name=mlflow_experiment_name,
    base_job_name=job_name,
    sagemaker_session=sess,
    accept_eula=True,
    role=role
)

### Vie hyperparameters

The trainer comes with sensible defaults. Let's inspect them first, you can also override the ones you want to change.

In [None]:
from rich.pretty import pprint

print("Default finetuning options:")
pprint(trainer.hyperparameters.to_dict())

---

## 4. Submit the training job

Calling `trainer.train(wait=True)` submits the job and blocks until it completes. SageMaker will:

1. Provision the compute cluster automatically
2. Download the base model and datasets
3. Run the RLVR training loop for the configured number of epochs
4. Upload the fine-tuned model artifacts to S3
5. Register the model in the Model Package Group

> **⏱ Expected duration:** 15-20 minutes. You can monitor progress in SageMaker Studio under **Jobs → Training**, or check the MLflow experiment for live metrics.

In [None]:
from rich.pretty import pprint

training_job = trainer.train(wait=True)

TRAINING_JOB_NAME = training_job.training_job_name

pprint(training_job)

---

## Next steps

Your RLVR-tuned model is now registered in the SageMaker Model Registry. In the next notebook, you'll evaluate it against the base model to measure the improvement on math reasoning tasks.