# Reinforcement Learning from Verifiable Rewards (RLVR) with SageMaker

## Lab 1 — Data Preparation

In this lab, you will prepare a dataset for **Reinforcement Learning from Verifiable Rewards (RLVR)** training on the **Qwen 2.5 — 7B Instruct** model.

### What is RLVR?

RLVR is a technique for fine-tuning large language models by automatically rewarding correct solutions to tasks that have **objectively verifiable answers** — such as math problems or coding challenges. Unlike RLHF (Reinforcement Learning from Human Feedback), RLVR doesn't require human annotators; instead, a rule-based reward function checks whether the model's output matches the known correct answer.

### What you'll do in this notebook

1. Install dependencies and set up your SageMaker session
2. Load the **GSM8K** math dataset from Hugging Face
3. Transform each sample into the RLVR training format
4. Split the data into train / validation / test sets
5. Upload the prepared data to Amazon S3
6. Register the datasets in the SageMaker AI Registry

Let's get started!

---

## 1. Prerequisites

First, install the required Python packages. We need:

- `sagemaker` — the SageMaker Python SDK for session management and registry access
- `datasets` — Hugging Face library for loading and manipulating datasets
- `huggingface_hub` — for authenticating with the Hugging Face Hub
- `fsspec` — filesystem abstraction used by `datasets` under the hood

In [None]:
%pip install --upgrade sagemaker fsspec datasets huggingface_hub

### Set up the SageMaker session

Next, we initialize a SageMaker session and resolve the IAM execution role. The session gives us a default S3 bucket where we'll store the prepared dataset, and the role is used by SageMaker to access AWS resources on our behalf.

In [None]:
import boto3
from sagemaker.core.helper.session_helper import Session, get_execution_role

sess = Session()
sagemaker_session_bucket = None

if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

s3_client = boto3.client("s3")
sess = Session(default_bucket=sagemaker_session_bucket)
bucket_name = sess.default_bucket()
default_prefix = sess.default_bucket_prefix

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

---

## 2. Load the GSM8K dataset

[GSM8K](https://huggingface.co/datasets/openai/gsm8k) (Grade School Math 8K) is a dataset of ~8,000 grade-school math word problems created by OpenAI. Each sample contains a natural-language **question** and a step-by-step **answer** that ends with the final numerical result after a `####` delimiter.

For this workshop we take a subset of **1,000 samples** to keep training fast. We stream the dataset and shuffle it to get a random sample.

In [None]:
import datasets
from datasets import load_dataset
import pandas as pd

dataset = (
    load_dataset(
        "openai/gsm8k",
        "main",
        split="train",
        streaming=True,
    )
    .take(1000)
    .shuffle(buffer_size=1000)
)

dataset = datasets.Dataset.from_generator(lambda: dataset, features=dataset.features)

print(f"Dataset size: {len(dataset)} samples")
pd.DataFrame(dataset).head()

---

## 3. Transform to RLVR format

SageMaker's RLVR training expects each sample to be a JSON object with the following fields:

| Field | Description |
|---|---|
| `prompt` | A chat-formatted list of messages (role + content) containing the question |
| `reward_model` | The ground-truth answer and verification style (`"rule"` for exact-match checking) |
| `data_source` | Identifier for the source dataset |
| `ability` | The skill being tested (e.g., `"math"`) |
| `extra_info` | Optional metadata (original question, answer, index, split) |

The helper function below:
1. Extracts the final numerical answer from the GSM8K `####` delimiter
2. Wraps the question in a chat prompt that instructs the model to think step-by-step
3. Packages everything into the required schema

In [None]:
import re


def extract_answer(answer_text):
    """Extract the final numerical answer after ####"""
    match = re.search(r"####\s*(.+)", answer_text)
    return match.group(1).strip().replace(",", "") if match else ""


def prepare_rlvr_sample(sample, index):
    """Convert a single GSM8K sample into the RLVR training format."""
    ground_truth = extract_answer(sample["answer"])
    yield {
        "data_source": "openai/gsm8k",
        "prompt": [
            {
                "content": f"{sample['question']} Let's think step by step and output the final answer after \"####\".",
                "role": "user",
            }
        ],
        "ability": "math",
        "reward_model": {"ground_truth": ground_truth, "style": "rule"},
        "extra_info": {
            "answer": sample["answer"],
            "index": index,
            "question": sample["question"],
            "split": "train",
        },
    }

Now let's apply the transformation to every sample in the dataset.

In [None]:
from tqdm import tqdm

records = []
for idx, row in tqdm(enumerate(dataset), total=len(dataset), desc="Converting to RLVR format"):
    for example in prepare_rlvr_sample(row, idx):
        records.append(example)

print(f"Total RLVR samples: {len(records)}")

Let's inspect a random sample to verify the format looks correct.

In [None]:
import json
from random import randint

print(json.dumps(records[randint(0, len(records) - 1)], indent=2))

---

## 4. Split the dataset

We split the data into three sets:

- **Train (72%)** — used by the RLVR training loop
- **Validation (20%)** — used to monitor training progress and detect overfitting
- **Test (8%)** — held out for final evaluation after training completes

In [None]:
from sklearn.model_selection import train_test_split

train_data, val_data = train_test_split(records, test_size=0.2, random_state=42)
train_data, test_data = train_test_split(train_data, test_size=0.1, random_state=42)

print(f"Train samples:      {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print(f"Test samples:       {len(test_data)}")

---

## 5. Upload to Amazon S3

We write each split to a local JSONL file (one JSON object per line), then upload them to the SageMaker default S3 bucket. The files are organized under a `datasets/serverless-model-customization-rlvr/` prefix with separate folders for each split.

In [None]:
import os
import shutil

if default_prefix:
    input_path = f"{default_prefix}/datasets/serverless-model-customization-rlvr"
else:
    input_path = "datasets/serverless-model-customization-rlvr"

train_s3_path = f"s3://{bucket_name}/{input_path}/train/train_rlvr.jsonl"
val_s3_path = f"s3://{bucket_name}/{input_path}/val/val_rlvr.jsonl"
test_s3_path = f"s3://{bucket_name}/{input_path}/test/test_rlvr.jsonl"

print(f"Train path: {train_s3_path}")
print(f"Val path:   {val_s3_path}")
print(f"Test path:  {test_s3_path}")

In [None]:
os.makedirs("./data", exist_ok=True)

for name, split in [("train", train_data), ("val", val_data), ("test", test_data)]:
    with open(f"./data/{name}_rlvr.jsonl", "w") as f:
        for item in split:
            f.write(json.dumps(item) + "\n")

s3_client.upload_file("./data/train_rlvr.jsonl", bucket_name, f"{input_path}/train/train_rlvr.jsonl")
s3_client.upload_file("./data/val_rlvr.jsonl", bucket_name, f"{input_path}/val/val_rlvr.jsonl")
s3_client.upload_file("./data/test_rlvr.jsonl", bucket_name, f"{input_path}/test/test_rlvr.jsonl")

shutil.rmtree("./data")

print("Upload complete!")

---

## 6. Register datasets in SageMaker AI Registry

Finally, we register each split as a **DataSet** resource in the SageMaker AI Registry. This makes the datasets discoverable and reusable across notebooks and training jobs. The ARNs printed below will be used in the next lab to kick off the RLVR training job.

In [None]:
from sagemaker.ai_registry.dataset import DataSet

dataset_train = DataSet.create(
    name="rlvr-train",
    source=train_s3_path,
    wait=True,
)
print(f"TRAINING_DATASET ARN: {dataset_train.arn}")

dataset_val = DataSet.create(
    name="rlvr-val",
    source=val_s3_path,
    wait=True,
)
print(f"VALIDATION_DATASET ARN: {dataset_val.arn}")

dataset_test = DataSet.create(
    name="rlvr-test",
    source=test_s3_path,
    wait=True,
)
print(f"TEST_DATASET ARN: {dataset_test.arn}")

---

## Next steps

Your data is now prepared and registered. In the next notebook, you'll use these dataset ARNs to launch an RLVR training job on SageMaker.