# Reinforcement Learning from Verifiable Rewards (RLVR) with SageMaker

## Lab 3 — Benchmark Evaluation

In this lab, you will evaluate the RLVR fine-tuned model against the **MATH benchmark** and compare it to the base model to measure the improvement in mathematical reasoning.

### Why benchmark evaluation?

RLVR trains models on tasks with objectively verifiable answers, so it makes sense to evaluate them the same way — using a standardized benchmark with known correct answers rather than subjective LLM-as-a-Judge scoring.

### What you'll do in this notebook

1. Retrieve the fine-tuned model from the Model Registry
2. Explore available benchmarks and create a `BenchMarkEvaluator`
3. Run evaluation on both the fine-tuned and base model
4. Compare the results

---

## 1. Prerequisites

### Set up the SageMaker session

In [None]:
import boto3
from sagemaker.core.helper.session_helper import Session, get_execution_role

sess = Session()
sagemaker_session_bucket = None

if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

s3_client = boto3.client("s3")
sess = Session(default_bucket=sagemaker_session_bucket)
sm_client = boto3.client("sagemaker", region_name=sess.boto_region_name)
bucket_name = sess.default_bucket()
default_prefix = sess.default_bucket_prefix

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

### Retrieve the fine-tuned model

We look up the Model Package Group created in Lab 2 and construct the ARN for the first model version. We also set an S3 output path for the evaluation artifacts.

In [None]:
from sagemaker.core.resources import ModelPackageGroup

base_model_id = "huggingface-llm-qwen2-5-7b-instruct"
model_package_group_name = f"{base_model_id}-rlvr"
model_package_version = "1"

model_package_group = ModelPackageGroup.get(model_package_group_name)

fine_tuned_model_package_arn = f"{model_package_group.model_package_group_arn.replace('model-package-group', 'model-package', 1)}/{model_package_version}"
print(f"Fine-tuned Model Package ARN: {fine_tuned_model_package_arn}")

if default_prefix:
    output_path = f"s3://{bucket_name}/{default_prefix}/{base_model_id}/evaluation"
else:
    output_path = f"s3://{bucket_name}/{base_model_id}/evaluation"

print(f"Evaluation output path: {output_path}")

---

## 2. Explore available benchmarks

SageMaker provides several built-in benchmarks. Let's list them and inspect the properties of the **MATH** benchmark we'll use.

In [None]:
from sagemaker.train.evaluate import BenchMarkEvaluator, get_benchmarks, get_benchmark_properties
from rich.pretty import pprint

Benchmark = get_benchmarks()
pprint(list(Benchmark))

In [None]:
pprint(get_benchmark_properties(benchmark=Benchmark.MATH))

---

## 3. Create and run the benchmark evaluator

The `BenchMarkEvaluator` runs the selected benchmark against your model. Setting `evaluate_base_model=True` also evaluates the original base model so you can directly compare the two.

> **⏱ Expected duration:** 15–30 minutes when evaluating both models.

In [None]:
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MATH,
    model=fine_tuned_model_package_arn,
    model_package_group=model_package_group_name,
    s3_output_path=output_path,
    evaluate_base_model=True,
)

pprint(evaluator)

In [None]:
execution = evaluator.evaluate()
execution.wait()

---

## 4. View evaluation results

We retrieve the latest succeeded benchmark evaluation and display the results. The output shows the MATH benchmark scores for both the base model and the RLVR fine-tuned model side by side.

In [None]:
from sagemaker.train.evaluate import EvaluationPipelineExecution
from sagemaker.train.evaluate.constants import EvalType

latest_succeeded = next(
    (
        e
        for e in EvaluationPipelineExecution.get_all(eval_type=EvalType.BENCHMARK)
        if e.status.overall_status == "Succeeded"
    ),
    None,
)

pprint(latest_succeeded)

In [None]:
if latest_succeeded:
    latest_succeeded.show_results()

---

## Summary

You've completed the RLVR lab! Here's what you accomplished:

1. **Lab 1** — Prepared the GSM8K dataset in RLVR format and registered it in the AI Registry
2. **Lab 2** — Launched a serverless RLVR training job with the `RLVRTrainer`
3. **Lab 3** — Evaluated the fine-tuned model against the MATH benchmark and compared it to the base model

The benchmark scores show how RLVR training improved the model's mathematical reasoning — without any human feedback or a separate reward model.