# SageMaker Benchmark Evaluation - Basic Usage

This notebook demonstrates the basic user-facing flow for creating and managing benchmark evaluation jobs using the BenchmarkEvaluator with Jinja2 template-based pipeline generation.

## Step 1: Discover Available Benchmarks

Discover the benchmark properties and available options:
https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html

In [2]:
from sagemaker.train.evaluate import get_benchmarks, get_benchmark_properties
from rich.pretty import pprint

# Configure logging to show INFO messages
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

# Get available benchmarks
Benchmark = get_benchmarks()
pprint(list(Benchmark))

# Print properties for a specific benchmark
pprint(get_benchmark_properties(benchmark=Benchmark.GEN_QA))

## Step 2: Create BenchmarkEvaluator

Create a BenchmarkEvaluator instance with the desired benchmark. The evaluator will use Jinja2 templates to render a complete pipeline definition.

**Required Parameters:**
- `benchmark`: Benchmark type from the Benchmark enum
- `base_model`: Model ARN from SageMaker hub content
- `output_s3_location`: S3 location for evaluation outputs
- `mlflow_resource_arn`: MLflow tracking server ARN for experiment tracking

**Optional Template Fields:**
These fields are used for template rendering. If not provided, defaults will be used:
- `model_package_group`: Model package group ARN
- `source_model_package`: Source model package ARN
- `dataset`: S3 URI of evaluation dataset
- `model_artifact`: ARN of model artifact for lineage tracking (auto-inferred from source_model_package)

In [None]:
from sagemaker.train.evaluate import BenchMarkEvaluator

# Create evaluator with GEN_QA benchmark
# These values match our successfully tested configuration
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.GEN_QA,
    model="arn:aws:sagemaker:us-west-2:052150106756:model-package/test-finetuned-models-gamma/28",
    s3_output_path="s3://mufi-test-serverless-smtj/eval/",
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:052150106756:mlflow-tracking-server/mmlu-eval-experiment",
    dataset="s3://sagemaker-us-west-2-052150106756/studio-users/d20251107t195443/datasets/2025-11-07T19-55-37-609Z/zc_test.jsonl",
    model_package_group="arn:aws:sagemaker:us-west-2:052150106756:model-package-group/example-name-aovqo", # Optional inferred from model if model package
    base_eval_name="gen-qa-eval-demo",
    # Note: sagemaker_session is optional and will be auto-created if not provided
    # Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
)

pprint(evaluator)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/mufi/Library/Application Support/sagemaker/config.yaml


In [None]:
# # [Optional] BASE MODEL EVAL

# from sagemaker.train.evaluate import BenchMarkEvaluator

# # Create evaluator with GEN_QA benchmark
# # These values match our successfully tested configuration
# evaluator = BenchMarkEvaluator(
#     benchmark=Benchmark.GEN_QA,
#     model="meta-textgeneration-llama-3-2-1b-instruct",
#     s3_output_path="s3://mufi-test-serverless-smtj/eval/",
#     mlflow_resource_arn="arn:aws:sagemaker:us-west-2:052150106756:mlflow-tracking-server/mmlu-eval-experiment",
#     dataset="s3://sagemaker-us-west-2-052150106756/studio-users/d20251107t195443/datasets/2025-11-07T19-55-37-609Z/zc_test.jsonl",
#     # model_package_group="arn:aws:sagemaker:us-west-2:052150106756:model-package-group/example-name-aovqo", # Optional inferred from model if model package
#     base_eval_name="gen-qa-eval-demo",
#     # Note: sagemaker_session is optional and will be auto-created if not provided
#     # Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
# )

# pprint(evaluator)

In [None]:
# # [Optional] Nova testing IAD Prod

# from sagemaker.train.evaluate import BenchMarkEvaluator

# # Create evaluator with GEN_QA benchmark
# # These values match our successfully tested configuration
# evaluator = BenchMarkEvaluator(
#     benchmark=Benchmark.GEN_QA,
#     # model="arn:aws:sagemaker:us-east-1:052150106756:model-package/bgrv-nova-micro-sft-lora/1",
#     model="arn:aws:sagemaker:us-east-1:052150106756:model-package/test-nova-finetuned-models/3",
#     s3_output_path="s3://mufi-test-serverless-iad/eval/",
#     mlflow_resource_arn="arn:aws:sagemaker:us-east-1:052150106756:mlflow-tracking-server/mlflow-prod-server",
#     dataset="s3://sagemaker-us-east-1-052150106756/studio-users/d20251107t195443/datasets/2025-11-07T19-55-37-609Z/zc_test.jsonl",
#     model_package_group="arn:aws:sagemaker:us-east-1:052150106756:model-package-group/test-nova-finetuned-models", # Optional inferred from model if model package
#     base_eval_name="gen-qa-eval-demo",
#     region="us-east-1",
#     # Note: sagemaker_session is optional and will be auto-created if not provided
#     # Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
# )

# pprint(evaluator)

INFO - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
INFO - sagemaker.modules.evaluate.base_evaluator - Model package group provided as ARN: arn:aws:sagemaker:us-east-1:052150106756:model-package-group/test-nova-finetuned-models


### Optionally update the hyperparameters

In [3]:
pprint(evaluator.hyperparameters.to_dict())

# optionally update hyperparameters
# evaluator.hyperparameters.temperature = "0.1"

# optionally get more info on types, limits, defaults.
# evaluator.hyperparameters.get_info()


## Step 3: Run Evaluation

Start a benchmark evaluation job. The system will:
1. Build template context with all required parameters
2. Render the pipeline definition from `DETERMINISTIC_TEMPLATE` using Jinja2
3. Create or update the pipeline with the rendered definition
4. Start the pipeline execution with empty parameters (all values pre-substituted)

**What happens during execution:**
- CreateEvaluationAction: Sets up lineage tracking
- EvaluateBaseModel & EvaluateCustomModel: Run in parallel as serverless training jobs
- AssociateLineage: Links evaluation results to lineage tracking

In [5]:
# Run evaluation with configured parameters
execution = evaluator.evaluate()
pprint(execution)

print(f"\nPipeline Execution ARN: {execution.arn}")
print(f"Initial Status: {execution.status.overall_status}")


Pipeline Execution ARN: arn:aws:sagemaker:us-west-2:052150106756:pipeline/SagemakerEvaluation-BenchmarkEvaluation-c344c91d-6f62-4907-85cc-7e6b29171c42/execution/95qr3e96dblb
Initial Status: Executing


### Alternative: Override Subtasks at Runtime

For benchmarks with subtask support, you can override subtasks when calling evaluate():

In [None]:
# Override subtasks at evaluation time
# execution = mmlu_evaluator.evaluate(subtask="abstract_algebra")  # Single subtask
# execution = mmlu_evaluator.evaluate(subtask=["abstract_algebra", "anatomy"])  # Multiple subtasks

## Step 4: Monitor Execution

Check the job status and refresh as needed:

In [5]:
# Refresh status
execution.refresh()

# Display job status with step details
pprint(execution.status)

# Display individual step statuses
if execution.status.step_details:
    print("\nStep Details:")
    for step in execution.status.step_details:
        print(f"  {step.name}: {step.status}")


Step Details:
  EvaluateCustomModel: Executing
  EvaluateBaseModel: Executing
  CreateEvaluationAction: Succeeded


## Step 5: Wait for Completion

Wait for the pipeline to complete. This provides rich progress updates in Jupyter notebooks:

In [9]:
# Wait for job completion with progress updates
# This will show a rich progress display in Jupyter
execution.wait(target_status="Succeeded", poll=5, timeout=3600)

print(f"\nFinal Status: {execution.status.overall_status}")


Final Status: Succeeded


## Step 6: View Results

Display the evaluation results in a formatted table:

Output Structure:

Evaluation results are stored in S3:

```
s3://your-bucket/output/
└── job_name/
    └── output/
        └── output.tar.gz
```

Extract output.tar.gz to reveal:

```
run_name/
├── eval_results/
│   ├── results_[timestamp].json
│   ├── inference_output.jsonl (for gen_qa)
│   └── details/
│       └── model/
│           └── <execution-date-time>/
│               └── details_<task_name>_#_<datetime>.parquet
└── tensorboard_results/
    └── eval/
        └── events.out.tfevents.[timestamp]
```

In [10]:
pprint(execution.s3_output_path)
# Display results in a formatted table
execution.show_results()

## Step 7: Retrieve an Existing Job

You can retrieve and inspect any existing evaluation job:

In [None]:
from sagemaker.train.evaluate import EvaluationPipelineExecution
from rich.pretty import pprint


# Get an existing job by ARN
# Replace with your actual pipeline execution ARN
existing_arn = "arn:aws:sagemaker:us-west-2:052150106756:pipeline/SagemakerEvaluation-BenchmarkEvaluation-c344c91d-6f62-4907-85cc-7e6b29171c42/execution/inlsexrd7jes"

# base model only example
# existing_arn = "arn:aws:sagemaker:us-west-2:052150106756:pipeline/SagemakerEvaluation-benchmark/execution/gdp9f4dbv2vi"
existing_execution = EvaluationPipelineExecution.get(
    arn=existing_arn,
    region="us-west-2"
)

pprint(existing_execution)
print(f"\nStatus: {existing_execution.status.overall_status}")

existing_execution.show_results()


Status: Executing


In [None]:
# Run evaluation with configured parameters
execution = evaluator.evaluate()
pprint(execution)

print(f"\nPipeline Execution ARN: {execution.arn}")
print(f"Initial Status: {execution.status.overall_status}")

INFO - sagemaker.modules.evaluate.benchmark_evaluator - Getting or creating artifact for source: arn:aws:sagemaker:us-west-2:052150106756:model-package/test-finetuned-models-gamma/28
INFO - sagemaker.modules.evaluate.base_evaluator - Searching for existing artifact for model package: arn:aws:sagemaker:us-west-2:052150106756:model-package/test-finetuned-models-gamma/28
INFO - sagemaker.modules.evaluate.base_evaluator - Found existing artifact: arn:aws:sagemaker:us-west-2:052150106756:artifact/2b64ef9fe915b3138877d772ec489bef
INFO - sagemaker.modules.evaluate.benchmark_evaluator - Resolved model info - base_model_name: meta-textgeneration-llama-3-2-1b-instruct, base_model_arn: arn:aws:sagemaker:us-west-2:aws:hub-content/SageMakerPublicHub/Model/meta-textgeneration-llama-3-2-1b-instruct/1.10.0, source_model_package_arn: arn:aws:sagemaker:us-west-2:052150106756:model-package/test-finetuned-models-gamma/28
INFO - sagemaker.modules.evaluate.benchmark_evaluator - Using configured hyperparamet

INFO - sagemaker_core.main.resources - Updating pipeline resource.
INFO - sagemaker.modules.evaluate.execution - Successfully updated pipeline: SagemakerEvaluation-benchmark
INFO - sagemaker.modules.evaluate.execution - Starting pipeline execution: gen-qa-eval-demo-1763843077
INFO - sagemaker.modules.evaluate.execution - Pipeline execution started: arn:aws:sagemaker:us-west-2:052150106756:pipeline/SagemakerEvaluation-benchmark/execution/gv93gtwgr7w8



Pipeline Execution ARN: arn:aws:sagemaker:us-west-2:052150106756:pipeline/SagemakerEvaluation-benchmark/execution/gv93gtwgr7w8
Initial Status: Executing


## Step 8: List All Benchmark Evaluations

Retrieve all benchmark evaluation executions:

In [None]:
# Get all benchmark evaluations (returns iterator)
all_executions_iter = BenchMarkEvaluator.get_all(region="us-west-2")
all_executions = list(all_executions_iter)

print(f"Found {len(all_executions)} evaluation(s)\n")
for exec in all_executions[:5]:  # Show first 5
    print(f"  {exec.name}: {exec.status.overall_status}")

Found 2 evaluation(s)

  95qr3e96dblb: Executing
  inlsexrd7jes: Executing


## Step 9: Stop a Running Job (Optional)

You can stop a running evaluation if needed:

In [None]:
# Uncomment to stop the job
# existing_execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")

  class AutoMLSnowflakeDatasetDefinition(Base):
  class SnowflakeDatasetDefinition(Base):


sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/mufi/Library/Application Support/sagemaker/config.yaml


Succeeded


AWS service error when stopping pipeline execution: Pipeline execution with ARN arn:aws:sagemaker:us-west-2:052150106756:pipeline/sagemakerevaluation-benchmark/execution/7rr30o7c2qfb status 'Succeeded'. Only pipelines with 'Executing' status can be stopped.


## Understanding the Pipeline Structure

The rendered pipeline definition includes:

**4 Steps:**
1. **CreateEvaluationAction** (Lineage): Sets up tracking
2. **EvaluateBaseModel** (Training): Evaluates base model
3. **EvaluateCustomModel** (Training): Evaluates custom model
4. **AssociateLineage** (Lineage): Links results

**Key Features:**
- Template-based: Uses Jinja2 for flexible pipeline generation
- Parallel execution: Base and custom models evaluated simultaneously
- Serverless: No need to manage compute resources
- MLflow integration: Automatic experiment tracking
- Lineage tracking: Full traceability of evaluation artifacts

**Typical Execution Time:**
- Total: ~10-12 minutes
- Downloading phase: ~5-7 minutes (model and dataset)
- Training phase: ~3-5 minutes (running evaluation)
- Lineage steps: ~2-4 seconds each