# SageMaker Custom Scorer Evaluation - Demo

This notebook demonstrates how to use the CustomScorerEvaluator to evaluate models with custom evaluator functions.

## Setup

Import necessary modules.

In [None]:
from sagemaker.train.evaluate import CustomScorerEvaluator
from rich.pretty import pprint

# Configure logging to show INFO messages
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

## Configure Evaluation Parameters

Set up the parameters for your custom scorer evaluation.

In [None]:
# Evaluator ARN (custom evaluator from AI Registry)
# evaluator_arn = "arn:aws:sagemaker:us-west-2:052150106756:hub-content/AIRegistry/JsonDoc/00-goga-qa-evaluation/1.0.0"
# evaluator_arn = "arn:aws:sagemaker:us-west-2:052150106756:hub-content/AIRegistry/JsonDoc/nikmehta-reward-function/1.0.0"
# evaluator_arn = "arn:aws:sagemaker:us-west-2:052150106756:hub-content/AIRegistry/JsonDoc/eval-lambda-test/0.0.1"
evaluator_arn = "arn:aws:sagemaker:us-west-2:052150106756:hub-content/F3LMYANDKWPZCROJVCKMJ7TOML6QMZBZRRQOVTUL45VUK7PJ4SXA/JsonDoc/eval-lambda-test/0.0.1"

# Dataset - can be S3 URI or AIRegistry DataSet ARN
dataset = "s3://sagemaker-us-west-2-052150106756/studio-users/d20251107t195443/datasets/2025-11-07T19-55-37-609Z/zc_test.jsonl"

# Base model - can be:
# 1. Model package ARN: "arn:aws:sagemaker:region:account:model-package/name/version"
# 2. JumpStart model ID: "llama-3-2-1b-instruct" [Evaluation with Base Model Only is yet to be implemented/tested - Not Working currently]
base_model = "arn:aws:sagemaker:us-west-2:052150106756:model-package/test-finetuned-models-gamma/28"

# S3 location for outputs
s3_output_path = "s3://mufi-test-serverless-smtj/eval/"

# Optional: MLflow tracking server ARN
mlflow_resource_arn = "arn:aws:sagemaker:us-west-2:052150106756:mlflow-tracking-server/mmlu-eval-experiment"

print("Configuration:")
print(f"  Evaluator: {evaluator_arn}")
print(f"  Dataset: {dataset}")
print(f"  Base Model: {base_model}")
print(f"  Output Location: {s3_output_path}")

## Create CustomScorerEvaluator Instance

Instantiate the evaluator with your configuration. The evaluator can accept:
- **Custom Evaluator ARN** (string): Points to your custom evaluator in AI Registry
- **Built-in Metric** (string or enum): Use preset metrics like "code_executions", "math_answers", etc.
- **Evaluator Object**: A sagemaker.ai_registry.evaluator.Evaluator instance

In [None]:
# Create evaluator with custom evaluator ARN
evaluator = CustomScorerEvaluator(
    evaluator=evaluator_arn,  # Custom evaluator ARN
    dataset=dataset,
    model=base_model,
    s3_output_path=s3_output_path,
    mlflow_resource_arn=mlflow_resource_arn,
    # model_package_group="arn:aws:sagemaker:us-west-2:052150106756:model-package-group/Demo-test-deb-2", 
    evaluate_base_model=False  # Set to True to also evaluate the base model
)

print("\n✓ CustomScorerEvaluator created successfully")
pprint(evaluator)

### Optionally update the hyperparameters

In [None]:
pprint(evaluator.hyperparameters.to_dict())

# optionally update hyperparameters
# evaluator.hyperparameters.temperature = "0.1"

# optionally get more info on types, limits, defaults.
# evaluator.hyperparameters.get_info()

## Alternative: Using Built-in Metrics

Instead of a custom evaluator ARN, you can use built-in metrics:

In [None]:
# Example with built-in metrics (commented out)
# from sagemaker.train.evaluate import get_builtin_metrics
# 
# BuiltInMetric = get_builtin_metrics()
# 
# evaluator_builtin = CustomScorerEvaluator(
#     evaluator=BuiltInMetric.PRIME_MATH,  # Or use string: "prime_math"
#     dataset=dataset,
#     base_model=base_model,
#     s3_output_path=s3_output_path
# )

## Start Evaluation

Call `evaluate()` to start the evaluation job. This will:
1. Create or update the evaluation pipeline
2. Start a pipeline execution
3. Return an `EvaluationPipelineExecution` object for monitoring

In [None]:
# Start evaluation
execution = evaluator.evaluate()

print("\n✓ Evaluation execution started successfully!")
print(f"  Execution Name: {execution.name}")
print(f"  Pipeline Execution ARN: {execution.arn}")
print(f"  Status: {execution.status.overall_status}")

## Monitor Job Progress

Use `refresh()` to update the job status, or `wait()` to block until completion.

In [None]:
# Check current status
execution.refresh()
print(f"Current Status: {execution.status.overall_status}")

pprint(execution.status)

## Wait for Completion

Block execution until the job completes. This provides a rich visual experience in Jupyter notebooks.

In [None]:
# Wait for job to complete (with rich visual feedback)
execution.wait(poll=30, timeout=3600)

print(f"\nFinal Status: {execution.status.overall_status}")

In [None]:
# show results
execution.show_results()

## Retrieve Existing Job

You can retrieve a previously started evaluation job using its ARN.

In [None]:
from sagemaker.train.evaluate import EvaluationPipelineExecution

# Get existing job by ARN
existing_arn = execution.arn  # Or use a specific ARN

existing_exec = EvaluationPipelineExecution.get(arn=existing_arn)

print(f"Retrieved job: {existing_exec.name}")
print(f"Status: {existing_exec.status.overall_status}")

## List All Custom Scorer Evaluations

Retrieve all custom scorer evaluation executions.

In [None]:
# Get all custom scorer evaluations
all_executions = list(CustomScorerEvaluator.get_all())

print(f"Found {len(all_executions)} custom scorer evaluation(s):\n")
for execution in all_executions:
    print(f"  - {execution.name} - {execution.arn}: {execution.status.overall_status}")

## Stop a Running Job (Optional)

You can stop a running evaluation if needed.

In [None]:
# Uncomment to stop the job
# execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")

## Summary

This notebook demonstrated:
1. ✅ Creating a CustomScorerEvaluator with a custom evaluator ARN
2. ✅ Starting an evaluation job
3. ✅ Monitoring job progress with refresh() and wait()
4. ✅ Retrieving existing jobs
5. ✅ Listing all custom scorer evaluations

### Key Points:
- The `evaluator` parameter accepts:
  - Custom evaluator ARN (for AI Registry evaluators)
  - Built-in metric names ("code_executions", "math_answers", "exact_match")
  - Evaluator objects from sagemaker.ai_registry.evaluator.Evaluator
- Set `evaluate_base_model=False` to only evaluate the custom model
- Use `execution.wait()` for automatic monitoring with rich visual feedback
- Use `execution.refresh()` for manual status updates
- The SageMaker session is automatically inferred from your environment