---
title: "Eval Hub API Examples"
subtitle: "Comprehensive guide to using the Evaluation Hub REST API"
author: "Evaluation Service Team"
format:
  html:
    toc: true
    toc-depth: 3
    code-fold: false
    theme: cosmo
  ipynb:
    output-file: api_examples.ipynb
jupyter: python3
---

# Eval Hub API Examples

This notebook demonstrates how to interact with the Evaluation Hub REST API running on `localhost:8000`.

## Setup and Dependencies

In [None]:
import requests
import json
import time
from uuid import uuid4
from datetime import datetime
from typing import Dict, Any, List

# Configuration
BASE_URL = "http://localhost:8000"
API_BASE = f"{BASE_URL}/api/v1"

# Helper function for pretty printing JSON responses
def print_json(data):
    print(json.dumps(data, indent=2, default=str))

# Helper function for API requests
def api_request(method: str, endpoint: str, **kwargs) -> requests.Response:
    """Make an API request with proper error handling."""
    url = f"{API_BASE}{endpoint}"
    response = requests.request(method, url, **kwargs)

    print(f"{method.upper()} {url}")
    print(f"Status: {response.status_code}")

    if response.headers.get('content-type', '').startswith('application/json'):
        print("Response:")
        print_json(response.json())
    else:
        print(f"Response: {response.text}")

    print("-" * 50)
    return response

## Health Check

First, let's verify the service is running:

In [None]:
response = api_request("GET", "/health")

if response.status_code == 200:
    health_data = response.json()
    print("‚úÖ Service is healthy!")
    print(f"Version: {health_data['version']}")
    print(f"Uptime: {health_data['uptime_seconds']:.1f} seconds")
else:
    print("‚ùå Service is not responding correctly")

## Provider Management

### List All Providers

In [None]:
response = api_request("GET", "/providers")

if response.status_code == 200:
    providers_data = response.json()
    print(f"Found {providers_data['total_providers']} providers:")
    for provider in providers_data['providers']:
        print(f"  - {provider['provider_name']} ({provider['provider_id']})")
        print(f"    Type: {provider['provider_type']}")
        print(f"    Benchmarks: {provider['benchmark_count']}")

### Get Specific Provider Details

In [None]:
# Get details for the lm_evaluation_harness provider
provider_id = "lm_evaluation_harness"
response = api_request("GET", f"/providers/{provider_id}")

if response.status_code == 200:
    provider = response.json()
    print(f"Provider: {provider['provider_name']}")
    print(f"Description: {provider['description']}")
    print(f"Number of benchmarks: {len(provider['benchmarks'])}")

## Benchmark Discovery

### List All Benchmarks

In [None]:
response = api_request("GET", "/benchmarks")

if response.status_code == 200:
    benchmarks_data = response.json()
    print(f"Total benchmarks available: {benchmarks_data['total_count']}")

    # Show first 5 benchmarks
    for benchmark in benchmarks_data['benchmarks'][:5]:
        print(f"  - {benchmark['name']} ({benchmark['benchmark_id']})")
        print(f"    Category: {benchmark['category']}")
        print(f"    Provider: {benchmark['provider_id']}")

### Filter Benchmarks by Category

In [None]:
response = api_request("GET", "/benchmarks", params={"category": "math"})

if response.status_code == 200:
    math_benchmarks = response.json()
    print(f"Math benchmarks: {math_benchmarks['total_count']}")
    for benchmark in math_benchmarks['benchmarks']:
        print(f"  - {benchmark['name']}: {benchmark['description']}")

### Get Provider-Specific Benchmarks

In [None]:
provider_id = "lm_evaluation_harness"
response = api_request("GET", f"/providers/{provider_id}/benchmarks")

if response.status_code == 200:
    benchmarks = response.json()
    print(f"Benchmarks for {provider_id}: {len(benchmarks)}")

    # Group by category
    categories = {}
    for benchmark in benchmarks:
        category = benchmark['category']
        if category not in categories:
            categories[category] = []
        categories[category].append(benchmark['name'])

    for category, names in categories.items():
        print(f"\n{category.title()}: {len(names)} benchmarks")
        print(f"  Examples: {', '.join(names[:3])}")

## Collections

### List Available Collections

In [None]:
response = api_request("GET", "/collections")

if response.status_code == 200:
    collections = response.json()
    print(f"Available collections: {collections['total_collections']}")

    for collection in collections['collections']:
        print(f"\nüìÅ {collection['name']} ({collection['collection_id']})")
        print(f"   Description: {collection['description']}")
        print(f"   Benchmarks: {len(collection['benchmarks'])}")
        for benchmark_ref in collection['benchmarks'][:3]:  # Show first 3
            print(f"     - {benchmark_ref['provider_id']}::{benchmark_ref['benchmark_id']}")

## Model Management

### List All Models

In [None]:
response = api_request("GET", "/models")

if response.status_code == 200:
    models_data = response.json()
    print(f"Total models: {models_data['total_models']}")
    print(f"Runtime models: {len(models_data.get('runtime_models', []))}")
    
    print("\nüìã Registered Models:")
    for model in models_data.get('models', []):
        print(f"  - {model['model_name']} ({model['model_id']})")
        print(f"    Type: {model['model_type']}")
        print(f"    Status: {model['status']}")
        if model.get('base_url'):
            print(f"    Base URL: {model['base_url']}")
    
    if models_data.get('runtime_models'):
        print("\n‚öôÔ∏è Runtime Models (from environment variables):")
        for model in models_data['runtime_models']:
            print(f"  - {model['model_name']} ({model['model_id']})")
            print(f"    Type: {model['model_type']}")

### List Only Active Models

In [None]:
response = api_request("GET", "/models", params={"include_inactive": False})

if response.status_code == 200:
    models_data = response.json()
    print(f"Active models: {models_data['total_models']}")
    for model in models_data.get('models', []):
        print(f"  - {model['model_name']} ({model['model_id']}) - {model['status']}")

### Get Model by ID

In [None]:
# Get details for a specific model
model_id = "gpt-4-turbo"  # Replace with an actual model ID from your system
response = api_request("GET", f"/models/{model_id}")

if response.status_code == 200:
    model = response.json()
    print(f"Model: {model['model_name']}")
    print(f"ID: {model['model_id']}")
    print(f"Type: {model['model_type']}")
    print(f"Description: {model['description']}")
    print(f"Base URL: {model.get('base_url', 'N/A')}")
    print(f"Status: {model['status']}")
    
    if model.get('capabilities'):
        print(f"\nCapabilities:")
        caps = model['capabilities']
        if caps.get('max_tokens'):
            print(f"  Max tokens: {caps['max_tokens']}")
        if caps.get('context_window'):
            print(f"  Context window: {caps['context_window']}")
        if caps.get('supports_streaming'):
            print(f"  Supports streaming: {caps['supports_streaming']}")
    
    if model.get('tags'):
        print(f"\nTags: {', '.join(model['tags'])}")
elif response.status_code == 404:
    print(f"‚ùå Model '{model_id}' not found")

### Register a New Model

In [None]:
# Register an OpenAI-compatible model
new_model = {
    "model_id": "groq-llama-3.1-70b",
    "model_name": "Llama 3.1 70B via Groq",
    "description": "Meta's Llama 3.1 70B model accessed through Groq API",
    "model_type": "openai-compatible",
    "base_url": "https://api.groq.com/openai/v1",
    "api_key_required": True,
    "model_path": "llama-3.1-70b-versatile",
    "capabilities": {
        "max_tokens": 8192,
        "supports_streaming": True,
        "supports_function_calling": True,
        "context_window": 131072
    },
    "config": {
        "temperature": 0.7,
        "max_tokens": 2048,
        "timeout": 60,
        "retry_attempts": 3
    },
    "status": "active",
    "tags": ["groq", "llama", "openai-compatible", "fast"]
}

print("üìù Registering new model...")
print_json(new_model)

response = api_request("POST", "/models", json=new_model)

if response.status_code == 201:
    registered_model = response.json()
    print(f"‚úÖ Model registered successfully!")
    print(f"Model ID: {registered_model['model_id']}")
    print(f"Created at: {registered_model.get('created_at', 'N/A')}")
else:
    print(f"‚ùå Failed to register model: {response.text}")

### Register a vLLM Server Model

In [None]:
# Register a vLLM server model
vllm_model = {
    "model_id": "local-llama-2-7b",
    "model_name": "Local Llama 2 7B",
    "description": "Llama 2 7B running on local vLLM server",
    "model_type": "vllm",
    "base_url": "http://localhost:8000",
    "api_key_required": False,
    "model_path": "/models/llama-2-7b",
    "capabilities": {
        "max_tokens": 4096,
        "supports_streaming": True,
        "context_window": 4096
    },
    "config": {
        "temperature": 0.0,
        "max_tokens": 512,
        "timeout": 120,
        "retry_attempts": 2
    },
    "status": "active",
    "tags": ["vllm", "local", "llama-2"]
}

print("üìù Registering vLLM model...")
response = api_request("POST", "/models", json=vllm_model)

if response.status_code == 201:
    print(f"‚úÖ vLLM model registered: {response.json()['model_id']}")
else:
    print(f"‚ö†Ô∏è Note: This may fail if the model ID already exists")
    print(f"Response: {response.text}")

### Update a Model

In [None]:
# Update model details
model_id = "groq-llama-3.1-70b"  # Replace with an actual model ID

update_request = {
    "model_name": "Llama 3.1 70B (Groq) - Updated",
    "description": "Updated description for Llama 3.1 70B via Groq",
    "status": "active",
    "tags": ["groq", "llama", "openai-compatible", "fast", "updated"]
}

print(f"üìù Updating model: {model_id}")
print_json(update_request)

response = api_request("PUT", f"/models/{model_id}", json=update_request)

if response.status_code == 200:
    updated_model = response.json()
    print(f"‚úÖ Model updated successfully!")
    print(f"New name: {updated_model['model_name']}")
    print(f"Tags: {', '.join(updated_model.get('tags', []))}")
elif response.status_code == 404:
    print(f"‚ùå Model '{model_id}' not found")
else:
    print(f"‚ùå Failed to update model: {response.text}")

### Delete a Model

In [None]:
# Delete a model (runtime models cannot be deleted via API)
model_id = "groq-llama-3.1-70b"  # Replace with an actual model ID

print(f"üóëÔ∏è Deleting model: {model_id}")
response = api_request("DELETE", f"/models/{model_id}")

if response.status_code == 200:
    result = response.json()
    print(f"‚úÖ {result.get('message', 'Model deleted successfully')}")
elif response.status_code == 404:
    print(f"‚ùå Model '{model_id}' not found")
elif response.status_code == 400:
    print(f"‚ùå Cannot delete runtime model (configured via environment variables)")
    print(f"Response: {response.text}")
else:
    print(f"‚ùå Failed to delete model: {response.text}")

### Reload Runtime Models

In [None]:
# Reload models configured via environment variables
print("üîÑ Reloading runtime models from environment variables...")
response = api_request("POST", "/models/reload")

if response.status_code == 200:
    result = response.json()
    print(f"‚úÖ {result.get('message', 'Runtime models reloaded successfully')}")
    
    # List models again to see any new runtime models
    print("\nüìã Updated model list:")
    list_response = api_request("GET", "/models")
    if list_response.status_code == 200:
        models_data = list_response.json()
        print(f"Total models: {models_data['total_models']}")
        print(f"Runtime models: {len(models_data.get('runtime_models', []))}")
else:
    print(f"‚ùå Failed to reload models: {response.text}")

## Basic Evaluation Examples

### Single Benchmark Evaluation from Builtin Provider (Simplified API)

In [None]:
# Example: Run a single benchmark using the simplified API (Llama Stack compatible)
provider_id = "lm_evaluation_harness"
benchmark_id = "arc_easy"

single_benchmark_request = {
    "model_name": "gpt-4o-mini",
    "model_configuration": {
        "temperature": 0.0,
        "max_tokens": 512
    },
    "timeout_minutes": 30,
    "retry_attempts": 1,
    "limit": 100,  # Limit to 100 samples for faster execution
    "num_fewshot": 0,
    "experiment_name": "Single Benchmark - ARC Easy",
    "tags": {
        "example_type": "single_benchmark",
        "provider": "lm_evaluation_harness",
        "benchmark": "arc_easy"
    }
}

print("üìù Creating single benchmark evaluation request...")
print(f"Provider ID: {provider_id}")
print(f"Benchmark ID: {benchmark_id}")
print_json(single_benchmark_request)

response = api_request("POST", f"/evaluations/benchmarks/{provider_id}/{benchmark_id}", json=single_benchmark_request)

if response.status_code == 202:
    evaluation_response = response.json()
    request_id = evaluation_response["request_id"]
    print(f"‚úÖ Single benchmark evaluation created successfully!")
    print(f"Request ID: {request_id}")
    print(f"Status: {evaluation_response['status']}")
    print(f"Experiment URL: {evaluation_response.get('experiment_url', 'N/A')}")
else:
    print("‚ùå Failed to create evaluation")
    print(f"Error: {response.text}")

### Simple Evaluation with Risk Category

In [None]:
# Create a simple evaluation request using risk category
evaluation_request = {
    "request_id": str(uuid4()),
    "experiment_name": "Simple Risk-Based Evaluation",
    "evaluations": [
        {
            "name": "GPT-4 Mini Low Risk Evaluation",
            "description": "Basic evaluation using low risk benchmarks",
            "model_name": "gpt-4o-mini",
            "model_configuration": {
                "temperature": 0.0,
                "max_tokens": 512
            },
            "risk_category": "low",
            "timeout_minutes": 30,
            "retry_attempts": 1
        }
    ],
    "tags": {
        "example_type": "risk_category",
        "complexity": "simple"
    }
}

print("üìù Creating evaluation request...")
print_json(evaluation_request)

response = api_request("POST", "/evaluations", json=evaluation_request)

if response.status_code == 202:
    evaluation_response = response.json()
    request_id = evaluation_response["request_id"]
    print(f"‚úÖ Evaluation created successfully!")
    print(f"Request ID: {request_id}")
    print(f"Status: {evaluation_response['status']}")
    print(f"Experiment URL: {evaluation_response.get('experiment_url', 'N/A')}")
else:
    print("‚ùå Failed to create evaluation")
    print(f"Error: {response.text}")

### Evaluation with Explicit Backend Configuration

In [None]:
# Create an evaluation with explicit backend configuration
explicit_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "Explicit Backend Configuration",
    "evaluations": [
        {
            "name": "LM-Eval Harness Evaluation",
            "description": "Evaluation with explicit lm-evaluation-harness configuration",
            "model_name": "gpt-4o-mini",
            "model_configuration": {
                "temperature": 0.1,
                "max_tokens": 256,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "lm-eval-backend",
                    "type": "lm-evaluation-harness",
                    "config": {
                        "batch_size": 1,
                        "device": "cpu"
                    },
                    "benchmarks": [
                        {
                            "name": "arc_easy",
                            "tasks": ["arc_easy"],
                            "config": {
                                "num_fewshot": 5,
                                "limit": 50
                            }
                        },
                        {
                            "name": "hellaswag",
                            "tasks": ["hellaswag"],
                            "config": {
                                "num_fewshot": 10,
                                "limit": 100
                            }
                        }
                    ]
                }
            ],
            "timeout_minutes": 45,
            "retry_attempts": 2
        }
    ],
    "tags": {
        "example_type": "explicit_backend",
        "complexity": "intermediate"
    }
}

print("üìù Creating evaluation with explicit backend...")
response = api_request("POST", "/evaluations", json=explicit_evaluation)

if response.status_code == 202:
    explicit_response = response.json()
    explicit_request_id = explicit_response["request_id"]
    print(f"‚úÖ Explicit evaluation created!")
    print(f"Request ID: {explicit_request_id}")

## NeMo Evaluator Integration

### Single NeMo Evaluator Container

In [None]:
# Example with single NeMo Evaluator container
nemo_single_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "NeMo Evaluator Single Container",
    "evaluations": [
        {
            "name": "GPT-4 via NeMo Evaluator",
            "description": "Remote evaluation using NeMo Evaluator container",
            "model_name": "gpt-4-turbo",
            "model_configuration": {
                "temperature": 0.0,
                "max_tokens": 512,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "remote-nemo-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "localhost",
                        "port": 3825,
                        "model_endpoint": "https://api.openai.com/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "OPENAI_API_KEY",
                        "timeout_seconds": 1800,
                        "max_retries": 2,
                        "verify_ssl": False,
                        "framework_name": "eval-hub-example",
                        "parallelism": 1,
                        "limit_samples": 25,
                        "temperature": 0.0,
                        "top_p": 0.95
                    },
                    "benchmarks": [
                        {
                            "name": "mmlu_pro_sample",
                            "tasks": ["mmlu_pro"],
                            "config": {
                                "limit": 25,
                                "num_fewshot": 5
                            }
                        }
                    ]
                }
            ],
            "timeout_minutes": 60,
            "retry_attempts": 1
        }
    ],
    "tags": {
        "example_type": "nemo_evaluator_single",
        "complexity": "advanced",
        "backend": "remote_container"
    }
}

print("üìù Creating NeMo Evaluator evaluation...")
print("Note: This requires a running NeMo Evaluator container on localhost:3825")

response = api_request("POST", "/evaluations", json=nemo_single_evaluation)

if response.status_code == 202:
    nemo_response = response.json()
    nemo_request_id = nemo_response["request_id"]
    print(f"‚úÖ NeMo evaluation created!")
    print(f"Request ID: {nemo_request_id}")
else:
    print("‚ö†Ô∏è NeMo evaluation failed (container may not be running)")
    print(f"Response: {response.text}")

### Multi-Container NeMo Evaluator Setup

In [None]:
# Example with multiple specialized NeMo Evaluator containers
nemo_multi_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "Multi-Container NeMo Evaluation",
    "evaluations": [
        {
            "name": "Distributed LLaMA Evaluation",
            "description": "Multi-container evaluation across specialized endpoints",
            "model_name": "llama-3.1-8b",
            "model_configuration": {
                "temperature": 0.1,
                "max_tokens": 512,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "academic-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "academic-eval.example.com",
                        "port": 3825,
                        "model_endpoint": "https://api.groq.com/openai/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "GROQ_API_KEY",
                        "timeout_seconds": 3600,
                        "framework_name": "eval-hub-academic",
                        "parallelism": 2
                    },
                    "benchmarks": [
                        {
                            "name": "mmlu_pro",
                            "tasks": ["mmlu_pro"],
                            "config": {"limit": 100, "num_fewshot": 5}
                        },
                        {
                            "name": "arc_challenge",
                            "tasks": ["arc_challenge"],
                            "config": {"limit": 200, "num_fewshot": 25}
                        }
                    ]
                },
                {
                    "name": "math-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "math-eval.example.com",
                        "port": 3825,
                        "model_endpoint": "https://api.groq.com/openai/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "GROQ_API_KEY",
                        "temperature": 0.0,
                        "parallelism": 1,
                        "framework_name": "eval-hub-math"
                    },
                    "benchmarks": [
                        {
                            "name": "gsm8k",
                            "tasks": ["gsm8k"],
                            "config": {"limit": 100, "num_fewshot": 8}
                        },
                        {
                            "name": "math",
                            "tasks": ["hendrycks_math"],
                            "config": {"limit": 50, "num_fewshot": 4}
                        }
                    ]
                }
            ],
            "timeout_minutes": 120,
            "retry_attempts": 2
        }
    ],
    "tags": {
        "example_type": "nemo_evaluator_multi",
        "complexity": "expert",
        "backend": "distributed_containers"
    }
}

print("üìù Creating multi-container NeMo evaluation...")
print("Note: This is a hypothetical example with multiple remote containers")
print_json(nemo_multi_evaluation)

## Evaluation Status Monitoring

### Check Evaluation Status

In [None]:
# Function to check evaluation status
def check_evaluation_status(request_id: str):
    response = api_request("GET", f"/evaluations/{request_id}")

    if response.status_code == 200:
        status_data = response.json()
        print(f"üìä Evaluation Status for {request_id}")
        print(f"Status: {status_data['status']}")
        print(f"Progress: {status_data.get('progress_percentage', 0):.1f}%")
        print(f"Total evaluations: {status_data.get('total_evaluations', 0)}")
        print(f"Completed: {status_data.get('completed_evaluations', 0)}")
        print(f"Failed: {status_data.get('failed_evaluations', 0)}")

        if status_data.get('results'):
            print(f"Results available: {len(status_data['results'])}")

        return status_data
    else:
        print(f"‚ùå Failed to get status: {response.text}")
        return None

# Check status of previously created evaluations (if they exist)
try:
    if 'request_id' in locals():
        check_evaluation_status(request_id)
except NameError:
    print("No evaluation request_id available to check")

### Monitor Evaluation Progress

In [None]:
# Function to monitor evaluation until completion
def monitor_evaluation(request_id: str, max_wait_time: int = 300):
    """Monitor an evaluation until completion or timeout."""
    start_time = time.time()

    while time.time() - start_time < max_wait_time:
        status_data = check_evaluation_status(request_id)

        if not status_data:
            break

        status = status_data['status']

        if status in ['completed', 'failed', 'cancelled']:
            print(f"üèÅ Evaluation {status}!")

            if status == 'completed' and status_data.get('results'):
                print("\nüìä Results Summary:")
                for result in status_data['results'][:3]:  # Show first 3 results
                    print(f"  - {result['benchmark_name']}: {result['status']}")
                    if result.get('metrics'):
                        for metric, value in list(result['metrics'].items())[:2]:
                            print(f"    {metric}: {value}")

            return status_data

        print(f"‚è≥ Still {status}, waiting...")
        time.sleep(10)

    print(f"‚è∞ Monitoring timed out after {max_wait_time} seconds")
    return None

# Example usage (uncomment if you have a running evaluation)
# monitor_evaluation(request_id)

## List All Evaluations

In [None]:
response = api_request("GET", "/evaluations")

if response.status_code == 200:
    evaluations = response.json()
    print(f"üìã Active evaluations: {len(evaluations)}")

    for eval_resp in evaluations:
        print(f"\nüîç {eval_resp['request_id']}")
        print(f"   Status: {eval_resp['status']}")
        print(f"   Progress: {eval_resp.get('progress_percentage', 0):.1f}%")
        print(f"   Created: {eval_resp['created_at']}")

## System Metrics

In [None]:
response = api_request("GET", "/metrics/system")

if response.status_code == 200:
    metrics = response.json()
    print("üìä System Metrics:")
    print(f"  Active evaluations: {metrics['active_evaluations']}")
    print(f"  Running tasks: {metrics['running_tasks']}")
    print(f"  Total requests: {metrics['total_requests']}")

    if metrics.get('status_breakdown'):
        print("\n  Status breakdown:")
        for status, count in metrics['status_breakdown'].items():
            print(f"    {status}: {count}")

    if metrics.get('memory_usage'):
        print(f"\n  Memory usage:")
        print(f"    Active evaluations: {metrics['memory_usage']['active_evaluations_mb']:.1f} MB")

## Evaluation Management

### Cancel an Evaluation

In [None]:
# Function to cancel an evaluation
def cancel_evaluation(request_id: str):
    response = api_request("DELETE", f"/evaluations/{request_id}")

    if response.status_code == 200:
        result = response.json()
        print(f"‚úÖ {result['message']}")
        return True
    else:
        print(f"‚ùå Failed to cancel: {response.text}")
        return False

# Example usage (uncomment if you want to cancel an evaluation)
# cancel_evaluation(request_id)

## Error Handling Examples

### Invalid Request Handling

In [None]:
# Example of invalid request to demonstrate error handling
invalid_request = {
    "request_id": "invalid-uuid-format",
    "evaluations": [
        {
            "name": "",  # Invalid: empty name
            "model_name": "",  # Invalid: empty model name
            "backends": []  # Invalid: no backends
        }
    ]
}

print("üìù Testing error handling with invalid request...")
response = api_request("POST", "/evaluations", json=invalid_request)

if response.status_code >= 400:
    print("‚úÖ Error handling working correctly")
    error_data = response.json()
    print(f"Error type: {response.status_code}")
    print(f"Error message: {error_data.get('detail', 'Unknown error')}")

### Non-existent Resource Handling

In [None]:
# Test accessing non-existent evaluation
fake_request_id = str(uuid4())
print(f"üîç Testing access to non-existent evaluation: {fake_request_id}")

response = api_request("GET", f"/evaluations/{fake_request_id}")

if response.status_code == 404:
    print("‚úÖ 404 handling working correctly")
    error_data = response.json()
    print(f"Error: {error_data['detail']}")

## Advanced Examples

### Batch Evaluation Requests

In [None]:
# Create multiple evaluations for comparison
batch_requests = []

models_to_compare = ["gpt-4o-mini", "gpt-3.5-turbo"]
risk_levels = ["low", "medium"]

for model in models_to_compare:
    for risk in risk_levels:
        batch_request = {
            "request_id": str(uuid4()),
            "experiment_name": f"Batch Comparison - {model} - {risk} risk",
            "evaluations": [
                {
                    "name": f"{model} {risk} risk evaluation",
                    "model_name": model,
                    "model_configuration": {
                        "temperature": 0.0,
                        "max_tokens": 256
                    },
                    "risk_category": risk,
                    "timeout_minutes": 30
                }
            ],
            "tags": {
                "batch_id": "model_comparison_001",
                "model": model,
                "risk_level": risk
            }
        }
        batch_requests.append(batch_request)

print(f"üì¶ Creating {len(batch_requests)} batch evaluations...")

batch_results = []
for i, request in enumerate(batch_requests):
    print(f"\nüìù Creating batch request {i+1}/{len(batch_requests)}")
    response = api_request("POST", "/evaluations", json=request)

    if response.status_code == 202:
        batch_results.append(response.json())
        print(f"‚úÖ Batch {i+1} created: {response.json()['request_id']}")
    else:
        print(f"‚ùå Batch {i+1} failed")

print(f"\nüìä Successfully created {len(batch_results)} batch evaluations")

### Configuration Validation

In [None]:
# Test various configuration combinations
test_configs = [
    {
        "name": "High timeout test",
        "config": {"timeout_minutes": 120, "retry_attempts": 5},
        "expected": "success"
    },
    {
        "name": "Zero timeout test",
        "config": {"timeout_minutes": 0, "retry_attempts": 1},
        "expected": "validation_error"
    },
    {
        "name": "Negative retry test",
        "config": {"timeout_minutes": 30, "retry_attempts": -1},
        "expected": "validation_error"
    }
]

for test in test_configs:
    print(f"\nüß™ Testing: {test['name']}")

    test_request = {
        "request_id": str(uuid4()),
        "experiment_name": test['name'],
        "evaluations": [
            {
                "name": "Config test",
                "model_name": "gpt-4o-mini",
                "risk_category": "low",
                **test['config']
            }
        ]
    }

    response = api_request("POST", "/evaluations", json=test_request)

    if test['expected'] == "success" and response.status_code == 202:
        print("‚úÖ Test passed")
    elif test['expected'] == "validation_error" and response.status_code >= 400:
        print("‚úÖ Validation correctly rejected invalid config")
    else:
        print(f"‚ùå Unexpected result: {response.status_code}")

## Summary

This notebook demonstrated comprehensive usage of the Eval Hub API including:

- ‚úÖ **Basic Operations**: Health checks, provider/benchmark discovery
- ‚úÖ **Model Management**: Register, list, update, and delete models
- ‚úÖ **Simple Evaluations**: Risk category-based evaluations
- ‚úÖ **Advanced Evaluations**: Explicit backend configuration
- ‚úÖ **NeMo Integration**: Single and multi-container setups
- ‚úÖ **Monitoring**: Status checking and progress tracking
- ‚úÖ **Management**: Cancellation and system metrics
- ‚úÖ **Error Handling**: Validation and error responses
- ‚úÖ **Batch Operations**: Multiple evaluation management

For production use, remember to:
- Use proper API keys and authentication
- Configure appropriate timeouts for your evaluation complexity
- Monitor resource usage and system metrics
- Handle errors gracefully in your applications
- Use the async evaluation mode for long-running evaluations

The Eval Hub provides a powerful and flexible API for orchestrating machine learning model evaluations across multiple backends and evaluation frameworks.