# Lighteval Example with Eval Hub SDK

This notebook demonstrates how to use the EvalHub client SDK with the EvalHub evaluation service. This notebook uses `SyncEvalHubClient`.

## Prerequisites: EvalHub credentials (`.env`)

Before running this notebook, create a `.env` file with your EvalHub credentials. Run the block below in a terminal (e.g. from the `examples/` directory); it writes `EVALHUB_TOKEN` and `EVALHUB_URL` to `.env`. The notebook will use these when it loads the client (e.g. via `load_dotenv()`).

```bash
# Get your OpenShift authentication token
export EVALHUB_TOKEN=$(oc whoami -t)

# Get the EvalHub route URL from your namespace (e.g., 'test')

export EVALHUB_URL=https://$(oc get route evalhub -n test -o jsonpath='{.spec.host}')

# Write to .env file
cat > .env << EOF
EVALHUB_TOKEN=$EVALHUB_TOKEN
EVALHUB_URL=$EVALHUB_URL
EOF

# Verify
echo "Token: ${EVALHUB_TOKEN:0:20}..."
echo "URL: $EVALHUB_URL"
echo "Written to .env"
```


## OpenShift Authentication

The SDK has built-in support for OpenShift authentication with three methods:

1. **ServiceAccount Token** (`auth_token_path`): For pods running inside OpenShift, use the automounted token at `/var/run/secrets/kubernetes.io/serviceaccount/token`
2. **User Token** (`auth_token`): For local development, use token from `oc whoami -t` (shown above)
3. **Environment Variable** (`auth_token`): For production, use token from environment variable or secret (used in examples below)

In [1]:
# Read environment variables from .env if present (for local runs)
from dotenv import load_dotenv

load_dotenv(".env")

True

## Setup: Import Required Modules

In [2]:
from evalhub import (
    SyncEvalHubClient,
    JobSubmissionRequest,
    BenchmarkConfig,
    ModelConfig,
    JobStatus,
)
import os

# Verify environment variables are set
if not os.getenv("EVALHUB_TOKEN"):
    raise ValueError(
        "EVALHUB_TOKEN environment variable is not set. See Prerequisites section above."
    )
if not os.getenv("EVALHUB_URL"):
    raise ValueError(
        "EVALHUB_URL environment variable is not set. See Prerequisites section above."
    )

## Connect to EvalHub and check health status.

In [3]:
client = SyncEvalHubClient(
    base_url=os.getenv("EVALHUB_URL"),
    auth_token=os.getenv("EVALHUB_TOKEN"),
    insecure=True,
)
try:
    # Check health
    health = client.health()
    print(f"✓ EvalHub is healthy: {health['status']}")
    print(f"  Version: {health.get('version', 'unknown')}")
    print(f"  Uptime: {health.get('uptime_seconds', 0):.1f}s")
except Exception as e:
    print(f"✗ Failed to connect: {e}")

TLS verification disabled - skipping CA bundle detection
TLS verification disabled (insecure mode)


✓ EvalHub is healthy: healthy
  Version: unknown
  Uptime: 0.0s


## List Providers and Benchmarks

In [4]:
# List available providers using nested resource structure
# List all providers
providers = client.providers.list()
print(f"✓ Found {len(providers)} providers:")
for provider in providers:
    print(f"  - {provider.id}: {provider.name}")

✓ Found 5 providers:
  - garak: Garak
  - guidellm: GuideLLM
  - lighteval: Lighteval
  - lm_evaluation_harness: LM Evaluation Harness
  - ragas: RAGAS


### We are using lighteval provider_id for this notebook
List available benchmarks for the provider

In [5]:
provider_id = "lighteval"

In [6]:
benchmarks = client.benchmarks.list(provider_id=provider_id)
print(f"\n✓ Found {len(benchmarks)} benchmarks for {provider_id}:")
for benchmark in benchmarks:
    print(f"  - {benchmark.id}: {benchmark.name}")


✓ Found 23 benchmarks for lighteval:
  - commonsense_reasoning: Commonsense Reasoning Suite
  - scientific_reasoning: Scientific Reasoning Suite
  - physical_commonsense: Physical Commonsense Suite
  - truthfulness: Truthfulness Suite
  - math: Math Suite
  - knowledge: Knowledge Suite
  - language_understanding: Language Understanding Suite
  - hellaswag: HellaSwag
  - winogrande: Winogrande
  - openbookqa: OpenBookQA
  - arc:easy: ARC Easy
  - arc:challenge: ARC Challenge
  - piqa: PIQA
  - truthfulqa:mc: TruthfulQA MC
  - truthfulqa:generation: TruthfulQA Generation
  - gsm8k: GSM8K
  - math:algebra: MATH Algebra
  - math:counting_and_probability: MATH Counting & Probability
  - mmlu: MMLU
  - triviaqa: TriviaQA
  - glue:cola: GLUE CoLA
  - glue:sst2: GLUE SST-2
  - glue:mrpc: GLUE MRPC


## Submit an Evaluation Job

To find available models in your namespace:

```bash
# List services in your namespace
oc get svc -n <namespace>
# Example: Using a service named 'granite-llm-metrics' running in the 'prabhu' namespace on port 8080.
# The model endpoint URL is formatted as:
#   http://<service-name>.<namespace>.svc.cluster.local:<port>/v1
# In this case, the endpoint points to the `granite-llm` model.
#
# Correct URL format: http://granite-llm-metrics.prabhu.svc.cluster.local:8080/v1
# WRONG format: http://granite-llm-metrics.prabhu.svc.cluster.local:8080/v1/completions
```

In [7]:
model_url = "http://granite-llm-metrics.prabhu.svc.cluster.local:8080/v1"
model_name = "granite-llm"

In [8]:
# Create benchmark configuration
benchmark1 = BenchmarkConfig(
    id="gsm8k",
    provider_id="lighteval",
    parameters={
        "num_examples": 5,  # Number of examples to evaluate (use small number for testing)
        "tokenizer": "google/flan-t5-small",  # HuggingFace tokenizer for the model
    },
)

# Create job submission request
job_request = JobSubmissionRequest(
    model=ModelConfig(url=model_url, name=model_name),
    benchmarks=[benchmark1],
    timeout_minutes=15,
    retry_attempts=1,
)

# Submit job using nested resource
job = client.jobs.submit(job_request)

# Store job ID for later use
submitted_job_id = job.id
print(f"✓ Job submitted successfully")
print(f"  Job ID: {submitted_job_id}")
print(f"  State: {job.state}")
print(f"  Created: {job.resource.created_at}")

✓ Job submitted successfully
  Job ID: c02c0831-0097-4598-b4cb-5cdb0870904c
  State: JobStatus.PENDING
  Created: 2026-02-13 06:42:54.696571+00:00


In [9]:
# Check status
updated_job = client.jobs.get(submitted_job_id)
print(f"\n✓ Current job state: {updated_job.state}")
if updated_job.status and updated_job.status.message:
    print(f"  Message: {updated_job.status.message.message}")


✓ Current job state: JobStatus.RUNNING
  Message: Evaluation job is running


## Wait for the job to complete

In [10]:
import time

while updated_job.state not in (JobStatus.COMPLETED, JobStatus.FAILED):
    print(f"\n✓ Current job state: {updated_job.state}")
    time.sleep(20)
    updated_job = client.jobs.get(submitted_job_id)

if updated_job.state == JobStatus.FAILED:
    if updated_job.status and updated_job.status.message:
        print(f"Job failed: {updated_job.status.message.message}")
    else:
        print("Job failed: Unknown")
else:
    print(f"✓ Job completed successfully")



✓ Current job state: JobStatus.RUNNING

✓ Current job state: JobStatus.RUNNING
✓ Job completed successfully


In [11]:
print("benchmark metrics:", updated_job.results.benchmarks[0].metrics)

benchmark metrics: {'all.extractive_match': 0.8, 'gsm8k|0.extractive_match': 0.8}


In [12]:
# print the job details
print(updated_job.model_dump_json(indent=2))

{
  "resource": {
    "id": "c02c0831-0097-4598-b4cb-5cdb0870904c",
    "tenant": "TODO",
    "created_at": "2026-02-13T06:42:54.696571Z",
    "updated_at": "2026-02-13T06:43:24.914291Z",
    "mlflow_experiment_id": null,
    "message": {
      "message": "Evaluation job is completed",
      "message_code": "evaluation_job_updated"
    }
  },
  "status": {
    "state": "completed",
    "message": {
      "message": "Evaluation job is completed",
      "message_code": "evaluation_job_updated"
    },
    "benchmarks": [
      {
        "id": "gsm8k",
        "provider_id": "",
        "status": "completed",
        "message": null
      }
    ]
  },
  "results": {
    "benchmarks": [
      {
        "id": "gsm8k",
        "provider_id": "",
        "metrics": {
          "all.extractive_match": 0.8,
          "gsm8k|0.extractive_match": 0.8
        },
        "artifacts": {
          "oci_digest": "sha256:0000000000000000000000000000000000000000000000000000000000000000",
          "oci_r

You can check the benchmark metrics from the job results

## Clean up

In [13]:
# run this cell to delete the job with hard_delete=True
client.jobs.cancel(submitted_job_id, hard_delete=True)

True

In [14]:
## Close the client
client.close()