# Scoring Deep Dive

This notebook explains how GEPA computes scores from Karenina's verification results. Understanding scoring is crucial for:

1. Interpreting optimization progress
2. Balancing correctness vs quality
3. Multi-model optimization with Pareto frontiers
4. Diagnosing template/rubric failures

---

## Setup

In [None]:
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent.parent.parent / "src"))

from karenina import Benchmark
from karenina.integrations.gepa import (
    compute_improvement,
    compute_multi_model_score,
    compute_single_score,
    compute_weighted_score,
    extract_failed_fields,
)
from karenina.schemas import ModelConfig, VerificationConfig

# Load benchmark
benchmark_path = Path.home() / "Projects/karenina-monorepo/local_data/data/checkpoints/aime_2025.jsonld"
benchmark = Benchmark.load(benchmark_path)
print(f"Loaded: {benchmark.name} ({len(benchmark.get_question_ids())} questions)")

---

## Run Verification to Get Results

First, let's run verification to get some real results to score.

In [None]:
# Configure verification with Claude Haiku
config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-haiku",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
            system_prompt="You are a math expert. Solve the problem and provide the final integer answer (0-999).",
        )
    ],
    parsing_models=[
        ModelConfig(
            id="claude-haiku-parser",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
        )
    ],
    evaluation_mode="template_only",
    replicate_count=1,
)

# Run on a small subset
question_ids = benchmark.get_question_ids()[:5]
print(f"Running verification on {len(question_ids)} questions...")

results = benchmark.run_verification(config, question_ids=question_ids)
print(f"Got {len(results.results)} results")

---

## compute_single_score(): Scoring One Result

The core scoring function that converts a `VerificationResult` to a float score (0.0-1.0).

In [None]:
# Score individual results
for result in results.results:
    # Default weights: 70% template, 30% rubric
    score = compute_single_score(result)

    # Get verification status
    passed = result.template.verify_result if result.template else False
    status = "PASS" if passed else "FAIL"

    print(f"{status} | Score: {score:.2f} | Question: {result.metadata.question_text[:50]}...")

### Understanding the Score Formula

```
score = template_weight * template_score + rubric_weight * rubric_score
```

Where:
- **template_score**: Binary (1.0 if verify_result=True, else 0.0)
- **rubric_score**: Average of normalized rubric trait scores

In [None]:
# Score with different weights
result = results.results[0]

# Template-only (correctness focused)
score_template_only = compute_single_score(result, template_weight=1.0, rubric_weight=0.0)

# Rubric-only (quality focused)
score_rubric_only = compute_single_score(result, template_weight=0.0, rubric_weight=1.0)

# Balanced
score_balanced = compute_single_score(result, template_weight=0.5, rubric_weight=0.5)

print(f"Template-only score: {score_template_only:.2f}")
print(f"Rubric-only score: {score_rubric_only:.2f}")
print(f"Balanced score: {score_balanced:.2f}")

---

## compute_weighted_score(): Aggregating Multiple Results

Aggregate scores across multiple questions.

In [None]:
# Build results dict (keyed by result index)
results_dict = {str(i): r for i, r in enumerate(results.results)}

# Compute aggregate score
aggregate_score = compute_weighted_score(
    results_dict,
    template_weight=1.0,  # AIME: correctness only
    rubric_weight=0.0,
)

print(f"Aggregate score across {len(results_dict)} questions: {aggregate_score:.2%}")

In [None]:
# Manual calculation to verify
individual_scores = [compute_single_score(r, template_weight=1.0, rubric_weight=0.0) for r in results.results]
manual_avg = sum(individual_scores) / len(individual_scores)

print(f"Individual scores: {individual_scores}")
print(f"Manual average: {manual_avg:.2%}")
print(f"compute_weighted_score: {aggregate_score:.2%}")
print(f"Match: {abs(manual_avg - aggregate_score) < 1e-6}")

---

## compute_multi_model_score(): Multi-Model Scoring

For multi-model benchmarks, compute per-model scores for Pareto optimization.

In [None]:
# Run verification with two models
multi_model_config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-haiku",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
            system_prompt="You are a math expert. Give the final integer answer.",
        ),
        ModelConfig(
            id="claude-sonnet",
            model_provider="anthropic",
            model_name="claude-sonnet-4-5",
            temperature=0.0,
            interface="langchain",
            system_prompt="You are a math expert. Show your work and give the final integer answer.",
        ),
    ],
    parsing_models=[
        ModelConfig(
            id="parser",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
        )
    ],
    evaluation_mode="template_only",
    replicate_count=1,
)

# Run on subset
print("Running multi-model verification on 3 questions...")
multi_results = benchmark.run_verification(multi_model_config, question_ids=question_ids[:3])
print(f"Got {len(multi_results.results)} results")

In [None]:
# Group results by model
from collections import defaultdict

results_by_model = defaultdict(list)
for result in multi_results.results:
    model_name = result.metadata.answering_model or "unknown"
    results_by_model[model_name].append(result)

print("Results per model:")
for model, model_results in results_by_model.items():
    print(f"  {model}: {len(model_results)} results")

In [None]:
# Compute multi-model score
overall_score, model_scores = compute_multi_model_score(
    dict(results_by_model),
    template_weight=1.0,
    rubric_weight=0.0,
)

print(f"Overall score: {overall_score:.2%}")
print("\nPer-model scores:")
for model, score in model_scores.items():
    print(f"  {model}: {score:.2%}")

---

## extract_failed_fields(): Diagnosing Failures

Identify which template fields failed verification.

In [None]:
# Find a failed result
failed_results = [r for r in results.results if r.template and not r.template.verify_result]

if failed_results:
    failed = failed_results[0]

    # Extract failed fields
    failed_fields = extract_failed_fields(failed)

    print(f"Question: {failed.metadata.question_text[:60]}...")
    print(f"Expected answer: {failed.metadata.raw_answer}")
    print(f"Model response: {failed.template.raw_llm_response[:100]}...")
    print(f"\nFailed fields: {failed_fields}")
else:
    print("All results passed! No failures to analyze.")
    print("\nExample of extract_failed_fields() with a passing result:")
    print(f"Failed fields: {extract_failed_fields(results.results[0])}")

---

## compute_improvement(): Measuring Optimization Progress

Compute relative improvement from baseline to optimized score.

In [None]:
# Example: Baseline vs optimized scores
baseline_score = 0.60  # 60% before optimization
optimized_score = 0.75  # 75% after optimization

improvement = compute_improvement(baseline_score, optimized_score)

print(f"Baseline: {baseline_score:.2%}")
print(f"Optimized: {optimized_score:.2%}")
print(f"Improvement: {improvement:.2%} ({improvement * 100:.1f}% relative improvement)")

In [None]:
# Edge cases
print("Edge cases:")

# Improvement from zero baseline
imp_from_zero = compute_improvement(0.0, 0.50)
print(f"  From 0% to 50%: {imp_from_zero:.2%} (returns absolute score when baseline is 0)")

# Negative improvement (worse)
negative_imp = compute_improvement(0.70, 0.60)
print(f"  From 70% to 60%: {negative_imp:.2%} (negative = got worse)")

# No change
no_change = compute_improvement(0.50, 0.50)
print(f"  No change: {no_change:.2%}")

---

## Practical Scoring Examples

### Example 1: AIME Benchmark (Correctness Only)

In [None]:
# For AIME, we only care about correctness
aime_scores = []
for result in results.results:
    score = compute_single_score(result, template_weight=1.0, rubric_weight=0.0)
    aime_scores.append(score)

accuracy = sum(aime_scores) / len(aime_scores)
print(f"AIME Accuracy: {accuracy:.2%} ({sum(aime_scores):.0f}/{len(aime_scores)} correct)")

### Example 2: Tracking Optimization Progress

In [None]:
# Simulate optimization progress
generation_scores = [
    0.40,  # Gen 0: Seed prompt
    0.45,  # Gen 1
    0.52,  # Gen 2
    0.58,  # Gen 3
    0.55,  # Gen 4 (regression)
    0.62,  # Gen 5
    0.65,  # Gen 6
    0.70,  # Gen 7: Best
]

baseline = generation_scores[0]
best_score = max(generation_scores)
best_gen = generation_scores.index(best_score)

print("Optimization Progress:")
for gen, score in enumerate(generation_scores):
    imp = compute_improvement(baseline, score)
    marker = " <- BEST" if gen == best_gen else ""
    print(f"  Gen {gen}: {score:.2%} (improvement: {imp:+.1%}){marker}")

final_improvement = compute_improvement(baseline, best_score)
print(f"\nTotal improvement: {final_improvement:+.1%} (from {baseline:.2%} to {best_score:.2%})")

### Example 3: Multi-Model Pareto Analysis

In [None]:
# Simulate multi-model scores for different prompts
candidates = {
    "Prompt A": {"haiku": 0.65, "sonnet": 0.70},
    "Prompt B": {"haiku": 0.70, "sonnet": 0.60},  # Better for haiku
    "Prompt C": {"haiku": 0.60, "sonnet": 0.75},  # Better for sonnet
    "Prompt D": {"haiku": 0.68, "sonnet": 0.72},  # Balanced
}

print("Multi-Model Candidate Analysis:")
print("-" * 50)

for prompt, scores in candidates.items():
    avg = sum(scores.values()) / len(scores)
    print(f"{prompt}:")
    for model, score in scores.items():
        print(f"  {model}: {score:.2%}")
    print(f"  Average: {avg:.2%}")
    print()

# Find Pareto-optimal candidates
print("Pareto Analysis:")
print("  - Prompt A: Not Pareto-optimal (dominated by D)")
print("  - Prompt B: Pareto-optimal for haiku")
print("  - Prompt C: Pareto-optimal for sonnet")
print("  - Prompt D: Pareto-optimal (best average)")

---

## Summary

| Function | Purpose | Returns |
|----------|---------|--------|
| `compute_single_score()` | Score one result | float (0.0-1.0) |
| `compute_weighted_score()` | Aggregate multiple results | float (0.0-1.0) |
| `compute_multi_model_score()` | Per-model + overall scores | (float, dict) |
| `compute_improvement()` | Relative improvement | float (fraction) |
| `extract_failed_fields()` | Find failed template fields | list[str] |

## Key Takeaways

1. **Score = weighted combination** of template (correctness) and rubric (quality)
2. **Use template_weight=1.0** for factual benchmarks like AIME
3. **Multi-model scoring** enables Pareto optimization across models
4. **extract_failed_fields()** helps diagnose verification failures
5. **compute_improvement()** tracks optimization progress

## Next Steps

- [05_karenina_adapter.ipynb](05_karenina_adapter.ipynb) - Using the KareninaAdapter