# Week 5 ‚Äî Building a Generic Benchmark Engine
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Use the generic benchmark engine from `src/benchmark_engine`
2. Create custom model functions that follow the callable interface
3. Build synthetic datasets for testing
4. Implement and compare different metric functions
5. Analyze benchmark results including timing and throughput

---

## üß† The Generic Benchmark Engine

### Why Generic?

| Before (Ad-Hoc) | After (Generic Engine) |
|-----------------|------------------------|
| Different code for each benchmark | One `run_benchmark` function |
| Inconsistent timing measurement | Built-in timing for all |
| Copy-paste evaluation loops | Reusable, tested engine |
| Tightly coupled to model type | Works with any callable |

### Core Components

```python
run_benchmark(
    model_fn,    # (str) -> str
    dataset,     # Iterator[(str, str)]
    metric_fn,   # (str, str) -> float
    batch_size=1
) -> Dict[str, Any]
```

---

## üõ†Ô∏è Step 1: Setup & Dependencies

In [None]:
# Standard library imports
import sys
import time
from typing import Callable, Iterator, Tuple, Dict, Any, List

# Add src to path if running in Colab
# In a local environment, you may need to adjust this path
sys.path.insert(0, '.')

print("‚úÖ Setup complete!")

---

## üì¶ Step 2: Import the Benchmark Engine

The benchmark engine is located in `src/benchmark_engine/engine.py`

In [None]:
# Import the benchmark engine
from src.benchmark_engine.engine import (
    run_benchmark,
    exact_match_metric,
    contains_metric
)

print("‚úÖ Benchmark engine imported successfully!")
print(f"   - run_benchmark: {run_benchmark.__doc__.split(chr(10))[1].strip()}")

---

## ü§ñ Step 3: Create a Model Function

A model function must follow this interface:
- **Input:** A string (the prompt)
- **Output:** A string (the model's response)

Let's create a simple mock model for testing.

In [None]:
def mock_model(prompt: str) -> str:
    """
    A simple mock model that returns predefined answers.
    
    This simulates an LLM by matching keywords in the prompt
    and returning corresponding answers.
    
    Args:
        prompt: The input question or prompt
    
    Returns:
        The model's answer
    """
    # Define a knowledge base of answers
    answers = {
        "capital of france": "Paris",
        "2+2": "4",
        "largest planet": "Jupiter",
        "formula for water": "H2O",
        "speed of light": "299792458 m/s",
    }
    
    # Search for matching keywords
    prompt_lower = prompt.lower()
    for key, answer in answers.items():
        if key in prompt_lower:
            return answer
    
    return "I don't know"


# Test the mock model
print("üß™ Testing mock model:")
print(f"   Q: What is the capital of France?")
print(f"   A: {mock_model('What is the capital of France?')}")
print(f"   Q: What is 2+2?")
print(f"   A: {mock_model('What is 2+2?')}")

---

## üìä Step 4: Create a Synthetic Dataset

A dataset is an iterator of `(input, reference)` tuples:
- **input:** The prompt to send to the model
- **reference:** The expected/correct answer

In [None]:
# Create 5 synthetic QA pairs
synthetic_dataset = [
    ("What is the capital of France?", "Paris"),
    ("What is 2+2?", "4"),
    ("What is the largest planet in our solar system?", "Jupiter"),
    ("What is the chemical formula for water?", "H2O"),
    ("What is the speed of light?", "299792458 m/s"),
]

print(f"üìù Created dataset with {len(synthetic_dataset)} examples:")
print("=" * 60)
for i, (question, answer) in enumerate(synthetic_dataset, 1):
    print(f"   Q{i}: {question}")
    print(f"   A{i}: {answer}")
    print()

---

## üéØ Step 5: Understand the Metric Functions

The engine comes with two built-in metrics:

1. **exact_match_metric:** Returns 1.0 if output exactly matches reference (case-insensitive)
2. **contains_metric:** Returns 1.0 if reference is found anywhere in output

In [None]:
# Test the metric functions
print("üìê Testing Metric Functions:")
print("=" * 60)

test_cases = [
    ("Paris", "Paris"),           # Exact match
    ("paris", "Paris"),           # Case insensitive
    ("  Paris  ", "Paris"),       # Whitespace handling
    ("The answer is Paris", "Paris"),  # Contains
    ("London", "Paris"),          # No match
]

print(f"{'Output':<25} {'Reference':<15} {'Exact':<8} {'Contains':<8}")
print("-" * 60)
for output, reference in test_cases:
    exact = exact_match_metric(output, reference)
    contains = contains_metric(output, reference)
    print(f"{output:<25} {reference:<15} {exact:<8.1f} {contains:<8.1f}")

---

## üöÄ Step 6: Run the Benchmark!

Now let's put it all together and run the benchmark.

In [None]:
# Run the benchmark
print("\n" + "=" * 60)
print("üöÄ Running Benchmark on 5 Synthetic QA Pairs")
print("=" * 60)

results = run_benchmark(
    model_fn=mock_model,
    dataset=iter(synthetic_dataset),  # Convert list to iterator
    metric_fn=exact_match_metric,
    batch_size=1
)

print("\n‚úÖ Benchmark complete!")

---

## üìä Step 7: Analyze the Results

The engine returns a comprehensive result dictionary.

In [None]:
# Display summary statistics
print("\nüìä Results Summary:")
print("=" * 60)
print(f"   Total examples:        {results['total_examples']}")
print(f"   Mean score (accuracy): {results['mean_score']:.2%}")
print(f"   Total time:            {results['total_time_seconds']:.4f} seconds")
print(f"   Throughput:            {results['examples_per_second']:.2f} examples/second")

In [None]:
# Display detailed per-example results
print("\nüìã Detailed Results:")
print("=" * 60)

for i, result in enumerate(results['results'], 1):
    status = "‚úì" if result['score'] == 1.0 else "‚úó"
    print(f"\n[{status}] Example {i}:")
    print(f"    Input:    {result['input']}")
    print(f"    Expected: {result['reference']}")
    print(f"    Got:      {result['model_output']}")
    print(f"    Score:    {result['score']:.2f}")
    print(f"    Time:     {result['inference_time_seconds']*1000:.2f} ms")

---

## üß™ Step 8: Try Different Metrics

Let's see how results differ with the `contains_metric`.

In [None]:
# Create a "verbose" model that adds extra text
def verbose_model(prompt: str) -> str:
    """A model that gives verbose answers."""
    answers = {
        "capital of france": "The capital of France is Paris, a beautiful city.",
        "2+2": "The answer to 2+2 is 4, of course!",
        "largest planet": "Jupiter is the largest planet in our solar system.",
        "formula for water": "The chemical formula for water is H2O.",
        "speed of light": "The speed of light is approximately 299792458 m/s.",
    }
    prompt_lower = prompt.lower()
    for key, answer in answers.items():
        if key in prompt_lower:
            return answer
    return "I don't know the answer to that question."


print("üîç Comparing metrics with verbose model:")
print("=" * 60)

# Run with exact_match
exact_results = run_benchmark(
    model_fn=verbose_model,
    dataset=iter(synthetic_dataset),
    metric_fn=exact_match_metric
)

# Run with contains_metric
contains_results = run_benchmark(
    model_fn=verbose_model,
    dataset=iter(synthetic_dataset),
    metric_fn=contains_metric
)

print(f"\nüìä Comparison:")
print(f"   Exact Match Accuracy:    {exact_results['mean_score']:.2%}")
print(f"   Contains Match Accuracy: {contains_results['mean_score']:.2%}")
print(f"\nüí° Insight: The verbose model fails exact match but passes contains!")

---

## üìù Step 9: Create Your Own Custom Metric

Let's implement a **partial match metric** that gives partial credit.

In [None]:
def partial_match_metric(output: str, reference: str) -> float:
    """
    Compute partial match score using word overlap.
    
    Returns the proportion of reference words found in output.
    This gives partial credit for partially correct answers.
    
    Args:
        output: Model generated text
        reference: Expected/correct answer
    
    Returns:
        Score between 0.0 and 1.0
    """
    output_words = set(output.strip().lower().split())
    reference_words = set(reference.strip().lower().split())
    
    if not reference_words:
        return 1.0  # Empty reference is always matched
    
    overlap = output_words & reference_words
    return len(overlap) / len(reference_words)


# Test the partial match metric
print("üß™ Testing Partial Match Metric:")
print("=" * 60)

test_cases = [
    ("Paris", "Paris"),                    # Perfect match
    ("The city is Paris", "Paris"),       # Contains
    ("New York City", "New York"),        # Partial (2/2)
    ("I think New is the answer", "New York"),  # Partial (1/2)
    ("London", "Paris"),                  # No match
]

print(f"{'Output':<30} {'Reference':<15} {'Score':<8}")
print("-" * 60)
for output, reference in test_cases:
    score = partial_match_metric(output, reference)
    print(f"{output:<30} {reference:<15} {score:<8.2f}")

In [None]:
# Run benchmark with custom metric
print("\nüöÄ Running Benchmark with Partial Match Metric:")
print("=" * 60)

partial_results = run_benchmark(
    model_fn=verbose_model,
    dataset=iter(synthetic_dataset),
    metric_fn=partial_match_metric
)

print(f"\nüìä Results with Partial Match:")
print(f"   Mean score: {partial_results['mean_score']:.2%}")
print(f"\nüìã Per-example scores:")
for i, result in enumerate(partial_results['results'], 1):
    print(f"   Q{i}: {result['score']:.2f} - {result['input'][:40]}...")

---

## üîß Step 10: Integration with Real Models

The benchmark engine works with any model. Here's how you would integrate with:

### ONNX Runtime

```python
import onnxruntime as ort
from transformers import AutoTokenizer

session = ort.InferenceSession("model.onnx")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

def onnx_model(prompt: str) -> str:
    inputs = tokenizer(prompt, return_tensors="np")
    outputs = session.run(None, {"input_ids": inputs["input_ids"]})
    # Decode and return
    return tokenizer.decode(outputs[0][0])

results = run_benchmark(onnx_model, dataset, exact_match_metric)
```

### Hugging Face Transformers

```python
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")

def hf_model(prompt: str) -> str:
    result = generator(prompt, max_length=50)
    return result[0]["generated_text"]

results = run_benchmark(hf_model, dataset, contains_metric)
```

### OpenAI API

```python
import openai

def openai_model(prompt: str) -> str:
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

results = run_benchmark(openai_model, dataset, exact_match_metric)
```

---

## üìà Step 11: Visualize Results

In [None]:
# Create a simple text-based visualization
print("\nüìä Score Distribution:")
print("=" * 60)

for i, result in enumerate(results['results'], 1):
    score = result['score']
    bar = "‚ñà" * int(score * 20) + "‚ñë" * (20 - int(score * 20))
    status = "‚úì" if score == 1.0 else "‚úó"
    print(f"Q{i} [{status}] {bar} {score:.0%}")

print("\nüìä Inference Time Distribution:")
print("=" * 60)

max_time = max(r['inference_time_seconds'] for r in results['results'])
for i, result in enumerate(results['results'], 1):
    time_ms = result['inference_time_seconds'] * 1000
    ratio = result['inference_time_seconds'] / max_time if max_time > 0 else 0
    bar = "‚ñà" * int(ratio * 20) + "‚ñë" * (20 - int(ratio * 20))
    print(f"Q{i}     {bar} {time_ms:.2f}ms")

---

## üéì Mini-Project: Build Your Own Benchmark Suite

### Task

Create a benchmark suite with:
1. At least 10 QA pairs
2. A custom metric function
3. Compare results across different metrics

### Template

In [None]:
# Your custom benchmark suite
my_benchmark = [
    # Add your QA pairs here
    ("What is the capital of Germany?", "Berlin"),
    ("What is 10 * 5?", "50"),
    ("What color is the sky?", "blue"),
    # ... add more
]

# Your custom model (extend the mock_model)
def my_model(prompt: str) -> str:
    # Implement your model here
    pass

# Your custom metric
def my_metric(output: str, reference: str) -> float:
    # Implement your metric here
    pass

# Run and analyze
# results = run_benchmark(my_model, iter(my_benchmark), my_metric)
# print(f"My Benchmark Score: {results['mean_score']:.2%}")

---

## ‚úÖ Knowledge Mastery Checklist

Before moving to Week 6, ensure you can check all boxes:

- [ ] I can import and use `run_benchmark` from the engine module
- [ ] I understand the callable interface for model functions
- [ ] I can create synthetic datasets as iterators of tuples
- [ ] I can use `exact_match_metric` and `contains_metric`
- [ ] I can implement custom metric functions
- [ ] I can interpret the results dictionary
- [ ] I understand when to use different metrics

---

**Week 5 Complete!** üéâ

**Next:** *Week 6 ‚Äî Automated Evaluation Pipelines (LLM-as-Judge)*