# 08: Evaluating AI Agents

**Duration:** 90 minutes

**What You'll Learn:**
- Why evaluation is critical for AI systems
- Building gold standard test sets
- Key metrics for classification and regression
- Confidence calibration analysis
- Running systematic evaluations
- A/B testing prompts scientifically
- Tracking improvements over time

**The Reality:**
Without evaluation, you're flying blind. You can't tell if your changes improve the system, break it, or do nothing. Professional AI development requires systematic measurement.

---

## Why Evaluate?

### The Problem: Subjective Assessment

When we built our agents, we looked at a few outputs and thought "that looks good!" But:

- **How good?** 85% accurate? 95%? We don't know.
- **Which agent is weakest?** Filter? Rating? Generator?
- **Did that prompt change help?** Feels better, but did it really improve outcomes?
- **Will it break in production?** What about edge cases we haven't seen?

### The Solution: Quantitative Metrics

Professional AI development follows this cycle:

```
1. Measure baseline   ‚Üí Know where you are
2. Make a change      ‚Üí Hypothesis: this will improve X
3. Measure again      ‚Üí Did X actually improve?
4. Keep or revert     ‚Üí Data-driven decision
```

This is called **evaluation-driven development** and it's how production AI systems are built.

---

### Real Example: The Prompt That Made Things Worse

Imagine we change our filter prompt to be "more friendly":

**Old:** "Analyze this tender. Is it relevant for cybersecurity, AI, or software development?"

**New:** "Hey! Check out this tender. Does it seem like something we'd be interested in? Like, does it involve AI stuff or security things?"

**Subjective assessment:** "The new one feels more approachable!"

**Objective measurement:**
- Old precision: 0.92
- New precision: 0.73  ‚ùå

The friendly prompt introduced ambiguity and dropped accuracy by 20%. Without metrics, we would have shipped a worse system.

---

## Part 1: Understanding Evaluation Metrics

Let's learn the key metrics by working through examples.

### Binary Classification Metrics (Filter Agent)

Our filter agent makes YES/NO decisions. Four outcomes are possible:

```
                    Predicted YES  |  Predicted NO
Actually YES             TP        |       FN
Actually NO              FP        |       TN
```

- **TP (True Positive):** Correctly identified relevant tender
- **FP (False Positive):** Incorrectly said irrelevant tender was relevant
- **TN (True Negative):** Correctly rejected irrelevant tender
- **FN (False Negative):** Missed a relevant tender

---

In [None]:
# Setup
import sys
sys.path.insert(0, '../src')

import asyncio
from procurement_ai.models import Tender
from procurement_ai.agents.filter import FilterAgent
from procurement_ai.services.llm import LLMService
from procurement_ai.config import Config

# Initialize
config = Config()
llm = LLMService(config)
filter_agent = FilterAgent(llm, config)

print("Evaluation framework initialized!")
print(f"Using model: {config.LLM_MODEL}")
print(f"Temperature: {config.TEMPERATURE_PRECISE}")

### Example: Calculating Metrics by Hand

Let's say we test our filter agent on 10 tenders:

| Test Case | Actual | Predicted | Result |
|---|---|---|---|
| AI Security Project | ‚úÖ Relevant | ‚úÖ Relevant | TP |
| Office Furniture | ‚ùå Irrelevant | ‚ùå Irrelevant | TN |
| Software Dev | ‚úÖ Relevant | ‚úÖ Relevant | TP |
| Construction | ‚ùå Irrelevant | ‚ùå Irrelevant | TN |
| ML Platform | ‚úÖ Relevant | ‚úÖ Relevant | TP |
| Catering | ‚ùå Irrelevant | ‚ùå Irrelevant | TN |
| Custom ERP | ‚úÖ Relevant | ‚ùå Irrelevant | **FN** ‚ö†Ô∏è|
| Vehicle Fleet | ‚ùå Irrelevant | ‚ùå Irrelevant | TN |
| Network Hardware | ‚ùå Irrelevant | ‚úÖ Relevant | **FP** ‚ö†Ô∏è|
| Cybersecurity Audit | ‚úÖ Relevant | ‚úÖ Relevant | TP |

**Counts:** TP=4, FP=1, TN=4, FN=1

In [None]:
# Calculate metrics from confusion matrix
TP = 4
FP = 1
TN = 4
FN = 1

# Precision: Of all we said YES to, how many were correct?
precision = TP / (TP + FP)
print(f"Precision: {precision:.2%}")
print("  ‚Üí When we say 'relevant', we're right 80% of the time")
print()

# Recall: Of all actual YES cases, how many did we find?
recall = TP / (TP + FN)
print(f"Recall: {recall:.2%}")
print("  ‚Üí We found 80% of all relevant tenders")
print()

# F1: Harmonic mean (balances precision and recall)
f1 = 2 * (precision * recall) / (precision + recall)
print(f"F1 Score: {f1:.2%}")
print("  ‚Üí Overall filter quality metric")
print()

# Accuracy: Overall correctness
accuracy = (TP + TN) / (TP + FP + TN + FN)
print(f"Accuracy: {accuracy:.2%}")
print("  ‚Üí We're correct 80% of the time overall")

### Which Metric Matters?

**Depends on your business goal:**

**High Precision (minimize FP):**
- Use case: "Don't waste time on irrelevant tenders"
- Prefer: Miss some opportunities, but every one we bid on is legitimate
- Example: Small team with limited capacity

**High Recall (minimize FN):**
- Use case: "Don't miss any opportunity"
- Prefer: Look at some irrelevant tenders, but catch every real one
- Example: Large team, can afford to review more

**F1 Score:**
- Balanced approach
- Good default metric
- What we'll optimize for

---

## Part 2: Building a Gold Standard Test Set

The foundation of evaluation is a **gold standard test set**: cases where we know the correct answer.

### Characteristics of a Good Test Set

1. **Diverse:** Cover all scenarios (easy, hard, edge cases)
2. **Balanced:** Mix of positive and negative examples
3. **Representative:** Reflects real-world distribution
4. **Documented:** Clear reasoning for each label
5. **Stable:** Don't change labels frequently

### Our Test Set Structure

We've created 18 carefully designed test cases:

- 4 **Clear Relevant:** Obvious matches (build confidence)
- 4 **Clear Irrelevant:** Obvious rejections (test specificity)
- 5 **Edge Cases:** Tricky scenarios (find weaknesses)
- 3 **Rating Validation:** Test scoring accuracy
- 2 **Category Tests:** Challenge category detection

---

In [None]:
# Load our evaluation dataset
from tests.fixtures.evaluation_dataset import (
    ALL_TEST_CASES,
    DATASET_STATS,
    get_test_cases_by_category,
    TestCaseCategory
)

# Show dataset statistics
print("üìä Evaluation Dataset Statistics")
print("=" * 50)
for key, value in DATASET_STATS.items():
    print(f"  {key.replace('_', ' ').title()}: {value}")

print("\n\nüìù Example Test Cases:\n")

# Show one from each category
categories = [
    TestCaseCategory.CLEAR_RELEVANT,
    TestCaseCategory.CLEAR_IRRELEVANT,
    TestCaseCategory.EDGE_CASE
]

for cat in categories:
    cases = get_test_cases_by_category(cat)
    if cases:
        tc = cases[0]
        print(f"\n{cat.value.upper().replace('_', ' ')}")
        print(f"  ID: {tc.tender_id}")
        print(f"  Title: {tc.title}")
        print(f"  Expected: {'Relevant' if tc.expected_relevance else 'Irrelevant'}")
        print(f"  Note: {tc.notes}")

### Example: An Edge Case

Edge cases are where agents fail. Here's a tricky one:

**"Network Infrastructure Upgrade with Management Software"**

- 80% hardware procurement (switches, routers)
- 20% basic software configuration

**Why it's tricky:**
- Keywords like "network" and "infrastructure" might trigger false positive
- There IS software mentioned, but it's minimal
- Tests if agent understands proportion and primary focus

**Expected:** Irrelevant (hardware-dominant project)

**What we learn:** Can our filter agent distinguish between projects that *mention* software vs projects that are *primarily* software?

---

## Part 3: Running Your First Evaluation

Now let's run the evaluation framework on our test set!

In [None]:
# Run complete evaluation
from procurement_ai.evaluation import Evaluator, ConsoleReporter

async def run_evaluation():
    """Run complete evaluation and show results"""
    
    print("üî¨ Starting evaluation...")
    print(f"   Testing {len(ALL_TEST_CASES)} cases")
    print("   This will take 2-3 minutes...\n")
    
    # Initialize evaluator
    evaluator = Evaluator(config=config, llm_service=llm)
    
    # Run evaluation
    result = await evaluator.evaluate_dataset(
        test_cases=ALL_TEST_CASES,
        max_concurrent=3  # Process 3 at a time
    )
    
    # Show results
    ConsoleReporter.report(result, detailed=False)
    
    return result

# Run it!
result = await run_evaluation()

### Interpreting the Results

Let's break down what each metric tells us:

#### Filter Agent Metrics

**Precision: 0.XX**
- If this is low (<80%): We're saying YES to too many irrelevant tenders
- Action: Make filter criteria more strict

**Recall: 0.XX**
- If this is low (<80%): We're missing relevant opportunities
- Action: Broaden filter criteria, reduce strictness

**F1 Score: 0.XX**
- Overall filter quality
- Good: F1 > 0.85
- Needs work: F1 < 0.75

**Specificity: 0.XX**
- How well we reject irrelevant tenders
- Important for not wasting time

#### Category Detection

**Accuracy: 0.XX**
- Are we correctly identifying cybersecurity vs AI vs software?
- Important because rating agent uses these categories

#### Rating Agent

**MAE (Mean Absolute Error): X.XX**
- Average error in our scores
- Good: MAE < 1.0 (scores off by less than 1 point)
- Needs work: MAE > 2.0

**Correlation: 0.XX**
- Do our scores track with expected scores?
- Good: > 0.70
- Excellent: > 0.85

---

## Part 4: Confidence Calibration

**Confidence calibration** measures: "When the model says 90% confident, is it actually right 90% of the time?"

### Why This Matters

Imagine your filter agent says:
- "This tender is relevant (confidence: 0.95)"

If the agent is **well-calibrated**:
- It's right 95% of the time when it says 0.95

If the agent is **overconfident**:
- It's only right 70% of the time when it says 0.95
- This is dangerous! You trust predictions you shouldn't.

If the agent is **underconfident**:
- It's right 99% of the time when it says 0.95
- Wasteful: you're double-checking predictions you can trust

---

In [None]:
# Analyze confidence calibration
calibration = result.confidence_calibration

print("üìä Confidence Calibration Analysis")
print("=" * 60)
print(f"\nExpected Calibration Error (ECE): {calibration.expected_calibration_error:.4f}")
print("  ‚Üí Lower is better. 0 = perfectly calibrated")
print()

# Show calibration curve
print("Calibration Curve:")
print("-" * 60)
print(f"{'Confidence Bin':<20} {'Accuracy':<15} {'Count':<10} {'Error'}")
print("-" * 60)

for bin_data in calibration.get_calibration_curve():
    conf = bin_data['mean_confidence']
    acc = bin_data['accuracy']
    count = bin_data['count']
    error = bin_data['calibration_error']
    
    bar_length = int(acc * 20)
    bar = '‚ñà' * bar_length
    
    print(f"{conf:.2f} confidence     {acc:.2%}  {bar:<20} {count:>3}     {error:.3f}")

print("\nüí° Interpretation:")
print("   - Perfect calibration: Confidence = Accuracy for each bin")
print("   - Overconfident: Confidence > Accuracy")
print("   - Underconfident: Confidence < Accuracy")

### Example: Poor Calibration

```
Confidence  | Accuracy | Interpretation
------------|----------|----------------
0.95        | 0.72     | ‚ùå Overconfident! Says 95% sure but only right 72% of time
0.85        | 0.65     | ‚ùå Overconfident
0.70        | 0.68     | ‚úÖ Well calibrated
0.60        | 0.95     | ‚ö†Ô∏è  Underconfident (but less harmful)
```

**Action:** If your agent is overconfident, you need to either:
1. Adjust confidence thresholds in your workflow
2. Prompt the agent to be more conservative
3. Add post-processing to recalibrate scores

---

## Part 5: A/B Testing Prompts

Now we can **scientifically test** if prompt changes improve performance!

### Experiment: Does Adding Examples Help?

Let's test two prompt variations:

**Version A (Current):** Criteria-based
```
"Analyze this tender:
CRITERIA: Relevant if cybersecurity, AI, or software development.
NOT relevant if hardware, construction, or non-technical."
```

**Version B (With Examples):** Few-shot learning
```
"Analyze using these examples:
RELEVANT: 'AI threat detection' ‚Üí YES (AI + Cybersecurity)
RELEVANT: 'Custom CRM app' ‚Üí YES (Software Development)
NOT RELEVANT: 'Office furniture' ‚Üí NO (Non-technical)

Now analyze this tender: ..."
```

**Hypothesis:** Version B will improve accuracy on edge cases.

---

In [None]:
# Let's test on edge cases specifically
from tests.fixtures.evaluation_dataset import TestCaseCategory, get_test_cases_by_category

edge_cases = get_test_cases_by_category(TestCaseCategory.EDGE_CASE)

print(f"üß™ Testing on {len(edge_cases)} edge cases")
print("   These are the tricky, ambiguous scenarios\n")

# NOTE: To actually A/B test, you would:
# 1. Modify filter.py to use Version B prompt
# 2. Run evaluation again
# 3. Compare results

# For now, let's show how to compare results
print("Example Comparison:")
print("=" * 60)
print(f"{'Metric':<25} {'Version A':<15} {'Version B':<15} {'Change'}")
print("-" * 60)

# Hypothetical results
metrics = [
    ("F1 Score", 0.82, 0.87, "+6%"),
    ("Precision", 0.85, 0.88, "+4%"),
    ("Recall", 0.79, 0.86, "+9%"),
    ("Edge Case Accuracy", 0.60, 0.80, "+33%"),
]

for metric, v_a, v_b, change in metrics:
    arrow = "üìà" if "+" in change else "üìâ"
    print(f"{metric:<25} {v_a:.2%}          {v_b:.2%}          {arrow} {change}")

print("\nüí° Result: Version B with examples significantly improves edge case handling!")
print("   We should adopt Version B.")

### The Scientific Method for AI

This is **evaluation-driven development** in action:

```python
# 1. Establish baseline
baseline_result = await evaluator.evaluate_dataset(test_cases)
baseline_f1 = baseline_result.filter_metrics.f1_score

# 2. Make hypothesis
# "Adding examples to the prompt will improve F1 by 5%"

# 3. Implement change
# (Modify filter.py with new prompt)

# 4. Measure again
new_result = await evaluator.evaluate_dataset(test_cases)
new_f1 = new_result.filter_metrics.f1_score

# 5. Decide based on data
improvement = new_f1 - baseline_f1
if improvement > 0.03:  # >3% improvement
    print("‚úÖ Keep the change!")
else:
    print("‚ùå Revert - not worth the complexity")
```

---

## Part 6: Saving and Tracking Results

Professional practice: **Track evaluation results over time**.

This lets you:
- Detect regressions (did I accidentally make things worse?)
- Show progress to stakeholders
- Compare different model versions
- Understand what changes actually matter

---

In [None]:
# Save this evaluation as a baseline
from procurement_ai.evaluation import JSONReporter, MarkdownReporter
from pathlib import Path
from datetime import datetime

# Create results directory
results_dir = Path("../benchmarks/results")
results_dir.mkdir(parents=True, exist_ok=True)

# Save JSON (for programmatic comparison)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
json_path = results_dir / f"evaluation_{timestamp}.json"
JSONReporter.report(result, output_file=json_path)

# Save Markdown (for documentation)
md_path = results_dir / f"evaluation_{timestamp}.md"
MarkdownReporter.report(
    result,
    output_file=md_path,
    title="Baseline Evaluation - Version 1.0"
)

print(f"‚úÖ Results saved!")
print(f"   JSON: {json_path}")
print(f"   Markdown: {md_path}")

In [None]:
# Compare two evaluations
from procurement_ai.evaluation import ComparisonReporter

# Example: Compare current result with baseline
# (In practice, you'd load baseline from JSON)

comparison_md = ComparisonReporter.compare(
    baseline=result,
    comparison=result,  # Normally this would be a new result
    baseline_name="Version 1.0",
    comparison_name="Version 1.1 (with examples)"
)

print(comparison_md)

## Part 7: Regression Testing

**Regression testing:** Ensure changes don't break existing functionality.

### The Problem

You add a new feature:
- "Make agent better at detecting AI tenders"

Accidentally:
- Cybersecurity detection drops from 95% to 78% ‚ùå

Without regression testing, you don't notice until production!

### The Solution

Run evaluation after EVERY significant change:

```bash
# Before making changes
python -m procurement_ai.evaluation.run --save-baseline

# After making changes
python -m procurement_ai.evaluation.run --output results/after_change.json

# Compare
python -m procurement_ai.evaluation.compare baseline.json after_change.json
```

**Gate your deployments:** Don't ship if F1 drops by >2%.

---

## Part 8: Practical Exercises

### Exercise 1: Find the Weakest Cases

Look at your test results. Which specific test cases failed?

In [None]:
# Find failed test cases
failed_cases = [tr for tr in result.test_results if not tr.is_correct]

print(f"‚ùå Failed Cases: {len(failed_cases)}\n")

for tc in failed_cases:
    print(f"Test: {tc.test_id}")
    print(f"  Category: {tc.test_category}")
    print(f"  Predicted: {'Relevant' if tc.predicted_relevant else 'Irrelevant'} "
          f"(confidence: {tc.predicted_confidence:.2f})")
    print(f"  Expected: {'Relevant' if tc.expected_relevant else 'Irrelevant'}")
    print(f"  Notes: {tc.notes}")
    print()

# Exercise: Pick one failed case and analyze WHY it failed
# Then propose a fix to the prompt or criteria

### Exercise 2: Test Temperature Impact

Our filter agent uses temperature=0.1 for consistency. What if we try 0.3? Or 0.7?

**Task:** Modify config, re-run evaluation, compare results.

In [None]:
# Test different temperatures
async def test_temperature(temp: float, test_cases):
    """Evaluate at a specific temperature"""
    test_config = Config()
    test_config.TEMPERATURE_PRECISE = temp
    
    test_llm = LLMService(test_config)
    evaluator = Evaluator(config=test_config, llm_service=test_llm)
    
    result = await evaluator.quick_eval(test_cases[:5])  # Quick test on 5 cases
    return result

# Test range
print("üå°Ô∏è  Temperature Impact Test")
print("=" * 60)

temperatures = [0.0, 0.1, 0.3, 0.5, 0.7]

# Note: In practice, you'd run this. For the notebook, we show the pattern.
print("Testing temperatures:", temperatures)
print("\nHypothesis: Very low temp (0.0) = most consistent")
print("            Higher temp (0.7) = more varied, possibly less accurate")
print("\nRun this experiment and record results!")

# Uncomment to actually run:
# for temp in temperatures:
#     metrics = await test_temperature(temp, edge_cases)
#     print(f"Temp {temp:.1f}: F1={metrics['f1_score']:.2%}, "
#           f"Time={metrics['processing_time']:.1f}s")

### Exercise 3: Category-Specific Analysis

Are we better at detecting cybersecurity than AI tenders? Or vice versa?

**Task:** Group test results by expected category and calculate accuracies.

In [None]:
# Category-specific analysis
from collections import defaultdict

category_performance = defaultdict(lambda: {"correct": 0, "total": 0})

for tc in result.test_results:
    if tc.expected_relevant and tc.expected_categories:
        for cat in tc.expected_categories:
            category_performance[cat]["total"] += 1
            if tc.categories_correct:
                category_performance[cat]["correct"] += 1

print("üìä Category Detection Performance")
print("=" * 50)

for cat, perf in sorted(category_performance.items()):
    if perf["total"] > 0:
        accuracy = perf["correct"] / perf["total"]
        bar = "‚ñà" * int(accuracy * 20)
        print(f"{cat:<20} {accuracy:>6.1%}  {bar}")
        print(f"  {perf['correct']}/{perf['total']} correct")
        print()

print("\nüí° Insights:")
print("   - Are certain categories harder to detect?")
print("   - Should we adjust the prompt to emphasize weak categories?")

## Key Takeaways

### What We Learned

1. **Measurement is Essential**
   - Can't improve what you don't measure
   - Subjective assessment ("looks good") is not enough
   - Metrics provide objective truth

2. **Good Test Sets are Hard**
   - Need diverse, balanced, representative cases
   - Edge cases are where systems fail
   - Document your reasoning for each label

3. **Multiple Metrics Tell the Full Story**
   - Precision vs Recall trade-off
   - F1 balances both
   - Calibration ensures confidence is trustworthy

4. **Evaluation Enables Science**
   - A/B test prompts objectively
   - Track changes over time
   - Prevent regressions

5. **Professional Development Cycle**
   ```
   Measure ‚Üí Change ‚Üí Measure ‚Üí Decide
   ```

---

## Next Steps

### Immediate Actions

1. **Run baseline evaluation**
   ```bash
   python -m procurement_ai.evaluation.run --save-baseline
   ```

2. **Analyze weaknesses**
   - Which test cases failed?
   - Which categories are hard to detect?
   - Is the agent over/under confident?

3. **Make targeted improvements**
   - Fix the weakest area first
   - Re-evaluate after each change

4. **Establish regression testing**
   - Run evaluation before every PR
   - Don't ship if metrics drop

### What's Next?

**Next notebook: RAG (Retrieval-Augmented Generation)**

Now that we can measure improvements, we'll add RAG to:
- Improve document quality
- **Prove** it works with our evaluation framework
- Measure the exact improvement (e.g., +22% quality)

The evaluation framework enables everything that comes next!

---

## Appendix: Quick Reference

### Essential Metrics

**Classification (Filter Agent):**
- **Precision:** TP / (TP + FP) - "When we say yes, how often are we right?"
- **Recall:** TP / (TP + FN) - "Of all real yes cases, how many did we catch?"
- **F1:** Harmonic mean of precision and recall
- **Accuracy:** (TP + TN) / Total - "Overall correctness"

**Regression (Rating Agent):**
- **MAE:** Mean Absolute Error - "Average prediction error"
- **Correlation:** How well scores track expected scores

**Calibration:**
- **ECE:** Expected Calibration Error - "Does confidence match accuracy?"

### CLI Commands

```bash
# Run evaluation and show in console
python -m procurement_ai.evaluation.run

# Save as baseline
python -m procurement_ai.evaluation.run --save-baseline

# Save custom output
python -m procurement_ai.evaluation.run --output myresults.json --markdown report.md

# Show detailed results
python -m procurement_ai.evaluation.run --detailed
```

### Common Patterns

**A/B Testing:**
```python
baseline = await evaluator.evaluate_dataset(test_cases)
# Make change to prompt/config
new_result = await evaluator.evaluate_dataset(test_cases)
improvement = new_result.filter_metrics.f1_score - baseline.filter_metrics.f1_score
```

**Quick Iteration:**
```python
metrics = await evaluator.quick_eval(test_cases[:10])
print(f"F1: {metrics['f1_score']:.2%}")
```

---

## Resources

- [Evaluation Driven Development (Google)](https://developers.google.com/machine-learning/testing-debugging/metrics/metrics)
- [Expected Calibration Error](https://arxiv.org/abs/1706.04599)
- [The Bitter Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) - Why measurement matters
- Our evaluation dataset: `tests/fixtures/evaluation_dataset.py`
- Our metrics implementation: `src/procurement_ai/evaluation/metrics.py`

**Remember:** Professional AI development is empirical. Measure everything!