# Lesson 6: Evaluation - Finding and Fixing Bugs with NAT Eval

In previous lessons, you built, traced, and integrated agents. But how do you know if they actually work? Manual testing catches obvious errors, but subtle bugs hide in edge cases. A tool might work 80% of the time and fail silently the other 20%.

In this lesson, you'll create evaluation datasets with ground truth answers, run systematic tests, discover a hidden bug in your climate agent, fix it, and verify the improvement. This transforms agent development from manual testing into a data-driven engineering process.
<div style="background-color: #e7f3fe; border-left: 6px solid #2196F3; padding: 15px; margin: 10px 0;">
<h4 style="margin-top: 0;">üéØ Learning Objectives</h4>
By the end of this lesson, you'll know how to:
<ul>
<li>Create evaluation datasets with ground truth answers</li>
<li>Run systematic tests to discover unexpected agent behaviors</li>
<li>Use evaluation results to identify and fix bugs</li>
<li>Verify improvements with before/after comparisons</li>
</ul>
</div>

## Setup

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

# Verify it loaded
print("API key set:", "Yes" if os.getenv('NVIDIA_API_KEY') else "No")

API key set: Yes


In [2]:
%%capture
# Install the climate analyzer package
!cd climate_analyzer && pip install -e . && cd ..

## Evaluation Dataset
An evaluation dataset consists of questions paired with ground truth answers. The agent's responses are compared against these known-correct answers to calculate accuracy.

In [None]:
# %load climate_analyzer/data/simple_eval.json
[
  {
    "id": "austria_1980",
    "question": "What was the average temperature in Austria in 1980? Please provide the numerical value.",
    "answer": "The average temperature in Austria in 1980 was 6.80\u00b0C"
  }
]


<div style="background-color: #f3e5f5; border-left: 6px solid #9c27b0; padding: 15px; margin: 15px 0;">
<h4 style="margin-top: 0;">üí° Evaluation Dataset Structure</h4>
Each test case contains:
<ul>
<li><strong>user_input</strong> - The question to ask the agent</li>
<li><strong>reference</strong> - The ground truth answer (what the agent should return)</li>
<li><strong>metadata</strong> - Additional context (like expected tool calls)</li>
</ul>
<br>
<strong>Example:</strong>
<pre style="background-color: #f5f5f5; padding: 10px; border-radius: 3px; margin: 10px 0;">
{
  "user_input": "What was Austria's average temperature in 1980?",
  "reference": "6.80¬∞C"
}
</pre>
</div>

### Verify Ground Truth
Before evaluating your agent, verify your ground truth answers are actually correct. Otherwise, you're testing against wrong answers.

In [4]:
!grep "^[^,]*,1980,[^,]*,Austria" ../resources/climate_data/temperature_annual.csv

AU000005010,1980,AU,Austria,48.05,14.1331,KREMSMUENSTER,7.994166666666666
AUXLT782426,1980,AU,Austria,47.0,15.4333,GRAZ_THALERHOF,7.631666666666667
AUXLT891651,1980,AU,Austria,47.383,13.456,RADSTADT,4.766666666666667


<div style="background-color: #f5f5f5; border: 1px solid #ddd; padding: 15px; border-radius: 5px; margin: 15px 0; font-family: monospace;">
<strong>Raw Data for Austria 1980:</strong>
<pre style="margin: 10px 0; white-space: pre-wrap;">
Austria,1980,01,7.994166666666666
Austria,1980,02,7.631666666666667
Austria,1980,03,4.766666666666667
</pre>
</div>

In [5]:
# Calculate the average to confirm our ground truth:
temps = [7.994166666666666, 7.631666666666667, 4.766666666666667]
average = sum(temps) / len(temps)
print(f"\nAverage temperature for Austria in 1980: {average:.2f}¬∞C")


Average temperature for Austria in 1980: 6.80¬∞C


Add an eval section to your NAT config to define your test dataset and metrics:
<div style="background-color: #fff3cd; border-left: 6px solid #ffc107; padding: 15px; margin: 15px 0;">
<h4 style="margin-top: 0;">üìã Evaluation Configuration</h4>
<pre style="background-color: #f5f5f5; padding: 10px; border-radius: 3px; margin: 10px 0;">
eval:
  eval_dataset_file_path: data/simple_eval.json  # Test questions + answers
  eval_name: simple_test                          # Name for this eval run
  eval_output_folder_path: .tmp/nat/climate_analyzer/eval  # Where to save results
  eval_metrics:
    - _type: answer_accuracy                      # Metric: compare answers
      model_name: meta/llama-3.1-70b-instruct    # LLM judges accuracy

In [None]:
# %load climate_analyzer/src/climate_analyzer/configs/eval_config.yml
llms:
  climate_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    base_url: $NVIDIA_BASE_URL 
    api_key: $NVIDIA_API_KEY
    temperature: 0.7
    top_p: 0.95
    max_tokens: 2048
  
  calculator_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    base_url: $NVIDIA_BASE_URL
    api_key: $NVIDIA_API_KEY
    temperature: 0.0
    max_tokens: 1024

functions:
  list_countries:
    _type: climate_analyzer/list_countries
    description: "List all available countries in the dataset"
    
  calculate_statistics:
    _type: climate_analyzer/calculate_statistics
    description: "Calculate temperature statistics globally or for a specific country"
  
  filter_by_country:
    _type: climate_analyzer/filter_by_country
    description: "Get information about climate data for a specific country"
  
  find_extreme_years:
    _type: climate_analyzer/find_extreme_years
    description: "Find the warmest or coldest years in the dataset"
  
  create_visualization:
    _type: climate_analyzer/create_visualization
    description: "Create visualizations including automatic top 5 countries by warming trend (country_comparison plot)"

  station_statistics:
    _type: climate_analyzer/station_statistics
    description: "Get statistics on climate stations used in the data"
  
  calculator_agent:
    _type: climate_analyzer/calculator_agent
    description: "Perform complex mathematical calculations for climate data analysis"

workflow:
  _type: react_agent
  tool_names:
    - list_countries
    - calculate_statistics
    - filter_by_country
    - find_extreme_years
    - create_visualization
    - station_statistics
    - calculator_agent
  llm_name: climate_llm
  max_iterations: 5
  parse_agent_response_max_retries: 2
  max_tool_calls: 30

# Evaluation configuration
eval:
  general:
    output:
      dir: ./.tmp/nat/climate_analyzer/eval/simple_test/
      cleanup: false  # Keep results for inspection
    dataset:
      _type: json
      file_path: data/simple_eval.json

  evaluators:
    # Check if the answer is accurate
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: climate_llm


## Run Evaluation

</pre>
</div>
<div style="background-color: #f3e5f5; border-left: 6px solid #9c27b0; padding: 15px; margin: 15px 0;">
<h4 style="margin-top: 0;">üí° How Answer Accuracy Works</h4>
<ol>
<li>Your agent processes each test question</li>
<li>NAT captures the agent's response</li>
<li>An LLM judge compares the response to the reference answer</li>
<li>The judge assigns a score (0.0 = wrong, 1.0 = correct)</li>
<li>NAT calculates average score across all test cases</li>
</ol>
<br>
<strong>Why use an LLM judge?</strong> Exact string matching is too brittle. "6.80¬∞C" and "6.8 degrees Celsius" are the same answer but different strings. An LLM can judge semantic equivalence.
</div>


In [7]:
!cd climate_analyzer && nat eval --config_file src/climate_analyzer/configs/eval_config.yml

2025-12-17 20:11:41 - INFO     - nat.eval.evaluate:448 - Starting evaluation run with config file: src/climate_analyzer/configs/eval_config.yml
Running workflow: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:20<00:00, 20.56s/it]
Evaluating Ragas nv_accuracy: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.70s/it]
2025-12-17 20:12:06 - INFO     - nat.eval.evaluate:252 - Profiler is not enabled. Skipping profiling.
2025-12-17 20:12:06 - INFO     - nat.eval.evaluate:337 - Workflow output written to .tmp/nat/climate_analyzer/eval/simple_test/workflow_output.json
2025-12-17 20:12:06 - INFO     - nat.eval.evaluate:348 - Evaluation results written to .tmp/nat/climate_analyzer/eval/simple_test/answer_accuracy_output.json
2025-12-17 20:12:06 - INFO     - nat.eval.utils.output_uploader:61 - No S3 config provided; skipping upload.
[0m[0m

<div style="background-color: #e7f3fe; border-left: 6px solid #2196F3; padding: 15px; margin: 15px 0;">
<h4 style="margin-top: 0;">üîÑ What's Happening</h4>
<ol>
<li>NAT loads your test dataset</li>
<li>For each test case, it runs your agent with the question</li>
<li>Captures the agent's reasoning steps and final answer</li>
<li>Sends both the agent's answer and reference answer to the LLM judge</li>
<li>Collects scores and saves detailed results to JSON files</li>
</ol>
</div>

## Check Results

<div style="background-color: #f5f5f5; border: 1px solid #ddd; padding: 15px; border-radius: 5px; margin: 15px 0; font-family: monospace;">
<strong>Generated Files:</strong>
<pre style="margin: 10px 0; white-space: pre-wrap;">
answer_accuracy_output.json  ‚Üê Detailed results with scores
eval_summary.json           ‚Üê High-level metrics
</pre>
</div>

### Score Summary
Now you can open the results files and see how your agent performed: 

In [8]:
import json

with open('climate_analyzer/.tmp/nat/climate_analyzer/eval/simple_test/answer_accuracy_output.json', 'r') as f:
    answer_accuracy_data = json.load(f)

In [9]:
print(f"üìä Evaluation Results")
print(f"=" * 50)
print(f"Average Score: {answer_accuracy_data['average_score']} / 1.0")
print()

for item in answer_accuracy_data['eval_output_items']:
    r = item['reasoning']
    print(f"‚ùì {r['user_input']}")
    print(f"‚úÖ Expected: {r['reference']}")
    print(f"‚ùå Got: {r['response']}")
    print(f"üìà Score: {item['score']}")
    print()

üìä Evaluation Results
Average Score: 0.0 / 1.0

‚ùì What was the average temperature in Austria in 1980? Please provide the numerical value.
‚úÖ Expected: The average temperature in Austria in 1980 was 6.80¬∞C
‚ùå Got: The average temperature in Austria in 1980 cannot be determined from the provided statistics.
üìà Score: 0.0



<div style="background-color: #ffebee; border-left: 6px solid #f44336; padding: 15px; margin: 20px 0;">
<h4 style="margin-top: 0;">‚ùå Initial Results - Something's Wrong</h4>
<pre style="background-color: white; padding: 10px; border-radius: 3px; margin: 10px 0;">
üìä Evaluation Results
==================================================
Average Score: 0.0 / 1.0
‚ùì What was Austria's average temperature in 1980?
‚úÖ Expected: 6.80¬∞C
‚ùå Got: 8.08¬∞C
üìà Score: 0.0
</pre>
<br>
<strong>The agent got the wrong answer!</strong> Let's investigate why.
</div>

### Inspect Agent Reasoning
Let's examine exactly what the agent did to understand where it went wrong. This code is just parsing the output JSON file. 

In [10]:
# Extract the reasoning steps
item = answer_accuracy_data['eval_output_items'][0]
contexts = item['reasoning']['retrieved_contexts']

print("ü§ñ AGENT'S DECISION PROCESS")
print("=" * 60)
print(f"Question: {item['reasoning']['user_input']}")
print(f"Expected: {item['reasoning']['reference']}")
print("=" * 60)
print()

# Parse each step
for i, context in enumerate(contexts):
    if context.startswith('**Step'):
        # Extract step number and content
        lines = context.strip().split('\n')
        step_header = lines[0]
        
        print(f"{step_header}")
        
        # Look for Thought
        if 'Thought:' in context:
            thought_start = context.find('Thought:') + 8
            thought_end = context.find('\n\nAction:') if '\n\nAction:' in context else len(context)
            thought = context[thought_start:thought_end].strip()
            print(f"üí≠ Thought: {thought}")
        
        # Look for Action (tool call)
        if 'Action:' in context and 'Action Input:' in context:
            action_start = context.find('Action:') + 7
            action_end = context.find('\nAction Input:')
            action = context[action_start:action_end].strip()
            
            input_start = context.find('Action Input:') + 13
            input_end = context.find('\n\n', input_start) if '\n\n' in context[input_start:] else len(context)
            action_input = context[input_start:input_end].strip()
            
            print(f"üõ†Ô∏è  Tool: {action}")
            print(f"üì• Input: {action_input}")
        
        # Look for tool response (usually JSON)
        if i + 1 < len(contexts) and contexts[i + 1].startswith('{'):
            print(f"üì§ Response: {contexts[i + 1][:100]}..." if len(contexts[i + 1]) > 100 else f"üì§ Response: {contexts[i + 1]}")
        
        # Look for Final Answer
        if 'Final Answer:' in context:
            answer_start = context.find('Final Answer:') + 13
            final_answer = context[answer_start:].strip()
            print(f"‚úÖ Final Answer: {final_answer}")
        
        print()

print("\n" + "=" * 60)
print(f"‚ùå Actual answer given: {item['reasoning']['response']}")
print(f"üìä Score: {item['score']}")

ü§ñ AGENT'S DECISION PROCESS
Question: What was the average temperature in Austria in 1980? Please provide the numerical value.
Expected: The average temperature in Austria in 1980 was 6.80¬∞C

**Step 0**
üí≠ Thought: To find the average temperature in Austria in 1980, I need to get the temperature statistics for Austria.
Action: calculate_statistics
Action Input: {"country": "Austria"}
üõ†Ô∏è  Tool: calculate_statistics
üì• Input: {"country": "Austria"}

**Step 1**

**Step 2**
üí≠ Thought: The provided statistics are for the entire period from 1950 to 2025, but I need the average temperature specifically for the year 1980. However, the statistics provided do not include yearly breakdowns, so I cannot determine the exact average temperature for 1980 from this data.
Final Answer: The average temperature in Austria in 1980 cannot be determined from the provided statistics.
‚úÖ Final Answer: The average temperature in Austria in 1980 cannot be determined from the provided statistics.

<div style="background-color: #f5f5f5; border: 1px solid #ddd; padding: 15px; border-radius: 5px; margin: 15px 0; font-family: monospace;">
<strong>Agent's Reasoning Trace:</strong>
<pre style="margin: 10px 0; white-space: pre-wrap;">
ü§ñ AGENT'S DECISION PROCESS
============================================================
Question: What was Austria's average temperature in 1980?
Expected: 6.80¬∞C
============================================================
Step 1
üí≠ Thought: I need to get temperature data for Austria in 1980
üõ†Ô∏è  Tool: calculate_statistics
üì• Input: {"country": "Austria", "start_year": 1980, "end_year": 1980}
üì§ Response: {"mean_temperature": 8.08, "years_analyzed": "1950-2025", ...}
‚úÖ Final Answer: 8.08¬∞C
============================================================
‚ùå Actual answer given: 8.08¬∞C
üìä Score: 0.0
</pre>
</div>

Looking at Step 1, you can see that the agent passed the correct country, but failed to provide the year 1980. In the following steps, the agent tried to work around the incomplete input by calculating the average temperature for Austria across all the years it had data for, coming up with the wrong answer. 

The entire output can be found at `climate_analyzer/.tmp/nat/climate_analyzer/eval/simple_test/answer_accuracy_output.json`

## Bug Discovery!
<div style="background-color: #ffebee; border: 2px solid #f44336; padding: 20px; border-radius: 8px; margin: 20px 0;">
<h3 style="color: #c62828; margin-top: 0;">üêõ Critical Bug Identified</h3>
<div style="background-color: white; padding: 15px; border-radius: 5px; margin: 15px 0;">
<h4 style="color: #f44336; margin-top: 0;">The Problem</h4>
<strong>The tool ignores year parameters!</strong>
<br><br>
<table style="width: 100%; border-collapse: collapse;">
    <tr style="background-color: #e8f5e9;">
        <td style="padding: 10px; border: 1px solid #ddd;"><strong>‚úÖ Agent Did Right</strong></td>
        <td style="padding: 10px; border: 1px solid #ddd;">Passed correct parameters: <code>country='Austria', start_year=1980, end_year=1980</code></td>
    </tr>
    <tr style="background-color: #ffebee;">
        <td style="padding: 10px; border: 1px solid #ddd;"><strong>‚ùå Tool Did Wrong</strong></td>
        <td style="padding: 10px; border: 1px solid #ddd;">Returned data for ALL years (1950-2025), not just 1980</td>
    </tr>
    <tr style="background-color: #fff3cd;">
        <td style="padding: 10px; border: 1px solid #ddd;"><strong>üìä Result</strong></td>
        <td style="padding: 10px; border: 1px solid #ddd;">Wrong answer: 8.08¬∞C (average across 75 years) instead of 6.80¬∞C (1980 only)</td>
    </tr>
</table>
</div>
<div style="background-color: white; padding: 15px; border-radius: 5px; margin-top: 15px;">
<h4 style="color: #f44336; margin-top: 0;">Root Cause</h4>
The <code>calculate_statistics</code> function accepts <code>start_year</code> and <code>end_year</code> parameters but doesn't actually filter the data by them. The function signature has the parameters, but the implementation doesn't use them.
<br><br>
<strong>This is why systematic evaluation matters</strong> - manual testing might never catch this edge case, but automated evaluation found it immediately.
</div>
</div>

## The Fix
Add year filtering logic to the calculate_statistics function:
<div style="background-color: #e8f5e9; border-left: 6px solid #4CAF50; padding: 15px; margin: 20px 0;">
<h4 style="margin-top: 0;">‚úÖ The Solution</h4>
<strong>Before (broken code):</strong>
<pre style="background-color: white; padding: 10px; border-radius: 3px; margin: 10px 0;">
def calculate_statistics(df, country=None, start_year=None, end_year=None):
    # Filter by country
    if country:
        df = df[df['country_name'] == country] # BUG: Never filters by year!
    return calculate_stats(df)
</pre>
<strong>After (fixed code):</strong>
<pre style="background-color: white; padding: 10px; border-radius: 3px; margin: 10px 0;">
def calculate_statistics(df, country=None, start_year=None, end_year=None):
    # Filter by country
    if country:
        df = df[df['country_name'] == country]
        if start_year is not None: # ‚úÖ FIX: Actually filter by year when specified
            df = df[df['year'] >= start_year]
        if end_year is not None:
            df = df[df['year'] <= end_year]
        return calculate_stats(df)
</pre>
</div>
<div style="background-color: #f3e5f5; border-left: 6px solid #9c27b0; padding: 15px; margin: 15px 0;">
<h4 style="margin-top: 0;">üí° Why This Bug Existed</h4>
<ul>
<li><strong>Interface vs. Implementation</strong> - The function signature promised year filtering, but the body didn't deliver</li>
<li><strong>Silent failure</strong> - No error was thrown; the function just returned wrong data</li>
<li><strong>Hard to catch manually</strong> - "Show me Austria's temperature" works fine. Only specific year queries fail.</li>
<li><strong>Evaluation caught it</strong> - Systematic testing with ground truth revealed the bug immediately</li>
</ul>
</div>

## Test the Fix

In [11]:
# Run evaluation with the fixed tool
!cd climate_analyzer && nat eval --config_file src/climate_analyzer/configs/eval_config_fixed.yml

2025-12-17 20:12:09 - INFO     - nat.eval.evaluate:448 - Starting evaluation run with config file: src/climate_analyzer/configs/eval_config_fixed.yml
Running workflow: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:13<00:00, 13.97s/it]
Evaluating Ragas nv_accuracy: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.57s/it]
2025-12-17 20:12:28 - INFO     - nat.eval.evaluate:252 - Profiler is not enabled. Skipping profiling.
2025-12-17 20:12:28 - INFO     - nat.eval.evaluate:337 - Workflow output written to .tmp/nat/climate_analyzer/eval/fixed_test/workflow_output.json
2025-12-17 20:12:28 - INFO     - nat.eval.evaluate:348 - Evaluation results written to .tmp/nat/climate_analyzer/eval/fixed_test/answer_accuracy_output.json
2025-12-17 20:12:28 - INFO     - nat.eval.utils.output_uploader:61 - No S3 config provided; skipping upload.
[0m[0m

## Verify Results
Now that the logic has been updated, check the results again to see if the score improved: 

In [12]:
import json

with open('climate_analyzer/.tmp/nat/climate_analyzer/eval/fixed_test/answer_accuracy_output.json', 'r') as f:
    answer_accuracy_data = json.load(f)

In [13]:
print(f"üìä Evaluation Results")
print(f"=" * 50)
print(f"Average Score: {answer_accuracy_data['average_score']} / 1.0")
print()

for item in answer_accuracy_data['eval_output_items']:
    r = item['reasoning']
    print(f"‚ùì {r['user_input']}")
    print(f"‚úÖ Expected: {r['reference']}")
    print(f"‚ùå Got: {r['response']}")
    print(f"üìà Score: {item['score']}")
    print()

üìä Evaluation Results
Average Score: 1.0 / 1.0

‚ùì What was the average temperature in Austria in 1980? Please provide the numerical value.
‚úÖ Expected: The average temperature in Austria in 1980 was 6.80¬∞C
‚ùå Got: The average temperature in Austria in 1980 was 6.8¬∞C.
üìà Score: 1.0



<div style="background-color: #e8f5e9; border-left: 6px solid #4CAF50; padding: 15px; margin: 20px 0;">
<h4 style="margin-top: 0;">‚úÖ Fixed Results - Success!</h4>
<pre style="background-color: white; padding: 10px; border-radius: 3px; margin: 10px 0;">
üìä Evaluation Results
==================================================
Average Score: 1.0 / 1.0
‚ùì What was Austria's average temperature in 1980?
‚úÖ Expected: 6.80¬∞C
‚úÖ Got: 6.80¬∞C
üìà Score: 1.0
</pre>
<br>
<strong>Perfect score!</strong> The agent now correctly filters by year and returns accurate results.
</div>

## Summary
<div style="background-color: #e3f2fd; border: 2px solid #2196F3; padding: 20px; border-radius: 8px; margin: 20px 0;">
<h3 style="color: #1976d2; margin-top: 0;">üéâ What You Accomplished</h3>
<div style="background-color: white; padding: 15px; border-radius: 5px; margin-top: 15px;">
<h4 style="color: #4CAF50; margin-top: 0;">‚úÖ The Evaluation ‚Üí Fix ‚Üí Verify Loop</h4>
<div style="display: flex; justify-content: space-around; align-items: center; margin: 15px 0; flex-wrap: wrap;">
    <div style="text-align: center; margin: 10px;">
        <div style="background-color: #2196F3; color: white; padding: 12px; border-radius: 8px;">
            <strong>1. Create Tests</strong>
        </div>
        <small>Ground truth dataset</small>
    </div>
    <div style="font-size: 20px;">‚Üí</div>
    <div style="text-align: center; margin: 10px;">
        <div style="background-color: #ff9800; color: white; padding: 12px; border-radius: 8px;">
            <strong>2. Run Evaluation</strong>
        </div>
        <small>Found score: 0.0</small>
    </div>
    <div style="font-size: 20px;">‚Üí</div>
    <div style="text-align: center; margin: 10px;">
        <div style="background-color: #f44336; color: white; padding: 12px; border-radius: 8px;">
            <strong>3. Discover Bug</strong>
        </div>
        <small>Year filter missing</small>
    </div>
    <div style="font-size: 20px;">‚Üí</div>
    <div style="text-align: center; margin: 10px;">
        <div style="background-color: #9C27B0; color: white; padding: 12px; border-radius: 8px;">
            <strong>4. Fix Code</strong>
        </div>
        <small>Add year filtering</small>
    </div>
    <div style="font-size: 20px;">‚Üí</div>
    <div style="text-align: center; margin: 10px;">
        <div style="background-color: #4CAF50; color: white; padding: 12px; border-radius: 8px;">
            <strong>5. Verify Fix</strong>
        </div>
        <small>Score: 1.0 ‚úÖ</small>
    </div>
</div>
</div>
<div style="background-color: white; padding: 15px; border-radius: 5px; margin-top: 15px;">
<h4 style="color: #2196F3; margin-top: 0;">üìä Before vs. After</h4>
<table style="width: 100%; border-collapse: collapse; margin-top: 10px;">
    <tr style="background-color: #2196F3; color: white;">
        <th style="padding: 12px; text-align: left; border: 1px solid #ddd;">Metric</th>
        <th style="padding: 12px; text-align: center; border: 1px solid #ddd;">Before Fix</th>
        <th style="padding: 12px; text-align: center; border: 1px solid #ddd;">After Fix</th>
    </tr>
    <tr style="background-color: white;">
        <td style="padding: 12px; border: 1px solid #ddd;"><strong>Evaluation Score</strong></td>
        <td style="padding: 12px; text-align: center; border: 1px solid #ddd; color: #f44336;"><strong>0.0 / 1.0</strong></td>
        <td style="padding: 12px; text-align: center; border: 1px solid #ddd; color: #4CAF50;"><strong>1.0 / 1.0</strong></td>
    </tr>
    <tr style="background-color: #f9f9f9;">
        <td style="padding: 12px; border: 1px solid #ddd;"><strong>Answer for 1980</strong></td>
        <td style="padding: 12px; text-align: center; border: 1px solid #ddd; color: #f44336;">8.08¬∞C (wrong)</td>
        <td style="padding: 12px; text-align: center; border: 1px solid #ddd; color: #4CAF50;">6.80¬∞C (correct)</td>
    </tr>
    <tr style="background-color: white;">
        <td style="padding: 12px; border: 1px solid #ddd;"><strong>Year Filtering</strong></td>
        <td style="padding: 12px; text-align: center; border: 1px solid #ddd; color: #f44336;">‚ùå Broken</td>
        <td style="padding: 12px; text-align: center; border: 1px solid #ddd; color: #4CAF50;">‚úÖ Working</td>
    </tr>
</table>
</div>
<div style="background-color: white; padding: 15px; border-radius: 5px; margin-top: 15px;">
<h4 style="color: #9C27B0; margin-top: 0;">üîë Key Insights</h4>
<ul>
<li><strong>Evaluation finds silent bugs</strong> - No errors thrown, just wrong answers</li>
<li><strong>Ground truth is essential</strong> - You need known-correct answers to test against</li>
<li><strong>Systematic beats manual</strong> - Automated evaluation catches edge cases you'd miss</li>
<li><strong>Reasoning traces debug bugs</strong> - Seeing what the agent tried helps identify where it went wrong</li>
<li><strong>Verify fixes work</strong> - Re-run evaluation to confirm the bug is actually fixed</li>
</ul>
</div>
<div style="background-color: #fff3cd; padding: 15px; border-radius: 5px; margin-top: 15px;">
<h4 style="margin-top: 0;">‚ö° Why This Matters in Production</h4>
Without systematic evaluation:
<ul>
<li>This bug would have made it to production</li>
<li>Users asking about specific years would get wrong answers</li>
<li>You'd only discover it through user complaints</li>
<li>You wouldn't know how widespread the problem is</li>
</ul>
<br>
With evaluation:
<ul>
<li>Caught the bug before deployment</li>
<li>Fixed it with confidence (verified the fix works)</li>
<li>Can now test regression (make sure future changes don't break it again)</li>
<li>Have metrics to track improvements over time</li>
</ul>
</div>
<div style="background-color: #d4edda; padding: 15px; border-radius: 5px; margin-top: 15px;">
<h4 style="margin-top: 0;">üöÄ Next Lesson: Deploy with UI</h4>
Your agent is now:
<ul>
<li>‚úÖ Functional (has data analysis tools)</li>
<li>‚úÖ Observable (Phoenix tracing shows decisions)</li>
<li>‚úÖ Enhanced (LangGraph calculator for complex math)</li>
<li>‚úÖ Tested (evaluation ensures correctness)</li>
</ul>
<br>
In the final lesson, you'll:
<ul>
<li>Deploy your agent with a production-ready UI</li>
<li>Add authentication and rate limiting</li>
<li>See how everything comes together in a real application</li>
<li>Share your agent with users</li>
</ul>
</div>
</div>