# **Agent Evaluation Workshop**

**Duration:** ~50 minutes  
**Prerequisites:** Familiarity with LLM tool-calling, basic Python  
**API:** Google Generative AI (Gemini)

---

## What You'll Learn

- How to evaluate individual agent components (tool selection, parameter extraction)
- How to use an LLM as a judge to score agent outputs
- How to systematically improve agent performance through prompt iteration

---

## Workshop Structure

| Exercise | Topic | Duration |
|----------|-------|----------|
| 1 | Component-Level Evaluation | 20 min |
| 2 | LLM-as-Judge Evaluation | 15 min |
| 3 | Prompt Iteration & Improvement | 15 min |

## Setup

In [3]:
%pip install --quiet google-genai

Note: you may need to restart the kernel to use updated packages.


In [4]:
import os
import json
from google import genai
from google.genai import types
from getpass import getpass 

api_key = getpass("Paste your Google API key: ").strip()
os.environ["GOOGLE_API_KEY"] = api_key

client = genai.Client(api_key=api_key)
MODEL = "gemini-2.5-flash"

print("Setup complete!")

Paste your Google API key:  ········


Setup complete!


---

# Exercise 1: Component-Level Evaluation (20 min)

**Goal:** Measure how well individual agent components work — tool selection and parameter extraction.

We'll build a simple agent with 4 tools, then run a structured test suite against it to measure:
- **Tool Selection Accuracy** — does the agent pick the right tool?
- **Parameter Extraction Accuracy** — does it extract the correct arguments?

```
              ┌─────────────┐
  Query ───> │  LLM Agent  │ ───> Tool Call + Params
              └─────────────┘
                    │
                    ▼
           Compare against expected
           tool + expected params
                    │
                    ▼
             Accuracy Scores
```

## Step 1: Define Agent Tools

We create 4 tools that our agent can call. Each tool has a clear docstring — this is what the LLM reads to decide which tool to use.

In [5]:
def get_weather(city: str) -> dict:
    """Get the current weather for a given city.

    Use this tool when the user asks about weather conditions,
    temperature, or forecast for a specific city or location.

    Args:
        city: Name of the city (e.g., "Delhi", "London", "Tokyo")

    Returns:
        dict with city, temperature, and conditions
    """
    # Simulated weather data
    weather_data = {
        "delhi": {"temp": 35, "conditions": "Sunny"},
        "london": {"temp": 12, "conditions": "Cloudy"},
        "tokyo": {"temp": 22, "conditions": "Clear"},
        "new york": {"temp": 18, "conditions": "Partly Cloudy"},
        "mumbai": {"temp": 32, "conditions": "Humid"},
    }
    data = weather_data.get(city.lower(), {"temp": 20, "conditions": "Unknown"})
    return {"city": city, "temperature_c": data["temp"], "conditions": data["conditions"], "status": "success"}


def search_web(query: str) -> dict:
    """Search the web for information on a given topic.

    Use this tool when the user asks general knowledge questions,
    wants to find information about a topic, or needs facts that
    aren't covered by other specialized tools.

    Args:
        query: The search query string

    Returns:
        dict with search results summary
    """
    return {"query": query, "results": f"Top results for: {query}", "status": "success"}


def calculate(expression: str) -> dict:
    """Evaluate a mathematical expression and return the result.

    Use this tool when the user asks to compute, calculate, or
    solve a math problem. Supports basic arithmetic (+, -, *, /),
    powers (**), and parentheses.

    Args:
        expression: A mathematical expression as a string (e.g., "2 + 3 * 4")

    Returns:
        dict with the expression and its computed result
    """
    try:
        result = eval(expression)
        return {"expression": expression, "result": result, "status": "success"}
    except Exception as e:
        return {"expression": expression, "error": str(e), "status": "error"}


def get_news(topic: str) -> dict:
    """Get the latest news headlines about a specific topic.

    Use this tool when the user asks about recent news, current events,
    or headlines related to a specific topic or subject.

    Args:
        topic: The news topic to search for (e.g., "AI", "sports", "politics")

    Returns:
        dict with news headlines for the topic
    """
    return {"topic": topic, "headlines": [f"Breaking: Latest on {topic}"], "status": "success"}


# Map tool names to functions for execution
TOOLS = [get_weather, search_web, calculate, get_news]
TOOL_MAP = {fn.__name__: fn for fn in TOOLS}

print("Available tools:", list(TOOL_MAP.keys()))

Available tools: ['get_weather', 'search_web', 'calculate', 'get_news']


## Step 2: Build the Agent Caller

This function sends a query to the LLM with tool definitions and returns which tool the LLM chose and what parameters it extracted.

In [7]:
SYSTEM_PROMPT = """You are a helpful assistant with access to tools.
When the user asks a question, decide which tool to call and extract the correct parameters.
Always use a tool to answer — do not answer from memory."""


def call_agent(query: str, system_prompt: str = SYSTEM_PROMPT) -> dict:
    """Send a query to the agent and return the tool call it makes.

    Returns:
        dict with 'tool_name' and 'params', or 'error' if no tool was called
    """
    try:
        response = client.models.generate_content(
            model=MODEL,
            contents=query,
            config=types.GenerateContentConfig(
                system_instruction=system_prompt,
                tools=TOOLS,
                tool_config=types.ToolConfig(
                    function_calling_config=types.FunctionCallingConfig(mode="ANY")
                ),
            ),
        )

        # Extract the function call from the response
        for part in response.candidates[0].content.parts:
            if part.function_call:
                return {
                    "tool_name": part.function_call.name,
                    "params": dict(part.function_call.args),
                }

        return {"tool_name": None, "params": {}, "error": "No tool call in response"}

    except Exception as e:
        return {"tool_name": None, "params": {}, "error": str(e)}

## Step 3: Define Test Cases

20 test queries covering all 4 tools — each with expected tool and expected parameters.

In [8]:
test_cases = [
    # --- get_weather (5 cases) ---
    {"query": "Weather in Delhi", "expected_tool": "get_weather", "expected_params": {"city": "Delhi"}},
    {"query": "What's the temperature in London?", "expected_tool": "get_weather", "expected_params": {"city": "London"}},
    {"query": "Is it raining in Tokyo right now?", "expected_tool": "get_weather", "expected_params": {"city": "Tokyo"}},
    {"query": "Tell me the weather forecast for Mumbai", "expected_tool": "get_weather", "expected_params": {"city": "Mumbai"}},
    {"query": "How hot is it in New York?", "expected_tool": "get_weather", "expected_params": {"city": "New York"}},

    # --- search_web (5 cases) ---
    {"query": "Who invented the telephone?", "expected_tool": "search_web", "expected_params": {"query": "Who invented the telephone?"}},
    {"query": "What is photosynthesis?", "expected_tool": "search_web", "expected_params": {"query": "What is photosynthesis?"}},
    {"query": "Tell me about the history of chess", "expected_tool": "search_web", "expected_params": {"query": "history of chess"}},
    {"query": "How does a combustion engine work?", "expected_tool": "search_web", "expected_params": {"query": "How does a combustion engine work?"}},
    {"query": "What are the benefits of meditation?", "expected_tool": "search_web", "expected_params": {"query": "benefits of meditation"}},

    # --- calculate (5 cases) ---
    {"query": "What is 25 * 4?", "expected_tool": "calculate", "expected_params": {"expression": "25 * 4"}},
    {"query": "Calculate 100 / 3", "expected_tool": "calculate", "expected_params": {"expression": "100 / 3"}},
    {"query": "What's 2 to the power of 10?", "expected_tool": "calculate", "expected_params": {"expression": "2 ** 10"}},
    {"query": "Compute (15 + 25) * 2", "expected_tool": "calculate", "expected_params": {"expression": "(15 + 25) * 2"}},
    {"query": "How much is 999 - 123?", "expected_tool": "calculate", "expected_params": {"expression": "999 - 123"}},

    # --- get_news (5 cases) ---
    {"query": "What's the latest news about AI?", "expected_tool": "get_news", "expected_params": {"topic": "AI"}},
    {"query": "Any recent sports headlines?", "expected_tool": "get_news", "expected_params": {"topic": "sports"}},
    {"query": "Show me news about climate change", "expected_tool": "get_news", "expected_params": {"topic": "climate change"}},
    {"query": "Latest updates on the stock market", "expected_tool": "get_news", "expected_params": {"topic": "stock market"}},
    {"query": "What's happening in space exploration?", "expected_tool": "get_news", "expected_params": {"topic": "space exploration"}},
]

print(f"Total test cases: {len(test_cases)}")
print(f"Tools covered: {set(tc['expected_tool'] for tc in test_cases)}")

Total test cases: 20
Tools covered: {'calculate', 'get_news', 'search_web', 'get_weather'}


## Step 4: Run the Test Harness

We send each query to the agent and compare:
1. Did it select the **correct tool**?
2. Did it extract the **correct parameters**?

For parameters, we use a **fuzzy match** with three strategies:
- **Case-insensitive** comparison
- **Bidirectional containment** — either string containing the other (handles rephrasing)
- **Whitespace normalization** — `2**10` matches `2 ** 10`

In [9]:
import time
import re


def normalize(s: str) -> str:
    """Normalize a string for fuzzy comparison: lowercase, strip, collapse whitespace."""
    return re.sub(r"\s+", " ", str(s).lower().strip())


def fuzzy_param_match(expected_params: dict, actual_params: dict) -> bool:
    """Check if actual params match expected values.

    Uses bidirectional containment after normalization:
    - Case-insensitive
    - Whitespace-normalized (so '2**10' matches '2 ** 10')
    - Either string containing the other counts as a match
    """
    for key, expected_val in expected_params.items():
        actual_norm = normalize(actual_params.get(key, ""))
        expected_norm = normalize(expected_val)
        # Also compare with all whitespace removed (handles '2**10' vs '2 ** 10')
        actual_compact = actual_norm.replace(" ", "")
        expected_compact = expected_norm.replace(" ", "")
        if (expected_norm in actual_norm
            or actual_norm in expected_norm
            or expected_compact == actual_compact):
            continue
        return False
    return True


def run_evaluation(test_cases: list, system_prompt: str = SYSTEM_PROMPT, label: str = "Evaluation") -> dict:
    """Run all test cases and return accuracy metrics."""
    results = []
    tool_correct = 0
    param_correct = 0

    print(f"\n{'=' * 70}")
    print(f"  {label}")
    print(f"{'=' * 70}")

    for i, tc in enumerate(test_cases):
        result = call_agent(tc["query"], system_prompt=system_prompt)

        tool_match = result["tool_name"] == tc["expected_tool"]
        param_match = fuzzy_param_match(tc["expected_params"], result["params"])

        if tool_match:
            tool_correct += 1
        if param_match:
            param_correct += 1

        status = "PASS" if (tool_match and param_match) else "FAIL"
        icon = "+" if status == "PASS" else "X"

        print(f"  [{icon}] {i+1:2d}. {tc['query'][:45]:<45}  "
              f"Tool: {result['tool_name'] or 'None':<15} "
              f"{'OK' if tool_match else 'WRONG':>5}  "
              f"Params: {'OK' if param_match else 'WRONG'}")

        if not tool_match:
            print(f"        Expected tool: {tc['expected_tool']}")
        if not param_match:
            print(f"        Expected params: {tc['expected_params']}")
            print(f"        Got params:      {result['params']}")

        results.append({
            "query": tc["query"],
            "expected_tool": tc["expected_tool"],
            "actual_tool": result["tool_name"],
            "tool_match": tool_match,
            "param_match": param_match,
            "expected_params": tc["expected_params"],
            "actual_params": result["params"],
        })

        time.sleep(0.1)  # Rate limiting

    total = len(test_cases)
    tool_accuracy = tool_correct / total * 100
    param_accuracy = param_correct / total * 100
    full_accuracy = sum(1 for r in results if r["tool_match"] and r["param_match"]) / total * 100

    print(f"\n{'-' * 70}")
    print(f"  RESULTS")
    print(f"  Tool Selection Accuracy:    {tool_correct}/{total} = {tool_accuracy:.1f}%")
    print(f"  Parameter Extraction:       {param_correct}/{total} = {param_accuracy:.1f}%")
    print(f"  Full Match (tool + params): {sum(1 for r in results if r['tool_match'] and r['param_match'])}/{total} = {full_accuracy:.1f}%")
    print(f"{'-' * 70}")

    return {
        "label": label,
        "tool_accuracy": tool_accuracy,
        "param_accuracy": param_accuracy,
        "full_accuracy": full_accuracy,
        "results": results,
    }

In [10]:
baseline_eval = run_evaluation(test_cases, label="Baseline Evaluation")


  Baseline Evaluation
  [+]  1. Weather in Delhi                               Tool: get_weather        OK  Params: OK
  [+]  2. What's the temperature in London?              Tool: get_weather        OK  Params: OK
  [+]  3. Is it raining in Tokyo right now?              Tool: get_weather        OK  Params: OK
  [+]  4. Tell me the weather forecast for Mumbai        Tool: get_weather        OK  Params: OK
  [+]  5. How hot is it in New York?                     Tool: get_weather        OK  Params: OK
  [+]  6. Who invented the telephone?                    Tool: search_web         OK  Params: OK
  [+]  7. What is photosynthesis?                        Tool: search_web         OK  Params: OK
  [+]  8. Tell me about the history of chess             Tool: search_web         OK  Params: OK
  [X]  9. How does a combustion engine work?             Tool: search_web         OK  Params: WRONG
        Expected params: {'query': 'How does a combustion engine work?'}
        Got params:      {'q

## Step 5: Per-Tool Breakdown

Let's see which tools the agent handles well and which it struggles with.

In [11]:
from collections import defaultdict

# Group results by expected tool
tool_stats = defaultdict(lambda: {"total": 0, "tool_ok": 0, "param_ok": 0, "full_ok": 0})

for r in baseline_eval["results"]:
    tool = r["expected_tool"]
    tool_stats[tool]["total"] += 1
    if r["tool_match"]:
        tool_stats[tool]["tool_ok"] += 1
    if r["param_match"]:
        tool_stats[tool]["param_ok"] += 1
    if r["tool_match"] and r["param_match"]:
        tool_stats[tool]["full_ok"] += 1

print(f"{'Tool':<15} {'Tool Acc':>10} {'Param Acc':>10} {'Full Acc':>10}")
print("-" * 50)
for tool, stats in sorted(tool_stats.items()):
    t = stats["total"]
    print(f"{tool:<15} {stats['tool_ok']/t*100:>9.0f}% {stats['param_ok']/t*100:>9.0f}% {stats['full_ok']/t*100:>9.0f}%")

Tool              Tool Acc  Param Acc   Full Acc
--------------------------------------------------
calculate             100%       100%       100%
get_news              100%       100%       100%
get_weather           100%       100%       100%
search_web            100%        80%        80%


### Discussion: Exercise 1

- Which tool had the lowest accuracy? Why might that be?
- Were parameter extraction errors caused by phrasing differences or actual mistakes?
- How would you improve tool selection? (Hint: better docstrings, system prompt, or tool names)

---

# Exercise 2: LLM-as-Judge Evaluation (15 min)

**Goal:** Use a second LLM call to evaluate the quality of agent-generated answers.

```
  Document ──> Agent LLM ──> Answer
                                │
                                ▼
                         Judge LLM
                     (scores 1-5 on 3 axes)
                                │
                                ▼
                    Compare with Human Scores
```

We'll:
1. Give the agent a document and 5 questions
2. Have a "Judge LLM" score each answer on Correctness, Completeness, and Relevance
3. Compare judge scores with your own human scores

## Step 1: The Document and Questions

In [12]:
DOCUMENT = """
Renewable Energy in India — 2024 Report

India has emerged as one of the world's fastest-growing renewable energy markets.
As of 2024, India's total installed renewable energy capacity has reached 190 GW,
accounting for approximately 43% of the country's total power generation capacity.

Solar energy leads with 82 GW of installed capacity, followed by wind energy at
46 GW. The government's National Solar Mission aims to achieve 280 GW of solar
capacity by 2030.

Key challenges include:
- Land acquisition for large solar farms, especially in densely populated states
- Grid integration issues due to intermittent nature of renewable sources
- Energy storage limitations — current battery capacity covers only 2 hours of peak demand
- Dependence on imported solar panels (60% from China)

The government has introduced several incentives:
- Production Linked Incentive (PLI) scheme for domestic solar panel manufacturing
- Green energy corridors for improved grid connectivity
- Subsidies of up to 40% for rooftop solar installations

India's renewable energy sector employs approximately 780,000 people and has
attracted over $30 billion in foreign investment since 2014.
"""

QUESTIONS = [
    "What is India's total installed renewable energy capacity in 2024?",
    "What are the top two renewable energy sources in India and their capacities?",
    "What are the key challenges facing India's renewable energy sector?",
    "What government incentives exist for renewable energy in India?",
    "How many people are employed in India's renewable energy sector?",
]

print(f"Document: {len(DOCUMENT.split())} words")
print(f"Questions: {len(QUESTIONS)}")

Document: 178 words
Questions: 5


## Step 2: Generate Agent Answers

In [13]:
def get_agent_answer(document: str, question: str) -> str:
    """Ask the agent to answer a question based on a document."""
    prompt = f"""Based ONLY on the following document, answer the question.
Be specific and cite numbers/facts from the document.

Document:
{document}

Question: {question}

Answer:"""

    response = client.models.generate_content(model=MODEL, contents=prompt)
    return response.text


# Generate answers for all questions
agent_answers = []
for i, q in enumerate(QUESTIONS):
    answer = get_agent_answer(DOCUMENT, q)
    agent_answers.append(answer)
    print(f"Q{i+1}: {q}")
    print(f"A{i+1}: {answer}")
    print("-" * 60)
    time.sleep(0.1)

Q1: What is India's total installed renewable energy capacity in 2024?
A1: As of 2024, India's total installed renewable energy capacity has reached **190 GW**.
------------------------------------------------------------
Q2: What are the top two renewable energy sources in India and their capacities?
A2: Based on the document:

The top two renewable energy sources in India and their capacities are:
1.  **Solar energy** with 82 GW of installed capacity.
2.  **Wind energy** with 46 GW of installed capacity.
------------------------------------------------------------
Q3: What are the key challenges facing India's renewable energy sector?
A3: Based ONLY on the provided document, the key challenges facing India's renewable energy sector are:

*   **Land acquisition** for large solar farms, particularly in densely populated states.
*   **Grid integration issues** due to the intermittent nature of renewable sources.
*   **Energy storage limitations**, as current battery capacity covers only

## Step 3: Build the Judge LLM

The judge receives the document, the question, and the agent's answer — then scores it on 3 dimensions.

In [14]:
JUDGE_PROMPT = """You are an expert evaluator. You will be given:
- A reference document
- A question about the document
- An answer generated by an AI agent

Score the answer on three dimensions (each 1-5):

1. **Correctness** (1-5): Are the facts in the answer accurate according to the document?
   - 1 = Mostly wrong  2 = Several errors  3 = Some errors  4 = Minor issues  5 = Fully correct

2. **Completeness** (1-5): Does the answer cover all relevant information from the document?
   - 1 = Very incomplete  2 = Missing major points  3 = Covers basics  4 = Mostly complete  5 = Thorough

3. **Relevance** (1-5): Does the answer directly address the question without unnecessary information?
   - 1 = Off-topic  2 = Partially relevant  3 = Relevant but unfocused  4 = Focused  5 = Precisely targeted

Respond ONLY in this exact JSON format:
{"correctness": <int>, "completeness": <int>, "relevance": <int>, "reasoning": "<brief explanation>"}
"""


def judge_answer(document: str, question: str, answer: str) -> dict:
    """Use a second LLM call to evaluate an answer."""
    prompt = f"""{JUDGE_PROMPT}

Document:
{document}

Question: {question}

Agent's Answer: {answer}

Your evaluation (JSON only):"""

    response = client.models.generate_content(
        model=MODEL,
        contents=prompt,
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
        ),
    )

    try:
        return json.loads(response.text)
    except json.JSONDecodeError:
        return {"correctness": 0, "completeness": 0, "relevance": 0, "reasoning": "Failed to parse judge response"}


print("Judge function ready.")

Judge function ready.


## Step 4: Run the Judge

In [15]:
judge_scores = []

print(f"{'Q#':<4} {'Correctness':>12} {'Completeness':>13} {'Relevance':>10}  Reasoning")
print("=" * 80)

for i, (q, a) in enumerate(zip(QUESTIONS, agent_answers)):
    score = judge_answer(DOCUMENT, q, a)
    judge_scores.append(score)

    print(f"Q{i+1:<3} {score.get('correctness', 0):>12} {score.get('completeness', 0):>13} "
          f"{score.get('relevance', 0):>10}  {score.get('reasoning', '')[:50]}")

    time.sleep(0.3)

# Averages
avg_c = sum(s.get("correctness", 0) for s in judge_scores) / len(judge_scores)
avg_comp = sum(s.get("completeness", 0) for s in judge_scores) / len(judge_scores)
avg_r = sum(s.get("relevance", 0) for s in judge_scores) / len(judge_scores)

print("=" * 80)
print(f"AVG  {avg_c:>12.1f} {avg_comp:>13.1f} {avg_r:>10.1f}")

Q#    Correctness  Completeness  Relevance  Reasoning
Q1              5             5          5  The answer accurately provides the exact figure fo
Q2              5             5          5  The answer correctly identifies the top two renewa
Q3              5             5          5  The answer accurately and completely lists all the
Q4              5             5          5  The answer accurately and completely lists all the
Q5              5             5          5  The answer accurately extracts the exact number of
AVG           5.0           5.0        5.0


## Step 5: Compare with Human Scores

Now it's your turn! Read each answer above and give your own scores (1-5) for each dimension. Then we'll compare human vs judge scores.

In [16]:
# ============================================================
# YOUR TURN: Fill in your human scores for each answer (1-5)
# ============================================================

human_scores = [
    {"correctness": 5, "completeness": 5, "relevance": 5},  # Q1
    {"correctness": 5, "completeness": 5, "relevance": 5},  # Q2
    {"correctness": 5, "completeness": 4, "relevance": 5},  # Q3
    {"correctness": 5, "completeness": 4, "relevance": 5},  # Q4
    {"correctness": 5, "completeness": 5, "relevance": 5},  # Q5
]

# ^^ EDIT the scores above after reading the agent's answers ^^

In [None]:
# Compare human vs judge scores
print(f"{'Q#':<4} {'Dimension':<14} {'Human':>6} {'Judge':>6} {'Diff':>6}")
print("=" * 40)

total_diff = 0
total_comparisons = 0

for i, (h, j) in enumerate(zip(human_scores, judge_scores)):
    for dim in ["correctness", "completeness", "relevance"]:
        h_val = h.get(dim, 0)
        j_val = j.get(dim, 0)
        diff = abs(h_val - j_val)
        total_diff += diff
        total_comparisons += 1
        print(f"Q{i+1:<3} {dim:<14} {h_val:>6} {j_val:>6} {diff:>+6}")
    print("-" * 40)

mae = total_diff / total_comparisons
print(f"\nMean Absolute Error (Human vs Judge): {mae:.2f}")
print(f"Interpretation:")
if mae < 0.5:
    print("  Excellent agreement — judge is well-calibrated.")
elif mae < 1.0:
    print("  Good agreement — judge is mostly aligned with human judgment.")
elif mae < 1.5:
    print("  Moderate agreement — judge has some systematic bias.")
else:
    print("  Poor agreement — judge may not be reliable for this task.")

### Discussion: Exercise 2

**When is LLM-as-Judge reliable?**
- Factual, document-grounded questions (clear right/wrong)
- Well-defined scoring rubrics with specific criteria
- Tasks where the LLM has enough context to verify

**When is it unreliable?**
- Subjective or creative tasks (no single correct answer)
- When the judge LLM has the same blind spots as the agent LLM
- Edge cases where the scoring rubric is ambiguous
- Self-evaluation (same model judging its own output) — tends to inflate scores

**Pro tip:** Use a *different* or *stronger* model as the judge for better calibration.

---

# Exercise 3: Prompt Iteration & Improvement (15 min)

**Goal:** Systematically improve your agent through prompt engineering.

We'll:
1. Use the baseline evaluation from Exercise 1 as a starting point
2. Try 3 different system prompt versions
3. Compare all results in a final table

```
  Prompt V1 (baseline)  ──> Eval ──> Score
  Prompt V2 (examples)  ──> Eval ──> Score
  Prompt V3 (strict)    ──> Eval ──> Score
                                       │
                                       ▼
                              Comparison Table
```

## Step 1: Baseline (already done)

We already ran the baseline in Exercise 1. Let's record those scores.

In [None]:
print(f"Baseline scores (from Exercise 1):")
print(f"  Tool Accuracy:  {baseline_eval['tool_accuracy']:.1f}%")
print(f"  Param Accuracy: {baseline_eval['param_accuracy']:.1f}%")
print(f"  Full Accuracy:  {baseline_eval['full_accuracy']:.1f}%")

## Step 2: Prompt V2 — Add Few-Shot Examples

Adding concrete examples of correct tool selection to the system prompt.

In [None]:
PROMPT_V2 = """You are a helpful assistant with access to tools.
When the user asks a question, decide which tool to call and extract the correct parameters.
Always use a tool to answer — do not answer from memory.

Here are examples of correct tool usage:

User: "What's the weather like in Paris?"
→ Call get_weather with city="Paris"

User: "What is 42 * 18?"
→ Call calculate with expression="42 * 18"

User: "Tell me about quantum computing"
→ Call search_web with query="quantum computing"

User: "Any news about elections?"
→ Call get_news with topic="elections"

Follow these examples closely. Pick the most specific tool for each query."""

print("Prompt V2 ready. Running evaluation...")

In [None]:
eval_v2 = run_evaluation(test_cases, system_prompt=PROMPT_V2, label="V2: Few-Shot Examples")

## Step 3: Prompt V3 — Add Strict Rules & Constraints

Being more explicit about when to use each tool and adding constraints.

In [None]:
PROMPT_V3 = """You are a helpful assistant with access to exactly 4 tools. You MUST call one tool for every query.

TOOL SELECTION RULES (follow strictly):

1. get_weather — Use ONLY for weather, temperature, forecast, or climate conditions in a specific city.
   Parameter: city = the city name mentioned by the user (use proper capitalization)

2. calculate — Use ONLY for math computations, arithmetic, or numerical calculations.
   Parameter: expression = a valid Python math expression (use *, /, +, -, **, parentheses)

3. get_news — Use ONLY when the user asks about recent news, current events, or headlines.
   Parameter: topic = the main subject/topic (keep it concise, 1-3 words)

4. search_web — Use for ALL other questions (general knowledge, facts, how things work, etc.)
   Parameter: query = the user's question or a refined search query

IMPORTANT:
- If a query mentions weather/temperature → ALWAYS use get_weather
- If a query involves numbers/math → ALWAYS use calculate
- If a query mentions news/headlines/latest updates → ALWAYS use get_news
- For everything else → use search_web
- Extract parameters exactly as specified above"""

print("Prompt V3 ready. Running evaluation...")

In [None]:
eval_v3 = run_evaluation(test_cases, system_prompt=PROMPT_V3, label="V3: Strict Rules")

## Step 4: Your Own Prompt — Prompt V4

Now it's your turn! Write your own system prompt below. Think about what worked in V2 and V3, and combine the best ideas.

In [None]:
# ============================================================
# YOUR TURN: Write your own system prompt
# ============================================================

PROMPT_V4 = """<Write your custom system prompt here>

Think about:
- What worked in V2 (examples) and V3 (strict rules)?
- Can you combine both approaches?
- Are there edge cases you noticed in earlier runs?
"""

# Uncomment the lines below after writing your prompt:
# eval_v4 = run_evaluation(test_cases, system_prompt=PROMPT_V4, label="V4: Your Custom Prompt")

## Step 5: Results Comparison Table

In [None]:
# Collect all evaluation runs
all_evals = [baseline_eval, eval_v2, eval_v3]

# Add V4 if it was run
if "eval_v4" in dir():
    all_evals.append(eval_v4)

print(f"\n{'=' * 70}")
print(f"  FINAL COMPARISON")
print(f"{'=' * 70}")
print(f"{'Prompt Version':<30} {'Tool Acc':>10} {'Param Acc':>10} {'Full Acc':>10}")
print("-" * 65)

best_full = max(e["full_accuracy"] for e in all_evals)

for e in all_evals:
    marker = "  <-- best" if e["full_accuracy"] == best_full else ""
    print(f"{e['label']:<30} {e['tool_accuracy']:>9.1f}% {e['param_accuracy']:>9.1f}% {e['full_accuracy']:>9.1f}%{marker}")

print("-" * 65)

# Show improvement from baseline
baseline_score = baseline_eval["full_accuracy"]
for e in all_evals[1:]:
    delta = e["full_accuracy"] - baseline_score
    direction = "+" if delta >= 0 else ""
    print(f"  {e['label']}: {direction}{delta:.1f}% vs baseline")

### Discussion: Exercise 3

- Which prompt version performed best? Why?
- Did few-shot examples (V2) or strict rules (V3) help more?
- What's the tradeoff between rigid rules and flexible prompts?
- How many iterations would you need in a real project to reach acceptable performance?

---

# Key Takeaways

| Concept | What We Learned |
|---------|----------------|
| **Component-Level Eval** | Test individual pieces (tool selection, params) before testing the whole agent |
| **LLM-as-Judge** | A second LLM can automate scoring, but watch for biases and blind spots |
| **Prompt Iteration** | Systematic A/B testing of prompts leads to measurable improvements |
| **Fuzzy Matching** | Real-world eval needs flexible comparison — exact string match is too rigid |
| **Eval-Driven Development** | Write your tests first, then improve the agent to pass them |

---

## Next Steps

- Add more test cases covering edge cases and ambiguous queries
- Try using a stronger model as the judge (e.g., a larger Gemini variant)
- Build a regression test suite — run it every time you change the prompt
- Explore automated prompt optimization (DSPy, etc.)