# V1: Action Autonomy - Router Agent

## The Autonomy Ladder

Building effective AI agents requires a deliberate approach to increasing autonomy:

![Autonomy Ladder](assets/diagrams/autonomy_ladder.png)

**Key Philosophy:** Start with a narrow, well-defined scope. Validate thoroughly. Then expand deliberately.

## What is Action Autonomy?

**Definition:** Agent performs single, well-defined classification or routing actions.

**Use Case:** Customer support routing
- Input: Customer message
- Action: Classify intent and route to department
- Output: Routing decision
- Handoff: Human agent takes over




## Setup

Install required packages and set up environment.

In [2]:
# Install packages
!pip install -q openai pandas python-dotenv
!pip install -q 'arize-phoenix[evals]' openinference-instrumentation-openai

print("Packages installed successfully!")

Packages installed successfully!


In [3]:
# Setup for Colab vs Local
import os
import sys

# Check if running on Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    # Clone repository for data access
    if not os.path.exists('awesome-generative-ai-guide'):
        !git clone https://github.com/aishwaryanr/awesome-generative-ai-guide.git

    # Navigate to course directory
    os.chdir('awesome-generative-ai-guide/resources/agentic_ai_course_lil')

    # Get API key from Colab secrets
    from google.colab import userdata
    os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
else:
    # Local environment - use .env file
    from dotenv import load_dotenv
    load_dotenv()

# Verify API key is set
if not os.getenv('OPENAI_API_KEY'):
    raise ValueError("Please set OPENAI_API_KEY in Colab Secrets or .env file")

print("Environment setup complete!")

Cloning into 'awesome-generative-ai-guide'...
remote: Enumerating objects: 2054, done.[K
remote: Counting objects: 100% (595/595), done.[K
remote: Compressing objects: 100% (275/275), done.[K
remote: Total 2054 (delta 438), reused 356 (delta 319), pack-reused 1459 (from 2)[K
Receiving objects: 100% (2054/2054), 150.43 MiB | 16.97 MiB/s, done.
Resolving deltas: 100% (1092/1092), done.
Environment setup complete!


## Building the Router Agent

### Architecture

Our V1 agent has a simple 4-step process:

<img src="assets/diagrams/v1_architecture.png" alt="V1 Router Architecture" width="400"/>

<img src="assets/diagrams/v1_data_flow.png" alt="Data Flow Through System" width="400"/>

### Key Design Choices

1. **Model:** GPT-4o-mini (cost-effective for classification)
2. **Temperature:** 0.1 (consistent results)
3. **Output:** JSON mode (structured response)
4. **Fallback:** ESCALATION if invalid department

Let's build it step by step.

In [4]:
# Step 1: Define data structures

from enum import Enum
from dataclasses import dataclass

class Department(Enum):
    """Available departments for routing."""
    BILLING = "billing"
    RETURNS = "returns"
    TECHNICAL_SUPPORT = "technical_support"
    ORDER_STATUS = "order_status"
    PRODUCT_INQUIRY = "product_inquiry"
    ACCOUNT_MANAGEMENT = "account_management"
    ESCALATION = "escalation"

@dataclass
class RoutingDecision:
    """Result of routing decision."""
    department: Department
    reasoning: str
    customer_message: str

print("Data structures defined!")
print(f"\nAvailable departments: {[d.name for d in Department]}")

Data structures defined!

Available departments: ['BILLING', 'RETURNS', 'TECHNICAL_SUPPORT', 'ORDER_STATUS', 'PRODUCT_INQUIRY', 'ACCOUNT_MANAGEMENT', 'ESCALATION']


In [5]:
# Step 2: Define Prompt 1 (baseline)

# Starting with minimal prompt - no department descriptions
# We'll discover what's missing through evaluation

SYSTEM_PROMPT_1 = """Route customer messages to departments.

Available departments: BILLING, RETURNS, TECHNICAL_SUPPORT, ORDER_STATUS, PRODUCT_INQUIRY, ACCOUNT_MANAGEMENT, ESCALATION

Respond with JSON:
{
    "department": "DEPARTMENT_NAME",
    "reasoning": "Your reasoning"
}
"""

print("Prompt 1 (baseline) defined!")
print(f"Prompt length: {len(SYSTEM_PROMPT_1)} chars")
print("\nNote: This is intentionally minimal. We'll see what happens...")

Prompt 1 (baseline) defined!
Prompt length: 258 chars

Note: This is intentionally minimal. We'll see what happens...


In [6]:
# Step 3: Build the RouterAgent class

import json
from openai import OpenAI

class RouterAgent:
    """V1 Action Autonomy Agent - Routes customer messages to departments."""

    def __init__(self, system_prompt):
        """Initialize agent with a system prompt."""
        self.client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
        self.model = "gpt-4o-mini"
        self.system_prompt = system_prompt

    def route(self, customer_message: str) -> RoutingDecision:
        """Route a customer message to appropriate department."""

        # Step 1: Call OpenAI API
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": customer_message}
            ],
            temperature=0.1,
            response_format={"type": "json_object"}
        )

        # Step 2: Parse JSON response
        result = json.loads(response.choices[0].message.content)

        # Step 3: Validate department
        dept_name = result.get("department", "ESCALATION").upper()
        try:
            department = Department[dept_name]
        except KeyError:
            department = Department.ESCALATION

        # Step 4: Return structured decision
        return RoutingDecision(
            department=department,
            reasoning=result.get("reasoning", "No reasoning provided"),
            customer_message=customer_message
        )

print("RouterAgent class defined!")
print("Ready to route customer messages.")

RouterAgent class defined!
Ready to route customer messages.


## Demo: See the Agent in Action

Let's test our agent with a few examples before formal evaluation.

In [7]:
# Initialize agent with Prompt 1
agent = RouterAgent(system_prompt=SYSTEM_PROMPT_1)

# Test messages covering different departments
test_messages = [
    "I was charged twice for my order!",
    "Where is my package? It's been 2 weeks!",
    "I want to return these shoes, they don't fit",
    "Is the blue wireless headphone in stock?",
    "I can't log into my account, it says password invalid",
    "This is ridiculous! I've called 3 times and nobody helps me!"
]

print("=" * 70)
print("ROUTER AGENT DEMO (Prompt 1 Baseline)")
print("=" * 70)
print()

for i, message in enumerate(test_messages, 1):
    print(f"[{i}] Customer: {message}")

    decision = agent.route(message)

    print(f"    -> Department: {decision.department.name}")
    print(f"    -> Reasoning: {decision.reasoning}")
    print()

print("Demo looks good! But let's evaluate systematically...")

ROUTER AGENT DEMO (Prompt 1 Baseline)

[1] Customer: I was charged twice for my order!
    -> Department: BILLING
    -> Reasoning: The customer is reporting an issue related to being charged twice, which falls under billing inquiries.

[2] Customer: Where is my package? It's been 2 weeks!
    -> Department: ORDER_STATUS
    -> Reasoning: The customer is inquiring about the status of their package, which falls under order status inquiries.

[3] Customer: I want to return these shoes, they don't fit
    -> Department: RETURNS
    -> Reasoning: The customer is requesting to return a product due to sizing issues, which falls under the returns department.

[4] Customer: Is the blue wireless headphone in stock?
    -> Department: PRODUCT_INQUIRY
    -> Reasoning: The customer is asking about the availability of a specific product, which falls under product inquiries.

[5] Customer: I can't log into my account, it says password invalid
    -> Department: TECHNICAL_SUPPORT
    -> Reasoning: T

---

## üé¨ End of Chapter

---

## Evaluation Setup

### Why Evaluate?

Demo showed it works, but we need systematic evaluation:
- Does it handle edge cases?
- What's the accuracy across all departments?
- Where does it fail and why?

### Evaluation Metric: Routing Accuracy

For Prompt 2 (Action Autonomy), routing accuracy is the right metric:
- **Clear ground truth:** Each message has one correct department
- **Binary outcome:** Either correct or incorrect
- **Easy to interpret:** 85% accuracy means 85% of routings are correct

### Test Dataset

30 test cases covering:
- All 7 departments
- Simple cases (clear keywords)
- Ambiguous cases (multiple possible departments)
- Edge cases (unusual requests)

### Evaluation Workflow



In [13]:
# Load test cases
import pandas as pd

# Load from repository data directory
test_df = pd.read_csv('/content/awesome-generative-ai-guide/resources/agentic_ai_course_lil/data/v1_test_cases.csv')

print(f"Loaded {len(test_df)} test cases")
print(f"\nColumns: {list(test_df.columns)}")
print(f"\nDepartment distribution:")
print(test_df['expected_department'].value_counts())

# Show a few examples
print(f"\nSample test cases:")
print(test_df[['test_id', 'customer_message', 'expected_department', 'category']].head())

Loaded 30 test cases

Columns: ['test_id', 'customer_message', 'expected_department', 'category']

Department distribution:
expected_department
BILLING               6
TECHNICAL_SUPPORT     6
RETURNS               5
PRODUCT_INQUIRY       5
ACCOUNT_MANAGEMENT    4
ORDER_STATUS          2
ESCALATION            2
Name: count, dtype: int64

Sample test cases:
  test_id                                   customer_message  \
0   TC001                   I was charged twice for my order   
1   TC002  Where is my package? Tracking says delivered b...   
2   TC003            I want to return these shoes wrong size   
3   TC004             Do you have the iPhone 15 case in red?   
4   TC005                        I can't log into my account   

  expected_department          category  
0             BILLING  duplicate_charge  
1        ORDER_STATUS  missing_delivery  
2             RETURNS     size_exchange  
3     PRODUCT_INQUIRY      availability  
4   TECHNICAL_SUPPORT       login_issue  


## Setup Arize Phoenix for Observability

### Why Phoenix?

Phoenix captures every LLM call as a "trace":
- Input: Customer message
- Prompt: System prompt sent to LLM
- Output: Department and reasoning
- Metadata: Tokens, latency, cost

This lets us:
1. See exactly what the agent is thinking
2. Understand why failures happen
3. Identify patterns in errors
4. Make targeted improvements

In [14]:
# Start Phoenix (Colab-compatible setup)
import os

# Configure Phoenix for Colab/local compatibility
os.environ["PHOENIX_HOST"] = "0.0.0.0"
os.environ["PHOENIX_PORT"] = "6006"

import phoenix as px
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

print("Starting Arize Phoenix...")
session = px.launch_app()  # don't pass port parameter
print("Phoenix session url:", session.url)

# For Google Colab compatibility
try:
    from google.colab import output
    output.serve_kernel_port_as_window(6006)
    print("‚úì Phoenix running on Colab at port 6006")
except ImportError:
    print("‚úì Phoenix running locally at http://localhost:6006")

print("\nClick the link above to open Phoenix UI in a new tab.")
print("Keep this tab open while running evaluations.")

Starting Arize Phoenix...


  next(self.gen)
  next(self.gen)


üåç To view the Phoenix app in your browser, visit https://jr5w3q56kyc1-496ff2e9c6d22116-6006-colab.googleusercontent.com/
üìñ For more information on how to use Phoenix, check out https://arize.com/docs/phoenix
Phoenix session url: https://jr5w3q56kyc1-496ff2e9c6d22116-6006-colab.googleusercontent.com/
Try `serve_kernel_port_as_iframe` instead. [0m


<IPython.core.display.Javascript object>

‚úì Phoenix running on Colab at port 6006

Click the link above to open Phoenix UI in a new tab.
Keep this tab open while running evaluations.


## Run Prompt 1 Evaluation

Let's evaluate the baseline (Prompt 1) to establish our starting point.

In [15]:
# Enable tracing for Prompt 1
project_name = "V1_action_autonomy_prompt_1"
print(f"Enabling tracing for project: {project_name}")

tracer_provider = register(project_name=project_name)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

print("Tracing enabled! All API calls will be captured in Phoenix.")

Enabling tracing for project: V1_action_autonomy_prompt_1
üî≠ OpenTelemetry Tracing Details üî≠
|  Phoenix Project: V1_action_autonomy_prompt_1
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.

Tracing enabled! All API calls will be captured in Phoenix.


In [16]:
# Run Prompt 1 evaluation
from dataclasses import dataclass
from collections import defaultdict
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

@dataclass
class EvalResult:
    """Result of a single evaluation."""
    test_id: str
    message: str
    expected: str
    predicted: str
    correct: bool
    reasoning: str
    category: str

# Initialize agent with Prompt 1
agent_p1 = RouterAgent(system_prompt=SYSTEM_PROMPT_1)
tracer = trace.get_tracer(__name__)

results_p1 = []

print("Running Prompt 1 evaluation on 30 test cases...")
print("(Each routing decision is being traced in Phoenix)\n")

for idx, row in test_df.iterrows():
    i = idx + 1
    test_id = row['test_id']

    # Create custom span for better Phoenix visualization
    with tracer.start_as_current_span(f"test_case_{test_id}") as span:
        span.set_attribute("test.id", test_id)
        span.set_attribute("test.category", row['category'])
        span.set_attribute("test.expected_department", row['expected_department'])

        # Route the message
        decision = agent_p1.route(row['customer_message'])
        correct = decision.department.name == row['expected_department']

        # Record result in span
        span.set_attribute("result.predicted_department", decision.department.name)
        span.set_attribute("result.correct", correct)

        if correct:
            span.set_status(Status(StatusCode.OK))
        else:
            span.set_status(Status(StatusCode.ERROR, "Incorrect routing"))
            span.set_attribute("error.expected", row['expected_department'])
            span.set_attribute("error.got", decision.department.name)

        # Store result
        result = EvalResult(
            test_id=test_id,
            message=row['customer_message'],
            expected=row['expected_department'],
            predicted=decision.department.name,
            correct=correct,
            reasoning=decision.reasoning,
            category=row['category']
        )
        results_p1.append(result)

        # Show progress
        status = "PASS" if correct else "FAIL"
        print(f"[{i}/30] {test_id}: {status} (Expected: {result.expected}, Got: {result.predicted})")

print("\nEvaluation complete!")

Running Prompt 1 evaluation on 30 test cases...
(Each routing decision is being traced in Phoenix)

[1/30] TC001: PASS (Expected: BILLING, Got: BILLING)
[2/30] TC002: PASS (Expected: ORDER_STATUS, Got: ORDER_STATUS)
[3/30] TC003: PASS (Expected: RETURNS, Got: RETURNS)
[4/30] TC004: PASS (Expected: PRODUCT_INQUIRY, Got: PRODUCT_INQUIRY)
[5/30] TC005: FAIL (Expected: TECHNICAL_SUPPORT, Got: ACCOUNT_MANAGEMENT)
[6/30] TC006: PASS (Expected: ACCOUNT_MANAGEMENT, Got: ACCOUNT_MANAGEMENT)
[7/30] TC007: PASS (Expected: ESCALATION, Got: ESCALATION)
[8/30] TC008: FAIL (Expected: BILLING, Got: RETURNS)
[9/30] TC009: PASS (Expected: TECHNICAL_SUPPORT, Got: TECHNICAL_SUPPORT)
[10/30] TC010: PASS (Expected: ORDER_STATUS, Got: ORDER_STATUS)
[11/30] TC011: PASS (Expected: RETURNS, Got: RETURNS)
[12/30] TC012: PASS (Expected: PRODUCT_INQUIRY, Got: PRODUCT_INQUIRY)
[13/30] TC013: PASS (Expected: ACCOUNT_MANAGEMENT, Got: ACCOUNT_MANAGEMENT)
[14/30] TC014: PASS (Expected: BILLING, Got: BILLING)
[15/30] TC

In [17]:
# Compute Prompt 1 metrics
total = len(results_p1)
correct = sum(1 for r in results_p1 if r.correct)
accuracy = correct / total

# Per-department accuracy
dept_correct = defaultdict(int)
dept_total = defaultdict(int)
for r in results_p1:
    dept_total[r.expected] += 1
    if r.correct:
        dept_correct[r.expected] += 1

print("=" * 70)
print("PROMPT 1 EVALUATION RESULTS")
print("=" * 70)

print(f"\nOverall Accuracy: {accuracy:.1%} ({correct}/{total} correct)")

print(f"\nPer-Department Accuracy:")
for dept in sorted(dept_total.keys()):
    acc = dept_correct[dept] / dept_total[dept]
    bar = "‚ñà" * int(acc * 20) + "‚ñë" * (20 - int(acc * 20))
    print(f"  {dept:20} {bar} {acc:.0%}")

# Show errors
errors = [r for r in results_p1 if not r.correct]
if errors:
    print(f"\nErrors ({len(errors)} cases):")
    for r in errors:
        print(f"\n  [{r.test_id}] {r.message[:60]}...")
        print(f"  Expected: {r.expected} -> Got: {r.predicted}")
        print(f"  Category: {r.category}")

PROMPT 1 EVALUATION RESULTS

Overall Accuracy: 73.3% (22/30 correct)

Per-Department Accuracy:
  ACCOUNT_MANAGEMENT   ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë 75%
  BILLING              ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë 50%
  ESCALATION           ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà 100%
  ORDER_STATUS         ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà 100%
  PRODUCT_INQUIRY      ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë 80%
  RETURNS              ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà 100%
  TECHNICAL_SUPPORT    ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë 50%

Errors (8 cases):

  [TC005] I can't log into my account...
  Expected: TECHNICAL_SUPPORT -> Got: ACCOUNT_MANAGEMENT
  Category: login_issue

  [TC008] My refund still hasn't shown up it's been 2 weeks...
  Expected: BILLING -> Got: RETURNS
  Category: refund_status

  [TC018] I forgo

---

## üé¨ End of Chapter

---

---

# üìä Continuous Calibration (CC) Phase

**Goal:** Understand WHY the system fails and design metrics to measure performance.

**In this phase:**
- Observe failures in Phoenix traces
- Analyze error patterns
- Design evaluation metrics
- Identify root causes

**Output:** Clear understanding of what to fix and how to measure it.

---

## Analyze Failures in Arize Phoenix

Now comes the key part: **Understanding WHY failures happened**

### How to Use Phoenix

1. Open the Phoenix URL from above
2. Click "Traces" in the left sidebar
3. Select project "V1_action_autonomy_prompt_1"
4. Filter for failed cases (red status)
5. Click on each trace to see:
   - Customer message
   - System prompt sent to LLM
   - LLM's response (department + reasoning)
   - Why it was incorrect

### Common Failure Patterns

Look for patterns like:
- **Ambiguous keywords:** "refund" could be BILLING or RETURNS
- **Multi-issue messages:** Customer mentions both shipping and refund
- **Missing context:** Prompt 1 lacks department descriptions
- **Over-escalation:** Negative sentiment triggers ESCALATION unnecessarily

**Exercise:** Analyze 3-5 failed traces and note patterns you observe.


---

## üé¨ End of Chapter


---

---

# üöÄ Continuous Deployment (CD) Phase

**Goal:** Improve the system based on CC insights and measure impact.

**In this phase:**
- Make targeted improvements (Prompt 2)
- Re-evaluate with same metrics
- Compare before/after performance
- Validate improvements worked

**Output:** Better system with measured improvements.

---

## Improve to Prompt 2

Based on Phoenix analysis, we identified these issues in Prompt 1:

1. **No department descriptions** ‚Üí LLM guesses based on keywords alone
2. **Ambiguous boundaries** ‚Üí "refund status" routed to RETURNS instead of BILLING
3. **Password resets** ‚Üí Routed to ACCOUNT_MANAGEMENT instead of TECHNICAL_SUPPORT

### V1 Improvements

The Prompt 2 adds:
- Clear descriptions for each department
- Explicit disambiguation rules
- Examples of edge cases

Let's see if it helps!

### The Iterative Improvement Cycle



In [18]:
# Now let's create Prompt 2 with improvements based on what we learned

SYSTEM_PROMPT_2 = """Route customer messages to departments.

Available departments:
- BILLING: Payment issues, charges, refunds, refund status, account balances, fees
- RETURNS: Return requests, exchanges, return status, return policies
- TECHNICAL_SUPPORT: Login problems, password reset issues, website errors, checkout failures
- ORDER_STATUS: Order tracking, shipping updates, delivery questions, missing items
- PRODUCT_INQUIRY: Product questions, specifications, availability, pricing
- ACCOUNT_MANAGEMENT: Profile updates, changing saved payment methods, preferences, address changes
- ESCALATION: Very upset customers demanding managers, supervisor requests

Important:
- Login/password problems = TECHNICAL_SUPPORT (not ACCOUNT_MANAGEMENT)
- Updating payment methods = ACCOUNT_MANAGEMENT (not BILLING)
- Refund status = BILLING (not RETURNS)

Respond with JSON:
{
    \"department\": \"DEPARTMENT_NAME\",
    \"reasoning\": \"Your reasoning\"
}
"""

print("Prompt 2 (improved) created with improvements!")
print(f"\nPrompt 1 length: {len(SYSTEM_PROMPT_1)} chars")
print(f"Prompt 2 length: {len(SYSTEM_PROMPT_2)} chars")
print(f"\nAdded {len(SYSTEM_PROMPT_2) - len(SYSTEM_PROMPT_1)} chars of context")

Prompt 2 (improved) created with improvements!

Prompt 1 length: 258 chars
Prompt 2 length: 926 chars

Added 668 chars of context


In [19]:
# Enable tracing for Prompt 2 (separate project)

# Uninstrument previous tracer to avoid overwriting Prompt 1 traces
OpenAIInstrumentor().uninstrument()

project_name_p2 = "V1_action_autonomy_prompt_2"
print(f"Enabling tracing for project: {project_name_p2}")

tracer_provider_p2 = register(project_name=project_name_p2)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider_p2)

print("Tracing enabled for Prompt 2!")



Enabling tracing for project: V1_action_autonomy_prompt_2
üî≠ OpenTelemetry Tracing Details üî≠
|  Phoenix Project: V1_action_autonomy_prompt_2
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.

Tracing enabled for Prompt 2!


In [20]:
# Run Prompt 2 evaluation
agent_p2 = RouterAgent(system_prompt=SYSTEM_PROMPT_2)
results_p2 = []

print("Running Prompt 2 evaluation on 30 test cases...\n")

for idx, row in test_df.iterrows():
    i = idx + 1
    test_id = row['test_id']

    with tracer.start_as_current_span(f"test_case_{test_id}") as span:
        span.set_attribute("test.id", test_id)
        span.set_attribute("test.expected_department", row['expected_department'])

        decision = agent_p2.route(row['customer_message'])
        correct = decision.department.name == row['expected_department']

        span.set_attribute("result.correct", correct)

        if correct:
            span.set_status(Status(StatusCode.OK))
        else:
            span.set_status(Status(StatusCode.ERROR, "Incorrect routing"))
            span.set_attribute("error.expected", row['expected_department'])
            span.set_attribute("error.got", decision.department.name)

        result = EvalResult(
            test_id=test_id,
            message=row['customer_message'],
            expected=row['expected_department'],
            predicted=decision.department.name,
            correct=correct,
            reasoning=decision.reasoning,
            category=row['category']
        )
        results_p2.append(result)

        status = "PASS" if correct else "FAIL"
        print(f"[{i}/30] {test_id}: {status}")

print("\nPrompt 2 evaluation complete!")

Running Prompt 2 evaluation on 30 test cases...

[1/30] TC001: PASS
[2/30] TC002: PASS
[3/30] TC003: PASS
[4/30] TC004: PASS
[5/30] TC005: PASS
[6/30] TC006: PASS
[7/30] TC007: PASS
[8/30] TC008: PASS
[9/30] TC009: PASS
[10/30] TC010: PASS
[11/30] TC011: PASS
[12/30] TC012: PASS
[13/30] TC013: PASS
[14/30] TC014: PASS
[15/30] TC015: PASS
[16/30] TC016: PASS
[17/30] TC017: PASS
[18/30] TC018: PASS
[19/30] TC019: PASS
[20/30] TC020: PASS
[21/30] TC021: FAIL
[22/30] TC022: FAIL
[23/30] TC023: PASS
[24/30] TC024: PASS
[25/30] TC025: PASS
[26/30] TC026: PASS
[27/30] TC027: PASS
[28/30] TC028: PASS
[29/30] TC029: PASS
[30/30] TC030: PASS

Prompt 2 evaluation complete!


In [21]:
# Compare V0 vs V1
correct_v1 = sum(1 for r in results_p2 if r.correct)
accuracy_v1 = correct_v1 / len(results_p2)

print("=" * 70)
print("PROMPT 1 vs PROMPT 2 COMPARISON")
print("=" * 70)

print(f"\nOverall Accuracy:")
print(f"  Prompt 1: {accuracy:.1%} ({correct}/{total})")
print(f"  Prompt 2: {accuracy_v1:.1%} ({correct_v1}/{total})")
improvement = accuracy_v1 - accuracy
print(f"  Improvement: +{improvement:.1%}")

# Which errors got fixed?
v0_errors = {r.test_id for r in results_p1 if not r.correct}
v1_errors = {r.test_id for r in results_p2 if not r.correct}

fixed = v0_errors - v1_errors
still_failing = v0_errors & v1_errors

if fixed:
    print(f"\nFixed in Prompt 2 ({len(fixed)} cases):")
    for test_id in sorted(fixed):
        r = next(r for r in results_p1 if r.test_id == test_id)
        print(f"  [{test_id}] {r.message[:50]}...")

if still_failing:
    print(f"\nStill Failing ({len(still_failing)} cases):")
    for test_id in sorted(still_failing):
        r = next(r for r in results_p2 if r.test_id == test_id)
        print(f"  [{test_id}] {r.message[:50]}...")

PROMPT 1 vs PROMPT 2 COMPARISON

Overall Accuracy:
  Prompt 1: 73.3% (True/30)
  Prompt 2: 93.3% (28/30)
  Improvement: +20.0%

Fixed in Prompt 2 (6 cases):
  [TC005] I can't log into my account...
  [TC008] My refund still hasn't shown up it's been 2 weeks...
  [TC018] I forgot my password and the reset email isn't com...
  [TC024] I need to update my credit card on file...
  [TC029] I reset my password but still can't access my acco...
  [TC030] Why was I charged a restocking fee?...

Still Failing (2 cases):
  [TC021] Why didn't I get my loyalty points for this purcha...
  [TC022] Your prices are way too high! This is ridiculous!...


## Key Takeaways

### What We Built

A V1 Action Autonomy agent that:
- Routes customer messages to departments
- Achieves ~90% accuracy on diverse test cases
- Provides reasoning for decisions
- Falls back to escalation for edge cases

### What We Learned

1. **Start Simple:** Action autonomy is perfect for classification tasks
2. **Observability is Key:** Phoenix traces revealed failure patterns
3. **Iterate Based on Data:** V0 ‚Üí V1 improvements were targeted
4. **Clear Metrics Matter:** Routing accuracy was appropriate for this task

### When to Use Prompt 2 (Action Autonomy)

V1 is appropriate when:
- Task is well-defined classification/routing
- Success criteria is clear (correct category)
- Human takes over after classification
- No multi-step reasoning required

### When V1 is NOT Enough

V1 limitations:
- Can't solve multi-step problems
- Can't retrieve relevant documentation
- Can't generate action plans
- Can't handle context from multiple sources

**That's where V2 comes in!**

### Next Steps

In the V2 notebook, we'll expand scope to **Planning Autonomy**:
- Retrieve relevant SOPs using keyword search
- Generate multi-step action plans
- Evaluate with more complex metrics
- Learn when to add vs avoid complexity

**Key Philosophy:** V1 isn't "bad" - it's appropriately scoped. V2 expands scope deliberately with proper guardrails.