# V2: Planning Autonomy

## From Single Actions to Multi-Step Plans

In V1, we built an **action autonomy** agent that performed a single classification: routing customer messages to departments.

Now we move up the autonomy ladder to **planning autonomy**: generating multi-step action plans by retrieving relevant procedures and reasoning over them.

### What You'll Learn

1. **RAG Systems**: Use BM25 to retrieve relevant Standard Operating Procedures (SOPs)
2. **Multi-Step Planning**: Generate detailed action plans instead of single actions
3. **Custom Metrics**: Design evaluation metrics from observed failures
4. **LLM-as-Judge**: Use GPT-4o to evaluate GPT-5 outputs
5. **Trace-First Evaluation**: Observe â†’ Discover â†’ Measure â†’ Improve

### The Incremental Building Story

**V1 Achievement:**
- Built routing from 73% â†’ 93% accuracy
- Prompt 1 (baseline) â†’ Prompt 2 (improved with descriptions)

**V2 Builds On V1:**
- **KEEPS** V1's 93% routing (don't regress!)
- **ADDS** BM25 retrieval to find relevant SOPs
- **ADDS** multi-step plan generation

**Key:** Each version builds on the previous one. We never start from scratch!

## V2 Architecture

Our V2 Planning Autonomy agent builds on V1's routing by adding retrieval and multi-step planning:

![V2 Architecture](assets/diagrams/v2_architecture.png)

![Data Flow Through System](assets/diagrams/v2_data_flow.png)

**Key Points:**
- **Green boxes** = V1 components (keep the 93% routing!)
- **Orange boxes** = V2 new components (BM25 + Planning)
- **Data flows** left-to-right: Message â†’ Routing â†’ Retrieval â†’ Planning â†’ Output

**What's New in V2:**
1. **BM25 Retriever**: Finds relevant SOPs using keyword matching
2. **Plan Generator**: Creates multi-step plans using retrieved context
3. **Custom Metrics**: SOP Recall + Plan Alignment (3-class)

## Setup

Install required packages and set up environment.

In [None]:
# Install packages
!pip install -q openai pandas python-dotenv rank-bm25
!pip install -q 'arize-phoenix[evals]' openinference-instrumentation-openai

print("Packages installed successfully!")

In [None]:
# Setup for Colab vs Local
import os
import sys

# Check if running on Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    # Clone repository for data access
    if not os.path.exists('awesome-generative-ai-guide'):
        !git clone https://github.com/aishwaryanr/awesome-generative-ai-guide.git

    # Navigate to notebooks directory
    if os.path.exists('agentic-ai-course/notebooks'):
        os.chdir('awesome-generative-ai-guide/resources/agentic_ai_course_lil')

    # Get API key from Colab secrets
    from google.colab import userdata
    os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
else:
    # Local environment - use .env file
    from dotenv import load_dotenv
    load_dotenv()

# Verify API key is set
if not os.getenv('OPENAI_API_KEY'):
    raise ValueError("Please set OPENAI_API_KEY in Colab Secrets or .env file")

print("Environment setup complete!")

In [None]:
# Import libraries
import json
import glob
import pandas as pd
from openai import OpenAI
from rank_bm25 import BM25Okapi
from dataclasses import dataclass
from typing import List, Dict

# Arize Phoenix for observability
import phoenix as px
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

print("âœ“ All imports successful!")

## Load SOPs (Standard Operating Procedures)

V2 uses a knowledge base of 9 SOPs covering different customer support scenarios.

In [None]:
def load_sops(sops_directory="data/sops"):
    """Load all SOP text files."""
    sops = {}
    sop_files = glob.glob(f"{sops_directory}/sop_*.txt")

    for filepath in sorted(sop_files):
        filename = os.path.basename(filepath)
        sop_id = filename.replace('.txt', '').upper()

        with open(filepath, 'r') as f:
            content = f.read()

        sops[sop_id] = {
            'filename': filename,
            'content': content,
            'word_count': len(content.split())
        }

    return sops

# Load SOPs
sops_db = load_sops()
print(f"Loaded {len(sops_db)} SOPs")
print(f"\nSOP IDs: {list(sops_db.keys())}")
print(f"\nExample SOP (first 200 chars):")
first_sop = list(sops_db.keys())[0]
print(f"{first_sop}: {sops_db[first_sop]['content'][:200]}...")

## Build BM25 Index

BM25 is a keyword-based retrieval algorithm. We'll use it to find relevant SOPs given a customer message.

![BM25 SOP Retrieval](assets/diagrams/v2_sop_retrieval.png)

**How it works:**
1. Combine message + department as query
2. Score all 9 SOPs using BM25
3. Return top K SOPs (K=2 for Prompt 1, K=4 for Prompt 2)

**Key insight:** K=2 may miss relevant SOPs ranked #3-4, K=4 captures them â†’ better recall

In [None]:
def build_bm25_index(sops_db):
    """Build BM25 index over SOPs."""
    sop_ids = list(sops_db.keys())
    sop_contents = [sops_db[sop_id]['content'] for sop_id in sop_ids]

    # Tokenize
    tokenized_corpus = [doc.lower().split() for doc in sop_contents]

    # Build BM25
    bm25 = BM25Okapi(tokenized_corpus)

    return bm25, sop_ids

bm25_index, sop_ids = build_bm25_index(sops_db)
print(f"âœ“ BM25 index built over {len(sop_ids)} documents")

## Planning Agent - Prompt 1 (Baseline)

**Configuration:**
- **Routing**: V1's improved Prompt 2 (93% accuracy) - EXACT copy
- **Retrieval**: K=2 (retrieve top 2 SOPs)
- **Planning**: gpt-4o

**Key: We use V1's EXACT department names and routing prompt!**

In [None]:
class PlanningAgent:
    def __init__(self, client, bm25_index, sop_ids, sops_db):
        self.client = client
        self.bm25_index = bm25_index
        self.sop_ids = sop_ids
        self.sops_db = sops_db

        # Use EXACT same department names as V1's enum
        self.departments = [
            "BILLING",
            "RETURNS",
            "TECHNICAL_SUPPORT",
            "ORDER_STATUS",
            "PRODUCT_INQUIRY",
            "ACCOUNT_MANAGEMENT",
            "ESCALATION"
        ]

    def route_message(self, message):
        """
        Route to department using V1's improved Prompt 2 (EXACT).

        This builds on V1 Action Autonomy's best routing prompt (93% accuracy).
        V2 adds planning on top of this solid routing foundation.
        Uses EXACT department names from V1's enum.
        """
        prompt = f"""Route customer messages to departments.

Available departments:
- BILLING: Payment issues, charges, refunds, refund status, account balances, fees
- RETURNS: Return requests, exchanges, return status, return policies
- TECHNICAL_SUPPORT: Login problems, password reset issues, website errors, checkout failures
- ORDER_STATUS: Order tracking, shipping updates, delivery questions, missing items
- PRODUCT_INQUIRY: Product questions, specifications, availability, pricing
- ACCOUNT_MANAGEMENT: Profile updates, changing saved payment methods, preferences, address changes
- ESCALATION: Very upset customers demanding managers, supervisor requests

Important:
- Login/password problems = TECHNICAL_SUPPORT (not ACCOUNT_MANAGEMENT)
- Updating payment methods = ACCOUNT_MANAGEMENT (not BILLING)
- Refund status = BILLING (not RETURNS)

Message: \"{message}\"

Respond with ONLY the department name, nothing else."""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        return response.choices[0].message.content.strip()

    def retrieve_sops(self, message, department, top_k=2):
        """Retrieve relevant SOPs using BM25."""
        query = f"{message} {department}"
        tokenized_query = query.lower().split()

        scores = self.bm25_index.get_scores(tokenized_query)
        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]

        retrieved_sops = []
        for idx in top_indices:
            sop_id = self.sop_ids[idx]
            score = scores[idx]
            content = self.sops_db[sop_id]['content']

            # Use first 1500 words
            words = content.split()[:1500]
            excerpt = ' '.join(words)

            retrieved_sops.append({
                'sop_id': sop_id,
                'score': score,
                'excerpt': excerpt,
                'full_content': content
            })

        return retrieved_sops

    def generate_plan(self, message, department, retrieved_sops):
        """Generate action plan."""
        sops_context = "\n\n".join([
            f"--- {sop['sop_id']} (Relevance: {sop['score']:.2f}) ---\n{sop['excerpt'][:2000]}..."
            for sop in retrieved_sops
        ])

        prompt = f"""You are a customer support agent planning assistant. Create a detailed, step-by-step action plan.

**Customer Message:**
"{message}"

**Department:** {department}

**Relevant Procedures (SOPs):**
{sops_context}

**Instructions:**
Create a detailed action plan that:
1. Lists specific steps the agent should take (in order)
2. References relevant SOP procedures
3. Includes verification or security steps
4. Mentions escalation criteria if applicable
5. Provides timeline expectations
6. Notes any edge cases or system limitations

Format as a numbered action plan. Be specific and actionable.

**Action Plan:**"""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        return response.choices[0].message.content.strip()

    def plan(self, message):
        """Full pipeline."""
        department = self.route_message(message)
        retrieved_sops = self.retrieve_sops(message, department, top_k=2)
        plan = self.generate_plan(message, department, retrieved_sops)

        return {
            'message': message,
            'department': department,
            'retrieved_sops': [
                {'sop_id': sop['sop_id'], 'score': sop['score']}
                for sop in retrieved_sops
            ],
            'plan': plan
        }

# Initialize agent
agent = PlanningAgent(client, bm25_index, sop_ids, sops_db)
print("âœ“ PlanningAgent initialized with V1's routing + BM25 + gpt-4o planning")

## Demo: Generate a Plan

Let's see the agent in action!

In [None]:
# Test the agent
test_message = "I bought a jacket last month, but it's too big. Can I return it?"

result = agent.plan(test_message)

print("="*80)
print("PLANNING AGENT DEMO")
print("="*80)
print(f"\nCustomer Message: {result['message']}")
print(f"\nRouted Department: {result['department']}")
print(f"\nRetrieved SOPs:")
for sop in result['retrieved_sops']:
    print(f"  - {sop['sop_id']} (score: {sop['score']:.2f})")
print(f"\nGenerated Action Plan:")
print(result['plan'])
print("\n" + "="*80)

---

## ðŸŽ¬ End of Chapter 1: Implementing Retrieval & Planning

**Next: Chapter 2 - Continuous Calibration (CC)**

---

---

# ðŸ“Š Continuous Calibration (CC) Phase

**Goal:** Observe failures, design custom metrics, and identify improvements.

**In this phase:**
- Enable Phoenix tracing to observe all LLM calls
- Run systematic evaluation on test cases
- Analyze errors in Phoenix UI
- Design metrics from observed patterns (SOP Recall, Plan Alignment)
- Compute metrics to quantify performance

**Output:** Custom metrics that measure what matters + clear improvement targets.

---

## Enable Phoenix Tracing

Phoenix captures all LLM calls so we can observe what's happening.

In [None]:
# Start Phoenix (Colab-compatible setup)
import os

# Configure Phoenix for Colab/local compatibility
os.environ["PHOENIX_HOST"] = "0.0.0.0"
os.environ["PHOENIX_PORT"] = "6006"

import phoenix as px

print("="*80)
print("Starting Arize Phoenix...")
print("="*80)
session = px.launch_app()  # don't pass port parameter
print("Phoenix session url:", session.url)

# For Google Colab compatibility
try:
    from google.colab import output
    output.serve_kernel_port_as_window(6006)
    print("âœ“ Phoenix running on Colab at port 6006")
except ImportError:
    print("âœ“ Phoenix running locally at http://localhost:6006")

print("Open the URL above to view traces in real-time\n")

# Enable OpenAI instrumentation for Prompt 1
project_name = "V2_planning_autonomy_prompt_1"
print(f"Enabling tracing for project: {project_name}")
tracer_provider = register(project_name=project_name)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
tracer = trace.get_tracer(__name__)
print("âœ“ Tracing enabled! All API calls will be captured in Phoenix.\n")

## Load Test Cases

We have 22 grounded test cases with expected SOPs and procedure steps.

**Each test case includes:**
- Customer message
- Complexity level (simple, medium, complex)
- Expected SOPs (ground truth)
- Expected procedure steps
- Policy details to mention

In [None]:
# Load test cases
test_cases = pd.read_csv('data/v2_test_cases.csv')
print(f"Loaded {len(test_cases)} test cases")
print(f"\nColumns: {list(test_cases.columns)}")
print(f"\nSample:")
print(test_cases[['message', 'complexity', 'expected_sops']].head())

## Run Prompt 1 Evaluation

Let's evaluate the baseline and observe failures in Phoenix.

**Note:** This will make ~66 OpenAI API calls (22 test cases Ã— 3 calls each):
- 1 call for routing
- 1 call for plan generation
- Takes ~5-10 minutes

In [None]:
def normalize_sop_name(sop):
    """Normalize SOP name to base format (e.g., SOP_001)."""
    import re
    sop = str(sop).upper()
    sop = sop.replace('SOP-', 'SOP_').replace(' ', '_')
    match = re.match(r'(SOP_\d+)', sop)
    return match.group(1) if match else sop

# Run evaluation
results = []

print("Running Prompt 1 evaluation...")
print("(Each plan generation is being traced in Phoenix)\n")

for idx, row in test_cases.iterrows():
    message = row['message']
    expected_sops = row['expected_sops'].split(',') if pd.notna(row['expected_sops']) else []
    expected_sops = [normalize_sop_name(s.strip()) for s in expected_sops]

    print(f"  [{idx+1}/{len(test_cases)}] Processing: {message[:60]}...")

    # Create Phoenix span for this test case
    with tracer.start_as_current_span(f"test_case_{idx}") as span:
        span.set_attribute("test.id", idx)
        span.set_attribute("test.message", message)
        span.set_attribute("test.complexity", row['complexity'])
        span.set_attribute("test.expected_sops", str(expected_sops))

        try:
            result = agent.plan(message)

            retrieved_sop_ids = [normalize_sop_name(sop['sop_id']) for sop in result['retrieved_sops']]

            # Record in span
            span.set_attribute("result.department", result['department'])
            span.set_attribute("result.retrieved_sops", str(retrieved_sop_ids))
            span.set_attribute("result.plan_length", len(result['plan'].split()))
            span.set_status(Status(StatusCode.OK))

            results.append({
                'test_case_id': idx,
                'message': message,
                'complexity': row['complexity'],
                'expected_sops': expected_sops,
                'retrieved_sops': retrieved_sop_ids,
                'department': result['department'],
                'plan': result['plan']
            })
        except Exception as e:
            print(f"    ERROR: {e}")
            span.set_status(Status(StatusCode.ERROR, str(e)))

results_df = pd.DataFrame(results)
print(f"\nâœ“ Completed {len(results_df)} evaluations")

## Observe Traces in Phoenix

### The Trace-First Evaluation Workflow

**Key workflow:** Observe â†’ Discover â†’ Measure â†’ Improve

**Go to Phoenix UI:** http://localhost:6006/

**What to observe:**
1. Click on "V2_planning_autonomy_prompt_1" project
2. See all test case traces
3. Click on individual traces to see:
   - Routing call (V1's prompt)
   - Plan generation call (with SOPs)
   - Retrieved SOPs vs Expected SOPs
4. **Look for patterns:**
   - Missing expected SOPs (K=2 limitation?)
   - Plans missing critical steps
   - Wrong SOPs retrieved

**Exercise:** Find 3-5 failed cases and note what went wrong.

## Analyze Errors

From observations, design metrics to measure failures.

In [None]:
# Simple error analysis
print("="*80)
print("ERROR ANALYSIS")
print("="*80)

errors = []

for idx, row in results_df.iterrows():
    expected_sops = set(row['expected_sops']) if isinstance(row['expected_sops'], list) else set()
    retrieved_sops = set(row['retrieved_sops']) if isinstance(row['retrieved_sops'], list) else set()

    # Missing expected SOPs
    missing_sops = expected_sops - retrieved_sops
    if missing_sops:
        errors.append({
            'test_case_id': idx,
            'error_type': 'missing_sops',
            'message': row['message'][:80],
            'expected_sops': list(expected_sops),
            'retrieved_sops': list(retrieved_sops),
            'missing_sops': list(missing_sops)
        })

    # Extra/wrong SOPs
    extra_sops = retrieved_sops - expected_sops
    if extra_sops:
        errors.append({
            'test_case_id': idx,
            'error_type': 'extra_sops',
            'message': row['message'][:80],
            'expected_sops': list(expected_sops),
            'retrieved_sops': list(retrieved_sops),
            'extra_sops': list(extra_sops)
        })

print(f"\nFound {len(errors)} error instances across {len(set(e['test_case_id'] for e in errors))} test cases")

missing_sops_errors = [e for e in errors if e['error_type'] == 'missing_sops']
extra_sops_errors = [e for e in errors if e['error_type'] == 'extra_sops']

print(f"\nMissing SOPs: {len(missing_sops_errors)} cases")
print(f"Extra/Wrong SOPs: {len(extra_sops_errors)} cases")

if missing_sops_errors:
    print("\n" + "-"*80)
    print("EXAMPLE: Missing Expected SOPs")
    print("-"*80)
    for i, error in enumerate(missing_sops_errors[:3]):
        print(f"\nCase {i+1}:")
        print(f"  Message: {error['message']}")
        print(f"  Expected SOPs: {error['expected_sops']}")
        print(f"  Retrieved SOPs: {error['retrieved_sops']}")
        print(f"  Missing: {error['missing_sops']}")

## Design 2 Custom Metrics

Based on observed failures, we design 2 metrics:

### Metric 1: SOP Retrieval Recall @ K
- **What:** % of expected SOPs actually retrieved
- **Why:** Wrong SOPs â†’ wrong plan (garbage in, garbage out)
- **Observed:** K=2 misses relevant SOPs ranked #3+
- **Formula:** `recall = len(retrieved âˆ© expected) / len(expected)`

### Metric 2: Plan-to-Steps Alignment (3-class)
- **What:** Does plan cover expected procedure steps?
- **Classes:** good (complete), partial (minor gaps), bad (major gaps)
- **Why:** End-to-end quality check
- **Observed:** Plans missing critical steps or policy details
- **Judge:** GPT-4o evaluates with reasoning

**Key:** Metrics emerged from observations, not predetermined!

In [None]:
def compute_sop_recall(expected_sops, retrieved_sops):
    """What % of expected SOPs were retrieved?"""
    if not expected_sops or len(expected_sops) == 0:
        return 1.0

    expected_set = set([normalize_sop_name(s) for s in expected_sops])
    retrieved_set = set([normalize_sop_name(s) for s in retrieved_sops])

    relevant_retrieved = expected_set & retrieved_set
    recall = len(relevant_retrieved) / len(expected_set)

    return recall

def compute_plan_alignment(message, expected_steps, policy_details, generated_plan):
    """Does plan cover expected steps? (LLM-as-Judge with 3 classes)"""
    judge_prompt = f"""You are evaluating if a customer support action plan adequately covers expected procedure steps.

**Customer Message:**
{message}

**Expected Procedure Steps (from SOP):**
{expected_steps}

**Expected Policy Details:**
{policy_details}

**Generated Action Plan:**
{generated_plan}

**Evaluation Task:**
Classify the plan quality into one of 3 classes:

- **good**: All critical steps are covered, policy details are mentioned, plan is complete and actionable
  Example: Plan includes all verification steps, mentions specific timelines, covers edge cases

- **partial**: Most important steps are covered but missing some details or minor steps
  Example: Plan has main actions but omits policy details like timelines or approval levels

- **bad**: Plan is missing critical steps, has significant gaps, or is unrelated to the expected procedure
  Example: Plan addresses wrong issue, skips mandatory verification steps, or completely misses the procedure

Respond in this EXACT format:
CLASS: <good|partial|bad>
REASONING: <2-3 sentence explanation of what's covered and what's missing>
"""

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": judge_prompt}],
            temperature=0
        )

        content = response.choices[0].message.content.strip()

        # Parse class
        class_line = [line for line in content.split('\n') if line.startswith('CLASS:')]
        reasoning_line = [line for line in content.split('\n') if line.startswith('REASONING:')]

        if class_line:
            class_text = class_line[0].replace('CLASS:', '').strip().lower()
            plan_class = class_text if class_text in ['good', 'partial', 'bad'] else 'partial'
        else:
            plan_class = 'partial'

        if reasoning_line:
            reasoning = reasoning_line[0].replace('REASONING:', '').strip()
        else:
            reasoning = content

        return {
            'class': plan_class,
            'reasoning': reasoning
        }

    except Exception as e:
        print(f"  LLM judge error: {e}")
        return {'class': 'partial', 'reasoning': str(e)}

print("âœ“ Metric functions defined")

## Compute Metrics for Prompt 1

**Note:** This will make 22 more API calls (one per test case for LLM-as-Judge)

In [None]:
# Compute metrics
metrics = []

print("Computing metrics for Prompt 1...\n")

for idx, row in results_df.iterrows():
    # Get expected data from test cases
    test_row = test_cases.iloc[idx]
    expected_steps = test_row.get('expected_steps', '') if pd.notna(test_row.get('expected_steps')) else ''
    policy_details = test_row.get('policy_details', '') if pd.notna(test_row.get('policy_details')) else ''

    message = row['message']
    expected_sops = row['expected_sops'] if isinstance(row['expected_sops'], list) else []
    retrieved_sops = row['retrieved_sops'] if isinstance(row['retrieved_sops'], list) else []
    plan = row['plan']

    print(f"[{idx+1}/{len(results_df)}] Evaluating: {message[:60]}...")

    # Metric 1: SOP Recall
    recall = compute_sop_recall(expected_sops, retrieved_sops)
    print(f"  SOP Recall: {recall:.2f}")

    # Metric 2: Plan Alignment
    alignment = compute_plan_alignment(message, expected_steps, policy_details, plan)
    print(f"  Plan Alignment: {alignment['class']}")

    metrics.append({
        'test_case_id': idx,
        'sop_recall': recall,
        'plan_alignment_class': alignment['class'],
        'plan_alignment_reasoning': alignment['reasoning']
    })

metrics_df = pd.DataFrame(metrics)
print(f"\nâœ“ Metrics computed for {len(metrics_df)} test cases")

## Summarize Prompt 1 Metrics

In [None]:
print("="*80)
print("PROMPT 1 METRICS SUMMARY")
print("="*80)

# Metric 1: SOP Recall
print("\n1. SOP RETRIEVAL RECALL @ K")
print("-" * 40)
recall_mean = metrics_df['sop_recall'].mean()
recall_perfect = (metrics_df['sop_recall'] == 1.0).sum()
recall_zero = (metrics_df['sop_recall'] == 0.0).sum()
print(f"  Mean Recall: {recall_mean:.2%}")
print(f"  Perfect (1.0): {recall_perfect}/{len(metrics_df)} cases")
print(f"  Zero (0.0): {recall_zero}/{len(metrics_df)} cases")
print(f"  â†’ Interpretation: On average, we retrieve {recall_mean:.0%} of expected SOPs")

# Metric 2: Plan Alignment
print("\n2. PLAN-TO-STEPS ALIGNMENT (3-class)")
print("-" * 40)
alignment_good = (metrics_df['plan_alignment_class'] == 'good').sum()
alignment_partial = (metrics_df['plan_alignment_class'] == 'partial').sum()
alignment_bad = (metrics_df['plan_alignment_class'] == 'bad').sum()

print(f"  Good: {alignment_good}/{len(metrics_df)} cases ({alignment_good/len(metrics_df):.1%})")
print(f"  Partial: {alignment_partial}/{len(metrics_df)} cases ({alignment_partial/len(metrics_df):.1%})")
print(f"  Bad: {alignment_bad}/{len(metrics_df)} cases ({alignment_bad/len(metrics_df):.1%})")
print(f"  â†’ Interpretation: {alignment_good} plans are complete, {alignment_partial} need minor fixes, {alignment_bad} have major gaps")

---

## ðŸŽ¬ End of Chapter 2: Error Analysis & Metric Design (CC)

**Next: Chapter 3 - Continuous Deployment (CD)**

---

---

# ðŸš€ Continuous Deployment (CD) Phase

**Goal:** Make targeted improvements and measure impact.

**In this phase:**
- Identify root causes from CC metrics
- Design Prompt 2 with targeted fixes (K=2â†’4, gpt-4oâ†’gpt-5)
- Re-evaluate with same metrics
- Compare Prompt 1 vs Prompt 2 performance
- Validate improvements worked

**Output:** Better system with measured improvements (SOP Recall: 54%â†’76%, Plan Alignment: 72%â†’100%).

---

## Identify Problems â†’ Design Improvements

Based on metrics, what should we improve?

**Problem 1: Low SOP Recall (53.79%)**
- Root cause: K=2 is too restrictive
- Many relevant SOPs ranked #3-4 but not retrieved
- **Solution:** Increase K from 2 to 4

**Problem 2: Plan Alignment not perfect (72% good)**
- Root cause: gpt-4o has limitations
- Some plans missing steps or policy details
- **Solution:** Upgrade to gpt-5 (better reasoning)

**Prompt 2 Improvements:**
1. K=2 â†’ K=4 (targets SOP Recall)
2. gpt-4o â†’ gpt-5 (targets Plan Alignment)

## Prompt 2: Improved Agent

Same architecture, but with targeted improvements.

**Changes:**
- âœ… K=2 â†’ K=4 (better SOP retrieval)
- âœ… gpt-4o â†’ gpt-5 (better plan generation)
- âœ… Same V1 routing (keep what works!)

**Goal:** Improve both SOP Recall and Plan Alignment

In [None]:
class PlanningAgentPrompt2:
    def __init__(self, client, bm25_index, sop_ids, sops_db):
        self.client = client
        self.bm25_index = bm25_index
        self.sop_ids = sop_ids
        self.sops_db = sops_db

        # Use EXACT same department names as V1
        self.departments = [
            "BILLING",
            "RETURNS",
            "TECHNICAL_SUPPORT",
            "ORDER_STATUS",
            "PRODUCT_INQUIRY",
            "ACCOUNT_MANAGEMENT",
            "ESCALATION"
        ]

    def route_message(self, message):
        """Same V1 routing - don't regress!"""
        prompt = f"""Route customer messages to departments.

Available departments:
- BILLING: Payment issues, charges, refunds, refund status, account balances, fees
- RETURNS: Return requests, exchanges, return status, return policies
- TECHNICAL_SUPPORT: Login problems, password reset issues, website errors, checkout failures
- ORDER_STATUS: Order tracking, shipping updates, delivery questions, missing items
- PRODUCT_INQUIRY: Product questions, specifications, availability, pricing
- ACCOUNT_MANAGEMENT: Profile updates, changing saved payment methods, preferences, address changes
- ESCALATION: Very upset customers demanding managers, supervisor requests

Important:
- Login/password problems = TECHNICAL_SUPPORT (not ACCOUNT_MANAGEMENT)
- Updating payment methods = ACCOUNT_MANAGEMENT (not BILLING)
- Refund status = BILLING (not RETURNS)

Message: \"{message}\"

Respond with ONLY the department name, nothing else."""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        return response.choices[0].message.content.strip()

    def retrieve_sops(self, message, department, top_k=4):
        """
        PROMPT 2 IMPROVEMENT: Increased top_k from 2 to 4
        Rationale: Prompt 1 had low recall, missing SOPs ranked #3-4
        """
        query = f"{message} {department}"
        tokenized_query = query.lower().split()

        scores = self.bm25_index.get_scores(tokenized_query)
        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]

        retrieved_sops = []
        for idx in top_indices:
            sop_id = self.sop_ids[idx]
            score = scores[idx]
            content = self.sops_db[sop_id]['content']

            words = content.split()[:1500]
            excerpt = ' '.join(words)

            retrieved_sops.append({
                'sop_id': sop_id,
                'score': score,
                'excerpt': excerpt,
                'full_content': content
            })

        return retrieved_sops

    def generate_plan(self, message, department, retrieved_sops):
        """
        PROMPT 2 IMPROVEMENT: Upgraded from gpt-4o to gpt-5
        Rationale: gpt-5 has better reasoning, should improve plan quality
        """
        sops_context = "\n\n".join([
            f"--- {sop['sop_id']} (Relevance: {sop['score']:.2f}) ---\n{sop['excerpt'][:2000]}..."
            for sop in retrieved_sops
        ])

        prompt = f"""You are a customer support agent planning assistant. Create a detailed, step-by-step action plan.

**Customer Message:**
"{message}"

**Department:** {department}

**Relevant Procedures (SOPs):**
{sops_context}

**Instructions:**
Create a detailed action plan that:
1. Lists specific steps the agent should take (in order)
2. References relevant SOP procedures
3. Includes verification or security steps
4. Mentions escalation criteria if applicable
5. Provides timeline expectations
6. Notes any edge cases or system limitations

Format as a numbered action plan. Be specific and actionable.

**Action Plan:**"""

        # PROMPT 2: Use gpt-5 instead of gpt-4o
        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[{"role": "user", "content": prompt}]
        )

        return response.choices[0].message.content.strip()

    def plan(self, message):
        """Full pipeline with K=4 and gpt-5."""
        department = self.route_message(message)
        retrieved_sops = self.retrieve_sops(message, department, top_k=4)
        plan = self.generate_plan(message, department, retrieved_sops)

        return {
            'message': message,
            'department': department,
            'retrieved_sops': [
                {'sop_id': sop['sop_id'], 'score': sop['score']}
                for sop in retrieved_sops
            ],
            'plan': plan
        }

# Initialize improved agent
agent_p2 = PlanningAgentPrompt2(client, bm25_index, sop_ids, sops_db)
print("âœ“ PlanningAgentPrompt2 initialized (K=4, gpt-5)")

## Summary

**What We Built:**
- V2 Planning Agent that generates multi-step plans
- Builds on V1's 93% routing (EXACT department names)
- Uses BM25 to retrieve relevant SOPs
- Uses LLM to generate detailed action plans

**What We Learned:**
1. **Incremental Building:** V2 = V1's routing + new capabilities
2. **Trace-First:** Observe failures â†’ Design metrics â†’ Improve
3. **Custom Metrics:** SOP Recall + Plan Alignment (3-class)
4. **Targeted Improvements:** K=2â†’4, gpt-4oâ†’gpt-5

**Next Steps:**
1. Run Prompt 2 evaluation
2. Compute Prompt 2 metrics
3. Compare Prompt 1 vs Prompt 2
4. Verify improvements worked!

**V2 Planning Autonomy: Complete!** ðŸŽ‰