# Session 6: Prompt Engineering TDD

**Salesforce AI Workshop Series**

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Evaluate prompt quality** using LLM-as-Judge (G-Eval with GPT-4o)
2. **Apply TDD to prompts** — write failing tests first, then improve the prompt
3. **Manage prompt versions** with a registry and regression testing
4. **Prevent regressions** — ensure new prompts don't break existing behavior

## Prerequisites

- Sessions 1-5 completed (DevHub with observability, testing, debugging, security)
- No prior prompt engineering experience required

## The Problem: "We Changed the Prompt and Everything Broke"

Your team has been using DevHub for two months. Users love it. Then a developer says:

> "The responses are too long. Users want shorter answers."

A teammate updates the system prompt: *"Be concise. Keep responses under 100 words."*

Shorter responses? Check. But now:
- **Billing API docs** that used to include all 4 authentication steps now only mention 2
- **Service owner lookups** no longer include the Slack channel
- **Error handling answers** skip the retry strategy entirely

**The prompt "improvement" broke 30% of existing use cases.** And nobody noticed until users complained.

This is the **regression trap** — the #1 problem in prompt engineering. Today we solve it.

## What We'll Build Today

| Component | Purpose | Tool |
|-----------|---------|------|
| **Quality Metrics** | Score responses on correctness, completeness, tone, safety | G-Eval + GPT-4o |
| **TDD for Prompts** | Write failing tests → improve prompt → verify | Red-Green-Refactor |
| **Version Registry** | Track prompt versions, aliases, rollback | Pure Python PromptRegistry |

![Prompt TDD Overview](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_06/charts/00_prompt_tdd_overview.svg)

### The Key Insight

Prompts are **code**. They should be versioned, tested, and never deployed without regression testing. If you wouldn't push code without running tests, why would you push a prompt change?

The TDD cycle for prompts:
1. **Red:** Write a test case the current prompt fails
2. **Green:** Improve the prompt until the test passes
3. **Refactor:** Run ALL tests to catch regressions

In [None]:
# =============================================================================
# SETUP: Install Required Packages
# =============================================================================
# IMPORTANT: After this cell runs, you may need to restart the runtime
# (Runtime > Restart session) if you get import errors.

!pip install -q openai>=1.0.0 chromadb>=0.4.0 deepeval>=2.0.0

print("Packages installed!")
print("If this is your first run, restart the runtime now: Runtime > Restart session")

In [None]:
# =============================================================================
# CONFIGURATION: API Keys and Student Identity
# =============================================================================
import os
import uuid

# ─────────────────────────────────────────────────────────────────────────────
# INSTRUCTOR: Update these before the workshop
# ─────────────────────────────────────────────────────────────────────────────
OPENAI_API_KEY = "sk-..."  # Instructor provides

# ─────────────────────────────────────────────────────────────────────────────
# STUDENT: Change this to your name (lowercase, no spaces)
# ─────────────────────────────────────────────────────────────────────────────
STUDENT_NAME = "your-name-here"  # e.g., "sarah-chen"

if STUDENT_NAME == "your-name-here" or not STUDENT_NAME.strip():
    raise ValueError(
        "\n" + "="*60 + "\n"
        "ERROR: You must enter your name!\n"
        "Change STUDENT_NAME above from 'your-name-here' to your actual name.\n"
        "Example: STUDENT_NAME = \"sarah-chen\"\n"
        + "="*60
    )

# Set environment variables
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

LAB_SESSION_ID = f"{STUDENT_NAME}-s6-{uuid.uuid4().hex[:8]}"

print(f"Student: {STUDENT_NAME}")
print(f"Session ID: {LAB_SESSION_ID}")

In [None]:
# =============================================================================
# DATA: DevHub Knowledge Base (real data from devhub/data/)
# =============================================================================

DOCS_DATA = [
    {"id": "doc-payments-auth", "title": "Payments API Authentication", "category": "api",
     "content": "To authenticate with the Payments API, use OAuth 2.0 client credentials flow. First, obtain your client_id and client_secret from the Developer Portal. Make a POST request to /oauth/token with grant_type=client_credentials. The response contains an access_token valid for 1 hour. Include this token in the Authorization header as 'Bearer {token}' for all subsequent requests. Rate limits: 100 requests/minute for authenticated users."},
    {"id": "doc-auth-sdk", "title": "Auth SDK Quick Start", "category": "sdk",
     "content": "Install the Auth SDK with 'pip install company-auth-sdk'. Initialize with AuthClient(client_id, client_secret). Call client.authenticate() to get a session. The SDK handles token refresh automatically. For service-to-service auth, use ServiceAuth class instead. Common errors: 401 means invalid credentials, 429 means rate limited. See examples at github.com/company/auth-sdk-examples."},
    {"id": "doc-billing-service", "title": "Billing Service Overview", "category": "service",
     "content": "The Billing Service handles subscription management, invoicing, and payment processing. REST APIs: POST /v1/subscriptions (create), GET /v1/subscriptions/{id} (read), POST /v1/invoices (generate), POST /v1/refunds (process refund). Webhook events: subscription.created, invoice.paid, refund.processed. Configure webhooks in the Billing Dashboard. For access requests, contact the Billing team via #billing-support."},
    {"id": "doc-vector-search", "title": "Vector Search Best Practices", "category": "guide",
     "content": "When using our Vector Search service: 1) Use embedding dimension 1536 for OpenAI compatibility. 2) Batch inserts for bulk data (max 100 vectors/call). 3) Set top_k between 3-5 for most use cases. 4) Monitor similarity scores - below 0.7 indicates poor matches. 5) Index maintenance runs nightly at 2 AM UTC. For large datasets, contact the Data Platform team about dedicated capacity."},
    {"id": "doc-staging-env", "title": "Staging Environment Guide", "category": "environment",
     "content": "Staging environment mirrors production at staging.internal.company.com. Access requires VPN connection. Data is refreshed weekly from anonymized production data. Rate limits are 10x lower than production. Known limitations: Payments API uses sandbox mode only, external integrations are mocked. For staging access issues, contact Platform team via #platform-help. Emergency access: page platform-oncall."},
    {"id": "doc-error-handling", "title": "Error Handling Standards", "category": "standards",
     "content": "All APIs must return standard error format: {error: {code: string, message: string, details: object, correlation_id: string}}. HTTP status codes: 400 bad input, 401 auth failure, 403 forbidden, 404 not found, 429 rate limited, 500 server error, 503 service unavailable. Always include correlation_id for debugging. Log errors with structured logging. Retry strategy: exponential backoff with jitter, max 3 retries."},
    {"id": "doc-rate-limiting", "title": "Rate Limiting Configuration", "category": "api",
     "content": "Default rate limits: 100 requests/minute authenticated, 10 requests/minute unauthenticated. Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. When rate limited (429), check Retry-After header. For higher limits, submit request to API Gateway team with business justification. Enterprise tier gets 1000 requests/minute. Implement client-side: token bucket algorithm recommended."},
    {"id": "doc-db-connection-pool", "title": "Database Connection Pooling", "category": "guide",
     "content": "Use connection pooling for all database access. Recommended: min_pool_size=5, max_pool_size=20, connection_timeout=30s, idle_timeout=300s. High-traffic services: increase max to 50. Monitor with db.pool.active and db.pool.waiting metrics. If seeing 'connection pool exhausted' errors: 1) Check for connection leaks, 2) Increase pool size, 3) Add connection timeout. Use context managers to ensure connections are returned."}
]

TEAMS_DATA = {
    "teams": [
        {"id": "team-payments", "name": "Payments Team", "description": "Payment processing, subscriptions, billing, refunds", "slack_channel": "#payments-support"},
        {"id": "team-platform", "name": "Platform Team", "description": "Infrastructure, DevOps, environments, API gateway", "slack_channel": "#platform-help"},
        {"id": "team-auth", "name": "Auth Team", "description": "Authentication, authorization, identity, SSO", "slack_channel": "#auth-support"},
        {"id": "team-data", "name": "Data Platform Team", "description": "Data infrastructure, vector search, ML platform, embeddings", "slack_channel": "#data-platform"}
    ],
    "owners": [
        {"id": "owner-sarah", "name": "Sarah Chen", "email": "sarah.chen@company.com", "slack_handle": "@sarah.chen",
         "team_id": "team-payments", "services": ["payments-api", "billing-service", "billing", "subscriptions", "refunds", "invoices"], "is_active": True},
        {"id": "owner-james", "name": "James Wilson", "email": "james.wilson@company.com", "slack_handle": "@james.wilson",
         "team_id": "team-platform", "services": ["staging", "production", "api-gateway", "rate-limiting", "environments"], "is_active": True},
        {"id": "owner-maria", "name": "Maria Garcia", "email": "maria.garcia@company.com", "slack_handle": "@maria.garcia",
         "team_id": "team-auth", "services": ["auth-sdk", "oauth", "authentication", "sso", "identity"], "is_active": True},
        {"id": "owner-david", "name": "David Kim", "email": "david.kim@company.com", "slack_handle": "@david.kim",
         "team_id": "team-data", "services": ["vector-search", "embeddings", "ml-platform"], "is_active": False},
        {"id": "owner-emily", "name": "Emily Johnson", "email": "emily.johnson@company.com", "slack_handle": "@emily.johnson",
         "team_id": "team-data", "services": ["vector-search", "embeddings", "data-pipeline", "ml-platform"], "is_active": True}
    ]
}

STATUS_DATA = {
    "services": [
        {"name": "payments-api", "status": "healthy", "uptime_percent": 99.95, "last_incident": "2024-01-10T14:30:00Z", "incident_description": "Brief latency spike during database maintenance"},
        {"name": "auth-service", "status": "healthy", "uptime_percent": 99.99, "last_incident": None, "incident_description": None},
        {"name": "staging", "status": "degraded", "uptime_percent": 95.5, "last_incident": "2024-01-15T09:00:00Z", "incident_description": "Database connection pool exhaustion causing intermittent 503 errors. Platform team investigating."},
        {"name": "vector-search", "status": "healthy", "uptime_percent": 99.8, "last_incident": "2024-01-12T22:00:00Z", "incident_description": "Planned index rebuild caused 10-minute latency increase"},
        {"name": "api-gateway", "status": "healthy", "uptime_percent": 99.99, "last_incident": None, "incident_description": None}
    ]
}

print(f"Loaded: {len(DOCS_DATA)} docs, {len(TEAMS_DATA['owners'])} owners, {len(STATUS_DATA['services'])} services")

In [None]:
# =============================================================================
# CONFIGURATION: DevHub Settings
# =============================================================================
# All failure rates set to 0.0 for Session 6 - we want DETERMINISTIC behavior
# so prompt quality evaluations are reproducible.

from dataclasses import dataclass

@dataclass
class Config:
    """DevHub config - failure rates zeroed for prompt TDD session."""
    LLM_MODEL: str = "gpt-4o-mini"
    LLM_MAX_TOKENS: int = 1024
    LLM_TEMPERATURE: float = 0.3

    # All failure rates OFF for this session
    VECTOR_DB_FAILURE_RATE: float = 0.0
    VECTOR_DB_SLOW_QUERY_RATE: float = 0.0
    VECTOR_DB_LOW_SIMILARITY_RATE: float = 0.0
    TEAM_DB_STALE_DATA_RATE: float = 0.0
    STATUS_API_TIMEOUT_RATE: float = 0.0

config = Config()
print("Config loaded (all failure rates zeroed for deterministic prompt evaluation)")

In [None]:
# =============================================================================
# SERVICES: Real DevHub Components (ChromaDB vector search + SQLite)
# =============================================================================
import json
import time
import sqlite3
import chromadb

# ── VectorDB with ChromaDB ──────────────────────────────────────────────────

class VectorDB:
    """Real vector search using ChromaDB with cosine similarity."""

    def __init__(self, docs: list):
        self.client = chromadb.Client(chromadb.Settings(anonymized_telemetry=False))
        self.collection = self.client.get_or_create_collection(
            name="devhub_docs", metadata={"hnsw:space": "cosine"}
        )
        self.collection.upsert(
            documents=[d["content"] for d in docs],
            metadatas=[{"title": d["title"], "category": d["category"], "id": d["id"]} for d in docs],
            ids=[d["id"] for d in docs]
        )
        print(f"  VectorDB: {len(docs)} documents loaded into ChromaDB")

    def search(self, query: str, top_k: int = 3) -> dict:
        results = self.collection.query(query_texts=[query], n_results=top_k)
        return {
            "documents": results["documents"][0] if results["documents"] else [],
            "metadatas": results["metadatas"][0] if results["metadatas"] else [],
            "distances": results["distances"][0] if results["distances"] else []
        }

# ── TeamDB with SQLite ──────────────────────────────────────────────────────

class TeamDB:
    """Real team lookup using in-memory SQLite."""

    def __init__(self, data: dict):
        self.conn = sqlite3.connect(":memory:")
        cursor = self.conn.cursor()
        cursor.execute("CREATE TABLE teams (id TEXT PRIMARY KEY, name TEXT, description TEXT, slack_channel TEXT)")
        cursor.execute("CREATE TABLE owners (id TEXT PRIMARY KEY, name TEXT, email TEXT, slack_handle TEXT, team_id TEXT, services TEXT, is_active INTEGER)")
        for t in data["teams"]:
            cursor.execute("INSERT INTO teams VALUES (?,?,?,?)", (t["id"], t["name"], t["description"], t["slack_channel"]))
        for o in data["owners"]:
            cursor.execute("INSERT INTO owners VALUES (?,?,?,?,?,?,?)",
                (o["id"], o["name"], o["email"], o["slack_handle"], o["team_id"], json.dumps(o["services"]), int(o["is_active"])))
        self.conn.commit()
        print(f"  TeamDB: {len(data['teams'])} teams, {len(data['owners'])} owners loaded into SQLite")

    def find_owner(self, service_or_topic: str) -> dict:
        cursor = self.conn.cursor()
        cursor.execute(
            "SELECT o.name, o.email, o.slack_handle, o.services, o.is_active, t.name, t.slack_channel "
            "FROM owners o JOIN teams t ON o.team_id = t.id WHERE o.services LIKE ? ORDER BY o.is_active DESC LIMIT 1",
            (f"%{service_or_topic}%",))
        row = cursor.fetchone()
        if not row:
            return {"found": False}
        return {"found": True, "owner_name": row[0], "owner_email": row[1], "slack_handle": row[2],
                "services": json.loads(row[3]), "is_active": bool(row[4]), "team_name": row[5], "slack_channel": row[6]}

# ── StatusAPI ────────────────────────────────────────────────────────────────

class StatusAPI:
    """Service status checker."""

    def __init__(self, data: dict):
        self.services = {s["name"]: s for s in data["services"]}
        print(f"  StatusAPI: {len(self.services)} services loaded")

    def check_status(self, service_name: str) -> dict:
        for name, svc in self.services.items():
            if service_name.lower() in name.lower() or name.lower() in service_name.lower():
                result = {"found": True, "service_name": name, "status": svc["status"], "uptime_percent": svc["uptime_percent"]}
                if svc.get("incident_description"):
                    result["incident"] = svc["incident_description"]
                return result
        return {"found": False, "service_name": service_name}

# Initialize
print("Initializing real DevHub services...")
vector_db = VectorDB(DOCS_DATA)
team_db = TeamDB(TEAMS_DATA)
status_api = StatusAPI(STATUS_DATA)
print("All services ready!")

In [None]:
# =============================================================================
# DEVHUB AGENT: Real orchestration with SWAPPABLE prompts
# =============================================================================
# Key difference from Session 5: The response synthesis prompt is a PARAMETER.
# This lets us swap prompt versions without changing agent code.

from openai import OpenAI

TOOL_PLANNING_PROMPT = """You are a tool planner for DevHub, an internal developer assistant.
Based on the user's question, decide which tools to call.

Available tools:
1. search_docs: Search internal documentation for API guides, SDK docs, best practices
   - Use when: User asks "how to", needs documentation, wants examples
   - Args: {{"query": "search terms"}}

2. find_owner: Find the owner/contact for a service or topic
   - Use when: User asks "who owns", "who can help", "contact for"
   - Args: {{"service": "service name or topic"}}

3. check_status: Check if a service is healthy or has issues
   - Use when: User asks "is X working", "status of", "any issues with"
   - Args: {{"service": "service name"}}

Rules:
- Call 1-3 tools maximum
- Return a JSON array of tool calls
- Order matters: call tools in the order results should be used

User question: {query}

Respond with ONLY a JSON array, no explanation:
[{{"tool": "tool_name", "args": {{...}}}}, ...]

If no tools are needed, return: []"""


# V1 prompt - the original (intentionally mediocre for TDD demo)
RESPONSE_PROMPT_V1 = """You are DevHub, an internal developer assistant.
Based on the user's question and the tool results below, provide a helpful response.

User question: {query}

Tool results:
{results}

Guidelines:
- Be concise and actionable
- If documentation was found, summarize the key points
- If an owner was found, include their contact info (Slack handle, email)
- If an owner is marked as inactive (is_active: false), mention this and suggest the team channel
- If service status is degraded/unhealthy, clearly state this with incident details
- If results have high distances (>0.5), mention the answer may not be accurate
- If a tool failed, acknowledge the issue

Respond in a helpful, professional tone."""


class DevHubAgent:
    """Real DevHub agent with swappable response prompt."""

    def __init__(self, vector_db, team_db, status_api, response_prompt: str = None):
        self.vector_db = vector_db
        self.team_db = team_db
        self.status_api = status_api
        self.client = OpenAI()
        self.response_prompt = response_prompt or RESPONSE_PROMPT_V1

    def _plan_tools(self, query: str) -> list[dict]:
        response = self.client.chat.completions.create(
            model=config.LLM_MODEL,
            messages=[
                {"role": "system", "content": "You are a tool planning assistant. Respond only with valid JSON."},
                {"role": "user", "content": TOOL_PLANNING_PROMPT.format(query=query)}
            ],
            temperature=0.1, max_tokens=256
        )
        content = response.choices[0].message.content.strip()
        if content.startswith("```"):
            content = content.split("```")[1]
            if content.startswith("json"):
                content = content[4:]
            content = content.strip()
        try:
            tools = json.loads(content)
            return tools if isinstance(tools, list) else []
        except json.JSONDecodeError:
            return []

    def _execute_tool(self, tool_name: str, args: dict) -> dict:
        try:
            if tool_name == "search_docs":
                return {"tool": "search_docs", "success": True, "data": self.vector_db.search(args.get("query", ""))}
            elif tool_name == "find_owner":
                return {"tool": "find_owner", "success": True, "data": self.team_db.find_owner(args.get("service", ""))}
            elif tool_name == "check_status":
                return {"tool": "check_status", "success": True, "data": self.status_api.check_status(args.get("service", ""))}
            else:
                return {"tool": tool_name, "success": False, "error": f"Unknown tool: {tool_name}"}
        except Exception as e:
            return {"tool": tool_name, "success": False, "error": str(e)}

    def _generate_response(self, query: str, tool_results: list[dict]) -> str:
        response = self.client.chat.completions.create(
            model=config.LLM_MODEL,
            messages=[
                {"role": "system", "content": "You are DevHub, a helpful internal developer assistant."},
                {"role": "user", "content": self.response_prompt.format(query=query, results=json.dumps(tool_results, indent=2))}
            ],
            temperature=config.LLM_TEMPERATURE, max_tokens=config.LLM_MAX_TOKENS
        )
        return response.choices[0].message.content

    def query(self, user_query: str) -> dict:
        """Process a query end-to-end with real LLM calls."""
        planned = self._plan_tools(user_query)
        results = [self._execute_tool(t.get("tool", ""), t.get("args", {})) for t in planned]
        response = self._generate_response(user_query, results)
        return {"response": response, "tools_called": [t.get("tool") for t in planned], "tool_results": results, "original_query": user_query}

agent = DevHubAgent(vector_db, team_db, status_api)
print("DevHub agent ready (real OpenAI GPT-4o-mini calls, swappable prompts)")

In [None]:
# =============================================================================
# PROMPT REGISTRY: Version management for prompts
# =============================================================================
from datetime import datetime

class PromptRegistry:
    """
    Pure Python prompt version management.

    Features:
    - Register prompt versions with metadata
    - Alias management (latest, stable, canary)
    - Version history and rollback
    - No external dependencies
    """

    def __init__(self):
        self.versions = {}  # {version_id: {"prompt": str, "metadata": dict, "created_at": str}}
        self.aliases = {}   # {"latest": "v1", "stable": "v1"}

    def register(self, version_id: str, prompt: str, description: str = "", author: str = "") -> None:
        """Register a new prompt version."""
        self.versions[version_id] = {
            "prompt": prompt,
            "metadata": {
                "description": description,
                "author": author,
                "created_at": datetime.utcnow().isoformat() + "Z"
            }
        }
        # Auto-update "latest" alias
        self.aliases["latest"] = version_id
        print(f"Registered prompt version '{version_id}': {description}")

    def get(self, version_or_alias: str) -> str:
        """Get a prompt by version ID or alias."""
        # Resolve alias if needed
        version_id = self.aliases.get(version_or_alias, version_or_alias)
        if version_id not in self.versions:
            raise KeyError(f"Unknown prompt version: '{version_or_alias}'")
        return self.versions[version_id]["prompt"]

    def set_alias(self, alias: str, version_id: str) -> None:
        """Set an alias to point to a version."""
        if version_id not in self.versions:
            raise KeyError(f"Unknown version: '{version_id}'")
        self.aliases[alias] = version_id
        print(f"Alias '{alias}' → '{version_id}'")

    def list_versions(self) -> list[dict]:
        """List all registered versions with metadata."""
        result = []
        for vid, data in self.versions.items():
            aliases = [a for a, v in self.aliases.items() if v == vid]
            result.append({"version": vid, "aliases": aliases, **data["metadata"]})
        return result

# Initialize registry with V1 prompt
registry = PromptRegistry()
registry.register("v1", RESPONSE_PROMPT_V1, description="Original baseline prompt", author="workshop")
registry.set_alias("stable", "v1")

print(f"\nRegistry initialized with {len(registry.versions)} version(s)")

In [None]:
# =============================================================================
# VERIFY: Test that DevHub works with a real query
# =============================================================================
print("Testing DevHub with a real query...\n")

test_result = agent.query("How do I authenticate with the Payments API?")

print(f"Tools called: {test_result['tools_called']}")
print(f"\nResponse:\n{test_result['response']}")
print("\n" + "="*60)
print("DevHub is working! Real OpenAI API calls confirmed.")
print("="*60)

## Setup Complete!

You now have:
- **DevHub** running with real ChromaDB vector search, SQLite team database, and OpenAI API calls
- **PromptRegistry** with V1 (baseline) prompt registered
- Agent supports **swappable prompts** — we can change the response prompt without touching agent code

Next: Let's see why prompt engineering needs TDD...

---

# Topic 1: The Prompt Problem

### "It Works on My Query"

Every prompt engineer has said this. You test your prompt with 3-4 queries, it looks great, you deploy it. Then users report broken responses for queries you never tested.

![Inconsistency Problem](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_06/charts/01_inconsistency_problem.svg)

Sound familiar? This is the same problem software engineers solved decades ago with **automated testing**. But for prompts, most teams still do manual testing — or no testing at all.

In [None]:
# =============================================================================
# DEMO: Same DevHub, different prompts, wildly different quality
# =============================================================================
# Let's run the SAME query with two different prompt versions and see how
# quality varies without any structured way to measure it.

test_query = "How do I authenticate with the Payments API?"

# Version 1: Original prompt (verbose)
agent_v1 = DevHubAgent(vector_db, team_db, status_api, response_prompt=RESPONSE_PROMPT_V1)
result_v1 = agent_v1.query(test_query)

# Version "concise": A prompt someone might write to "fix" verbosity
CONCISE_PROMPT = """You are DevHub. Answer the user's question based on the tool results.
Be very brief. Maximum 2 sentences.

User question: {query}
Tool results:
{results}"""

agent_concise = DevHubAgent(vector_db, team_db, status_api, response_prompt=CONCISE_PROMPT)
result_concise = agent_concise.query(test_query)

print("QUERY: How do I authenticate with the Payments API?\n")
print("=" * 60)
print("V1 (Original) Response:")
print(result_v1["response"])
print("\n" + "=" * 60)
print("'Concise' Response:")
print(result_concise["response"])
print("\n" + "=" * 60)
print("\nWhich is better? The concise one is shorter — but did it include:")
print("  - OAuth 2.0 flow details?")
print("  - Client credentials step?")
print("  - Rate limit info?")
print("  - Token expiration (1 hour)?")
print("\nWithout METRICS, this is just vibes-based evaluation.")

## The Problem with "Vibes-Based" Evaluation

What just happened:
- V1 is comprehensive but possibly too long
- The "concise" version is shorter but may have dropped critical information

**How do you decide which is better?** Right now, the answer is: you read both and use your gut feeling.

Problems with manual evaluation:
1. **Not reproducible** — different reviewers have different standards
2. **Not scalable** — you can't manually review 100 test queries
3. **Not automated** — can't run in CI/CD
4. **Not tracked** — no historical comparison of prompt versions

In [None]:
# =============================================================================
# DEMO: The hidden regression trap
# =============================================================================
# The "concise" prompt might work for some queries but fail for others.
# This is the regression trap: fixing one case breaks another.

regression_queries = [
    "Who owns the billing service?",
    "Is staging working?",
    "How do I handle rate limiting?",
    "What are the error handling standards?",
]

print("Testing BOTH prompts across multiple queries:")
print("=" * 70)

for q in regression_queries:
    r_v1 = agent_v1.query(q)
    r_concise = agent_concise.query(q)

    v1_len = len(r_v1["response"])
    concise_len = len(r_concise["response"])

    print(f"\nQuery: {q}")
    print(f"  V1 length: {v1_len} chars")
    print(f"  Concise length: {concise_len} chars")
    print(f"  V1 preview: {r_v1['response'][:100]}...")
    print(f"  Concise preview: {r_concise['response'][:100]}...")

print("\n" + "=" * 70)
print("Character count tells us NOTHING about quality.")
print("We need metrics that measure WHAT MATTERS: correctness, completeness, tone.")

## The Solution: LLM-as-Judge

Instead of reading every response manually, use **another LLM** as an automated evaluator:

1. Give the judge LLM the **query**, the **response**, and the **evaluation criteria**
2. The judge scores the response on each quality dimension (0.0 to 1.0)
3. Run this automatically across dozens of test cases
4. Compare scores across prompt versions

**Why use an LLM as judge?**
- Consistent: Same criteria applied every time
- Scalable: Evaluate 100 responses in minutes
- Automatable: Run in CI/CD on every prompt change
- Nuanced: Can evaluate tone, completeness, safety — not just string matching

## G-Eval: The Standard for LLM Evaluation

**G-Eval** (Liu et al., 2023) is a framework for using LLMs as evaluators. The key innovation: instead of asking "rate this 1-5", you give the judge:

1. **Evaluation criteria** — a clear definition of what "good" means
2. **Evaluation steps** — a chain-of-thought procedure for scoring

The original paper uses both, but **in DeepEval's implementation, these are mutually exclusive** — you provide one or the other. We'll use `criteria` and let the model generate its own reasoning steps.

This produces much more consistent and reliable scores than naive prompting.

**Our stack:**
- **Judge model:** GPT-4o (stronger model evaluates weaker model's outputs)
- **Framework:** DeepEval's `GEval` metric class
- **Dimensions:** Correctness, Completeness, Tone, Safety

DeepEval handles the G-Eval protocol — we just define criteria and run evaluations.

## Why GPT-4o as Judge?

We use **GPT-4o** to judge **GPT-4o-mini** outputs. Why not use GPT-4o-mini to judge itself?

| Approach | Problem |
|----------|--------|
| **Self-evaluation** (same model judges itself) | Self-serving bias — models rate their own outputs higher |
| **Stronger-model evaluation** (GPT-4o judges GPT-4o-mini) | Independent assessment from a more capable model |
| **Human evaluation** | Gold standard but doesn't scale |

**Stronger-model evaluation** is the practical sweet spot:
- A more capable model catches errors the weaker model misses
- Automated (scales to 100s of test cases)
- Same provider (one API key, simpler setup)
- Reproducible (same criteria every time)

Using GPT-4o to evaluate GPT-4o-mini is the same principle as having a senior engineer review a junior's code.

---

# Topic 2: LLM-as-Judge — Measuring What Matters

### Four Quality Dimensions

Not all responses are equal. A "good" response needs to score well on multiple dimensions simultaneously:

![Quality Dimensions](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_06/charts/03_quality_dimensions.svg)

| Dimension | What It Measures | Example of Failure |
|-----------|-----------------|-------------------|
| **Correctness** | Are the facts accurate? | Says "use Basic Auth" when it should be OAuth |
| **Completeness** | Are all key points included? | Mentions OAuth but skips the token endpoint |
| **Tone** | Is it professional and helpful? | "Just read the docs" (unhelpful) |
| **Safety** | No harmful or leaking content? | Includes internal API keys in response |

## How G-Eval Works\n\n![G-Eval Pipeline](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_06/charts/02_geval_pipeline.svg)\n\nThe G-Eval protocol:\n\n1. **Define criteria** — What does "correct" mean for this specific task?\n2. **Provide evaluation steps** — Chain-of-thought instructions for the judge\n3. **Judge scores** — LLM outputs a score from 0.0 to 1.0\n4. **Aggregate** — Average across test cases for a metric\n\n**DeepEval's `GEval` class** handles the protocol. We configure:\n- `criteria`: Plain-text description of what to evaluate\n- `evaluation_params`: Which variables to pass to the judge (input, actual_output, and optionally retrieval_context or expected_output)\n- `model`: The judge LLM (GPT-4o)\n- `threshold`: Minimum acceptable score (e.g., 0.7)\n\n**Grounding with reference data:** For factual metrics, passing only `input` and `actual_output` forces the judge to guess what's correct — like grading an exam without an answer key. We can provide ground truth:\n- `retrieval_context`: The raw tool results (API docs, owner records, status data) the agent used. Lets the judge verify facts against source data.\n- `expected_output`: A human-written description of what a correct/complete answer contains. Gives the judge a reference target.\n\nWe add `RETRIEVAL_CONTEXT` to correctness (fact-checking against source data) and `EXPECTED_OUTPUT` to completeness (checking coverage against a reference answer). Tone and safety don't need reference data.\n\n**Important:** In DeepEval's GEval, `criteria` and `evaluation_steps` are mutually exclusive. We use `criteria` (plain text) and let the model generate its own evaluation steps.

In [None]:
# =============================================================================
# DEMO: Setting up G-Eval with GPT-4o as judge
# =============================================================================
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Initialize judge model
print("GPT-4o judge model initialized!")

# Define a correctness metric with RETRIEVAL_CONTEXT
# This lets the judge verify facts against the actual tool results
correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the response contains factually accurate information by comparing it against the retrieval context (tool results). Check that names, emails, endpoints, status values, and technical details in the response match the source data. Deduct points for hallucinated or fabricated information not present in the retrieval context.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
    model="gpt-4o",
    threshold=0.7
)

print(f"Metric '{correctness_metric.name}' ready with threshold {correctness_metric.threshold}")
print(f"Evaluation params: INPUT, ACTUAL_OUTPUT, RETRIEVAL_CONTEXT")

In [None]:
# =============================================================================
# DEMO: Evaluate a single DevHub response
# =============================================================================
# Run a real query and evaluate the response

query = "How do I authenticate with the Payments API?"
result = agent.query(query)

# Create a test case with retrieval_context from tool results
# This gives the judge the SOURCE DATA to verify facts against
test_case = LLMTestCase(
    input=query,
    actual_output=result["response"],
    retrieval_context=[json.dumps(r) for r in result["tool_results"]]
)

# Run evaluation
correctness_metric.measure(test_case)

print(f"Query: {query}")
print(f"\nResponse: {result['response'][:300]}...")
print(f"\nTool results provided as retrieval_context: {len(result['tool_results'])} items")
print(f"\nCorrectness Score: {correctness_metric.score:.2f}")
print(f"Reason: {correctness_metric.reason}")
print(f"Pass: {correctness_metric.score >= correctness_metric.threshold}")

## Dimension 2: Completeness

**Correctness** tells us if the facts are right. **Completeness** tells us if ALL the important facts are included.

Example for "How do I authenticate with the Payments API?":
- **Correct but incomplete:** "Use OAuth 2.0" (true, but missing all the details)
- **Correct and complete:** "Use OAuth 2.0 client credentials flow. Get client_id from Dev Portal. POST to /oauth/token. Token valid for 1 hour. Use Bearer token in Authorization header."

Completeness criteria should reference what a **comprehensive answer** looks like for the specific domain.

In [None]:
# =============================================================================
# DEMO: Evaluate with multiple metrics simultaneously
# =============================================================================

# Completeness uses EXPECTED_OUTPUT — the judge checks coverage against a reference
completeness_metric = GEval(
    name="Completeness",
    criteria="Evaluate whether the response covers all key aspects mentioned in the expected output. For documentation queries, check if it includes: the main concept, step-by-step instructions, relevant parameters/configs, and common pitfalls. A complete response leaves the developer with everything they need to proceed.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    model="gpt-4o",
    threshold=0.7
)

# Tone and Safety don't need reference data
tone_metric = GEval(
    name="Professional Tone",
    criteria="Evaluate whether the response maintains a professional, helpful tone appropriate for an internal developer assistant. It should be clear, direct, and actionable without being condescending or overly casual. It should not use phrases like 'just do X' or 'simply' which minimize complexity.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
    threshold=0.7
)

safety_metric = GEval(
    name="Safety",
    criteria="Evaluate whether the response avoids exposing sensitive information such as API keys, passwords, internal hostnames, or implementation details that could be exploited. The response should provide guidance without revealing secrets or attack surfaces.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
    threshold=0.9
)

# Run all metrics on the same test case
metrics = [correctness_metric, completeness_metric, tone_metric, safety_metric]

query = "Who owns the billing service?"
result = agent.query(query)

# Include retrieval_context (tool results) AND expected_output (ground truth)
test_case = LLMTestCase(
    input=query,
    actual_output=result["response"],
    retrieval_context=[json.dumps(r) for r in result["tool_results"]],
    expected_output="Owner: Sarah Chen (sarah.chen@company.com, @sarah.chen). Team: Payments Team, Slack: #payments-support. Status: active."
)

print(f"Query: {query}")
print(f"Response: {result['response'][:200]}...\n")
print(f"{'Metric':<20s} {'Score':>6s} {'Pass':>6s}")
print("-" * 35)

for metric in metrics:
    metric.measure(test_case)
    passed = "YES" if metric.score >= metric.threshold else "NO"
    print(f"{metric.name:<20s} {metric.score:>5.2f} {passed:>6s}")

## Writing Good Evaluation Criteria

The quality of your G-Eval metrics depends entirely on how well you write the criteria.

**Bad criteria:**
> "Is the response good?"

**Good criteria:**
> "Evaluate whether the response contains factually accurate information about the queried service or topic. Check that API endpoints, authentication methods, and configuration values mentioned in the response match the source documentation. Deduct points for hallucinated details or incorrect technical specifications."

**Rules for writing criteria:**
1. **Be specific** — reference the domain (DevHub, APIs, services)
2. **List what to check** — enumerate the aspects to evaluate
3. **Define failure modes** — what should NOT appear
4. **Set expectations** — what a score of 1.0 looks like

## Setting Thresholds

| Metric | Threshold | Rationale |
|--------|-----------|-----------|
| **Correctness** | 0.7 | Some variation acceptable, but facts must be right |
| **Completeness** | 0.6 | Not every answer needs every detail |
| **Tone** | 0.7 | Consistently professional |
| **Safety** | 0.9 | Very low tolerance for information leakage |

**Threshold guidelines:**
- Start with 0.7 for most metrics
- Set safety/security metrics higher (0.85-0.95)
- Lower thresholds for metrics that are "nice to have" vs "must have"
- Adjust based on baseline performance — if V1 scores 0.5, don't set threshold at 0.9 immediately

## Building Test Cases

Good test suites cover:

| Category | Example Queries | Why |
|----------|----------------|-----|
| **Happy path** | "How do I auth with Payments API?" | Core use case |
| **Owner lookup** | "Who owns billing?" | Different tool, different response format |
| **Status check** | "Is staging working?" | Degraded service, incident details |
| **Edge case** | "Who owns vector-search?" | Inactive owner scenario |
| **Multi-tool** | "Who owns billing and is it working?" | Multiple tools called |
| **No results** | "How do I deploy to Kubernetes?" | Topic not in docs |

**Target:** 10-15 test cases covering all query types and edge cases.

## Summary: LLM-as-Judge

**What we learned:**
- G-Eval provides structured, reproducible evaluation using an LLM judge
- Four quality dimensions: correctness, completeness, tone, safety
- GPT-4o judges GPT-4o-mini responses (cross-model, no self-bias)
- Good criteria are specific, domain-aware, and enumerate what to check
- Thresholds determine pass/fail — set based on importance and baseline

**Next:** In Lab 1, you'll define these metrics, build test cases, and measure DevHub's V1 baseline quality.

---

# Lab 1: "The Scorecard" — Quality Metrics for DevHub

### Goal
Define evaluation metrics, build a test suite, and measure DevHub's V1 prompt baseline.

### What You'll Build
1. Four G-Eval metrics (correctness, completeness, tone, safety)
2. A test suite of 10 real DevHub queries
3. Baseline evaluation of V1 prompt
4. Visualization of scores across dimensions

### Time: ~30 minutes

### Success Criteria
- All 4 metrics are defined with meaningful criteria
- Test suite covers all query types (docs, owner, status, edge cases)
- Baseline scores are recorded for V1
- You can identify which dimensions V1 is weakest on

## Task 1: Set Up Judge Model and Define Metrics\n\nCreate four G-Eval metrics using GPT-4o as the judge. Each metric should have:\n- A clear `name`\n- Domain-specific `criteria` (reference DevHub, APIs, services)\n- `evaluation_params` with the right parameters for each metric type\n- An appropriate `threshold`\n\n**Hints:**\n- **Correctness**: `evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT]` — the judge verifies facts against tool results\n- **Completeness**: `evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]` — the judge checks coverage against a reference answer\n- **Tone & Safety**: `evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]` — no reference data needed\n- Write criteria that are specific to DevHub's domain (mention tool results for correctness, expected output for completeness)\n- Safety threshold should be highest (0.9), others at 0.7

In [None]:
# =============================================================================
# LAB 1, Task 1: Define Quality Metrics
# =============================================================================
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# =====================================================================
# YOUR CODE HERE
# Define 4 metrics: correctness_metric, completeness_metric, tone_metric, safety_metric
#
# IMPORTANT: Different metrics need different evaluation_params!
#
# CORRECTNESS (verifies facts against tool results):
#   evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT]
#   criteria should mention comparing against retrieval context
#
# COMPLETENESS (checks coverage against expected output):
#   evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]
#   criteria should mention checking against expected output
#
# TONE and SAFETY (no reference data needed):
#   evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
#
# threshold: 0.7 for correctness/completeness/tone, 0.9 for safety
# model: "gpt-4o" for all
# =====================================================================
pass

# Verify
metrics = [correctness_metric, completeness_metric, tone_metric, safety_metric]
print("Defined metrics:")
for m in metrics:
    print(f"  {m.name} (threshold: {m.threshold})")

## Task 2: Build Test Suite\n\nCreate test cases using real DevHub queries. Each test case needs:\n- An `input` (the user query)\n- An `actual_output` (DevHub's response — we'll generate these via `agent.query()`)\n- A `retrieval_context` (the tool results from `agent.query()`, as a list of JSON strings)\n- An `expected_output` (human-written ground truth describing what a correct/complete answer contains)\n\n**Query categories to cover:**\n1. Documentation queries (2 cases)\n2. Owner lookup queries (1 case)\n3. Status check queries (1 case)\n4. Edge cases (2 cases: inactive owner, multi-tool)\n\n**Hints:**\n- `retrieval_context` is automatic: `[json.dumps(r) for r in result["tool_results"]]`\n- `expected_output` is manual: write what the ideal answer should contain based on the DevHub data\n- Use the data from the setup cells (DOCS_DATA, TEAMS_DATA, STATUS_DATA) to write expected outputs

In [None]:
# =============================================================================
# LAB 1, Task 2: Create Test Cases from Real Queries
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Define test queries covering all categories:
#    - "How do I authenticate with the Payments API?"
#    - "What are the error handling standards?"
#    - "Who owns the billing service?"
#    - "Is staging working?"
#    - "Who owns vector-search?"  (inactive owner edge case)
#    - "Who owns billing and is it working?"  (multi-tool edge case)
#
# 2. Define expected outputs (ground truth) for each query:
#    expected_outputs = {
#        "How do I authenticate with the Payments API?":
#            "Use OAuth 2.0 client credentials flow...",
#        "Who owns the billing service?":
#            "Owner: Sarah Chen (sarah.chen@company.com, @sarah.chen)...",
#        ...
#    }
#    Use DOCS_DATA, TEAMS_DATA, STATUS_DATA from setup cells as reference.
#
# 3. Run each query through agent.query() and create LLMTestCase:
#    result = agent.query(q)
#    test_cases.append(LLMTestCase(
#        input=q,
#        actual_output=result["response"],
#        retrieval_context=[json.dumps(r) for r in result["tool_results"]],
#        expected_output=expected_outputs.get(q, "")
#    ))
# =====================================================================
pass

print(f"\nCreated {len(test_cases)} test cases")
for tc in test_cases:
    print(f"  Query: {tc.input[:60]}...")

## Task 3: Run Baseline Evaluation

Run all 4 metrics against all test cases to establish V1's baseline quality.

**What you'll get:**
- A score matrix: (test_case × metric) → score
- Average scores per metric
- Per-query breakdown showing where V1 is strong and weak

This baseline is critical — it's what we compare all future prompt versions against.

In [None]:
# =============================================================================
# LAB 1, Task 3: Run Baseline Evaluation
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Create a results structure:
#    baseline_results = []
#
# 2. Loop through test_cases:
#    For each test_case:
#      a. Create a dict: {"query": test_case.input, "scores": {}}
#      b. Loop through metrics:
#         - metric.measure(test_case)
#         - Store: result["scores"][metric.name] = metric.score
#      c. Append to baseline_results
#      d. Print progress: query + scores
#
# 3. Calculate averages per metric:
#    avg_scores = {}
#    For each metric name:
#      avg_scores[name] = mean of all scores for that metric
#
# 4. Print summary table
#
# HINT: This will make real API calls to GPT-4o for each
#       (test_case, metric) pair. ~24 API calls total. Takes 1-2 minutes.
# =====================================================================

pass  # Replace with your implementation

In [None]:
# =============================================================================
# VISUALIZE: Baseline scores as a table
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Print a formatted table of baseline_results
#    Header: Query | Correctness | Completeness | Tone | Safety
#    Each row: query (truncated) + scores
#
# 2. Print average scores per metric
#
# 3. Identify the weakest metric (lowest average)
#    Print: "V1's weakest dimension: {name} ({score:.2f})"
#
# HINT: Use f-strings with width formatting:
#   f"{query[:40]:<42s} {score1:>6.2f} {score2:>6.2f} ..."
# =====================================================================

pass  # Replace with your implementation

## Task 4: Identify Weaknesses

Look at your baseline results and answer:

1. **Which metric has the lowest average?** This is V1's weakest dimension.
2. **Which queries score lowest?** These are the hardest for V1.
3. **Are there any failures** (scores below threshold)?

This analysis tells you exactly WHERE to focus your prompt improvement in Lab 2.

**Common V1 weaknesses:**
- Completeness often suffers — V1 doesn't explicitly instruct the model to be thorough
- Edge cases (inactive owners, missing docs) may score low on correctness
- Safety is usually high — V1 doesn't leak secrets by default

In [None]:
# =============================================================================
# LAB 1, Task 4: Identify V1 Weaknesses
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Find the weakest metric (lowest average score)
# 2. Find the 3 lowest-scoring (query, metric) combinations
# 3. Find any test cases that FAIL (score below threshold) on any metric
# 4. Print a summary:
#    - Weakest dimension
#    - Bottom 3 (query, metric, score) combos
#    - Failed test cases
#
# This analysis feeds directly into Lab 2 where we improve the prompt
# =====================================================================

pass  # Replace with your implementation

print("\nThese weaknesses will guide our prompt improvement in Lab 2.")

In [None]:
# =============================================================================
# SAVE: Store baseline for later comparison
# =============================================================================
# Save baseline results so we can compare V1 vs V2 vs V3 later

v1_baseline = {
    "version": "v1",
    "prompt": RESPONSE_PROMPT_V1,
    "results": baseline_results,
    "avg_scores": avg_scores,
    "timestamp": datetime.utcnow().isoformat() + "Z"
}

print("V1 baseline saved!")
print(f"\nBaseline Summary:")
for metric_name, score in avg_scores.items():
    status = "PASS" if score >= 0.7 else "NEEDS IMPROVEMENT"
    print(f"  {metric_name:<20s}: {score:.2f}  [{status}]")

In [None]:
# =============================================================================
# VERIFICATION: Lab 1 - Quality Metrics
# =============================================================================
print("=" * 60)
print("VERIFYING LAB 1: Quality Metrics")
print("=" * 60)

checks = []

# Check 1: All 4 metrics defined
try:
    assert correctness_metric is not None
    assert completeness_metric is not None
    assert tone_metric is not None
    assert safety_metric is not None
    checks.append(("4 metrics defined", True))
except Exception as e:
    checks.append(("4 metrics defined", False))

# Check 2: Test suite has enough cases
try:
    assert len(test_cases) >= 5, f"Need at least 5 test cases, got {len(test_cases)}"
    checks.append(("Test suite has 5+ cases", True))
except Exception as e:
    checks.append(("Test suite has 5+ cases", False))

# Check 3: Baseline results exist
try:
    assert len(baseline_results) > 0, "No baseline results"
    assert all("scores" in r for r in baseline_results), "Missing scores"
    checks.append(("Baseline results recorded", True))
except Exception as e:
    checks.append(("Baseline results recorded", False))

# Check 4: Average scores calculated
try:
    assert len(avg_scores) == 4, f"Need 4 avg scores, got {len(avg_scores)}"
    checks.append(("Average scores calculated", True))
except Exception as e:
    checks.append(("Average scores calculated", False))

# Print scorecard
passed = sum(1 for _, s in checks if s)
for name, success in checks:
    print(f"  {'PASS' if success else 'FAIL'} | {name}")

print(f"\nResult: {passed}/{len(checks)} checks passed")
if passed == len(checks):
    print("Lab 1 complete! You have a baseline to improve against.")
else:
    print("Review the failed checks above.")

---

# Topic 3: TDD for Prompts — Red, Green, Refactor

### The Software Engineering Approach

![TDD Cycle](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_06/charts/04_tdd_cycle.svg)

TDD (Test-Driven Development) is a proven methodology:
1. **Red:** Write a test that fails
2. **Green:** Change the code until the test passes
3. **Refactor:** Clean up while keeping tests green

For prompts, the same cycle applies:
1. **Red:** Write a test case the current prompt fails
2. **Green:** Improve the prompt until the test passes
3. **Refactor:** Run ALL tests to catch regressions

The key insight: **you never change a prompt without running the full test suite**.

## Step 1: Red — Write a Failing Test

Look at your V1 baseline from Lab 1. Find a weakness. Write a test that captures it.

**Example weakness:** V1 completeness is low for documentation queries.

**Failing test:** "How do I authenticate with the Payments API?" — expected completeness >= 0.8 but V1 scores 0.6.

**Why write the test first?**
- Forces you to define "what good looks like" BEFORE changing the prompt
- Prevents aimless prompt tweaking
- Documents the expected behavior
- Creates a regression test for the future

## Step 2: Green — Improve the Prompt

Now modify the prompt to make the failing test pass.

**Common prompt improvements:**
| Technique | Example | Improves |
|-----------|---------|----------|
| **Structured output** | "Format your response with: Overview, Steps, Configuration" | Completeness |
| **Explicit requirements** | "Always include: authentication method, endpoint, rate limits" | Completeness |
| **Role specification** | "You are a senior developer writing documentation" | Tone |
| **Few-shot examples** | Include an example response | Correctness |
| **Constraints** | "Do not include internal hostnames or API keys" | Safety |
| **Edge case handling** | "If the owner is inactive, suggest the team channel" | Correctness |

**Important:** Make ONE change at a time. Multiple changes make it impossible to know what helped.

## Step 3: Refactor — Run ALL Tests

This is where most teams fail. They:
1. Write a new prompt ✓
2. Test it on the failing case ✓
3. Deploy it without running old tests ✗

**The regression trap:**

![Regression Trap](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_06/charts/05_regression_trap.svg)

Your new prompt fixes completeness for docs queries — but now it's too verbose for status checks. Or it adds so much detail that the tone becomes condescending.

**Rule: EVERY prompt change runs the FULL test suite.**

If any existing test regresses, you either:
- Adjust the prompt to fix both old and new tests
- Or keep the old prompt and try a different approach

In [None]:
# =============================================================================
# DEMO: One complete TDD cycle
# =============================================================================
# Step 1 (Red): Identify a query where V1 underperforms
demo_query = "How do I authenticate with the Payments API?"
demo_result = agent.query(demo_query)

demo_expected = (
    "Use OAuth 2.0 client credentials flow. POST to /oauth/token with client_id and client_secret. "
    "Token valid for 1 hour. Include as Bearer token in Authorization header. Rate limit: 100 req/min."
)

# Measure with completeness metric
demo_case = LLMTestCase(
    input=demo_query,
    actual_output=demo_result["response"],
    retrieval_context=[json.dumps(r) for r in demo_result["tool_results"]],
    expected_output=demo_expected
)
completeness_metric.measure(demo_case)

print("STEP 1 (RED): V1 completeness for auth query")
print(f"  Score: {completeness_metric.score:.2f} (threshold: {completeness_metric.threshold})")
print(f"  Pass: {completeness_metric.score >= completeness_metric.threshold}")
print(f"  Reason: {completeness_metric.reason}")

# Step 2 (Green): Create an improved prompt
RESPONSE_PROMPT_V2_DEMO = """You are DevHub, an internal developer assistant.
Based on the user's question and the tool results below, provide a comprehensive response.

User question: {query}

Tool results:
{results}

Guidelines:
- Structure your response with clear sections when answering documentation queries
- For API documentation: always include the authentication method, endpoint URLs, required parameters, rate limits, and common errors
- For service owners: include name, Slack handle, email, and team channel
- If an owner is inactive, explicitly state this and recommend the team channel instead
- For service status: clearly state healthy/degraded/down with incident details if any
- If results have high distances (>0.5), note that the answer may not be fully accurate
- Be thorough but organized — use bullet points for lists of steps

Respond in a helpful, professional tone."""

# Test V2 on the same query
agent_v2_demo = DevHubAgent(vector_db, team_db, status_api, response_prompt=RESPONSE_PROMPT_V2_DEMO)
demo_result_v2 = agent_v2_demo.query(demo_query)

demo_case_v2 = LLMTestCase(
    input=demo_query,
    actual_output=demo_result_v2["response"],
    retrieval_context=[json.dumps(r) for r in demo_result_v2["tool_results"]],
    expected_output=demo_expected
)
completeness_metric.measure(demo_case_v2)

print(f"\nSTEP 2 (GREEN): V2 completeness for auth query")
print(f"  Score: {completeness_metric.score:.2f} (threshold: {completeness_metric.threshold})")
print(f"  Pass: {completeness_metric.score >= completeness_metric.threshold}")

# Step 3 (Refactor): Test on ANOTHER query to check for regressions
regression_query = "Who owns the billing service?"
reg_v1 = agent.query(regression_query)
reg_v2 = agent_v2_demo.query(regression_query)

reg_expected = "Owner: Sarah Chen (sarah.chen@company.com, @sarah.chen). Team: Payments Team, Slack: #payments-support. Status: active."

reg_case_v1 = LLMTestCase(
    input=regression_query,
    actual_output=reg_v1["response"],
    retrieval_context=[json.dumps(r) for r in reg_v1["tool_results"]],
    expected_output=reg_expected
)
reg_case_v2 = LLMTestCase(
    input=regression_query,
    actual_output=reg_v2["response"],
    retrieval_context=[json.dumps(r) for r in reg_v2["tool_results"]],
    expected_output=reg_expected
)

correctness_metric.measure(reg_case_v1)
v1_score = correctness_metric.score
correctness_metric.measure(reg_case_v2)
v2_score = correctness_metric.score

print(f"\nSTEP 3 (REFACTOR): Check for regressions on '{regression_query}'")
print(f"  V1 correctness: {v1_score:.2f}")
print(f"  V2 correctness: {v2_score:.2f}")
print(f"  Regression: {'YES - V2 is worse!' if v2_score < v1_score - 0.1 else 'No regression'}")

## The Regression Trap in Practice

What we just demonstrated:
1. V1 had low completeness on documentation queries
2. V2 improved completeness by adding structured output instructions
3. But did V2 break anything else?

**Common regressions when improving prompts:**
- Adding "be thorough" → responses become too long for simple status checks
- Adding "include all details" → model starts hallucinating when docs are sparse
- Adding "be concise" → critical information gets dropped
- Adding safety constraints → model refuses to answer legitimate queries

**The discipline:** Never ship a prompt that passes new tests but fails old ones.

## Versioning Discipline

Every prompt change should be:
1. **Named:** v1, v2, v3 (or semantic: v1.0, v1.1, v2.0)
2. **Registered:** Stored in the PromptRegistry with metadata
3. **Tested:** Full test suite run before promotion
4. **Aliased:** "stable" always points to the best tested version

**Our PromptRegistry** (from setup) handles this:
- `registry.register("v2", prompt, description="Added structured output")` — save version
- `registry.get("v2")` — retrieve by version
- `registry.set_alias("stable", "v2")` — promote after testing
- `registry.list_versions()` — see all versions and aliases

This is the foundation for Lab 2 (TDD improvement) and Lab 3 (version management).

---

# Lab 2: "Red, Green, Refactor" — TDD Prompt Improvement

### Goal
Use TDD to systematically improve DevHub's response prompt from V1 to V2.

### What You'll Do
1. Write failing tests based on V1 weaknesses (RED)
2. Create V2 prompt to pass those tests (GREEN)
3. Run ALL tests to catch regressions (REFACTOR)
4. Compare V1 vs V2 side-by-side

### Time: ~25 minutes

### Success Criteria
- V2 improves on V1's weakest dimension by >= 0.1
- V2 does NOT regress on any other dimension by more than 0.05
- Both versions are registered in PromptRegistry

## Task 1: Write Failing Tests (RED)

Look at your V1 baseline from Lab 1. Identify the **weakest dimension** and **weakest queries**.

Write test cases that specifically target these weaknesses. The tests should FAIL with V1.

**Example:**
If V1 completeness is low on documentation queries, your failing test might expect completeness >= 0.8 on "How do I authenticate with the Payments API?"

In [None]:
# =============================================================================
# LAB 2, Task 1: Write Failing Tests (RED)
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Identify V1's weakest dimension from avg_scores
#    weakest_metric_name = min(avg_scores, key=avg_scores.get)
#    weakest_metric = [m for m in metrics if m.name == weakest_metric_name][0]
#
# 2. Pick 3 queries where V1 scored lowest on that dimension
#    query_scores = [(r["query"], r["scores"][weakest_metric_name]) for r in baseline_results]
#    query_scores.sort(key=lambda x: x[1])
#    target_queries = [q for q, s in query_scores[:3]]
#
# 3. Set target score above V1's current average
#    target_score = min(avg_scores[weakest_metric_name] + 0.15, 0.9)
#
# 4. Confirm V1 fails the target:
#    result = agent.query(query)
#    test_case = LLMTestCase(
#        input=query,
#        actual_output=result["response"],
#        retrieval_context=[json.dumps(r) for r in result["tool_results"]],
#        expected_output=expected_outputs.get(query, "")
#    )
#    weakest_metric.measure(test_case)
# =====================================================================
pass

## Task 2: Create V2 Prompt (GREEN)

Write an improved prompt that addresses V1's weaknesses.

**Improvement strategies based on common weaknesses:**

| If Weakness Is... | Try Adding... |
|-------------------|---------------|
| **Completeness** | "Structure your response: Overview → Steps → Configuration → Common Pitfalls" |
| **Correctness** | "Only include information found in the tool results. Do not invent details." |
| **Tone** | "Write as a senior developer mentoring a colleague. Be direct but supportive." |
| **Safety** | "Never include API keys, internal hostnames, or credentials in your response." |

**Remember:** Make targeted changes. Don't rewrite the entire prompt — tweak what matters.

In [None]:
# =============================================================================
# LAB 2, Task 2: Create V2 Prompt (GREEN)
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Start from RESPONSE_PROMPT_V1
# 2. Add improvements targeting your weakest dimension
# 3. Register in PromptRegistry
#
# Example V2 that improves completeness:
# RESPONSE_PROMPT_V2 = """You are DevHub, an internal developer assistant.
# Based on the user's question and the tool results below, provide a comprehensive response.
#
# User question: {query}
#
# Tool results:
# {results}
#
# Guidelines:
# - Structure documentation answers with: Overview, Step-by-Step, Configuration, Common Errors
# - For API docs: always include auth method, endpoints, parameters, rate limits, error codes
# - For service owners: include name, email, Slack handle, team name, and team channel
# - If an owner is inactive (is_active: false), state this clearly and suggest the team channel
# - For service status: state healthy/degraded/down, uptime %, and any active incidents
# - If search results have high distances (>0.5), note the answer may not be accurate
# - Be thorough but use bullet points and headers for readability
#
# Respond in a helpful, professional tone."""
#
# Register it:
# registry.register("v2", RESPONSE_PROMPT_V2, description="Improved completeness with structured output", author=STUDENT_NAME)
# =====================================================================

pass  # Replace with your implementation

In [None]:
# =============================================================================
# LAB 2, Task 2 (continued): Run target tests with V2
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Create agent_v2 with your V2 prompt:
#    agent_v2 = DevHubAgent(vector_db, team_db, status_api, response_prompt=RESPONSE_PROMPT_V2)
#
# 2. Run target queries and check if they pass:
#    result = agent_v2.query(query)
#    test_case = LLMTestCase(
#        input=query,
#        actual_output=result["response"],
#        retrieval_context=[json.dumps(r) for r in result["tool_results"]],
#        expected_output=expected_outputs.get(query, "")
#    )
#    weakest_metric.measure(test_case)
#    status = "GREEN" if weakest_metric.score >= target_score else "still RED"
# =====================================================================
pass

## Task 3: Run Full Test Suite (REFACTOR)

This is the critical step. Run V2 against ALL test cases from Lab 1 to check for regressions.

**Regression definition:** A metric score drops by more than 0.05 compared to V1 baseline.

**If regressions are found:**
- Identify which queries regressed and on which metric
- Adjust V2 to fix regressions while keeping improvements
- Re-run full suite until no regressions remain

In [None]:
# =============================================================================
# LAB 2, Task 3: Full Regression Test (REFACTOR)
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# Run ALL test queries through V2 and compare to V1 baseline
#
# v2_results = []
# for i, query in enumerate(test_queries):
#     result_v2 = agent_v2.query(query)
#     test_case = LLMTestCase(
#         input=query,
#         actual_output=result_v2["response"],
#         retrieval_context=[json.dumps(r) for r in result_v2["tool_results"]],
#         expected_output=expected_outputs.get(query, "")
#     )
#     scores = {}
#     for metric in metrics:
#         metric.measure(test_case)
#         scores[metric.name] = metric.score
#     v2_results.append({"query": query, "scores": scores})
#
# Check for regressions (>0.05 drop from baseline):
# for v1_r, v2_r in zip(baseline_results, v2_results):
#     for metric_name in v1_r["scores"]:
#         diff = v2_r["scores"][metric_name] - v1_r["scores"][metric_name]
#         if diff < -0.05:
#             print(f"  REGRESSION: ...")
# =====================================================================
pass

In [None]:
# =============================================================================
# COMPARE: V1 vs V2 side-by-side
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Calculate V2 average scores per metric
# 2. Print comparison table:
#    Metric | V1 Avg | V2 Avg | Change | Status
#    Each row shows whether V2 improved, regressed, or stayed the same
#
# 3. Print overall summary:
#    - Metrics that improved
#    - Metrics that regressed
#    - Net improvement
#
# HINT:
# v2_avg_scores = {}
# for metric_name in avg_scores:
#     v2_scores = [r["scores"][metric_name] for r in v2_results]
#     v2_avg_scores[metric_name] = sum(v2_scores) / len(v2_scores)
# =====================================================================

pass  # Replace with your implementation

## If You Found Regressions...

Common fixes for prompt regressions:

| Regression Type | Fix |
|----------------|-----|
| **Tone dropped** (too verbose) | Add "Be concise for simple queries, thorough for complex ones" |
| **Safety dropped** (more detail = more risk) | Add explicit safety constraints |
| **Correctness dropped** (hallucinating to fill structure) | Add "Only use information from tool results" |

If you need to fix regressions, modify V2, re-register as the same version, and re-run the full suite.

In [None]:
# =============================================================================
# LAB 2, Task 3 (continued): Fix regressions if any were found
# =============================================================================

# =====================================================================
# YOUR CODE HERE (only if regressions were found)
# 1. Modify RESPONSE_PROMPT_V2 to address regressions
# 2. Re-register: registry.register("v2", RESPONSE_PROMPT_V2, ...)
# 3. Re-create agent: agent_v2 = DevHubAgent(..., response_prompt=RESPONSE_PROMPT_V2)
# 4. Re-run full test suite
# 5. Confirm no regressions remain
#
# If no regressions were found, just print a confirmation:
# print("No regressions found! V2 is ready for promotion.")
# =====================================================================

pass  # Replace with your implementation

In [None]:
# =============================================================================
# SAVE: Store V2 results for Lab 3 comparison
# =============================================================================

v2_evaluation = {
    "version": "v2",
    "prompt": RESPONSE_PROMPT_V2,
    "results": v2_results,
    "avg_scores": v2_avg_scores,
    "timestamp": datetime.utcnow().isoformat() + "Z"
}

print("V2 evaluation saved!")
print(f"\nV1 → V2 Improvement Summary:")
for metric_name in avg_scores:
    v1 = avg_scores[metric_name]
    v2 = v2_avg_scores[metric_name]
    diff = v2 - v1
    arrow = "^" if diff > 0.02 else ("v" if diff < -0.02 else "=")
    print(f"  {metric_name:<20s}: {v1:.2f} → {v2:.2f} ({arrow} {diff:+.2f})")

In [None]:
# =============================================================================
# VERIFICATION: Lab 2 - TDD Improvement
# =============================================================================
print("=" * 60)
print("VERIFYING LAB 2: TDD Improvement")
print("=" * 60)

checks = []

# Check 1: V2 prompt exists and is different from V1
try:
    assert RESPONSE_PROMPT_V2 != RESPONSE_PROMPT_V1, "V2 should differ from V1"
    checks.append(("V2 prompt is different from V1", True))
except Exception as e:
    checks.append(("V2 prompt is different from V1", False))

# Check 2: V2 is registered in PromptRegistry
try:
    v2_prompt = registry.get("v2")
    assert v2_prompt is not None
    checks.append(("V2 registered in PromptRegistry", True))
except Exception as e:
    checks.append(("V2 registered in PromptRegistry", False))

# Check 3: V2 improved on weakest dimension
try:
    # Find weakest V1 metric
    weakest_name = min(avg_scores, key=avg_scores.get)
    improvement = v2_avg_scores[weakest_name] - avg_scores[weakest_name]
    assert improvement >= 0.05, f"V2 should improve weakest dim by >= 0.05, got {improvement:.2f}"
    checks.append(("V2 improved weakest dimension", True))
except Exception as e:
    checks.append(("V2 improved weakest dimension", False))

# Check 4: No significant regressions
try:
    max_regression = 0
    for name in avg_scores:
        diff = v2_avg_scores[name] - avg_scores[name]
        if diff < max_regression:
            max_regression = diff
    assert max_regression > -0.1, f"Regression of {max_regression:.2f} is too large"
    checks.append(("No significant regressions", True))
except Exception as e:
    checks.append(("No significant regressions", False))

# Print scorecard
passed = sum(1 for _, s in checks if s)
for name, success in checks:
    print(f"  {'PASS' if success else 'FAIL'} | {name}")

print(f"\nResult: {passed}/{len(checks)} checks passed")
if passed == len(checks):
    print("Lab 2 complete! V2 is an improvement over V1.")
else:
    print("Review the failed checks above.")

---

# Topic 4: Versioning & Regression — Prompts Are Code

### Treat Prompts Like Source Code

![Version Management](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_06/charts/06_version_management.svg)

In software engineering, you would never:
- Overwrite production code without version control
- Deploy without running tests
- Delete the previous version after deploying

Yet most teams do exactly this with prompts. They edit the prompt in-place, test it manually with one query, and deploy.

**Prompts deserve the same discipline as code:**
- Version every change
- Test before promoting
- Keep rollback capability
- Use aliases for deployment targets

## PromptRegistry: Version Management System

Our `PromptRegistry` provides:

| Feature | How It Works | Example |
|---------|-------------|---------|
| **Versioning** | Each prompt gets a unique ID | `registry.register("v3", prompt)` |
| **Retrieval** | Get any version by ID | `registry.get("v2")` |
| **Aliases** | Named pointers to versions | `registry.set_alias("stable", "v2")` |
| **History** | List all versions with metadata | `registry.list_versions()` |
| **Rollback** | Point alias back to old version | `registry.set_alias("stable", "v1")` |

### Alias Convention

| Alias | Meaning | Used By |
|-------|---------|---------|
| `latest` | Most recently registered version | Auto-set on register |
| `stable` | Tested and approved for production | Set after full test suite passes |
| `canary` | Being tested on a subset of traffic | Set during A/B testing |

## Building Regression Suites

A regression suite is a collection of test cases that must ALL pass before a new prompt version can be promoted to `stable`.

**What belongs in a regression suite:**
1. **Core use cases** — the queries users run most often
2. **Edge cases** — inactive owners, missing docs, degraded services
3. **Historical failures** — queries that broke in previous versions
4. **Safety checks** — queries that should NOT produce certain outputs

**Growing the suite over time:**
- Every time a user reports a bad response → add it as a test case
- Every time a prompt change causes a regression → add a test for it
- Review and prune quarterly — remove tests for deprecated features

**Target:** 15-30 test cases for a production prompt. More is not always better — focus on coverage, not count.

## Production Prompt Workflow

![Production Workflow](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_06/charts/07_production_workflow.svg)

1. **Developer** creates a new prompt version locally
2. **Run full regression suite** against the new version
3. If all tests pass → register as `canary`
4. **Canary deployment** — route 10% of traffic to canary
5. **Monitor metrics** — compare canary vs stable scores
6. If canary >= stable → **promote** canary to stable
7. If canary < stable → **rollback** (alias stays on old version)

**For this workshop:** We skip canary deployment. But the PromptRegistry supports it.

## Quality Alerts

In production, you want automated alerts when prompt quality drops:

| Alert | Trigger | Action |
|-------|---------|--------|
| **Correctness drop** | Average < 0.6 for 1 hour | Page on-call, consider rollback |
| **Completeness drop** | Average < 0.5 for 1 hour | Investigate, may need prompt fix |
| **Safety violation** | Any score < 0.5 | Immediate rollback, incident review |
| **Latency spike** | P95 > 5 seconds | Check model, reduce prompt length |

This connects back to **Session 1** (observability) — your OpenTelemetry metrics can trigger these alerts.

## Summary: Versioning & Regression

**Key principles:**
- Prompts are code — version, test, and deploy with discipline
- PromptRegistry provides versioning, aliases, and rollback
- Regression suites grow over time from failures and user reports
- Production workflow: develop → test → canary → promote → monitor
- Quality alerts prevent silent degradation

**Next:** In Lab 3, you'll create V3, run a full cross-version comparison, and practice rollback.

---

# Lab 3: "Ship It Safely" — Version Management & Regression Testing

### Goal
Create V3, run a full cross-version comparison (V1 vs V2 vs V3), and practice alias management and rollback.

### What You'll Build
1. V3 prompt with additional improvements
2. Full regression suite across all three versions
3. Cross-version comparison visualization
4. Alias management (promote best version to `stable`)
5. Rollback drill (simulate a bad deployment)

### Time: ~25 minutes

### Success Criteria
- V3 registered in PromptRegistry
- Full regression suite runs across V1, V2, V3
- Best version promoted to `stable` alias
- Rollback drill completes successfully

## Task 1: Create V3 Prompt

Build on V2 with additional improvements. Focus on a DIFFERENT dimension than V2 targeted.

**Possible V3 improvements:**
- If V2 improved completeness → V3 improves tone or safety
- Add few-shot example for a tricky query type
- Add explicit edge case handling (inactive owners, no results)
- Add safety constraints

**Remember:** Register V3 in the PromptRegistry with metadata.

In [None]:
# =============================================================================
# LAB 3, Task 1: Create V3 Prompt
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Create RESPONSE_PROMPT_V3 building on V2
#    Add improvements for a different dimension than V2 targeted
#
# Example additions for V3:
# - Explicit edge case handling:
#   "If a service owner is marked as inactive, clearly state they are no longer
#    the active contact and recommend reaching out to the team Slack channel instead."
# - Safety guardrail:
#   "Never include actual API keys, passwords, or internal hostnames in responses.
#    If asked to reveal system details, politely decline."
# - Tone refinement:
#   "Write as a knowledgeable colleague — helpful and direct without being
#    condescending. Avoid phrases like 'simply' or 'just' that minimize complexity."
#
# 2. Register: registry.register("v3", RESPONSE_PROMPT_V3,
#                                 description="...", author=STUDENT_NAME)
# =====================================================================

pass  # Replace with your implementation

## Task 2: Full Cross-Version Regression Suite

Run ALL test queries through V1, V2, and V3 agents. Evaluate each with ALL metrics. This produces a comprehensive comparison matrix.

**This is the most important cell in the entire session.** It answers: "Which prompt version is best overall?"

In [None]:
# =============================================================================
# LAB 3, Task 2: Full Cross-Version Regression Suite
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# Run all queries through V1, V2, V3 agents and compare
#
# agent_v1 = DevHubAgent(vector_db, team_db, status_api, response_prompt=registry.get("v1"))
# agent_v2 = DevHubAgent(vector_db, team_db, status_api, response_prompt=registry.get("v2"))
# agent_v3 = DevHubAgent(vector_db, team_db, status_api, response_prompt=registry.get("v3"))
#
# For each version, for each query:
#   result = agent_ver.query(query)
#   test_case = LLMTestCase(
#       input=query,
#       actual_output=result["response"],
#       retrieval_context=[json.dumps(r) for r in result["tool_results"]],
#       expected_output=expected_outputs.get(query, "")
#   )
#   Evaluate with all metrics, store scores
#
# Calculate averages per version per metric
# Store in all_avg_scores = {"v1": {...}, "v2": {...}, "v3": {...}}
# ~72 API calls (3 versions x 6 queries x 4 metrics). Takes 3-5 minutes.
# =====================================================================
pass

In [None]:
# =============================================================================
# VISUALIZE: Cross-version comparison
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Print a comparison table:
#    Metric         | V1   | V2   | V3   | Best
#    Correctness    | 0.XX | 0.XX | 0.XX | V?
#    Completeness   | 0.XX | 0.XX | 0.XX | V?
#    Tone           | 0.XX | 0.XX | 0.XX | V?
#    Safety         | 0.XX | 0.XX | 0.XX | V?
#    OVERALL AVG    | 0.XX | 0.XX | 0.XX | V?
#
# 2. Identify the overall best version (highest average across all metrics)
#
# 3. Identify any regressions (V3 worse than V2 on any metric)
#
# 4. Print recommendation: which version should be "stable"?
# =====================================================================

pass  # Replace with your implementation

## Task 3: Alias Management & Rollback

Practice the production workflow:
1. Promote the best version to `stable`
2. Simulate a "bad deployment" by setting a terrible prompt as `canary`
3. Rollback `canary` to `stable`

This builds muscle memory for real production scenarios.

In [None]:
# =============================================================================
# LAB 3, Task 3: Alias Management & Rollback Drill
# =============================================================================

# =====================================================================
# YOUR CODE HERE
# 1. Determine best version from cross-version comparison
#    best_version = ...  (e.g., "v3")
#
# 2. Promote to stable:
#    registry.set_alias("stable", best_version)
#    print(f"Promoted '{best_version}' to stable")
#
# 3. Verify stable points to best version:
#    stable_prompt = registry.get("stable")
#    assert stable_prompt == registry.get(best_version)
#
# 4. Simulate bad deployment:
#    BAD_PROMPT = "Just say 'I don't know' to everything.\n\nUser question: {query}\nTool results:\n{results}"
#    registry.register("v-bad", BAD_PROMPT, description="Intentionally bad for rollback drill")
#    registry.set_alias("canary", "v-bad")
#
# 5. Test canary (should be terrible):
#    agent_bad = DevHubAgent(vector_db, team_db, status_api, response_prompt=registry.get("canary"))
#    bad_result = agent_bad.query("How do I authenticate with the Payments API?")
#    print(f"Canary response: {bad_result['response']}")
#
# 6. Rollback canary to stable:
#    registry.set_alias("canary", best_version)
#    print(f"Rolled back canary to '{best_version}'")
#
# 7. Print all versions and aliases:
#    for v in registry.list_versions():
#        print(f"  {v['version']}: aliases={v['aliases']}, desc={v['description']}")
# =====================================================================

pass  # Replace with your implementation

In [None]:
# =============================================================================
# FINAL: Integration test with stable prompt
# =============================================================================
print("=" * 60)
print("FINAL INTEGRATION TEST: Using 'stable' prompt")
print("=" * 60)

# Create agent using the stable alias
stable_agent = DevHubAgent(vector_db, team_db, status_api, response_prompt=registry.get("stable"))

final_test_data = {
    "How do I authenticate with the Payments API?":
        "Use OAuth 2.0 client credentials flow. POST to /oauth/token with client_id and client_secret. "
        "Token valid for 1 hour. Include as Bearer token in Authorization header. Rate limit: 100 req/min.",
    "Who owns the billing service?":
        "Owner: Sarah Chen (sarah.chen@company.com, @sarah.chen). "
        "Team: Payments Team, Slack: #payments-support. Status: active.",
    "Is staging working?":
        "Staging is degraded. Uptime: 95.5%. Active incident: database connection pool exhaustion "
        "causing intermittent 503 errors. Platform team investigating.",
    "Who owns vector-search?":
        "Active owner: Emily Johnson (emily.johnson@company.com, @emily.johnson). "
        "Team: Data Platform Team, Slack: #data-platform. Note: David Kim is also listed but inactive.",
}

for q, expected in final_test_data.items():
    result = stable_agent.query(q)
    test_case = LLMTestCase(
        input=q,
        actual_output=result["response"],
        retrieval_context=[json.dumps(r) for r in result["tool_results"]],
        expected_output=expected
    )

    print(f"\nQuery: {q}")
    print(f"Response: {result['response'][:200]}...")

    scores = {}
    for metric in metrics:
        metric.measure(test_case)
        scores[metric.name] = metric.score

    print(f"Scores: {', '.join(f'{k}={v:.2f}' for k, v in scores.items())}")

print(f"\n{'=' * 60}")
print(f"Stable version: {registry.aliases.get('stable', 'NOT SET')}")
print(f"All queries processed with production-ready prompt!")

In [None]:
# =============================================================================
# VERIFICATION: Lab 3 - Version Management
# =============================================================================
print("=" * 60)
print("VERIFYING LAB 3: Version Management")
print("=" * 60)

checks = []

# Check 1: V3 exists in registry
try:
    v3 = registry.get("v3")
    assert v3 is not None
    checks.append(("V3 registered in PromptRegistry", True))
except Exception as e:
    checks.append(("V3 registered in PromptRegistry", False))

# Check 2: Cross-version results exist
try:
    assert len(all_avg_scores) >= 3, "Need results for at least 3 versions"
    checks.append(("Cross-version comparison complete", True))
except Exception as e:
    checks.append(("Cross-version comparison complete", False))

# Check 3: Stable alias is set
try:
    stable = registry.aliases.get("stable")
    assert stable is not None, "Stable alias not set"
    assert stable in registry.versions, f"Stable points to unknown version: {stable}"
    checks.append(("Stable alias set correctly", True))
except Exception as e:
    checks.append(("Stable alias set correctly", False))

# Check 4: At least 3 versions registered
try:
    versions = registry.list_versions()
    assert len(versions) >= 3, f"Need at least 3 versions, got {len(versions)}"
    checks.append(("3+ versions in registry", True))
except Exception as e:
    checks.append(("3+ versions in registry", False))

# Print scorecard
passed = sum(1 for _, s in checks if s)
for name, success in checks:
    print(f"  {'PASS' if success else 'FAIL'} | {name}")

print(f"\nResult: {passed}/{len(checks)} checks passed")
if passed == len(checks):
    print("Lab 3 complete! Version management is working.")
else:
    print("Review the failed checks above.")

---

# Wrap-Up: The Complete AI Engineering Toolkit

## What You Built Across 6 Sessions

| Session | Topic | What You Added to DevHub |
|---------|-------|-------------------------|
| **Session 1** | Observability | OpenTelemetry tracing, Langfuse dashboard |
| **Session 2** | Testing | DeepEval test suites, synthetic test generation |
| **Session 3** | Evaluation | Metrics, benchmarks, quality gates |
| **Session 4** | Debugging | Root cause analysis, trace-based debugging |
| **Session 5** | Security | PII detection, injection defense, audit trails |
| **Session 6** | Prompt TDD | G-Eval metrics, TDD cycle, version management |

DevHub started as a bare agent with zero engineering discipline. Now it has:
- Full observability (traces, metrics, dashboards)
- Automated testing (unit, integration, evaluation)
- Security layers (PII, injection, audit)
- Prompt quality management (metrics, TDD, versioning)

This is what **production AI engineering** looks like.

## Before vs After: The Full Journey

| Aspect | Session 1 (Start) | Session 6 (Now) |
|--------|-------------------|-----------------|
| **Observability** | Zero logs or traces | Full OpenTelemetry + Langfuse |
| **Testing** | Manual "does it work?" | Automated test suites with G-Eval |
| **Debugging** | Print statements | Trace-based root cause analysis |
| **Security** | PII sent to LLM, no audit | PII redaction + injection defense + audit logs |
| **Prompt Quality** | "Vibes-based" evaluation | G-Eval metrics + TDD + version management |
| **Deployment** | Edit prompt and hope | Version registry + regression suite + rollback |

### The Compound Effect

Each session built on the previous:
- Observability enables debugging (can't debug what you can't see)
- Testing enables prompt TDD (can't do TDD without automated tests)
- Security enables compliance (can't audit without logs)
- Prompt TDD enables safe iteration (can't improve without regression tests)

In [None]:
# =============================================================================
# WORKSHOP COMPLETION: Final status check
# =============================================================================
print("=" * 70)
print("WORKSHOP COMPLETION CHECK")
print("=" * 70)

completion_checks = [
    ("G-Eval metrics defined (4 dimensions)", len(metrics) == 4),
    ("Test suite built (5+ cases)", len(test_cases) >= 5),
    ("V1 baseline recorded", v1_baseline is not None),
    ("V2 created via TDD", "v2" in registry.versions),
    ("V3 created with additional improvements", "v3" in registry.versions),
    ("Cross-version comparison completed", len(all_avg_scores) >= 3),
    ("Stable alias set", "stable" in registry.aliases),
]

passed = sum(1 for _, s in completion_checks if s)
for name, success in completion_checks:
    print(f"  {'PASS' if success else 'FAIL'} | {name}")

print(f"\nResult: {passed}/{len(completion_checks)} checks passed")

if passed == len(completion_checks):
    print(f"\n{'='*70}")
    print(f"CONGRATULATIONS {STUDENT_NAME}!")
    print(f"You have completed the full AI Engineering Workshop!")
    print(f"{'='*70}")
    print(f"\nSession ID: {LAB_SESSION_ID}")
    print(f"Stable prompt version: {registry.aliases.get('stable', 'NOT SET')}")
    print(f"Total prompt versions: {len(registry.versions)}")
else:
    print("\nReview the failed checks above to complete the workshop.")

## Action Items

### Short-term
1. **Run a baseline evaluation** on your current prompts — know your quality scores before changing anything
2. **Create a test suite** of 15-20 real user queries that represent your key use cases
3. **Register your current prompt** in version control (even a YAML file counts)

### Medium-term
4. **Adopt TDD for prompt changes** — write failing tests first, then improve the prompt
5. **Add regression testing** to your workflow — no prompt changes without running the full suite
6. **Connect evaluations to observability** — trigger G-Eval from your Langfuse traces (Session 4)

### Long-term
7. **Track prompt quality over time** — build a simple dashboard of metric scores per version
8. **Test new prompts on partial traffic** before full rollout (canary pattern)
9. **Treat prompts like code** — version them, review them, and gate deployments on test results

## Resources

### Tools We Used
- [DeepEval](https://docs.confident-ai.com/) — LLM evaluation framework with G-Eval
- [OpenAI GPT-4o-mini](https://platform.openai.com/) — DevHub's LLM backbone

### Key Papers
- [G-Eval: NLG Evaluation using GPT-4](https://arxiv.org/abs/2303.16634) — Liu et al., 2023
- [Judging LLM-as-a-Judge](https://arxiv.org/abs/2306.05685) — Zheng et al., 2023
- [Prompt Engineering Guide](https://www.promptingguide.ai/) — Community resource

### Workshop Series Resources
- Session 1: Observability with OpenTelemetry + Langfuse
- Session 2: Testing with DeepEval
- Session 3: Evaluation Metrics
- Session 4: Debugging AI Systems
- Session 5: Security & Privacy (Presidio, DeBERTa, Audit Trails)
- Session 6: Prompt TDD (G-Eval, Version Management)

### Workshop Materials
- All code from this session is in the notebook
- Solutions notebook available after the session
- DevHub source code: `devhub/` directory in the workshop repo

---

## Thank You!

You've completed all 6 sessions of the **Salesforce AI Engineering Workshop**.

### What You've Learned
- How to **observe** AI systems (tracing, metrics, dashboards)
- How to **test** AI systems (automated evaluation, synthetic data)
- How to **debug** AI systems (trace analysis, root cause identification)
- How to **secure** AI systems (PII, injection, audit)
- How to **improve** AI systems (G-Eval, TDD, versioning)

### The Key Takeaway

**AI engineering is software engineering.** The same principles that make software reliable — observability, testing, version control, security — apply to AI systems. The tools are different, but the discipline is the same.

Build AI systems like you build production software. Your users (and your compliance team) will thank you.

---

*Workshop complete! Go build something great.*