# Session 1: Observability in AI Applications

**Salesforce AI Workshop Series**

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Map AI system architecture** using the 5-layer diagnostic framework
2. **Instrument Python code** with OpenTelemetry tracing
3. **Debug production issues** using distributed traces in Jaeger
4. **Identify bottlenecks** by analyzing span timings and attributes
5. **Diagnose data quality issues** through trace attributes

## Prerequisites

- Basic Python knowledge (functions, classes, decorators)
- Familiarity with APIs and JSON
- No prior observability experience required

## Session Format

- **~2.5 hours hands-on**
- Instructor demos followed by your labs
- All code runs in this notebook
- Traces visible in shared Jaeger instance

---

## The Problem: "It's Slow Sometimes"

Picture this scenario...

Your team deployed an internal AI assistant called **DevHub**. It helps developers find documentation, locate service owners, and check system status.

**Monday morning**, your Slack explodes:

> "DevHub is super slow today" - @alex
> 
> "I asked who owns billing and got someone who left 6 months ago" - @sarah
> 
> "The answers seem... wrong? Not relevant?" - @mike

You check the logs:

```
INFO: Query received
INFO: Processing...
INFO: Response sent
```

**That's it.** No errors. No clues. Just "processing."

You have NO IDEA:
- Where the time is being spent
- Why some queries are slow and others fast
- Whether the data being returned is stale or incorrect
- Which component is causing the problem

**This session teaches you how to NEVER be in this situation again.**

---

## What We'll Build Today

1. **Understand** the 5-layer AI architecture framework
2. **Experience** DevHub V0 (the broken version with no observability)
3. **Learn** distributed tracing concepts (traces, spans, attributes)
4. **Instrument** DevHub with OpenTelemetry (Lab 1)
5. **Debug** three production scenarios using traces (Lab 2)

---

## Google Colab Setup

If you're running this in Google Colab:

1. **Runtime → Change runtime type → Python 3**
2. No GPU needed for this session
3. All data is loaded from this notebook (no external files needed)

Let's start by installing the required packages...

In [None]:
# =============================================================================
# INSTALL REQUIRED PACKAGES
# =============================================================================
# Run this cell first! It installs all dependencies needed for this session.
# This may take 1-2 minutes on first run.

!pip install -q \
    chromadb>=0.4.0 \
    openai>=1.0.0 \
    opentelemetry-api>=1.20.0 \
    opentelemetry-sdk>=1.20.0 \
    opentelemetry-exporter-otlp-proto-grpc>=1.20.0 \
    grpcio>=1.50.0 \
    rich>=13.0.0

print("All packages installed successfully!")

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================
# These credentials connect you to the shared workshop infrastructure.
# DO NOT CHANGE unless instructed by your instructor.

import os

# -----------------------------------------------------------------------------
# Jaeger Configuration (Distributed Tracing)
# -----------------------------------------------------------------------------
# Jaeger collects and visualizes traces from your application
JAEGER_ENDPOINT = "http://46.224.233.5:4317"  # OTLP gRPC endpoint for sending traces
JAEGER_UI = "https://46.224.233.5/jaeger"     # Web UI for viewing traces

# -----------------------------------------------------------------------------
# OpenAI Configuration
# -----------------------------------------------------------------------------
# Your instructor will provide this key
OPENAI_API_KEY = "sk-..."  # INSTRUCTOR: Fill this before class
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# -----------------------------------------------------------------------------
# Student Identity
# -----------------------------------------------------------------------------
# CHANGE THIS to your name (lowercase, no spaces)
# This helps identify your traces in the shared Jaeger instance
STUDENT_NAME = "your-name-here"  # Example: "john-smith"

# Validate student name
if STUDENT_NAME == "your-name-here" or " " in STUDENT_NAME:
    print("ERROR: Please set STUDENT_NAME to your name (lowercase, no spaces)")
    print("   Example: STUDENT_NAME = 'john-smith'")
else:
    print(f"Student identity set: {STUDENT_NAME}")
    print(f"   Your traces will appear as: devhub-{STUDENT_NAME}")

In [None]:
# =============================================================================
# TEST JAEGER CONNECTION
# =============================================================================
# This sends a test trace to verify Jaeger is reachable.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Create a test tracer
test_resource = Resource.create({"service.name": f"connection-test-{STUDENT_NAME}"})
test_provider = TracerProvider(resource=test_resource)
test_exporter = OTLPSpanExporter(endpoint=JAEGER_ENDPOINT, insecure=True)
test_provider.add_span_processor(BatchSpanProcessor(test_exporter))

test_tracer = test_provider.get_tracer("connection-test")

# Send a test span
try:
    with test_tracer.start_as_current_span("connection-test-span") as span:
        span.set_attribute("student.name", STUDENT_NAME)
        span.set_attribute("test.message", "Hello from Colab!")
    
    # Force flush to ensure span is sent
    test_provider.force_flush()
    
    print("Jaeger connection successful!")
    print(f"   View your test trace at: {JAEGER_UI}")
    print(f"   Search for service: connection-test-{STUDENT_NAME}")
except Exception as e:
    print(f"Jaeger connection failed: {e}")
    print("   Check that JAEGER_ENDPOINT is correct")
    print("   Ask your instructor for help")

In [None]:
# =============================================================================
# TEST OPENAI CONNECTION
# =============================================================================
# This makes a simple API call to verify OpenAI is reachable.

from openai import OpenAI

try:
    client = OpenAI()
    
    # Simple test call
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Say 'Hello Workshop!' in exactly 2 words"}],
        max_tokens=10
    )
    
    result = response.choices[0].message.content
    print(f"OpenAI connection successful!")
    print(f"   Response: {result}")

except Exception as e:
    print(f"OpenAI connection failed: {e}")
    print("   Check that OPENAI_API_KEY is set correctly")
    print("   Ask your instructor for help")

---

## Setup Complete!

If you see success messages above for:
- Packages installed
- Student identity set
- Jaeger connection
- OpenAI connection

**You're ready to begin!**

If any step failed, raise your hand or message in the workshop chat.

---

**Next:** We'll learn the 5-layer AI architecture framework that helps us understand WHERE problems can occur.

---

# Topic 1: Understanding AI System Architecture

Before we can debug problems, we need to understand **where** problems can occur.

Every AI application - whether it's ChatGPT, a RAG system, or an AI agent - follows a similar architectural pattern. Understanding this pattern is the first step to effective debugging.

## Why Architecture Matters for Debugging

When something goes wrong, you need to ask: **"Which layer is causing this?"**

Without a mental model of your system's architecture:
- You're guessing randomly
- You check the wrong things first
- You waste hours on red herrings

With a clear architecture framework:
- You systematically narrow down the problem
- You know which metrics/logs to check for each layer
- You find root causes in minutes, not hours

**The 5-Layer Framework** gives you this mental model for ANY AI application.

## The 5-Layer AI Architecture Framework

```mermaid
flowchart TB
    subgraph L1["LAYER 1: APPLICATION"]
        A1["User Interface"]
        A2["CLI, Web UI, API, Chatbot"]
    end
    
    subgraph L2["LAYER 2: GATEWAY"]
        B1["Validation & Auth"]
        B2["Rate limiting, Input validation"]
    end
    
    subgraph L3["LAYER 3: ORCHESTRATION"]
        C1["Agent / Router"]
        C2["Tool selection, Multi-step reasoning"]
    end
    
    subgraph L4["LAYER 4: LLM"]
        D1["Language Model"]
        D2["OpenAI, Claude, Local models"]
    end
    
    subgraph L5["LAYER 5: DATA"]
        E1["Data Sources"]
        E2["VectorDB, SQL, APIs, Files"]
    end
    
    L1 --> L2 --> L3 --> L4
    L3 --> L5
    L4 --> L3
    
    style L1 fill:#e1f5fe
    style L2 fill:#fff3e0
    style L3 fill:#f3e5f5
    style L4 fill:#e8f5e9
    style L5 fill:#fce4ec
```

Each layer has **different failure modes** and **different debugging approaches**.

Let's examine each layer...

![Five Layer Architecture](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_01/charts/01_five_layer_architecture.svg)


## Layer 1: Application Layer

**What it does:** The user-facing interface - how users interact with your AI system.

**Examples:**
- Command-line interface (CLI)
- Web application (React, Flask)
- REST API endpoints
- Slack/Teams bot
- Mobile app

**Typical failures at this layer:**
| Symptom | Possible Cause |
|---------|----------------|
| No response at all | Server not running, network issues |
| Slow initial response | Cold start, connection pooling issues |
| Formatting errors | Response parsing bugs |
| Session issues | State management problems |

**What to check:**
- Server logs (is it even receiving requests?)
- Network connectivity
- Response serialization

## Layer 2: Gateway Layer

**What it does:** Validates, authenticates, and rate-limits incoming requests before they reach core logic.

**Examples:**
- Input validation (query length, allowed characters)
- Authentication (API keys, JWT tokens)
- Rate limiting (requests per minute)
- Request routing

**Typical failures at this layer:**
| Symptom | Possible Cause |
|---------|----------------|
| 401/403 errors | Auth misconfiguration |
| 429 errors | Rate limit exceeded |
| "Invalid input" errors | Validation too strict |
| Requests rejected silently | Middleware misconfiguration |

**What to check:**
- Auth token validity
- Rate limit counters
- Validation rules
- Middleware order

## Layer 3: Orchestration Layer

**What it does:** Decides WHAT to do with a request - which tools to call, in what order, how to combine results.

**Examples:**
- AI Agents (LangChain, AutoGPT)
- Tool/function routers
- Multi-step pipelines
- ReAct loops

**Typical failures at this layer:**
| Symptom | Possible Cause |
|---------|----------------|
| Wrong tool called | Poor tool descriptions, ambiguous query |
| Infinite loops | Missing stop conditions |
| Partial answers | Tool results not combined properly |
| Inconsistent behavior | Non-deterministic routing |

**What to check:**
- Which tools were selected (and why)
- Tool execution order
- How results were combined
- Agent reasoning steps

**This is often the hardest layer to debug** - decisions are made by AI, not explicit code.

## Layer 4: LLM Layer

**What it does:** The "brain" - generates text, makes decisions, synthesizes information.

**Examples:**
- OpenAI GPT-4
- Anthropic Claude
- Local models (Llama, Mistral)
- Embedding models

**Typical failures at this layer:**
| Symptom | Possible Cause |
|---------|----------------|
| Slow responses | Model overloaded, high token count |
| Hallucinations | Insufficient context, wrong model |
| Inconsistent outputs | Temperature too high |
| Token limit errors | Context too long |
| API errors | Rate limits, outages |

**What to check:**
- Token counts (input/output)
- Model latency
- Prompt content
- Temperature/sampling settings
- API error responses

## Layer 5: Data Layer

**What it does:** Stores and retrieves information - documents, vectors, structured data, external APIs.

**Examples:**
- Vector databases (ChromaDB, Pinecone, Weaviate)
- SQL/NoSQL databases
- External APIs
- File systems
- Caches (Redis)

**Typical failures at this layer:**
| Symptom | Possible Cause |
|---------|----------------|
| Slow queries | Missing indexes, large scans |
| Wrong results | Stale data, poor embeddings |
| "Not found" errors | Data not indexed, wrong collection |
| Connection errors | Database down, network issues |
| Inconsistent data | Race conditions, no transactions |

**What to check:**
- Query latency
- Result relevance scores (for vector search)
- Data freshness
- Connection pool status
- Index health

## DevHub: Mapped to 5 Layers

Now let's see how our workshop application **DevHub** maps to this framework:

```mermaid
flowchart TB
    subgraph L1["LAYER 1: APPLICATION"]
        A1["Notebook Interface"]
    end
    
    subgraph L2["LAYER 2: GATEWAY"]
        B1["Input Validation"]
    end
    
    subgraph L3["LAYER 3: ORCHESTRATION"]
        C1["DevHubAgent"]
        C2["Tools: search_docs | find_owner | check_status"]
    end
    
    subgraph L4["LAYER 4: LLM"]
        D1["OpenAI GPT-4o-mini"]
    end
    
    subgraph L5["LAYER 5: DATA"]
        E1["VectorDB<br/>(ChromaDB)<br/>8 docs"]
        E2["TeamDB<br/>(In-memory)<br/>5 owners"]
        E3["StatusAPI<br/>(Mock)<br/>5 services"]
    end
    
    L1 --> L2 --> L3
    C1 --> D1
    C1 --> E1
    C1 --> E2
    C1 --> E3
    D1 --> C1
    
    style L1 fill:#e1f5fe
    style L2 fill:#fff3e0
    style L3 fill:#f3e5f5
    style L4 fill:#e8f5e9
    style L5 fill:#fce4ec
```

**Where are DevHub's problems?**
- VectorDB (Layer 5): Slow queries, connection failures, low similarity
- TeamDB (Layer 5): Stale data (inactive owners)
- StatusAPI (Layer 5): Timeouts

Most of DevHub's intentional problems are in **Layer 5 (Data)** - but without tracing, you wouldn't know that!

## Key Insight: Layer-Based Debugging

**Each layer has different:**
- Failure modes
- Symptoms
- Debugging tools
- Metrics to monitor

When something goes wrong:

1. **Identify which layer** is causing the issue
2. **Use layer-appropriate tools** to investigate
3. **Fix at the right level** (don't patch symptoms)

**Coming up:** We'll experience DevHub's problems firsthand, then learn how distributed tracing helps us identify WHICH layer is failing.

---

---

# Topic 2: DevHub - Our Workshop Application

Now let's meet the application we'll be debugging throughout this workshop.

**DevHub** is an internal developer knowledge assistant. It helps developers:
- Find documentation
- Locate service owners
- Check system status

But it has problems... problems you'll learn to diagnose.

## What DevHub Does

DevHub answers three types of questions:

### 1. Documentation Search
> "How do I authenticate with the Payments API?"

Uses **VectorDB** (ChromaDB) to find relevant documentation through semantic search.

### 2. Owner Lookup
> "Who owns the billing service?"

Uses **TeamDB** (In-memory) to find the team and person responsible for a service.

### 3. Status Check
> "Is staging working?"

Uses **StatusAPI** to check if services are healthy, degraded, or down.

### Multi-Tool Queries
> "How do I use Auth SDK and who can help?"

The agent can call **multiple tools** to answer complex questions.

![Devhub Request Flow](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_01/charts/02_devhub_request_flow.svg)


## DevHub Request Flow

```mermaid
flowchart LR
    A["User Query"] --> B["Agent Receives"]
    B --> C["Tool Planning<br/>(LLM)"]
    C --> D["Tool Execution"]
    D --> E["VectorDB.search()"]
    D --> F["TeamDB.find_owner()"]
    D --> G["StatusAPI.check()"]
    E --> H["Response Synthesis<br/>(LLM)"]
    F --> H
    G --> H
    H --> I["User Answer"]
    
    style C fill:#e8f5e9
    style H fill:#e8f5e9
    style E fill:#fce4ec
    style F fill:#fce4ec
    style G fill:#fce4ec
```

The agent:
1. **Receives** the user's question
2. **Plans** which tools to call (using LLM)
3. **Executes** each tool
4. **Synthesizes** results into a coherent answer (using LLM)

**Problem:** Without tracing, you can't see steps 2, 3, or 4!

In [None]:
# =============================================================================
# DEVHUB V0 - THE UNINSTRUMENTED VERSION
# =============================================================================
# This is the "broken" version of DevHub - it works, but has intentional
# problems and NO observability. You'll experience the pain of debugging
# without tracing, then fix it in Lab 1.
#
# INTENTIONAL PROBLEMS (you'll discover these):
# - 10% of VectorDB queries are slow (3 seconds)
# - 5% of VectorDB queries fail (connection error)
# - 15% of VectorDB results have low similarity (bad retrieval)
# - 10% of TeamDB lookups return stale data (inactive owners)
# - 2% of StatusAPI calls timeout
#
# DO NOT MODIFY THIS CELL - you'll create an instrumented version in Lab 1

import json
import random
import time
from pathlib import Path

import chromadb
from chromadb.config import Settings
from openai import OpenAI


# -----------------------------------------------------------------------------
# Configuration
# -----------------------------------------------------------------------------
class Config:
    """Configuration with intentional failure rates for workshop scenarios."""

    # Failure rates
    VECTOR_DB_FAILURE_RATE = 0.05        # 5% connection failures
    VECTOR_DB_SLOW_QUERY_RATE = 0.10     # 10% slow queries
    VECTOR_DB_LOW_SIMILARITY_RATE = 0.15 # 15% bad retrieval
    TEAM_DB_STALE_DATA_RATE = 0.10       # 10% stale contacts
    STATUS_API_TIMEOUT_RATE = 0.02       # 2% timeouts

    # Latency settings (ms)
    VECTOR_DB_LATENCY_MIN = 50
    VECTOR_DB_LATENCY_MAX = 200
    VECTOR_DB_SLOW_QUERY_LATENCY = 3000

    # LLM settings
    LLM_MODEL = "gpt-4o-mini"


# -----------------------------------------------------------------------------
# Data (embedded for Colab compatibility)
# -----------------------------------------------------------------------------
DOCS_DATA = [
    {"id": "doc-payments-auth", "title": "Payments API Authentication", "category": "api",
     "content": "To authenticate with the Payments API, use OAuth 2.0 client credentials flow. First, obtain your client_id and client_secret from the Developer Portal. Make a POST request to /oauth/token with grant_type=client_credentials. The response contains an access_token valid for 1 hour. Include this token in the Authorization header as 'Bearer {token}' for all subsequent requests."},
    {"id": "doc-auth-sdk", "title": "Auth SDK Quick Start", "category": "sdk",
     "content": "Install the Auth SDK with 'pip install company-auth-sdk'. Initialize with AuthClient(client_id, client_secret). Call client.authenticate() to get a session. The SDK handles token refresh automatically. For service-to-service auth, use ServiceAuth class instead."},
    {"id": "doc-billing-service", "title": "Billing Service Overview", "category": "service",
     "content": "The Billing Service handles subscription management, invoicing, and payment processing. REST APIs: POST /v1/subscriptions (create), GET /v1/subscriptions/{id} (read), POST /v1/invoices (generate). For access requests, contact the Billing team."},
    {"id": "doc-vector-search", "title": "Vector Search Best Practices", "category": "guide",
     "content": "When using Vector Search: 1) Use embedding dimension 1536 for OpenAI compatibility. 2) Batch inserts for bulk data (max 100 vectors/call). 3) Set top_k between 3-5 for most use cases. 4) Monitor similarity scores - below 0.7 indicates poor matches."},
    {"id": "doc-staging-env", "title": "Staging Environment Guide", "category": "environment",
     "content": "Staging environment mirrors production at staging.internal.company.com. Access requires VPN connection. Data is refreshed weekly from anonymized production data. Known limitations: Payments API uses sandbox mode only."},
    {"id": "doc-error-handling", "title": "Error Handling Standards", "category": "standards",
     "content": "All APIs must return standard error format: {error: {code, message, details, correlation_id}}. HTTP codes: 400 bad input, 401 auth failure, 403 forbidden, 404 not found, 429 rate limited, 500 server error."},
    {"id": "doc-rate-limiting", "title": "Rate Limiting Configuration", "category": "api",
     "content": "Default rate limits: 100 requests/minute authenticated, 10 requests/minute unauthenticated. Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset."},
    {"id": "doc-db-connection-pool", "title": "Database Connection Pooling", "category": "guide",
     "content": "Use connection pooling for all database access. Recommended: min_pool_size=5, max_pool_size=20, connection_timeout=30s. If seeing 'connection pool exhausted' errors: check for connection leaks."},
]

TEAMS_DATA = {
    "teams": [
        {"id": "team-payments", "name": "Payments Team", "slack_channel": "#payments-support"},
        {"id": "team-platform", "name": "Platform Team", "slack_channel": "#platform-help"},
        {"id": "team-auth", "name": "Auth Team", "slack_channel": "#auth-support"},
        {"id": "team-data", "name": "Data Platform Team", "slack_channel": "#data-platform"},
    ],
    "owners": [
        {"id": "owner-sarah", "name": "Sarah Chen", "email": "sarah.chen@company.com", "slack": "@sarah.chen", "team_id": "team-payments", "services": ["payments-api", "billing-service", "billing"], "is_active": True},
        {"id": "owner-david", "name": "David Kim", "email": "david.kim@company.com", "slack": "@david.kim", "team_id": "team-data", "services": ["vector-search", "embeddings"], "is_active": False},  # LEFT COMPANY
        {"id": "owner-emily", "name": "Emily Johnson", "email": "emily.johnson@company.com", "slack": "@emily.j", "team_id": "team-data", "services": ["vector-search", "embeddings", "data-pipeline"], "is_active": True},
        {"id": "owner-michael", "name": "Michael Brown", "email": "michael.brown@company.com", "slack": "@mbrown", "team_id": "team-auth", "services": ["auth-service", "auth-sdk"], "is_active": True},
        {"id": "owner-lisa", "name": "Lisa Wang", "email": "lisa.wang@company.com", "slack": "@lisa.wang", "team_id": "team-platform", "services": ["staging", "api-gateway"], "is_active": True},
    ]
}

STATUS_DATA = {
    "services": [
        {"name": "payments-api", "status": "healthy", "uptime": 99.95},
        {"name": "auth-service", "status": "healthy", "uptime": 99.99},
        {"name": "staging", "status": "degraded", "uptime": 95.5, "incident": "Database connection pool exhaustion"},
        {"name": "vector-search", "status": "healthy", "uptime": 99.8},
        {"name": "api-gateway", "status": "healthy", "uptime": 99.99},
    ]
}


# -----------------------------------------------------------------------------
# VectorDB Service
# -----------------------------------------------------------------------------
class VectorDB:
    """Vector database for semantic document search. Has intentional problems."""

    def __init__(self):
        self._client = chromadb.Client(Settings(anonymized_telemetry=False))
        self._collection = self._client.get_or_create_collection(
            name="devhub_docs",
            metadata={"hnsw:space": "cosine"}
        )
        self._load_documents()

    def _load_documents(self):
        ids = [doc["id"] for doc in DOCS_DATA]
        texts = [doc["content"] for doc in DOCS_DATA]
        metadatas = [{"title": doc["title"], "category": doc["category"]} for doc in DOCS_DATA]
        self._collection.upsert(ids=ids, documents=texts, metadatas=metadatas)

    def search(self, query: str, top_k: int = 3) -> dict:
        start_time = time.time()

        # INTENTIONAL PROBLEM 1: Connection failure (5%)
        if random.random() < Config.VECTOR_DB_FAILURE_RATE:
            raise ConnectionError("VectorDB connection failed: ECONNREFUSED")

        # INTENTIONAL PROBLEM 2: Slow query (10%)
        if random.random() < Config.VECTOR_DB_SLOW_QUERY_RATE:
            time.sleep(Config.VECTOR_DB_SLOW_QUERY_LATENCY / 1000)
        else:
            time.sleep(random.randint(Config.VECTOR_DB_LATENCY_MIN, Config.VECTOR_DB_LATENCY_MAX) / 1000)

        results = self._collection.query(query_texts=[query], n_results=top_k)

        distances = results["distances"][0] if results["distances"] else []

        # INTENTIONAL PROBLEM 3: Low similarity (15%)
        if random.random() < Config.VECTOR_DB_LOW_SIMILARITY_RATE:
            distances = [d + 0.5 for d in distances]

        return {
            "documents": results["documents"][0] if results["documents"] else [],
            "metadatas": results["metadatas"][0] if results["metadatas"] else [],
            "distances": distances,
            "latency_ms": int((time.time() - start_time) * 1000)
        }


# -----------------------------------------------------------------------------
# TeamDB Service
# -----------------------------------------------------------------------------
class TeamDB:
    """Team/owner lookup database. Has intentional stale data problem."""

    def __init__(self):
        self.teams = {t["id"]: t for t in TEAMS_DATA["teams"]}
        self.owners = TEAMS_DATA["owners"]

    def find_owner(self, service_name: str) -> dict:
        start_time = time.time()
        time.sleep(random.randint(20, 100) / 1000)  # Simulate latency

        # Find owners for this service
        matching_owners = [o for o in self.owners if service_name.lower() in [s.lower() for s in o["services"]]]

        if not matching_owners:
            return {"found": False, "latency_ms": int((time.time() - start_time) * 1000)}

        # INTENTIONAL PROBLEM: Stale data (10%) - return inactive owner
        if random.random() < Config.TEAM_DB_STALE_DATA_RATE:
            # Find inactive owner if exists
            inactive = [o for o in matching_owners if not o["is_active"]]
            if inactive:
                owner = inactive[0]
            else:
                owner = matching_owners[0]
        else:
            # Normal: return active owner
            active = [o for o in matching_owners if o["is_active"]]
            owner = active[0] if active else matching_owners[0]

        team = self.teams.get(owner["team_id"], {})

        return {
            "found": True,
            "owner": owner,
            "team": team,
            "latency_ms": int((time.time() - start_time) * 1000)
        }


# -----------------------------------------------------------------------------
# StatusAPI Service
# -----------------------------------------------------------------------------
class StatusAPI:
    """Service status checker. Has intentional timeout problem."""

    def __init__(self):
        self.services = {s["name"]: s for s in STATUS_DATA["services"]}

    def check_status(self, service_name: str) -> dict:
        start_time = time.time()

        # INTENTIONAL PROBLEM: Timeout (2%)
        if random.random() < Config.STATUS_API_TIMEOUT_RATE:
            time.sleep(5)  # 5 second timeout
            raise TimeoutError(f"StatusAPI timeout checking {service_name}")

        time.sleep(random.randint(30, 150) / 1000)  # Normal latency

        service = self.services.get(service_name.lower())

        if not service:
            return {"found": False, "latency_ms": int((time.time() - start_time) * 1000)}

        return {
            "found": True,
            "service": service,
            "latency_ms": int((time.time() - start_time) * 1000)
        }


# -----------------------------------------------------------------------------
# DevHubAgent
# -----------------------------------------------------------------------------
class DevHubAgent:
    """AI agent that orchestrates tools to answer developer questions."""

    TOOL_PLANNING_PROMPT = """You are a tool planner. Based on the user's question, decide which tools to call.

Available tools:
1. search_docs: Search documentation. Use for "how to", needs docs, wants examples. Args: {"query": "search terms"}
2. find_owner: Find service owner. Use for "who owns", "who can help". Args: {"service": "service name"}
3. check_status: Check service health. Use for "is X working", "status of". Args: {"service": "service name"}

Return ONLY a JSON array: [{"tool": "name", "args": {...}}, ...]
If no tools needed, return: []

User question: {query}"""

    RESPONSE_PROMPT = """Based on the user's question and tool results, provide a helpful response.

User question: {query}

Tool results:
{results}

Guidelines:
- Be concise and actionable
- If owner is inactive (is_active: false), mention this
- If service is degraded, clearly state this
- If similarity scores are low (distance > 0.5), mention answers may not be accurate"""

    def __init__(self):
        self.vector_db = VectorDB()
        self.team_db = TeamDB()
        self.status_api = StatusAPI()
        self.client = OpenAI()

    def _plan_tools(self, query: str) -> list:
        response = self.client.chat.completions.create(
            model=Config.LLM_MODEL,
            messages=[
                {"role": "system", "content": "You are a tool planning assistant. Respond only with valid JSON."},
                {"role": "user", "content": self.TOOL_PLANNING_PROMPT.format(query=query)}
            ],
            temperature=0.1,
            max_tokens=256
        )
        content = response.choices[0].message.content.strip()
        if content.startswith("```"):
            content = content.split("```")[1]
            if content.startswith("json"):
                content = content[4:]
        try:
            return json.loads(content.strip())
        except:
            return []

    def _execute_tool(self, tool_name: str, args: dict) -> dict:
        result = {"tool": tool_name, "success": False, "data": None, "error": None}
        try:
            if tool_name == "search_docs":
                result["data"] = self.vector_db.search(args.get("query", ""))
                result["success"] = True
            elif tool_name == "find_owner":
                result["data"] = self.team_db.find_owner(args.get("service", ""))
                result["success"] = True
            elif tool_name == "check_status":
                result["data"] = self.status_api.check_status(args.get("service", ""))
                result["success"] = True
        except Exception as e:
            result["error"] = str(e)
        return result

    def _generate_response(self, query: str, tool_results: list) -> str:
        response = self.client.chat.completions.create(
            model=Config.LLM_MODEL,
            messages=[
                {"role": "system", "content": "You are DevHub, a helpful developer assistant."},
                {"role": "user", "content": self.RESPONSE_PROMPT.format(query=query, results=json.dumps(tool_results, indent=2))}
            ],
            temperature=0.3,
            max_tokens=1024
        )
        return response.choices[0].message.content

    def query(self, user_query: str) -> dict:
        planned_tools = self._plan_tools(user_query)
        tool_results = [self._execute_tool(t["tool"], t.get("args", {})) for t in planned_tools]
        response = self._generate_response(user_query, tool_results)
        return {
            "response": response,
            "tools_called": [t["tool"] for t in planned_tools],
            "tool_results": tool_results
        }


print("DevHub V0 loaded successfully!")
print("   Classes available: Config, VectorDB, TeamDB, StatusAPI, DevHubAgent")

In [None]:
# =============================================================================
# INITIALIZE DEVHUB
# =============================================================================
# Create instances of all services and the agent.

print("Initializing DevHub...")

# Create the agent (this also initializes all services)
agent = DevHubAgent()

print("DevHub initialized!")
print(f"   - VectorDB: {len(DOCS_DATA)} documents loaded")
print(f"   - TeamDB: {len(TEAMS_DATA['owners'])} owners loaded")
print(f"   - StatusAPI: {len(STATUS_DATA['services'])} services loaded")
print("\nReady to answer questions!")

In [None]:
# =============================================================================
# DEMO: Run a documentation query
# =============================================================================
# The instructor will run this to show DevHub in action.

query = "How do I authenticate with the Payments API?"

print(f"Question: {query}")
print("-" * 50)

result = agent.query(query)

print(f"\nAnswer:\n{result['response']}")
print(f"\nTools used: {result['tools_called']}")

In [None]:
# =============================================================================
# DEMO: Try different query types
# =============================================================================

queries = [
    "Who owns the billing service?",
    "Is staging working?",
    "How do I use Auth SDK and who can help me with it?"
]

for q in queries:
    print(f"\n{'='*60}")
    print(f"Question: {q}")
    print("-" * 60)

    result = agent.query(q)

    print(f"Answer: {result['response'][:200]}...")
    print(f"Tools: {result['tools_called']}")

---

## Lab 1.1: Explore DevHub

Now it's your turn! Run DevHub yourself and observe its behavior.

### Your Tasks:

1. Run at least **5 different queries** using the code cell below
2. **Time each query** mentally (or use the latency info)
3. **Note any issues** you observe:
   - Slow responses
   - Strange answers
   - Errors

### Suggested queries to try:
- "How do I connect to the database?"
- "Who owns vector search?"
- "Is the payments API working?"
- "What are the rate limits?"
- "How do I handle errors?"

### Questions to answer:
- Did any queries feel slow? How slow?
- Did any answers seem wrong or outdated?
- Did any queries fail completely?

In [None]:
# =============================================================================
# LAB 1.1: Explore DevHub
# =============================================================================
# Run different queries and observe the behavior.
# Note: Some queries might be slow or return unexpected results!

# Query 1
result1 = agent.query("How do I connect to the database?")
print(f"Q1 Answer: {result1['response'][:150]}...")
print(f"   Tools: {result1['tools_called']}\n")

# Query 2 - PUT YOUR CODE HERE: Try a different query
# result2 = agent.query("...")
# print(f"Q2 Answer: {result2['response'][:150]}...")

# Query 3 - PUT YOUR CODE HERE
# result3 = agent.query("...")

# Query 4 - PUT YOUR CODE HERE
# result4 = agent.query("...")

# Query 5 - PUT YOUR CODE HERE
# result5 = agent.query("...")

# -----------------------------------------------------------------------------
# YOUR OBSERVATIONS
# -----------------------------------------------------------------------------
# Did any queries feel slow? Which ones?
# YOUR ANSWER:

# Did any answers seem wrong or outdated?
# YOUR ANSWER:

# Did any queries fail? What was the error?
# YOUR ANSWER:

---

## The Frustration Exercise

Something is wrong with DevHub. Users are complaining about:
- Slow responses (sometimes 3+ seconds)
- Wrong owner information
- Irrelevant search results

**Your challenge:** Figure out what's causing these problems.

### What you have available:
- The source code (Cell above)
- The ability to run queries
- Print statements
- Basic logging

### What you DON'T have:
- Any tracing or observability
- Metrics dashboards
- Performance profiling

**Try to debug it.** We'll see how far you get...

In [None]:
# =============================================================================
# TRY TO DEBUG DEVHUB
# =============================================================================
# Add print statements, timing, whatever you think might help.
# Spoiler: It's going to be frustrating.

import time

# Let's add some "debugging"
def debug_query(query):
    print(f"[DEBUG] Starting query: {query}")
    start = time.time()

    try:
        result = agent.query(query)
        elapsed = time.time() - start

        print(f"[DEBUG] Query completed in {elapsed:.2f}s")
        print(f"[DEBUG] Tools called: {result['tools_called']}")

        # What else can we check?
        # We don't know:
        # - How long each tool took
        # - What data each tool returned
        # - Whether the data was stale
        # - Whether similarity scores were low

        return result

    except Exception as e:
        print(f"[DEBUG] Query failed: {e}")
        # But WHY did it fail? Which component?
        return None

# Try it
print("Running debug query...\n")
result = debug_query("Who owns vector search?")

if result:
    print(f"\nAnswer: {result['response']}")

print("\n" + "="*60)
print("Questions you can't answer with print statements:")
print("- Which specific component was slow?")
print("- Was the returned data fresh or stale?")
print("- What was the similarity score of the search results?")
print("- Why did that component fail?")
print("="*60)

---

## Discussion: What Information Would Help?

You just experienced the **pain of debugging without observability**.

### With print statements, you learned:
- Total query time
- Which tools were called
- Whether it succeeded or failed

### But you COULDN'T learn:
- How long EACH tool took
- What data each tool actually returned
- Whether data was stale (is_active field)
- What the similarity scores were
- Which specific line of code was slow
- The sequence of operations

### What we need:
A way to **trace the entire request** through all components, capturing:
- **Timing** for each operation
- **Data** passed between components
- **Errors** with full context
- **Relationships** between operations

**This is exactly what distributed tracing provides.**

---

**Next:** We'll learn the concepts behind distributed tracing, then instrument DevHub to capture all this information.

---

# Topic 3: Distributed Tracing with OpenTelemetry

Now that you've felt the pain of debugging without visibility, let's learn the solution: **distributed tracing**.

Tracing gives you a complete picture of what happens during a request - where time is spent, what data flows through, and where errors occur.

## The Problem: Where's the Bottleneck?

In modern applications, a single user request might:
- Hit multiple services
- Make database queries
- Call external APIs
- Invoke AI models

**Example:** Your DevHub query:
1. Application receives request
2. Agent plans tools (calls LLM)
3. VectorDB searches documents
4. TeamDB looks up owner
5. LLM synthesizes response
6. Application returns answer

If the total time is 5 seconds, **WHERE** is that time spent?
- Is it the VectorDB query?
- Is it the LLM call?
- Is it network latency?

**Without tracing, you're guessing.**

## The Black Box Problem

![Black Box Problem](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_01/charts/00_black_box_problem.svg)

Without tracing:
- You see the **input** (request)
- You see the **output** (response)
- You see the **total time**
- You DON'T see what happened **inside**

This is the "black box" problem.

## What is a Trace?

A **trace** represents the **complete journey** of a single request through your system.

### Key properties:
- **Unique Trace ID**: Every trace has a unique identifier (e.g., `a1b2c3d4e5f6`)
- **Spans**: A trace contains multiple "spans" (we'll explain next)
- **Causality**: Shows which operations triggered which other operations
- **Timing**: Captures start time, end time, and duration

### Example Trace:
```
Trace ID: abc123

[0ms]──────────────────────────────────────────────[5200ms]
│                                                        │
│  agent.query (total: 5200ms)                          │
│    │                                                   │
│    ├──[50ms] tool_planning (LLM call)                 │
│    │                                                   │
│    ├──[3100ms] vector_db.search  <- SLOW!             │
│    │                                                   │
│    ├──[80ms] team_db.find_owner                       │
│    │                                                   │
│    └──[1970ms] response_synthesis (LLM call)          │
```

**Now you can SEE** that vector_db.search took 3100ms - that's your bottleneck!

![Trace Span Hierarchy](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_01/charts/03_trace_span_hierarchy.svg)


## What is a Span?

A **span** represents a **single operation** within a trace.

### Every span has:

| Property | Description | Example |
|----------|-------------|---------|
| **Name** | What operation this is | `vector_db.search` |
| **Start Time** | When it started | `2024-01-15T10:30:00.000Z` |
| **End Time** | When it finished | `2024-01-15T10:30:03.100Z` |
| **Duration** | How long it took | `3100ms` |
| **Parent** | Which span triggered this one | `agent.query` |
| **Attributes** | Key-value metadata | `db.system=chromadb` |
| **Status** | OK, ERROR, or UNSET | `OK` |
| **Events** | Timestamped logs within the span | `query executed` |

### Span Hierarchy:

Spans form a **tree structure**:
- **Root span**: The top-level operation (e.g., `agent.query`)
- **Child spans**: Operations triggered by the parent (e.g., `vector_db.search`)
- **Nested spans**: Can go many levels deep

This hierarchy shows **causality** - which operations triggered which.

## Trace/Span Hierarchy

```mermaid
gantt
    title Trace: agent.query (5200ms total)
    dateFormat X
    axisFormat %L

    section Root
    agent.query           :0, 5200

    section Planning
    _plan_tools           :0, 50
    
    section Execution
    vector_db.search      :crit, 100, 3200
    team_db.find_owner    :3250, 3330
    
    section Synthesis
    _generate_response    :3350, 5200
```

**What this visualization shows:**
1. Total request took 5200ms
2. `vector_db.search` took 3100ms (60% of total time!)
3. We can immediately identify the bottleneck

**Attributes on vector_db.search span:**
- `db.system = "chromadb"`
- `vector.query = "How do I authenticate..."`
- `vector.latency_ms = 3100` - This tells us WHY it was slow

## Span Attributes: The Secret Sauce

Attributes are **key-value pairs** attached to spans that provide context.

### Standard Attributes (OpenTelemetry Semantic Conventions):

| Attribute | Description | Example |
|-----------|-------------|---------|
| `service.name` | Which service | `devhub-john-smith` |
| `db.system` | Database type | `chromadb`, `postgresql` |
| `db.operation` | Operation type | `query`, `insert` |
| `http.method` | HTTP method | `GET`, `POST` |
| `http.status_code` | Response code | `200`, `500` |

### Custom Attributes (DevHub-specific):

| Attribute | Description | Why It Matters |
|-----------|-------------|----------------|
| `vector.query` | Search query text | See what was searched |
| `vector.latency_ms` | Query latency | Identify slow queries |
| `vector.results_count` | Number of results | Check retrieval quality |
| `vector.top_distance` | Best similarity score | Detect poor matches |
| `owner.is_active` | Owner status | Catch stale data |
| `llm.model` | Model used | Track model performance |
| `llm.tokens` | Tokens used | Monitor costs |

**Attributes are what make debugging possible.** They answer not just "what happened" but "WHY did it happen."

![Span Attributes](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_01/charts/04_span_attributes.svg)


## Context Propagation: Connecting the Dots

How does a trace stay connected across different services and function calls?

### The W3C Trace Context Standard

When a span starts, it generates:
- **Trace ID**: Unique ID for the entire trace (stays the same)
- **Span ID**: Unique ID for this specific span
- **Parent Span ID**: The span that created this one

```
traceparent: 00-abc123def456-span789-01
             │    │            │      │
             │    │            │      └─ Flags
             │    │            └─ Span ID (this span)
             │    └─ Trace ID (whole trace)
             └─ Version
```

### How it works in OpenTelemetry:

```python
# OpenTelemetry automatically propagates context
with tracer.start_as_current_span("parent_operation"):
    # Any spans created here automatically become children
    with tracer.start_as_current_span("child_operation"):
        # This span's parent is automatically set
        pass
```

You don't need to manually pass trace IDs - OpenTelemetry handles it!

![Context Propagation](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_01/charts/05_context_propagation.svg)


## OpenTelemetry: The Industry Standard

**OpenTelemetry (OTel)** is the industry-standard framework for observability.

### Why OpenTelemetry?

| Feature | Benefit |
|---------|---------|
| **Vendor-neutral** | Works with Jaeger, Datadog, New Relic, etc. |
| **Single API** | Learn once, use everywhere |
| **Auto-instrumentation** | Many libraries instrumented automatically |
| **Wide adoption** | Used by Google, Microsoft, AWS, etc. |

### Key Components:

1. **Tracer Provider**: Creates and manages tracers
2. **Tracer**: Creates spans
3. **Span Processor**: Processes spans before export
4. **Exporter**: Sends spans to backend (Jaeger, etc.)

### Basic Pattern:

```python
from opentelemetry import trace

# Get a tracer
tracer = trace.get_tracer("my-service")

# Create spans
with tracer.start_as_current_span("operation_name") as span:
    span.set_attribute("key", "value")
    # Your code here
```

**Next:** Let's see this in action with a simple demo.

In [None]:
# =============================================================================
# DEMO: Create a Simple Trace
# =============================================================================
# This shows the basic pattern for creating traces with OpenTelemetry.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
import time

# -----------------------------------------------------------------------------
# Step 1: Configure the tracer provider
# -----------------------------------------------------------------------------
resource = Resource.create({
    "service.name": f"demo-tracing-{STUDENT_NAME}",
    "service.version": "1.0.0",
})

provider = TracerProvider(resource=resource)

# -----------------------------------------------------------------------------
# Step 2: Configure the exporter (sends traces to Jaeger)
# -----------------------------------------------------------------------------
exporter = OTLPSpanExporter(
    endpoint=JAEGER_ENDPOINT,
    insecure=True  # Use insecure for workshop (no TLS)
)

provider.add_span_processor(BatchSpanProcessor(exporter))

# -----------------------------------------------------------------------------
# Step 3: Set as global tracer provider
# -----------------------------------------------------------------------------
trace.set_tracer_provider(provider)

# -----------------------------------------------------------------------------
# Step 4: Get a tracer
# -----------------------------------------------------------------------------
tracer = trace.get_tracer("demo-tracer")

# -----------------------------------------------------------------------------
# Step 5: Create spans!
# -----------------------------------------------------------------------------
print("Creating a trace with nested spans...")

with tracer.start_as_current_span("parent_operation") as parent:
    parent.set_attribute("demo.type", "workshop")
    parent.set_attribute("student.name", STUDENT_NAME)

    # Simulate some work
    time.sleep(0.1)

    # Child span 1
    with tracer.start_as_current_span("child_operation_1") as child1:
        child1.set_attribute("operation.type", "database_query")
        time.sleep(0.2)  # Simulate DB query

    # Child span 2
    with tracer.start_as_current_span("child_operation_2") as child2:
        child2.set_attribute("operation.type", "api_call")
        time.sleep(0.15)  # Simulate API call

# Force flush to ensure spans are sent
provider.force_flush()

print("Trace created and sent to Jaeger!")
print(f"\nView it at: {JAEGER_UI}")
print(f"Service name: demo-tracing-{STUDENT_NAME}")

In [None]:
# =============================================================================
# DEMO: How to View Your Trace in Jaeger
# =============================================================================
# Follow these steps to see your trace in the Jaeger UI.

print("=" * 60)
print("HOW TO VIEW YOUR TRACE IN JAEGER")
print("=" * 60)

print(f"""
1. Open Jaeger UI in your browser:
   {JAEGER_UI}

2. Enter credentials if prompted:
   Username: workshop
   Password: salesforce2025

3. In the "Service" dropdown, select:
   demo-tracing-{STUDENT_NAME}

4. Click "Find Traces"

5. Click on the trace that appears

6. You should see:
   - parent_operation (root span)
     |-- child_operation_1 (database_query)
     |-- child_operation_2 (api_call)

7. Click on each span to see its attributes

""")

print("=" * 60)
print("TIP: If you don't see your trace, wait 10 seconds and refresh.")
print("    Traces are batched and may take a moment to appear.")
print("=" * 60)

---

## Key Insight: Traces Show WHERE, Not Just THAT

**Without tracing:**
> "The request took 5 seconds."

**With tracing:**
> "The request took 5 seconds, of which 3 seconds was the vector database query, and the query had a latency_ms attribute of 3000, which matches our configured slow query latency, indicating this was a simulated slow query scenario."

Tracing transforms debugging from:
- **Guessing** -> **Knowing**
- **Hours of investigation** -> **Minutes of analysis**
- **"It's slow somewhere"** -> **"vector_db.search took 3100ms"**

---

**Next:** Now you'll instrument DevHub yourself in Lab 1, adding tracing to all components so you can see exactly what's happening inside.

---

# Lab 1: Instrument DevHub with OpenTelemetry

Time to get your hands dirty! In this lab, you'll add tracing to DevHub so you can see exactly what's happening inside.

**Duration:** ~30 minutes

**What you'll do:**
1. Initialize OpenTelemetry (Task 1)
2. Instrument VectorDB.search() (Task 2)
3. Instrument TeamDB.find_owner() (Task 3)
4. Instrument StatusAPI.check_status() (Task 4)
5. Instrument DevHubAgent.query() (Task 5)

**Scaffolding level decreases** as you go:
- Task 1: Full step-by-step guidance
- Task 2: Medium guidance
- Task 3: Light guidance
- Task 4: Minimal guidance
- Task 5: Just the goal

## What You'll Instrument

By the end of this lab, every DevHub query will generate a trace like:

```
agent.query (root span)
  |
  |-- _plan_tools
  |     |-- llm.completion
  |
  |-- _execute_tool
  |     |-- vector_db.search
  |           Attributes:
  |           - db.system = "chromadb"
  |           - vector.query = "..."
  |           - vector.latency_ms = 150
  |           - vector.results_count = 3
  |
  |-- _execute_tool
  |     |-- team_db.find_owner
  |           Attributes:
  |           - db.system = "sqlite"
  |           - owner.name = "Sarah Chen"
  |           - owner.is_active = true
  |
  |-- _generate_response
        |-- llm.completion
```

This visibility will let you diagnose any performance or data issue!

## Task 1: Initialize OpenTelemetry

**Goal:** Set up the tracing infrastructure for DevHub.

**What you need to do:**

1. Create a `Resource` with service name `devhub-{STUDENT_NAME}`
2. Create a `TracerProvider` with that resource
3. Create an `OTLPSpanExporter` pointing to Jaeger
4. Add a `BatchSpanProcessor` to the provider
5. Set the provider as global
6. Create a tracer named `devhub`

**Code structure:**
```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Step 1: Create resource
resource = Resource.create({"service.name": "devhub-YOUR_NAME"})

# Step 2: Create provider
provider = TracerProvider(resource=resource)

# Step 3: Create exporter
exporter = OTLPSpanExporter(endpoint=JAEGER_ENDPOINT, insecure=True)

# Step 4: Add processor
provider.add_span_processor(BatchSpanProcessor(exporter))

# Step 5: Set global
trace.set_tracer_provider(provider)

# Step 6: Get tracer
tracer = trace.get_tracer("devhub")
```

**Time:** ~5 minutes

![Otel Architecture](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_01/charts/07_otel_architecture.svg)


In [None]:
# =============================================================================
# TASK 1: Initialize OpenTelemetry
# =============================================================================
# Set up the tracing infrastructure for DevHub.
# Follow the instructions in the cell above.
#
# TIME: ~5 minutes
# =============================================================================

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# -----------------------------------------------------------------------------
# PUT YOUR CODE HERE
# -----------------------------------------------------------------------------

# Step 1: Create resource with service.name = f"devhub-{STUDENT_NAME}"
resource = None  # PUT YOUR CODE HERE

# Step 2: Create TracerProvider with the resource
provider = None  # PUT YOUR CODE HERE

# Step 3: Create OTLPSpanExporter pointing to JAEGER_ENDPOINT
exporter = None  # PUT YOUR CODE HERE

# Step 4: Add BatchSpanProcessor to the provider
# PUT YOUR CODE HERE

# Step 5: Set as global tracer provider
# PUT YOUR CODE HERE

# Step 6: Get a tracer named "devhub"
tracer = None  # PUT YOUR CODE HERE

# -----------------------------------------------------------------------------
# END YOUR CODE
# -----------------------------------------------------------------------------

# Verification
if tracer is not None:
    print("OpenTelemetry initialized!")
    print(f"   Service name: devhub-{STUDENT_NAME}")
    print(f"   Exporting to: {JAEGER_ENDPOINT}")
else:
    print("tracer is None - check your code above")

In [None]:
# =============================================================================
# SOLUTION: Task 1 - Initialize OpenTelemetry
# =============================================================================
# Expand this cell to see the solution if you get stuck.

# Step 1: Create resource
resource = Resource.create({
    "service.name": f"devhub-{STUDENT_NAME}",
    "service.version": "1.0.0",
})

# Step 2: Create provider
provider = TracerProvider(resource=resource)

# Step 3: Create exporter
exporter = OTLPSpanExporter(
    endpoint=JAEGER_ENDPOINT,
    insecure=True
)

# Step 4: Add processor
provider.add_span_processor(BatchSpanProcessor(exporter))

# Step 5: Set global
trace.set_tracer_provider(provider)

# Step 6: Get tracer
tracer = trace.get_tracer("devhub")

print("OpenTelemetry initialized (from solution)")

## Task 2: Instrument VectorDB.search()

**Goal:** Add tracing to the vector database search method.

**What to capture:**
- Span name: `vector_db.search`
- Attributes:
  - `db.system` = `"chromadb"`
  - `vector.query` = the search query (first 100 chars)
  - `vector.top_k` = the top_k parameter
  - `vector.latency_ms` = the latency from the result
  - `vector.results_count` = number of results returned
  - `vector.top_distance` = the best (lowest) distance score

**Pattern:**
```python
def search(self, query: str, top_k: int = 3) -> dict:
    with tracer.start_as_current_span("vector_db.search") as span:
        span.set_attribute("db.system", "chromadb")
        span.set_attribute("vector.query", query[:100])
        # ... existing code ...
        span.set_attribute("vector.latency_ms", result["latency_ms"])
        return result
```

**Time:** ~8 minutes

In [None]:
# =============================================================================
# TASK 2: Instrument VectorDB.search()
# =============================================================================
# Add tracing to capture search queries and their performance.
#
# TIME: ~8 minutes
# =============================================================================

class VectorDBInstrumented:
    """VectorDB with OpenTelemetry instrumentation."""

    def __init__(self):
        self._client = chromadb.Client(Settings(anonymized_telemetry=False))
        self._collection = self._client.get_or_create_collection(
            name="devhub_docs_instrumented",
            metadata={"hnsw:space": "cosine"}
        )
        self._load_documents()

    def _load_documents(self):
        ids = [doc["id"] for doc in DOCS_DATA]
        texts = [doc["content"] for doc in DOCS_DATA]
        metadatas = [{"title": doc["title"], "category": doc["category"]} for doc in DOCS_DATA]
        self._collection.upsert(ids=ids, documents=texts, metadatas=metadatas)

    def search(self, query: str, top_k: int = 3) -> dict:
        # ─────────────────────────────────────────────────────────────────────
        # PUT YOUR CODE HERE: Wrap this method with a span
        # ─────────────────────────────────────────────────────────────────────

        # 1. Start a span named "vector_db.search"
        # 2. Set attributes: db.system, vector.query, vector.top_k

        # ─────────────────────────────────────────────────────────────────────

        start_time = time.time()

        # Intentional problems (keep these!)
        if random.random() < Config.VECTOR_DB_FAILURE_RATE:
            raise ConnectionError("VectorDB connection failed: ECONNREFUSED")

        if random.random() < Config.VECTOR_DB_SLOW_QUERY_RATE:
            time.sleep(Config.VECTOR_DB_SLOW_QUERY_LATENCY / 1000)
        else:
            time.sleep(random.randint(Config.VECTOR_DB_LATENCY_MIN, Config.VECTOR_DB_LATENCY_MAX) / 1000)

        results = self._collection.query(query_texts=[query], n_results=top_k)

        distances = results["distances"][0] if results["distances"] else []

        if random.random() < Config.VECTOR_DB_LOW_SIMILARITY_RATE:
            distances = [d + 0.5 for d in distances]

        result = {
            "documents": results["documents"][0] if results["documents"] else [],
            "metadatas": results["metadatas"][0] if results["metadatas"] else [],
            "distances": distances,
            "latency_ms": int((time.time() - start_time) * 1000)
        }

        # ─────────────────────────────────────────────────────────────────────
        # PUT YOUR CODE HERE: Set attributes for latency, results_count, top_distance
        # ─────────────────────────────────────────────────────────────────────

        # 3. Set: vector.latency_ms, vector.results_count, vector.top_distance

        # ─────────────────────────────────────────────────────────────────────

        return result


# Test it
test_vdb = VectorDBInstrumented()
test_result = test_vdb.search("authentication")
print(f"Search returned {len(test_result['documents'])} results")
print(f"   Latency: {test_result['latency_ms']}ms")

In [None]:
# =============================================================================
# SOLUTION: Task 2 - Instrument VectorDB.search()
# =============================================================================

class VectorDBInstrumented:
    """VectorDB with OpenTelemetry instrumentation."""

    def __init__(self):
        self._client = chromadb.Client(Settings(anonymized_telemetry=False))
        self._collection = self._client.get_or_create_collection(
            name="devhub_docs_instrumented",
            metadata={"hnsw:space": "cosine"}
        )
        self._load_documents()

    def _load_documents(self):
        ids = [doc["id"] for doc in DOCS_DATA]
        texts = [doc["content"] for doc in DOCS_DATA]
        metadatas = [{"title": doc["title"], "category": doc["category"]} for doc in DOCS_DATA]
        self._collection.upsert(ids=ids, documents=texts, metadatas=metadatas)

    def search(self, query: str, top_k: int = 3) -> dict:
        # Start span and set initial attributes
        with tracer.start_as_current_span("vector_db.search") as span:
            span.set_attribute("db.system", "chromadb")
            span.set_attribute("vector.query", query[:100])
            span.set_attribute("vector.top_k", top_k)

            start_time = time.time()

            # Intentional problems
            if random.random() < Config.VECTOR_DB_FAILURE_RATE:
                span.set_attribute("error", True)
                span.set_attribute("error.type", "ConnectionError")
                raise ConnectionError("VectorDB connection failed: ECONNREFUSED")

            if random.random() < Config.VECTOR_DB_SLOW_QUERY_RATE:
                time.sleep(Config.VECTOR_DB_SLOW_QUERY_LATENCY / 1000)
            else:
                time.sleep(random.randint(Config.VECTOR_DB_LATENCY_MIN, Config.VECTOR_DB_LATENCY_MAX) / 1000)

            results = self._collection.query(query_texts=[query], n_results=top_k)

            distances = results["distances"][0] if results["distances"] else []

            if random.random() < Config.VECTOR_DB_LOW_SIMILARITY_RATE:
                distances = [d + 0.5 for d in distances]

            result = {
                "documents": results["documents"][0] if results["documents"] else [],
                "metadatas": results["metadatas"][0] if results["metadatas"] else [],
                "distances": distances,
                "latency_ms": int((time.time() - start_time) * 1000)
            }

            # Set result attributes
            span.set_attribute("vector.latency_ms", result["latency_ms"])
            span.set_attribute("vector.results_count", len(result["documents"]))
            if distances:
                span.set_attribute("vector.top_distance", distances[0])

            return result


print("VectorDBInstrumented defined (from solution)")

## Task 3: Instrument TeamDB.find_owner()

**Goal:** Add tracing to capture owner lookups and detect stale data.

**What to capture:**
- Span name: `team_db.find_owner`
- Attributes:
  - `db.system` = `"in_memory"`
  - `team_db.service` = the service name
  - `team_db.found` = whether owner was found
  - `owner.name` = owner's name (if found)
  - `owner.is_active` = owner's active status (CRITICAL for detecting stale data!)
  - `team_db.latency_ms` = latency

**Less guidance this time** - follow the Task 2 pattern!

**Time:** ~5 minutes

In [None]:
# =============================================================================
# TASK 3: Instrument TeamDB.find_owner()
# =============================================================================
# Add tracing to capture owner lookups. Less guidance - follow Task 2 pattern!
#
# TIME: ~5 minutes
# =============================================================================

class TeamDBInstrumented:
    """TeamDB with OpenTelemetry instrumentation."""

    def __init__(self):
        self.teams = {t["id"]: t for t in TEAMS_DATA["teams"]}
        self.owners = TEAMS_DATA["owners"]

    def find_owner(self, service_name: str) -> dict:
        # ─────────────────────────────────────────────────────────────────────
        # PUT YOUR CODE HERE: Add span and attributes
        # Hint: Follow the VectorDB pattern from Task 2
        # ─────────────────────────────────────────────────────────────────────

        start_time = time.time()
        time.sleep(random.randint(20, 100) / 1000)

        matching_owners = [o for o in self.owners if service_name.lower() in [s.lower() for s in o["services"]]]

        if not matching_owners:
            return {"found": False, "latency_ms": int((time.time() - start_time) * 1000)}

        # INTENTIONAL PROBLEM: Stale data (10%)
        if random.random() < Config.TEAM_DB_STALE_DATA_RATE:
            inactive = [o for o in matching_owners if not o["is_active"]]
            if inactive:
                owner = inactive[0]
            else:
                owner = matching_owners[0]
        else:
            active = [o for o in matching_owners if o["is_active"]]
            owner = active[0] if active else matching_owners[0]

        team = self.teams.get(owner["team_id"], {})

        result = {
            "found": True,
            "owner": owner,
            "team": team,
            "latency_ms": int((time.time() - start_time) * 1000)
        }

        # ─────────────────────────────────────────────────────────────────────
        # PUT YOUR CODE HERE: Set result attributes (owner.name, owner.is_active)
        # ─────────────────────────────────────────────────────────────────────

        return result


# Test it
test_tdb = TeamDBInstrumented()
test_result = test_tdb.find_owner("billing")
print(f"Found: {test_result['found']}")
if test_result['found']:
    print(f"   Owner: {test_result['owner']['name']}")
    print(f"   Active: {test_result['owner']['is_active']}")

In [None]:
# =============================================================================
# SOLUTION: Task 3 - Instrument TeamDB.find_owner()
# =============================================================================

class TeamDBInstrumented:
    """TeamDB with OpenTelemetry instrumentation."""

    def __init__(self):
        self.teams = {t["id"]: t for t in TEAMS_DATA["teams"]}
        self.owners = TEAMS_DATA["owners"]

    def find_owner(self, service_name: str) -> dict:
        with tracer.start_as_current_span("team_db.find_owner") as span:
            span.set_attribute("db.system", "in_memory")
            span.set_attribute("team_db.service", service_name)

            start_time = time.time()
            time.sleep(random.randint(20, 100) / 1000)

            matching_owners = [o for o in self.owners if service_name.lower() in [s.lower() for s in o["services"]]]

            if not matching_owners:
                span.set_attribute("team_db.found", False)
                return {"found": False, "latency_ms": int((time.time() - start_time) * 1000)}

            # INTENTIONAL PROBLEM: Stale data (10%)
            if random.random() < Config.TEAM_DB_STALE_DATA_RATE:
                inactive = [o for o in matching_owners if not o["is_active"]]
                if inactive:
                    owner = inactive[0]
                else:
                    owner = matching_owners[0]
            else:
                active = [o for o in matching_owners if o["is_active"]]
                owner = active[0] if active else matching_owners[0]

            team = self.teams.get(owner["team_id"], {})

            result = {
                "found": True,
                "owner": owner,
                "team": team,
                "latency_ms": int((time.time() - start_time) * 1000)
            }

            # Set result attributes
            span.set_attribute("team_db.found", True)
            span.set_attribute("owner.name", owner["name"])
            span.set_attribute("owner.is_active", owner["is_active"])
            span.set_attribute("team_db.latency_ms", result["latency_ms"])

            return result


print("TeamDBInstrumented defined (from solution)")

## Task 4: Instrument StatusAPI.check_status()

**Goal:** Add tracing to capture status checks.

**Minimal guidance** - you know the pattern now!

Span name: `status_api.check_status`

Capture: service name, found status, service status (healthy/degraded), latency

**Time:** ~5 minutes

In [None]:
# =============================================================================
# TASK 4: Instrument StatusAPI.check_status()
# =============================================================================
# Minimal guidance - you know the pattern!
#
# TIME: ~5 minutes
# =============================================================================

class StatusAPIInstrumented:
    """StatusAPI with OpenTelemetry instrumentation."""

    def __init__(self):
        self.services = {s["name"]: s for s in STATUS_DATA["services"]}

    def check_status(self, service_name: str) -> dict:
        # ─────────────────────────────────────────────────────────────────────
        # PUT YOUR CODE HERE: Add tracing
        # ─────────────────────────────────────────────────────────────────────

        start_time = time.time()

        # INTENTIONAL PROBLEM: Timeout (2%)
        if random.random() < Config.STATUS_API_TIMEOUT_RATE:
            time.sleep(5)
            raise TimeoutError(f"StatusAPI timeout checking {service_name}")

        time.sleep(random.randint(30, 150) / 1000)

        service = self.services.get(service_name.lower())

        if not service:
            return {"found": False, "latency_ms": int((time.time() - start_time) * 1000)}

        return {
            "found": True,
            "service": service,
            "latency_ms": int((time.time() - start_time) * 1000)
        }


# Test it
test_sapi = StatusAPIInstrumented()
test_result = test_sapi.check_status("staging")
print(f"Found: {test_result['found']}")
if test_result['found']:
    print(f"   Status: {test_result['service']['status']}")

In [None]:
# =============================================================================
# SOLUTION: Task 4 - Instrument StatusAPI.check_status()
# =============================================================================

class StatusAPIInstrumented:
    """StatusAPI with OpenTelemetry instrumentation."""

    def __init__(self):
        self.services = {s["name"]: s for s in STATUS_DATA["services"]}

    def check_status(self, service_name: str) -> dict:
        with tracer.start_as_current_span("status_api.check_status") as span:
            span.set_attribute("status_api.service", service_name)

            start_time = time.time()

            # INTENTIONAL PROBLEM: Timeout (2%)
            if random.random() < Config.STATUS_API_TIMEOUT_RATE:
                span.set_attribute("error", True)
                span.set_attribute("error.type", "TimeoutError")
                time.sleep(5)
                raise TimeoutError(f"StatusAPI timeout checking {service_name}")

            time.sleep(random.randint(30, 150) / 1000)

            service = self.services.get(service_name.lower())

            if not service:
                span.set_attribute("status_api.found", False)
                return {"found": False, "latency_ms": int((time.time() - start_time) * 1000)}

            result = {
                "found": True,
                "service": service,
                "latency_ms": int((time.time() - start_time) * 1000)
            }

            span.set_attribute("status_api.found", True)
            span.set_attribute("status_api.status", service["status"])
            span.set_attribute("status_api.latency_ms", result["latency_ms"])

            return result


print("StatusAPIInstrumented defined (from solution)")

## Task 5: Instrument DevHubAgent.query()

**Goal:** Create the parent span that wraps all tool calls.

The agent's `query()` method should create a root span called `agent.query`. All the tool spans you created in Tasks 2-4 will automatically become children of this span because of OpenTelemetry's context propagation.

**What to instrument:**
- `agent.query` - root span
- Attributes: `agent.query` (the user's question), `agent.tools_planned` (list of tools)

**Time:** ~5 minutes

In [None]:
# =============================================================================
# TASK 5: Instrument DevHubAgent.query()
# =============================================================================
# Create the parent span that wraps all tool calls.
#
# TIME: ~5 minutes
# =============================================================================

class DevHubAgentInstrumented:
    """DevHub agent with OpenTelemetry instrumentation."""

    TOOL_PLANNING_PROMPT = DevHubAgent.TOOL_PLANNING_PROMPT
    RESPONSE_PROMPT = DevHubAgent.RESPONSE_PROMPT

    def __init__(self):
        # Use instrumented services
        self.vector_db = VectorDBInstrumented()
        self.team_db = TeamDBInstrumented()
        self.status_api = StatusAPIInstrumented()
        self.client = OpenAI()

    def _plan_tools(self, query: str) -> list:
        response = self.client.chat.completions.create(
            model=Config.LLM_MODEL,
            messages=[
                {"role": "system", "content": "You are a tool planning assistant. Respond only with valid JSON."},
                {"role": "user", "content": self.TOOL_PLANNING_PROMPT.format(query=query)}
            ],
            temperature=0.1,
            max_tokens=256
        )
        content = response.choices[0].message.content.strip()
        if content.startswith("```"):
            content = content.split("```")[1]
            if content.startswith("json"):
                content = content[4:]
        try:
            return json.loads(content.strip())
        except:
            return []

    def _execute_tool(self, tool_name: str, args: dict) -> dict:
        result = {"tool": tool_name, "success": False, "data": None, "error": None}
        try:
            if tool_name == "search_docs":
                result["data"] = self.vector_db.search(args.get("query", ""))
                result["success"] = True
            elif tool_name == "find_owner":
                result["data"] = self.team_db.find_owner(args.get("service", ""))
                result["success"] = True
            elif tool_name == "check_status":
                result["data"] = self.status_api.check_status(args.get("service", ""))
                result["success"] = True
        except Exception as e:
            result["error"] = str(e)
        return result

    def _generate_response(self, query: str, tool_results: list) -> str:
        response = self.client.chat.completions.create(
            model=Config.LLM_MODEL,
            messages=[
                {"role": "system", "content": "You are DevHub, a helpful developer assistant."},
                {"role": "user", "content": self.RESPONSE_PROMPT.format(query=query, results=json.dumps(tool_results, indent=2))}
            ],
            temperature=0.3,
            max_tokens=1024
        )
        return response.choices[0].message.content

    def query(self, user_query: str) -> dict:
        # ─────────────────────────────────────────────────────────────────────
        # PUT YOUR CODE HERE: Wrap everything in an "agent.query" span
        # ─────────────────────────────────────────────────────────────────────

        planned_tools = self._plan_tools(user_query)
        tool_results = [self._execute_tool(t["tool"], t.get("args", {})) for t in planned_tools]
        response = self._generate_response(user_query, tool_results)
        
        return {
            "response": response,
            "tools_called": [t["tool"] for t in planned_tools],
            "tool_results": tool_results
        }


print("DevHubAgentInstrumented defined")

In [None]:
# =============================================================================
# SOLUTION: Task 5 - Instrument DevHubAgent.query()
# =============================================================================

class DevHubAgentInstrumented:
    """DevHub agent with OpenTelemetry instrumentation."""

    TOOL_PLANNING_PROMPT = DevHubAgent.TOOL_PLANNING_PROMPT
    RESPONSE_PROMPT = DevHubAgent.RESPONSE_PROMPT

    def __init__(self):
        self.vector_db = VectorDBInstrumented()
        self.team_db = TeamDBInstrumented()
        self.status_api = StatusAPIInstrumented()
        self.client = OpenAI()

    def _plan_tools(self, query: str) -> list:
        with tracer.start_as_current_span("_plan_tools") as span:
            response = self.client.chat.completions.create(
                model=Config.LLM_MODEL,
                messages=[
                    {"role": "system", "content": "You are a tool planning assistant. Respond only with valid JSON."},
                    {"role": "user", "content": self.TOOL_PLANNING_PROMPT.format(query=query)}
                ],
                temperature=0.1,
                max_tokens=256
            )
            content = response.choices[0].message.content.strip()
            if content.startswith("```"):
                content = content.split("```")[1]
                if content.startswith("json"):
                    content = content[4:]
            try:
                return json.loads(content.strip())
            except:
                return []

    def _execute_tool(self, tool_name: str, args: dict) -> dict:
        with tracer.start_as_current_span(f"_execute_tool.{tool_name}") as span:
            span.set_attribute("tool.name", tool_name)
            result = {"tool": tool_name, "success": False, "data": None, "error": None}
            try:
                if tool_name == "search_docs":
                    result["data"] = self.vector_db.search(args.get("query", ""))
                    result["success"] = True
                elif tool_name == "find_owner":
                    result["data"] = self.team_db.find_owner(args.get("service", ""))
                    result["success"] = True
                elif tool_name == "check_status":
                    result["data"] = self.status_api.check_status(args.get("service", ""))
                    result["success"] = True
            except Exception as e:
                result["error"] = str(e)
                span.set_attribute("error", True)
            return result

    def _generate_response(self, query: str, tool_results: list) -> str:
        with tracer.start_as_current_span("_generate_response") as span:
            response = self.client.chat.completions.create(
                model=Config.LLM_MODEL,
                messages=[
                    {"role": "system", "content": "You are DevHub, a helpful developer assistant."},
                    {"role": "user", "content": self.RESPONSE_PROMPT.format(query=query, results=json.dumps(tool_results, indent=2))}
                ],
                temperature=0.3,
                max_tokens=1024
            )
            return response.choices[0].message.content

    def query(self, user_query: str) -> dict:
        with tracer.start_as_current_span("agent.query") as span:
            span.set_attribute("agent.query", user_query[:200])

            planned_tools = self._plan_tools(user_query)
            span.set_attribute("agent.tools_planned", str([t["tool"] for t in planned_tools]))

            tool_results = [self._execute_tool(t["tool"], t.get("args", {})) for t in planned_tools]
            response = self._generate_response(user_query, tool_results)

            return {
                "response": response,
                "tools_called": [t["tool"] for t in planned_tools],
                "tool_results": tool_results
            }


print("DevHubAgentInstrumented defined (from solution)")

In [None]:
# =============================================================================
# RUN INSTRUMENTED DEVHUB
# =============================================================================
# Now let's run some queries and generate traces!

print("Creating instrumented DevHub agent...")

# Create agent with instrumented services
instrumented_agent = DevHubAgentInstrumented()

print("Instrumented agent created!")
print("\nRunning test queries to generate traces...\n")

# Run several queries
test_queries = [
    "How do I authenticate with the Payments API?",
    "Who owns vector search?",
    "Is staging working?",
]

for q in test_queries:
    print(f"Query: {q}")
    try:
        result = instrumented_agent.query(q)
        print(f"  Tools: {result['tools_called']}")
        print(f"  Answer: {result['response'][:100]}...\n")
    except Exception as e:
        print(f"  Error: {e}\n")

# Force flush to send all traces
provider.force_flush()

print("=" * 60)
print("Traces sent to Jaeger!")
print(f"   View at: {JAEGER_UI}")
print(f"   Service: devhub-{STUDENT_NAME}")
print("=" * 60)

---

## Lab 1 Verification Checklist

Go to Jaeger and verify your instrumentation:

### 1. Find Your Traces
- [ ] Open Jaeger UI
- [ ] Select service: `devhub-{STUDENT_NAME}`
- [ ] Click "Find Traces"
- [ ] See at least 3 traces from your test queries

### 2. Verify Span Structure
- [ ] Root span: `agent.query`
- [ ] Child spans for tool planning and execution
- [ ] Nested spans for `vector_db.search`, `team_db.find_owner`, etc.

### 3. Verify Attributes
- [ ] `vector_db.search` has: `db.system`, `vector.query`, `vector.latency_ms`
- [ ] `team_db.find_owner` has: `db.system`, `owner.name`, `owner.is_active`
- [ ] `status_api.check_status` has: `service.name`, `service.status`

### 4. Look for Issues
- [ ] Find a slow query (latency_ms > 2000)
- [ ] Find an inactive owner (is_active = false)
- [ ] Find a low similarity result (distance > 0.5)

**If you see all these, you've successfully instrumented DevHub!**

In [None]:
# =============================================================================
# HOW TO VERIFY YOUR TRACES
# =============================================================================

print(f"""
VERIFICATION STEPS:

1. Open Jaeger:
   {JAEGER_UI}

   Credentials: workshop / salesforce2025

2. Select Service:
   devhub-{STUDENT_NAME}

3. Click "Find Traces"

4. Click on any trace to see the span waterfall

5. Click on individual spans to see attributes

WHAT TO LOOK FOR:

✓ Root span "agent.query" containing all other spans
✓ Child spans for each tool (_execute_tool)
✓ Nested spans for actual tool implementations
✓ Attributes on each span (db.system, latency_ms, etc.)

IF SOMETHING IS MISSING:

- Check that your code has tracer.start_as_current_span()
- Check that you're setting attributes with span.set_attribute()
- Make sure provider.force_flush() was called
- Wait 10-15 seconds and refresh Jaeger

""")


---

# Lab 2: Debug Production Scenarios Using Traces

Now that DevHub is instrumented, let's use traces to debug real issues!

**Duration:** ~30 minutes

**What you'll do:**
1. Scenario 1: The Slow Query
2. Scenario 2: The Wrong Owner
3. Scenario 3: Poor Retrieval Quality

For each scenario:
1. **Reproduce** the problem by running queries
2. **Find** the relevant trace in Jaeger
3. **Analyze** the spans and attributes
4. **Identify** the root cause

## How to Use Jaeger for Debugging

### Step-by-Step Navigation:

1. **Open Jaeger UI** at the provided URL
2. **Select your service** from the dropdown: `devhub-{STUDENT_NAME}`
3. **Set time range** to "Last Hour" or appropriate window
4. **Click "Find Traces"** to list all traces
5. **Click on a trace** to see the span waterfall
6. **Click on individual spans** to see attributes

### What to Look For:

| Issue Type | Where to Look | What to Check |
|------------|---------------|---------------|
| Slow requests | Span duration bars | Which span takes longest? |
| Stale data | `team_db.find_owner` span | `owner.is_active` attribute |
| Poor retrieval | `vector_db.search` span | `vector.top_distance` attribute |
| Errors | Any span with red color | Error message in attributes |

### Pro Tips:
- Sort traces by **duration** to find slowest first
- Use **Compare** feature to see differences between traces
- Check **Logs** tab for any events within spans

## Jaeger UI Overview

```mermaid
graph TB
    subgraph JaegerUI["Jaeger UI Layout"]
        subgraph Header["🔍 Search Panel"]
            Service["Service Selector<br/>devhub-student-name"]
            Time["Time Range<br/>Last Hour"]
            Search["Find Traces Button"]
        end
        
        subgraph TraceList["📋 Trace List"]
            T1["Trace 1: agent.query - 5200ms"]
            T2["Trace 2: agent.query - 180ms"]
            T3["Trace 3: agent.query - 3400ms"]
        end
        
        subgraph TraceDetail["📊 Trace Detail (Waterfall)"]
            Root["agent.query ████████████████████"]
            Child1["├─ plan_tools ███"]
            Child2["├─ vector_db.search █████████████"]
            Child3["├─ team_db.find_owner ██"]
            Child4["└─ generate_response ████"]
        end
        
        subgraph Attributes["🏷️ Span Attributes"]
            Attr1["db.system: chromadb"]
            Attr2["vector.latency_ms: 3100"]
            Attr3["vector.top_distance: 0.23"]
        end
    end
    
    Header --> TraceList
    TraceList --> TraceDetail
    TraceDetail --> Attributes
```

**Key Areas:**
1. **Search Panel** - Filter traces by service, time, tags
2. **Trace List** - All matching traces, sortable by duration
3. **Waterfall View** - Visual timeline of spans
4. **Attributes Panel** - Metadata for selected span

---

## Scenario 1: The Slow Query

### The Problem

Users report: **"DevHub sometimes takes forever to respond - like 3+ seconds!"**

Your task:
1. Reproduce the slow query issue
2. Find the slow trace in Jaeger
3. Identify which component is causing the slowness
4. Determine why it's slow (check attributes)

### What You're Looking For:
- A trace that takes significantly longer than others
- Which span within that trace is the bottleneck
- The `latency_ms` attribute that confirms the slowness

In [None]:
# =============================================================================
# SCENARIO 1: Reproduce the Slow Query
# =============================================================================
# Run this query multiple times until you get a slow one.
# The 10% slow query rate means ~1 in 10 will be slow.

import time

print("Running queries to reproduce slow query issue...")
print("(Run this cell multiple times if needed)\n")

for i in range(5):
    print(f"Query {i+1}:")
    start = time.time()
    
    try:
        result = instrumented_agent.query("How do I authenticate with the Payments API?")
        elapsed = time.time() - start
        
        # Flag slow queries
        if elapsed > 2.0:
            print(f"  ⚠️  SLOW! Took {elapsed:.2f}s - Check this trace in Jaeger!")
        else:
            print(f"  ✓ Normal: {elapsed:.2f}s")
            
    except Exception as e:
        print(f"  ✗ Error: {e}")
    
    print()

# Flush traces
provider.force_flush()
print(f"\n📊 Check Jaeger for slow traces: {JAEGER_UI}")

### Scenario 1: Analysis Worksheet

Go to Jaeger and find the slow trace. Then answer these questions:

| Question | Your Answer |
|----------|-------------|
| What was the total trace duration? | ___ ms |
| Which span took the longest? | ___________ |
| How long did that span take? | ___ ms |
| What is the `latency_ms` attribute value? | ___ ms |
| What component caused the slowness? | ___________ |

**Hint:** Look for the span with the longest duration bar in the waterfall view.

In [None]:
# =============================================================================
# SCENARIO 1: Your Analysis
# =============================================================================
# Record your findings from Jaeger here.

# YOUR ANSWERS:
total_trace_duration = None  # PUT YOUR ANSWER HERE (in ms)
slowest_span_name = None     # PUT YOUR ANSWER HERE (e.g., "vector_db.search")
slowest_span_duration = None # PUT YOUR ANSWER HERE (in ms)
latency_ms_attribute = None  # PUT YOUR ANSWER HERE (from span attributes)
root_cause = None            # PUT YOUR ANSWER HERE (e.g., "VectorDB slow query simulation")

# Verification
print("Your Scenario 1 Analysis:")
print(f"  Total trace duration: {total_trace_duration} ms")
print(f"  Slowest span: {slowest_span_name}")
print(f"  Slowest span duration: {slowest_span_duration} ms")
print(f"  latency_ms attribute: {latency_ms_attribute} ms")
print(f"  Root cause: {root_cause}")

### Scenario 1: Solution

<details>
<summary>Click to reveal solution</summary>

**Root Cause:** VectorDB slow query simulation

**What you should have found:**
- The `vector_db.search` span took ~3000ms (3 seconds)
- The `vector.latency_ms` attribute shows ~3000
- This matches our configured `VECTOR_DB_SLOW_QUERY_LATENCY = 3000`

**Why this happens:**
- 10% of VectorDB queries are intentionally slowed (`VECTOR_DB_SLOW_QUERY_RATE = 0.10`)
- The code sleeps for 3 seconds to simulate a slow query

**In production, this could indicate:**
- Database index issues
- Large table scans
- Network latency to vector database
- Insufficient database resources

**How tracing helped:**
Without tracing, you only knew "it's slow sometimes." With tracing, you can see EXACTLY which component is slow and by how much.

</details>

---

## Scenario 2: The Wrong Owner

### The Problem

Users report: **"I asked who owns vector search and got David Kim, but he left the company 6 months ago!"**

Your task:
1. Reproduce the stale owner issue
2. Find the trace in Jaeger
3. Check the `owner.is_active` attribute
4. Understand why stale data was returned

### What You're Looking For:
- A trace where `find_owner` was called
- The `owner.is_active` attribute showing `false`
- The owner name showing someone who should no longer be returned

In [None]:
# =============================================================================
# SCENARIO 2: Reproduce the Wrong Owner Issue
# =============================================================================
# Run this query multiple times to trigger the stale data scenario.
# The 10% stale data rate means ~1 in 10 will return inactive owner.

print("Running queries to reproduce stale owner issue...")
print("(Run this cell multiple times if needed)\n")

for i in range(5):
    print(f"Query {i+1}: 'Who owns vector search?'")
    
    try:
        result = instrumented_agent.query("Who owns vector search?")
        
        # Check if David Kim (inactive) was mentioned
        if "David Kim" in result['response']:
            print(f"  ⚠️  STALE DATA! Got David Kim (who left the company)")
            print(f"  Check this trace in Jaeger for owner.is_active=false")
        elif "Emily Johnson" in result['response']:
            print(f"  ✓ Correct: Got Emily Johnson (active owner)")
        else:
            print(f"  ? Response: {result['response'][:100]}...")
            
    except Exception as e:
        print(f"  ✗ Error: {e}")
    
    print()

# Flush traces
provider.force_flush()
print(f"\n📊 Check Jaeger for stale data traces: {JAEGER_UI}")

### Scenario 2: Analysis Worksheet

Go to Jaeger and find a trace where the wrong owner was returned. Answer these questions:

| Question | Your Answer |
|----------|-------------|
| Which span contains owner information? | ___________ |
| What is the `owner.name` attribute? | ___________ |
| What is the `owner.is_active` attribute? | ___________ |
| What is the `owner.email` attribute? | ___________ |
| Should this owner have been returned? | Yes / No |

**Hint:** Look for the `team_db.find_owner` span and examine its attributes.

In [None]:
# =============================================================================
# SCENARIO 2: Your Analysis
# =============================================================================
# Record your findings from Jaeger here.

# YOUR ANSWERS:
span_with_owner_info = None   # PUT YOUR ANSWER HERE (e.g., "team_db.find_owner")
owner_name_attribute = None   # PUT YOUR ANSWER HERE (e.g., "David Kim")
owner_is_active = None        # PUT YOUR ANSWER HERE (True or False)
owner_email = None            # PUT YOUR ANSWER HERE
should_be_returned = None     # PUT YOUR ANSWER HERE ("Yes" or "No")

# Verification
print("Your Scenario 2 Analysis:")
print(f"  Span with owner info: {span_with_owner_info}")
print(f"  owner.name: {owner_name_attribute}")
print(f"  owner.is_active: {owner_is_active}")
print(f"  owner.email: {owner_email}")
print(f"  Should this owner be returned? {should_be_returned}")

### Scenario 2: Solution

<details>
<summary>Click to reveal solution</summary>

**Root Cause:** TeamDB stale data simulation

**What you should have found:**
- The `team_db.find_owner` span has `owner.is_active = false`
- The `owner.name` attribute shows "David Kim"
- David Kim left the company but is still in the database

**Why this happens:**
- 10% of TeamDB lookups return inactive owners (`TEAM_DB_STALE_DATA_RATE = 0.10`)
- The code preferentially returns inactive owners when triggered

**In production, this could indicate:**
- Employee data not synced with HR system
- Cache not invalidated after employee departure
- Missing data validation in the application

**How tracing helped:**
Without tracing, you'd only see the wrong name in the response. With tracing, you can see the `is_active=false` attribute, proving the database returned stale data (not an LLM hallucination).

**The fix:**
Add validation to filter out inactive owners, or fix the data sync issue.

</details>

---

## Scenario 3: Poor Retrieval Quality

### The Problem

Users report: **"The answers seem wrong or irrelevant sometimes. Like it's pulling the wrong documentation."**

Your task:
1. Reproduce the poor retrieval issue
2. Find the trace in Jaeger
3. Check the `vector.top_distance` attribute
4. Understand what a high distance means

### What You're Looking For:
- A trace where `vector_db.search` was called
- The `vector.top_distance` attribute showing a value > 0.5
- High distance = low similarity = poor retrieval quality

In [None]:
# =============================================================================
# SCENARIO 3: Reproduce Poor Retrieval Quality
# =============================================================================
# Run documentation queries to trigger low similarity results.
# The 15% low similarity rate means ~1 in 7 will have poor retrieval.

print("Running queries to reproduce poor retrieval issue...")
print("(Run this cell multiple times if needed)\n")

queries = [
    "How do I handle errors in my API?",
    "What are the database connection settings?",
    "How do rate limits work?",
]

for i, q in enumerate(queries):
    print(f"Query {i+1}: '{q}'")
    
    try:
        result = instrumented_agent.query(q)
        print(f"  ✓ Got response (check Jaeger for similarity scores)")
        
    except Exception as e:
        print(f"  ✗ Error: {e}")
    
    print()

# Flush traces
provider.force_flush()

print(f"""
📊 Check Jaeger for poor retrieval:
   {JAEGER_UI}

Look for vector_db.search spans where:
   vector.top_distance > 0.5

Normal similarity: 0.1 - 0.3 (good match)
Poor similarity: > 0.5 (bad match - results may be irrelevant)
""")

### Scenario 3: Analysis Worksheet

Go to Jaeger and find a trace with poor retrieval quality. Answer these questions:

| Question | Your Answer |
|----------|-------------|
| Which span shows similarity scores? | ___________ |
| What is the `vector.top_distance` value? | ___________ |
| Is this value good (< 0.3) or bad (> 0.5)? | ___________ |
| What was the search query? | ___________ |
| How many results were returned? | ___________ |

**Hint:** In vector search, **lower distance = higher similarity**. A distance > 0.5 indicates poor match quality.

In [None]:
# =============================================================================
# SCENARIO 3: Your Analysis
# =============================================================================
# Record your findings from Jaeger here.

# YOUR ANSWERS:
span_with_similarity = None    # PUT YOUR ANSWER HERE (e.g., "vector_db.search")
top_distance_value = None      # PUT YOUR ANSWER HERE (e.g., 0.67)
quality_assessment = None      # PUT YOUR ANSWER HERE ("good" or "bad")
search_query = None            # PUT YOUR ANSWER HERE
results_count = None           # PUT YOUR ANSWER HERE

# Verification
print("Your Scenario 3 Analysis:")
print(f"  Span with similarity: {span_with_similarity}")
print(f"  vector.top_distance: {top_distance_value}")
print(f"  Quality assessment: {quality_assessment}")
print(f"  Search query: {search_query}")
print(f"  Results count: {results_count}")

### Scenario 3: Solution

<details>
<summary>Click to reveal solution</summary>

**Root Cause:** VectorDB low similarity simulation

**What you should have found:**
- The `vector_db.search` span has `vector.top_distance > 0.5`
- Normal queries show distance of 0.1-0.3
- The artificially inflated distance indicates poor retrieval

**Why this happens:**
- 15% of VectorDB queries artificially inflate distances (`VECTOR_DB_LOW_SIMILARITY_RATE = 0.15`)
- The code adds 0.5 to all distances, making results appear irrelevant

**In production, this could indicate:**
- Poor embedding quality
- Query not matching indexed content style
- Stale or corrupted vector index
- Wrong embedding model version

**How tracing helped:**
Without tracing, users just say "the answer is wrong." With tracing, you can see that the similarity score was low (0.67 vs normal 0.2), proving the retrieval layer returned poor matches.

**The fix:**
- Monitor similarity scores with alerts for low values
- Re-index with better embeddings
- Add a fallback for low-confidence results

</details>

---

# Session 1: Wrap-Up

## What You Learned Today

### 1. The 5-Layer AI Architecture Framework
You can now map any AI application to five layers:
- **Application** → User interface
- **Gateway** → Validation and auth
- **Orchestration** → Agent and tool selection
- **LLM** → AI model calls
- **Data** → Databases and APIs

### 2. Distributed Tracing Concepts
You understand:
- **Traces** = Complete request journey
- **Spans** = Individual operations
- **Attributes** = Contextual metadata
- **Context propagation** = How traces stay connected

### 3. OpenTelemetry Instrumentation
You can:
- Set up a tracer provider
- Create spans with `start_as_current_span()`
- Add attributes with `set_attribute()`
- Export traces to Jaeger

### 4. Debugging with Traces
You diagnosed real issues:
- Slow queries (VectorDB latency)
- Stale data (inactive owners)
- Poor retrieval (low similarity scores)

## Before vs After: The Impact of Observability

### Before Tracing
```
User: "DevHub is slow"
You:  "Let me check the logs..."
Logs: "Processing... Done."
You:  "I have no idea what's wrong"
Time: 4 hours of guessing
```

### After Tracing
```
User: "DevHub is slow"
You:  "Let me check Jaeger..."
Trace: agent.query took 5200ms
       └── vector_db.search took 3100ms
           └── vector.latency_ms = 3100
You:  "The vector database query is slow. It's hitting our
       10% slow query simulation. In production, I'd check
       the ChromaDB indexes and query complexity."
Time: 2 minutes of analysis
```

**From 4 hours to 2 minutes. That's the power of observability.**

## Visual Comparison

```mermaid
graph LR
    subgraph Before["❌ BEFORE: No Tracing"]
        B1["User: It's slow"]
        B2["You: Check logs..."]
        B3["Logs: Processing...Done"]
        B4["You: ??? 🤷"]
        B5["4 HOURS debugging"]
        B1 --> B2 --> B3 --> B4 --> B5
    end
    
    subgraph After["✅ AFTER: With Tracing"]
        A1["User: It's slow"]
        A2["You: Check Jaeger"]
        A3["Trace: vector_db 3100ms"]
        A4["You: Found it! 🎯"]
        A5["2 MINUTES to fix"]
        A1 --> A2 --> A3 --> A4 --> A5
    end
```

| Metric | Before | After |
|--------|--------|-------|
| Debug time | 4 hours | 2 minutes |
| Root cause found? | Maybe | Definitely |
| Data to prove it | None | Trace + Attributes |
| Confidence level | Low | High |

![Before After Observability](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_01/charts/06_before_after_observability.svg)


## 5 Key Takeaways

### 1. Map Before You Debug
Use the 5-layer framework to understand WHERE in your system a problem might be before diving into code.

### 2. Attributes Are Everything
The span name tells you WHAT happened. Attributes tell you WHY. Always capture relevant metadata.

### 3. Think in Hierarchies
Structure your traces as parent-child relationships. This shows causality and helps identify which component triggered an issue.

### 4. Instrument Early
Don't wait for production problems. Add tracing during development so you're ready when issues arise.

### 5. Use Semantic Conventions
Follow OpenTelemetry semantic conventions (`db.system`, `http.method`, etc.) for consistent, interoperable traces.

## Take-Home Exercise

**Design the architecture for your own AI system.**

Think about an AI application you work on or want to build:

1. **Draw the 5 layers** - What components fit in each layer?

2. **Identify potential failure modes** - What could go wrong at each layer?

3. **Design your spans** - What would you name them? What attributes would you capture?

4. **Plan your debugging workflow** - If something goes wrong, what trace would help you find it?

### Example template:

```
My Application: _______________

Layer 1 (Application):
- Component: _______________
- Potential failures: _______________
- Key spans/attributes: _______________

Layer 2 (Gateway):
- Component: _______________
- Potential failures: _______________
- Key spans/attributes: _______________

[Continue for all 5 layers]
```

Bring your design to Session 2 for discussion!

---

## Coming Up: Session 2 - Testing AI Applications

In Session 2, we'll learn how to **test AI applications** to prevent issues before they reach production.

### Topics:
- Why traditional testing doesn't work for AI
- Introduction to DeepEval for AI testing
- Testing retrieval quality
- Testing response accuracy
- Building a test suite for DevHub

### You'll Build:
- Automated tests for DevHub
- Quality metrics dashboard
- CI/CD integration for AI testing

---

## Congratulations!

You've completed **Session 1: Observability in AI Applications!**

**Skills gained:**
- Map AI systems to 5-layer framework
- Instrument code with OpenTelemetry
- Debug using distributed traces
- Identify bottlenecks and data issues

See you in Session 2!

---