# 10: Observability Dashboard üî≠

This notebook serves as the **"Flight Recorder"** for the SalesOps Agent Suite (Day 9).

To win the Capstone, we must prove that our agent is **Deterministic, Measurable, and Auditable**. This dashboard visualizes the JSONL telemetry logs generated by the `observability` package.

### üéØ Goals
1.  **Audit Runs:** See a history of all Coordinator executions (Success/Failure).
2.  **Visualize Traces:** View a Gantt chart of the agent workflow (Ingest ‚Üí Detect ‚Üí Explain ‚Üí Act).
3.  **Analyze Performance:** Track LLM latency and estimated token costs.
4.  **Verify Actions:** Confirm that downstream actions (Jira/Email) were executed correctly.

### üèóÔ∏è Components Used
* `observability.collector.LogCollector`: Aggregates logs from `outputs/observability/`.
* `plotly`: Interactive charts for Traces and Metrics.

## 1) Imports

In [7]:
import sys
import os
import json
import pandas as pd
import plotly.express as px
from IPython.display import display, Markdown

# Add project root to path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname("__file__"), "..")))

from observability.collector import LogCollector

# Initialize Collector
OBS_DIR = "../outputs/observability"
collector = LogCollector(OBS_DIR)

print(f"‚úÖ Dashboard Connected to: {os.path.abspath(OBS_DIR)}")

‚úÖ Dashboard Connected to: d:\01. Github\salesops-suite\outputs\observability


## 2) Runs Overview

In [8]:
df_runs = collector.get_runs()

if not df_runs.empty:
    print(f"üìä Total Runs: {len(df_runs)}")

    # Status Breakdown
    status_counts = df_runs["status"].value_counts().reset_index()
    status_counts.columns = ["Status", "Count"]
    display(status_counts)

    # Show Table
    display(df_runs[["run_id", "status", "start_ts", "duration_sec"]].tail())
else:
    print("‚ùå No runs found. Please execute 'python main.py' first.")

üìä Total Runs: 13


Unnamed: 0,Status,Count
0,completed,13


Unnamed: 0,run_id,status,start_ts,duration_sec
8,run_20251125T161454Z_563789,completed,2025-11-25 16:14:54.382487+00:00,75.511185
9,run_20251125T163134Z_690bac,completed,2025-11-25 16:31:34.190574+00:00,30.310516
10,run_20251125T163936Z_eef401,completed,2025-11-25 16:39:36.725056+00:00,29.291779
11,run_20251125T170332Z_e06aac,completed,2025-11-25 17:03:32.234913+00:00,29.080436
12,run_20251125T171815Z_9f42c5,completed,2025-11-25 17:18:15.579626+00:00,28.633473


## 3) Trace Visualization (Gantt Chart)

In [9]:
df_spans = collector.get_traces()

if not df_spans.empty and not df_runs.empty:
    # Filter spans belonging to the latest run (based on time window)
    latest_run = df_runs.iloc[-1]
    run_start = latest_run["start_ts"]

    # Get spans that started after the run started
    current_spans = df_spans[df_spans["start_ts"] >= run_start].copy()

    if not current_spans.empty:
        fig = px.timeline(
            current_spans,
            x_start="start_ts",
            x_end="end_ts",
            y="name",
            color="component",
            title=f"Execution Trace: {latest_run['run_id']}",
            hover_data=["duration_ms", "status", "error"],
            height=400,
        )
        fig.update_yaxes(autorange="reversed")  # Root at top
        fig.show()
    else:
        print("No spans found for the latest run.")
else:
    print("‚ùå No trace data available.")

## 4) LLM Metrics (Cost & Latency)

In [10]:
df_llm = collector.get_llm_calls()

if not df_llm.empty:
    # Latency Distribution
    fig_hist = px.histogram(
        df_llm,
        x="latency_ms",
        nbins=20,
        color="model",
        title="LLM Latency Distribution (ms)",
        marginal="box",
    )
    fig_hist.show()

    # KPIs
    total_calls = len(df_llm)
    total_tokens = df_llm["est_tokens"].sum() if "est_tokens" in df_llm.columns else 0
    avg_latency = df_llm["latency_ms"].mean()

    md = f"""
    ### ü§ñ AI Metrics
    * **Total Calls:** {total_calls}
    * **Est. Tokens:** {total_tokens:,.0f}
    * **Avg Latency:** {avg_latency:.0f} ms
    """
    display(Markdown(md))
else:
    print("‚ùå No LLM calls recorded.")


    ### ü§ñ AI Metrics
    * **Total Calls:** 10
    * **Est. Tokens:** 1,854
    * **Avg Latency:** 2118 ms
    

## 5) Action Audit

In [12]:
df_actions = collector.get_actions()

if not df_actions.empty:
    # Parse nested result status if needed
    if 'result' in df_actions.columns:
        # Safe extraction
        df_actions['status_code'] = df_actions['result'].apply(lambda x: x.get('http_code') if isinstance(x, dict) else None)
        df_actions['outcome'] = df_actions['result'].apply(lambda x: x.get('status') if isinstance(x, dict) else None)

    fig_bar = px.bar(
        df_actions, 
        x="type", 
        color="outcome", 
        title="Actions Executed by Type",
        barmode="group"
    )
    fig_bar.show()
    
    print("Recent Actions:")
    
    # Handle missing timestamp column gracefully (Legacy logs compatibility)
    cols_to_show = ["action_id", "type", "outcome"]
    if "timestamp" in df_actions.columns:
        cols_to_show.insert(0, "timestamp")
        
    display(df_actions[cols_to_show].tail())
else:
    print("‚ùå No actions recorded.")

Recent Actions:


Unnamed: 0,action_id,type,outcome
1,a7569384-ca4c-49a2-a604-72404fbe300e,create_ticket,success
2,47125b42-9adc-48a4-8c5a-cdd09588eb2d,create_ticket,success
3,8f833d85-72a1-4e43-a5df-deb6686347f9,create_ticket,success
4,65ecb819-6f42-47ad-9ce9-7e859ddc8d6b,create_ticket,success
5,779a9712-0eae-4dab-b952-f01430f2e77e,create_ticket,success


## 6) Deep Dive Evidence

In [13]:
if not df_llm.empty:
    last_call = df_llm.iloc[-1]
    prompt_hash = last_call.get("prompt_hash")

    print(f"üîç Inspecting Last AI Call: {last_call['anomaly_id']}")

    raw_path = f"{OBS_DIR}/responses/{prompt_hash}.json"
    if os.path.exists(raw_path):
        with open(raw_path, "r") as f:
            raw_data = json.load(f)

        print("\n--- üìù Prompt (Truncated) ---")
        print(
            raw_data["prompt"][:1000] + "..."
            if len(raw_data["prompt"]) > 1000
            else raw_data["prompt"]
        )

        print("\n--- üí° Model Response ---")
        print(json.dumps(raw_data["response"], indent=2))
    else:
        print(f"‚ö†Ô∏è Raw response file not found: {raw_path}")

üîç Inspecting Last AI Call: iqr_West_2016-03-10_s16

--- üìù Prompt (Truncated) ---
You are a Senior SalesOps Analyst. Analyze this sales anomaly.

DATA CONTEXT:
- Entity: West (region)
- Metric: Sales
- Value: 7,662.96
- Expected: 507.17
- Score: 16.73

STATISTICAL CONTEXT:
Q1: 79.54
Q3: 507.17
IQR: 427.63

HISTORICAL CONTEXT (From Memory Bank):
No relevant past events found.

OUTPUT FORMAT:
Return valid JSON with these exact keys:
{
    "explanation_short": "1 sentence summary",
    "explanation_full": "2-3 sentence detailed analysis. Reference history if relevant.",
    "suggested_actions": ["Action 1", "Action 2"],
    "confidence": "High/Medium/Low",
    "needs_human_review": boolean
}

CONSTRAINT:
- Rely ONLY on provided numbers and history.
- Do NOT invent external events.
- Output pure JSON (no markdown).

--- üí° Model Response ---
{
  "explanation_short": "West region sales are significantly higher than expected, exceeding the upper quartile by a large margin.",
  "explan

## ‚è≠Ô∏è Next Step: Proving Quality (Evaluation)

Success! We have built the **Observability Dashboard**.
* We can see the **Trace Waterfall** of our agents.
* We can audit every **Action** taken.
* We can measure **LLM Latency and Cost**.

**But... does it actually work?**
Tracing shows *what* happened, but not *how good* it was.
* Did the detector find all the anomalies? (Recall)
* Did the explainer give accurate reasons? (Quality)
* Did the system survive errors? (Robustness)

In **Day 10**, we will build the **Evaluation Pipeline**.
We will use **Synthetic Golden Datasets**, **Automated Regression Tests**, and **Human-in-the-loop Scoring** to generate a final "Report Card" for our submission.

üëâ **Proceed to `evaluation/99_evaluation_report.ipynb`.**