# 📈 Week 11-12 · Notebook 04 · Monitoring & Observability for the Manufacturing Copilot

This notebook addresses a critical MLOps gap identified in our readiness scorecard: **Monitoring & Observability**. We will instrument our FastAPI service with production-grade logging, metrics, and alerting to track performance, cost, and the quality of the copilot's responses.


## 🎯 Learning Objectives

- **Instrument with Prometheus:** Integrate the `prometheus-fastapi-instrumentator` library to automatically expose key metrics (latency, request counts, error rates) from our FastAPI endpoints.
- **Implement Custom GenAI Metrics:** Create custom Prometheus metrics to track specific GenAI concerns, such as hallucination rates, token usage, and user feedback scores.
- **Configure Structured Logging:** Set up structured (JSON) logging to capture detailed, queryable information with every request, including trace IDs for end-to-end visibility.
- **Design Alerting Policies:** Define alerting rules in Prometheus's Alertmanager format to notify the on-call team about critical issues like SLA breaches or spikes in harmful responses.
- **Visualize with Grafana:** Outline the key dashboards needed in Grafana to provide a real-time, holistic view of the Manufacturing Copilot's health.


## 🧩 Scenario: From "Is it working?" to "How well is it working?"

Our Manufacturing Copilot is now running in a staging environment. It's not enough to know that the API is "up." Leadership and operations teams are asking critical questions:
-   "Are we meeting our latency SLA of <500ms for technicians on the floor?"
-   "How often does the RAG agent fail to find an answer (hallucinate)?"
-   "Which pieces of equipment are causing the most requests?"
-   "How much is this service costing us in terms of LLM tokens per day?"

To answer these, we need a robust monitoring and observability stack.


## 📊 The Three Pillars of Observability for GenAI

We will focus on three key types of telemetry data:

1.  **Metrics:** Numerical measurements aggregated over time (e.g., request rate, error count). Ideal for dashboards and high-level alerting.
2.  **Logs:** Timestamped records of discrete events (e.g., a single API request). Ideal for deep-dive debugging and auditing.
3.  **Traces:** A representation of the entire lifecycle of a request as it moves through different services. We've already laid the groundwork for this with our `X-Request-Trace-ID` header.

### Key GenAI Metrics Framework

| Metric Name                      | Type        | Description                                                              | Labels (`key=value`)             | Target / SLA                               |
| -------------------------------- | ----------- | ------------------------------------------------------------------------ | -------------------------------- | ------------------------------------------ |
| `copilot_requests_total`         | `Counter`   | Total number of requests to the copilot API.                             | `plant_id`, `status_code`, `agent` | N/A                                        |
| `copilot_latency_seconds`        | `Histogram` | Latency distribution for API requests.                                   | `plant_id`, `agent`              | P95 < 0.5 seconds                          |
| `copilot_llm_token_usage_total`  | `Counter`   | Total number of input and output tokens used by the LLM.                 | `plant_id`, `direction` (in/out) | Monitor for budget adherence               |
| `copilot_hallucination_rate`     | `Gauge`     | Percentage of responses flagged as hallucinations or "I don't know".     | `plant_id`, `agent`              | < 5%                                       |
| `copilot_user_feedback_score`    | `Gauge`     | Average user satisfaction score (e.g., 1-5 stars).                       | `plant_id`, `agent`              | > 4.0                                      |


### 1. Instrumenting the FastAPI Service with Prometheus

**Prometheus** is an open-source monitoring system that collects and stores its metrics as time series data. We will use the `prometheus-fastapi-instrumentator` library to automatically expose key metrics from our API.

First, let's update our `app/main.py` to include the instrumentation and define our custom GenAI metrics.

In [None]:
%%writefile app/main.py -a

# --- Prometheus Metrics ---
from prometheus_fastapi_instrumentator import Instrumentator
from prometheus_client import Counter, Gauge

# Define custom metrics
LLM_TOKEN_USAGE = Counter(
    "copilot_llm_token_usage_total",
    "Total number of LLM tokens used",
    ["plant_id", "agent_name", "direction"]
)

HALLUCINATION_FLAG = Counter(
    "copilot_hallucination_total",
    "Total number of responses flagged as hallucinations",
    ["plant_id", "agent_name"]
)

USER_FEEDBACK_SCORE = Gauge(
    "copilot_user_feedback_score",
    "Average user feedback score on a sliding window",
    ["plant_id", "agent_name"]
)

# Instrument the app
@app.on_event("startup")
async def startup():
    """Instrument the app before it starts."""
    Instrumentator().instrument(app).expose(app)


Now, we need to modify our agent logic to update these metrics. Let's update `app/agents.py`.

In [None]:
%%writefile app/agents.py

from uuid import uuid4
import asyncio
import random
from .models import DiagnosisRequest, DiagnosisResponse
# Import the custom metrics from main
from .main import LLM_TOKEN_USAGE, HALLUCINATION_FLAG

async def run_copilot_inference(payload: DiagnosisRequest) -> DiagnosisResponse:
    """
    This function simulates the full agentic workflow, including I/O-bound operations
    and updating Prometheus metrics.
    """
    agent_name = "RAG_Agent"
    plant_id = payload.plant_id

    # Simulate network latency
    await asyncio.sleep(0.1)
    
    # 1. Vision Agent (Simulated)
    vision_output = {
        "defects_found": ["micro-fracture", "surface-discoloration"],
        "confidence": 0.85,
    }

    await asyncio.sleep(0.15)

    # 2. RAG Agent (Simulated)
    rag_output = {
        "recommended_steps": [
            f"1. For equipment {payload.equipment_id}, inspect the primary coolant line for leaks.",
            "2. Verify torque settings on mounting bolts (Ref: SOP-123, Sec 4.2).",
            "3. Escalate to Level-2 maintenance if vibration exceeds 5mm/s.",
        ],
        "cited_documents": ["SOP-123", "MAINT-GUIDE-V2"],
    }
    
    # --- Update Metrics ---
    # Simulate token usage
    input_tokens = len(payload.problem_description.split()) * 2 # Rough estimate
    output_tokens = sum(len(step.split()) for step in rag_output["recommended_steps"]) * 2
    LLM_TOKEN_USAGE.labels(plant_id=plant_id, agent_name=agent_name, direction="input").inc(input_tokens)
    LLM_TOKEN_USAGE.labels(plant_id=plant_id, agent_name=agent_name, direction="output").inc(output_tokens)

    # Simulate hallucination check
    if not rag_output["cited_documents"]:
        HALLUCINATION_FLAG.labels(plant_id=plant_id, agent_name=agent_name).inc()
        confidence = 0.4 # Lower confidence if no docs are cited
    else:
        confidence = 0.91

    # 3. Report Generation (Simulated)
    report = f"Incident Report for {payload.equipment_id} at {plant_id}: Visual inspection found {', '.join(vision_output['defects_found'])}. Recommended action: Follow RAG guidance."

    return DiagnosisResponse(
        request_id=uuid4(),
        vision_analysis=vision_output,
        rag_guidance=rag_output,
        generated_report=report,
        confidence_score=confidence,
    )

With these changes, every time the `/v1/diagnose` endpoint is called, our custom metrics for token usage and hallucinations will be updated. The `prometheus-fastapi-instrumentator` automatically handles standard metrics like request counts and latency.

After running the app, you can visit the `/metrics` endpoint (e.g., `http://localhost:8000/metrics`) to see the raw text-based exposition format that Prometheus scrapes.

### 2. Structured Logging for Deeper Insights

While metrics are great for aggregation, **structured logs** are essential for debugging individual requests. We want to move from simple string logs to queryable JSON objects.

Let's install and configure `structlog`.

```bash
pip install structlog
```

Now, we'll create a new file, `app/logging_config.py`, to define our logging setup.

In [None]:
%%writefile app/logging_config.py

import logging
import sys
import structlog

def setup_logging(log_level: str = "INFO"):
    """
    Configures structured logging for the application.
    """
    logging.basicConfig(
        level=log_level,
        format="%(message)s",
        stream=sys.stdout,
    )

    structlog.configure(
        processors=[
            structlog.stdlib.filter_by_level,
            structlog.stdlib.add_logger_name,
            structlog.stdlib.add_log_level,
            structlog.stdlib.PositionalArgumentsFormatter(),
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.UnicodeDecoder(),
            structlog.processors.JSONRenderer()
        ],
        context_class=dict,
        logger_factory=structlog.stdlib.LoggerFactory(),
        wrapper_class=structlog.stdlib.BoundLogger,
        cache_logger_on_first_use=True,
    )


And now we update `app/main.py` again to use this new logging configuration.

In [None]:
%%writefile app/main.py

import time
import logging
from uuid import uuid4
import structlog

from fastapi import FastAPI, Request, Depends
from .config import settings
from .models import DiagnosisRequest, DiagnosisResponse, HealthStatus
from .security import authorize_request
from .agents import run_copilot_inference
from .logging_config import setup_logging

# --- Application Setup ---
setup_logging(log_level=settings.LOG_LEVEL)
logger = structlog.get_logger("manufacturing_copilot_api")

app = FastAPI(
    title=settings.APP_TITLE,
    version=settings.APP_VERSION,
    description="API for interacting with the Manufacturing Copilot agents.",
)

# --- Middleware for Observability ---
@app.middleware("http")
async def add_observability_headers(request: Request, call_next):
    trace_id = str(uuid4())
    
    # Bind key context variables to the logger for every request
    structlog.contextvars.clear_contextvars()
    structlog.contextvars.bind_contextvars(
        trace_id=trace_id,
        path=request.url.path,
        method=request.method,
    )
    
    start_time = time.perf_counter()
    
    response = await call_next(request)
    
    duration_ms = (time.perf_counter() - start_time) * 1000
    
    response.headers["X-Request-Trace-ID"] = trace_id
    response.headers["X-Response-Time-ms"] = f"{duration_ms:.2f}"
    
    logger.info(
        "request_completed",
        status_code=response.status_code,
        duration_ms=duration_ms
    )
    
    if duration_ms > 500:
        logger.warning("high_latency_detected", duration_ms=duration_ms)
        
    return response


# --- API Endpoints ---
@app.get("/health", response_model=HealthStatus, tags=["Monitoring"])
async def health_check():
    return HealthStatus(status="ok")


@app.post("/v1/diagnose", response_model=DiagnosisResponse, tags=["Copilot"])
async def diagnose_problem(
    payload: DiagnosisRequest,
    user_id: str = Depends(authorize_request)
):
    logger.info(
        "diagnosis_request_received",
        user_id=user_id,
        plant_id=payload.plant_id,
        equipment_id=payload.equipment_id
    )
    
    response = await run_copilot_inference(payload)
    
    logger.info("diagnosis_request_success", confidence=response.confidence_score)
    
    return response

# --- Prometheus Metrics ---
from prometheus_fastapi_instrumentator import Instrumentator
from prometheus_client import Counter, Gauge

LLM_TOKEN_USAGE = Counter(
    "copilot_llm_token_usage_total",
    "Total number of LLM tokens used",
    ["plant_id", "agent_name", "direction"]
)

HALLUCINATION_FLAG = Counter(
    "copilot_hallucination_total",
    "Total number of responses flagged as hallucinations",
    ["plant_id", "agent_name"]
)

USER_FEEDBACK_SCORE = Gauge(
    "copilot_user_feedback_score",
    "Average user feedback score on a sliding window",
    ["plant_id", "agent_name"]
)

@app.on_event("startup")
async def startup():
    Instrumentator().instrument(app).expose(app)
    logger.info("Application startup complete. Metrics exposed at /metrics.")


### 3. Alerting Policy for GenAI Metrics

Metrics are useless without automated alerting. We can define rules that will fire when a metric crosses a critical threshold. This example is for **Prometheus Alertmanager**. These rules would be saved in a file like `alert_rules.yml` and loaded by Prometheus.

```yaml
# alert_rules.yml

groups:
- name: GenAI_Copilot_Alerts
  rules:
  - alert: HighHallucinationRate
    # This PromQL query calculates the rate of hallucinations over the last 15 minutes
    # and compares it to the rate of total requests.
    expr: (sum(rate(copilot_hallucination_total[15m])) by (plant_id, agent_name)) / (sum(rate(http_requests_total{job="fastapi"}[15m])) by (plant_id, agent_name)) > 0.05
    # The condition must be true for 5 minutes before firing to avoid "flapping" on brief spikes.
    for: 5m
    labels:
      severity: critical
      team: ai_operations
    annotations:
      summary: "High hallucination rate for agent {{ $labels.agent_name }} at plant {{ $labels.plant_id }}."
      description: "The hallucination rate has been over 5% for the last 5 minutes. Current value is {{ $value | humanizePercentage }}. This indicates a potential issue with the RAG retriever or the underlying data."
      runbook_url: "https://internal-wiki.example.com/runbooks/hallucination-escalation"

  - alert: HighRequestLatency
    # This PromQL query finds the 95th percentile latency over the last 10 minutes.
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[10m])) by (le, job)) > 0.5
    for: 5m
    labels:
      severity: warning
      team: ai_operations
    annotations:
      summary: "P95 API latency is above 500ms."
      description: "The API P95 latency has been over 500ms for 5 minutes. This is impacting user experience and violating our SLA."
      runbook_url: "https://internal-wiki.example.com/runbooks/api-latency-investigation"
      
  - alert: NoUserFeedbackReceived
    # This alert fires if no user feedback has been recorded for a specific plant in 24 hours.
    expr: time() - max_over_time(copilot_user_feedback_score[24h]) > 86400
    for: 1h
    labels:
      severity: info
      team: product_management
    annotations:
      summary: "No user feedback received from plant {{ $labels.plant_id }} in 24 hours."
      description: "The feedback mechanism might be broken or users are not engaging. Please investigate."
```

### 4. Designing Grafana Dashboards for GenAI

A picture is worth a thousand log lines. A well-designed dashboard provides an at-a-glance view of the system's health and performance. Here’s a blueprint for a Grafana dashboard for our Manufacturing Copilot.

**Dashboard Title:** Manufacturing Copilot - Service Health

**Variables:**
- `plant_id`: A dropdown to filter the entire dashboard by a specific manufacturing plant.
- `agent_name`: A dropdown to filter by a specific agent (e.g., `quality_control_agent`).

---

#### Row 1: Key Performance Indicators (KPIs)

- **Stat Panel: P95 Latency**
  - **Query:** `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="fastapi", plant_id=~"$plant_id"}[5m])) by (le))`
  - **Unit:** Seconds
  - **Thresholds:** Base: < 0.3s, Warning: 0.3s-0.5s, Critical: > 0.5s

- **Stat Panel: Error Rate (%)**
  - **Query:** `(sum(rate(http_requests_total{job="fastapi", status_code=~"5..", plant_id=~"$plant_id"}[5m])) / sum(rate(http_requests_total{job="fastapi", plant_id=~"$plant_id"}[5m]))) * 100`
  - **Unit:** Percent
  - **Thresholds:** Base: < 1%, Warning: 1-3%, Critical: > 3%

- **Stat Panel: Hallucination Rate (%)**
  - **Query:** `(sum(rate(copilot_hallucination_total{plant_id=~"$plant_id", agent_name=~"$agent_name"}[5m])) / sum(rate(http_requests_total{job="fastapi", plant_id=~"$plant_id"}[5m]))) * 100`
  - **Unit:** Percent
  - **Thresholds:** Base: < 2%, Warning: 2-5%, Critical: > 5%

- **Stat Panel: Average User Feedback**
  - **Query:** `avg_over_time(copilot_user_feedback_score{plant_id=~"$plant_id"}[24h])`
  - **Unit:** Stars (1-5)
  - **Thresholds:** Base: > 4.0, Warning: 3.5-4.0, Critical: < 3.5

---

#### Row 2: Latency & Throughput Analysis

- **Time Series Panel: API Latency (P95, P99, Median)**
  - **Queries:**
    - P99: `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="fastapi", plant_id=~"$plant_id"}[5m])) by (le))`
    - P95: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="fastapi", plant_id=~"$plant_id"}[5m])) by (le))`
    - Median: `histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket{job="fastapi", plant_id=~"$plant_id"}[5m])) by (le))`
  - **Y-Axis:** Seconds

- **Time Series Panel: Requests per Second (RPS)**
  - **Query:** `sum(rate(http_requests_total{job="fastapi", plant_id=~"$plant_id"}[1m])) by (status_code)`
  - **Visualization:** Stacked bar chart, colored by status code (2xx green, 4xx orange, 5xx red).

---

#### Row 3: GenAI-Specific Metrics

- **Time Series Panel: LLM Token Usage**
  - **Query:** `sum(rate(copilot_llm_token_usage_total{plant_id=~"$plant_id", agent_name=~"$agent_name"}[5m])) by (token_type)`
  - **Legend:** `{{token_type}}` (e.g., prompt_tokens, completion_tokens)
  - **Y-Axis:** Tokens per second

- **Time Series Panel: Hallucination Count**
  - **Query:** `sum(increase(copilot_hallucination_total{plant_id=~"$plant_id", agent_name=~"$agent_name"}[5m])) by (agent_name)`
  - **Visualization:** Bar chart.

- **Table Panel: Top 5 Hallucinating Agents**
  - **Query:** `topk(5, sum(increase(copilot_hallucination_total[24h])) by (agent_name))`
  - **Columns:** Agent Name, Hallucination Count (24h)

---

#### Row 4: System & Resource Monitoring

- **Gauge Panel: Active Agents**
  - **Query:** `sum(copilot_active_agents{plant_id=~"$plant_id"})`
  - **Unit:** Integer

- **Time Series Panel: CPU / Memory Usage**
  - **Queries:** Standard node exporter metrics for container CPU and Memory.
  - **Y-Axis:** Percent / Bytes

- **Log Panel: Recent Errors**
  - **Source:** Loki
  - **LogQL Query:** `{job="fastapi", level="error"} | json`
  - **Description:** Shows the structured JSON logs for recent errors, allowing for quick inspection of `trace_id`, error messages, and other context.

This dashboard provides a holistic view, combining high-level business metrics (feedback, hallucinations) with low-level system performance indicators (latency, resource usage). The use of variables makes it a powerful tool for targeted debugging.

In [None]:
# --- weekly_report_generator.py ---
# This script could be run automatically to generate a weekly markdown report.

from jinja2 import Template

# In a real scenario, this data would be fetched from the Prometheus API.
report_data = {
    "week_of": "2025-10-27",
    "overall_health": "GREEN",
    "p95_latency_ms": 380,
    "sla_latency_ms": 500,
    "uptime_percentage": 99.98,
    "total_requests": 150230,
    "hallucination_rate_percent": 3.2,
    "avg_user_feedback": 4.6,
    "total_cost_usd": 750.45,
    "top_issue": "High latency spikes during shift change (7am-8am).",
    "action_item": "Investigate database connection pool contention during peak load."
}

report_template_str = """
# Manufacturing Copilot Weekly Health Report

**Week of:** {{ week_of }}
**Overall Health:** {{ overall_health }}

| Metric                      | Value                               | Target      |
| --------------------------- | ----------------------------------- | ----------- |
| Uptime                      | {{ uptime_percentage }}%            | > 99.9%     |
| P95 Latency                 | {{ p95_latency_ms }}ms              | < {{ sla_latency_ms }}ms |
| Total Requests              | {{ "{:,}".format(total_requests) }} | N/A         |
| Hallucination Rate          | {{ hallucination_rate_percent }}%   | < 5%        |
| Avg. User Feedback (1-5)    | {{ avg_user_feedback }}             | > 4.0       |
| Estimated Weekly Cost (USD) | ${{ "{:,.2f}".format(total_cost_usd) }} | < $1,000    |

---

### Key Observations & Action Items

*   **Top Issue:** {{ top_issue }}
*   **Action Item:** {{ action_item }}
"""

template = Template(report_template_str)
report = template.render(report_data)

print(report)


## 🧪 Lab Assignment: Instrument and Observe

1.  **Instrument Your FastAPI Service:**
    -   Add the `prometheus-fastapi-instrumentator` to your `main.py` as shown above.
    -   Add at least one custom metric (e.g., `LLM_TOKEN_USAGE`).
    -   Run your application and navigate to `http://localhost:8000/metrics`. You should see a long list of metrics.

2.  **Set Up a Local Monitoring Stack:**
    -   Create a `docker-compose.monitoring.yml` file that includes services for **Prometheus** and **Grafana**. You can find many examples online.
    -   Configure Prometheus to scrape the `/metrics` endpoint of your running FastAPI application.
    -   Run `docker-compose -f docker-compose.monitoring.yml up`.

3.  **Create a Grafana Dashboard:**
    -   Log in to Grafana (usually at `http://localhost:3000`).
    -   Connect to your Prometheus data source.
    -   Create a new dashboard with at least two panels:
        -   A time-series graph showing the `http_requests_total`.
        -   A gauge showing the value of your custom metric.

4.  **Implement Structured Logging:**
    -   Choose a structured logging library (like `structlog`).
    -   Configure it to output JSON logs from your FastAPI application.
    -   Modify your logging middleware to use the structured logger.


## ✅ Checklist for this Notebook

- [X] Standard and custom GenAI metrics identified and defined.
- [X] FastAPI application instrumented to export metrics to Prometheus.
- [X] Structured logging format designed to include critical context like trace IDs.
- [X] Alerting policies for key GenAI failure modes (hallucination, latency) drafted.
- [X] Key Grafana dashboard panels designed for at-a-glance analysis.
- [ ] **TODO:** Complete the Lab Assignment to set up a local monitoring stack and visualize your API's metrics.


## 📚 References and Further Reading

-   [Prometheus Documentation](https://prometheus.io/docs/introduction/overview/)
-   [Grafana Documentation](https://grafana.com/docs/grafana/latest/)
-   [prometheus-fastapi-instrumentator on GitHub](https://github.com/trallnag/prometheus-fastapi-instrumentator) - The library for instrumenting FastAPI.
-   [Google SRE Handbook, Chapter 6: Monitoring Distributed Systems](https://sre.google/sre-book/monitoring-distributed-systems/) - A foundational text on monitoring philosophy.
-   [structlog Documentation](https://www.structlog.org/en/stable/) - A popular library for structured logging in Python.
