# 📡 Week 09-10 · Notebook 11 · Post-Tuning Evaluation, Monitoring & Drift Management

Operationalize continuous evaluation, drift detection, and rollback for fine-tuned manufacturing copilots.

## 🎯 Learning Objectives
- **Build Evaluation Harnesses:** Create a robust evaluation framework that combines automated metrics (e.g., ROUGE, BLEU) with structured human-in-the-loop (HITL) reviews.
- **Monitor for Drift:** Implement strategies to detect embedding drift (semantic shift in user queries) and response quality degradation over time.
- **Deploy with Safeguards:** Understand and apply deployment strategies like canary releases and shadow testing to de-risk model updates.
- **Establish Governance SOPs:** Draft a Standard Operating Procedure (SOP) for model rollback and incident response, aligned with plant governance policies.

## 🧩 Scenario
A fine-tuned assistant is live in four plants. Leadership wants weekly monitoring packs summarizing accuracy, safety, and drift signals.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.metrics.pairwise import cosine_distances

## 🧪 Evaluation & Monitoring Log Schema

To monitor the model in production, we need to log every request and response. This structured log is the foundation for all post-deployment analysis.

```json
{
  "request_id": "uuid-1234-abcd",
  "timestamp": "2024-10-28T10:05:00Z",
  "plant_id": "Plant_B",
  "model_version": "sop-assistant-v1.1",
  "user_prompt": "What is the lubrication schedule for the main conveyor?",
  "prompt_embedding": [0.12, -0.05, ...], // 768-dim embedding
  "generated_response": "The main conveyor requires lubrication every 200 operating hours. Use ISO VG 46 oil.",
  "response_embedding": [0.08, 0.21, ...],
  "latency_ms": 250,
  "human_feedback_score": 4, // Optional: 1-5 scale from a user feedback button
  "contains_safety_keyword": false // Flagged by a post-processing check
}
```

This schema captures not just the text, but also the semantic meaning (embeddings) and performance metrics needed for drift detection and KPI tracking.

### 1. Generate Synthetic Production Logs

First, we'll create a synthetic dataset that mimics the production logs from our deployed Manufacturing Copilot. This data is crucial for simulating our monitoring and drift detection workflows.

In [None]:
def generate_prod_logs(num_logs=500):
    """Generates a DataFrame of synthetic production logs for the Manufacturing Copilot."""
    logs = []
    base_time = datetime.now()
    plants = ["Plant_A", "Plant_B", "Plant_C", "Plant_D"]
    
    # Pre-generate random choices to speed up the loop
    plant_choices = np.random.choice(plants, size=num_logs)
    latency_values = np.random.randint(150, 500, size=num_logs)
    feedback_scores = np.random.choice([3, 4, 5, 5], p=[0.1, 0.2, 0.6, 0.1], size=num_logs) # Adjusted probabilities
    safety_flags = np.random.choice([True, False], p=[0.05, 0.95], size=num_logs)

    for i in range(num_logs):
        logs.append({
            "timestamp": base_time - timedelta(hours=i),
            "plant_id": plant_choices[i],
            "model_version": "sop-assistant-v1.1",
            "latency_ms": latency_values[i],
            "human_feedback_score": feedback_scores[i],
            "contains_safety_keyword": safety_flags[i]
        })
    
    df = pd.DataFrame(logs)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    return df

# Generate and display the logs
prod_logs = generate_prod_logs(num_logs=500)
print("Generated Production Logs:")
prod_logs.head()

### 2. Calculate and Display KPI Dashboard

With the logs collected, we can compute key performance indicators (KPIs). This dashboard provides a high-level weekly overview for plant leadership, enabling them to quickly spot anomalies. A sudden drop in feedback score at one plant, for instance, would trigger an immediate investigation.

In [None]:
# --- Calculate KPIs ---
kpi_df = prod_logs.groupby('plant_id').agg(
    avg_latency_ms=('latency_ms', 'mean'),
    avg_feedback_score=('human_feedback_score', 'mean'),
    safety_incidents=('contains_safety_keyword', lambda x: x.sum())
).round(2)

print("--- Weekly KPI Dashboard ---")
kpi_df

### 3. Monitor for Embedding Drift

**Data drift** is a critical concept in MLOps. It occurs when the statistical properties of the production data change over time, making the model's original training data less relevant. For an LLM, this often manifests as **embedding drift** or **concept drift**, where the topics and semantics of user prompts shift.

We can detect this by comparing the embeddings of recent prompts to a baseline established during training (e.g., from the validation set). A significant divergence, measured by cosine distance, indicates that the model's knowledge may be becoming stale. This is a primary signal for triggering a retraining or fine-tuning cycle.

In [None]:
# --- Simulate Embedding Drift ---
# In a real scenario, these would be loaded from a vector DB or log storage.
# Baseline embeddings from the validation set (e.g., centered around a certain point)
baseline_embeddings = np.random.randn(100, 768) * 0.1 + 0.5

# Current week's production prompt embeddings
# Let's simulate a drift by changing the mean, mimicking a shift in user topics.
current_embeddings = np.random.randn(50, 768) * 0.1 + 0.8 

# --- Calculate Drift using Centroid Cosine Distance ---
# We compute the average cosine distance between the centroid (average vector) of the
# current embeddings and the centroid of the baseline embeddings.
baseline_centroid = baseline_embeddings.mean(axis=0).reshape(1, -1)
current_centroid = current_embeddings.mean(axis=0).reshape(1, -1)

# Cosine distance is (1 - cosine similarity). A higher value means more difference.
drift_score = cosine_distances(baseline_centroid, current_centroid)[0, 0]

print(f"--- Embedding Drift Report ---")
print(f"Baseline Centroid Norm: {np.linalg.norm(baseline_centroid):.2f}")
print(f"Current Centroid Norm:  {np.linalg.norm(current_centroid):.2f}")
print(f"Drift Score (Cosine Distance): {drift_score:.4f}")

# --- Governance Rule ---
# This threshold must be tuned based on historical data and business tolerance.
DRIFT_THRESHOLD = 0.02 

if drift_score > DRIFT_THRESHOLD:
    print(f"\n🚨 ALERT: Drift score ({drift_score:.4f}) exceeds threshold of {DRIFT_THRESHOLD}. Retraining may be required.")
else:
    print(f"\n✅ OK: Drift score ({drift_score:.4f}) is within acceptable limits.")

### 4. Generate Automated Governance Report

This final step automates the creation of a weekly performance report. The report, formatted in Markdown, summarizes KPIs and drift analysis, providing clear status indicators and recommended actions. This artifact is ready to be emailed to stakeholders or posted to a documentation portal, ensuring transparent and consistent governance.

In [None]:
from IPython.display import Markdown

# Determine status based on drift score
drift_status = "ALERT" if drift_score > DRIFT_THRESHOLD else "OK"
status_color = "red" if drift_status == "ALERT" else "green"

summary_text = "Significant prompt drift detected. An investigation into new user query patterns is recommended." if drift_status == "ALERT" else "No significant anomalies detected."
action_items = "1. Analyze new query clusters from production logs.\n    - 2. Curate new training data reflecting these patterns.\n    - 3. Schedule a potential retraining cycle." if drift_status == "ALERT" else "- None."

# --- Build Markdown Report ---
report_template = f"""
# Weekly Manufacturing Copilot Performance Report

**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Model Version:** sop-assistant-v1.1

---

## 1. Key Performance Indicators (KPIs)

| Plant | Avg. Latency (ms) | Avg. Feedback Score | Safety Incidents |
|:---|:---|:---|:---|
"""
# Populate KPI table
for index, row in kpi_df.iterrows():
    report_template += f"| **{index}** | {row['avg_latency_ms']:.0f} | {row['avg_feedback_score']:.2f} | {int(row['safety_incidents'])} |\n"

report_template += f"""
---

## 2. Model Drift Analysis

- **Prompt Embedding Drift Score:** `{drift_score:.4f}`
- **Drift Threshold:** `{DRIFT_THRESHOLD}`
- **Status:** <font color='{status_color}'>**{drift_status}**</font>

---

## 3. Summary & Actions

- **Overall Performance:** The model is performing within expected parameters, but requires attention regarding data drift.
- **Anomalies:** {summary_text}
- **Action Items:**
    - {action_items}
"""

# Display the rendered Markdown report
Markdown(report_template)

---

## 🛡️ Deployment Strategies & Rollback SOP

While monitoring tells us *when* to act, we also need robust procedures for *how* to act. This involves safe deployment strategies to de-risk model updates and clear Standard Operating Procedures (SOPs) for handling incidents.

### Deployment Strategies for Safe Rollouts

1.  **Canary Release:**
    -   **Concept:** Route a small percentage of live traffic (e.g., 5% of users from a single plant) to the new model version (`v1.2`), while the majority remains on the stable version (`v1.1`).
    -   **Benefit:** Limits the "blast radius" of any potential issues. It provides real-world performance data from a small, controlled user group before a full rollout.
    -   **Procedure:** If KPIs for the canary group remain stable and positive for a predefined period (e.g., 48 hours), gradually increase the traffic split (25%, 50%, and finally 100%).

2.  **Shadow Testing (or Dark Launch):**
    -   **Concept:** Run the new model `v1.2` in parallel with the current model `v1.1`. Live traffic is served by `v1.1`, but a copy of each request is also sent to `v1.2` "in the dark." The `v1.2` responses are logged but not sent to the user.
    -   **Benefit:** Allows you to compare the outputs of the new model against the old one on 100% of live traffic with zero risk to the user experience. You can analyze discrepancies, performance, and error rates offline.
    -   **Procedure:** Log cases where `v1.2`'s response differs significantly from `v1.1` (e.g., using embedding distance or keyword checks). Use this analysis to find bugs or unexpected behaviors before committing to a release.

### Standard Operating Procedure (SOP): Model Rollback

A rollback SOP is a non-negotiable component of production MLOps. It ensures a rapid, predictable response to model failures, minimizing impact on business operations.

---
**SOP-MLOPS-003: Emergency Model Rollback**

-   **Trigger Conditions (any of the following):**
    1.  Critical safety incident rate (as defined by `contains_safety_keyword`) increases by > 1% over a 12-hour rolling window.
    2.  Average human feedback score drops by > 0.5 points over a 24-hour period across any plant.
    3.  P95 latency increases by > 50% for more than 1 hour.
    4.  The model generates responses that are confirmed to violate a documented safety or compliance rule (e.g., hallucinating a dangerous chemical mixture).

-   **Procedure:**
    1.  **Immediate Action (On-call MLOps Engineer):** Via the API gateway or load balancer, immediately re-route 100% of traffic back to the previous stable model version (e.g., from `v1.2` to `v1.1`).
    2.  **Communication:** Post a status update in the designated incident channel (e.g., `#manufacturing-ai-status`), notifying stakeholders that a rollback has occurred and the system is stable.
    3.  **Investigation:** Create a high-priority incident ticket. The AI/MLOps team must perform a root cause analysis (RCA) using the production logs leading up to the incident.
    4.  **Resolution:** The problematic model version (`v1.2`) is quarantined. It cannot be redeployed until the root cause is fixed, the model is re-validated in a staging environment, and the fix is approved by the AI Governance Board.
---

## 🧪 Lab Assignment

1.  **Enhance the KPI Dashboard:**
    *   Add a new KPI to the `kpi_df` DataFrame: `p95_latency_ms`. This should calculate the 95th percentile of latency for each plant. Use the `.quantile(0.95)` method within the `agg` function.
    *   Update the final Markdown report to include this new P95 Latency column.

2.  **Simulate a "No Drift" Scenario:**
    *   In the "Monitor for Embedding Drift" section, create a second set of `current_embeddings_no_drift`.
    *   These embeddings should be generated with the *same mean* as the `baseline_embeddings` (e.g., `... * 0.1 + 0.5`).
    *   Calculate a `drift_score_no_drift` and print it. Confirm that this new score is below the `DRIFT_THRESHOLD`.

3.  **Refine the Rollback SOP:**
    *   Add a fifth "Trigger Condition" to the SOP. This new condition should be: "More than 3 user-reported escalations related to incorrect or nonsensical answers are confirmed within a 4-hour window."

4.  **Create a Drift Alert Function:**
    *   Write a Python function `check_drift(baseline_embeds, current_embeds, threshold)` that takes baseline embeddings, current embeddings, and a threshold as input.
    *   The function should perform the centroid distance calculation and return a tuple: `(drift_score, alert_status)`, where `alert_status` is either `"OK"` or `"ALERT"`.
    - Call this function in your notebook to generate the drift report.

## ✅ Checklist
- [ ] Production logs are structured to capture embeddings, latency, and user feedback.
- [ ] KPIs are automatically calculated and monitored for anomalies.
- [ ] Embedding drift is quantitatively measured against a baseline.
- [ ] A governance rule is in place to trigger alerts when drift exceeds a threshold.
- [ ] A clear, automated report is generated for stakeholders.
- [ ] Deployment strategies like canary and shadow testing are understood.
- [ ] A formal SOP for model rollback is documented and socialized.

## 📚 References
- [MLOps Course: Setting up ML in Production](https://www.deeplearning.ai/courses/machine-learning-engineering-for-production-mlops/) (DeepLearning.AI)
- *Designing Machine Learning Systems* by Chip Huyen (O'Reilly, 2022)
- [Hugging Face Hub: Model Cards](https://huggingface.co/docs/hub/model-cards)
- [Google Cloud: MLOps concepts](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)