# Advanced Monitoring & Drift Detection

This notebook focuses on advanced monitoring techniques to ensure the long-term reliability and quality of the Self-Critique pipeline. It provides tools for detecting data drift, concept drift, and model performance degradation.

## Learning Objectives

- **Data Drift Detection**: Identify changes in the statistical properties of input data.
- **Model Drift Detection**: Monitor for degradation in the pipeline's quality scores over time.
- **Statistical Monitoring**: Apply statistical tests (e.g., Kolmogorov-Smirnov) to detect drift.
- **Alerting**: Establish a framework for automated drift alerts.
- **Remediation Strategies**: Understand how to respond to detected drift.

## Business Context

LLM-based systems can degrade silently over time as input data characteristics or user expectations change. Proactive drift detection is essential to maintain quality and trust. This notebook helps answer:

- Is the pipeline performing as well today as it did last month?
- Has the type of content we're processing changed?
- How can we automatically detect and get alerted to quality degradation?
- When should we consider updating our prompts or models?

---


## Section 1: Setup and Configuration

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from typing import Dict, List, Any

from notebooks._shared_utilities import (
    load_monitoring_data,
    compare_distributions
)

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 7)

print("âœ“ Environment setup complete")

## Section 2: Simulating Monitoring Data

We'll create a simulated dataset representing pipeline performance over time. This will include a "baseline" period of normal performance and a "current" period where we'll introduce drift.


In [None]:
def simulate_monitoring_data(num_records: int, drift: bool = False) -> pd.DataFrame:
    """Generates a DataFrame of simulated monitoring logs."""
    base_date = pd.to_datetime('2024-01-01')
    dates = [base_date + pd.Timedelta(hours=i) for i in range(num_records)]
    
    data = {
        'timestamp': dates,
        'paper_length': np.random.normal(3000, 500, num_records),
        'overall_quality': np.random.normal(8.8, 0.4, num_records),
        'latency_seconds': np.random.normal(6.0, 1.0, num_records)
    }
    
    if drift:
        # Introduce drift in the second half of the data
        drift_point = num_records // 2
        data['paper_length'][drift_point:] = np.random.normal(4500, 700, num_records - drift_point) # Data drift
        data['overall_quality'][drift_point:] = np.random.normal(7.2, 0.8, num_records - drift_point) # Model drift
        data['latency_seconds'][drift_point:] = np.random.normal(8.5, 1.5, num_records - drift_point) # Performance drift
    
    return pd.DataFrame(data)

# Generate baseline and current data
baseline_data = simulate_monitoring_data(1000, drift=False)
current_data = simulate_monitoring_data(1000, drift=True)

# Split current data into pre-drift and post-drift for visualization
current_baseline = current_data.iloc[:500]
current_drifted = current_data.iloc[500:]

print("Simulated Data Summary:")
print("\nBaseline Data:")
print(baseline_data.describe())
print("\nCurrent (Drifted) Data:")
print(current_drifted.describe())

## Section 3: Data Drift Detection

Data drift occurs when the statistical properties of the input data change. We'll monitor `paper_length` as a proxy for input complexity.


In [None]:
def plot_drift(baseline, current, metric, ax, title):
    """Helper to plot and compare distributions."""
    sns.kdeplot(baseline[metric], ax=ax, label='Baseline', color='blue', fill=True)
    sns.kdeplot(current[metric], ax=ax, label='Current', color='red', fill=True)
    ax.set_title(title)
    ax.legend()
    
    # Perform KS test
    stat, p_value = compare_distributions(baseline[metric], current[metric], test='ks')
    drift_detected = p_value < 0.05
    verdict = "Drift Detected" if drift_detected else "No Drift"
    ax.text(0.95, 0.95, f'p-value: {p_value:.4f}\n{verdict}', 
            transform=ax.transAxes, ha='right', va='top', 
            bbox=dict(boxstyle='round,pad=0.5', fc='wheat', alpha=0.5))
    return drift_detected

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Drift Analysis: Baseline vs. Current', fontsize=16)

# Data Drift (Paper Length)
plot_drift(baseline_data, current_drifted, 'paper_length', axes[0], 'Data Drift: Paper Length')

# Model Performance Drift (Quality Score)
plot_drift(baseline_data, current_drifted, 'overall_quality', axes[1], 'Model Drift: Quality Score')

# System Performance Drift (Latency)
plot_drift(baseline_data, current_drifted, 'latency_seconds', axes[2], 'Performance Drift: Latency')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

## Section 4: Time Series Drift Monitoring

We can also monitor metrics over time using a sliding window to detect gradual drift.


In [None]:
def monitor_over_time(data: pd.DataFrame, metric: str, window_size: int = 100):
    """Calculates rolling statistics to monitor drift over time."""
    data[f'{metric}_rolling_mean'] = data[metric].rolling(window=window_size).mean()
    data[f'{metric}_rolling_std'] = data[metric].rolling(window=window_size).std()
    return data

monitored_data = monitor_over_time(current_data, 'overall_quality')

fig, ax = plt.subplots(figsize=(15, 6))
ax.plot(monitored_data['timestamp'], monitored_data['overall_quality'], 'k.', alpha=0.1, label='Raw Score')
ax.plot(monitored_data['timestamp'], monitored_data['overall_quality_rolling_mean'], 'b-', label='Rolling Mean')
ax.fill_between(monitored_data['timestamp'], 
                monitored_data['overall_quality_rolling_mean'] - 2 * monitored_data['overall_quality_rolling_std'], 
                monitored_data['overall_quality_rolling_mean'] + 2 * monitored_data['overall_quality_rolling_std'], 
                color='blue', alpha=0.2, label='Â±2 Std Dev')

ax.set_title('Time Series Monitoring for Quality Score Drift')
ax.set_xlabel('Timestamp')
ax.set_ylabel('Overall Quality Score')
ax.legend()
ax.grid(True)

# Add a horizontal line for the baseline mean
baseline_mean = baseline_data['overall_quality'].mean()
ax.axhline(baseline_mean, color='red', linestyle='--', label='Baseline Mean')
ax.legend()
plt.show()


## Section 5: Automated Drift Alerts

This section outlines a simple alerting mechanism that could be integrated into a production monitoring system.


In [None]:
class DriftDetector:
    def __init__(self, baseline_df: pd.DataFrame, p_value_threshold: float = 0.05):
        self.baseline_df = baseline_df
        self.p_value_threshold = p_value_threshold
        print(f"Drift detector initialized with baseline. p-value threshold: {p_value_threshold}")

    def check(self, current_df: pd.DataFrame) -> List[str]:
        """Checks for drift in key metrics and returns a list of alerts."""
        alerts = []
        metrics_to_check = ['paper_length', 'overall_quality', 'latency_seconds']

        for metric in metrics_to_check:
            _, p_value = compare_distributions(self.baseline_df[metric], current_df[metric])
            if p_value < self.p_value_threshold:
                alert_msg = f"ðŸš¨ DRIFT DETECTED in '{metric}' (p-value: {p_value:.4f})"
                print(alert_msg)
                alerts.append(alert_msg)
        
        if not alerts:
            print("âœ… No significant drift detected.")
            
        return alerts

# Initialize the detector with the baseline data
detector = DriftDetector(baseline_data)

print("\n--- Checking for drift in the first half (no drift) ---")
alerts_no_drift = detector.check(current_baseline)

print("\n--- Checking for drift in the second half (drift) ---")
alerts_drift = detector.check(current_drifted)


## Section 6: Remediation Strategies

When drift is detected, several actions can be taken:

| Drift Type | Potential Causes | Remediation Strategies |
| :--- | :--- | :--- |
| **Data Drift** | New data sources, changing topics, different document formats | - **Update Prompts**: Modify prompts to handle new data characteristics.<br>- **Retrain/Fine-tune**: If using a fine-tuned model, retrain on recent data.<br>- **Data Validation**: Add stricter validation at the input layer. |
| **Model Drift** | Data drift, concept drift (meaning of quality changes), model staleness | - **Prompt Engineering**: A/B test new prompts to improve performance.<br>- **Model Upgrade**: Evaluate a newer, more capable model.<br>- **Human-in-the-Loop**: Collect human feedback on recent outputs to understand the failure mode. |
| **Performance Drift**| API provider issues, larger inputs/outputs, inefficient code | - **Optimize Prompts**: Reduce token usage.<br>- **Caching**: Implement caching for common requests.<br>- **Infrastructure Scaling**: Increase resources if it's a bottleneck. |


## Conclusion

This notebook provides a framework for monitoring and detecting drift in the Self-Critique pipeline. Key takeaways:

1. **Multi-faceted Monitoring**: It's crucial to monitor data, model quality, and system performance metrics.
2. **Statistical Rigor**: Statistical tests like the Kolmogorov-Smirnov test provide a robust way to quantify drift.
3. **Automation is Key**: Automated monitoring and alerting are essential for catching issues before they impact users.
4. **Have a Plan**: A clear set of remediation strategies is necessary to act on drift alerts effectively.

### Next Steps

1. **Integrate with a Monitoring Stack**: Export these metrics to a system like Prometheus and build dashboards in Grafana.
2. **Set Up Automated Alerting**: Connect the `DriftDetector` to an alerting system like PagerDuty or Slack.
3. **Establish a Re-evaluation Cadence**: Schedule regular, automated re-evaluation of the pipeline against the benchmark dataset from `model_evaluation_qa.ipynb`.