# 📊 Monitoring & Metrics Analysis: KubeSentiment Observability

This notebook explores the comprehensive monitoring and observability stack of KubeSentiment, including metrics collection, alerting, and performance analysis.

## 🎯 Learning Objectives

By the end of this notebook, you will:
1. Understand the three pillars of observability (metrics, logs, traces)
2. Explore Prometheus metrics and Grafana dashboards
3. Analyze alerting rules and incident management
4. Perform data drift detection and model monitoring
5. Understand health checks and readiness probes
6. Create custom monitoring dashboards and alerts

## 📈 Three Pillars of Observability

### 1. **Metrics** - Quantitative Measurements
- **System Metrics**: CPU, memory, disk, network
- **Application Metrics**: Response times, error rates, throughput
- **Business Metrics**: Prediction accuracy, user satisfaction

### 2. **Logs** - Discrete Events
- **Structured Logging**: JSON format with correlation IDs
- **Log Levels**: DEBUG, INFO, WARNING, ERROR, CRITICAL
- **Centralized Collection**: Elasticsearch + Kibana or Loki

### 3. **Traces** - Request Flow
- **Distributed Tracing**: OpenTelemetry integration
- **Request Correlation**: Track requests across services
- **Performance Analysis**: Identify bottlenecks

### KubeSentiment Monitoring Stack
```
Application Layer
    ↓ (Metrics)
Prometheus ← Grafana
    ↓ (Alerts)
Alertmanager → Slack/Email
    ↓ (Logs)
Loki/Elasticsearch ← Kibana
```

In [None]:
# Setup and imports
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime, timedelta
import requests
import time
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('default')
sns.set_palette("husl")

# Configuration
PROMETHEUS_URL = "http://localhost:9090"  # Default Prometheus port
API_BASE_URL = "http://localhost:8000"    # KubeSentiment API

print("✅ Libraries imported successfully!")
print(f"📊 Prometheus URL: {PROMETHEUS_URL}")
print(f"🌐 API URL: {API_BASE_URL}")

## 📋 Loading Monitoring Configuration

Let's examine the monitoring configuration files and understand the alerting rules.

In [None]:
# Load monitoring configurations
def load_monitoring_configs():
    """Load monitoring configuration files."""
    configs = {}
    config_dir = "../config"
    
    # Load Prometheus rules
    prometheus_rules_path = f"{config_dir}/prometheus-rules.yaml"
    if os.path.exists(prometheus_rules_path):
        try:
            import yaml
            with open(prometheus_rules_path, 'r') as f:
                configs['prometheus_rules'] = yaml.safe_load(f)
            print("✅ Prometheus rules loaded")
        except ImportError:
            print("⚠️ PyYAML not available")
    
    # Load Alertmanager config
    alertmanager_path = f"{config_dir}/alertmanager-config.yaml"
    if os.path.exists(alertmanager_path):
        try:
            with open(alertmanager_path, 'r') as f:
                configs['alertmanager'] = yaml.safe_load(f)
            print("✅ Alertmanager config loaded")
        except:
            pass
    
    # Load environments config
    environments_path = f"{config_dir}/environments.yaml"
    if os.path.exists(environments_path):
        try:
            with open(environments_path, 'r') as f:
                configs['environments'] = yaml.safe_load(f)
            print("✅ Environment configs loaded")
        except:
            pass
    
    return configs

# Load configurations
monitoring_configs = load_monitoring_configs()

print("\n📊 Monitoring Configuration Analysis:")
print("=" * 50)

# Analyze Prometheus rules
if 'prometheus_rules' in monitoring_configs:
    rules = monitoring_configs['prometheus_rules']
    print("🚨 Prometheus Alerting Rules:")
    
    if 'groups' in rules:
        for group in rules['groups']:
            print(f"\n📋 Group: {group.get('name', 'Unknown')}")
            
            if 'rules' in group:
                for rule in group['rules']:
                    if 'alert' in rule:
                        print(f"   🚨 {rule['alert']}")
                        if 'expr' in rule:
                            expr = rule['expr']
                            if len(expr) > 60:
                                expr = expr[:57] + "..."
                            print(f"      Expression: {expr}")
                        if 'for' in rule:
                            print(f"      Duration: {rule['for']}")
                        if 'labels' in rule and 'severity' in rule['labels']:
                            print(f"      Severity: {rule['labels']['severity']}")
    else:
        print("   No rule groups found")

# Analyze Alertmanager config
if 'alertmanager' in monitoring_configs:
    am_config = monitoring_configs['alertmanager']
    print("\n📢 Alertmanager Configuration:")
    
    if 'route' in am_config:
        route = am_config['route']
        print(f"   📋 Default receiver: {route.get('receiver', 'unknown')}")
        
        if 'routes' in route:
            print(f"   🔀 Routing rules: {len(route['routes'])}")
    
    if 'receivers' in am_config:
        receivers = am_config['receivers']
        print(f"   📞 Receivers configured: {len(receivers)}")
        
        for receiver in receivers:
            name = receiver.get('name', 'unknown')
            if 'slack_configs' in receiver:
                print(f"      💬 Slack: {name}")
            if 'email_configs' in receiver:
                print(f"      📧 Email: {name}")
            if 'pagerduty_configs' in receiver:
                print(f"      🚨 PagerDuty: {name}")

# Sample alerting rules for demonstration
sample_alerts = {
    "HighErrorRate": {
        "description": "Error rate above 5% for 5 minutes",
        "severity": "critical",
        "query": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]) > 0.05"
    },
    "HighLatency": {
        "description": "95th percentile latency above 500ms for 10 minutes",
        "severity": "warning",
        "query": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[10m])) > 0.5"
    },
    "LowThroughput": {
        "description": "Request rate below 10 RPS for 15 minutes",
        "severity": "warning",
        "query": "rate(http_requests_total[15m]) < 10"
    },
    "HighMemoryUsage": {
        "description": "Memory usage above 90% for 5 minutes",
        "severity": "critical",
        "query": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9"
    },
    "ModelAccuracyDrop": {
        "description": "Model accuracy drops below 85%",
        "severity": "critical",
        "query": "ml_model_accuracy < 0.85"
    }
}

print("\n🚨 Sample Alerting Rules:")
for alert_name, alert_info in sample_alerts.items():
    print(f"\n📢 {alert_name} ({alert_info['severity'].upper()})")
    print(f"   📝 {alert_info['description']}")
    print(f"   🔍 Query: {alert_info['query'][:80]}..." if len(alert_info['query']) > 80 else f"   🔍 Query: {alert_info['query']}")

## 📊 Generating Sample Metrics Data

Let's create realistic monitoring data to demonstrate analysis capabilities.

In [None]:
# Generate sample monitoring data
def generate_monitoring_data(hours=24):
    """Generate sample monitoring data for analysis."""
    
    # Create time series data
    timestamps = pd.date_range(
        start=datetime.now() - timedelta(hours=hours), 
        end=datetime.now(), 
        freq='5min'
    )
    
    data = []
    np.random.seed(42)
    
    for ts in timestamps:
        # Base metrics with some variability
        base_load = 50 + 30 * np.sin(2 * np.pi * (ts.hour / 24))  # Daily pattern
        
        # Add some anomalies
        anomaly_multiplier = 1.0
        if ts.hour in [2, 3, 14, 15]:  # Low traffic periods
            anomaly_multiplier = 0.3
        elif np.random.random() < 0.05:  # 5% chance of spike
            anomaly_multiplier = np.random.uniform(1.5, 3.0)
        
        # HTTP metrics
        requests_per_second = base_load * anomaly_multiplier + np.random.normal(0, 5)
        requests_per_second = max(0, requests_per_second)
        
        error_rate = np.random.beta(2, 98) * 100  # Low error rate, occasional spikes
        if np.random.random() < 0.02:  # 2% chance of error spike
            error_rate = np.random.uniform(5, 15)
        
        # Latency metrics (in milliseconds)
        p50_latency = 45 + (requests_per_second / 10) + np.random.normal(0, 5)
        p95_latency = p50_latency * (1.5 + np.random.exponential(0.5))
        p99_latency = p95_latency * (1.3 + np.random.exponential(0.3))
        
        # System metrics
        cpu_usage = 20 + (requests_per_second / 5) + np.random.normal(0, 3)
        cpu_usage = min(95, max(5, cpu_usage))
        
        memory_usage = 45 + (requests_per_second / 8) + np.random.normal(0, 2)
        memory_usage = min(90, max(30, memory_usage))
        
        # Model metrics
        model_accuracy = 0.91 + np.random.normal(0, 0.005)
        model_accuracy = max(0.85, min(0.95, model_accuracy))
        
        # Inference time
        inference_time = 42 + (cpu_usage / 10) + np.random.normal(0, 3)
        inference_time = max(20, inference_time)
        
        # Cache metrics
        cache_hit_rate = 0.75 + np.random.normal(0, 0.05)
        cache_hit_rate = max(0.6, min(0.9, cache_hit_rate))
        
        data.append({
            "timestamp": ts,
            "requests_per_second": round(requests_per_second, 2),
            "error_rate_percent": round(error_rate, 3),
            "p50_latency_ms": round(p50_latency, 2),
            "p95_latency_ms": round(p95_latency, 2),
            "p99_latency_ms": round(p99_latency, 2),
            "cpu_usage_percent": round(cpu_usage, 2),
            "memory_usage_percent": round(memory_usage, 2),
            "model_accuracy": round(model_accuracy, 4),
            "inference_time_ms": round(inference_time, 2),
            "cache_hit_rate": round(cache_hit_rate, 3),
            "active_connections": int(requests_per_second * 2 + np.random.normal(0, 5))
        })
    
    return pd.DataFrame(data)

# Generate sample data
monitoring_df = generate_monitoring_data(hours=24)
monitoring_df.set_index('timestamp', inplace=True)

print("📊 Sample Monitoring Data Generated:")
print("=" * 50)
print(f"📋 Data points: {len(monitoring_df)}")
print(f"⏱️ Time range: {monitoring_df.index.min()} to {monitoring_df.index.max()}")
print(f"📊 Metrics collected: {len(monitoring_df.columns)}")

print("\n🔍 Data Preview:")
display(monitoring_df.head())

print("\n📈 Summary Statistics:")
display(monitoring_df.describe())

## 📈 Time Series Analysis

Let's analyze the monitoring data over time to understand system behavior and identify patterns.

In [None]:
# Time series analysis and visualization
fig, axes = plt.subplots(4, 2, figsize=(16, 20))
fig.suptitle('KubeSentiment Monitoring Dashboard - 24 Hour Overview', fontsize=16, fontweight='bold')

# 1. Request throughput over time
axes[0, 0].plot(monitoring_df.index, monitoring_df['requests_per_second'], linewidth=2, color='blue')
axes[0, 0].set_title('Request Throughput (RPS)')
axes[0, 0].set_ylabel('Requests/sec')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].axhline(monitoring_df['requests_per_second'].mean(), color='red', linestyle='--', alpha=0.7,
                   label=f'Mean: {monitoring_df["requests_per_second"].mean():.1f}')
axes[0, 0].legend()

# 2. Error rate over time
axes[0, 1].plot(monitoring_df.index, monitoring_df['error_rate_percent'], linewidth=2, color='red')
axes[0, 1].set_title('Error Rate (%)')
axes[0, 1].set_ylabel('Error Rate (%)')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].axhline(5, color='orange', linestyle='--', alpha=0.7, label='Alert Threshold: 5%')
axes[0, 1].legend()

# 3. Latency percentiles
axes[1, 0].plot(monitoring_df.index, monitoring_df['p50_latency_ms'], label='P50', linewidth=2, color='green')
axes[1, 0].plot(monitoring_df.index, monitoring_df['p95_latency_ms'], label='P95', linewidth=2, color='orange')
axes[1, 0].plot(monitoring_df.index, monitoring_df['p99_latency_ms'], label='P99', linewidth=2, color='red')
axes[1, 0].set_title('Response Latency Percentiles')
axes[1, 0].set_ylabel('Latency (ms)')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].legend()
axes[1, 0].axhline(500, color='red', linestyle='--', alpha=0.7, label='Alert Threshold')

# 4. System resource usage
axes[1, 1].plot(monitoring_df.index, monitoring_df['cpu_usage_percent'], label='CPU', linewidth=2, color='purple')
axes[1, 1].plot(monitoring_df.index, monitoring_df['memory_usage_percent'], label='Memory', linewidth=2, color='brown')
axes[1, 1].set_title('System Resource Usage')
axes[1, 1].set_ylabel('Usage (%)')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].axhline(80, color='red', linestyle='--', alpha=0.7, label='High Usage Threshold')
axes[1, 1].legend()

# 5. Model performance metrics
axes[2, 0].plot(monitoring_df.index, monitoring_df['model_accuracy'] * 100, linewidth=2, color='darkgreen')
axes[2, 0].set_title('Model Accuracy Over Time')
axes[2, 0].set_ylabel('Accuracy (%)')
axes[2, 0].grid(True, alpha=0.3)
axes[2, 0].axhline(85, color='red', linestyle='--', alpha=0.7, label='Accuracy Threshold')
axes[2, 0].legend()

# 6. Inference performance
axes[2, 1].plot(monitoring_df.index, monitoring_df['inference_time_ms'], linewidth=2, color='teal')
axes[2, 1].set_title('Model Inference Time')
axes[2, 1].set_ylabel('Time (ms)')
axes[2, 1].grid(True, alpha=0.3)
axes[2, 1].axhline(100, color='red', linestyle='--', alpha=0.7, label='Latency Threshold')
axes[2, 1].legend()

# 7. Cache performance
axes[3, 0].plot(monitoring_df.index, monitoring_df['cache_hit_rate'] * 100, linewidth=2, color='magenta')
axes[3, 0].set_title('Cache Hit Rate')
axes[3, 0].set_ylabel('Hit Rate (%)')
axes[3, 0].grid(True, alpha=0.3)
axes[3, 0].axhline(70, color='green', linestyle='--', alpha=0.7, label='Good Performance')
axes[3, 0].legend()

# 8. Active connections
axes[3, 1].plot(monitoring_df.index, monitoring_df['active_connections'], linewidth=2, color='olive')
axes[3, 1].set_title('Active Connections')
axes[3, 1].set_ylabel('Connections')
axes[3, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate key metrics
print("📊 Key Performance Metrics (24h):")
print("=" * 50)

metrics_summary = {
    "Average RPS": f"{monitoring_df['requests_per_second'].mean():.1f}",
    "Peak RPS": f"{monitoring_df['requests_per_second'].max():.1f}",
    "Average Error Rate": f"{monitoring_df['error_rate_percent'].mean():.2f}%",
    "Max Error Rate": f"{monitoring_df['error_rate_percent'].max():.2f}%",
    "Average P95 Latency": f"{monitoring_df['p95_latency_ms'].mean():.1f}ms",
    "Max P95 Latency": f"{monitoring_df['p95_latency_ms'].max():.1f}ms",
    "Average CPU Usage": f"{monitoring_df['cpu_usage_percent'].mean():.1f}%",
    "Average Memory Usage": f"{monitoring_df['memory_usage_percent'].mean():.1f}%",
    "Model Accuracy": f"{monitoring_df['model_accuracy'].mean():.1%}",
    "Cache Hit Rate": f"{monitoring_df['cache_hit_rate'].mean():.1%}",
    "Average Inference Time": f"{monitoring_df['inference_time_ms'].mean():.1f}ms"
}

for metric, value in metrics_summary.items():
    print(f"{metric:<25}: {value}")

# Identify potential issues
issues = []
if monitoring_df['error_rate_percent'].max() > 5:
    issues.append("High error rate detected")
if monitoring_df['p95_latency_ms'].max() > 500:
    issues.append("High latency spikes detected")
if monitoring_df['cpu_usage_percent'].max() > 80:
    issues.append("High CPU usage detected")
if monitoring_df['memory_usage_percent'].max() > 85:
    issues.append("High memory usage detected")
if monitoring_df['model_accuracy'].min() < 0.85:
    issues.append("Model accuracy degradation detected")

if issues:
    print(f"\n🚨 Potential Issues Detected ({len(issues)}):")
    for issue in issues:
        print(f"   • {issue}")
else:
    print("\n✅ No critical issues detected in the monitoring period")

## 🚨 Alert Analysis and Incident Detection

Let's analyze potential alerts and create an alerting dashboard.

In [None]:
# Alert analysis and incident detection
def analyze_alerts(monitoring_df, alert_rules):
    """Analyze monitoring data against alert rules."""
    
    alerts_triggered = []
    
    for alert_name, rule in alert_rules.items():
        severity = rule['severity']
        description = rule['description']
        
        # Simple rule evaluation (in real implementation, this would be PromQL)
        violations = []
        
        if alert_name == "HighErrorRate":
            violations = monitoring_df[monitoring_df['error_rate_percent'] > 5].index.tolist()
        elif alert_name == "HighLatency":
            violations = monitoring_df[monitoring_df['p95_latency_ms'] > 500].index.tolist()
        elif alert_name == "LowThroughput":
            violations = monitoring_df[monitoring_df['requests_per_second'] < 10].index.tolist()
        elif alert_name == "HighMemoryUsage":
            violations = monitoring_df[monitoring_df['memory_usage_percent'] > 90].index.tolist()
        elif alert_name == "ModelAccuracyDrop":
            violations = monitoring_df[monitoring_df['model_accuracy'] < 0.85].index.tolist()
        
        if violations:
            alerts_triggered.append({
                "alert_name": alert_name,
                "severity": severity,
                "description": description,
                "violations_count": len(violations),
                "first_violation": violations[0],
                "last_violation": violations[-1],
                "violation_percentage": len(violations) / len(monitoring_df) * 100
            })
    
    return alerts_triggered

# Analyze alerts
alerts_detected = analyze_alerts(monitoring_df, sample_alerts)

print("🚨 Alert Analysis Results:")
print("=" * 50)

if alerts_detected:
    print(f"📢 {len(alerts_detected)} alerts would have been triggered:")
    
    for alert in alerts_detected:
        severity_color = "🔴" if alert['severity'] == 'critical' else "🟠" if alert['severity'] == 'warning' else "🟢"
        print(f"\n{severity_color} {alert['alert_name']} ({alert['severity'].upper()})")
        print(f"   📝 {alert['description']}")
        print(f"   🔢 Violations: {alert['violations_count']} times")
        print(f"   📊 Violation rate: {alert['violation_percentage']:.1f}%")
        print(f"   🕒 First: {alert['first_violation']}")
        print(f"   🕒 Last: {alert['last_violation']}")

    # Alert severity distribution
    severity_counts = {}
    for alert in alerts_detected:
        severity_counts[alert['severity']] = severity_counts.get(alert['severity'], 0) + 1
    
    print(f"\n📈 Alert Severity Distribution:")
    for severity, count in severity_counts.items():
        print(f"   {severity.upper()}: {count} alerts")
    
else:
    print("✅ No alerts would have been triggered in this period")

# Create alert timeline visualization
if alerts_detected:
    fig, axes = plt.subplots(len(alerts_detected), 1, figsize=(12, 4 * len(alerts_detected)))
    fig.suptitle('Alert Timeline Analysis', fontsize=16)
    
    if len(alerts_detected) == 1:
        axes = [axes]
    
    for i, alert in enumerate(alerts_detected):
        ax = axes[i]
        
        # Plot the relevant metric
        if alert['alert_name'] == "HighErrorRate":
            metric_data = monitoring_df['error_rate_percent']
            threshold = 5
            ylabel = 'Error Rate (%)'
        elif alert['alert_name'] == "HighLatency":
            metric_data = monitoring_df['p95_latency_ms']
            threshold = 500
            ylabel = 'P95 Latency (ms)'
        elif alert['alert_name'] == "LowThroughput":
            metric_data = monitoring_df['requests_per_second']
            threshold = 10
            ylabel = 'Requests/sec'
        elif alert['alert_name'] == "HighMemoryUsage":
            metric_data = monitoring_df['memory_usage_percent']
            threshold = 90
            ylabel = 'Memory Usage (%)'
        elif alert['alert_name'] == "ModelAccuracyDrop":
            metric_data = monitoring_df['model_accuracy'] * 100
            threshold = 85
            ylabel = 'Model Accuracy (%)'
        
        ax.plot(monitoring_df.index, metric_data, linewidth=2, color='blue', label='Actual Value')
        ax.axhline(threshold, color='red', linestyle='--', linewidth=2, 
                   label=f'Threshold: {threshold}')
        ax.fill_between(monitoring_df.index, metric_data, threshold, 
                       where=(metric_data > threshold), color='red', alpha=0.3, label='Alert Zone')
        
        ax.set_title(f"{alert['alert_name']} - {alert['description']}")
        ax.set_ylabel(ylabel)
        ax.grid(True, alpha=0.3)
        ax.legend()
        
        # Rotate x-axis labels
        plt.setp(ax.get_xticklabels(), rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()

# Simulate real-time alerting
def simulate_real_time_monitoring(monitoring_df, check_interval_minutes=5):
    """Simulate real-time monitoring and alerting."""
    
    print("\n📊 Real-Time Monitoring Simulation:")
    print("=" * 50)
    
    # Check last few data points
    recent_data = monitoring_df.tail(12)  # Last hour (12 * 5min intervals)
    
    active_alerts = []
    
    # Check each alert rule
    latest_values = recent_data.iloc[-1]
    
    if latest_values['error_rate_percent'] > 5:
        active_alerts.append("High Error Rate")
    
    if latest_values['p95_latency_ms'] > 500:
        active_alerts.append("High Latency")
    
    if latest_values['requests_per_second'] < 10:
        active_alerts.append("Low Throughput")
    
    if latest_values['memory_usage_percent'] > 90:
        active_alerts.append("High Memory Usage")
    
    if latest_values['model_accuracy'] < 0.85:
        active_alerts.append("Model Accuracy Drop")
    
    if active_alerts:
        print(f"🚨 ACTIVE ALERTS ({len(active_alerts)}):")
        for alert in active_alerts:
            print(f"   • {alert}")
        
        print("\n📢 Recommended Actions:")
        if "High Error Rate" in active_alerts:
            print("   • Check application logs for error patterns")
            print("   • Review recent deployments or configuration changes")
        if "High Latency" in active_alerts:
            print("   • Check system resource utilization")
            print("   • Consider scaling up instance size")
        if "Low Throughput" in active_alerts:
            print("   • Investigate potential service outages")
            print("   • Check network connectivity")
        if "High Memory Usage" in active_alerts:
            print("   • Monitor for memory leaks")
            print("   • Consider increasing instance memory")
        if "Model Accuracy Drop" in active_alerts:
            print("   • Check for data drift")
            print("   • Consider model retraining")
    else:
        print("✅ All systems operating normally")
        print(f"   📊 Current RPS: {latest_values['requests_per_second']:.1f}")
        print(f"   📈 Current P95 Latency: {latest_values['p95_latency_ms']:.1f}ms")
        print(f"   🎯 Model Accuracy: {latest_values['model_accuracy']:.1%}")

simulate_real_time_monitoring(monitoring_df)

## 🔍 Data Drift Detection

Let's implement data drift detection using statistical methods.

In [None]:
# Data drift detection
def detect_data_drift(reference_data, current_data, feature_name, threshold=0.05):
    """Detect data drift using statistical tests."""
    
    from scipy.stats import ks_2samp, ttest_ind
    
    # Kolmogorov-Smirnov test
    ks_stat, ks_p_value = ks_2samp(reference_data, current_data)
    
    # T-test
    t_stat, t_p_value = ttest_ind(reference_data, current_data)
    
    # Simple drift detection based on distribution change
    ref_mean, ref_std = np.mean(reference_data), np.std(reference_data)
    curr_mean, curr_std = np.mean(current_data), np.std(current_data)
    
    # Calculate drift score (normalized difference)
    mean_drift = abs(curr_mean - ref_mean) / ref_std if ref_std > 0 else 0
    std_drift = abs(curr_std - ref_std) / ref_std if ref_std > 0 else 0
    
    drift_score = (mean_drift + std_drift) / 2
    
    return {
        "feature": feature_name,
        "drift_detected": drift_score > threshold,
        "drift_score": drift_score,
        "threshold": threshold,
        "ks_p_value": ks_p_value,
        "t_p_value": t_p_value,
        "reference_mean": ref_mean,
        "current_mean": curr_mean,
        "reference_std": ref_std,
        "current_std": curr_std
    }

# Simulate data drift analysis
def analyze_model_drift(monitoring_df, reference_window_hours=6):
    """Analyze model performance drift over time."""
    
    # Split data into reference and current windows
    midpoint = len(monitoring_df) // 2
    reference_data = monitoring_df.iloc[:midpoint]
    current_data = monitoring_df.iloc[midpoint:]
    
    # Features to monitor for drift
    drift_features = {
        "model_accuracy": "Model Accuracy",
        "inference_time_ms": "Inference Time",
        "p95_latency_ms": "P95 Latency",
        "error_rate_percent": "Error Rate"
    }
    
    drift_results = []
    
    for feature, display_name in drift_features.items():
        if feature in monitoring_df.columns:
            result = detect_data_drift(
                reference_data[feature].values,
                current_data[feature].values,
                display_name
            )
            drift_results.append(result)
    
    return drift_results

# Analyze drift
drift_analysis = analyze_model_drift(monitoring_df)

print("🔍 Data Drift Detection Analysis:")
print("=" * 50)

drift_detected = False
for result in drift_analysis:
    status = "🚨 DRIFT DETECTED" if result['drift_detected'] else "✅ STABLE"
    print(f"\n{status} - {result['feature']}")
    print(f"   📊 Drift Score: {result['drift_score']:.3f} (threshold: {result['threshold']})")
    print(f"   📈 Reference Mean: {result['reference_mean']:.3f}")
    print(f"   📉 Current Mean: {result['current_mean']:.3f}")
    print(f"   📏 Reference Std: {result['reference_std']:.3f}")
    print(f"   📐 Current Std: {result['current_std']:.3f}")
    print(f"   🔬 KS Test p-value: {result['ks_p_value']:.4f}")
    
    if result['drift_detected']:
        drift_detected = True
        print("   ⚠️  Recommendation: Investigate data distribution changes")

if drift_detected:
    print(f"\n🚨 SUMMARY: Data drift detected in {sum(1 for r in drift_analysis if r['drift_detected'])} metrics")
    print("   📋 Recommended actions:")
    print("   • Review recent data collection changes")
    print("   • Check for changes in user behavior")
    print("   • Consider model retraining if drift persists")
    print("   • Update model monitoring thresholds")
else:
    print("\n✅ SUMMARY: No significant data drift detected")
    print("   📋 System operating within normal parameters")

# Visualize drift for key metrics
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Data Drift Analysis - Reference vs Current Distribution', fontsize=16)

# Split data for visualization
midpoint = len(monitoring_df) // 2
reference_data = monitoring_df.iloc[:midpoint]
current_data = monitoring_df.iloc[midpoint:]

drift_features = ['model_accuracy', 'inference_time_ms', 'p95_latency_ms', 'error_rate_percent']
feature_names = ['Model Accuracy', 'Inference Time (ms)', 'P95 Latency (ms)', 'Error Rate (%)']

for i, (feature, name) in enumerate(zip(drift_features, feature_names)):
    ax = axes[i // 2, i % 2]
    
    # Plot distributions
    ax.hist(reference_data[feature], alpha=0.7, label='Reference', bins=20, density=True)
    ax.hist(current_data[feature], alpha=0.7, label='Current', bins=20, density=True)
    
    ax.set_title(f'{name} Distribution')
    ax.set_xlabel('Value')
    ax.set_ylabel('Density')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Create drift monitoring dashboard
fig, ax = plt.subplots(figsize=(12, 8))

# Plot drift scores
features = [r['feature'] for r in drift_analysis]
drift_scores = [r['drift_score'] for r in drift_analysis]
thresholds = [r['threshold'] for r in drift_analysis]

x = np.arange(len(features))
bars = ax.bar(x, drift_scores, alpha=0.7, color='skyblue', label='Drift Score')
ax.axhline(y=0.05, color='red', linestyle='--', linewidth=2, label='Drift Threshold')

# Color bars based on drift detection
for i, (bar, score, threshold) in enumerate(zip(bars, drift_scores, thresholds)):
    if score > threshold:
        bar.set_color('red')
        bar.set_alpha(0.9)

ax.set_xlabel('Metrics')
ax.set_ylabel('Drift Score')
ax.set_title('Data Drift Detection Dashboard')
ax.set_xticks(x)
ax.set_xticklabels(features, rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3)

# Add value labels on bars
for bar, score in zip(bars, drift_scores):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.001,
            f'{score:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 📊 Custom Dashboard Creation

Let's create a custom monitoring dashboard with key metrics and alerts.

In [None]:
# Create comprehensive monitoring dashboard
def create_monitoring_dashboard(monitoring_df, alerts_detected, drift_analysis):
    """Create a comprehensive monitoring dashboard."""
    
    fig, axes = plt.subplots(4, 3, figsize=(18, 20))
    fig.suptitle('KubeSentiment - Comprehensive Monitoring Dashboard', fontsize=16, fontweight='bold')
    
    # Current status summary (top row)
    latest = monitoring_df.iloc[-1]
    
    # Status indicators
    status_data = {
        'RPS': {'value': latest['requests_per_second'], 'threshold': 10, 'unit': 'req/s'},
        'Error Rate': {'value': latest['error_rate_percent'], 'threshold': 5, 'unit': '%'},
        'P95 Latency': {'value': latest['p95_latency_ms'], 'threshold': 500, 'unit': 'ms'}
    }
    
    colors = []
    for i, (metric, data) in enumerate(status_data.items()):
        color = 'green' if data['value'] < data['threshold'] else 'red'
        colors.append(color)
        
        axes[0, i].text(0.5, 0.7, f"{data['value']:.1f} {data['unit']}", 
                       ha='center', va='center', fontsize=24, color=color, fontweight='bold')
        axes[0, i].text(0.5, 0.3, metric, ha='center', va='center', fontsize=12)
        axes[0, i].set_xlim(0, 1)
        axes[0, i].set_ylim(0, 1)
        axes[0, i].axis('off')
        
        # Add threshold indicator
        axes[0, i].text(0.5, 0.1, f"Threshold: {data['threshold']} {data['unit']}", 
                       ha='center', va='center', fontsize=8, color='gray')
    
    # Overall system health score
    health_score = 100
    if latest['error_rate_percent'] > 5:
        health_score -= 30
    if latest['p95_latency_ms'] > 500:
        health_score -= 20
    if latest['cpu_usage_percent'] > 80:
        health_score -= 15
    if latest['memory_usage_percent'] > 85:
        health_score -= 15
    if latest['model_accuracy'] < 0.85:
        health_score -= 20
    
    health_color = 'green' if health_score > 80 else 'orange' if health_score > 60 else 'red'
    axes[0, 2].text(0.5, 0.7, f"{health_score}%", ha='center', va='center', 
                    fontsize=24, color=health_color, fontweight='bold')
    axes[0, 2].text(0.5, 0.3, 'System Health', ha='center', va='center', fontsize=12)
    axes[0, 2].text(0.5, 0.1, 'Composite Score', ha='center', va='center', fontsize=8, color='gray')
    
    # Time series plots (second row)
    time_window = monitoring_df.tail(60)  # Last 5 hours
    
    axes[1, 0].plot(time_window.index, time_window['requests_per_second'], color='blue', linewidth=2)
    axes[1, 0].set_title('Throughput (Last 5h)')
    axes[1, 0].set_ylabel('RPS')
    axes[1, 0].grid(True, alpha=0.3)
    
    axes[1, 1].plot(time_window.index, time_window['p95_latency_ms'], color='orange', linewidth=2)
    axes[1, 1].axhline(500, color='red', linestyle='--', alpha=0.7)
    axes[1, 1].set_title('Latency (Last 5h)')
    axes[1, 1].set_ylabel('P95 Latency (ms)')
    axes[1, 1].grid(True, alpha=0.3)
    
    axes[1, 2].plot(time_window.index, time_window['model_accuracy'] * 100, color='green', linewidth=2)
    axes[1, 2].axhline(85, color='red', linestyle='--', alpha=0.7)
    axes[1, 2].set_title('Model Accuracy (Last 5h)')
    axes[1, 2].set_ylabel('Accuracy (%)')
    axes[1, 2].grid(True, alpha=0.3)
    
    # Resource utilization (third row)
    axes[2, 0].plot(time_window.index, time_window['cpu_usage_percent'], label='CPU', color='purple')
    axes[2, 0].plot(time_window.index, time_window['memory_usage_percent'], label='Memory', color='brown')
    axes[2, 0].axhline(80, color='red', linestyle='--', alpha=0.7)
    axes[2, 0].set_title('Resource Utilization')
    axes[2, 0].set_ylabel('Usage (%)')
    axes[2, 0].legend()
    axes[2, 0].grid(True, alpha=0.3)
    
    # Cache performance
    axes[2, 1].plot(time_window.index, time_window['cache_hit_rate'] * 100, color='magenta')
    axes[2, 1].axhline(70, color='green', linestyle='--', alpha=0.7)
    axes[2, 1].set_title('Cache Performance')
    axes[2, 1].set_ylabel('Hit Rate (%)')
    axes[2, 1].grid(True, alpha=0.3)
    
    # Error rate
    axes[2, 2].plot(time_window.index, time_window['error_rate_percent'], color='red')
    axes[2, 2].axhline(5, color='red', linestyle='--', alpha=0.7)
    axes[2, 2].set_title('Error Rate')
    axes[2, 2].set_ylabel('Error Rate (%)')
    axes[2, 2].grid(True, alpha=0.3)
    
    # Alerts and drift summary (bottom row)
    axes[3, 0].axis('off')
    alert_summary = f"""Active Alerts: {len(alerts_detected)}

"""
    for alert in alerts_detected[:3]:  # Show top 3
        alert_summary += f"🚨 {alert['alert_name']}\n   {alert['severity'].upper()}\n\n"
    
    axes[3, 0].text(0.1, 0.9, alert_summary, transform=axes[3, 0].transAxes, 
                    fontsize=10, verticalalignment='top', fontfamily='monospace',
                    bbox=dict(boxstyle="round,pad=0.3", facecolor="lightcoral", alpha=0.5))
    axes[3, 0].set_title('Alert Status')
    
    # Drift status
    axes[3, 1].axis('off')
    drift_summary = f"""Drift Detection\n\n"""
    drift_issues = sum(1 for r in drift_analysis if r['drift_detected'])
    drift_summary += f"Issues Found: {drift_issues}/{len(drift_analysis)}\n\n"
    
    for result in drift_analysis:
        status = "❌" if result['drift_detected'] else "✅"
        drift_summary += f"{status} {result['feature'][:15]}\n   {result['drift_score']:.3f}\n\n"
    
    axes[3, 1].text(0.1, 0.9, drift_summary, transform=axes[3, 1].transAxes, 
                    fontsize=9, verticalalignment='top', fontfamily='monospace',
                    bbox=dict(boxstyle="round,pad=0.3", facecolor="lightblue", alpha=0.5))
    axes[3, 1].set_title('Drift Status')
    
    # Recommendations
    axes[3, 2].axis('off')
    recommendations = """Recommendations:\n\n"""
    
    if alerts_detected:
        recommendations += "• Review active alerts\n• Check system resources\n"
    else:
        recommendations += "• System operating normally\n• Monitor trends\n"
    
    if drift_issues > 0:
        recommendations += "• Investigate data drift\n• Consider model retraining\n"
    
    if latest['cpu_usage_percent'] > 70:
        recommendations += "• Monitor CPU utilization\n"
    
    if latest['cache_hit_rate'] < 0.7:
        recommendations += "• Optimize cache performance\n"
    
    recommendations += "\n• Regular maintenance\n• Performance monitoring"
    
    axes[3, 2].text(0.1, 0.9, recommendations, transform=axes[3, 2].transAxes, 
                    fontsize=10, verticalalignment='top', fontfamily='monospace',
                    bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgreen", alpha=0.5))
    axes[3, 2].set_title('Action Items')
    
    plt.tight_layout()
    plt.show()

# Create the comprehensive dashboard
create_monitoring_dashboard(monitoring_df, alerts_detected, drift_analysis)

print("🎉 Monitoring Analysis Complete!")
print("\n📋 Dashboard Summary:")
print(f"   📊 Monitoring period: {len(monitoring_df)} data points")
print(f"   🚨 Alerts detected: {len(alerts_detected)}")
print(f"   🔍 Drift issues: {sum(1 for r in drift_analysis if r['drift_detected'])}")
print(f"   📈 Average RPS: {monitoring_df['requests_per_second'].mean():.1f}")
print(f"   ⏱️ Average P95 latency: {monitoring_df['p95_latency_ms'].mean():.1f}ms")
print(f"   🎯 Model accuracy: {monitoring_df['model_accuracy'].mean():.1%}")
print(f"   💾 Cache hit rate: {monitoring_df['cache_hit_rate'].mean():.1%}")

print("\n🚀 Next Steps:")
print("   • Set up real Prometheus + Grafana monitoring")
print("   • Configure alerting rules in production")
print("   • Implement automated drift detection")
print("   • Create custom Grafana dashboards")
print("   • Set up log aggregation with Loki/Elasticsearch")
print("\n💡 Pro Tip: Use this notebook as a template for creating custom monitoring dashboards!")