# Lab-2.1.4: 監控與性能優化

## 🎯 學習目標

1. **建立企業級監控體系**
   - Prometheus 指標收集
   - Grafana 儀表板設計
   - 告警和通知系統

2. **實現自動化性能調優**
   - 動態資源調配
   - 負載平衡優化
   - SLA 監控和保證

3. **建構完整運維流程**
   - 健康檢查機制
   - 故障檢測和恢復
   - 容量規劃和擴展

## 📋 企業案例背景

**場景**: Netflix 推薦系統需要：
- 99.99% 可用性 (年停機時間 < 53分鐘)
- P99 延遲 < 50ms
- 支援 10M+ 並發用戶
- 自動故障恢復

**技術挑戰**: 如何設計可觀測性系統確保服務品質？

---

## 1. Prometheus 指標收集系統

### 1.1 指標定義和收集

In [None]:
import time
import json
import threading
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from collections import defaultdict, deque
import numpy as np
from datetime import datetime, timedelta

# 模擬 Prometheus 客戶端
class MockPrometheusMetrics:
    """
    模擬 Prometheus 指標收集器
    (實際部署中使用 prometheus_client)
    """
    
    def __init__(self):
        self.counters = defaultdict(float)
        self.histograms = defaultdict(list)
        self.gauges = defaultdict(float)
        self.summaries = defaultdict(list)
    
    def counter_inc(self, name: str, labels: Dict[str, str] = None, value: float = 1.0):
        key = self._make_key(name, labels)
        self.counters[key] += value
    
    def histogram_observe(self, name: str, labels: Dict[str, str] = None, value: float = 0.0):
        key = self._make_key(name, labels)
        self.histograms[key].append(value)
    
    def gauge_set(self, name: str, labels: Dict[str, str] = None, value: float = 0.0):
        key = self._make_key(name, labels)
        self.gauges[key] = value
    
    def _make_key(self, name: str, labels: Dict[str, str] = None) -> str:
        if labels:
            label_str = ','.join([f'{k}="{v}"' for k, v in sorted(labels.items())])
            return f'{name}{{{label_str}}}'
        return name

class TritonMetricsCollector:
    """
    Triton 專用指標收集器
    
    收集指標:
    - 推理延遲和吞吐量
    - 資源使用情況
    - 錯誤率和成功率
    - 商業指標
    """
    
    def __init__(self, model_name: str, model_version: str = "1"):
        self.model_name = model_name
        self.model_version = model_version
        self.metrics = MockPrometheusMetrics()
        
        # 基礎標籤
        self.base_labels = {
            "model_name": model_name,
            "model_version": model_version,
            "instance": "triton-0"
        }
        
        print(f"📊 Triton 指標收集器已初始化: {model_name} v{model_version}")
    
    def record_inference_request(
        self, 
        latency_ms: float,
        batch_size: int,
        success: bool = True,
        backend: str = "pytorch"
    ):
        """
        記錄推理請求指標
        """
        labels = {**self.base_labels, "backend": backend}
        
        # 請求計數
        status = "success" if success else "error"
        self.metrics.counter_inc(
            "triton_inference_requests_total",
            {**labels, "status": status}
        )
        
        if success:
            # 延遲直方圖
            self.metrics.histogram_observe(
                "triton_inference_latency_ms",
                labels,
                latency_ms
            )
            
            # 批量大小
            self.metrics.histogram_observe(
                "triton_inference_batch_size",
                labels,
                batch_size
            )
            
            # 吞吐量 (samples/sec)
            throughput = batch_size / (latency_ms / 1000)
            self.metrics.gauge_set(
                "triton_inference_throughput_samples_per_sec",
                labels,
                throughput
            )
    
    def record_resource_usage(
        self,
        gpu_utilization: float,
        gpu_memory_used_gb: float,
        gpu_memory_total_gb: float,
        cpu_utilization: float
    ):
        """
        記錄資源使用指標
        """
        labels = self.base_labels
        
        # GPU 指標
        self.metrics.gauge_set(
            "triton_gpu_utilization_percent", labels, gpu_utilization
        )
        self.metrics.gauge_set(
            "triton_gpu_memory_used_gb", labels, gpu_memory_used_gb
        )
        self.metrics.gauge_set(
            "triton_gpu_memory_utilization_percent", 
            labels, 
            (gpu_memory_used_gb / gpu_memory_total_gb) * 100
        )
        
        # CPU 指標
        self.metrics.gauge_set(
            "triton_cpu_utilization_percent", labels, cpu_utilization
        )
    
    def record_business_metrics(
        self,
        decisions: List[str],
        fraud_detected: int,
        false_positives: int = 0
    ):
        """
        記錄業務指標 (以反詐騙為例)
        """
        labels = self.base_labels
        
        # 決策分佈
        for decision in ['APPROVE', 'BLOCK', 'REVIEW']:
            count = decisions.count(decision)
            self.metrics.counter_inc(
                "triton_business_decisions_total",
                {**labels, "decision": decision.lower()},
                count
            )
        
        # 詐騙檢測
        self.metrics.counter_inc(
            "triton_fraud_detected_total", labels, fraud_detected
        )
        
        # 誤報
        if false_positives > 0:
            self.metrics.counter_inc(
                "triton_false_positives_total", labels, false_positives
            )
    
    def get_metrics_summary(self) -> Dict[str, Any]:
        """
        獲取指標摘要
        """
        summary = {
            "timestamp": datetime.now().isoformat(),
            "model": f"{self.model_name}:{self.model_version}",
            "counters": dict(self.metrics.counters),
            "gauges": dict(self.metrics.gauges),
            "histograms": {}
        }
        
        # 計算直方圖統計
        for key, values in self.metrics.histograms.items():
            if values:
                summary["histograms"][key] = {
                    "count": len(values),
                    "sum": sum(values),
                    "mean": np.mean(values),
                    "p50": np.percentile(values, 50),
                    "p90": np.percentile(values, 90),
                    "p95": np.percentile(values, 95),
                    "p99": np.percentile(values, 99)
                }
        
        return summary

# 創建指標收集器
metrics_collector = TritonMetricsCollector(
    model_name="netflix_recommendation_v2_prod",
    model_version="2"
)

print("✅ Prometheus 指標收集系統已啟動")

### 1.2 模擬監控數據收集

In [None]:
import random

def simulate_production_workload(duration_seconds: int = 60):
    """
    模擬生產環境工作負載
    """
    print(f"🔄 模擬生產環境工作負載 ({duration_seconds} 秒)...")
    
    start_time = time.time()
    request_count = 0
    
    while time.time() - start_time < duration_seconds:
        # 模擬不同時段的負載模式
        elapsed = time.time() - start_time
        load_factor = 1 + 0.5 * np.sin(elapsed / 10)  # 週期性負載變化
        
        # 請求頻率 (QPS)
        base_qps = 50
        current_qps = base_qps * load_factor
        
        # 模擬請求
        if random.random() < current_qps / 100:  # 調整概率
            # 批量大小分佈
            batch_size = random.choices(
                [1, 2, 4, 8, 16], 
                weights=[0.3, 0.2, 0.2, 0.2, 0.1]
            )[0]
            
            # 延遲模擬 (正常分佈 + 偶爾的異常值)
            if random.random() < 0.95:  # 95% 正常請求
                base_latency = 15 + 5 * batch_size  # 基礎延遲隨批量增加
                latency = max(1, np.random.normal(base_latency, 5))
                success = random.random() > 0.001  # 99.9% 成功率
            else:  # 5% 異常請求
                latency = np.random.exponential(100)  # 長尾延遲
                success = random.random() > 0.1  # 90% 成功率
            
            # 記錄推理指標
            metrics_collector.record_inference_request(
                latency_ms=latency,
                batch_size=batch_size,
                success=success
            )
            
            # 模擬資源使用
            if request_count % 10 == 0:  # 每10個請求記錄一次資源
                gpu_util = min(100, 30 + 20 * load_factor + np.random.normal(0, 5))
                gpu_memory = min(8, 2 + 1.5 * load_factor + np.random.normal(0, 0.5))
                cpu_util = min(100, 20 + 15 * load_factor + np.random.normal(0, 3))
                
                metrics_collector.record_resource_usage(
                    gpu_utilization=max(0, gpu_util),
                    gpu_memory_used_gb=max(0, gpu_memory),
                    gpu_memory_total_gb=8.0,
                    cpu_utilization=max(0, cpu_util)
                )
            
            # 模擬業務決策
            if success and request_count % 5 == 0:
                decisions = random.choices(
                    ['APPROVE', 'BLOCK', 'REVIEW'],
                    weights=[0.85, 0.10, 0.05],
                    k=batch_size
                )
                
                fraud_detected = decisions.count('BLOCK')
                
                metrics_collector.record_business_metrics(
                    decisions=decisions,
                    fraud_detected=fraud_detected
                )
            
            request_count += 1
        
        time.sleep(0.1)  # 控制模擬速度
    
    print(f"✅ 模擬完成，總計處理 {request_count} 個請求")
    return request_count

# 執行工作負載模擬
total_requests = simulate_production_workload(30)  # 30秒模擬

# 獲取指標摘要
summary = metrics_collector.get_metrics_summary()

print("\n📊 指標收集摘要:")
print(f"   📈 總請求數: {total_requests}")
print(f"   ⏱️  平均延遲: {summary['histograms'].get('triton_inference_latency_ms', {}).get('mean', 0):.2f} ms")
print(f"   📊 P95 延遲: {summary['histograms'].get('triton_inference_latency_ms', {}).get('p95', 0):.2f} ms")
print(f"   🚀 平均批量: {summary['histograms'].get('triton_inference_batch_size', {}).get('mean', 0):.2f}")

## 2. SLA 監控和告警系統

### 2.1 SLA 定義和監控

In [None]:
@dataclass
class SLATarget:
    """
    SLA 目標定義
    """
    name: str
    description: str
    target_value: float
    operator: str  # '<', '>', '<=', '>=', '=='
    measurement_window: int  # 秒
    severity: str  # 'critical', 'warning', 'info'

class SLAMonitor:
    """
    SLA 監控和告警系統
    
    監控指標:
    - 可用性 (Availability)
    - 延遲 (Latency)
    - 吞吐量 (Throughput)
    - 錯誤率 (Error Rate)
    """
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.sla_targets = self._define_sla_targets()
        self.metrics_history = defaultdict(list)
        self.alerts = []
        
        print(f"🎯 SLA 監控系統已啟動: {service_name}")
        self._print_sla_targets()
    
    def _define_sla_targets(self) -> List[SLATarget]:
        """
        定義企業級 SLA 目標
        """
        return [
            SLATarget(
                name="availability",
                description="服務可用性",
                target_value=99.99,  # 99.99%
                operator=">=",
                measurement_window=300,  # 5分鐘
                severity="critical"
            ),
            SLATarget(
                name="latency_p95",
                description="P95 延遲",
                target_value=50.0,  # 50ms
                operator="<=",
                measurement_window=60,  # 1分鐘
                severity="warning"
            ),
            SLATarget(
                name="latency_p99",
                description="P99 延遲",
                target_value=100.0,  # 100ms
                operator="<=",
                measurement_window=60,
                severity="critical"
            ),
            SLATarget(
                name="error_rate",
                description="錯誤率",
                target_value=0.1,  # 0.1%
                operator="<=",
                measurement_window=300,
                severity="critical"
            ),
            SLATarget(
                name="throughput",
                description="最小吞吐量",
                target_value=100.0,  # 100 QPS
                operator=">=",
                measurement_window=60,
                severity="warning"
            )
        ]
    
    def _print_sla_targets(self):
        """
        打印 SLA 目標
        """
        print("\n📋 SLA 目標定義:")
        for target in self.sla_targets:
            severity_icon = {
                'critical': '🔴',
                'warning': '🟡', 
                'info': '🔵'
            }[target.severity]
            
            print(f"   {severity_icon} {target.description}: {target.operator} {target.target_value}")
            print(f"      測量窗口: {target.measurement_window}秒, 級別: {target.severity}")
    
    def record_metrics(
        self,
        timestamp: float,
        latency_ms: float,
        success: bool,
        throughput_qps: float
    ):
        """
        記錄指標用於 SLA 計算
        """
        metric_point = {
            'timestamp': timestamp,
            'latency_ms': latency_ms,
            'success': success,
            'throughput_qps': throughput_qps
        }
        
        self.metrics_history['points'].append(metric_point)
        
        # 保持最近1小時的數據
        cutoff_time = timestamp - 3600
        self.metrics_history['points'] = [
            p for p in self.metrics_history['points']
            if p['timestamp'] > cutoff_time
        ]
    
    def calculate_sla_metrics(self, window_seconds: int = 300) -> Dict[str, float]:
        """
        計算指定時間窗口內的 SLA 指標
        """
        current_time = time.time()
        cutoff_time = current_time - window_seconds
        
        # 獲取窗口內的數據
        window_data = [
            p for p in self.metrics_history['points']
            if p['timestamp'] > cutoff_time
        ]
        
        if not window_data:
            return {}
        
        # 計算指標
        total_requests = len(window_data)
        successful_requests = sum(1 for p in window_data if p['success'])
        latencies = [p['latency_ms'] for p in window_data if p['success']]
        throughputs = [p['throughput_qps'] for p in window_data]
        
        metrics = {
            'availability': (successful_requests / total_requests) * 100 if total_requests > 0 else 0,
            'error_rate': ((total_requests - successful_requests) / total_requests) * 100 if total_requests > 0 else 0,
            'throughput': np.mean(throughputs) if throughputs else 0
        }
        
        if latencies:
            metrics.update({
                'latency_p50': np.percentile(latencies, 50),
                'latency_p95': np.percentile(latencies, 95),
                'latency_p99': np.percentile(latencies, 99),
                'latency_max': np.max(latencies)
            })
        
        return metrics
    
    def check_sla_violations(self) -> List[Dict[str, Any]]:
        """
        檢查 SLA 違反情況
        """
        violations = []
        
        for target in self.sla_targets:
            metrics = self.calculate_sla_metrics(target.measurement_window)
            
            if target.name not in metrics:
                continue
            
            current_value = metrics[target.name]
            target_value = target.target_value
            
            violated = False
            if target.operator == '<=' and current_value > target_value:
                violated = True
            elif target.operator == '>=' and current_value < target_value:
                violated = True
            elif target.operator == '<' and current_value >= target_value:
                violated = True
            elif target.operator == '>' and current_value <= target_value:
                violated = True
            elif target.operator == '==' and abs(current_value - target_value) > 0.01:
                violated = True
            
            if violated:
                violation = {
                    'timestamp': time.time(),
                    'sla_name': target.name,
                    'description': target.description,
                    'target_value': target_value,
                    'current_value': current_value,
                    'operator': target.operator,
                    'severity': target.severity,
                    'window_seconds': target.measurement_window
                }
                violations.append(violation)
        
        return violations
    
    def generate_sla_report(self) -> Dict[str, Any]:
        """
        生成 SLA 報告
        """
        current_metrics = self.calculate_sla_metrics(300)  # 5分鐘窗口
        violations = self.check_sla_violations()
        
        report = {
            'service_name': self.service_name,
            'timestamp': datetime.now().isoformat(),
            'current_metrics': current_metrics,
            'sla_targets': [
                {
                    'name': target.name,
                    'description': target.description,
                    'target': f"{target.operator} {target.target_value}",
                    'current': current_metrics.get(target.name, 'N/A'),
                    'status': 'PASS' if target.name not in [v['sla_name'] for v in violations] else 'FAIL'
                }
                for target in self.sla_targets
            ],
            'violations': violations,
            'overall_status': 'HEALTHY' if not violations else 'DEGRADED'
        }
        
        return report

# 初始化 SLA 監控
sla_monitor = SLAMonitor("netflix-recommendation-service")

# 模擬指標收集
print("\n🔄 模擬 SLA 指標收集...")
current_time = time.time()

# 模擬不同性能場景
for i in range(100):
    # 時間進展
    timestamp = current_time + i * 3  # 每3秒一個數據點
    
    # 模擬不同階段的性能
    if i < 30:  # 正常階段
        latency = np.random.normal(25, 5)
        success_rate = 0.999
        throughput = np.random.normal(150, 20)
    elif i < 60:  # 性能退化階段
        latency = np.random.normal(45, 15)
        success_rate = 0.995
        throughput = np.random.normal(120, 25)
    else:  # 恢復階段
        latency = np.random.normal(30, 8)
        success_rate = 0.998
        throughput = np.random.normal(140, 15)
    
    success = np.random.random() < success_rate
    
    sla_monitor.record_metrics(
        timestamp=timestamp,
        latency_ms=max(1, latency),
        success=success,
        throughput_qps=max(0, throughput)
    )

print("✅ SLA 指標收集完成")

### 2.2 SLA 報告和告警

In [None]:
# 生成 SLA 報告
sla_report = sla_monitor.generate_sla_report()

print("📋 SLA 監控報告")
print("═" * 60)
print(f"🏷️  服務: {sla_report['service_name']}")
print(f"🕐 時間: {sla_report['timestamp']}")
print(f"📊 整體狀態: {sla_report['overall_status']}")
print()

# 當前指標
print("📈 當前性能指標 (5分鐘窗口):")
current_metrics = sla_report['current_metrics']
for metric, value in current_metrics.items():
    if isinstance(value, (int, float)):
        if 'latency' in metric:
            print(f"   ⏱️  {metric}: {value:.2f} ms")
        elif 'rate' in metric or 'availability' in metric:
            print(f"   📊 {metric}: {value:.2f}%")
        elif 'throughput' in metric:
            print(f"   🚀 {metric}: {value:.1f} QPS")
        else:
            print(f"   📈 {metric}: {value:.2f}")

print()

# SLA 目標檢查
print("🎯 SLA 目標檢查:")
for target in sla_report['sla_targets']:
    status_icon = '✅' if target['status'] == 'PASS' else '❌'
    current_val = target['current']
    
    if current_val != 'N/A':
        if isinstance(current_val, (int, float)):
            current_str = f"{current_val:.2f}"
        else:
            current_str = str(current_val)
    else:
        current_str = 'N/A'
    
    print(f"   {status_icon} {target['description']}: {current_str} (目標: {target['target']})")

print()

# 違反情況
if sla_report['violations']:
    print("🚨 SLA 違反告警:")
    for violation in sla_report['violations']:
        severity_icon = {
            'critical': '🔴',
            'warning': '🟡',
            'info': '🔵'
        }[violation['severity']]
        
        print(f"   {severity_icon} {violation['description']}:")
        print(f"      目標: {violation['operator']} {violation['target_value']}")
        print(f"      當前: {violation['current_value']:.2f}")
        print(f"      級別: {violation['severity'].upper()}")
        print(f"      時間窗口: {violation['window_seconds']}秒")
        print()
else:
    print("✅ 所有 SLA 目標均達成")

## 3. 自動化性能調優系統

### 3.1 自適應資源調配

In [None]:
class AutoScalingController:
    """
    自動化性能調優控制器
    
    功能:
    - 動態實例調整
    - 負載平衡優化
    - 資源使用優化
    """
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.current_instances = 2
        self.min_instances = 1
        self.max_instances = 10
        
        # 調整策略參數
        self.scale_up_threshold = {
            'cpu_utilization': 70,  # %
            'memory_utilization': 80,  # %
            'latency_p95': 50,  # ms
            'queue_length': 10  # 請求數
        }
        
        self.scale_down_threshold = {
            'cpu_utilization': 30,  # %
            'memory_utilization': 40,  # %
            'latency_p95': 20,  # ms
            'queue_length': 2  # 請求數
        }
        
        self.scaling_history = []
        self.last_scaling_time = 0
        self.cooldown_period = 300  # 5分鐘冷卻期
        
        print(f"🎛️  自動調整控制器已啟動: {service_name}")
        print(f"   📊 當前實例數: {self.current_instances}")
        print(f"   📈 實例範圍: {self.min_instances}-{self.max_instances}")
    
    def analyze_metrics(self, metrics: Dict[str, float]) -> Dict[str, Any]:
        """
        分析當前指標並生成調整建議
        """
        current_time = time.time()
        
        # 檢查冷卻期
        if current_time - self.last_scaling_time < self.cooldown_period:
            return {
                'action': 'wait',
                'reason': f'冷卻期內 ({self.cooldown_period - (current_time - self.last_scaling_time):.0f}秒)',
                'current_instances': self.current_instances
            }
        
        # 檢查擴容條件
        scale_up_signals = []
        scale_down_signals = []
        
        for metric, value in metrics.items():
            if metric in self.scale_up_threshold:
                threshold = self.scale_up_threshold[metric]
                if metric in ['cpu_utilization', 'memory_utilization', 'latency_p95', 'queue_length']:
                    if value > threshold:
                        scale_up_signals.append(f"{metric}: {value:.1f} > {threshold}")
            
            if metric in self.scale_down_threshold:
                threshold = self.scale_down_threshold[metric]
                if metric in ['cpu_utilization', 'memory_utilization', 'latency_p95', 'queue_length']:
                    if value < threshold:
                        scale_down_signals.append(f"{metric}: {value:.1f} < {threshold}")
        
        # 決策邏輯
        if len(scale_up_signals) >= 2 and self.current_instances < self.max_instances:
            return {
                'action': 'scale_up',
                'reason': f'多個指標觸發擴容: {scale_up_signals}',
                'target_instances': min(self.current_instances + 1, self.max_instances),
                'current_instances': self.current_instances
            }
        elif len(scale_down_signals) >= 3 and self.current_instances > self.min_instances:
            return {
                'action': 'scale_down',
                'reason': f'多個指標支持縮容: {scale_down_signals}',
                'target_instances': max(self.current_instances - 1, self.min_instances),
                'current_instances': self.current_instances
            }
        else:
            return {
                'action': 'maintain',
                'reason': f'指標正常，維持當前實例數',
                'current_instances': self.current_instances
            }
    
    def execute_scaling(self, decision: Dict[str, Any]) -> bool:
        """
        執行調整操作
        """
        action = decision['action']
        
        if action == 'scale_up':
            old_instances = self.current_instances
            self.current_instances = decision['target_instances']
            self.last_scaling_time = time.time()
            
            scaling_event = {
                'timestamp': time.time(),
                'action': 'scale_up',
                'from_instances': old_instances,
                'to_instances': self.current_instances,
                'reason': decision['reason']
            }
            self.scaling_history.append(scaling_event)
            
            print(f"📈 執行擴容: {old_instances} → {self.current_instances} 實例")
            print(f"   💡 原因: {decision['reason']}")
            
            # 模擬擴容過程
            print(f"   🔄 啟動新實例...")
            time.sleep(1)
            print(f"   ✅ 擴容完成")
            
            return True
            
        elif action == 'scale_down':
            old_instances = self.current_instances
            self.current_instances = decision['target_instances']
            self.last_scaling_time = time.time()
            
            scaling_event = {
                'timestamp': time.time(),
                'action': 'scale_down',
                'from_instances': old_instances,
                'to_instances': self.current_instances,
                'reason': decision['reason']
            }
            self.scaling_history.append(scaling_event)
            
            print(f"📉 執行縮容: {old_instances} → {self.current_instances} 實例")
            print(f"   💡 原因: {decision['reason']}")
            
            # 模擬縮容過程
            print(f"   🔄 優雅關閉實例...")
            time.sleep(1)
            print(f"   ✅ 縮容完成")
            
            return True
            
        else:
            return False
    
    def get_scaling_report(self) -> Dict[str, Any]:
        """
        獲取調整歷史報告
        """
        return {
            'service_name': self.service_name,
            'current_instances': self.current_instances,
            'instance_range': f"{self.min_instances}-{self.max_instances}",
            'scaling_history': self.scaling_history[-10:],  # 最近10次
            'total_scaling_events': len(self.scaling_history)
        }

# 初始化自動調整控制器
autoscaling = AutoScalingController("netflix-recommendation-service")

# 模擬自動調整場景
print("\n🎛️  模擬自動調整場景...")

# 場景1: 高負載觸發擴容
print("\n📈 場景1: 高負載期間")
high_load_metrics = {
    'cpu_utilization': 85.0,
    'memory_utilization': 78.0,
    'latency_p95': 65.0,
    'queue_length': 15
}

decision = autoscaling.analyze_metrics(high_load_metrics)
print(f"📊 分析結果: {decision['action']} - {decision['reason']}")
autoscaling.execute_scaling(decision)

# 場景2: 持續高負載
print("\n📈 場景2: 持續高負載")
time.sleep(2)  # 模擬時間經過
autoscaling.last_scaling_time -= 310  # 繞過冷卻期

extreme_load_metrics = {
    'cpu_utilization': 92.0,
    'memory_utilization': 88.0,
    'latency_p95': 120.0,
    'queue_length': 25
}

decision = autoscaling.analyze_metrics(extreme_load_metrics)
print(f"📊 分析結果: {decision['action']} - {decision['reason']}")
autoscaling.execute_scaling(decision)

# 場景3: 負載降低觸發縮容
print("\n📉 場景3: 負載降低期間")
time.sleep(2)
autoscaling.last_scaling_time -= 310

low_load_metrics = {
    'cpu_utilization': 25.0,
    'memory_utilization': 35.0,
    'latency_p95': 15.0,
    'queue_length': 1
}

decision = autoscaling.analyze_metrics(low_load_metrics)
print(f"📊 分析結果: {decision['action']} - {decision['reason']}")
autoscaling.execute_scaling(decision)

# 獲取調整報告
scaling_report = autoscaling.get_scaling_report()

print("\n📋 自動調整歷史報告:")
print(f"   🏷️  服務: {scaling_report['service_name']}")
print(f"   📊 當前實例: {scaling_report['current_instances']}")
print(f"   📈 實例範圍: {scaling_report['instance_range']}")
print(f"   🔄 調整次數: {scaling_report['total_scaling_events']}")

print("\n📝 調整歷史:")
for event in scaling_report['scaling_history']:
    action_icon = '📈' if event['action'] == 'scale_up' else '📉'
    timestamp = datetime.fromtimestamp(event['timestamp']).strftime('%H:%M:%S')
    print(f"   {action_icon} {timestamp}: {event['from_instances']} → {event['to_instances']} ({event['action']})")
    print(f"      理由: {event['reason'][:80]}..." if len(event['reason']) > 80 else f"      理由: {event['reason']}")

## 4. 健康檢查和故障恢復

### 4.1 多層次健康檢查

In [None]:
from enum import Enum
from typing import Callable

class HealthStatus(Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"
    UNKNOWN = "unknown"

@dataclass
class HealthCheck:
    name: str
    description: str
    check_function: Callable
    timeout_seconds: int
    critical: bool  # 是否為關鍵檢查
    interval_seconds: int

class ComprehensiveHealthMonitor:
    """
    綜合健康監控系統
    
    多層次檢查:
    - 基礎設施層 (Infrastructure)
    - 應用層 (Application)
    - 業務層 (Business)
    """
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.health_checks = self._setup_health_checks()
        self.health_history = defaultdict(list)
        self.alert_thresholds = {
            'consecutive_failures': 3,
            'failure_rate_threshold': 0.2  # 20%
        }
        
        print(f"🏥 綜合健康監控系統已啟動: {service_name}")
        print(f"   📋 健康檢查項目: {len(self.health_checks)}")
    
    def _setup_health_checks(self) -> List[HealthCheck]:
        """
        設置健康檢查項目
        """
        return [
            # 基礎設施層檢查
            HealthCheck(
                name="gpu_availability",
                description="GPU 可用性檢查",
                check_function=self._check_gpu_availability,
                timeout_seconds=5,
                critical=True,
                interval_seconds=30
            ),
            HealthCheck(
                name="memory_usage",
                description="記憶體使用率檢查",
                check_function=self._check_memory_usage,
                timeout_seconds=3,
                critical=True,
                interval_seconds=15
            ),
            HealthCheck(
                name="disk_space",
                description="磁碟空間檢查",
                check_function=self._check_disk_space,
                timeout_seconds=5,
                critical=False,
                interval_seconds=60
            ),
            
            # 應用層檢查
            HealthCheck(
                name="model_inference",
                description="模型推理功能檢查",
                check_function=self._check_model_inference,
                timeout_seconds=10,
                critical=True,
                interval_seconds=30
            ),
            HealthCheck(
                name="api_endpoint",
                description="API 端點可達性檢查",
                check_function=self._check_api_endpoint,
                timeout_seconds=5,
                critical=True,
                interval_seconds=15
            ),
            
            # 業務層檢查
            HealthCheck(
                name="response_quality",
                description="響應品質檢查",
                check_function=self._check_response_quality,
                timeout_seconds=15,
                critical=False,
                interval_seconds=120
            ),
            HealthCheck(
                name="business_metrics",
                description="業務指標健康度檢查",
                check_function=self._check_business_metrics,
                timeout_seconds=10,
                critical=False,
                interval_seconds=180
            )
        ]
    
    def _check_gpu_availability(self) -> Dict[str, Any]:
        """檢查 GPU 可用性"""
        try:
            # 模擬 GPU 檢查
            import torch
            if torch.cuda.is_available():
                gpu_count = torch.cuda.device_count()
                current_device = torch.cuda.current_device()
                gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
                
                return {
                    'status': HealthStatus.HEALTHY,
                    'details': {
                        'gpu_count': gpu_count,
                        'current_device': current_device,
                        'total_memory_gb': gpu_memory
                    },
                    'message': f'{gpu_count} GPU(s) 可用'
                }
            else:
                return {
                    'status': HealthStatus.UNHEALTHY,
                    'details': {},
                    'message': 'GPU 不可用'
                }
        except Exception as e:
            return {
                'status': HealthStatus.UNKNOWN,
                'details': {'error': str(e)},
                'message': f'GPU 檢查失敗: {str(e)}'
            }
    
    def _check_memory_usage(self) -> Dict[str, Any]:
        """檢查記憶體使用率"""
        try:
            # 模擬記憶體檢查
            if torch.cuda.is_available():
                allocated = torch.cuda.memory_allocated(0) / (1024**3)
                cached = torch.cuda.memory_reserved(0) / (1024**3)
                total = torch.cuda.get_device_properties(0).total_memory / (1024**3)
                
                usage_ratio = allocated / total
                
                if usage_ratio < 0.8:
                    status = HealthStatus.HEALTHY
                elif usage_ratio < 0.9:
                    status = HealthStatus.DEGRADED
                else:
                    status = HealthStatus.UNHEALTHY
                
                return {
                    'status': status,
                    'details': {
                        'allocated_gb': allocated,
                        'cached_gb': cached,
                        'total_gb': total,
                        'usage_ratio': usage_ratio
                    },
                    'message': f'GPU 記憶體使用率: {usage_ratio:.1%}'
                }
            else:
                return {
                    'status': HealthStatus.HEALTHY,
                    'details': {},
                    'message': 'CPU 模式，跳過 GPU 記憶體檢查'
                }
        except Exception as e:
            return {
                'status': HealthStatus.UNKNOWN,
                'details': {'error': str(e)},
                'message': f'記憶體檢查失敗: {str(e)}'
            }
    
    def _check_disk_space(self) -> Dict[str, Any]:
        """檢查磁碟空間"""
        try:
            import shutil
            total, used, free = shutil.disk_usage("/")
            
            usage_ratio = used / total
            
            if usage_ratio < 0.8:
                status = HealthStatus.HEALTHY
            elif usage_ratio < 0.9:
                status = HealthStatus.DEGRADED
            else:
                status = HealthStatus.UNHEALTHY
            
            return {
                'status': status,
                'details': {
                    'total_gb': total / (1024**3),
                    'used_gb': used / (1024**3),
                    'free_gb': free / (1024**3),
                    'usage_ratio': usage_ratio
                },
                'message': f'磁碟使用率: {usage_ratio:.1%}'
            }
        except Exception as e:
            return {
                'status': HealthStatus.UNKNOWN,
                'details': {'error': str(e)},
                'message': f'磁碟檢查失敗: {str(e)}'
            }
    
    def _check_model_inference(self) -> Dict[str, Any]:
        """檢查模型推理功能"""
        try:
            # 模擬推理測試
            start_time = time.time()
            
            # 模擬推理過程
            time.sleep(0.1)  # 模擬推理時間
            
            inference_time = (time.time() - start_time) * 1000
            
            if inference_time < 50:
                status = HealthStatus.HEALTHY
            elif inference_time < 100:
                status = HealthStatus.DEGRADED
            else:
                status = HealthStatus.UNHEALTHY
            
            return {
                'status': status,
                'details': {
                    'inference_time_ms': inference_time,
                    'test_successful': True
                },
                'message': f'推理測試成功，耗時 {inference_time:.1f}ms'
            }
        except Exception as e:
            return {
                'status': HealthStatus.UNHEALTHY,
                'details': {'error': str(e)},
                'message': f'推理測試失敗: {str(e)}'
            }
    
    def _check_api_endpoint(self) -> Dict[str, Any]:
        """檢查 API 端點可達性"""
        try:
            # 模擬 API 健康檢查
            response_time = np.random.normal(20, 5)  # 模擬響應時間
            
            if response_time < 100:
                status = HealthStatus.HEALTHY
            elif response_time < 500:
                status = HealthStatus.DEGRADED
            else:
                status = HealthStatus.UNHEALTHY
            
            return {
                'status': status,
                'details': {
                    'response_time_ms': response_time,
                    'endpoint_reachable': True
                },
                'message': f'API 端點可達，響應時間 {response_time:.1f}ms'
            }
        except Exception as e:
            return {
                'status': HealthStatus.UNHEALTHY,
                'details': {'error': str(e)},
                'message': f'API 端點檢查失敗: {str(e)}'
            }
    
    def _check_response_quality(self) -> Dict[str, Any]:
        """檢查響應品質"""
        try:
            # 模擬品質檢查
            accuracy = np.random.normal(0.95, 0.02)  # 模擬準確率
            consistency = np.random.normal(0.98, 0.01)  # 模擬一致性
            
            quality_score = (accuracy + consistency) / 2
            
            if quality_score > 0.95:
                status = HealthStatus.HEALTHY
            elif quality_score > 0.90:
                status = HealthStatus.DEGRADED
            else:
                status = HealthStatus.UNHEALTHY
            
            return {
                'status': status,
                'details': {
                    'accuracy': accuracy,
                    'consistency': consistency,
                    'quality_score': quality_score
                },
                'message': f'響應品質: {quality_score:.1%}'
            }
        except Exception as e:
            return {
                'status': HealthStatus.UNKNOWN,
                'details': {'error': str(e)},
                'message': f'品質檢查失敗: {str(e)}'
            }
    
    def _check_business_metrics(self) -> Dict[str, Any]:
        """檢查業務指標健康度"""
        try:
            # 模擬業務指標檢查
            fraud_detection_rate = np.random.normal(0.85, 0.05)
            false_positive_rate = np.random.normal(0.03, 0.01)
            customer_satisfaction = np.random.normal(0.92, 0.03)
            
            # 綜合評分
            business_health = (fraud_detection_rate + (1 - false_positive_rate) + customer_satisfaction) / 3
            
            if business_health > 0.90:
                status = HealthStatus.HEALTHY
            elif business_health > 0.80:
                status = HealthStatus.DEGRADED
            else:
                status = HealthStatus.UNHEALTHY
            
            return {
                'status': status,
                'details': {
                    'fraud_detection_rate': fraud_detection_rate,
                    'false_positive_rate': false_positive_rate,
                    'customer_satisfaction': customer_satisfaction,
                    'business_health_score': business_health
                },
                'message': f'業務健康度: {business_health:.1%}'
            }
        except Exception as e:
            return {
                'status': HealthStatus.UNKNOWN,
                'details': {'error': str(e)},
                'message': f'業務指標檢查失敗: {str(e)}'
            }
    
    def run_all_health_checks(self) -> Dict[str, Any]:
        """
        執行所有健康檢查
        """
        print("🏥 執行綜合健康檢查...")
        
        results = {
            'timestamp': time.time(),
            'service_name': self.service_name,
            'checks': {},
            'summary': {
                'total_checks': len(self.health_checks),
                'healthy': 0,
                'degraded': 0,
                'unhealthy': 0,
                'unknown': 0,
                'critical_failures': 0
            }
        }
        
        for check in self.health_checks:
            try:
                print(f"   🔍 {check.description}...")
                
                start_time = time.time()
                result = check.check_function()
                execution_time = time.time() - start_time
                
                result['execution_time_ms'] = execution_time * 1000
                result['check_name'] = check.name
                result['critical'] = check.critical
                
                results['checks'][check.name] = result
                
                # 更新統計
                status = result['status']
                if status == HealthStatus.HEALTHY:
                    results['summary']['healthy'] += 1
                    print(f"      ✅ 健康")
                elif status == HealthStatus.DEGRADED:
                    results['summary']['degraded'] += 1
                    print(f"      🟡 降級")
                elif status == HealthStatus.UNHEALTHY:
                    results['summary']['unhealthy'] += 1
                    if check.critical:
                        results['summary']['critical_failures'] += 1
                    print(f"      ❌ 不健康")
                else:
                    results['summary']['unknown'] += 1
                    print(f"      ❓ 未知")
                
            except Exception as e:
                print(f"      💥 檢查執行失敗: {str(e)}")
                results['checks'][check.name] = {
                    'status': HealthStatus.UNKNOWN,
                    'details': {'error': str(e)},
                    'message': f'檢查執行失敗: {str(e)}',
                    'critical': check.critical
                }
                results['summary']['unknown'] += 1
        
        # 計算整體健康狀態
        if results['summary']['critical_failures'] > 0:
            results['overall_status'] = HealthStatus.UNHEALTHY
        elif results['summary']['unhealthy'] > 0 or results['summary']['degraded'] > 2:
            results['overall_status'] = HealthStatus.DEGRADED
        elif results['summary']['degraded'] > 0:
            results['overall_status'] = HealthStatus.DEGRADED
        else:
            results['overall_status'] = HealthStatus.HEALTHY
        
        return results

# 創建健康監控系統
health_monitor = ComprehensiveHealthMonitor("netflix-recommendation-service")

# 執行健康檢查
health_results = health_monitor.run_all_health_checks()

print("\n📋 健康檢查結果摘要:")
print(f"   🏷️  服務: {health_results['service_name']}")
print(f"   📊 整體狀態: {health_results['overall_status'].value.upper()}")
print(f"   📈 檢查統計:")
print(f"      ✅ 健康: {health_results['summary']['healthy']}")
print(f"      🟡 降級: {health_results['summary']['degraded']}")
print(f"      ❌ 不健康: {health_results['summary']['unhealthy']}")
print(f"      ❓ 未知: {health_results['summary']['unknown']}")
print(f"      🔴 關鍵失敗: {health_results['summary']['critical_failures']}")

print("\n🔍 詳細檢查結果:")
for check_name, result in health_results['checks'].items():
    status_icon = {
        HealthStatus.HEALTHY: '✅',
        HealthStatus.DEGRADED: '🟡',
        HealthStatus.UNHEALTHY: '❌',
        HealthStatus.UNKNOWN: '❓'
    }[result['status']]
    
    critical_marker = ' 🔴' if result.get('critical', False) else ''
    print(f"   {status_icon} {check_name}{critical_marker}: {result['message']}")
    print(f"      執行時間: {result.get('execution_time_ms', 0):.1f}ms")

## 🎯 本章總結

### 核心學習成果

通過本實驗室，您已經掌握了：

1. **📊 企業級監控體系**
   - Prometheus 指標收集和管理
   - 多維度性能監控
   - 實時告警和通知機制

2. **🎯 SLA 監控和保證**
   - 99.99% 可用性監控
   - P95/P99 延遲追蹤
   - 自動化 SLA 違反檢測

3. **🎛️  自動化運維能力**
   - 智能資源調配
   - 動態擴縮容決策
   - 負載預測和優化

4. **🏥 健康檢查和故障恢復**
   - 多層次健康監控
   - 預防性故障檢測
   - 自動化恢復流程

### 企業級運維技能

您現在具備了：
- **Netflix 級別**的可觀測性設計能力
- **金融級別**的 SLA 監控技能
- **雲原生**的自動化運維經驗
- **生產級別**的故障處理能力

### 完整 Lab-2.1 學習成果

完成整個 **Lab-2.1: Triton Server Basics** 後，您已經具備：

1. **Triton Server 完整部署能力**
2. **企業級 Model Repository 設計**
3. **PyTorch Backend 深度優化**
4. **生產級監控和運維**

### 下一步學習路徑

準備進入 **Lab-2.2: Multi-Model Management**：
- 多模型統一管理平台
- A/B 測試自動化
- 模型生命週期管理
- 企業級模型治理

---

**🏆 恭喜！您已經完成了 Triton Server 基礎的企業級監控與性能優化！**

**📈 技能提升總結：從基礎部署提升到企業級運維專家！**