# Lab 2.5-03: 性能分析與診斷

## 實驗目標

本節將深入分析 vLLM 的性能數據，包括：
- 歷史數據深度分析
- 性能瓶頸識別與診斷
- 負載測試與基準評估
- 容量規劃與擴展建議
- 性能優化策略制定

## 分析框架

### 1. 多維度性能分析
```
性能分析維度:
├── 時間維度 (Time-based)
│   ├── 趨勢分析 (Trend Analysis)
│   ├── 週期性模式 (Periodic Patterns)
│   └── 突發事件檢測 (Spike Detection)
├── 資源維度 (Resource-based)
│   ├── CPU/Memory/GPU 利用率
│   ├── 瓶頸資源識別
│   └── 資源配比優化
└── 業務維度 (Business-based)
    ├── 請求處理效率
    ├── 用戶體驗指標
    └── 成本效益分析
```

### 2. 性能指標體系
- **響應性指標**: 延遲分佈、TTFT、TPOT
- **吞吐量指標**: QPS、TPS、併發處理能力
- **可靠性指標**: 錯誤率、可用性、穩定性
- **效率指標**: 資源利用率、成本效益比

## 1. 環境初始化與數據載入

In [None]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
from typing import Dict, List, Tuple, Optional, Any
from pathlib import Path

# 科學計算與統計
from scipy import stats
from scipy.signal import find_peaks
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.ensemble import IsolationForest

# 時間序列分析
try:
    from statsmodels.tsa.seasonal import seasonal_decompose
    from statsmodels.tsa.stattools import adfuller
    STATSMODELS_AVAILABLE = True
except ImportError:
    STATSMODELS_AVAILABLE = False
    print("⚠️  statsmodels 未安裝，部分時間序列分析功能不可用")

# 進階視覺化
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.offline as pyo

# 設定
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
pyo.init_notebook_mode(connected=True)

print("✅ 性能分析環境初始化完成")

In [None]:
class PerformanceDataLoader:
    """性能數據載入器"""
    
    def __init__(self, data_dir: str = "."):
        self.data_dir = Path(data_dir)
        self.monitoring_data = None
        self.load_test_data = None
    
    def find_monitoring_files(self) -> List[Path]:
        """尋找監控數據文件"""
        json_files = list(self.data_dir.glob("vllm_monitoring_data_*.json"))
        csv_files = list(self.data_dir.glob("vllm_monitoring_data_*.csv"))
        
        return sorted(json_files + csv_files, key=lambda x: x.stat().st_mtime, reverse=True)
    
    def load_monitoring_data(self, file_path: Optional[str] = None) -> pd.DataFrame:
        """載入監控數據"""
        if file_path is None:
            # 自動尋找最新的數據文件
            files = self.find_monitoring_files()
            if not files:
                print("❌ 未找到監控數據文件")
                return self.generate_sample_data()
            file_path = files[0]
            print(f"📁 使用數據文件: {file_path}")
        
        file_path = Path(file_path)
        
        if file_path.suffix == '.json':
            return self._load_json_data(file_path)
        elif file_path.suffix == '.csv':
            return self._load_csv_data(file_path)
        else:
            raise ValueError(f"不支援的文件格式: {file_path.suffix}")
    
    def _load_json_data(self, file_path: Path) -> pd.DataFrame:
        """載入 JSON 格式數據"""
        with open(file_path, 'r') as f:
            data = json.load(f)
        
        # 建立 DataFrame
        df_data = {}
        
        # 時間戳
        timestamps = [datetime.fromisoformat(ts) for ts in data['timestamps']]
        df_data['timestamp'] = timestamps
        
        # 系統指標
        for metric_name, values in data['system_metrics'].items():
            # 確保長度一致
            padded_values = values + [None] * (len(timestamps) - len(values))
            df_data[metric_name] = padded_values[:len(timestamps)]
        
        # vLLM 指標
        for metric_name, values in data['vllm_metrics'].items():
            padded_values = values + [None] * (len(timestamps) - len(values))
            df_data[metric_name] = padded_values[:len(timestamps)]
        
        df = pd.DataFrame(df_data)
        df.set_index('timestamp', inplace=True)
        
        print(f"✅ 載入 JSON 數據: {len(df)} 行, {len(df.columns)} 列")
        return df
    
    def _load_csv_data(self, file_path: Path) -> pd.DataFrame:
        """載入 CSV 格式數據"""
        df = pd.read_csv(file_path)
        
        if 'timestamp' in df.columns:
            df['timestamp'] = pd.to_datetime(df['timestamp'])
            df.set_index('timestamp', inplace=True)
        
        print(f"✅ 載入 CSV 數據: {len(df)} 行, {len(df.columns)} 列")
        return df
    
    def generate_sample_data(self, duration_hours: int = 2, interval_seconds: int = 30) -> pd.DataFrame:
        """生成示例數據用於分析"""
        print("🔧 生成示例監控數據...")
        
        # 時間範圍
        end_time = datetime.now()
        start_time = end_time - timedelta(hours=duration_hours)
        timestamps = pd.date_range(start_time, end_time, freq=f'{interval_seconds}s')
        
        n_points = len(timestamps)
        
        # 生成基礎噪音
        np.random.seed(42)
        
        # 時間因子 (模擬日常變化)
        time_factor = np.sin(np.linspace(0, 4*np.pi, n_points)) * 0.2 + 1
        
        data = {
            # 系統指標
            'cpu_percent': np.clip(
                30 + 20 * time_factor + np.random.normal(0, 5, n_points), 0, 100
            ),
            'memory_percent': np.clip(
                60 + 10 * time_factor + np.random.normal(0, 3, n_points), 0, 100
            ),
            'gpu_memory_used': np.clip(
                45 + 25 * time_factor + np.random.normal(0, 8, n_points), 0, 100
            ),
            'gpu_utilization': np.clip(
                40 + 30 * time_factor + np.random.normal(0, 10, n_points), 0, 100
            ),
            
            # vLLM 指標 (模擬)
            'vllm_num_requests_running': np.random.poisson(3 * time_factor, n_points),
            'vllm_num_requests_waiting': np.random.poisson(1 * time_factor, n_points),
            'vllm_request_success_total': np.cumsum(np.random.poisson(2 * time_factor, n_points)),
            'vllm_request_failure_total': np.cumsum(np.random.poisson(0.1 * time_factor, n_points)),
            
            # 延遲指標 (秒)
            'vllm_time_to_first_token_seconds_sum': np.cumsum(
                np.random.exponential(1.5 * time_factor, n_points)
            ),
            'vllm_time_to_first_token_seconds_count': np.cumsum(
                np.random.poisson(1 * time_factor, n_points)
            ),
        }
        
        # 添加一些突發事件
        spike_points = np.random.choice(n_points, size=5, replace=False)
        for point in spike_points:
            data['cpu_percent'][point] = min(95, data['cpu_percent'][point] + 30)
            data['gpu_utilization'][point] = min(100, data['gpu_utilization'][point] + 40)
        
        df = pd.DataFrame(data, index=timestamps)
        
        print(f"✅ 生成示例數據: {len(df)} 行, {len(df.columns)} 列")
        return df

# 初始化數據載入器
data_loader = PerformanceDataLoader()
print("✅ 數據載入器已初始化")

In [None]:
# 載入監控數據
df = data_loader.load_monitoring_data()

# 數據基本資訊
print("\n📊 數據集基本資訊:")
print(f"   時間範圍: {df.index.min()} 到 {df.index.max()}")
print(f"   數據點數: {len(df)}")
print(f"   時間跨度: {(df.index.max() - df.index.min()).total_seconds():.0f} 秒")
print(f"   平均間隔: {(df.index.max() - df.index.min()).total_seconds() / (len(df) - 1):.1f} 秒")

print("\n📈 可用指標:")
system_metrics = [col for col in df.columns if not col.startswith('vllm_')]
vllm_metrics = [col for col in df.columns if col.startswith('vllm_')]

print(f"   系統指標 ({len(system_metrics)}): {', '.join(system_metrics[:5])}{'...' if len(system_metrics) > 5 else ''}")
print(f"   vLLM 指標 ({len(vllm_metrics)}): {', '.join(vllm_metrics[:5])}{'...' if len(vllm_metrics) > 5 else ''}")

# 檢查數據完整性
missing_data = df.isnull().sum()
if missing_data.sum() > 0:
    print("\n⚠️  數據缺失情況:")
    for col, missing in missing_data[missing_data > 0].items():
        print(f"   {col}: {missing} 個缺失值 ({missing/len(df)*100:.1f}%)")
else:
    print("\n✅ 數據完整，無缺失值")

## 2. 綜合性能分析器

In [None]:
class ComprehensivePerformanceAnalyzer:
    """綜合性能分析器"""
    
    def __init__(self, data: pd.DataFrame):
        self.data = data.copy()
        self.data = self.data.fillna(method='ffill').fillna(0)  # 處理缺失值
        
        # 分類指標
        self.system_metrics = [col for col in self.data.columns if not col.startswith('vllm_')]
        self.vllm_metrics = [col for col in self.data.columns if col.startswith('vllm_')]
        
        # 性能基準線
        self.performance_thresholds = {
            'cpu_percent': {'good': 60, 'warning': 80, 'critical': 95},
            'memory_percent': {'good': 70, 'warning': 85, 'critical': 95},
            'gpu_memory_used': {'good': 70, 'warning': 85, 'critical': 95},
            'gpu_utilization': {'good': 80, 'warning': 90, 'critical': 98}
        }
    
    def calculate_derived_metrics(self):
        """計算衍生指標"""
        print("🔧 計算衍生指標...")
        
        # 請求成功率
        if 'vllm_request_success_total' in self.data.columns and 'vllm_request_failure_total' in self.data.columns:
            total_requests = self.data['vllm_request_success_total'] + self.data['vllm_request_failure_total']
            self.data['success_rate'] = np.where(total_requests > 0, 
                                                self.data['vllm_request_success_total'] / total_requests * 100, 100)
        
        # 平均 TTFT
        if ('vllm_time_to_first_token_seconds_sum' in self.data.columns and 
            'vllm_time_to_first_token_seconds_count' in self.data.columns):
            count = self.data['vllm_time_to_first_token_seconds_count']
            total_time = self.data['vllm_time_to_first_token_seconds_sum']
            self.data['avg_ttft'] = np.where(count > 0, total_time / count, 0)
        
        # 請求處理速率 (QPS)
        if 'vllm_request_success_total' in self.data.columns:
            self.data['qps'] = self.data['vllm_request_success_total'].diff().fillna(0) / (
                self.data.index.to_series().diff().dt.total_seconds().fillna(1)
            )
            self.data['qps'] = self.data['qps'].clip(lower=0)  # 移除負值
        
        # 系統負載指數 (綜合指標)
        load_components = []
        if 'cpu_percent' in self.data.columns:
            load_components.append(self.data['cpu_percent'] / 100)
        if 'memory_percent' in self.data.columns:
            load_components.append(self.data['memory_percent'] / 100)
        if 'gpu_utilization' in self.data.columns:
            load_components.append(self.data['gpu_utilization'] / 100)
        
        if load_components:
            self.data['system_load_index'] = np.mean(load_components, axis=0) * 100
        
        print(f"✅ 衍生指標計算完成，新增 {len([col for col in self.data.columns if col in ['success_rate', 'avg_ttft', 'qps', 'system_load_index']])} 個指標")
    
    def performance_summary_statistics(self) -> Dict[str, Any]:
        """性能摘要統計"""
        stats = {}
        
        # 系統資源統計
        for metric in self.system_metrics:
            if metric in self.data.columns:
                series = self.data[metric].dropna()
                if len(series) > 0:
                    stats[metric] = {
                        'mean': series.mean(),
                        'median': series.median(),
                        'std': series.std(),
                        'min': series.min(),
                        'max': series.max(),
                        'p95': series.quantile(0.95),
                        'p99': series.quantile(0.99)
                    }
        
        # vLLM 性能統計
        derived_metrics = ['success_rate', 'avg_ttft', 'qps', 'system_load_index']
        for metric in derived_metrics:
            if metric in self.data.columns:
                series = self.data[metric].dropna()
                if len(series) > 0:
                    stats[metric] = {
                        'mean': series.mean(),
                        'median': series.median(),
                        'std': series.std(),
                        'min': series.min(),
                        'max': series.max(),
                        'p95': series.quantile(0.95),
                        'p99': series.quantile(0.99)
                    }
        
        return stats
    
    def identify_performance_bottlenecks(self) -> Dict[str, Any]:
        """識別性能瓶頸"""
        bottlenecks = {
            'critical_periods': [],
            'resource_constraints': [],
            'performance_degradation': [],
            'recommendations': []
        }
        
        # 檢查資源使用超過閾值的時間段
        for metric, thresholds in self.performance_thresholds.items():
            if metric in self.data.columns:
                series = self.data[metric]
                
                # 檢查臨界期間
                critical_mask = series > thresholds['critical']
                warning_mask = series > thresholds['warning']
                
                if critical_mask.any():
                    critical_periods = self._find_consecutive_periods(critical_mask)
                    for start, end, duration in critical_periods:
                        bottlenecks['critical_periods'].append({
                            'metric': metric,
                            'start_time': start,
                            'end_time': end,
                            'duration_seconds': duration,
                            'max_value': series[start:end].max(),
                            'severity': 'critical'
                        })
                
                # 資源約束分析
                high_usage_pct = (series > thresholds['warning']).mean() * 100
                if high_usage_pct > 20:  # 超過 20% 的時間處於高使用狀態
                    bottlenecks['resource_constraints'].append({
                        'metric': metric,
                        'high_usage_percentage': high_usage_pct,
                        'avg_usage': series.mean(),
                        'p95_usage': series.quantile(0.95)
                    })
        
        # 性能衰退檢測
        if 'qps' in self.data.columns and len(self.data) > 100:
            qps_series = self.data['qps'].rolling(window=20).mean()
            if len(qps_series.dropna()) > 50:
                early_performance = qps_series.dropna().iloc[:25].mean()
                late_performance = qps_series.dropna().iloc[-25:].mean()
                
                if early_performance > 0 and (late_performance / early_performance) < 0.8:
                    bottlenecks['performance_degradation'].append({
                        'metric': 'qps',
                        'early_avg': early_performance,
                        'late_avg': late_performance,
                        'degradation_pct': (1 - late_performance / early_performance) * 100
                    })
        
        # 生成建議
        bottlenecks['recommendations'] = self._generate_recommendations(bottlenecks)
        
        return bottlenecks
    
    def _find_consecutive_periods(self, mask: pd.Series) -> List[Tuple]:
        """尋找連續的時間段"""
        periods = []
        start = None
        
        for i, (timestamp, value) in enumerate(mask.items()):
            if value and start is None:
                start = timestamp
            elif not value and start is not None:
                end = timestamp
                duration = (end - start).total_seconds()
                periods.append((start, end, duration))
                start = None
        
        # 處理結尾的情況
        if start is not None:
            end = mask.index[-1]
            duration = (end - start).total_seconds()
            periods.append((start, end, duration))
        
        return periods
    
    def _generate_recommendations(self, bottlenecks: Dict) -> List[str]:
        """生成優化建議"""
        recommendations = []
        
        # 基於資源約束的建議
        for constraint in bottlenecks['resource_constraints']:
            metric = constraint['metric']
            if metric == 'cpu_percent':
                recommendations.append("CPU 使用率過高，建議優化算法或增加 CPU 核心數")
            elif metric == 'memory_percent':
                recommendations.append("記憶體使用率過高，建議增加記憶體或優化記憶體使用")
            elif metric == 'gpu_memory_used':
                recommendations.append("GPU 記憶體使用率過高，建議調整批次大小或模型分片")
            elif metric == 'gpu_utilization':
                recommendations.append("GPU 使用率過高，建議增加 GPU 數量或優化計算")
        
        # 基於性能衰退的建議
        if bottlenecks['performance_degradation']:
            recommendations.append("檢測到性能衰退，建議檢查模型快取和記憶體洩漏")
        
        # 基於臨界期間的建議
        if bottlenecks['critical_periods']:
            recommendations.append("檢測到資源使用臨界期間，建議設置負載均衡和自動擴展")
        
        return recommendations

# 初始化分析器
analyzer = ComprehensivePerformanceAnalyzer(df)

# 計算衍生指標
analyzer.calculate_derived_metrics()

print("✅ 綜合性能分析器已初始化")

## 3. 詳細性能統計分析

In [None]:
# 獲取性能統計
performance_stats = analyzer.performance_summary_statistics()

# 顯示詳細統計
print("\n" + "="*80)
print("📊 詳細性能統計分析")
print("="*80)

# 系統資源統計
print("\n🖥️  系統資源使用統計:")
print(f"{'指標':<20} {'平均值':<10} {'中位數':<10} {'P95':<10} {'P99':<10} {'最大值':<10} {'標準差':<10}")
print("-" * 80)

for metric in ['cpu_percent', 'memory_percent', 'gpu_memory_used', 'gpu_utilization']:
    if metric in performance_stats:
        stats = performance_stats[metric]
        print(f"{metric:<20} {stats['mean']:<10.1f} {stats['median']:<10.1f} "
              f"{stats['p95']:<10.1f} {stats['p99']:<10.1f} {stats['max']:<10.1f} {stats['std']:<10.1f}")

# 性能指標統計
print("\n🚀 vLLM 性能指標統計:")
performance_metrics = ['success_rate', 'avg_ttft', 'qps', 'system_load_index']

for metric in performance_metrics:
    if metric in performance_stats:
        stats = performance_stats[metric]
        unit = '%' if metric in ['success_rate', 'system_load_index'] else ('s' if 'ttft' in metric else 'req/s')
        print(f"\n   {metric}:")
        print(f"     平均值: {stats['mean']:.3f} {unit}")
        print(f"     中位數: {stats['median']:.3f} {unit}")
        print(f"     P95: {stats['p95']:.3f} {unit}")
        print(f"     P99: {stats['p99']:.3f} {unit}")
        print(f"     範圍: {stats['min']:.3f} - {stats['max']:.3f} {unit}")
        print(f"     變異係數: {(stats['std']/stats['mean']*100):.1f}%")

print("\n" + "="*80)

## 4. 性能瓶頸識別與診斷

In [None]:
# 執行瓶頸分析
bottlenecks = analyzer.identify_performance_bottlenecks()

print("\n" + "="*80)
print("🔍 性能瓶頸診斷報告")
print("="*80)

# 臨界期間分析
if bottlenecks['critical_periods']:
    print("\n⚠️  檢測到臨界性能期間:")
    for i, period in enumerate(bottlenecks['critical_periods'], 1):
        print(f"\n   期間 {i}:")
        print(f"     指標: {period['metric']}")
        print(f"     時間: {period['start_time'].strftime('%H:%M:%S')} - {period['end_time'].strftime('%H:%M:%S')}")
        print(f"     持續時間: {period['duration_seconds']:.0f} 秒")
        print(f"     最大值: {period['max_value']:.1f}%")
        print(f"     嚴重程度: {period['severity']}")
else:
    print("\n✅ 未檢測到臨界性能期間")

# 資源約束分析
if bottlenecks['resource_constraints']:
    print("\n📈 資源約束分析:")
    for constraint in bottlenecks['resource_constraints']:
        print(f"\n   {constraint['metric']}:")
        print(f"     高使用率時間佔比: {constraint['high_usage_percentage']:.1f}%")
        print(f"     平均使用率: {constraint['avg_usage']:.1f}%")
        print(f"     P95 使用率: {constraint['p95_usage']:.1f}%")
else:
    print("\n✅ 未發現顯著的資源約束問題")

# 性能衰退分析
if bottlenecks['performance_degradation']:
    print("\n📉 性能衰退檢測:")
    for degradation in bottlenecks['performance_degradation']:
        print(f"\n   {degradation['metric']}:")
        print(f"     初期性能: {degradation['early_avg']:.2f}")
        print(f"     後期性能: {degradation['late_avg']:.2f}")
        print(f"     衰退幅度: {degradation['degradation_pct']:.1f}%")
else:
    print("\n✅ 未檢測到顯著的性能衰退")

# 優化建議
if bottlenecks['recommendations']:
    print("\n💡 優化建議:")
    for i, recommendation in enumerate(bottlenecks['recommendations'], 1):
        print(f"   {i}. {recommendation}")
else:
    print("\n✅ 系統運行良好，暫無特殊優化建議")

print("\n" + "="*80)

## 5. 進階時間序列分析

In [None]:
class TimeSeriesAnalyzer:
    """時間序列分析器"""
    
    def __init__(self, data: pd.DataFrame):
        self.data = data
    
    def trend_analysis(self, metric: str, window_size: int = 20) -> Dict[str, Any]:
        """趨勢分析"""
        if metric not in self.data.columns:
            return {}
        
        series = self.data[metric].dropna()
        if len(series) < window_size:
            return {}
        
        # 計算移動平均
        rolling_mean = series.rolling(window=window_size).mean()
        
        # 線性趨勢擬合
        x = np.arange(len(series))
        slope, intercept, r_value, p_value, std_err = stats.linregress(x, series)
        
        # 趨勢強度判斷
        trend_strength = abs(r_value)
        if trend_strength > 0.7:
            trend_type = "強" + ("上升" if slope > 0 else "下降")
        elif trend_strength > 0.3:
            trend_type = "中等" + ("上升" if slope > 0 else "下降")
        else:
            trend_type = "無明顯趨勢"
        
        return {
            'slope': slope,
            'r_squared': r_value**2,
            'p_value': p_value,
            'trend_type': trend_type,
            'rolling_mean': rolling_mean,
            'trend_line': slope * x + intercept
        }
    
    def seasonality_detection(self, metric: str, period: int = None) -> Dict[str, Any]:
        """季節性檢測"""
        if not STATSMODELS_AVAILABLE or metric not in self.data.columns:
            return {}
        
        series = self.data[metric].dropna()
        if len(series) < 100:  # 需要足夠的數據點
            return {}
        
        try:
            # 自動檢測週期
            if period is None:
                # 假設數據間隔為分鐘級，嘗試小時週期
                period = min(60, len(series) // 4)
            
            # 季節分解
            decomposition = seasonal_decompose(
                series, 
                model='additive', 
                period=period,
                extrapolate_trend='freq'
            )
            
            # 季節性強度
            seasonal_strength = np.std(decomposition.seasonal) / np.std(series)
            
            return {
                'has_seasonality': seasonal_strength > 0.1,
                'seasonal_strength': seasonal_strength,
                'period': period,
                'trend': decomposition.trend,
                'seasonal': decomposition.seasonal,
                'residual': decomposition.resid
            }
        
        except Exception as e:
            print(f"季節性分析失敗: {e}")
            return {}
    
    def anomaly_detection_advanced(self, metric: str, contamination: float = 0.1) -> Dict[str, Any]:
        """進階異常檢測"""
        if metric not in self.data.columns:
            return {}
        
        series = self.data[metric].dropna()
        if len(series) < 20:
            return {}
        
        # 準備數據
        X = series.values.reshape(-1, 1)
        
        # Isolation Forest 異常檢測
        iso_forest = IsolationForest(
            contamination=contamination,
            random_state=42,
            n_estimators=100
        )
        
        anomaly_labels = iso_forest.fit_predict(X)
        anomaly_scores = iso_forest.decision_function(X)
        
        # 統計學異常檢測 (Z-score)
        z_scores = np.abs(stats.zscore(series))
        z_anomalies = z_scores > 3
        
        # 綜合異常檢測結果
        anomalies = (anomaly_labels == -1) | z_anomalies
        
        anomaly_indices = series.index[anomalies]
        anomaly_values = series[anomalies]
        
        return {
            'anomaly_count': anomalies.sum(),
            'anomaly_rate': anomalies.mean(),
            'anomaly_timestamps': anomaly_indices,
            'anomaly_values': anomaly_values,
            'anomaly_scores': anomaly_scores,
            'z_scores': z_scores
        }
    
    def correlation_analysis(self, metrics: List[str]) -> Dict[str, Any]:
        """相關性分析"""
        available_metrics = [m for m in metrics if m in self.data.columns]
        
        if len(available_metrics) < 2:
            return {}
        
        # 計算相關性矩陣
        correlation_matrix = self.data[available_metrics].corr()
        
        # 尋找強相關關係
        strong_correlations = []
        for i, metric1 in enumerate(available_metrics):
            for j, metric2 in enumerate(available_metrics[i+1:], i+1):
                corr_value = correlation_matrix.loc[metric1, metric2]
                if abs(corr_value) > 0.7:  # 強相關閾值
                    strong_correlations.append({
                        'metric1': metric1,
                        'metric2': metric2,
                        'correlation': corr_value,
                        'strength': 'strong'
                    })
                elif abs(corr_value) > 0.4:  # 中等相關閾值
                    strong_correlations.append({
                        'metric1': metric1,
                        'metric2': metric2,
                        'correlation': corr_value,
                        'strength': 'moderate'
                    })
        
        return {
            'correlation_matrix': correlation_matrix,
            'strong_correlations': strong_correlations
        }

# 初始化時間序列分析器
ts_analyzer = TimeSeriesAnalyzer(analyzer.data)
print("✅ 時間序列分析器已初始化")

In [None]:
# 執行時間序列分析
print("\n" + "="*80)
print("📈 時間序列深度分析")
print("="*80)

# 選擇關鍵指標進行分析
key_metrics = ['cpu_percent', 'memory_percent', 'gpu_utilization', 'system_load_index']
available_metrics = [m for m in key_metrics if m in analyzer.data.columns]

# 趨勢分析
print("\n📊 趨勢分析結果:")
for metric in available_metrics:
    trend_result = ts_analyzer.trend_analysis(metric)
    if trend_result:
        print(f"\n   {metric}:")
        print(f"     趨勢類型: {trend_result['trend_type']}")
        print(f"     趨勢斜率: {trend_result['slope']:.4f} 單位/時間點")
        print(f"     決定係數 (R²): {trend_result['r_squared']:.3f}")
        print(f"     統計顯著性 (p值): {trend_result['p_value']:.4f}")

# 異常檢測
print("\n🔍 進階異常檢測:")
total_anomalies = 0
for metric in available_metrics:
    anomaly_result = ts_analyzer.anomaly_detection_advanced(metric)
    if anomaly_result:
        count = anomaly_result['anomaly_count']
        rate = anomaly_result['anomaly_rate'] * 100
        total_anomalies += count
        
        print(f"\n   {metric}:")
        print(f"     異常點數量: {count}")
        print(f"     異常率: {rate:.1f}%")
        
        if count > 0:
            worst_anomaly_idx = np.argmax(np.abs(anomaly_result['anomaly_scores']))
            worst_timestamp = anomaly_result['anomaly_timestamps'][worst_anomaly_idx]
            worst_value = anomaly_result['anomaly_values'].iloc[worst_anomaly_idx]
            print(f"     最嚴重異常: {worst_timestamp.strftime('%H:%M:%S')} (值: {worst_value:.1f})")

print(f"\n   總異常點數: {total_anomalies}")

# 相關性分析
correlation_result = ts_analyzer.correlation_analysis(available_metrics)
if correlation_result and correlation_result['strong_correlations']:
    print("\n🔗 強相關關係分析:")
    for corr in correlation_result['strong_correlations']:
        print(f"   {corr['metric1']} ↔ {corr['metric2']}: {corr['correlation']:.3f} ({corr['strength']})")
else:
    print("\n   未發現顯著的強相關關係")

print("\n" + "="*80)

## 6. 高級性能視覺化

In [None]:
def create_comprehensive_performance_dashboard():
    """創建綜合性能分析儀表板"""
    fig, axes = plt.subplots(3, 2, figsize=(20, 16))
    fig.suptitle('vLLM 綜合性能分析儀表板', fontsize=18, fontweight='bold')
    
    # 1. 系統資源使用趨勢
    ax1 = axes[0, 0]
    for metric in ['cpu_percent', 'memory_percent', 'gpu_utilization']:
        if metric in analyzer.data.columns:
            ax1.plot(analyzer.data.index, analyzer.data[metric], 
                    label=metric.replace('_', ' ').title(), alpha=0.8, linewidth=2)
    
    ax1.set_title('系統資源使用趨勢', fontweight='bold', fontsize=14)
    ax1.set_ylabel('使用率 (%)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, 100)
    
    # 2. 性能指標分佈
    ax2 = axes[0, 1]
    performance_metrics = ['cpu_percent', 'memory_percent', 'gpu_utilization']
    available_perf_metrics = [m for m in performance_metrics if m in analyzer.data.columns]
    
    if available_perf_metrics:
        data_for_box = [analyzer.data[metric].dropna() for metric in available_perf_metrics]
        labels = [m.replace('_', ' ').title() for m in available_perf_metrics]
        
        box_plot = ax2.boxplot(data_for_box, labels=labels, patch_artist=True)
        colors = ['lightblue', 'lightgreen', 'lightcoral']
        for patch, color in zip(box_plot['boxes'], colors[:len(box_plot['boxes'])]):
            patch.set_facecolor(color)
    
    ax2.set_title('性能指標分佈', fontweight='bold', fontsize=14)
    ax2.set_ylabel('使用率 (%)')
    ax2.grid(True, alpha=0.3)
    
    # 3. vLLM 請求處理性能
    ax3 = axes[1, 0]
    if 'qps' in analyzer.data.columns:
        qps_data = analyzer.data['qps'].rolling(window=10).mean()
        ax3.plot(analyzer.data.index, qps_data, color='purple', linewidth=2, label='QPS (10-point avg)')
        ax3.fill_between(analyzer.data.index, qps_data, alpha=0.3, color='purple')
    
    if 'vllm_num_requests_running' in analyzer.data.columns:
        ax3_twin = ax3.twinx()
        ax3_twin.plot(analyzer.data.index, analyzer.data['vllm_num_requests_running'], 
                     color='orange', linewidth=2, alpha=0.7, label='Running Requests')
        ax3_twin.set_ylabel('Active Requests', color='orange')
        ax3_twin.tick_params(axis='y', labelcolor='orange')
    
    ax3.set_title('vLLM 請求處理性能', fontweight='bold', fontsize=14)
    ax3.set_ylabel('QPS')
    ax3.legend(loc='upper left')
    ax3.grid(True, alpha=0.3)
    
    # 4. 系統負載指數與異常點
    ax4 = axes[1, 1]
    if 'system_load_index' in analyzer.data.columns:
        load_data = analyzer.data['system_load_index']
        ax4.plot(analyzer.data.index, load_data, color='red', linewidth=2, alpha=0.8)
        
        # 標記異常點
        anomaly_result = ts_analyzer.anomaly_detection_advanced('system_load_index')
        if anomaly_result and len(anomaly_result['anomaly_timestamps']) > 0:
            ax4.scatter(anomaly_result['anomaly_timestamps'], 
                       anomaly_result['anomaly_values'],
                       color='red', s=50, alpha=0.8, marker='x', label=f"異常點 ({len(anomaly_result['anomaly_timestamps'])})")
            ax4.legend()
        
        # 添加警告線
        ax4.axhline(y=80, color='orange', linestyle='--', alpha=0.7, label='警告線 (80%)')
        ax4.axhline(y=95, color='red', linestyle='--', alpha=0.7, label='臨界線 (95%)')
    
    ax4.set_title('系統負載指數與異常檢測', fontweight='bold', fontsize=14)
    ax4.set_ylabel('負載指數 (%)')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    # 5. 相關性熱力圖
    ax5 = axes[2, 0]
    correlation_metrics = ['cpu_percent', 'memory_percent', 'gpu_utilization', 'gpu_memory_used']
    available_corr_metrics = [m for m in correlation_metrics if m in analyzer.data.columns]
    
    if len(available_corr_metrics) > 1:
        corr_matrix = analyzer.data[available_corr_metrics].corr()
        im = ax5.imshow(corr_matrix, cmap='RdYlBu_r', aspect='auto', vmin=-1, vmax=1)
        
        # 添加數值標籤
        for i in range(len(available_corr_metrics)):
            for j in range(len(available_corr_metrics)):
                text = ax5.text(j, i, f'{corr_matrix.iloc[i, j]:.2f}',
                               ha="center", va="center", color="black", fontweight='bold')
        
        ax5.set_xticks(range(len(available_corr_metrics)))
        ax5.set_yticks(range(len(available_corr_metrics)))
        ax5.set_xticklabels([m.replace('_', '\n') for m in available_corr_metrics], rotation=45)
        ax5.set_yticklabels([m.replace('_', '\n') for m in available_corr_metrics])
        
        # 添加顏色條
        cbar = plt.colorbar(im, ax=ax5, shrink=0.8)
        cbar.set_label('相關係數', rotation=270, labelpad=20)
    
    ax5.set_title('指標相關性分析', fontweight='bold', fontsize=14)
    
    # 6. 性能摘要雷達圖
    ax6 = axes[2, 1]
    
    # 計算性能評分 (0-100)
    scores = {}
    labels = []
    values = []
    
    if 'success_rate' in analyzer.data.columns:
        scores['Success Rate'] = analyzer.data['success_rate'].mean()
    
    # CPU 效率 (100 - 平均使用率)
    if 'cpu_percent' in analyzer.data.columns:
        scores['CPU Efficiency'] = max(0, 100 - analyzer.data['cpu_percent'].mean())
    
    # Memory 效率
    if 'memory_percent' in analyzer.data.columns:
        scores['Memory Efficiency'] = max(0, 100 - analyzer.data['memory_percent'].mean())
    
    # GPU 效率
    if 'gpu_utilization' in analyzer.data.columns:
        scores['GPU Efficiency'] = max(0, 100 - analyzer.data['gpu_utilization'].mean())
    
    # 穩定性 (100 - 變異係數)
    if 'system_load_index' in analyzer.data.columns:
        cv = analyzer.data['system_load_index'].std() / analyzer.data['system_load_index'].mean()
        scores['Stability'] = max(0, 100 - cv * 100)
    
    if scores:
        labels = list(scores.keys())
        values = list(scores.values())
        
        # 雷達圖
        angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False).tolist()
        values += values[:1]  # 閉合圖形
        angles += angles[:1]
        
        ax6.plot(angles, values, 'o-', linewidth=2, label='Performance Score')
        ax6.fill(angles, values, alpha=0.25)
        ax6.set_xticks(angles[:-1])
        ax6.set_xticklabels(labels)
        ax6.set_ylim(0, 100)
        ax6.grid(True)
        
        # 添加分數標籤
        for angle, value, label in zip(angles[:-1], values[:-1], labels):
            ax6.text(angle, value + 5, f'{value:.0f}', ha='center', va='center', fontweight='bold')
    
    ax6.set_title('性能評分雷達圖', fontweight='bold', fontsize=14)
    
    plt.tight_layout()
    plt.show()

# 生成綜合性能儀表板
print("📊 生成綜合性能分析儀表板...")
create_comprehensive_performance_dashboard()

## 7. 負載測試與基準評估

In [None]:
class LoadTestAnalyzer:
    """負載測試分析器"""
    
    def __init__(self, data: pd.DataFrame):
        self.data = data
        self.baseline_metrics = self._calculate_baseline()
    
    def _calculate_baseline(self) -> Dict[str, float]:
        """計算基準線性能指標"""
        baseline = {}
        
        # 取前 25% 的數據作為基準線
        baseline_data = self.data.iloc[:len(self.data)//4]
        
        for col in self.data.columns:
            if baseline_data[col].dtype in ['int64', 'float64']:
                baseline[col] = baseline_data[col].mean()
        
        return baseline
    
    def analyze_load_patterns(self) -> Dict[str, Any]:
        """分析負載模式"""
        patterns = {
            'peak_periods': [],
            'low_periods': [],
            'load_distribution': {},
            'performance_under_load': {}
        }
        
        if 'system_load_index' in self.data.columns:
            load_series = self.data['system_load_index']
            
            # 識別高負載和低負載期間
            high_load_threshold = load_series.quantile(0.8)
            low_load_threshold = load_series.quantile(0.2)
            
            high_load_mask = load_series > high_load_threshold
            low_load_mask = load_series < low_load_threshold
            
            # 尋找連續的高負載期間
            patterns['peak_periods'] = self._find_consecutive_periods(high_load_mask)
            patterns['low_periods'] = self._find_consecutive_periods(low_load_mask)
            
            # 負載分佈分析
            patterns['load_distribution'] = {
                'mean': load_series.mean(),
                'median': load_series.median(),
                'std': load_series.std(),
                'min': load_series.min(),
                'max': load_series.max(),
                'q25': load_series.quantile(0.25),
                'q75': load_series.quantile(0.75),
                'high_load_percentage': (high_load_mask.sum() / len(load_series)) * 100,
                'low_load_percentage': (low_load_mask.sum() / len(load_series)) * 100
            }
            
            # 不同負載下的性能分析
            high_load_data = self.data[high_load_mask]
            low_load_data = self.data[low_load_mask]
            medium_load_data = self.data[~(high_load_mask | low_load_mask)]
            
            for load_type, data_subset in [('high', high_load_data), ('medium', medium_load_data), ('low', low_load_data)]:
                if len(data_subset) > 0:
                    patterns['performance_under_load'][load_type] = {
                        'avg_cpu': data_subset['cpu_percent'].mean() if 'cpu_percent' in data_subset.columns else 0,
                        'avg_memory': data_subset['memory_percent'].mean() if 'memory_percent' in data_subset.columns else 0,
                        'avg_gpu': data_subset['gpu_utilization'].mean() if 'gpu_utilization' in data_subset.columns else 0,
                        'avg_qps': data_subset['qps'].mean() if 'qps' in data_subset.columns else 0,
                        'data_points': len(data_subset)
                    }
        
        return patterns
    
    def _find_consecutive_periods(self, mask: pd.Series) -> List[Dict]:
        """尋找連續的時間段"""
        periods = []
        start = None
        
        for timestamp, value in mask.items():
            if value and start is None:
                start = timestamp
            elif not value and start is not None:
                end = timestamp
                duration = (end - start).total_seconds()
                periods.append({
                    'start': start,
                    'end': end,
                    'duration_seconds': duration
                })
                start = None
        
        # 處理結尾的情況
        if start is not None:
            end = mask.index[-1]
            duration = (end - start).total_seconds()
            periods.append({
                'start': start,
                'end': end,
                'duration_seconds': duration
            })
        
        return periods
    
    def capacity_planning_analysis(self) -> Dict[str, Any]:
        """容量規劃分析"""
        capacity_analysis = {
            'current_utilization': {},
            'bottleneck_analysis': {},
            'scaling_recommendations': []
        }
        
        # 當前資源使用情況
        resource_metrics = ['cpu_percent', 'memory_percent', 'gpu_utilization', 'gpu_memory_used']
        
        for metric in resource_metrics:
            if metric in self.data.columns:
                series = self.data[metric]
                capacity_analysis['current_utilization'][metric] = {
                    'average': series.mean(),
                    'peak': series.max(),
                    'p95': series.quantile(0.95),
                    'headroom': 100 - series.quantile(0.95),  # 剩餘容量
                    'utilization_trend': 'increasing' if series.tail(20).mean() > series.head(20).mean() else 'stable'
                }
        
        # 瓶頸分析
        bottlenecks = []
        for metric, utilization in capacity_analysis['current_utilization'].items():
            if utilization['p95'] > 90:
                bottlenecks.append({
                    'resource': metric,
                    'severity': 'critical',
                    'p95_usage': utilization['p95'],
                    'headroom': utilization['headroom']
                })
            elif utilization['p95'] > 80:
                bottlenecks.append({
                    'resource': metric,
                    'severity': 'warning',
                    'p95_usage': utilization['p95'],
                    'headroom': utilization['headroom']
                })
        
        capacity_analysis['bottleneck_analysis'] = bottlenecks
        
        # 擴展建議
        recommendations = []
        
        for bottleneck in bottlenecks:
            resource = bottleneck['resource']
            if resource == 'cpu_percent':
                if bottleneck['severity'] == 'critical':
                    recommendations.append("立即增加 CPU 核心數或優化 CPU 密集型運算")
                else:
                    recommendations.append("考慮在未來 1-2 個月內增加 CPU 資源")
            
            elif resource == 'memory_percent':
                if bottleneck['severity'] == 'critical':
                    recommendations.append("立即增加記憶體容量或優化記憶體使用")
                else:
                    recommendations.append("計劃增加記憶體容量以應對未來需求")
            
            elif resource in ['gpu_utilization', 'gpu_memory_used']:
                if bottleneck['severity'] == 'critical':
                    recommendations.append("立即增加 GPU 資源或實施模型分片")
                else:
                    recommendations.append("準備 GPU 擴展方案以應對負載增長")
        
        if not recommendations:
            recommendations.append("當前資源配置充足，持續監控即可")
        
        capacity_analysis['scaling_recommendations'] = recommendations
        
        return capacity_analysis

# 初始化負載測試分析器
load_analyzer = LoadTestAnalyzer(analyzer.data)
print("✅ 負載測試分析器已初始化")

In [None]:
# 執行負載模式分析
load_patterns = load_analyzer.analyze_load_patterns()

print("\n" + "="*80)
print("📈 負載模式分析報告")
print("="*80)

# 負載分佈統計
if 'load_distribution' in load_patterns and load_patterns['load_distribution']:
    dist = load_patterns['load_distribution']
    print("\n🔍 負載分佈統計:")
    print(f"   平均負載: {dist['mean']:.1f}%")
    print(f"   中位數負載: {dist['median']:.1f}%")
    print(f"   負載範圍: {dist['min']:.1f}% - {dist['max']:.1f}%")
    print(f"   負載變異性: {dist['std']:.1f}% (標準差)")
    print(f"   高負載時間佔比: {dist['high_load_percentage']:.1f}%")
    print(f"   低負載時間佔比: {dist['low_load_percentage']:.1f}%")

# 峰值期間分析
if load_patterns['peak_periods']:
    print(f"\n⚡ 檢測到 {len(load_patterns['peak_periods'])} 個高負載期間:")
    for i, period in enumerate(load_patterns['peak_periods'][:5], 1):  # 顯示前5個
        print(f"   期間 {i}: {period['start'].strftime('%H:%M:%S')} - {period['end'].strftime('%H:%M:%S')} "
              f"(持續 {period['duration_seconds']:.0f} 秒)")
else:
    print("\n✅ 未檢測到顯著的高負載期間")

# 不同負載下的性能表現
if 'performance_under_load' in load_patterns:
    print("\n🎯 不同負載下的性能表現:")
    for load_type, perf in load_patterns['performance_under_load'].items():
        if perf['data_points'] > 0:
            print(f"\n   {load_type.upper()} 負載 ({perf['data_points']} 個數據點):")
            print(f"     平均 CPU: {perf['avg_cpu']:.1f}%")
            print(f"     平均記憶體: {perf['avg_memory']:.1f}%")
            print(f"     平均 GPU: {perf['avg_gpu']:.1f}%")
            print(f"     平均 QPS: {perf['avg_qps']:.2f}")

print("\n" + "="*80)

In [None]:
# 執行容量規劃分析
capacity_analysis = load_analyzer.capacity_planning_analysis()

print("\n" + "="*80)
print("🏗️  容量規劃分析報告")
print("="*80)

# 當前資源使用情況
print("\n📊 當前資源使用情況:")
print(f"{'資源':<15} {'平均使用':<10} {'峰值使用':<10} {'P95使用':<10} {'剩餘容量':<10} {'使用趨勢':<10}")
print("-" * 70)

for resource, utilization in capacity_analysis['current_utilization'].items():
    print(f"{resource:<15} {utilization['average']:<10.1f} {utilization['peak']:<10.1f} "
          f"{utilization['p95']:<10.1f} {utilization['headroom']:<10.1f} {utilization['utilization_trend']:<10}")

# 瓶頸分析
if capacity_analysis['bottleneck_analysis']:
    print("\n⚠️  資源瓶頸分析:")
    for bottleneck in capacity_analysis['bottleneck_analysis']:
        severity_icon = "🔴" if bottleneck['severity'] == 'critical' else "🟡"
        print(f"   {severity_icon} {bottleneck['resource']}: {bottleneck['severity'].upper()} "
              f"(P95: {bottleneck['p95_usage']:.1f}%, 剩餘: {bottleneck['headroom']:.1f}%)")
else:
    print("\n✅ 未發現資源瓶頸")

# 擴展建議
print("\n💡 容量擴展建議:")
for i, recommendation in enumerate(capacity_analysis['scaling_recommendations'], 1):
    print(f"   {i}. {recommendation}")

print("\n" + "="*80)

## 8. 性能優化建議生成

In [None]:
class PerformanceOptimizationAdvisor:
    """性能優化建議生成器"""
    
    def __init__(self, analyzer, bottlenecks, load_patterns, capacity_analysis):
        self.analyzer = analyzer
        self.bottlenecks = bottlenecks
        self.load_patterns = load_patterns
        self.capacity_analysis = capacity_analysis
        
    def generate_comprehensive_recommendations(self) -> Dict[str, Any]:
        """生成綜合優化建議"""
        recommendations = {
            'immediate_actions': [],
            'short_term_optimizations': [],
            'long_term_planning': [],
            'monitoring_improvements': [],
            'performance_kpis': {},
            'risk_assessment': {}
        }
        
        # 立即行動建議
        recommendations['immediate_actions'] = self._generate_immediate_actions()
        
        # 短期優化建議
        recommendations['short_term_optimizations'] = self._generate_short_term_optimizations()
        
        # 長期規劃建議
        recommendations['long_term_planning'] = self._generate_long_term_planning()
        
        # 監控改進建議
        recommendations['monitoring_improvements'] = self._generate_monitoring_improvements()
        
        # 性能 KPI 建議
        recommendations['performance_kpis'] = self._generate_performance_kpis()
        
        # 風險評估
        recommendations['risk_assessment'] = self._assess_risks()
        
        return recommendations
    
    def _generate_immediate_actions(self) -> List[Dict[str, str]]:
        """生成立即行動建議"""
        actions = []
        
        # 檢查臨界期間
        if self.bottlenecks['critical_periods']:
            actions.append({
                'action': '設置自動告警',
                'description': '為臨界資源使用期間設置實時告警機制',
                'priority': 'high',
                'estimated_effort': '1-2 小時'
            })
        
        # 檢查資源約束
        critical_resources = [c for c in self.bottlenecks['resource_constraints'] 
                            if c['high_usage_percentage'] > 80]
        
        if critical_resources:
            for resource in critical_resources:
                if resource['metric'] == 'gpu_memory_used':
                    actions.append({
                        'action': '調整 GPU 記憶體配置',
                        'description': '降低批次大小或啟用梯度累積來減少 GPU 記憶體使用',
                        'priority': 'high',
                        'estimated_effort': '30 分鐘'
                    })
                elif resource['metric'] == 'cpu_percent':
                    actions.append({
                        'action': '優化 CPU 使用',
                        'description': '檢查並優化 CPU 密集型操作，考慮並行化處理',
                        'priority': 'medium',
                        'estimated_effort': '2-4 小時'
                    })
        
        # 檢查異常檢測結果
        if not actions:  # 如果沒有緊急問題
            actions.append({
                'action': '驗證當前配置',
                'description': '系統運行正常，驗證當前配置是否為最佳實踐',
                'priority': 'low',
                'estimated_effort': '1 小時'
            })
        
        return actions
    
    def _generate_short_term_optimizations(self) -> List[Dict[str, str]]:
        """生成短期優化建議"""
        optimizations = []
        
        # 基於負載模式的優化
        if self.load_patterns.get('peak_periods'):
            optimizations.append({
                'optimization': '實施負載均衡',
                'description': '在高負載期間自動分散請求或調整服務容量',
                'expected_benefit': '減少 20-30% 的響應延遲',
                'timeline': '1-2 週'
            })
        
        # 記憶體優化
        if 'memory_percent' in self.analyzer.data.columns:
            avg_memory = self.analyzer.data['memory_percent'].mean()
            if avg_memory > 70:
                optimizations.append({
                    'optimization': '記憶體使用優化',
                    'description': '實施記憶體池管理和垃圾回收優化',
                    'expected_benefit': '降低 10-15% 的記憶體使用',
                    'timeline': '1 週'
                })
        
        # 快取優化
        optimizations.append({
            'optimization': 'KV 快取優化',
            'description': '調整 KV 快取大小和策略以提升推理效率',
            'expected_benefit': '提升 15-25% 的吞吐量',
            'timeline': '3-5 天'
        })
        
        # 並發優化
        if 'qps' in self.analyzer.data.columns:
            avg_qps = self.analyzer.data['qps'].mean()
            if avg_qps < 10:  # 假設期望 QPS
                optimizations.append({
                    'optimization': '並發處理優化',
                    'description': '調整並發數和批次處理策略',
                    'expected_benefit': '提升 30-50% 的吞吐量',
                    'timeline': '1 週'
                })
        
        return optimizations
    
    def _generate_long_term_planning(self) -> List[Dict[str, str]]:
        """生成長期規劃建議"""
        planning = []
        
        # 基於容量分析的規劃
        bottlenecks = self.capacity_analysis.get('bottleneck_analysis', [])
        
        if bottlenecks:
            planning.append({
                'plan': '基礎設施擴展',
                'description': '根據資源瓶頸分析制定硬體擴展計劃',
                'timeline': '3-6 個月',
                'investment': '中等到高等'
            })
        
        # 架構升級
        planning.append({
            'plan': '分散式架構升級',
            'description': '實施多節點分散式推理架構以提升可擴展性',
            'timeline': '6-12 個月',
            'investment': '高等'
        })
        
        # 自動化運維
        planning.append({
            'plan': 'DevOps 自動化',
            'description': '建立自動化監控、部署和擴縮容機制',
            'timeline': '2-4 個月',
            'investment': '中等'
        })
        
        # 模型優化
        planning.append({
            'plan': '模型壓縮與量化',
            'description': '實施模型量化、蒸餾和剪枝以提升推理效率',
            'timeline': '4-8 個月',
            'investment': '中等'
        })
        
        return planning
    
    def _generate_monitoring_improvements(self) -> List[Dict[str, str]]:
        """生成監控改進建議"""
        improvements = [
            {
                'improvement': '增加業務層監控',
                'description': '監控模型准確性、用戶滿意度等業務指標',
                'priority': 'high'
            },
            {
                'improvement': '實施分散式追蹤',
                'description': '使用 Jaeger 或 Zipkin 追蹤請求生命週期',
                'priority': 'medium'
            },
            {
                'improvement': '建立自定義指標',
                'description': '添加特定於業務的 KPI 監控指標',
                'priority': 'medium'
            },
            {
                'improvement': '強化告警機制',
                'description': '設置多層級告警和自動響應機制',
                'priority': 'high'
            }
        ]
        
        return improvements
    
    def _generate_performance_kpis(self) -> Dict[str, Dict[str, float]]:
        """生成性能 KPI 建議"""
        kpis = {
            'latency_targets': {
                'p50_response_time': 1.0,  # 秒
                'p95_response_time': 2.0,
                'p99_response_time': 5.0,
                'ttft_target': 0.5
            },
            'throughput_targets': {
                'min_qps': 10,
                'target_qps': 50,
                'peak_qps': 100
            },
            'resource_targets': {
                'max_cpu_usage': 80,  # %
                'max_memory_usage': 85,
                'max_gpu_usage': 90,
                'target_gpu_utilization': 75
            },
            'reliability_targets': {
                'min_success_rate': 99.0,  # %
                'max_error_rate': 1.0,
                'target_availability': 99.9
            }
        }
        
        return kpis
    
    def _assess_risks(self) -> Dict[str, Any]:
        """評估風險"""
        risks = {
            'high_priority': [],
            'medium_priority': [],
            'low_priority': []
        }
        
        # 檢查高風險情況
        if self.bottlenecks['critical_periods']:
            risks['high_priority'].append({
                'risk': '系統過載風險',
                'description': '檢測到臨界資源使用期間，可能導致服務中斷',
                'mitigation': '實施自動擴縮容和負載均衡'
            })
        
        # 檢查性能衰退風險
        if self.bottlenecks['performance_degradation']:
            risks['medium_priority'].append({
                'risk': '性能衰退風險',
                'description': '檢測到性能下降趨勢，可能影響用戶體驗',
                'mitigation': '定期性能基準測試和優化'
            })
        
        # 檢查容量風險
        critical_bottlenecks = [b for b in self.capacity_analysis.get('bottleneck_analysis', []) 
                              if b['severity'] == 'critical']
        
        if critical_bottlenecks:
            risks['high_priority'].append({
                'risk': '容量不足風險',
                'description': '關鍵資源接近容量上限',
                'mitigation': '制定緊急擴容計劃'
            })
        
        # 如果沒有高風險，添加一些預防性建議
        if not risks['high_priority'] and not risks['medium_priority']:
            risks['low_priority'].append({
                'risk': '監控覆蓋不足風險',
                'description': '當前系統運行良好，但建議加強監控覆蓋',
                'mitigation': '擴展監控指標和告警機制'
            })
        
        return risks

# 初始化優化建議生成器
advisor = PerformanceOptimizationAdvisor(
    analyzer, bottlenecks, load_patterns, capacity_analysis
)

# 生成綜合建議
optimization_recommendations = advisor.generate_comprehensive_recommendations()

print("✅ 性能優化建議生成完成")

In [None]:
# 顯示綜合優化建議
print("\n" + "="*80)
print("🎯 vLLM 性能優化建議報告")
print("="*80)

# 立即行動建議
print("\n🚨 立即行動建議:")
for i, action in enumerate(optimization_recommendations['immediate_actions'], 1):
    priority_icon = "🔴" if action['priority'] == 'high' else "🟡" if action['priority'] == 'medium' else "🟢"
    print(f"\n   {i}. {priority_icon} {action['action']}")
    print(f"      描述: {action['description']}")
    print(f"      優先級: {action['priority'].upper()}")
    print(f"      預估工作量: {action['estimated_effort']}")

# 短期優化建議
print("\n📈 短期優化建議 (1-4 週):")
for i, optimization in enumerate(optimization_recommendations['short_term_optimizations'], 1):
    print(f"\n   {i}. {optimization['optimization']}")
    print(f"      描述: {optimization['description']}")
    print(f"      預期效益: {optimization['expected_benefit']}")
    print(f"      時間線: {optimization['timeline']}")

# 長期規劃建議
print("\n🏗️  長期規劃建議 (3-12 個月):")
for i, plan in enumerate(optimization_recommendations['long_term_planning'], 1):
    investment_icon = "💰" if plan['investment'] == '高等' else "💵" if plan['investment'] == '中等' else "💴"
    print(f"\n   {i}. {investment_icon} {plan['plan']}")
    print(f"      描述: {plan['description']}")
    print(f"      時間線: {plan['timeline']}")
    print(f"      投資級別: {plan['investment']}")

# 監控改進建議
print("\n📊 監控改進建議:")
for i, improvement in enumerate(optimization_recommendations['monitoring_improvements'], 1):
    priority_icon = "🔴" if improvement['priority'] == 'high' else "🟡" if improvement['priority'] == 'medium' else "🟢"
    print(f"   {i}. {priority_icon} {improvement['improvement']}")
    print(f"      {improvement['description']}")

# 性能 KPI 建議
print("\n🎯 建議性能 KPI 目標:")
kpis = optimization_recommendations['performance_kpis']

print("\n   📊 延遲目標:")
for metric, target in kpis['latency_targets'].items():
    unit = '秒' if 'time' in metric or 'ttft' in metric else ''
    print(f"     {metric}: ≤ {target} {unit}")

print("\n   🚀 吞吐量目標:")
for metric, target in kpis['throughput_targets'].items():
    print(f"     {metric}: {target} QPS")

print("\n   💻 資源使用目標:")
for metric, target in kpis['resource_targets'].items():
    print(f"     {metric}: {target}%")

# 風險評估
print("\n⚠️  風險評估:")
risks = optimization_recommendations['risk_assessment']

for priority in ['high_priority', 'medium_priority', 'low_priority']:
    if risks[priority]:
        priority_name = priority.replace('_', ' ').title()
        priority_icon = "🔴" if priority == 'high_priority' else "🟡" if priority == 'medium_priority' else "🟢"
        print(f"\n   {priority_icon} {priority_name} 風險:")
        
        for risk in risks[priority]:
            print(f"     • {risk['risk']}")
            print(f"       {risk['description']}")
            print(f"       緩解措施: {risk['mitigation']}")

print("\n" + "="*80)
print("📋 報告總結: 基於當前監控數據，系統整體運行" + 
      ("需要關注" if risks['high_priority'] else "良好") + 
      "。建議優先執行立即行動建議，並制定短期優化計劃。")
print("="*80)

## 實驗總結

本實驗完成了全面的 vLLM 性能分析與診斷，建立了完整的性能評估框架：

### ✅ 核心成果

1. **深度性能分析**
   - 多維度性能統計分析
   - 歷史趨勢與模式識別
   - 異常檢測與根因分析

2. **智能瓶頸診斷**
   - 自動識別資源約束
   - 臨界期間檢測
   - 性能衰退分析

3. **負載模式分析**
   - 峰值與低谷期間識別
   - 不同負載下的性能表現
   - 負載分佈統計

4. **容量規劃評估**
   - 當前資源利用率分析
   - 瓶頸資源識別
   - 擴展建議生成

5. **綜合優化建議**
   - 分層級的行動建議
   - 短期與長期規劃
   - 風險評估與緩解策略

### 🎯 技術亮點

- **機器學習異常檢測**: 使用 Isolation Forest 進行智能異常識別
- **時間序列分析**: 趨勢檢測、季節性分析、相關性評估
- **統計學方法**: 多種統計指標和分佈分析
- **視覺化儀表板**: 多面板綜合性能視覺化
- **自動化建議**: 基於數據的智能優化建議生成

### 📊 分析覆蓋範圍

- **系統層面**: CPU、記憶體、GPU 資源分析
- **應用層面**: vLLM 服務性能指標
- **業務層面**: QPS、延遲、成功率分析
- **運維層面**: 容量規劃、風險評估

### 🔧 實用工具

- **ComprehensivePerformanceAnalyzer**: 全面性能分析器
- **TimeSeriesAnalyzer**: 時間序列分析工具
- **LoadTestAnalyzer**: 負載測試分析器
- **PerformanceOptimizationAdvisor**: 優化建議生成器

### 📋 下一步

繼續進行 **04-Alerting_and_Optimization.ipynb**，學習智能告警系統建設和自動化優化策略。

---

**應用價值**:
- 提供數據驅動的性能優化決策依據
- 建立標準化的性能評估流程
- 支援預防性維護和容量規劃
- 降低系統故障風險和運維成本