# Lab 2.5-02: 實時指標收集與監控

## 實驗目標

本節將深入探討 vLLM 的實時性能指標收集，包括：
- vLLM 內建指標的詳細分析
- 高頻率實時數據流建立
- 先進的指標聚合與計算
- 動態監控儀表板開發

## 核心監控指標體系

### 1. 請求級指標 (Request-level Metrics)
- **TTFT (Time to First Token)**: 從請求到首個 token 的時間
- **TPOT (Time Per Output Token)**: 每個輸出 token 的平均時間
- **Request Duration**: 完整請求的總時間
- **Queue Time**: 請求在佇列中的等待時間

### 2. 吞吐量指標 (Throughput Metrics)
- **Requests per Second (RPS)**: 每秒處理的請求數
- **Tokens per Second (TPS)**: 每秒生成的 token 數
- **Concurrent Requests**: 並發請求數量
- **Batch Size Distribution**: 批次大小分佈

### 3. 資源使用指標 (Resource Utilization)
- **GPU Memory Usage**: GPU 記憶體使用情況
- **KV Cache Utilization**: KV 快取使用效率
- **Model Loading Time**: 模型載入時間
- **Attention Computation Time**: 注意力計算時間

## 1. 環境初始化與依賴導入

In [None]:
import os
import time
import json
import asyncio
import threading
import logging
import statistics
from datetime import datetime, timedelta
from collections import defaultdict, deque
from typing import Dict, List, Optional, Any

# 數據處理與分析
import pandas as pd
import numpy as np

# 視覺化
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.animation import FuncAnimation
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.offline as pyo

# HTTP 客戶端
import requests
import aiohttp

# Prometheus 指標
from prometheus_client import (
    Gauge, Counter, Histogram, Summary, 
    CollectorRegistry, start_http_server
)
from prometheus_client.parser import text_string_to_metric_families

# 系統監控
import psutil
try:
    import pynvml
    pynvml.nvmlInit()
    NVIDIA_GPU_AVAILABLE = True
except (ImportError, Exception):
    NVIDIA_GPU_AVAILABLE = False

# 設置日誌
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# 啟用 Plotly 離線模式
pyo.init_notebook_mode(connected=True)

print("✅ 實時監控環境初始化完成")

## 2. vLLM Metrics 解析器與收集器

In [None]:
class VLLMMetricsParser:
    """vLLM Prometheus 指標解析器"""
    
    def __init__(self, metrics_url: str = "http://127.0.0.1:8001/metrics"):
        self.metrics_url = metrics_url
        self.last_scrape_time = None
        self.metrics_cache = {}
    
    def fetch_raw_metrics(self) -> Optional[str]:
        """獲取原始 metrics 文本"""
        try:
            response = requests.get(self.metrics_url, timeout=5)
            if response.status_code == 200:
                self.last_scrape_time = datetime.now()
                return response.text
            else:
                logger.warning(f"Metrics endpoint returned {response.status_code}")
                return None
        except requests.exceptions.RequestException as e:
            logger.error(f"Failed to fetch metrics: {e}")
            return None
    
    def parse_metrics(self, raw_metrics: str) -> Dict[str, Any]:
        """解析 Prometheus 格式的 metrics"""
        parsed_metrics = {}
        
        try:
            for family in text_string_to_metric_families(raw_metrics):
                metric_name = family.name
                metric_type = family.type
                
                if metric_type == 'counter':
                    for sample in family.samples:
                        parsed_metrics[sample.name] = {
                            'value': sample.value,
                            'labels': sample.labels,
                            'type': 'counter'
                        }
                
                elif metric_type == 'gauge':
                    for sample in family.samples:
                        parsed_metrics[sample.name] = {
                            'value': sample.value,
                            'labels': sample.labels,
                            'type': 'gauge'
                        }
                
                elif metric_type == 'histogram':
                    histogram_data = defaultdict(dict)
                    for sample in family.samples:
                        key = sample.name.replace(f'{metric_name}_', '')
                        labels_key = str(sorted(sample.labels.items()))
                        histogram_data[labels_key][key] = sample.value
                    
                    parsed_metrics[metric_name] = {
                        'type': 'histogram',
                        'data': dict(histogram_data)
                    }
        
        except Exception as e:
            logger.error(f"Error parsing metrics: {e}")
        
        return parsed_metrics
    
    def get_specific_metrics(self, metric_patterns: List[str]) -> Dict[str, Any]:
        """獲取特定模式的指標"""
        raw_metrics = self.fetch_raw_metrics()
        if not raw_metrics:
            return {}
        
        all_metrics = self.parse_metrics(raw_metrics)
        filtered_metrics = {}
        
        for pattern in metric_patterns:
            for metric_name, metric_data in all_metrics.items():
                if pattern in metric_name:
                    filtered_metrics[metric_name] = metric_data
        
        return filtered_metrics

# 初始化 metrics 解析器
metrics_parser = VLLMMetricsParser()
print("✅ vLLM Metrics 解析器已初始化")

In [None]:
class RealTimeMetricsCollector:
    """實時指標收集器"""
    
    def __init__(self, collection_interval: float = 1.0, max_data_points: int = 1000):
        self.collection_interval = collection_interval
        self.max_data_points = max_data_points
        self.running = False
        
        # 時間序列數據儲存
        self.time_series_data = defaultdict(lambda: deque(maxlen=max_data_points))
        self.timestamps = deque(maxlen=max_data_points)
        
        # 聚合統計
        self.aggregated_stats = defaultdict(dict)
        
        # 指標定義
        self.key_metrics = [
            'vllm_request_success_total',
            'vllm_request_failure_total',
            'vllm_time_to_first_token_seconds',
            'vllm_time_per_output_token_seconds',
            'vllm_request_duration_seconds',
            'vllm_num_requests_running',
            'vllm_num_requests_waiting',
            'vllm_gpu_cache_usage_perc'
        ]
        
        # 系統指標
        self.system_metrics = {
            'cpu_percent': deque(maxlen=max_data_points),
            'memory_percent': deque(maxlen=max_data_points),
            'gpu_memory_used': deque(maxlen=max_data_points),
            'gpu_utilization': deque(maxlen=max_data_points)
        }
    
    def collect_system_metrics(self) -> Dict[str, float]:
        """收集系統級指標"""
        metrics = {}
        
        # CPU 和記憶體
        metrics['cpu_percent'] = psutil.cpu_percent(interval=None)
        metrics['memory_percent'] = psutil.virtual_memory().percent
        
        # GPU 指標
        if NVIDIA_GPU_AVAILABLE:
            try:
                handle = pynvml.nvmlDeviceGetHandleByIndex(0)
                mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
                util_info = pynvml.nvmlDeviceGetUtilizationRates(handle)
                
                metrics['gpu_memory_used'] = (mem_info.used / mem_info.total) * 100
                metrics['gpu_utilization'] = util_info.gpu
            except Exception as e:
                logger.warning(f"GPU metrics collection failed: {e}")
                metrics['gpu_memory_used'] = 0
                metrics['gpu_utilization'] = 0
        else:
            metrics['gpu_memory_used'] = 0
            metrics['gpu_utilization'] = 0
        
        return metrics
    
    def collect_vllm_metrics(self) -> Dict[str, float]:
        """收集 vLLM 特定指標"""
        vllm_metrics = metrics_parser.get_specific_metrics(self.key_metrics)
        
        # 簡化指標提取
        simplified_metrics = {}
        
        for metric_name, metric_data in vllm_metrics.items():
            if metric_data['type'] in ['counter', 'gauge']:
                simplified_metrics[metric_name] = metric_data['value']
            elif metric_data['type'] == 'histogram':
                # 對於 histogram，提取關鍵統計值
                for labels_key, hist_data in metric_data['data'].items():
                    if 'count' in hist_data:
                        simplified_metrics[f"{metric_name}_count"] = hist_data['count']
                    if 'sum' in hist_data:
                        simplified_metrics[f"{metric_name}_sum"] = hist_data['sum']
        
        return simplified_metrics
    
    def collect_single_round(self):
        """執行單次指標收集"""
        timestamp = datetime.now()
        
        # 收集系統指標
        system_data = self.collect_system_metrics()
        for key, value in system_data.items():
            self.system_metrics[key].append(value)
        
        # 收集 vLLM 指標
        vllm_data = self.collect_vllm_metrics()
        for key, value in vllm_data.items():
            self.time_series_data[key].append(value)
        
        # 記錄時間戳
        self.timestamps.append(timestamp)
        
        # 計算聚合統計
        self.update_aggregated_stats()
    
    def update_aggregated_stats(self):
        """更新聚合統計數據"""
        # 系統指標統計
        for metric_name, values in self.system_metrics.items():
            if len(values) > 0:
                self.aggregated_stats[metric_name] = {
                    'current': values[-1],
                    'avg_5m': np.mean(list(values)[-300:]) if len(values) >= 300 else np.mean(list(values)),
                    'max_5m': np.max(list(values)[-300:]) if len(values) >= 300 else np.max(list(values)),
                    'min_5m': np.min(list(values)[-300:]) if len(values) >= 300 else np.min(list(values))
                }
        
        # vLLM 指標統計
        for metric_name, values in self.time_series_data.items():
            if len(values) > 0:
                self.aggregated_stats[metric_name] = {
                    'current': values[-1],
                    'avg_5m': np.mean(list(values)[-300:]) if len(values) >= 300 else np.mean(list(values)),
                    'max_5m': np.max(list(values)[-300:]) if len(values) >= 300 else np.max(list(values)),
                    'min_5m': np.min(list(values)[-300:]) if len(values) >= 300 else np.min(list(values))
                }
    
    def start_collection(self):
        """開始實時收集"""
        self.running = True
        
        def collection_loop():
            while self.running:
                try:
                    self.collect_single_round()
                    time.sleep(self.collection_interval)
                except Exception as e:
                    logger.error(f"Collection error: {e}")
                    time.sleep(self.collection_interval)
        
        # 在背景執行緒中運行
        self.collection_thread = threading.Thread(target=collection_loop, daemon=True)
        self.collection_thread.start()
        
        logger.info(f"Started real-time metrics collection (interval: {self.collection_interval}s)")
    
    def stop_collection(self):
        """停止實時收集"""
        self.running = False
        logger.info("Stopped real-time metrics collection")
    
    def get_latest_data(self, metric_name: str, num_points: int = 60) -> tuple:
        """獲取最新的時間序列數據"""
        if metric_name in self.system_metrics:
            values = list(self.system_metrics[metric_name])[-num_points:]
        elif metric_name in self.time_series_data:
            values = list(self.time_series_data[metric_name])[-num_points:]
        else:
            return [], []
        
        timestamps = list(self.timestamps)[-len(values):]
        return timestamps, values
    
    def get_summary_stats(self) -> Dict[str, Any]:
        """獲取摘要統計"""
        return dict(self.aggregated_stats)

# 初始化實時收集器
collector = RealTimeMetricsCollector(collection_interval=2.0)
print("✅ 實時指標收集器已初始化")

## 3. 動態視覺化儀表板

In [None]:
class RealTimeDashboard:
    """實時監控儀表板"""
    
    def __init__(self, collector: RealTimeMetricsCollector):
        self.collector = collector
        self.fig = None
        self.animation = None
        
        # 設置 matplotlib 風格
        plt.style.use('seaborn-v0_8-darkgrid')
        
        # 顏色配置
        self.colors = {
            'cpu': '#FF6B6B',
            'memory': '#4ECDC4',
            'gpu_memory': '#45B7D1',
            'gpu_util': '#96CEB4',
            'requests': '#FFEAA7',
            'latency': '#DDA0DD'
        }
    
    def create_static_dashboard(self):
        """創建靜態儀表板快照"""
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        fig.suptitle('vLLM 實時性能監控儀表板', fontsize=16, fontweight='bold')
        
        # 系統資源監控
        self._plot_system_metrics(axes[0, 0], 'cpu_percent', 'CPU 使用率 (%)', self.colors['cpu'])
        self._plot_system_metrics(axes[0, 1], 'memory_percent', '記憶體使用率 (%)', self.colors['memory'])
        self._plot_system_metrics(axes[0, 2], 'gpu_memory_used', 'GPU 記憶體使用率 (%)', self.colors['gpu_memory'])
        
        # vLLM 特定指標
        self._plot_vllm_requests(axes[1, 0])
        self._plot_vllm_latency(axes[1, 1])
        self._plot_summary_stats(axes[1, 2])
        
        plt.tight_layout()
        plt.show()
    
    def _plot_system_metrics(self, ax, metric_name, title, color):
        """繪製系統指標"""
        timestamps, values = self.collector.get_latest_data(metric_name, 60)
        
        if len(values) > 0:
            ax.plot(timestamps, values, color=color, linewidth=2)
            ax.fill_between(timestamps, values, alpha=0.3, color=color)
            
            # 添加當前值標註
            current_value = values[-1] if values else 0
            ax.text(0.02, 0.98, f'當前: {current_value:.1f}%', 
                   transform=ax.transAxes, verticalalignment='top',
                   bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
        
        ax.set_title(title, fontweight='bold')
        ax.set_ylabel('使用率 (%)')
        ax.set_ylim(0, 100)
        ax.grid(True, alpha=0.3)
        
        # 格式化 x 軸時間顯示
        if len(timestamps) > 0:
            ax.set_xlim(timestamps[0], timestamps[-1])
    
    def _plot_vllm_requests(self, ax):
        """繪製 vLLM 請求指標"""
        # 運行中的請求
        timestamps, running = self.collector.get_latest_data('vllm_num_requests_running', 60)
        timestamps_waiting, waiting = self.collector.get_latest_data('vllm_num_requests_waiting', 60)
        
        if len(running) > 0:
            ax.plot(timestamps, running, label='運行中', color=self.colors['requests'], linewidth=2)
        
        if len(waiting) > 0:
            ax.plot(timestamps_waiting, waiting, label='等待中', color='orange', linewidth=2)
        
        ax.set_title('vLLM 請求狀態', fontweight='bold')
        ax.set_ylabel('請求數量')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    def _plot_vllm_latency(self, ax):
        """繪製延遲指標"""
        # 嘗試獲取 TTFT 數據
        timestamps, ttft_count = self.collector.get_latest_data('vllm_time_to_first_token_seconds_count', 60)
        timestamps_sum, ttft_sum = self.collector.get_latest_data('vllm_time_to_first_token_seconds_sum', 60)
        
        if len(ttft_count) > 0 and len(ttft_sum) > 0:
            # 計算平均 TTFT
            avg_ttft = [s/c if c > 0 else 0 for s, c in zip(ttft_sum, ttft_count)]
            ax.plot(timestamps, avg_ttft, label='TTFT 平均', color=self.colors['latency'], linewidth=2)
        
        ax.set_title('延遲指標', fontweight='bold')
        ax.set_ylabel('時間 (秒)')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    def _plot_summary_stats(self, ax):
        """繪製摘要統計"""
        stats = self.collector.get_summary_stats()
        
        # 選擇關鍵指標顯示
        key_stats = {
            'CPU': stats.get('cpu_percent', {}).get('current', 0),
            'Memory': stats.get('memory_percent', {}).get('current', 0),
            'GPU Mem': stats.get('gpu_memory_used', {}).get('current', 0),
            'GPU Util': stats.get('gpu_utilization', {}).get('current', 0)
        }
        
        labels = list(key_stats.keys())
        values = list(key_stats.values())
        
        bars = ax.bar(labels, values, 
                     color=[self.colors['cpu'], self.colors['memory'], 
                           self.colors['gpu_memory'], self.colors['gpu_util']])
        
        # 添加數值標籤
        for bar, value in zip(bars, values):
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                   f'{value:.1f}%', ha='center', va='bottom')
        
        ax.set_title('當前系統狀態', fontweight='bold')
        ax.set_ylabel('使用率 (%)')
        ax.set_ylim(0, 100)
        ax.grid(True, alpha=0.3)
    
    def create_interactive_dashboard(self):
        """創建互動式 Plotly 儀表板"""
        # 創建子圖
        fig = make_subplots(
            rows=2, cols=3,
            subplot_titles=['CPU 使用率', '記憶體使用率', 'GPU 記憶體使用率',
                          'vLLM 請求狀態', '延遲指標', '系統摘要'],
            specs=[[{"type": "scatter"}, {"type": "scatter"}, {"type": "scatter"}],
                   [{"type": "scatter"}, {"type": "scatter"}, {"type": "bar"}]]
        )
        
        # 添加 CPU 使用率
        timestamps, cpu_values = self.collector.get_latest_data('cpu_percent', 100)
        if len(cpu_values) > 0:
            fig.add_trace(
                go.Scatter(x=timestamps, y=cpu_values, 
                          mode='lines+markers', name='CPU',
                          line=dict(color='#FF6B6B', width=2)),
                row=1, col=1
            )
        
        # 添加記憶體使用率
        timestamps, mem_values = self.collector.get_latest_data('memory_percent', 100)
        if len(mem_values) > 0:
            fig.add_trace(
                go.Scatter(x=timestamps, y=mem_values,
                          mode='lines+markers', name='Memory',
                          line=dict(color='#4ECDC4', width=2)),
                row=1, col=2
            )
        
        # 添加 GPU 記憶體使用率
        timestamps, gpu_mem_values = self.collector.get_latest_data('gpu_memory_used', 100)
        if len(gpu_mem_values) > 0:
            fig.add_trace(
                go.Scatter(x=timestamps, y=gpu_mem_values,
                          mode='lines+markers', name='GPU Memory',
                          line=dict(color='#45B7D1', width=2)),
                row=1, col=3
            )
        
        # 添加 vLLM 請求狀態
        timestamps, running = self.collector.get_latest_data('vllm_num_requests_running', 100)
        if len(running) > 0:
            fig.add_trace(
                go.Scatter(x=timestamps, y=running,
                          mode='lines+markers', name='Running Requests',
                          line=dict(color='#FFEAA7', width=2)),
                row=2, col=1
            )
        
        # 更新佈局
        fig.update_layout(
            title='vLLM 實時性能監控儀表板',
            height=800,
            showlegend=False
        )
        
        # 顯示圖表
        pyo.iplot(fig)

# 初始化儀表板
dashboard = RealTimeDashboard(collector)
print("✅ 實時儀表板已初始化")

## 4. 開始實時監控

In [None]:
# 開始實時指標收集
collector.start_collection()

print("🚀 實時監控已啟動")
print("   收集間隔: 2 秒")
print("   最大數據點: 1000")
print("   監控指標包括:")
print("   - 系統資源 (CPU, 記憶體, GPU)")
print("   - vLLM 請求指標")
print("   - 延遲和吞吐量指標")
print("\n⏳ 等待 30 秒收集初始數據...")

# 等待收集一些初始數據
time.sleep(30)

## 5. 模擬 vLLM 工作負載

In [None]:
class VLLMWorkloadSimulator:
    """vLLM 工作負載模擬器"""
    
    def __init__(self, api_base: str = "http://127.0.0.1:8000"):
        self.api_base = api_base
        self.session = requests.Session()
        
        # 測試提示詞集合
        self.test_prompts = [
            "What is the capital of France?",
            "Explain quantum computing in simple terms.",
            "Write a short story about a robot.",
            "How do neural networks work?",
            "Describe the process of photosynthesis.",
            "What are the benefits of renewable energy?",
            "Explain the theory of relativity.",
            "Write a poem about the ocean."
        ]
    
    def check_vllm_availability(self) -> bool:
        """檢查 vLLM 服務是否可用"""
        try:
            response = self.session.get(f"{self.api_base}/v1/models", timeout=5)
            return response.status_code == 200
        except requests.exceptions.RequestException:
            return False
    
    def send_completion_request(self, prompt: str, max_tokens: int = 100) -> Dict[str, Any]:
        """發送完成請求"""
        start_time = time.time()
        
        payload = {
            "model": "microsoft/DialoGPT-medium",  # 使用配置中的模型
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": 0.7,
            "stream": False
        }
        
        try:
            response = self.session.post(
                f"{self.api_base}/v1/completions",
                json=payload,
                timeout=30
            )
            
            end_time = time.time()
            duration = end_time - start_time
            
            result = {
                'success': response.status_code == 200,
                'duration': duration,
                'status_code': response.status_code,
                'prompt_length': len(prompt),
                'response_length': 0
            }
            
            if response.status_code == 200:
                data = response.json()
                if 'choices' in data and len(data['choices']) > 0:
                    result['response_length'] = len(data['choices'][0].get('text', ''))
            
            return result
        
        except requests.exceptions.RequestException as e:
            return {
                'success': False,
                'duration': time.time() - start_time,
                'error': str(e),
                'prompt_length': len(prompt),
                'response_length': 0
            }
    
    async def simulate_concurrent_load(self, num_requests: int = 10, delay_range: tuple = (1, 5)):
        """模擬並發負載"""
        print(f"🔄 開始模擬 {num_requests} 個並發請求...")
        
        async def single_request(request_id: int):
            prompt = np.random.choice(self.test_prompts)
            max_tokens = np.random.randint(50, 150)
            
            # 模擬請求間隔
            await asyncio.sleep(np.random.uniform(*delay_range))
            
            # 同步請求 (在實際環境中應使用 aiohttp)
            result = self.send_completion_request(prompt, max_tokens)
            result['request_id'] = request_id
            
            return result
        
        # 創建並發任務
        tasks = [single_request(i) for i in range(num_requests)]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # 分析結果
        successful_requests = [r for r in results if isinstance(r, dict) and r.get('success', False)]
        failed_requests = [r for r in results if not (isinstance(r, dict) and r.get('success', False))]
        
        print(f"✅ 模擬完成: {len(successful_requests)} 成功, {len(failed_requests)} 失敗")
        
        if successful_requests:
            durations = [r['duration'] for r in successful_requests]
            print(f"   平均延遲: {np.mean(durations):.2f}s")
            print(f"   P95 延遲: {np.percentile(durations, 95):.2f}s")
        
        return results
    
    def simulate_steady_load(self, duration_minutes: int = 5, requests_per_minute: int = 12):
        """模擬穩定負載"""
        print(f"🔄 開始模擬 {duration_minutes} 分鐘的穩定負載 ({requests_per_minute} 請求/分鐘)...")
        
        end_time = time.time() + duration_minutes * 60
        request_interval = 60.0 / requests_per_minute
        request_count = 0
        
        results = []
        
        while time.time() < end_time:
            prompt = np.random.choice(self.test_prompts)
            max_tokens = np.random.randint(50, 150)
            
            result = self.send_completion_request(prompt, max_tokens)
            result['request_id'] = request_count
            result['timestamp'] = datetime.now()
            
            results.append(result)
            request_count += 1
            
            if request_count % 10 == 0:
                successful = len([r for r in results[-10:] if r.get('success', False)])
                print(f"   已發送 {request_count} 請求 (最近 10 個: {successful}/10 成功)")
            
            # 等待下一個請求
            time.sleep(request_interval + np.random.uniform(-0.5, 0.5))
        
        print(f"✅ 穩定負載模擬完成: 總共 {request_count} 請求")
        return results

# 初始化工作負載模擬器
simulator = VLLMWorkloadSimulator()

# 檢查 vLLM 服務可用性
if simulator.check_vllm_availability():
    print("✅ vLLM 服務已可用，可以開始負載測試")
    vllm_available = True
else:
    print("⚠️  vLLM 服務不可用，將跳過負載測試")
    print("   請確保 vLLM 服務正在運行: bash start_vllm_with_metrics.sh")
    vllm_available = False

In [None]:
# 如果 vLLM 可用，開始負載測試
if vllm_available:
    print("🚀 開始模擬工作負載...")
    
    # 模擬 2 分鐘的穩定負載
    load_results = simulator.simulate_steady_load(duration_minutes=2, requests_per_minute=6)
    
    print("\n📊 負載測試結果分析:")
    successful_results = [r for r in load_results if r.get('success', False)]
    
    if successful_results:
        durations = [r['duration'] for r in successful_results]
        print(f"   成功請求: {len(successful_results)}/{len(load_results)}")
        print(f"   平均延遲: {np.mean(durations):.2f}s")
        print(f"   中位數延遲: {np.median(durations):.2f}s")
        print(f"   P95 延遲: {np.percentile(durations, 95):.2f}s")
        print(f"   最大延遲: {np.max(durations):.2f}s")
else:
    print("⏭️  跳過負載測試，繼續進行監控數據分析")

## 6. 實時監控儀表板展示

In [None]:
# 生成靜態儀表板
print("📊 生成實時監控儀表板...")
dashboard.create_static_dashboard()

In [None]:
# 顯示詳細的統計摘要
def display_detailed_stats():
    stats = collector.get_summary_stats()
    
    print("\n" + "="*60)
    print("📈 詳細監控統計摘要")
    print("="*60)
    
    # 系統資源統計
    print("\n🖥️  系統資源使用情況:")
    system_metrics = ['cpu_percent', 'memory_percent', 'gpu_memory_used', 'gpu_utilization']
    
    for metric in system_metrics:
        if metric in stats:
            data = stats[metric]
            print(f"   {metric:20}: 當前 {data.get('current', 0):6.1f}% | "
                  f"5分鐘平均 {data.get('avg_5m', 0):6.1f}% | "
                  f"最大 {data.get('max_5m', 0):6.1f}%")
    
    # vLLM 指標統計
    print("\n🤖 vLLM 服務指標:")
    vllm_metrics = [k for k in stats.keys() if k.startswith('vllm_')]
    
    if vllm_metrics:
        for metric in sorted(vllm_metrics)[:10]:  # 顯示前 10 個指標
            data = stats[metric]
            print(f"   {metric:35}: {data.get('current', 0):8.2f} | "
                  f"平均 {data.get('avg_5m', 0):8.2f}")
    else:
        print("   暫無 vLLM 指標數據 (服務可能未運行)")
    
    # 數據收集統計
    print("\n📊 數據收集統計:")
    print(f"   收集時間點: {len(collector.timestamps)}")
    print(f"   收集間隔: {collector.collection_interval} 秒")
    print(f"   運行狀態: {'活躍' if collector.running else '已停止'}")
    
    if len(collector.timestamps) >= 2:
        duration = (collector.timestamps[-1] - collector.timestamps[0]).total_seconds()
        print(f"   監控時長: {duration:.0f} 秒 ({duration/60:.1f} 分鐘)")
    
    print("\n" + "="*60)

display_detailed_stats()

## 7. 進階指標分析

In [None]:
class AdvancedMetricsAnalyzer:
    """進階指標分析器"""
    
    def __init__(self, collector: RealTimeMetricsCollector):
        self.collector = collector
    
    def analyze_resource_correlation(self):
        """分析資源使用相關性"""
        print("🔍 分析系統資源使用相關性...")
        
        # 獲取系統指標數據
        _, cpu_data = self.collector.get_latest_data('cpu_percent', 1000)
        _, memory_data = self.collector.get_latest_data('memory_percent', 1000)
        _, gpu_memory_data = self.collector.get_latest_data('gpu_memory_used', 1000)
        _, gpu_util_data = self.collector.get_latest_data('gpu_utilization', 1000)
        
        if len(cpu_data) < 10:
            print("   數據點不足，無法進行相關性分析")
            return
        
        # 創建 DataFrame
        min_len = min(len(cpu_data), len(memory_data), len(gpu_memory_data), len(gpu_util_data))
        
        df = pd.DataFrame({
            'CPU': cpu_data[-min_len:],
            'Memory': memory_data[-min_len:],
            'GPU_Memory': gpu_memory_data[-min_len:],
            'GPU_Util': gpu_util_data[-min_len:]
        })
        
        # 計算相關性矩陣
        correlation_matrix = df.corr()
        
        # 視覺化相關性
        plt.figure(figsize=(10, 8))
        
        # 相關性熱力圖
        plt.subplot(2, 2, 1)
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
                   square=True, linewidths=0.5)
        plt.title('系統資源使用相關性')
        
        # 散點圖矩陣
        plt.subplot(2, 2, 2)
        plt.scatter(df['CPU'], df['Memory'], alpha=0.6, color='blue')
        plt.xlabel('CPU 使用率 (%)')
        plt.ylabel('記憶體使用率 (%)')
        plt.title('CPU vs Memory')
        plt.grid(True, alpha=0.3)
        
        plt.subplot(2, 2, 3)
        plt.scatter(df['GPU_Memory'], df['GPU_Util'], alpha=0.6, color='green')
        plt.xlabel('GPU 記憶體使用率 (%)')
        plt.ylabel('GPU 使用率 (%)')
        plt.title('GPU Memory vs GPU Utilization')
        plt.grid(True, alpha=0.3)
        
        # 時間序列趨勢
        plt.subplot(2, 2, 4)
        timestamps = list(range(len(df)))
        plt.plot(timestamps, df['CPU'], label='CPU', alpha=0.7)
        plt.plot(timestamps, df['Memory'], label='Memory', alpha=0.7)
        plt.plot(timestamps, df['GPU_Memory'], label='GPU Mem', alpha=0.7)
        plt.xlabel('時間點')
        plt.ylabel('使用率 (%)')
        plt.title('資源使用趨勢')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        return correlation_matrix
    
    def detect_anomalies(self, metric_name: str, threshold_std: float = 2.0):
        """檢測指標異常值"""
        timestamps, values = self.collector.get_latest_data(metric_name, 1000)
        
        if len(values) < 20:
            print(f"   {metric_name}: 數據點不足，無法檢測異常")
            return []
        
        # 使用滑動平均和標準差檢測異常
        window_size = min(20, len(values) // 4)
        
        anomalies = []
        
        for i in range(window_size, len(values)):
            window_data = values[i-window_size:i]
            mean_val = np.mean(window_data)
            std_val = np.std(window_data)
            
            current_val = values[i]
            
            if abs(current_val - mean_val) > threshold_std * std_val:
                anomalies.append({
                    'timestamp': timestamps[i],
                    'value': current_val,
                    'expected_range': (mean_val - threshold_std * std_val,
                                     mean_val + threshold_std * std_val),
                    'deviation': abs(current_val - mean_val) / std_val if std_val > 0 else 0
                })
        
        return anomalies
    
    def generate_performance_report(self):
        """生成性能報告"""
        print("\n" + "="*60)
        print("📋 vLLM 性能分析報告")
        print("="*60)
        
        # 系統資源分析
        print("\n🔍 異常檢測結果:")
        
        metrics_to_check = ['cpu_percent', 'memory_percent', 'gpu_memory_used', 'gpu_utilization']
        
        total_anomalies = 0
        for metric in metrics_to_check:
            anomalies = self.detect_anomalies(metric, threshold_std=2.0)
            total_anomalies += len(anomalies)
            
            if anomalies:
                print(f"   {metric}: 發現 {len(anomalies)} 個異常點")
                # 顯示最嚴重的異常
                worst_anomaly = max(anomalies, key=lambda x: x['deviation'])
                print(f"     最嚴重異常: {worst_anomaly['value']:.1f} (偏差 {worst_anomaly['deviation']:.1f}σ)")
            else:
                print(f"   {metric}: 無異常檢測")
        
        if total_anomalies == 0:
            print("   ✅ 系統運行穩定，未檢測到顯著異常")
        
        # 性能摘要
        stats = self.collector.get_summary_stats()
        
        print("\n📊 性能摘要:")
        
        # 資源使用效率評估
        cpu_avg = stats.get('cpu_percent', {}).get('avg_5m', 0)
        memory_avg = stats.get('memory_percent', {}).get('avg_5m', 0)
        gpu_memory_avg = stats.get('gpu_memory_used', {}).get('avg_5m', 0)
        
        print(f"   平均 CPU 使用率: {cpu_avg:.1f}%")
        print(f"   平均記憶體使用率: {memory_avg:.1f}%")
        print(f"   平均 GPU 記憶體使用率: {gpu_memory_avg:.1f}%")
        
        # 效率評估
        if cpu_avg < 20:
            print("   💡 CPU 使用率較低，可考慮增加負載")
        elif cpu_avg > 80:
            print("   ⚠️  CPU 使用率過高，建議優化或擴容")
        
        if gpu_memory_avg < 30 and NVIDIA_GPU_AVAILABLE:
            print("   💡 GPU 記憶體使用率較低，可考慮增加批次大小")
        elif gpu_memory_avg > 90:
            print("   ⚠️  GPU 記憶體使用率過高，建議減少批次大小")
        
        print("\n" + "="*60)

# 創建進階分析器
analyzer = AdvancedMetricsAnalyzer(collector)

# 執行分析
print("🔬 執行進階指標分析...")
correlation_matrix = analyzer.analyze_resource_correlation()

In [None]:
# 生成性能報告
analyzer.generate_performance_report()

## 8. 監控數據導出與儲存

In [None]:
def export_monitoring_data():
    """導出監控數據"""
    print("💾 導出監控數據...")
    
    # 準備導出數據
    export_data = {
        'metadata': {
            'export_time': datetime.now().isoformat(),
            'collection_interval': collector.collection_interval,
            'total_data_points': len(collector.timestamps),
            'monitoring_duration_seconds': (collector.timestamps[-1] - collector.timestamps[0]).total_seconds() if len(collector.timestamps) >= 2 else 0
        },
        'timestamps': [ts.isoformat() for ts in collector.timestamps],
        'system_metrics': {},
        'vllm_metrics': {},
        'aggregated_stats': dict(collector.aggregated_stats)
    }
    
    # 導出系統指標
    for metric_name, values in collector.system_metrics.items():
        export_data['system_metrics'][metric_name] = list(values)
    
    # 導出 vLLM 指標
    for metric_name, values in collector.time_series_data.items():
        export_data['vllm_metrics'][metric_name] = list(values)
    
    # 儲存為 JSON 檔案
    filename = f"vllm_monitoring_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    
    with open(filename, 'w') as f:
        json.dump(export_data, f, indent=2)
    
    print(f"✅ 監控數據已導出到: {filename}")
    
    # 同時導出為 CSV 格式 (便於分析)
    if len(collector.timestamps) > 0:
        # 創建 DataFrame
        df_data = {'timestamp': [ts.isoformat() for ts in collector.timestamps]}
        
        # 添加系統指標
        for metric_name, values in collector.system_metrics.items():
            # 確保長度一致
            padded_values = list(values) + [None] * (len(collector.timestamps) - len(values))
            df_data[metric_name] = padded_values[:len(collector.timestamps)]
        
        # 添加部分 vLLM 指標
        for metric_name, values in list(collector.time_series_data.items())[:10]:  # 限制數量
            padded_values = list(values) + [None] * (len(collector.timestamps) - len(values))
            df_data[metric_name] = padded_values[:len(collector.timestamps)]
        
        df = pd.DataFrame(df_data)
        csv_filename = filename.replace('.json', '.csv')
        df.to_csv(csv_filename, index=False)
        print(f"✅ CSV 格式數據已導出到: {csv_filename}")
    
    return filename

# 導出數據
export_filename = export_monitoring_data()

## 9. 清理與停止監控

In [None]:
# 停止監控收集
print("🛑 停止實時監控收集...")
collector.stop_collection()

# 最終統計
final_stats = collector.get_summary_stats()
print(f"\n📊 最終監控統計:")
print(f"   總監控時間: {len(collector.timestamps)} 個數據點")
print(f"   系統指標類型: {len(collector.system_metrics)}")
print(f"   vLLM 指標類型: {len(collector.time_series_data)}")

if len(collector.timestamps) >= 2:
    duration = (collector.timestamps[-1] - collector.timestamps[0]).total_seconds()
    print(f"   監控持續時間: {duration:.0f} 秒 ({duration/60:.1f} 分鐘)")

print("\n✅ 實時監控實驗完成")

## 實驗總結

本實驗成功建立了完整的 vLLM 實時監控系統，涵蓋以下核心功能：

### ✅ 完成項目

1. **實時指標收集**
   - vLLM 原生 Prometheus 指標解析
   - 系統資源監控 (CPU, 記憶體, GPU)
   - 高頻率數據收集 (2秒間隔)

2. **動態監控視覺化**
   - 多面板實時儀表板
   - 系統資源使用趨勢圖
   - vLLM 服務狀態監控

3. **進階分析功能**
   - 資源使用相關性分析
   - 異常檢測算法
   - 性能評估報告

4. **工作負載模擬**
   - 並發請求模擬
   - 穩定負載測試
   - 延遲性能分析

5. **數據管理**
   - 時間序列數據儲存
   - JSON/CSV 格式導出
   - 聚合統計計算

### 🎯 核心成果

- **實時監控架構**: 建立了可擴展的監控數據收集架構
- **指標解析引擎**: 開發了 Prometheus 格式指標的解析器
- **異常檢測**: 實現了基於統計的異常檢測機制
- **性能分析**: 提供了全面的性能評估和優化建議

### 📈 監控指標覆蓋

- **系統層**: CPU, 記憶體, GPU 使用率
- **應用層**: vLLM 請求狀態, 延遲指標
- **業務層**: 吞吐量, 錯誤率, 併發數

### 🔧 技術特點

- **非阻塞收集**: 使用多執行緒實現背景數據收集
- **記憶體效率**: 滑動視窗限制數據點數量
- **容錯設計**: 網路異常和服務不可用的處理
- **標準化輸出**: 支援多種數據格式導出

### 📋 下一步

繼續進行 **03-Performance_Analysis.ipynb**，學習深度性能分析和瓶頸診斷技術。

---

**注意事項**:
- 生產環境建議調整收集頻率以節省資源
- 長期監控需要配置數據保留策略
- 異常檢測閾值需要根據實際業務調整
- 建議結合 Grafana 實現更豐富的視覺化效果