# Lab-2.4.3: 極致性能調優 🚀

## 🎯 學習目標

- **瓶頸分析**: 識別和解決性能瓶頸
- **GPU 記憶體管理**: 優化記憶體使用策略
- **Multi-GPU 配置**: 擴展到多 GPU 環境
- **性能監控**: 實時監控和調優指標

## 🏗️ 實驗架構

```
Performance Optimization Pipeline
├── 瓶頸分析和診斷
├── GPU 記憶體優化
├── Multi-GPU 負載均衡
└── 實時性能監控
```

## 1. 環境準備與基礎設置

In [None]:
import os
import sys
import time
import json
import asyncio
import psutil
import subprocess
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass
from datetime import datetime
import concurrent.futures

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import tritonclient.http as httpclient
import tritonclient.grpc as grpcclient
from tritonclient.utils import InferenceServerException

# GPU 監控相關
try:
    import pynvml
    pynvml.nvmlInit()
    GPU_AVAILABLE = True
    print("✅ NVIDIA GPU 監控已初始化")
except ImportError:
    GPU_AVAILABLE = False
    print("⚠️  NVIDIA GPU 監控不可用")

# 設置圖表樣式
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("🔧 環境設置完成")

## 2. 性能監控基礎設施

In [None]:
@dataclass
class PerformanceMetrics:
    """性能指標數據結構"""
    timestamp: float
    latency_ms: float
    throughput_rps: float
    cpu_usage: float
    memory_usage: float
    gpu_usage: Optional[float] = None
    gpu_memory_usage: Optional[float] = None
    queue_size: Optional[int] = None
    batch_size: Optional[int] = None

class PerformanceMonitor:
    """實時性能監控器"""
    
    def __init__(self, triton_url: str = "localhost:8000"):
        self.triton_url = triton_url
        self.client = httpclient.InferenceServerClient(url=triton_url)
        self.metrics_history: List[PerformanceMetrics] = []
        self.monitoring = False
        
    def get_system_metrics(self) -> Dict:
        """獲取系統資源使用情況"""
        metrics = {
            'cpu_usage': psutil.cpu_percent(interval=0.1),
            'memory_usage': psutil.virtual_memory().percent,
            'timestamp': time.time()
        }
        
        # GPU 監控
        if GPU_AVAILABLE:
            try:
                handle = pynvml.nvmlDeviceGetHandleByIndex(0)
                gpu_util = pynvml.nvmlDeviceGetUtilizationRates(handle)
                mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
                
                metrics.update({
                    'gpu_usage': gpu_util.gpu,
                    'gpu_memory_usage': (mem_info.used / mem_info.total) * 100
                })
            except Exception as e:
                print(f"GPU 監控錯誤: {e}")
                
        return metrics
    
    def get_triton_metrics(self) -> Dict:
        """獲取 Triton 服務器指標"""
        try:
            # 獲取服務器統計信息
            stats = self.client.get_inference_statistics()
            
            # 獲取服務器狀態
            server_metadata = self.client.get_server_metadata()
            
            return {
                'server_stats': stats,
                'server_metadata': server_metadata,
                'timestamp': time.time()
            }
        except Exception as e:
            print(f"Triton 指標獲取錯誤: {e}")
            return {}
    
    async def monitor_performance(self, duration: int = 60, interval: float = 1.0):
        """持續監控性能"""
        self.monitoring = True
        start_time = time.time()
        
        print(f"🔍 開始性能監控 ({duration} 秒)")
        
        while self.monitoring and (time.time() - start_time) < duration:
            # 收集指標
            system_metrics = self.get_system_metrics()
            triton_metrics = self.get_triton_metrics()
            
            # 記錄指標
            metric = PerformanceMetrics(
                timestamp=system_metrics['timestamp'],
                latency_ms=0,  # 將在負載測試中更新
                throughput_rps=0,  # 將在負載測試中更新
                cpu_usage=system_metrics['cpu_usage'],
                memory_usage=system_metrics['memory_usage'],
                gpu_usage=system_metrics.get('gpu_usage'),
                gpu_memory_usage=system_metrics.get('gpu_memory_usage')
            )
            
            self.metrics_history.append(metric)
            
            await asyncio.sleep(interval)
        
        print("✅ 性能監控完成")
    
    def stop_monitoring(self):
        """停止監控"""
        self.monitoring = False
    
    def plot_metrics(self, figsize: Tuple[int, int] = (15, 10)):
        """繪製性能指標圖表"""
        if not self.metrics_history:
            print("⚠️  沒有性能數據可顯示")
            return
        
        fig, axes = plt.subplots(2, 2, figsize=figsize)
        
        timestamps = [m.timestamp for m in self.metrics_history]
        start_time = timestamps[0]
        relative_times = [(t - start_time) for t in timestamps]
        
        # CPU 使用率
        axes[0, 0].plot(relative_times, [m.cpu_usage for m in self.metrics_history], 'b-')
        axes[0, 0].set_title('CPU 使用率 (%)')
        axes[0, 0].set_ylabel('使用率 (%)')
        axes[0, 0].grid(True)
        
        # 記憶體使用率
        axes[0, 1].plot(relative_times, [m.memory_usage for m in self.metrics_history], 'g-')
        axes[0, 1].set_title('記憶體使用率 (%)')
        axes[0, 1].set_ylabel('使用率 (%)')
        axes[0, 1].grid(True)
        
        # GPU 使用率
        if any(m.gpu_usage is not None for m in self.metrics_history):
            gpu_usage = [m.gpu_usage or 0 for m in self.metrics_history]
            axes[1, 0].plot(relative_times, gpu_usage, 'r-')
            axes[1, 0].set_title('GPU 使用率 (%)')
            axes[1, 0].set_ylabel('使用率 (%)')
            axes[1, 0].grid(True)
        
        # GPU 記憶體使用率
        if any(m.gpu_memory_usage is not None for m in self.metrics_history):
            gpu_mem = [m.gpu_memory_usage or 0 for m in self.metrics_history]
            axes[1, 1].plot(relative_times, gpu_mem, 'orange')
            axes[1, 1].set_title('GPU 記憶體使用率 (%)')
            axes[1, 1].set_ylabel('使用率 (%)')
            axes[1, 1].grid(True)
        
        plt.tight_layout()
        plt.show()

# 初始化監控器
monitor = PerformanceMonitor()
print("📊 性能監控器已初始化")

## 3. 瓶頸分析和診斷

In [None]:
class BottleneckAnalyzer:
    """瓶頸分析器"""
    
    def __init__(self, triton_url: str = "localhost:8000"):
        self.triton_url = triton_url
        self.client = httpclient.InferenceServerClient(url=triton_url)
        
    def analyze_model_performance(self, model_name: str, test_duration: int = 30) -> Dict:
        """分析單個模型的性能瓶頸"""
        print(f"🔍 分析模型 {model_name} 的性能...")
        
        # 獲取模型配置
        try:
            model_config = self.client.get_model_config(model_name)
            model_metadata = self.client.get_model_metadata(model_name)
        except Exception as e:
            print(f"❌ 無法獲取模型信息: {e}")
            return {}
        
        # 分析結果
        analysis = {
            'model_name': model_name,
            'config': model_config,
            'metadata': model_metadata,
            'bottlenecks': [],
            'recommendations': []
        }
        
        # 檢查配置潛在問題
        config_dict = model_config
        
        # 檢查動態批次設置
        if 'dynamic_batching' in config_dict:
            db_config = config_dict['dynamic_batching']
            max_queue_delay = db_config.get('max_queue_delay_microseconds', 0)
            
            if max_queue_delay == 0:
                analysis['bottlenecks'].append("動態批次未設置最大延遲")
                analysis['recommendations'].append("設置適當的 max_queue_delay_microseconds")
            
            preferred_batch_size = db_config.get('preferred_batch_size', [])
            if not preferred_batch_size:
                analysis['bottlenecks'].append("未設置首選批次大小")
                analysis['recommendations'].append("配置 preferred_batch_size 優化吞吐量")
        else:
            analysis['bottlenecks'].append("未啟用動態批次")
            analysis['recommendations'].append("啟用動態批次以提高吞吐量")
        
        # 檢查實例數量
        instance_group = config_dict.get('instance_group', [{}])[0]
        count = instance_group.get('count', 1)
        
        if count == 1:
            analysis['bottlenecks'].append("單實例可能成為瓶頸")
            analysis['recommendations'].append("考慮增加實例數量")
        
        # 檢查 GPU 類型
        kind = instance_group.get('kind', 'KIND_AUTO')
        if kind == 'KIND_CPU':
            analysis['bottlenecks'].append("使用 CPU 推理可能較慢")
            analysis['recommendations'].append("考慮使用 GPU 推理")
        
        return analysis
    
    def benchmark_latency_throughput(self, model_name: str, 
                                   batch_sizes: List[int] = [1, 2, 4, 8, 16, 32],
                                   concurrency_levels: List[int] = [1, 2, 4, 8, 16]) -> Dict:
        """基準測試延遲和吞吐量"""
        print(f"📊 對模型 {model_name} 進行基準測試...")
        
        results = {
            'batch_size_analysis': {},
            'concurrency_analysis': {},
            'optimal_config': {}
        }
        
        # 測試不同批次大小
        print("🔬 測試批次大小影響...")
        for batch_size in batch_sizes:
            try:
                latency, throughput = self._test_batch_size(model_name, batch_size)
                results['batch_size_analysis'][batch_size] = {
                    'latency_ms': latency,
                    'throughput_rps': throughput
                }
                print(f"  批次大小 {batch_size}: 延遲 {latency:.2f}ms, 吞吐量 {throughput:.2f} RPS")
            except Exception as e:
                print(f"  批次大小 {batch_size} 測試失敗: {e}")
        
        # 測試不同並發級別
        print("🔬 測試並發級別影響...")
        for concurrency in concurrency_levels:
            try:
                latency, throughput = self._test_concurrency(model_name, concurrency)
                results['concurrency_analysis'][concurrency] = {
                    'latency_ms': latency,
                    'throughput_rps': throughput
                }
                print(f"  並發級別 {concurrency}: 延遲 {latency:.2f}ms, 吞吐量 {throughput:.2f} RPS")
            except Exception as e:
                print(f"  並發級別 {concurrency} 測試失敗: {e}")
        
        # 找出最優配置
        if results['batch_size_analysis']:
            best_batch = max(results['batch_size_analysis'].items(), 
                           key=lambda x: x[1]['throughput_rps'])
            results['optimal_config']['best_batch_size'] = best_batch[0]
            
        if results['concurrency_analysis']:
            best_concurrency = max(results['concurrency_analysis'].items(), 
                                 key=lambda x: x[1]['throughput_rps'])
            results['optimal_config']['best_concurrency'] = best_concurrency[0]
        
        return results
    
    def _test_batch_size(self, model_name: str, batch_size: int) -> Tuple[float, float]:
        """測試特定批次大小的性能"""
        # 創建測試數據
        inputs = []
        outputs = []
        
        try:
            # 獲取模型元數據以了解輸入格式
            metadata = self.client.get_model_metadata(model_name)
            
            # 創建示例輸入 (這裡需要根據實際模型調整)
            for inp in metadata['inputs']:
                input_name = inp['name']
                input_shape = inp['shape']
                input_dtype = inp['datatype']
                
                # 創建測試數據
                if input_dtype == 'FP32':
                    test_data = np.random.random([batch_size] + input_shape).astype(np.float32)
                elif input_dtype == 'INT64':
                    test_data = np.random.randint(0, 1000, [batch_size] + input_shape).astype(np.int64)
                else:
                    test_data = np.random.random([batch_size] + input_shape).astype(np.float32)
                
                inputs.append(httpclient.InferInput(input_name, test_data.shape, input_dtype))
                inputs[-1].set_data_from_numpy(test_data)
            
            for out in metadata['outputs']:
                outputs.append(httpclient.InferRequestedOutput(out['name']))
        
        except Exception as e:
            # 如果無法獲取元數據，使用默認測試
            test_data = np.random.random((batch_size, 224, 224, 3)).astype(np.float32)
            inputs = [httpclient.InferInput("input", test_data.shape, "FP32")]
            inputs[0].set_data_from_numpy(test_data)
            outputs = [httpclient.InferRequestedOutput("output")]
        
        # 執行測試
        num_requests = 100
        start_time = time.time()
        
        for _ in range(num_requests):
            try:
                self.client.infer(model_name, inputs, outputs=outputs)
            except Exception:
                pass  # 忽略個別請求錯誤
        
        end_time = time.time()
        
        total_time = end_time - start_time
        avg_latency = (total_time / num_requests) * 1000  # 轉換為毫秒
        throughput = (num_requests * batch_size) / total_time
        
        return avg_latency, throughput
    
    def _test_concurrency(self, model_name: str, concurrency: int) -> Tuple[float, float]:
        """測試特定並發級別的性能"""
        # 簡化的並發測試
        return self._test_batch_size(model_name, 1)  # 暫時使用批次大小測試
    
    def generate_optimization_report(self, model_name: str) -> str:
        """生成優化建議報告"""
        analysis = self.analyze_model_performance(model_name)
        benchmark = self.benchmark_latency_throughput(model_name)
        
        report = f"""
# 🎯 模型性能優化報告: {model_name}

## 📊 瓶頸分析
"""
        
        if analysis.get('bottlenecks'):
            report += "\n### 🚨 發現的瓶頸:\n"
            for bottleneck in analysis['bottlenecks']:
                report += f"- {bottleneck}\n"
        
        if analysis.get('recommendations'):
            report += "\n### 💡 優化建議:\n"
            for rec in analysis['recommendations']:
                report += f"- {rec}\n"
        
        if benchmark.get('optimal_config'):
            report += "\n### 🎯 最優配置:\n"
            config = benchmark['optimal_config']
            if 'best_batch_size' in config:
                report += f"- 最佳批次大小: {config['best_batch_size']}\n"
            if 'best_concurrency' in config:
                report += f"- 最佳並發級別: {config['best_concurrency']}\n"
        
        return report

# 初始化分析器
analyzer = BottleneckAnalyzer()
print("🔍 瓶頸分析器已初始化")

## 4. GPU 記憶體管理優化

In [None]:
class GPUMemoryOptimizer:
    """GPU 記憶體優化器"""
    
    def __init__(self):
        self.gpu_available = GPU_AVAILABLE
        
    def get_gpu_memory_info(self) -> Dict:
        """獲取 GPU 記憶體信息"""
        if not self.gpu_available:
            return {'error': 'GPU 不可用'}
        
        try:
            device_count = pynvml.nvmlDeviceGetCount()
            gpu_info = {}
            
            for i in range(device_count):
                handle = pynvml.nvmlDeviceGetHandleByIndex(i)
                mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
                name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
                
                gpu_info[f'gpu_{i}'] = {
                    'name': name,
                    'total_memory_gb': mem_info.total / (1024**3),
                    'used_memory_gb': mem_info.used / (1024**3),
                    'free_memory_gb': mem_info.free / (1024**3),
                    'utilization_percent': (mem_info.used / mem_info.total) * 100
                }
            
            return gpu_info
            
        except Exception as e:
            return {'error': f'獲取 GPU 信息失敗: {e}'}
    
    def analyze_memory_usage_pattern(self, duration: int = 60) -> Dict:
        """分析記憶體使用模式"""
        print(f"📈 分析 GPU 記憶體使用模式 ({duration} 秒)...")
        
        if not self.gpu_available:
            return {'error': 'GPU 不可用'}
        
        memory_history = []
        start_time = time.time()
        
        while time.time() - start_time < duration:
            gpu_info = self.get_gpu_memory_info()
            if 'error' not in gpu_info:
                timestamp = time.time()
                for gpu_id, info in gpu_info.items():
                    memory_history.append({
                        'timestamp': timestamp,
                        'gpu_id': gpu_id,
                        'used_memory_gb': info['used_memory_gb'],
                        'utilization_percent': info['utilization_percent']
                    })
            time.sleep(1)
        
        # 分析模式
        analysis = {
            'memory_history': memory_history,
            'peak_usage': {},
            'average_usage': {},
            'memory_spikes': [],
            'recommendations': []
        }
        
        # 計算統計信息
        if memory_history:
            gpus = set(item['gpu_id'] for item in memory_history)
            
            for gpu_id in gpus:
                gpu_data = [item for item in memory_history if item['gpu_id'] == gpu_id]
                usage_values = [item['utilization_percent'] for item in gpu_data]
                
                analysis['peak_usage'][gpu_id] = max(usage_values)
                analysis['average_usage'][gpu_id] = sum(usage_values) / len(usage_values)
                
                # 檢測記憶體峰值
                threshold = 90  # 90% 使用率閾值
                spikes = [item for item in gpu_data if item['utilization_percent'] > threshold]
                if spikes:
                    analysis['memory_spikes'].extend(spikes)
        
        # 生成建議
        if analysis['memory_spikes']:
            analysis['recommendations'].append("檢測到記憶體使用峰值，考慮優化批次大小")
        
        avg_usage = analysis.get('average_usage', {})
        if avg_usage and any(usage > 80 for usage in avg_usage.values()):
            analysis['recommendations'].append("記憶體使用率較高，建議啟用記憶體優化策略")
        
        return analysis
    
    def generate_memory_optimization_config(self, target_memory_usage: float = 75.0) -> Dict:
        """生成記憶體優化配置"""
        gpu_info = self.get_gpu_memory_info()
        
        if 'error' in gpu_info:
            return gpu_info
        
        optimization_config = {
            'memory_pool_config': {},
            'instance_group_config': {},
            'batching_config': {},
            'recommendations': []
        }
        
        for gpu_id, info in gpu_info.items():
            total_memory_gb = info['total_memory_gb']
            target_memory_gb = total_memory_gb * (target_memory_usage / 100)
            
            # 記憶體池配置
            optimization_config['memory_pool_config'][gpu_id] = {
                'gpu_memory_pool_byte_size': int(target_memory_gb * 1024**3),
                'pinned_memory_pool_byte_size': 268435456  # 256MB
            }
            
            # 實例組配置
            optimization_config['instance_group_config'][gpu_id] = {
                'count': 1,
                'kind': 'KIND_GPU',
                'gpus': [int(gpu_id.split('_')[1])]
            }
            
            # 建議的批次配置
            optimization_config['batching_config'][gpu_id] = {
                'max_batch_size': min(32, max(1, int(target_memory_gb / 2))),
                'preferred_batch_size': [1, 2, 4, 8]
            }
        
        # 通用建議
        optimization_config['recommendations'] = [
            "使用動態批次以最大化記憶體效率",
            "監控記憶體使用情況並調整批次大小",
            "考慮使用模型並行化分散記憶體負載",
            f"目標記憶體使用率: {target_memory_usage}%"
        ]
        
        return optimization_config
    
    def plot_memory_usage(self, memory_history: List[Dict], figsize: Tuple[int, int] = (12, 6)):
        """繪製記憶體使用圖表"""
        if not memory_history:
            print("⚠️  沒有記憶體使用數據")
            return
        
        plt.figure(figsize=figsize)
        
        # 按 GPU 分組
        gpus = set(item['gpu_id'] for item in memory_history)
        
        for gpu_id in gpus:
            gpu_data = [item for item in memory_history if item['gpu_id'] == gpu_id]
            timestamps = [item['timestamp'] for item in gpu_data]
            utilization = [item['utilization_percent'] for item in gpu_data]
            
            # 轉換為相對時間
            start_time = min(timestamps)
            relative_times = [(t - start_time) / 60 for t in timestamps]  # 轉換為分鐘
            
            plt.plot(relative_times, utilization, label=f'{gpu_id.upper()}')
        
        plt.title('GPU 記憶體使用率趨勢')
        plt.xlabel('時間 (分鐘)')
        plt.ylabel('記憶體使用率 (%)')
        plt.legend()
        plt.grid(True)
        plt.axhline(y=80, color='orange', linestyle='--', alpha=0.7, label='警告線 (80%)')
        plt.axhline(y=90, color='red', linestyle='--', alpha=0.7, label='危險線 (90%)')
        plt.tight_layout()
        plt.show()

# 初始化 GPU 記憶體優化器
gpu_optimizer = GPUMemoryOptimizer()
print("🎯 GPU 記憶體優化器已初始化")

# 顯示當前 GPU 狀態
gpu_info = gpu_optimizer.get_gpu_memory_info()
print("\n📊 當前 GPU 狀態:")
for gpu_id, info in gpu_info.items():
    if gpu_id != 'error':
        print(f"  {gpu_id.upper()}: {info['name']}")
        print(f"    總記憶體: {info['total_memory_gb']:.2f} GB")
        print(f"    已使用: {info['used_memory_gb']:.2f} GB ({info['utilization_percent']:.1f}%)")
        print(f"    可用: {info['free_memory_gb']:.2f} GB")
    else:
        print(f"  {info}")

## 5. Multi-GPU 配置和負載均衡

In [None]:
class MultiGPUManager:
    """Multi-GPU 管理器"""
    
    def __init__(self, triton_url: str = "localhost:8000"):
        self.triton_url = triton_url
        self.client = httpclient.InferenceServerClient(url=triton_url)
        self.gpu_optimizer = GPUMemoryOptimizer()
        
    def analyze_gpu_topology(self) -> Dict:
        """分析 GPU 拓撲結構"""
        print("🔍 分析 GPU 拓撲結構...")
        
        if not GPU_AVAILABLE:
            return {'error': 'GPU 不可用'}
        
        try:
            device_count = pynvml.nvmlDeviceGetCount()
            topology = {
                'device_count': device_count,
                'devices': {},
                'nvlink_topology': {},
                'recommendations': []
            }
            
            for i in range(device_count):
                handle = pynvml.nvmlDeviceGetHandleByIndex(i)
                name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
                mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
                
                try:
                    # 嘗試獲取 GPU 拓撲信息
                    pci_info = pynvml.nvmlDeviceGetPciInfo(handle)
                    
                    topology['devices'][f'gpu_{i}'] = {
                        'name': name,
                        'memory_total_gb': mem_info.total / (1024**3),
                        'pci_bus': pci_info.bus,
                        'pci_device': pci_info.device,
                        'pci_domain': pci_info.domain
                    }
                except Exception as e:
                    topology['devices'][f'gpu_{i}'] = {
                        'name': name,
                        'memory_total_gb': mem_info.total / (1024**3),
                        'error': str(e)
                    }
            
            # 分析 NVLink 連接 (如果可用)
            if device_count > 1:
                try:
                    for i in range(device_count):
                        handle_i = pynvml.nvmlDeviceGetHandleByIndex(i)
                        for j in range(i + 1, device_count):
                            handle_j = pynvml.nvmlDeviceGetHandleByIndex(j)
                            # 這裡可以添加 NVLink 連接檢測
                            # 由於 NVML API 的限制，這部分可能需要其他方法
                            pass
                except Exception:
                    pass
            
            # 生成建議
            if device_count == 1:
                topology['recommendations'].append("單 GPU 配置，考慮增加 GPU 數量以提高性能")
            elif device_count > 1:
                topology['recommendations'].append(f"檢測到 {device_count} 個 GPU，可以配置模型並行")
                topology['recommendations'].append("建議為不同模型分配不同 GPU 以避免競爭")
            
            return topology
            
        except Exception as e:
            return {'error': f'分析 GPU 拓撲失敗: {e}'}
    
    def generate_multi_gpu_config(self, model_configs: List[Dict]) -> Dict:
        """生成 Multi-GPU 配置"""
        print("⚙️  生成 Multi-GPU 配置...")
        
        topology = self.analyze_gpu_topology()
        if 'error' in topology:
            return topology
        
        device_count = topology['device_count']
        if device_count < 2:
            return {'error': '需要至少 2 個 GPU 來配置 Multi-GPU'}
        
        multi_gpu_config = {
            'gpu_allocation': {},
            'load_balancing': {},
            'model_configs': {},
            'recommendations': []
        }
        
        # GPU 分配策略
        num_models = len(model_configs)
        
        if num_models <= device_count:
            # 每個模型分配一個 GPU
            for i, model_config in enumerate(model_configs):
                model_name = model_config.get('name', f'model_{i}')
                gpu_id = i % device_count
                
                multi_gpu_config['gpu_allocation'][model_name] = [gpu_id]
                multi_gpu_config['model_configs'][model_name] = {
                    'instance_group': [
                        {
                            'count': 1,
                            'kind': 'KIND_GPU',
                            'gpus': [gpu_id]
                        }
                    ]
                }
        else:
            # 模型數量超過 GPU 數量，需要共享
            for i, model_config in enumerate(model_configs):
                model_name = model_config.get('name', f'model_{i}')
                gpu_id = i % device_count
                
                if gpu_id not in multi_gpu_config['gpu_allocation']:
                    multi_gpu_config['gpu_allocation'][gpu_id] = []
                
                multi_gpu_config['gpu_allocation'][gpu_id].append(model_name)
                multi_gpu_config['model_configs'][model_name] = {
                    'instance_group': [
                        {
                            'count': 1,
                            'kind': 'KIND_GPU',
                            'gpus': [gpu_id]
                        }
                    ]
                }
        
        # 負載均衡配置
        multi_gpu_config['load_balancing'] = {
            'strategy': 'round_robin',
            'health_check': {
                'enabled': True,
                'interval_ms': 5000
            },
            'failover': {
                'enabled': True,
                'retry_count': 3
            }
        }
        
        # 生成建議
        multi_gpu_config['recommendations'] = [
            f"已為 {num_models} 個模型配置 {device_count} 個 GPU",
            "建議監控各 GPU 的負載均衡情況",
            "可以根據模型使用頻率調整 GPU 分配",
            "考慮為高頻模型配置多個實例"
        ]
        
        if num_models > device_count:
            multi_gpu_config['recommendations'].append(
                "模型數量超過 GPU 數量，建議監控 GPU 使用率"
            )
        
        return multi_gpu_config
    
    def monitor_multi_gpu_performance(self, duration: int = 60) -> Dict:
        """監控 Multi-GPU 性能"""
        print(f"📊 監控 Multi-GPU 性能 ({duration} 秒)...")
        
        if not GPU_AVAILABLE:
            return {'error': 'GPU 不可用'}
        
        performance_data = {
            'timeline': [],
            'gpu_metrics': {},
            'load_balance_analysis': {},
            'recommendations': []
        }
        
        start_time = time.time()
        
        while time.time() - start_time < duration:
            timestamp = time.time()
            gpu_info = self.gpu_optimizer.get_gpu_memory_info()
            
            if 'error' not in gpu_info:
                frame_data = {'timestamp': timestamp}
                
                for gpu_id, info in gpu_info.items():
                    frame_data[gpu_id] = {
                        'memory_usage': info['utilization_percent'],
                        'memory_used_gb': info['used_memory_gb']
                    }
                    
                    # 累積統計
                    if gpu_id not in performance_data['gpu_metrics']:
                        performance_data['gpu_metrics'][gpu_id] = {
                            'usage_history': [],
                            'peak_usage': 0,
                            'avg_usage': 0
                        }
                    
                    performance_data['gpu_metrics'][gpu_id]['usage_history'].append(
                        info['utilization_percent']
                    )
                    performance_data['gpu_metrics'][gpu_id]['peak_usage'] = max(
                        performance_data['gpu_metrics'][gpu_id]['peak_usage'],
                        info['utilization_percent']
                    )
                
                performance_data['timeline'].append(frame_data)
            
            time.sleep(1)
        
        # 計算平均使用率和負載均衡分析
        for gpu_id, metrics in performance_data['gpu_metrics'].items():
            usage_history = metrics['usage_history']
            if usage_history:
                metrics['avg_usage'] = sum(usage_history) / len(usage_history)
        
        # 負載均衡分析
        avg_usages = [metrics['avg_usage'] for metrics in performance_data['gpu_metrics'].values()]
        if len(avg_usages) > 1:
            usage_variance = np.var(avg_usages)
            performance_data['load_balance_analysis'] = {
                'usage_variance': usage_variance,
                'max_usage_diff': max(avg_usages) - min(avg_usages),
                'is_balanced': usage_variance < 100  # 變異數閾值
            }
            
            if not performance_data['load_balance_analysis']['is_balanced']:
                performance_data['recommendations'].append(
                    "檢測到負載不均衡，建議重新分配模型或調整實例數量"
                )
        
        return performance_data
    
    def plot_multi_gpu_performance(self, performance_data: Dict, figsize: Tuple[int, int] = (15, 8)):
        """繪製 Multi-GPU 性能圖表"""
        if not performance_data.get('timeline'):
            print("⚠️  沒有性能數據可顯示")
            return
        
        fig, axes = plt.subplots(2, 2, figsize=figsize)
        
        # 提取數據
        timeline = performance_data['timeline']
        timestamps = [frame['timestamp'] for frame in timeline]
        start_time = min(timestamps)
        relative_times = [(t - start_time) / 60 for t in timestamps]  # 轉換為分鐘
        
        # GPU 使用率時間線
        gpu_ids = [key for key in timeline[0].keys() if key.startswith('gpu_')]
        
        for gpu_id in gpu_ids:
            usage_data = [frame[gpu_id]['memory_usage'] for frame in timeline]
            axes[0, 0].plot(relative_times, usage_data, label=gpu_id.upper())
        
        axes[0, 0].set_title('GPU 記憶體使用率時間線')
        axes[0, 0].set_xlabel('時間 (分鐘)')
        axes[0, 0].set_ylabel('使用率 (%)')
        axes[0, 0].legend()
        axes[0, 0].grid(True)
        
        # 平均使用率比較
        gpu_metrics = performance_data['gpu_metrics']
        gpu_names = list(gpu_metrics.keys())
        avg_usages = [gpu_metrics[gpu]['avg_usage'] for gpu in gpu_names]
        
        axes[0, 1].bar(range(len(gpu_names)), avg_usages)
        axes[0, 1].set_title('平均 GPU 使用率比較')
        axes[0, 1].set_xlabel('GPU')
        axes[0, 1].set_ylabel('平均使用率 (%)')
        axes[0, 1].set_xticks(range(len(gpu_names)))
        axes[0, 1].set_xticklabels([gpu.upper() for gpu in gpu_names])
        axes[0, 1].grid(True, axis='y')
        
        # 峰值使用率比較
        peak_usages = [gpu_metrics[gpu]['peak_usage'] for gpu in gpu_names]
        
        axes[1, 0].bar(range(len(gpu_names)), peak_usages, color='orange')
        axes[1, 0].set_title('峰值 GPU 使用率比較')
        axes[1, 0].set_xlabel('GPU')
        axes[1, 0].set_ylabel('峰值使用率 (%)')
        axes[1, 0].set_xticks(range(len(gpu_names)))
        axes[1, 0].set_xticklabels([gpu.upper() for gpu in gpu_names])
        axes[1, 0].grid(True, axis='y')
        
        # 負載均衡分析
        if 'load_balance_analysis' in performance_data:
            balance_info = performance_data['load_balance_analysis']
            axes[1, 1].text(0.1, 0.8, f"使用率變異數: {balance_info['usage_variance']:.2f}", 
                          transform=axes[1, 1].transAxes, fontsize=12)
            axes[1, 1].text(0.1, 0.6, f"最大使用率差異: {balance_info['max_usage_diff']:.2f}%", 
                          transform=axes[1, 1].transAxes, fontsize=12)
            balance_status = "是" if balance_info['is_balanced'] else "否"
            axes[1, 1].text(0.1, 0.4, f"負載均衡: {balance_status}", 
                          transform=axes[1, 1].transAxes, fontsize=12, 
                          color='green' if balance_info['is_balanced'] else 'red')
            axes[1, 1].set_title('負載均衡分析')
            axes[1, 1].set_xlim(0, 1)
            axes[1, 1].set_ylim(0, 1)
            axes[1, 1].axis('off')
        
        plt.tight_layout()
        plt.show()

# 初始化 Multi-GPU 管理器
multi_gpu_manager = MultiGPUManager()
print("🚀 Multi-GPU 管理器已初始化")

# 分析 GPU 拓撲
topology = multi_gpu_manager.analyze_gpu_topology()
if 'error' not in topology:
    print(f"\n🔍 GPU 拓撲分析:")
    print(f"  檢測到 {topology['device_count']} 個 GPU")
    for gpu_id, info in topology['devices'].items():
        print(f"    {gpu_id.upper()}: {info['name']} ({info['memory_total_gb']:.1f} GB)")
else:
    print(f"GPU 拓撲分析失敗: {topology['error']}")

## 6. 綜合性能優化實戰

In [None]:
class ComprehensiveOptimizer:
    """綜合性能優化器"""
    
    def __init__(self, triton_url: str = "localhost:8000"):
        self.triton_url = triton_url
        self.client = httpclient.InferenceServerClient(url=triton_url)
        self.monitor = PerformanceMonitor(triton_url)
        self.analyzer = BottleneckAnalyzer(triton_url)
        self.gpu_optimizer = GPUMemoryOptimizer()
        self.multi_gpu_manager = MultiGPUManager(triton_url)
    
    async def run_comprehensive_optimization(self, model_names: List[str], 
                                           optimization_duration: int = 300) -> Dict:
        """執行綜合性能優化分析"""
        print("🚀 開始綜合性能優化分析...")
        
        optimization_results = {
            'models_analyzed': model_names,
            'baseline_performance': {},
            'bottleneck_analysis': {},
            'memory_analysis': {},
            'multi_gpu_analysis': {},
            'optimization_recommendations': {},
            'projected_improvements': {}
        }
        
        # 1. 基線性能測試
        print("📊 1/5 執行基線性能測試...")
        for model_name in model_names:
            try:
                baseline = await self._measure_baseline_performance(model_name)
                optimization_results['baseline_performance'][model_name] = baseline
                print(f"  {model_name}: 延遲 {baseline.get('avg_latency_ms', 'N/A')}ms, "
                      f"吞吐量 {baseline.get('throughput_rps', 'N/A')} RPS")
            except Exception as e:
                print(f"  {model_name} 基線測試失敗: {e}")
        
        # 2. 瓶頸分析
        print("🔍 2/5 執行瓶頸分析...")
        for model_name in model_names:
            try:
                analysis = self.analyzer.analyze_model_performance(model_name)
                optimization_results['bottleneck_analysis'][model_name] = analysis
                if analysis.get('bottlenecks'):
                    print(f"  {model_name}: 發現 {len(analysis['bottlenecks'])} 個瓶頸")
            except Exception as e:
                print(f"  {model_name} 瓶頸分析失敗: {e}")
        
        # 3. 記憶體分析
        print("💾 3/5 執行記憶體分析...")
        try:
            memory_analysis = self.gpu_optimizer.analyze_memory_usage_pattern(60)
            optimization_results['memory_analysis'] = memory_analysis
            if memory_analysis.get('memory_spikes'):
                print(f"  檢測到 {len(memory_analysis['memory_spikes'])} 個記憶體峰值")
        except Exception as e:
            print(f"  記憶體分析失敗: {e}")
        
        # 4. Multi-GPU 分析
        print("🔄 4/5 執行 Multi-GPU 分析...")
        try:
            model_configs = [{'name': name} for name in model_names]
            multi_gpu_config = self.multi_gpu_manager.generate_multi_gpu_config(model_configs)
            optimization_results['multi_gpu_analysis'] = multi_gpu_config
            if 'error' not in multi_gpu_config:
                print(f"  生成了 Multi-GPU 配置，支持 {len(model_names)} 個模型")
        except Exception as e:
            print(f"  Multi-GPU 分析失敗: {e}")
        
        # 5. 生成優化建議
        print("💡 5/5 生成優化建議...")
        optimization_results['optimization_recommendations'] = \
            self._generate_comprehensive_recommendations(optimization_results)
        
        # 6. 預測改進效果
        optimization_results['projected_improvements'] = \
            self._project_improvements(optimization_results)
        
        print("✅ 綜合性能優化分析完成")
        return optimization_results
    
    async def _measure_baseline_performance(self, model_name: str) -> Dict:
        """測量基線性能"""
        try:
            # 獲取模型元數據
            metadata = self.client.get_model_metadata(model_name)
            
            # 創建測試輸入
            inputs = []
            outputs = []
            
            for inp in metadata['inputs']:
                input_name = inp['name']
                input_shape = inp['shape']
                input_dtype = inp['datatype']
                
                if input_dtype == 'FP32':
                    test_data = np.random.random([1] + input_shape).astype(np.float32)
                elif input_dtype == 'INT64':
                    test_data = np.random.randint(0, 1000, [1] + input_shape).astype(np.int64)
                else:
                    test_data = np.random.random([1] + input_shape).astype(np.float32)
                
                inputs.append(httpclient.InferInput(input_name, test_data.shape, input_dtype))
                inputs[-1].set_data_from_numpy(test_data)
            
            for out in metadata['outputs']:
                outputs.append(httpclient.InferRequestedOutput(out['name']))
            
            # 預熱
            for _ in range(10):
                self.client.infer(model_name, inputs, outputs=outputs)
            
            # 測量性能
            num_requests = 100
            latencies = []
            
            for _ in range(num_requests):
                start_time = time.time()
                self.client.infer(model_name, inputs, outputs=outputs)
                end_time = time.time()
                latencies.append((end_time - start_time) * 1000)  # 轉換為毫秒
            
            # 計算統計信息
            avg_latency = sum(latencies) / len(latencies)
            p95_latency = np.percentile(latencies, 95)
            p99_latency = np.percentile(latencies, 99)
            throughput = num_requests / (sum(latencies) / 1000)  # RPS
            
            return {
                'avg_latency_ms': avg_latency,
                'p95_latency_ms': p95_latency,
                'p99_latency_ms': p99_latency,
                'throughput_rps': throughput,
                'min_latency_ms': min(latencies),
                'max_latency_ms': max(latencies)
            }
            
        except Exception as e:
            return {'error': str(e)}
    
    def _generate_comprehensive_recommendations(self, results: Dict) -> Dict:
        """生成綜合優化建議"""
        recommendations = {
            'high_priority': [],
            'medium_priority': [],
            'low_priority': [],
            'implementation_order': []
        }
        
        # 分析瓶頸並生成建議
        for model_name, analysis in results.get('bottleneck_analysis', {}).items():
            if analysis.get('bottlenecks'):
                for bottleneck in analysis['bottlenecks']:
                    if '動態批次' in bottleneck:
                        recommendations['high_priority'].append(
                            f"{model_name}: 啟用並優化動態批次配置"
                        )
                    elif '實例' in bottleneck:
                        recommendations['medium_priority'].append(
                            f"{model_name}: 增加模型實例數量"
                        )
                    elif 'CPU' in bottleneck:
                        recommendations['high_priority'].append(
                            f"{model_name}: 從 CPU 遷移到 GPU 推理"
                        )
        
        # 記憶體優化建議
        memory_analysis = results.get('memory_analysis', {})
        if memory_analysis.get('memory_spikes'):
            recommendations['high_priority'].append(
                "優化 GPU 記憶體使用，減少記憶體峰值"
            )
        
        if memory_analysis.get('recommendations'):
            recommendations['medium_priority'].extend(
                memory_analysis['recommendations']
            )
        
        # Multi-GPU 建議
        multi_gpu_analysis = results.get('multi_gpu_analysis', {})
        if 'error' not in multi_gpu_analysis and multi_gpu_analysis.get('recommendations'):
            recommendations['low_priority'].extend(
                multi_gpu_analysis['recommendations']
            )
        
        # 實施順序建議
        recommendations['implementation_order'] = [
            "1. 啟用和優化動態批次處理",
            "2. 優化 GPU 記憶體配置",
            "3. 調整模型實例數量",
            "4. 配置 Multi-GPU 負載均衡",
            "5. 實施性能監控和告警"
        ]
        
        return recommendations
    
    def _project_improvements(self, results: Dict) -> Dict:
        """預測優化改進效果"""
        improvements = {
            'latency_improvement': {},
            'throughput_improvement': {},
            'memory_efficiency': {},
            'overall_score': {}
        }
        
        for model_name, baseline in results.get('baseline_performance', {}).items():
            if 'error' in baseline:
                continue
            
            # 基於瓶頸分析預測改進
            bottlenecks = results.get('bottleneck_analysis', {}).get(model_name, {}).get('bottlenecks', [])
            
            latency_improvement = 1.0  # 無改進基線
            throughput_improvement = 1.0
            
            # 動態批次改進預測
            if any('動態批次' in b for b in bottlenecks):
                latency_improvement *= 0.8  # 20% 延遲改進
                throughput_improvement *= 3.0  # 3倍吞吐量改進
            
            # 實例數量改進預測
            if any('實例' in b for b in bottlenecks):
                throughput_improvement *= 2.0  # 2倍吞吐量改進
            
            # GPU 遷移改進預測
            if any('CPU' in b for b in bottlenecks):
                latency_improvement *= 0.3  # 70% 延遲改進
                throughput_improvement *= 5.0  # 5倍吞吐量改進
            
            improvements['latency_improvement'][model_name] = {
                'current_ms': baseline.get('avg_latency_ms', 0),
                'projected_ms': baseline.get('avg_latency_ms', 0) * latency_improvement,
                'improvement_percent': (1 - latency_improvement) * 100
            }
            
            improvements['throughput_improvement'][model_name] = {
                'current_rps': baseline.get('throughput_rps', 0),
                'projected_rps': baseline.get('throughput_rps', 0) * throughput_improvement,
                'improvement_percent': (throughput_improvement - 1) * 100
            }
            
            # 整體評分
            score = (throughput_improvement + (2 - latency_improvement)) / 2
            improvements['overall_score'][model_name] = min(score * 20, 100)  # 最高100分
        
        return improvements
    
    def generate_optimization_report(self, results: Dict) -> str:
        """生成詳細的優化報告"""
        report = f"""
# 🚀 綜合性能優化報告

## 📊 分析概述
- 分析模型數量: {len(results['models_analyzed'])}
- 分析時間: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
- 模型清單: {', '.join(results['models_analyzed'])}

## 📈 基線性能
"""
        
        for model_name, baseline in results.get('baseline_performance', {}).items():
            if 'error' not in baseline:
                report += f"""
### {model_name}
- 平均延遲: {baseline.get('avg_latency_ms', 'N/A'):.2f} ms
- P95 延遲: {baseline.get('p95_latency_ms', 'N/A'):.2f} ms
- 吞吐量: {baseline.get('throughput_rps', 'N/A'):.2f} RPS
"""
        
        report += "\n## 🔍 瓶頸分析\n"
        for model_name, analysis in results.get('bottleneck_analysis', {}).items():
            if analysis.get('bottlenecks'):
                report += f"\n### {model_name}\n"
                for bottleneck in analysis['bottlenecks']:
                    report += f"- ⚠️  {bottleneck}\n"
        
        report += "\n## 💡 優化建議\n"
        recommendations = results.get('optimization_recommendations', {})
        
        if recommendations.get('high_priority'):
            report += "\n### 🔴 高優先級\n"
            for rec in recommendations['high_priority']:
                report += f"- {rec}\n"
        
        if recommendations.get('medium_priority'):
            report += "\n### 🟡 中優先級\n"
            for rec in recommendations['medium_priority']:
                report += f"- {rec}\n"
        
        if recommendations.get('implementation_order'):
            report += "\n### 📋 實施順序\n"
            for order in recommendations['implementation_order']:
                report += f"{order}\n"
        
        report += "\n## 📊 預期改進效果\n"
        improvements = results.get('projected_improvements', {})
        
        for model_name in results['models_analyzed']:
            if model_name in improvements.get('latency_improvement', {}):
                lat_imp = improvements['latency_improvement'][model_name]
                thr_imp = improvements['throughput_improvement'][model_name]
                score = improvements['overall_score'].get(model_name, 0)
                
                report += f"""
### {model_name}
- 延遲改進: {lat_imp['improvement_percent']:.1f}% ({lat_imp['current_ms']:.2f}ms → {lat_imp['projected_ms']:.2f}ms)
- 吞吐量改進: {thr_imp['improvement_percent']:.1f}% ({thr_imp['current_rps']:.2f} → {thr_imp['projected_rps']:.2f} RPS)
- 整體評分: {score:.1f}/100
"""
        
        return report

# 初始化綜合優化器
comprehensive_optimizer = ComprehensiveOptimizer()
print("🎯 綜合性能優化器已初始化")

## 7. 實戰演示：完整優化流程

In [None]:
# 配置要分析的模型
models_to_analyze = [
    "text_classification",  # 替換為您的實際模型名稱
    "sentiment_analysis",   # 替換為您的實際模型名稱
]

print("🚀 開始完整的性能優化流程演示")
print(f"將分析以下模型: {', '.join(models_to_analyze)}")
print("\n注意: 請確保這些模型已在 Triton 服務器中加載")

In [None]:
# 檢查模型可用性
available_models = []

try:
    client = httpclient.InferenceServerClient(url="localhost:8000")
    server_metadata = client.get_server_metadata()
    print("🔍 檢查可用模型...")
    
    for model_name in models_to_analyze:
        try:
            model_metadata = client.get_model_metadata(model_name)
            available_models.append(model_name)
            print(f"  ✅ {model_name} - 可用")
        except Exception as e:
            print(f"  ❌ {model_name} - 不可用: {e}")
    
    if not available_models:
        print("\n⚠️  沒有可用的模型進行分析")
        print("請確保 Triton 服務器正在運行並且已加載模型")
    else:
        print(f"\n✅ 找到 {len(available_models)} 個可用模型")
        
except Exception as e:
    print(f"❌ 無法連接到 Triton 服務器: {e}")
    print("請確保 Triton 服務器在 localhost:8000 上運行")
    available_models = []

In [None]:
# 執行綜合優化分析（如果有可用模型）
if available_models:
    print("🎯 開始綜合優化分析...")
    
    # 這是一個異步函數，在 Jupyter 中需要使用 await
    optimization_results = await comprehensive_optimizer.run_comprehensive_optimization(
        model_names=available_models,
        optimization_duration=60  # 縮短分析時間以便演示
    )
    
    # 生成詳細報告
    detailed_report = comprehensive_optimizer.generate_optimization_report(optimization_results)
    print(detailed_report)
    
else:
    print("⚠️  跳過綜合分析 - 沒有可用模型")
    
    # 演示其他功能
    print("\n🔧 演示其他優化功能...")
    
    # GPU 記憶體信息
    gpu_info = gpu_optimizer.get_gpu_memory_info()
    print("\n📊 GPU 記憶體狀態:")
    if 'error' not in gpu_info:
        for gpu_id, info in gpu_info.items():
            print(f"  {gpu_id}: {info['utilization_percent']:.1f}% 使用率")
    else:
        print(f"  {gpu_info}")
    
    # 記憶體優化配置
    memory_config = gpu_optimizer.generate_memory_optimization_config()
    print("\n⚙️  記憶體優化配置已生成")
    
    # Multi-GPU 拓撲
    topology = multi_gpu_manager.analyze_gpu_topology()
    if 'error' not in topology:
        print(f"\n🔄 檢測到 {topology['device_count']} 個 GPU")
    else:
        print(f"\n🔄 GPU 拓撲: {topology['error']}")

## 8. 性能監控儀表板

In [None]:
# 啟動實時性能監控（演示版本）
if available_models:
    print("📊 啟動實時性能監控...")
    
    # 啟動後台監控
    monitoring_task = asyncio.create_task(
        monitor.monitor_performance(duration=30, interval=1.0)
    )
    
    # 等待監控完成
    await monitoring_task
    
    # 顯示監控結果
    if monitor.metrics_history:
        print(f"\n📈 收集了 {len(monitor.metrics_history)} 個監控數據點")
        monitor.plot_metrics()
    else:
        print("\n⚠️  沒有收集到監控數據")
        
else:
    print("⚠️  跳過實時監控 - 沒有可用模型")
    
    # 演示靜態 GPU 監控
    print("\n📊 演示靜態 GPU 監控...")
    
    # 模擬監控數據
    import random
    
    mock_history = []
    for i in range(30):
        mock_history.append({
            'timestamp': time.time() + i,
            'gpu_id': 'gpu_0',
            'used_memory_gb': 4.5 + random.uniform(-0.5, 0.5),
            'utilization_percent': 60 + random.uniform(-20, 20)
        })
    
    gpu_optimizer.plot_memory_usage(mock_history)
    print("上圖顯示了模擬的 GPU 記憶體使用情況")

## 9. 優化配置生成器

In [None]:
def generate_optimized_model_config(model_name: str, optimization_level: str = "balanced") -> Dict:
    """生成優化的模型配置"""
    
    base_config = {
        "name": model_name,
        "platform": "tensorrt_plan",  # 或其他適當的平台
        "max_batch_size": 32,
        "input": [],
        "output": [],
        "dynamic_batching": {
            "preferred_batch_size": [1, 2, 4, 8, 16],
            "max_queue_delay_microseconds": 1000
        },
        "instance_group": [
            {
                "count": 1,
                "kind": "KIND_GPU",
                "gpus": [0]
            }
        ],
        "optimization": {
            "graph": {
                "level": 1
            }
        }
    }
    
    # 根據優化級別調整配置
    if optimization_level == "latency":
        # 優化延遲
        base_config["max_batch_size"] = 4
        base_config["dynamic_batching"]["preferred_batch_size"] = [1, 2]
        base_config["dynamic_batching"]["max_queue_delay_microseconds"] = 100
        base_config["instance_group"][0]["count"] = 2
        
    elif optimization_level == "throughput":
        # 優化吞吐量
        base_config["max_batch_size"] = 64
        base_config["dynamic_batching"]["preferred_batch_size"] = [8, 16, 32, 64]
        base_config["dynamic_batching"]["max_queue_delay_microseconds"] = 5000
        base_config["instance_group"][0]["count"] = 1
        
    elif optimization_level == "memory":
        # 優化記憶體使用
        base_config["max_batch_size"] = 16
        base_config["dynamic_batching"]["preferred_batch_size"] = [1, 2, 4, 8]
        base_config["optimization"]["memory_pool_byte_size"] = 1073741824  # 1GB
        
    return base_config

# 生成不同優化級別的配置示例
print("⚙️  生成優化配置示例...")

optimization_levels = ["latency", "throughput", "memory", "balanced"]

for level in optimization_levels:
    config = generate_optimized_model_config("example_model", level)
    print(f"\n📋 {level.upper()} 優化配置:")
    print(f"  最大批次大小: {config['max_batch_size']}")
    print(f"  首選批次大小: {config['dynamic_batching']['preferred_batch_size']}")
    print(f"  最大隊列延遲: {config['dynamic_batching']['max_queue_delay_microseconds']} μs")
    print(f"  實例數量: {config['instance_group'][0]['count']}")

print("\n💾 配置文件已生成，可以保存為 config.pbtxt 並應用到 Triton 服務器")

## 10. 總結和最佳實踐

### 🎯 性能優化總結

通過本實驗室，我們學習了 Triton Inference Server 的極致性能優化技術：

#### ✅ 完成的優化項目

1. **瓶頸分析和診斷**
   - 自動化模型性能瓶頸識別
   - 批次大小和並發級別優化
   - 配置問題檢測和建議

2. **GPU 記憶體管理**
   - 實時記憶體使用監控
   - 記憶體使用模式分析
   - 自動化記憶體配置生成

3. **Multi-GPU 配置**
   - GPU 拓撲結構分析
   - 智能負載均衡配置
   - 跨 GPU 性能監控

4. **綜合性能優化**
   - 端到端優化流程
   - 改進效果預測
   - 詳細優化報告生成

#### 🏆 最佳實踐

1. **配置優化**
   - 啟用動態批次處理
   - 合理設置首選批次大小
   - 根據硬件調整實例數量

2. **記憶體管理**
   - 監控 GPU 記憶體使用率
   - 設置合理的記憶體池大小
   - 避免記憶體碎片化

3. **多 GPU 部署**
   - 合理分配模型到不同 GPU
   - 實施負載均衡和故障轉移
   - 監控 GPU 間負載分佈

4. **持續優化**
   - 建立性能監控體系
   - 定期執行性能評估
   - 根據業務需求調整配置

#### 🚀 下一步建議

1. **生產環境部署**
   - 應用學到的優化配置
   - 建立性能監控告警
   - 實施自動化優化流程

2. **高級功能探索**
   - 模型 Ensemble 和 Pipeline
   - 自定義後端開發
   - 分散式推理架構

3. **持續學習**
   - 關注 Triton 新功能發布
   - 參與社區討論和貢獻
   - 探索新的優化技術

### 📚 相關資源

- [Triton Performance Guide](https://github.com/triton-inference-server/tutorials)
- [GPU Memory Optimization](https://developer.nvidia.com/blog/cuda-memory-optimization/)
- [Multi-GPU Best Practices](https://docs.nvidia.com/deeplearning/frameworks/user-guide/)

🎉 **恭喜！您已經掌握了 Triton Inference Server 的極致性能優化技術！**