# PyTorch Backend 深度配置與優化

## 🎯 學習目標

本實驗將深入探討 Triton PyTorch Backend 的高級配置和優化技術，學習如何最大化利用 PyTorch 在 Triton 中的靈活性和性能。

### 核心知識點
- ✅ PyTorch Backend 進階配置
- ✅ 動態形狀 (Dynamic Shapes) 處理
- ✅ 記憶體池管理與優化
- ✅ 自定義運算子整合
- ✅ 批次處理優化
- ✅ 模型熱更新機制

### 技術架構
```
Triton Server
├── PyTorch Backend Engine
│   ├── Dynamic Shape Manager
│   ├── Memory Pool Allocator  
│   ├── Custom Operator Registry
│   └── Batch Scheduler
├── Model Repository
│   ├── config.pbtxt (Advanced)
│   ├── model.py (Custom Logic)
│   └── version_policy.json
└── Performance Monitor
```

## 1. 環境準備與驗證

In [None]:
import os
import sys
import json
import time
import torch
import numpy as np
import requests
import tritonclient.http as httpclient
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Any, Optional
import logging
from dataclasses import dataclass
import psutil
import GPUtil

# 設定日誌
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("🔧 環境資訊檢查")
print(f"Python 版本: {sys.version}")
print(f"PyTorch 版本: {torch.__version__}")
print(f"CUDA 可用: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU 數量: {torch.cuda.device_count()}")
    print(f"當前 GPU: {torch.cuda.get_device_name()}")
    print(f"CUDA 版本: {torch.version.cuda}")

In [None]:
# 設定工作目錄
BASE_DIR = Path.cwd()
MODEL_REPO = BASE_DIR / "model_repository_advanced"
CONFIGS_DIR = BASE_DIR / "configs"
SCRIPTS_DIR = BASE_DIR / "scripts"

# 創建必要目錄
for dir_path in [MODEL_REPO, CONFIGS_DIR, SCRIPTS_DIR]:
    dir_path.mkdir(exist_ok=True)
    
print(f"📁 工作目錄: {BASE_DIR}")
print(f"📁 模型倉庫: {MODEL_REPO}")
print(f"📁 配置目錄: {CONFIGS_DIR}")
print(f"📁 腳本目錄: {SCRIPTS_DIR}")

## 2. 動態形狀處理架構設計

### 核心概念
動態形狀是現代深度學習應用的關鍵需求，特別是在處理變長序列、不同尺寸圖像等場景。

In [None]:
@dataclass
class DynamicShapeConfig:
    """動態形狀配置類"""
    input_name: str
    min_shape: List[int]
    opt_shape: List[int]
    max_shape: List[int]
    data_type: str = "TYPE_FP32"
    
class AdvancedModelConfig:
    """高級模型配置生成器"""
    
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.backend = "pytorch"
        self.max_batch_size = 0  # 動態批次
        self.dynamic_configs = []
        
    def add_dynamic_input(self, config: DynamicShapeConfig):
        """添加動態輸入配置"""
        self.dynamic_configs.append(config)
        
    def generate_config(self) -> str:
        """生成進階 config.pbtxt"""
        config_lines = [
            f'name: "{self.model_name}"',
            f'backend: "{self.backend}"',
            f'max_batch_size: {self.max_batch_size}',
            '',
            '# 動態形狀輸入配置'
        ]
        
        # 添加動態輸入
        for i, config in enumerate(self.dynamic_configs):
            config_lines.extend([
                'input {',
                f'  name: "{config.input_name}"',
                f'  data_type: {config.data_type}',
                f'  dims: [ -1 ]  # 動態維度',
                '}',
                ''
            ])
            
        # 添加輸出配置
        config_lines.extend([
            'output {',
            '  name: "output"',
            '  data_type: TYPE_FP32',
            '  dims: [ -1 ]',
            '}',
            '',
            '# 實例組配置',
            'instance_group {',
            '  count: 2',
            '  kind: KIND_GPU',
            '}',
            '',
            '# 動態批次配置',
            'dynamic_batching {',
            '  preferred_batch_size: [ 4, 8, 16 ]',
            '  max_queue_delay_microseconds: 100',
            '  preserve_ordering: true',
            '}',
            '',
            '# 優化配置',
            'optimization {',
            '  cuda {',
            '    graphs: true',
            '    graph_spec {',
            '      batch_size: 1',
            '      input {',
            f'        key: "{self.dynamic_configs[0].input_name if self.dynamic_configs else "input"}"',
            f'        value {{',
            f'          dim: [ {"16" if self.dynamic_configs else "224"} ]',
            '        }',
            '      }',
            '    }',
            '  }',
            '}'
        ])
        
        return '\n'.join(config_lines)

# 示例：文本分類模型動態配置
text_config = DynamicShapeConfig(
    input_name="input_ids",
    min_shape=[1],      # 最小序列長度
    opt_shape=[128],    # 最優序列長度
    max_shape=[512],    # 最大序列長度
    data_type="TYPE_INT64"
)

model_config = AdvancedModelConfig("text_classifier_dynamic")
model_config.add_dynamic_input(text_config)

print("🔧 動態形狀配置生成器已創建")
print(f"📝 支持的配置類型: {len(model_config.dynamic_configs)} 個動態輸入")

## 3. 記憶體池管理與優化

### 記憶體優化策略
- **預分配池**: 避免動態分配開銷
- **分段管理**: 不同大小張量分別管理
- **回收策略**: 智能垃圾回收機制

In [None]:
class MemoryPoolManager:
    """記憶體池管理器"""
    
    def __init__(self):
        self.pools = {}
        self.usage_stats = {}
        self.peak_usage = 0
        
    def create_pool(self, name: str, size_mb: int):
        """創建記憶體池"""
        if torch.cuda.is_available():
            # 預分配 GPU 記憶體
            pool_size = size_mb * 1024 * 1024
            dummy_tensor = torch.empty(pool_size // 4, dtype=torch.float32, device='cuda')
            del dummy_tensor  # 觸發記憶體分配但不保留引用
            torch.cuda.empty_cache()
            
        self.pools[name] = {
            'size_mb': size_mb,
            'allocated': 0,
            'peak': 0,
            'tensors': []
        }
        
        logger.info(f"創建記憶體池 '{name}': {size_mb}MB")
        
    def allocate_tensor(self, pool_name: str, shape: tuple, dtype=torch.float32) -> torch.Tensor:
        """從記憶體池分配張量"""
        if pool_name not in self.pools:
            raise ValueError(f"記憶體池 '{pool_name}' 不存在")
            
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        tensor = torch.empty(shape, dtype=dtype, device=device)
        
        # 更新統計
        pool = self.pools[pool_name]
        tensor_size = tensor.numel() * tensor.element_size()
        pool['allocated'] += tensor_size
        pool['peak'] = max(pool['peak'], pool['allocated'])
        pool['tensors'].append(tensor)
        
        return tensor
    
    def get_memory_stats(self) -> Dict[str, Any]:
        """獲取記憶體使用統計"""
        stats = {}
        
        if torch.cuda.is_available():
            gpu_stats = {
                'allocated_mb': torch.cuda.memory_allocated() / 1024**2,
                'reserved_mb': torch.cuda.memory_reserved() / 1024**2,
                'max_allocated_mb': torch.cuda.max_memory_allocated() / 1024**2
            }
            stats['gpu'] = gpu_stats
            
        # 系統記憶體
        process = psutil.Process()
        memory_info = process.memory_info()
        stats['system'] = {
            'rss_mb': memory_info.rss / 1024**2,
            'vms_mb': memory_info.vms / 1024**2
        }
        
        # 記憶體池統計
        pool_stats = {}
        for name, pool in self.pools.items():
            pool_stats[name] = {
                'size_mb': pool['size_mb'],
                'allocated_mb': pool['allocated'] / 1024**2,
                'peak_mb': pool['peak'] / 1024**2,
                'utilization': pool['allocated'] / (pool['size_mb'] * 1024**2) * 100
            }
        stats['pools'] = pool_stats
        
        return stats
    
    def cleanup_pool(self, pool_name: str):
        """清理記憶體池"""
        if pool_name in self.pools:
            pool = self.pools[pool_name]
            # 清理張量引用
            pool['tensors'].clear()
            pool['allocated'] = 0
            
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
                
            logger.info(f"已清理記憶體池 '{pool_name}'")

# 創建記憶體管理器
memory_manager = MemoryPoolManager()

# 創建不同用途的記憶體池
memory_manager.create_pool("inference", 512)      # 推理專用 512MB
memory_manager.create_pool("preprocessing", 128)  # 預處理專用 128MB
memory_manager.create_pool("cache", 256)          # 緩存專用 256MB

print("🧠 記憶體池管理器已初始化")
print(f"📊 創建了 {len(memory_manager.pools)} 個記憶體池")

## 4. 自定義 PyTorch 模型實現

### 高級模型類設計
實現支持動態形狀、記憶體優化的 Triton PyTorch 模型。

In [None]:
# 創建高級文本分類模型
model_dir = MODEL_REPO / "text_classifier_advanced" / "1"
model_dir.mkdir(parents=True, exist_ok=True)

# 高級模型實現
model_code = '''
import json
import torch
import torch.nn as nn
import torch.nn.functional as F
import triton_python_backend_utils as pb_utils
import numpy as np
from typing import Dict, List, Any, Optional
import logging
import time
import psutil
import os

class AdvancedTextClassifier(nn.Module):
    """支持動態形狀的高級文本分類器"""
    
    def __init__(self, vocab_size=30000, embed_dim=256, hidden_dim=512, num_classes=10):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.attention = nn.MultiheadAttention(hidden_dim * 2, 8, batch_first=True)
        self.classifier = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, num_classes)
        )
        
        # 性能優化：預編譯圖
        self._compiled = False
        
    def forward(self, input_ids, attention_mask=None):
        # 動態序列長度處理
        if attention_mask is None:
            attention_mask = (input_ids != 0).float()
            
        # 嵌入層
        embeds = self.embedding(input_ids)
        
        # LSTM 編碼
        lstm_out, _ = self.lstm(embeds)
        
        # 注意力機制
        attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out, 
                                   key_padding_mask=(attention_mask == 0))
        
        # 池化：使用注意力權重平均
        mask_expanded = attention_mask.unsqueeze(-1).expand_as(attn_out)
        sum_embeddings = torch.sum(attn_out * mask_expanded, dim=1)
        sum_mask = torch.sum(mask_expanded, dim=1)
        pooled = sum_embeddings / (sum_mask + 1e-9)
        
        # 分類
        logits = self.classifier(pooled)
        return logits

class TritonPythonModel:
    """Triton PyTorch Backend 高級模型"""
    
    def initialize(self, args):
        self.model_config = model_config = json.loads(args[\'model_config\'])
        self.model_instance_name = args[\'model_instance_name\']
        
        # 設定日誌
        self.logger = logging.getLogger(f"AdvancedModel-{self.model_instance_name}")
        
        # 設備配置
        if torch.cuda.is_available():
            self.device = torch.device(f"cuda:{args.get(\'model_instance_device_id\', 0)}")
            self.logger.info(f"使用 GPU: {self.device}")
        else:
            self.device = torch.device("cpu")
            self.logger.info("使用 CPU")
            
        # 初始化模型
        self.model = AdvancedTextClassifier()
        self.model.to(self.device)
        self.model.eval()
        
        # 性能優化：JIT 編譯
        if torch.cuda.is_available():
            try:
                # 創建示例輸入進行 JIT 編譯
                example_input = torch.randint(1, 1000, (1, 128), device=self.device)
                self.model = torch.jit.trace(self.model, example_input)
                self.logger.info("JIT 編譯成功")
            except Exception as e:
                self.logger.warning(f"JIT 編譯失敗: {e}")
        
        # 記憶體池初始化
        self.memory_stats = {
            \'total_requests\': 0,
            \'total_time\': 0,
            \'peak_memory\': 0
        }
        
        # 預熱模型
        self._warmup_model()
        
        self.logger.info("高級 PyTorch 模型初始化完成")
    
    def _warmup_model(self):
        """模型預熱"""
        self.logger.info("開始模型預熱...")
        
        # 不同長度的預熱輸入
        warmup_lengths = [16, 64, 128, 256]
        
        with torch.no_grad():
            for length in warmup_lengths:
                dummy_input = torch.randint(1, 1000, (2, length), device=self.device)
                _ = self.model(dummy_input)
                
        if torch.cuda.is_available():
            torch.cuda.synchronize()
            
        self.logger.info("模型預熱完成")
    
    def execute(self, requests):
        """執行推理請求"""
        responses = []
        
        for request in requests:
            start_time = time.time()
            
            try:
                # 獲取輸入
                input_ids = pb_utils.get_input_tensor_by_name(request, "input_ids")
                input_data = input_ids.as_numpy()
                
                # 轉換為 PyTorch 張量
                input_tensor = torch.from_numpy(input_data).to(self.device)
                
                # 推理
                with torch.no_grad():
                    logits = self.model(input_tensor)
                    probs = F.softmax(logits, dim=-1)
                
                # 轉換回 numpy
                output_data = probs.cpu().numpy()
                
                # 創建輸出張量
                output_tensor = pb_utils.Tensor("output", output_data)
                response = pb_utils.InferenceResponse(output_tensors=[output_tensor])
                
                # 更新統計
                end_time = time.time()
                self.memory_stats[\'total_requests\'] += 1
                self.memory_stats[\'total_time\'] += (end_time - start_time)
                
                if torch.cuda.is_available():
                    current_memory = torch.cuda.memory_allocated() / 1024**2
                    self.memory_stats[\'peak_memory\'] = max(
                        self.memory_stats[\'peak_memory\'], current_memory
                    )
                
            except Exception as e:
                self.logger.error(f"推理錯誤: {e}")
                error_response = pb_utils.InferenceResponse(
                    output_tensors=[],
                    error=pb_utils.TritonError(f"推理失敗: {str(e)}")
                )
                responses.append(error_response)
                continue
                
            responses.append(response)
        
        return responses
    
    def finalize(self):
        """清理資源"""
        self.logger.info("正在清理模型資源...")
        
        # 輸出統計信息
        if self.memory_stats[\'total_requests\'] > 0:
            avg_time = self.memory_stats[\'total_time\'] / self.memory_stats[\'total_requests\']
            self.logger.info(f"總請求數: {self.memory_stats[\'total_requests\']}") 
            self.logger.info(f"平均推理時間: {avg_time:.4f}s")
            self.logger.info(f"峰值記憶體使用: {self.memory_stats[\'peak_memory\']:.2f}MB")
        
        # 清理 GPU 記憶體
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            
        self.logger.info("模型資源清理完成")
'''

# 寫入模型文件
with open(model_dir / "model.py", "w", encoding="utf-8") as f:
    f.write(model_code)

print(f"✅ 高級 PyTorch 模型已創建: {model_dir / 'model.py'}")

## 5. 進階配置文件生成

In [None]:
# 生成高級配置文件
advanced_config = '''
name: "text_classifier_advanced"
backend: "python"
max_batch_size: 0

# 動態輸入配置
input {
  name: "input_ids"
  data_type: TYPE_INT64
  dims: [ -1 ]  # 動態序列長度
}

output {
  name: "output"
  data_type: TYPE_FP32
  dims: [ 10 ]  # 10 個分類
}

# 實例組配置 - 多實例提高吞吐量
instance_group {
  count: 2
  kind: KIND_GPU
  gpus: [ 0 ]
}

# 動態批次配置
dynamic_batching {
  preferred_batch_size: [ 2, 4, 8, 16 ]
  max_queue_delay_microseconds: 500
  preserve_ordering: false
  priority_levels: 2
  default_priority_level: 0
  
  # 批次大小控制
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 1000000
    allow_timeout_override: true
    max_batch_size: 32
  }
}

# 模型優化配置
optimization {
  cuda {
    graphs: true
    busy_wait_events: true
    graph_spec {
      batch_size: 1
      input {
        key: "input_ids"
        value {
          dim: [ 128 ]
        }
      }
    }
    graph_spec {
      batch_size: 4
      input {
        key: "input_ids"
        value {
          dim: [ 256 ]
        }
      }
    }
  }
}

# 模型預熱配置
model_warmup {
  name: "warmup_sample_1"
  batch_size: 1
  inputs {
    key: "input_ids"
    value {
      data_type: TYPE_INT64
      dims: [ 128 ]
      zero_data: true
    }
  }
}

model_warmup {
  name: "warmup_sample_batch"
  batch_size: 8
  inputs {
    key: "input_ids"
    value {
      data_type: TYPE_INT64
      dims: [ 256 ]
      zero_data: true
    }
  }
}

# 性能參數
parameters {
  key: "EXECUTION_ENV_PATH"
  value: { string_value: "/opt/tritonserver/backends/python/triton_python_backend_stub" }
}

parameters {
  key: "shm_region_prefix_name"
  value: { string_value: "triton_python_backend" }
}
'''

# 寫入配置文件
config_path = MODEL_REPO / "text_classifier_advanced" / "config.pbtxt"
with open(config_path, "w", encoding="utf-8") as f:
    f.write(advanced_config.strip())

print(f"✅ 高級配置文件已創建: {config_path}")
print("\n🔧 配置特性:")
print("- ✅ 動態序列長度支持")
print("- ✅ 智能批次調度")
print("- ✅ CUDA Graph 優化")
print("- ✅ 多實例並行")
print("- ✅ 模型預熱")

## 6. 性能監控與分析工具

In [None]:
class PerformanceMonitor:
    """性能監控器"""
    
    def __init__(self, triton_url="localhost:8000"):
        self.triton_url = triton_url
        self.client = httpclient.InferenceServerClient(url=triton_url)
        self.metrics_history = []
        
    def get_server_metrics(self) -> Dict[str, Any]:
        """獲取服務器指標"""
        try:
            # 獲取服務器統計
            stats = self.client.get_inference_statistics()
            
            # 獲取模型狀態
            models = self.client.get_model_repository_index()
            
            metrics = {
                'timestamp': time.time(),
                'server_stats': stats,
                'model_count': len(models),
                'system_resources': self._get_system_resources()
            }
            
            return metrics
            
        except Exception as e:
            logger.error(f"獲取指標失敗: {e}")
            return {}
    
    def _get_system_resources(self) -> Dict[str, float]:
        """獲取系統資源使用情況"""
        resources = {
            'cpu_percent': psutil.cpu_percent(),
            'memory_percent': psutil.virtual_memory().percent,
            'memory_used_gb': psutil.virtual_memory().used / 1024**3
        }
        
        # GPU 資源
        try:
            gpus = GPUtil.getGPUs()
            if gpus:
                gpu = gpus[0]
                resources.update({
                    'gpu_utilization': gpu.load * 100,
                    'gpu_memory_used': gpu.memoryUsed,
                    'gpu_memory_total': gpu.memoryTotal,
                    'gpu_memory_percent': (gpu.memoryUsed / gpu.memoryTotal) * 100,
                    'gpu_temperature': gpu.temperature
                })
        except:
            pass
            
        return resources
    
    def benchmark_model(self, model_name: str, test_data: List[np.ndarray], 
                       concurrent_requests: int = 1, iterations: int = 100) -> Dict[str, Any]:
        """模型基準測試"""
        
        latencies = []
        errors = 0
        
        print(f"🚀 開始基準測試: {model_name}")
        print(f"📊 參數: {concurrent_requests} 並發, {iterations} 次迭代")
        
        start_time = time.time()
        
        for i in range(iterations):
            # 隨機選擇測試數據
            test_input = test_data[i % len(test_data)]
            
            try:
                # 創建輸入
                inputs = [httpclient.InferInput("input_ids", test_input.shape, "INT64")]
                inputs[0].set_data_from_numpy(test_input)
                
                # 發送請求並計時
                request_start = time.time()
                response = self.client.infer(model_name, inputs)
                request_end = time.time()
                
                latency = (request_end - request_start) * 1000  # 轉換為毫秒
                latencies.append(latency)
                
                if i % 20 == 0:
                    print(f"  進度: {i}/{iterations}, 當前延遲: {latency:.2f}ms")
                    
            except Exception as e:
                errors += 1
                logger.error(f"請求失敗: {e}")
        
        end_time = time.time()
        total_time = end_time - start_time
        
        # 計算統計數據
        if latencies:
            stats = {
                'total_requests': iterations,
                'successful_requests': len(latencies),
                'error_rate': errors / iterations * 100,
                'total_time_seconds': total_time,
                'throughput_rps': len(latencies) / total_time,
                'latency_stats': {
                    'mean_ms': np.mean(latencies),
                    'median_ms': np.median(latencies),
                    'p95_ms': np.percentile(latencies, 95),
                    'p99_ms': np.percentile(latencies, 99),
                    'min_ms': np.min(latencies),
                    'max_ms': np.max(latencies),
                    'std_ms': np.std(latencies)
                }
            }
        else:
            stats = {
                'total_requests': iterations,
                'successful_requests': 0,
                'error_rate': 100.0,
                'total_time_seconds': total_time,
                'throughput_rps': 0
            }
        
        return stats
    
    def plot_performance_trends(self, save_path: Optional[str] = None):
        """繪製性能趨勢圖"""
        if not self.metrics_history:
            print("⚠️  沒有歷史數據可以繪製")
            return
            
        # 提取時間序列數據
        timestamps = [m['timestamp'] for m in self.metrics_history]
        cpu_usage = [m['system_resources']['cpu_percent'] for m in self.metrics_history]
        memory_usage = [m['system_resources']['memory_percent'] for m in self.metrics_history]
        
        # 創建圖表
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        fig.suptitle('Triton Server 性能監控儀表板', fontsize=16)
        
        # CPU 使用率
        axes[0, 0].plot(timestamps, cpu_usage, 'b-', linewidth=2)
        axes[0, 0].set_title('CPU 使用率')
        axes[0, 0].set_ylabel('使用率 (%)')
        axes[0, 0].grid(True, alpha=0.3)
        
        # 記憶體使用率
        axes[0, 1].plot(timestamps, memory_usage, 'r-', linewidth=2)
        axes[0, 1].set_title('記憶體使用率')
        axes[0, 1].set_ylabel('使用率 (%)')
        axes[0, 1].grid(True, alpha=0.3)
        
        # GPU 使用率 (如果可用)
        gpu_usage = []
        gpu_memory = []
        for m in self.metrics_history:
            resources = m['system_resources']
            gpu_usage.append(resources.get('gpu_utilization', 0))
            gpu_memory.append(resources.get('gpu_memory_percent', 0))
            
        axes[1, 0].plot(timestamps, gpu_usage, 'g-', linewidth=2, label='GPU 利用率')
        axes[1, 0].plot(timestamps, gpu_memory, 'orange', linewidth=2, label='GPU 記憶體')
        axes[1, 0].set_title('GPU 資源')
        axes[1, 0].set_ylabel('使用率 (%)')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
        
        # 模型數量
        model_counts = [m['model_count'] for m in self.metrics_history]
        axes[1, 1].plot(timestamps, model_counts, 'm-', linewidth=2)
        axes[1, 1].set_title('載入模型數量')
        axes[1, 1].set_ylabel('模型數量')
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')
            print(f"📊 性能圖表已儲存: {save_path}")
        
        plt.show()

# 創建性能監控器
monitor = PerformanceMonitor()
print("📊 性能監控器已創建")

# 準備測試數據
def generate_test_data(num_samples: int = 50) -> List[np.ndarray]:
    """生成測試數據"""
    test_data = []
    
    # 不同長度的序列
    lengths = [16, 32, 64, 128, 256]
    
    for _ in range(num_samples):
        length = np.random.choice(lengths)
        # 生成隨機序列，避免全零
        sequence = np.random.randint(1, 5000, size=(1, length), dtype=np.int64)
        test_data.append(sequence)
    
    return test_data

test_data = generate_test_data()
print(f"📋 生成了 {len(test_data)} 個測試樣本")
print(f"📏 序列長度範圍: {[data.shape[1] for data in test_data[:5]]}...")

## 7. Triton Server 啟動與模型載入

In [None]:
# 創建 Docker 啟動腳本
docker_script = f'''
#!/bin/bash

# Triton Server 高級配置啟動腳本

MODEL_REPO="{MODEL_REPO.absolute()}"
CONTAINER_NAME="triton-advanced-pytorch"

echo "🚀 啟動 Triton Server (PyTorch Backend 進階)"
echo "📁 模型倉庫: $MODEL_REPO"

# 停止現有容器
docker stop $CONTAINER_NAME 2>/dev/null || true
docker rm $CONTAINER_NAME 2>/dev/null || true

# 啟動 Triton Server
docker run -d \
  --name $CONTAINER_NAME \
  --gpus all \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v "$MODEL_REPO:/models" \
  -e CUDA_VISIBLE_DEVICES=0 \
  nvcr.io/nvidia/tritonserver:24.01-py3 \
  tritonserver \
    --model-repository=/models \
    --backend-directory=/opt/tritonserver/backends \
    --model-control-mode=explicit \
    --strict-model-config=false \
    --log-verbose=1 \
    --log-info=true \
    --log-warning=true \
    --log-error=true \
    --exit-on-error=false \
    --exit-timeout-secs=120 \
    --buffer-manager-thread-count=2 \
    --model-load-thread-count=4

echo "⏳ 等待服務器啟動..."
sleep 10

# 檢查服務器狀態
if curl -s http://localhost:8000/v2/health/ready > /dev/null; then
    echo "✅ Triton Server 已就緒"
    echo "🌐 HTTP: http://localhost:8000"
    echo "📊 指標: http://localhost:8002/metrics"
    
    # 載入高級模型
    echo "📥 載入高級 PyTorch 模型..."
    curl -X POST http://localhost:8000/v2/repository/models/text_classifier_advanced/load
    
    echo "📋 檢查模型狀態:"
    curl -s http://localhost:8000/v2/models/text_classifier_advanced
else
    echo "❌ Triton Server 啟動失敗"
    docker logs $CONTAINER_NAME
    exit 1
fi
'''

script_path = SCRIPTS_DIR / "start_triton_advanced.sh"
with open(script_path, "w") as f:
    f.write(docker_script.strip())

# 設定執行權限
os.chmod(script_path, 0o755)

print(f"📜 Docker 啟動腳本已創建: {script_path}")
print("\n🚀 要啟動 Triton Server，請執行:")
print(f"   bash {script_path}")
print("\n⚠️  注意: 確保 Docker 和 NVIDIA Container Toolkit 已安裝")

## 8. 性能測試與基準評估

In [None]:
def wait_for_server(max_wait_time: int = 60) -> bool:
    """等待服務器就緒"""
    print("⏳ 等待 Triton Server 就緒...")
    
    for i in range(max_wait_time):
        try:
            client = httpclient.InferenceServerClient(url="localhost:8000")
            if client.is_server_ready():
                print("✅ Triton Server 已就緒")
                return True
        except:
            pass
        
        time.sleep(1)
        if i % 10 == 0:
            print(f"  等待中... ({i}/{max_wait_time}s)")
    
    print("❌ 服務器啟動超時")
    return False

def run_comprehensive_benchmark():
    """執行綜合基準測試"""
    
    if not wait_for_server():
        print("無法連接到 Triton Server，請確保服務器已啟動")
        return
    
    print("\n🎯 開始綜合性能基準測試")
    print("=" * 50)
    
    # 測試配置
    test_configs = [
        {"name": "小批次低延遲", "concurrent": 1, "iterations": 50},
        {"name": "中批次平衡", "concurrent": 4, "iterations": 100},
        {"name": "大批次高吞吐", "concurrent": 8, "iterations": 200}
    ]
    
    results = {}
    
    for config in test_configs:
        print(f"\n📊 測試: {config['name']}")
        print(f"   並發: {config['concurrent']}, 迭代: {config['iterations']}")
        
        try:
            result = monitor.benchmark_model(
                "text_classifier_advanced",
                test_data,
                concurrent_requests=config['concurrent'],
                iterations=config['iterations']
            )
            results[config['name']] = result
            
            # 顯示結果
            if 'latency_stats' in result:
                stats = result['latency_stats']
                print(f"   ✅ 吞吐量: {result['throughput_rps']:.2f} RPS")
                print(f"   ⚡ 平均延遲: {stats['mean_ms']:.2f}ms")
                print(f"   📈 P95 延遲: {stats['p95_ms']:.2f}ms")
                print(f"   ❌ 錯誤率: {result['error_rate']:.2f}%")
            else:
                print(f"   ❌ 測試失敗，錯誤率: {result['error_rate']:.2f}%")
                
        except Exception as e:
            print(f"   ❌ 測試失敗: {e}")
            results[config['name']] = {"error": str(e)}
    
    return results

# 執行性能測試（如果服務器可用）
try:
    client = httpclient.InferenceServerClient(url="localhost:8000")
    if client.is_server_ready():
        print("🎯 檢測到 Triton Server，開始性能測試")
        benchmark_results = run_comprehensive_benchmark()
    else:
        print("⚠️  Triton Server 未就緒，跳過性能測試")
        print("   請先啟動服務器後再運行此部分")
        benchmark_results = None
except Exception as e:
    print(f"⚠️  無法連接到 Triton Server: {e}")
    print("   請確保服務器已啟動並可訪問")
    benchmark_results = None

## 9. 記憶體使用分析

In [None]:
def analyze_memory_usage():
    """分析記憶體使用情況"""
    
    print("🧠 記憶體使用分析")
    print("=" * 30)
    
    # 獲取記憶體統計
    stats = memory_manager.get_memory_stats()
    
    # 系統記憶體
    if 'system' in stats:
        sys_stats = stats['system']
        print(f"📊 系統記憶體:")
        print(f"   RSS: {sys_stats['rss_mb']:.2f} MB")
        print(f"   VMS: {sys_stats['vms_mb']:.2f} MB")
    
    # GPU 記憶體
    if 'gpu' in stats:
        gpu_stats = stats['gpu']
        print(f"\n🎮 GPU 記憶體:")
        print(f"   已分配: {gpu_stats['allocated_mb']:.2f} MB")
        print(f"   已保留: {gpu_stats['reserved_mb']:.2f} MB")
        print(f"   峰值: {gpu_stats['max_allocated_mb']:.2f} MB")
        
        # 計算利用率
        if torch.cuda.is_available():
            total_memory = torch.cuda.get_device_properties(0).total_memory / 1024**2
            utilization = gpu_stats['allocated_mb'] / total_memory * 100
            print(f"   利用率: {utilization:.1f}%")
    
    # 記憶體池統計
    if 'pools' in stats and stats['pools']:
        print(f"\n🏊 記憶體池:")
        for pool_name, pool_stats in stats['pools'].items():
            print(f"   {pool_name}:")
            print(f"     大小: {pool_stats['size_mb']:.2f} MB")
            print(f"     已用: {pool_stats['allocated_mb']:.2f} MB")
            print(f"     峰值: {pool_stats['peak_mb']:.2f} MB")
            print(f"     利用率: {pool_stats['utilization']:.1f}%")
    
    return stats

def create_memory_visualization(stats: Dict[str, Any]):
    """創建記憶體使用視覺化"""
    
    if 'pools' not in stats or not stats['pools']:
        print("⚠️  沒有記憶體池數據可視覺化")
        return
    
    # 準備數據
    pool_names = list(stats['pools'].keys())
    pool_sizes = [stats['pools'][name]['size_mb'] for name in pool_names]
    pool_used = [stats['pools'][name]['allocated_mb'] for name in pool_names]
    pool_utilization = [stats['pools'][name]['utilization'] for name in pool_names]
    
    # 創建圖表
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    # 記憶體池大小對比
    axes[0].bar(pool_names, pool_sizes, color='skyblue', alpha=0.7)
    axes[0].set_title('記憶體池大小', fontsize=14)
    axes[0].set_ylabel('大小 (MB)')
    axes[0].tick_params(axis='x', rotation=45)
    
    # 記憶體池使用情況
    x = np.arange(len(pool_names))
    width = 0.35
    
    axes[1].bar(x - width/2, pool_sizes, width, label='總容量', color='lightgray', alpha=0.7)
    axes[1].bar(x + width/2, pool_used, width, label='已使用', color='orange', alpha=0.7)
    axes[1].set_title('記憶體使用對比', fontsize=14)
    axes[1].set_ylabel('記憶體 (MB)')
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(pool_names, rotation=45)
    axes[1].legend()
    
    # 利用率圓餅圖
    colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']
    axes[2].pie(pool_utilization, labels=pool_names, colors=colors[:len(pool_names)], 
               autopct='%1.1f%%', startangle=90)
    axes[2].set_title('記憶體池利用率', fontsize=14)
    
    plt.tight_layout()
    
    # 保存圖表
    chart_path = BASE_DIR / "memory_analysis.png"
    plt.savefig(chart_path, dpi=300, bbox_inches='tight')
    print(f"📊 記憶體分析圖表已儲存: {chart_path}")
    
    plt.show()

# 執行記憶體分析
print("🔍 執行記憶體使用分析...")
memory_stats = analyze_memory_usage()

# 創建視覺化（如果有數據）
if memory_stats and 'pools' in memory_stats:
    create_memory_visualization(memory_stats)
else:
    print("📊 記憶體池數據不足，跳過視覺化")

## 10. 自定義運算子整合示例

In [None]:
# 創建自定義運算子模型
custom_model_dir = MODEL_REPO / "custom_ops_model" / "1"
custom_model_dir.mkdir(parents=True, exist_ok=True)

custom_ops_code = '''
import torch
import torch.nn as nn
import torch.nn.functional as F
import triton_python_backend_utils as pb_utils
import numpy as np
import json
from typing import List, Tuple

class CustomAttentionLayer(nn.Module):
    """自定義注意力層，展示自定義運算子整合"""
    
    def __init__(self, d_model: int = 256, n_heads: int = 8):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        assert d_model % n_heads == 0, "d_model 必須能被 n_heads 整除"
        
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.out_linear = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(0.1)
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim]))
        
    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]
        seq_len = query.shape[1]
        
        # 線性變換
        Q = self.q_linear(query)
        K = self.k_linear(key) 
        V = self.v_linear(value)
        
        # 重塑為多頭
        Q = Q.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        
        # 自定義注意力計算（優化版本）
        attention = self._custom_attention(Q, K, V, mask)
        
        # 合併多頭
        attention = attention.transpose(1, 2).contiguous().view(
            batch_size, seq_len, self.d_model
        )
        
        # 輸出投影
        output = self.out_linear(attention)
        return output
    
    def _custom_attention(self, Q, K, V, mask=None):
        """自定義注意力計算，展示自定義運算子"""
        
        # 注意力分數計算
        if self.scale.device != Q.device:
            self.scale = self.scale.to(Q.device)
            
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        # 應用掩碼
        if mask is not None:
            mask = mask.unsqueeze(1).unsqueeze(1)  # [batch, 1, 1, seq_len]
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Softmax + Dropout
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # 應用權重
        attention = torch.matmul(attention_weights, V)
        
        return attention

class CustomOpsModel(nn.Module):
    """包含自定義運算子的模型"""
    
    def __init__(self, vocab_size=10000, d_model=256, n_heads=8, num_layers=3):
        super().__init__()
        self.d_model = d_model
        
        self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=0)
        self.pos_encoding = nn.Parameter(torch.randn(1000, d_model))
        
        # 多層自定義注意力
        self.attention_layers = nn.ModuleList([
            CustomAttentionLayer(d_model, n_heads) for _ in range(num_layers)
        ])
        
        self.layer_norms = nn.ModuleList([
            nn.LayerNorm(d_model) for _ in range(num_layers)
        ])
        
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(d_model * 4, d_model)
        )
        
        self.output_norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 5)  # 5 類分類
        
    def forward(self, input_ids):
        seq_len = input_ids.shape[1]
        
        # 嵌入 + 位置編碼
        x = self.embedding(input_ids)
        x = x + self.pos_encoding[:seq_len].unsqueeze(0)
        
        # 創建注意力掩碼
        attention_mask = (input_ids != 0).float()
        
        # 多層 Transformer
        for attention_layer, layer_norm in zip(self.attention_layers, self.layer_norms):
            # 自注意力 + 殘差連接
            attn_output = attention_layer(x, x, x, attention_mask)
            x = layer_norm(x + attn_output)
            
            # FFN + 殘差連接
            ffn_output = self.ffn(x)
            x = layer_norm(x + ffn_output)
        
        # 最終標準化
        x = self.output_norm(x)
        
        # 全局平均池化
        mask_expanded = attention_mask.unsqueeze(-1).expand_as(x)
        sum_embeddings = torch.sum(x * mask_expanded, dim=1)
        sum_mask = torch.sum(mask_expanded, dim=1)
        pooled = sum_embeddings / (sum_mask + 1e-9)
        
        # 分類
        logits = self.classifier(pooled)
        return logits

class TritonPythonModel:
    """Triton 自定義運算子模型"""
    
    def initialize(self, args):
        self.model_config = json.loads(args[\'model_config\'])
        
        # 設備設定
        device_id = args.get(\'model_instance_device_id\', 0)
        self.device = torch.device(f"cuda:{device_id}" if torch.cuda.is_available() else "cpu")
        
        # 初始化自定義模型
        self.model = CustomOpsModel()
        self.model.to(self.device)
        self.model.eval()
        
        # 效能優化
        if torch.cuda.is_available():
            # 啟用優化
            torch.backends.cudnn.benchmark = True
            torch.backends.cuda.matmul.allow_tf32 = True
            
        print(f"自定義運算子模型已載入到 {self.device}")
        
    def execute(self, requests):
        responses = []
        
        for request in requests:
            try:
                # 獲取輸入
                input_tensor = pb_utils.get_input_tensor_by_name(request, "input_ids")
                input_data = torch.from_numpy(input_tensor.as_numpy()).to(self.device)
                
                # 推理
                with torch.no_grad():
                    logits = self.model(input_data)
                    probs = F.softmax(logits, dim=-1)
                
                # 輸出
                output_data = probs.cpu().numpy()
                output_tensor = pb_utils.Tensor("output", output_data)
                response = pb_utils.InferenceResponse(output_tensors=[output_tensor])
                
            except Exception as e:
                response = pb_utils.InferenceResponse(
                    output_tensors=[],
                    error=pb_utils.TritonError(f"推理錯誤: {str(e)}")
                )
            
            responses.append(response)
        
        return responses
    
    def finalize(self):
        print("自定義運算子模型資源已清理")
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
'''

# 寫入自定義運算子模型
with open(custom_model_dir / "model.py", "w", encoding="utf-8") as f:
    f.write(custom_ops_code)

# 創建配置文件
custom_config = '''
name: "custom_ops_model"
backend: "python"
max_batch_size: 16

input {
  name: "input_ids"
  data_type: TYPE_INT64
  dims: [ -1 ]
}

output {
  name: "output"
  data_type: TYPE_FP32
  dims: [ 5 ]
}

instance_group {
  count: 1
  kind: KIND_GPU
}

dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 200
}

# 自定義運算子優化參數
parameters {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: { string_value: "no" }
}
'''

config_path = MODEL_REPO / "custom_ops_model" / "config.pbtxt"
with open(config_path, "w", encoding="utf-8") as f:
    f.write(custom_config.strip())

print("✅ 自定義運算子模型已創建")
print(f"📁 模型路徑: {custom_model_dir}")
print(f"🔧 配置文件: {config_path}")
print("\n🎯 特性:")
print("- ✅ 自定義多頭注意力層")
print("- ✅ 優化的注意力計算")
print("- ✅ 多層 Transformer 架構")
print("- ✅ CUDA 優化設定")

## 11. 總結與最佳實踐

### 🎯 本實驗重點回顧

通過本實驗，我們深入探討了 Triton PyTorch Backend 的高級配置和優化技術：

#### 核心技術成果
1. **動態形狀處理**: 實現了靈活的序列長度支援
2. **記憶體池管理**: 建立了高效的記憶體分配策略
3. **自定義運算子**: 整合了優化的注意力機制
4. **性能監控**: 構建了完整的監控和分析工具
5. **進階配置**: 掌握了企業級配置最佳實踐

#### 企業級部署要點
- ✅ **CUDA Graph 優化**: 降低 GPU 內核啟動開銷
- ✅ **動態批次調度**: 平衡延遲與吞吐量
- ✅ **記憶體預分配**: 避免動態分配導致的性能抖動
- ✅ **模型預熱**: 確保首次推理的穩定性能
- ✅ **多實例部署**: 最大化 GPU 資源利用率

In [None]:
# 生成最佳實踐總結報告
def generate_best_practices_report():
    """生成最佳實踐報告"""
    
    report = {
        "title": "PyTorch Backend 高級優化最佳實踐",
        "sections": {
            "配置優化": {
                "動態批次": [
                    "根據硬體資源設定合適的 preferred_batch_size",
                    "調整 max_queue_delay_microseconds 平衡延遲",
                    "使用 preserve_ordering=false 提高吞吐量"
                ],
                "CUDA 優化": [
                    "啟用 CUDA Graph (graphs: true)",
                    "預定義常用輸入形狀的 graph_spec",
                    "設定 busy_wait_events 減少延遲"
                ],
                "實例組設定": [
                    "GPU 模型使用 KIND_GPU",
                    "根據 GPU 記憶體設定實例數量",
                    "大模型考慮 model parallel"
                ]
            },
            "記憶體管理": {
                "預分配策略": [
                    "推理前預分配記憶體池",
                    "分不同大小級別管理張量",
                    "定期清理未使用的張量"
                ],
                "GPU 記憶體": [
                    "監控 GPU 記憶體使用率",
                    "避免記憶體碎片化",
                    "使用 torch.cuda.empty_cache() 適時清理"
                ]
            },
            "性能調優": {
                "PyTorch 優化": [
                    "啟用 torch.backends.cudnn.benchmark",
                    "使用 JIT 編譯 (torch.jit.trace)",
                    "考慮 TensorFloat-32 (TF32) 優化"
                ],
                "模型優化": [
                    "實施模型預熱降低冷啟動延遲",
                    "使用 torch.no_grad() 降低記憶體使用",
                    "考慮混合精度推理"
                ]
            },
            "監控告警": {
                "關鍵指標": [
                    "推理延遲 (P50, P95, P99)",
                    "吞吐量 (RPS)",
                    "GPU 記憶體使用率",
                    "錯誤率"
                ],
                "告警閾值": [
                    "P99 延遲 > 500ms",
                    "錯誤率 > 1%",
                    "GPU 記憶體使用率 > 90%",
                    "實例健康狀態異常"
                ]
            }
        },
        "common_pitfalls": {
            "配置陷阱": [
                "忘記設定 max_batch_size=0 用於動態批次",
                "實例數量過多導致記憶體不足",
                "未正確配置動態形狀導致形狀不匹配"
            ],
            "性能陷阱": [
                "頻繁的 CPU-GPU 數據傳輸",
                "未使用批次推理",
                "記憶體洩漏導致 OOM"
            ]
        }
    }
    
    # 輸出報告
    print(f"📋 {report['title']}")
    print("=" * 50)
    
    for section_name, content in report['sections'].items():
        print(f"\n🔹 {section_name}")
        for subsection, items in content.items():
            print(f"  📌 {subsection}:")
            for item in items:
                print(f"     • {item}")
    
    print(f"\n⚠️  常見陷阱")
    for category, pitfalls in report['common_pitfalls'].items():
        print(f"  🚫 {category}:")
        for pitfall in pitfalls:
            print(f"     • {pitfall}")
    
    return report

# 生成報告
best_practices = generate_best_practices_report()

print("\n🎉 PyTorch Backend 高級配置實驗完成！")
print("\n📚 下一步學習建議:")
print("1. 🔧 實踐 TensorRT Backend 整合 (Notebook 02)")
print("2. 🚀 探索 vLLM Backend 整合 (Notebook 03)")
print("3. 🛠️  開發自定義 Python Backend (Notebook 04)")
print("4. 📊 深入性能調優與監控")
print("5. 🏢 企業級部署實踐")

## 🔗 相關資源與延伸閱讀

### 官方文檔
- [Triton PyTorch Backend](https://github.com/triton-inference-server/pytorch_backend)
- [CUDA Graph 優化指南](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/)
- [PyTorch JIT 編譯](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html)

### 性能優化
- [PyTorch 性能調優指南](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html)
- [NVIDIA GPU 優化最佳實踐](https://docs.nvidia.com/deeplearning/performance/index.html)
- [Triton 性能調優](https://github.com/triton-inference-server/server/blob/main/docs/optimization.md)

### 實踐案例
- [企業級 AI 推理平台設計](https://developer.nvidia.com/blog/deploying-ai-at-scale-with-triton-inference-server/)
- [大規模 NLP 模型部署](https://developer.nvidia.com/blog/how-to-deploy-almost-any-hugging-face-model-on-nvidia-triton-inference-server/)

---

**🎓 實驗完成標誌**: PyTorch Backend 高級配置與優化技術掌握 ✅