# TensorRT Backend 整合與極致優化

## 🎯 學習目標

本實驗將深入探討 Triton TensorRT Backend 的整合與優化技術，學習如何將 PyTorch 模型轉換為高性能的 TensorRT 引擎，並實現極致的推理性能。

### 核心知識點
- ✅ TensorRT 引擎構建與優化
- ✅ 模型轉換流程 (PyTorch → ONNX → TensorRT)
- ✅ 精度優化 (FP32/FP16/INT8)
- ✅ 動態形狀支援
- ✅ 性能基準測試與對比
- ✅ 企業級部署配置

### 技術架構
```
PyTorch 模型
     ↓
ONNX 轉換 (torch.onnx.export)
     ↓
TensorRT 優化 (trtexec/TensorRT Python API)
     ↓
Triton TensorRT Backend
     ↓
高性能推理服務
```

### 性能提升預期
- 🚀 **推理速度**: 2-10x 加速
- 💾 **記憶體使用**: 30-50% 減少
- ⚡ **延遲優化**: 毫秒級響應
- 🔋 **能耗效率**: 顯著降低

## 1. 環境準備與 TensorRT 設定

In [None]:
import os
import sys
import json
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import requests
import tritonclient.http as httpclient
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Any, Optional, Tuple
import logging
import subprocess
import shutil
import pickle
import onnx
import onnxruntime as ort
from dataclasses import dataclass
import psutil
import GPUtil

# TensorRT 相關導入
try:
    import tensorrt as trt
    import pycuda.driver as cuda
    import pycuda.autoinit
    TRT_AVAILABLE = True
    print(f"✅ TensorRT 版本: {trt.__version__}")
except ImportError as e:
    TRT_AVAILABLE = False
    print(f"⚠️  TensorRT 未安裝: {e}")
    print("   請安裝 TensorRT 以使用完整功能")

# 設定日誌
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("🔧 環境資訊檢查")
print(f"Python 版本: {sys.version}")
print(f"PyTorch 版本: {torch.__version__}")
print(f"ONNX 版本: {onnx.__version__}")
print(f"ONNX Runtime 版本: {ort.__version__}")
print(f"CUDA 可用: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU 數量: {torch.cuda.device_count()}")
    print(f"當前 GPU: {torch.cuda.get_device_name()}")
    print(f"CUDA 版本: {torch.version.cuda}")
    print(f"計算能力: {torch.cuda.get_device_capability()}")

In [None]:
# 設定工作目錄
BASE_DIR = Path.cwd()
MODEL_REPO = BASE_DIR / "model_repository_tensorrt"
ONNX_DIR = BASE_DIR / "onnx_models"
TRT_DIR = BASE_DIR / "tensorrt_engines"
BENCHMARK_DIR = BASE_DIR / "benchmarks"
SCRIPTS_DIR = BASE_DIR / "scripts"

# 創建必要目錄
for dir_path in [MODEL_REPO, ONNX_DIR, TRT_DIR, BENCHMARK_DIR, SCRIPTS_DIR]:
    dir_path.mkdir(exist_ok=True)
    
print(f"📁 工作目錄: {BASE_DIR}")
print(f"📁 模型倉庫: {MODEL_REPO}")
print(f"📁 ONNX 模型: {ONNX_DIR}")
print(f"📁 TensorRT 引擎: {TRT_DIR}")
print(f"📁 基準測試: {BENCHMARK_DIR}")

## 2. 建立參考 PyTorch 模型

創建一個適合 TensorRT 優化的模型，展示轉換流程和性能對比。

In [None]:
class EfficientTextClassifier(nn.Module):
    """為 TensorRT 優化設計的文本分類模型"""
    
    def __init__(self, vocab_size=10000, embed_dim=128, hidden_dim=256, num_classes=5):
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim
        self.num_classes = num_classes
        
        # 嵌入層
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        
        # 1D 卷積層 (對 TensorRT 友好)
        self.conv_layers = nn.Sequential(
            nn.Conv1d(embed_dim, hidden_dim // 2, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv1d(hidden_dim // 2, hidden_dim, kernel_size=3, padding=1),
            nn.ReLU(),
        )
        
        # 全局最大池化
        self.global_pool = nn.AdaptiveMaxPool1d(1)
        
        # 分類頭
        self.classifier = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes)
        )
        
        # 權重初始化
        self._initialize_weights()
    
    def _initialize_weights(self):
        """權重初始化"""
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Conv1d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.Embedding):
                nn.init.normal_(m.weight, 0, 0.1)
    
    def forward(self, input_ids):
        # 輸入 shape: [batch_size, sequence_length]
        batch_size, seq_len = input_ids.shape
        
        # 嵌入: [batch_size, sequence_length, embed_dim]
        embedded = self.embedding(input_ids)
        
        # 轉置用於卷積: [batch_size, embed_dim, sequence_length]
        embedded = embedded.transpose(1, 2)
        
        # 卷積特徵提取
        conv_output = self.conv_layers(embedded)
        
        # 全局池化: [batch_size, hidden_dim, 1]
        pooled = self.global_pool(conv_output)
        
        # 展平: [batch_size, hidden_dim]
        flattened = pooled.squeeze(2)
        
        # 分類
        logits = self.classifier(flattened)
        
        return logits

# 創建和初始化模型
model = EfficientTextClassifier(
    vocab_size=10000,
    embed_dim=128,
    hidden_dim=256,
    num_classes=5
)

# 模型設定
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model.eval()

print("🎯 EfficientTextClassifier 已創建")
print(f"📊 模型參數量: {sum(p.numel() for p in model.parameters()):,}")
print(f"🔧 模型設備: {device}")

# 計算模型大小
param_size = sum(p.numel() * p.element_size() for p in model.parameters())
buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
model_size_mb = (param_size + buffer_size) / 1024**2
print(f"💾 模型大小: {model_size_mb:.2f} MB")

## 3. PyTorch 到 ONNX 轉換

### 轉換流程
PyTorch → ONNX → TensorRT 是標準的優化路徑。

In [None]:
@dataclass
class ONNXExportConfig:
    """ONNX 導出配置"""
    input_names: List[str]
    output_names: List[str]
    dynamic_axes: Dict[str, Any]
    opset_version: int = 11
    do_constant_folding: bool = True
    verbose: bool = False

class PyTorchToONNXConverter:
    """PyTorch 到 ONNX 轉換器"""
    
    def __init__(self, model: nn.Module, device: torch.device):
        self.model = model
        self.device = device
        self.model.eval()
        
    def export_onnx(self, 
                   dummy_input: torch.Tensor,
                   onnx_path: Path,
                   config: ONNXExportConfig) -> bool:
        """導出 ONNX 模型"""
        
        try:
            logger.info(f"開始導出 ONNX 模型到: {onnx_path}")
            
            with torch.no_grad():
                torch.onnx.export(
                    self.model,
                    dummy_input,
                    str(onnx_path),
                    export_params=True,
                    opset_version=config.opset_version,
                    do_constant_folding=config.do_constant_folding,
                    input_names=config.input_names,
                    output_names=config.output_names,
                    dynamic_axes=config.dynamic_axes,
                    verbose=config.verbose
                )
            
            # 驗證 ONNX 模型
            self._validate_onnx_model(onnx_path, dummy_input)
            
            logger.info("ONNX 導出成功")
            return True
            
        except Exception as e:
            logger.error(f"ONNX 導出失敗: {e}")
            return False
    
    def _validate_onnx_model(self, onnx_path: Path, dummy_input: torch.Tensor):
        """驗證 ONNX 模型"""
        
        # 載入和檢查 ONNX 模型
        onnx_model = onnx.load(str(onnx_path))
        onnx.checker.check_model(onnx_model)
        
        # 比較 PyTorch 和 ONNX 輸出
        with torch.no_grad():
            pytorch_output = self.model(dummy_input).cpu().numpy()
        
        # ONNX Runtime 推理
        ort_session = ort.InferenceSession(str(onnx_path))
        ort_inputs = {ort_session.get_inputs()[0].name: dummy_input.cpu().numpy()}
        onnx_output = ort_session.run(None, ort_inputs)[0]
        
        # 檢查輸出一致性
        max_diff = np.max(np.abs(pytorch_output - onnx_output))
        logger.info(f"PyTorch vs ONNX 最大差異: {max_diff:.6f}")
        
        if max_diff > 1e-3:
            logger.warning(f"輸出差異較大: {max_diff}")
        else:
            logger.info("輸出一致性驗證通過")

# 準備導出配置
export_configs = {
    # 固定形狀版本 (用於最大優化)
    "fixed": ONNXExportConfig(
        input_names=["input_ids"],
        output_names=["logits"],
        dynamic_axes={},  # 無動態軸
        opset_version=11
    ),
    
    # 動態形狀版本 (用於靈活性)
    "dynamic": ONNXExportConfig(
        input_names=["input_ids"],
        output_names=["logits"],
        dynamic_axes={
            "input_ids": {0: "batch_size", 1: "sequence_length"},
            "logits": {0: "batch_size"}
        },
        opset_version=11
    )
}

# 創建轉換器
converter = PyTorchToONNXConverter(model, device)

# 導出不同版本的 ONNX 模型
export_results = {}

for config_name, config in export_configs.items():
    print(f"\n🔄 導出 {config_name} ONNX 模型")
    
    # 創建示例輸入
    if config_name == "fixed":
        # 固定形狀: batch=4, seq_len=128
        dummy_input = torch.randint(1, 1000, (4, 128), dtype=torch.long, device=device)
        onnx_path = ONNX_DIR / f"text_classifier_{config_name}_b4_s128.onnx"
    else:
        # 動態形狀: batch=1, seq_len=64 (示例)
        dummy_input = torch.randint(1, 1000, (1, 64), dtype=torch.long, device=device)
        onnx_path = ONNX_DIR / f"text_classifier_{config_name}.onnx"
    
    # 執行導出
    success = converter.export_onnx(dummy_input, onnx_path, config)
    export_results[config_name] = {
        "path": onnx_path,
        "success": success,
        "dummy_input_shape": dummy_input.shape
    }
    
    if success:
        # 檢查文件大小
        file_size_mb = onnx_path.stat().st_size / 1024**2
        print(f"  ✅ 導出成功，文件大小: {file_size_mb:.2f} MB")
        print(f"  📄 文件路徑: {onnx_path}")
    else:
        print(f"  ❌ 導出失敗")

print(f"\n📊 ONNX 導出總結: {sum(1 for r in export_results.values() if r['success'])}/{len(export_results)} 成功")

## 4. TensorRT 引擎構建

### 引擎優化策略
- **FP16 精度**: 2x 速度提升，最小精度損失
- **INT8 量化**: 4x 速度提升，需要校準數據
- **動態形狀**: 支援不同輸入尺寸
- **層融合**: 自動優化運算圖

In [None]:
if TRT_AVAILABLE:
    
    @dataclass
    class TensorRTBuildConfig:
        """TensorRT 構建配置"""
        precision: str = "fp16"  # fp32, fp16, int8
        max_batch_size: int = 16
        max_workspace_size: int = 1 << 30  # 1GB
        optimization_level: int = 3
        enable_dynamic_shapes: bool = True
        min_shapes: Dict[str, Tuple] = None
        opt_shapes: Dict[str, Tuple] = None
        max_shapes: Dict[str, Tuple] = None
    
    class TensorRTEngineBuilder:
        """TensorRT 引擎構建器"""
        
        def __init__(self):
            self.logger = trt.Logger(trt.Logger.INFO)
            self.builder = trt.Builder(self.logger)
            
        def build_engine_from_onnx(self, 
                                  onnx_path: Path, 
                                  engine_path: Path,
                                  config: TensorRTBuildConfig) -> bool:
            """從 ONNX 構建 TensorRT 引擎"""
            
            try:
                logger.info(f"開始構建 TensorRT 引擎: {engine_path}")
                logger.info(f"精度模式: {config.precision}")
                
                # 創建網絡
                network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
                network = self.builder.create_network(network_flags)
                parser = trt.OnnxParser(network, self.logger)
                
                # 解析 ONNX
                with open(onnx_path, 'rb') as model:
                    if not parser.parse(model.read()):
                        logger.error("ONNX 解析失敗")
                        for error in range(parser.num_errors):
                            logger.error(parser.get_error(error))
                        return False
                
                # 創建構建配置
                build_config = self.builder.create_builder_config()
                build_config.max_workspace_size = config.max_workspace_size
                
                # 設定精度
                if config.precision == "fp16":
                    if self.builder.platform_has_fast_fp16:
                        build_config.set_flag(trt.BuilderFlag.FP16)
                        logger.info("啟用 FP16 精度")
                    else:
                        logger.warning("GPU 不支援 FP16，使用 FP32")
                elif config.precision == "int8":
                    if self.builder.platform_has_fast_int8:
                        build_config.set_flag(trt.BuilderFlag.INT8)
                        # 這裡需要校準器，簡化示例跳過
                        logger.info("啟用 INT8 精度 (需要校準數據)")
                    else:
                        logger.warning("GPU 不支援 INT8，使用 FP32")
                
                # 動態形狀配置
                if config.enable_dynamic_shapes and config.min_shapes:
                    profile = self.builder.create_optimization_profile()
                    
                    for input_name in config.min_shapes:
                        min_shape = config.min_shapes[input_name]
                        opt_shape = config.opt_shapes.get(input_name, min_shape)
                        max_shape = config.max_shapes.get(input_name, min_shape)
                        
                        profile.set_shape(input_name, min_shape, opt_shape, max_shape)
                        logger.info(f"動態形狀 {input_name}: min={min_shape}, opt={opt_shape}, max={max_shape}")
                    
                    build_config.add_optimization_profile(profile)
                
                # 構建引擎
                logger.info("開始構建引擎... (這可能需要幾分鐘)")
                engine = self.builder.build_engine(network, build_config)
                
                if engine is None:
                    logger.error("引擎構建失敗")
                    return False
                
                # 序列化引擎
                with open(engine_path, 'wb') as f:
                    f.write(engine.serialize())
                
                logger.info(f"TensorRT 引擎已儲存: {engine_path}")
                
                # 顯示引擎資訊
                self._print_engine_info(engine, engine_path)
                
                return True
                
            except Exception as e:
                logger.error(f"TensorRT 引擎構建錯誤: {e}")
                return False
        
        def _print_engine_info(self, engine, engine_path: Path):
            """顯示引擎資訊"""
            file_size_mb = engine_path.stat().st_size / 1024**2
            logger.info(f"引擎文件大小: {file_size_mb:.2f} MB")
            logger.info(f"輸入數量: {engine.num_bindings // 2}")
            logger.info(f"最大批次大小: {engine.max_batch_size}")
            
            for i in range(engine.num_bindings):
                name = engine.get_binding_name(i)
                shape = engine.get_binding_shape(i)
                dtype = engine.get_binding_dtype(i)
                is_input = engine.binding_is_input(i)
                logger.info(f"  {'輸入' if is_input else '輸出'} {name}: {shape} ({dtype})")
    
    # 創建 TensorRT 引擎構建器
    builder = TensorRTEngineBuilder()
    
    # 構建不同精度的引擎
    build_configs = {
        "fp32": TensorRTBuildConfig(
            precision="fp32",
            max_batch_size=16,
            enable_dynamic_shapes=False
        ),
        "fp16": TensorRTBuildConfig(
            precision="fp16",
            max_batch_size=16,
            enable_dynamic_shapes=False
        ),
        "dynamic_fp16": TensorRTBuildConfig(
            precision="fp16",
            max_batch_size=0,  # 動態批次
            enable_dynamic_shapes=True,
            min_shapes={"input_ids": (1, 16)},
            opt_shapes={"input_ids": (4, 128)},
            max_shapes={"input_ids": (16, 512)}
        )
    }
    
    build_results = {}
    
    for config_name, build_config in build_configs.items():
        print(f"\n🔧 構建 {config_name} TensorRT 引擎")
        
        # 選擇對應的 ONNX 模型
        if "dynamic" in config_name:
            onnx_path = export_results["dynamic"]["path"]
        else:
            onnx_path = export_results["fixed"]["path"]
        
        if not export_results["dynamic" if "dynamic" in config_name else "fixed"]["success"]:
            print(f"  ⚠️  跳過 {config_name}，ONNX 模型不可用")
            continue
        
        engine_path = TRT_DIR / f"text_classifier_{config_name}.engine"
        
        # 構建引擎
        success = builder.build_engine_from_onnx(onnx_path, engine_path, build_config)
        build_results[config_name] = {
            "path": engine_path,
            "success": success,
            "config": build_config
        }
        
        if success:
            print(f"  ✅ 引擎構建成功")
        else:
            print(f"  ❌ 引擎構建失敗")
    
    print(f"\n📊 TensorRT 引擎構建總結: {sum(1 for r in build_results.values() if r['success'])}/{len(build_results)} 成功")

else:
    print("⚠️  TensorRT 未安裝，跳過引擎構建")
    print("   可以使用 trtexec 命令行工具作為替代")
    
    # 生成 trtexec 命令示例
    trtexec_commands = []
    
    if export_results["fixed"]["success"]:
        onnx_path = export_results["fixed"]["path"]
        engine_path = TRT_DIR / "text_classifier_fp16.engine"
        
        cmd = f"""trtexec --onnx={onnx_path} \
  --saveEngine={engine_path} \
  --fp16 \
  --workspace=1024 \
  --verbose"""
        
        trtexec_commands.append(("FP16 引擎", cmd))
    
    if trtexec_commands:
        print("\n📜 trtexec 命令參考:")
        for name, cmd in trtexec_commands:
            print(f"\n{name}:")
            print(cmd)
    
    build_results = {}  # 空字典用於後續代碼兼容

## 5. Triton TensorRT Backend 配置

### 建立 Triton 模型倉庫結構

In [None]:
class TritonTensorRTModelGenerator:
    """Triton TensorRT 模型生成器"""
    
    def __init__(self, model_repo_path: Path):
        self.model_repo = model_repo_path
        
    def create_tensorrt_model(self, 
                             model_name: str,
                             engine_path: Path,
                             input_spec: Dict[str, Any],
                             output_spec: Dict[str, Any],
                             config_params: Dict[str, Any] = None) -> bool:
        """創建 TensorRT 模型配置"""
        
        model_dir = self.model_repo / model_name
        version_dir = model_dir / "1"
        version_dir.mkdir(parents=True, exist_ok=True)
        
        # 複製引擎文件
        if engine_path.exists():
            target_engine = version_dir / "model.plan"
            shutil.copy2(engine_path, target_engine)
            logger.info(f"TensorRT 引擎已複製到: {target_engine}")
        else:
            logger.error(f"引擎文件不存在: {engine_path}")
            return False
        
        # 生成配置文件
        config = self._generate_config(model_name, input_spec, output_spec, config_params)
        config_path = model_dir / "config.pbtxt"
        
        with open(config_path, "w", encoding="utf-8") as f:
            f.write(config)
        
        logger.info(f"配置文件已創建: {config_path}")
        return True
    
    def _generate_config(self, 
                        model_name: str,
                        input_spec: Dict[str, Any],
                        output_spec: Dict[str, Any],
                        config_params: Dict[str, Any] = None) -> str:
        """生成 TensorRT 模型配置"""
        
        config_params = config_params or {}
        
        # 基本配置
        config_lines = [
            f'name: "{model_name}"',
            'platform: "tensorrt_plan"',
            f'max_batch_size: {config_params.get("max_batch_size", 16)}',
            ''
        ]
        
        # 輸入配置
        for input_name, spec in input_spec.items():
            config_lines.extend([
                'input {',
                f'  name: "{input_name}"',
                f'  data_type: {spec["data_type"]}',
                f'  dims: {spec["dims"]}',
                '}',
                ''
            ])
        
        # 輸出配置
        for output_name, spec in output_spec.items():
            config_lines.extend([
                'output {',
                f'  name: "{output_name}"',
                f'  data_type: {spec["data_type"]}',
                f'  dims: {spec["dims"]}',
                '}',
                ''
            ])
        
        # 實例組配置
        instance_count = config_params.get("instance_count", 1)
        config_lines.extend([
            'instance_group {',
            f'  count: {instance_count}',
            '  kind: KIND_GPU',
            '}',
            ''
        ])
        
        # 動態批次配置
        if config_params.get("enable_dynamic_batching", True):
            preferred_batch_sizes = config_params.get("preferred_batch_sizes", [1, 4, 8])
            config_lines.extend([
                'dynamic_batching {',
                f'  preferred_batch_size: {preferred_batch_sizes}',
                f'  max_queue_delay_microseconds: {config_params.get("max_queue_delay", 100)}',
                '}',
                ''
            ])
        
        # TensorRT 特定優化
        config_lines.extend([
            '# TensorRT 優化參數',
            'optimization {',
            '  cuda {',
            '    graphs: true',
            '    graph_spec {',
            '      batch_size: 1',
            '      input {',
            f'        key: "{list(input_spec.keys())[0]}"',
            '        value {',
            f'          dim: [ {config_params.get("opt_input_dims", "128")} ]',
            '        }',
            '      }',
            '    }',
            '  }',
            '}',
            '',
            '# 模型預熱',
            'model_warmup {',
            '  name: "warmup_sample"',
            '  batch_size: 1',
            '  inputs {',
            f'    key: "{list(input_spec.keys())[0]}"',
            '    value {',
            f'      data_type: {list(input_spec.values())[0]["data_type"]}',
            f'      dims: [ {config_params.get("warmup_dims", "128")} ]',
            '      zero_data: true',
            '    }',
            '  }',
            '}'
        ])
        
        return '\n'.join(config_lines)

# 創建 Triton 模型生成器
model_generator = TritonTensorRTModelGenerator(MODEL_REPO)

# 為每個成功構建的引擎創建 Triton 模型
triton_models = {}

if TRT_AVAILABLE and build_results:
    for config_name, result in build_results.items():
        if result["success"]:
            model_name = f"text_classifier_trt_{config_name}"
            
            # 輸入輸出規格
            if "dynamic" in config_name:
                input_spec = {
                    "input_ids": {
                        "data_type": "TYPE_INT64",
                        "dims": [-1]  # 動態序列長度
                    }
                }
                max_batch_size = 0  # 動態批次
            else:
                input_spec = {
                    "input_ids": {
                        "data_type": "TYPE_INT64",
                        "dims": [128]  # 固定序列長度
                    }
                }
                max_batch_size = 16
            
            output_spec = {
                "logits": {
                    "data_type": "TYPE_FP32",
                    "dims": [5]  # 5 個分類
                }
            }
            
            # 配置參數
            config_params = {
                "max_batch_size": max_batch_size,
                "instance_count": 1,
                "enable_dynamic_batching": max_batch_size > 0,
                "preferred_batch_sizes": [1, 4, 8] if max_batch_size > 0 else [],
                "opt_input_dims": "128",
                "warmup_dims": "128"
            }
            
            # 創建模型
            success = model_generator.create_tensorrt_model(
                model_name,
                result["path"],
                input_spec,
                output_spec,
                config_params
            )
            
            triton_models[model_name] = {
                "engine_path": result["path"],
                "success": success,
                "config_name": config_name
            }
            
            if success:
                print(f"✅ Triton 模型已創建: {model_name}")
            else:
                print(f"❌ Triton 模型創建失敗: {model_name}")

# 總結
print(f"\n📊 Triton TensorRT 模型創建總結:")
print(f"   成功: {sum(1 for m in triton_models.values() if m['success'])}")
print(f"   總計: {len(triton_models)}")

if not TRT_AVAILABLE:
    print("\n⚠️  TensorRT 未安裝，已跳過模型創建")
    print("   安裝 TensorRT 後可體驗完整功能")

## 6. 性能基準測試系統

In [None]:
class ComprehensiveBenchmark:
    """綜合性能基準測試系統"""
    
    def __init__(self, triton_url: str = "localhost:8000"):
        self.triton_url = triton_url
        self.results = {}
        
    def benchmark_model(self, 
                       model_name: str,
                       test_data: List[np.ndarray],
                       iterations: int = 100,
                       warmup_iterations: int = 10) -> Dict[str, Any]:
        """單個模型基準測試"""
        
        try:
            client = httpclient.InferenceServerClient(url=self.triton_url)
            
            if not client.is_model_ready(model_name):
                logger.warning(f"模型 {model_name} 未就緒")
                return {"error": "Model not ready"}
            
            logger.info(f"開始基準測試: {model_name}")
            
            # 預熱
            for i in range(warmup_iterations):
                test_input = test_data[i % len(test_data)]
                self._single_inference(client, model_name, test_input)
            
            logger.info(f"預熱完成，開始正式測試 ({iterations} 次)")
            
            # 正式測試
            latencies = []
            errors = 0
            start_time = time.time()
            
            for i in range(iterations):
                test_input = test_data[i % len(test_data)]
                
                try:
                    request_start = time.time()
                    response = self._single_inference(client, model_name, test_input)
                    request_end = time.time()
                    
                    latency = (request_end - request_start) * 1000
                    latencies.append(latency)
                    
                except Exception as e:
                    errors += 1
                    logger.error(f"推理錯誤: {e}")
            
            end_time = time.time()
            total_time = end_time - start_time
            
            # 計算統計
            if latencies:
                stats = {
                    "model_name": model_name,
                    "total_requests": iterations,
                    "successful_requests": len(latencies),
                    "error_count": errors,
                    "error_rate_percent": (errors / iterations) * 100,
                    "total_time_seconds": total_time,
                    "throughput_rps": len(latencies) / total_time,
                    "latency_ms": {
                        "mean": np.mean(latencies),
                        "median": np.median(latencies),
                        "p95": np.percentile(latencies, 95),
                        "p99": np.percentile(latencies, 99),
                        "min": np.min(latencies),
                        "max": np.max(latencies),
                        "std": np.std(latencies)
                    },
                    "raw_latencies": latencies
                }
            else:
                stats = {
                    "model_name": model_name,
                    "total_requests": iterations,
                    "successful_requests": 0,
                    "error_count": errors,
                    "error_rate_percent": 100.0,
                    "throughput_rps": 0
                }
            
            return stats
            
        except Exception as e:
            logger.error(f"基準測試失敗 {model_name}: {e}")
            return {"error": str(e)}
    
    def _single_inference(self, client, model_name: str, input_data: np.ndarray):
        """執行單次推理"""
        inputs = [httpclient.InferInput("input_ids", input_data.shape, "INT64")]
        inputs[0].set_data_from_numpy(input_data)
        
        response = client.infer(model_name, inputs)
        return response.as_numpy("logits")
    
    def compare_models(self, 
                      model_names: List[str],
                      test_data: List[np.ndarray],
                      iterations: int = 100) -> Dict[str, Any]:
        """比較多個模型性能"""
        
        comparison_results = {}
        
        for model_name in model_names:
            print(f"\n🚀 測試模型: {model_name}")
            result = self.benchmark_model(model_name, test_data, iterations)
            comparison_results[model_name] = result
            
            if "error" not in result and "latency_ms" in result:
                latency_stats = result["latency_ms"]
                print(f"  ⚡ 吞吐量: {result['throughput_rps']:.2f} RPS")
                print(f"  📊 平均延遲: {latency_stats['mean']:.2f}ms")
                print(f"  📈 P95 延遲: {latency_stats['p95']:.2f}ms")
                print(f"  ❌ 錯誤率: {result['error_rate_percent']:.2f}%")
            else:
                print(f"  ❌ 測試失敗: {result.get('error', '未知錯誤')}")
        
        return comparison_results
    
    def generate_performance_report(self, results: Dict[str, Any]) -> str:
        """生成性能報告"""
        
        report_lines = [
            "# TensorRT vs PyTorch 性能對比報告",
            "=" * 50,
            ""
        ]
        
        # 成功的結果
        successful_results = {
            k: v for k, v in results.items() 
            if "error" not in v and "latency_ms" in v
        }
        
        if not successful_results:
            return "無可用的測試結果\n"
        
        # 按吞吐量排序
        sorted_results = sorted(
            successful_results.items(),
            key=lambda x: x[1]["throughput_rps"],
            reverse=True
        )
        
        # 詳細結果表格
        report_lines.extend([
            "## 詳細性能指標",
            "",
            "| 模型 | 吞吐量 (RPS) | 平均延遲 (ms) | P95 延遲 (ms) | 錯誤率 (%) |",
            "|------|-------------|--------------|--------------|----------|"
        ])
        
        for model_name, result in sorted_results:
            throughput = result["throughput_rps"]
            mean_latency = result["latency_ms"]["mean"]
            p95_latency = result["latency_ms"]["p95"]
            error_rate = result["error_rate_percent"]
            
            report_lines.append(
                f"| {model_name} | {throughput:.2f} | {mean_latency:.2f} | {p95_latency:.2f} | {error_rate:.2f} |"
            )
        
        # 性能提升分析
        if len(sorted_results) > 1:
            baseline_name, baseline_result = sorted_results[-1]  # 最慢的作為基線
            baseline_throughput = baseline_result["throughput_rps"]
            baseline_latency = baseline_result["latency_ms"]["mean"]
            
            report_lines.extend([
                "",
                "## 性能提升分析",
                f"(以 {baseline_name} 作為基線)",
                ""
            ])
            
            for model_name, result in sorted_results[:-1]:
                throughput_speedup = result["throughput_rps"] / baseline_throughput
                latency_speedup = baseline_latency / result["latency_ms"]["mean"]
                
                report_lines.append(
                    f"- **{model_name}**: {throughput_speedup:.2f}x 吞吐量提升, {latency_speedup:.2f}x 延遲改善"
                )
        
        # 建議
        report_lines.extend([
            "",
            "## 部署建議",
            "",
            "### 高吞吐量場景",
            f"- 推薦: {sorted_results[0][0]}",
            f"- 吞吐量: {sorted_results[0][1]['throughput_rps']:.2f} RPS",
            "",
            "### 低延遲場景",
        ])
        
        # 找出延遲最低的模型
        lowest_latency = min(
            successful_results.items(),
            key=lambda x: x[1]["latency_ms"]["mean"]
        )
        
        report_lines.extend([
            f"- 推薦: {lowest_latency[0]}",
            f"- 平均延遲: {lowest_latency[1]['latency_ms']['mean']:.2f} ms"
        ])
        
        return "\n".join(report_lines)

# 創建基準測試器
benchmark = ComprehensiveBenchmark()

# 準備測試數據
def generate_tensorrt_test_data(num_samples: int = 50) -> List[np.ndarray]:
    """生成 TensorRT 測試數據"""
    test_data = []
    
    # 固定長度序列（適合固定形狀引擎）
    lengths = [128]  # TensorRT 通常使用固定形狀效果更好
    
    for _ in range(num_samples):
        length = np.random.choice(lengths)
        # 生成隨機序列
        sequence = np.random.randint(1, 5000, size=(1, length), dtype=np.int64)
        test_data.append(sequence)
    
    return test_data

tensorrt_test_data = generate_tensorrt_test_data()
print(f"📋 TensorRT 測試數據已生成: {len(tensorrt_test_data)} 個樣本")
print(f"📏 序列長度: {[data.shape[1] for data in tensorrt_test_data[:5]]}")

# 顯示可用模型
available_models = [name for name, info in triton_models.items() if info["success"]]
print(f"\n📋 可用的 TensorRT 模型:")
for model in available_models:
    print(f"  - {model}")

if not available_models:
    print("⚠️  沒有可用的 TensorRT 模型進行測試")
    print("   請確保 TensorRT 引擎構建成功並且 Triton Server 正在運行")

## 7. 執行性能對比測試

In [None]:
def wait_for_triton_server(max_wait_time: int = 60) -> bool:
    """等待 Triton Server 就緒"""
    print("⏳ 檢查 Triton Server 狀態...")
    
    for i in range(max_wait_time):
        try:
            client = httpclient.InferenceServerClient(url="localhost:8000")
            if client.is_server_ready():
                print("✅ Triton Server 已就緒")
                return True
        except:
            pass
        
        time.sleep(1)
        if i % 10 == 0 and i > 0:
            print(f"  等待中... ({i}/{max_wait_time}s)")
    
    print("❌ Triton Server 連接超時")
    return False

def run_tensorrt_benchmark():
    """執行 TensorRT 性能基準測試"""
    
    if not wait_for_triton_server():
        print("無法連接到 Triton Server")
        print("\n🚀 要啟動 Triton Server，請執行:")
        print("   docker run -d --name triton-trt --gpus all \\")
        print("     -p 8000:8000 -p 8001:8001 -p 8002:8002 \\")
        print(f"     -v {MODEL_REPO.absolute()}:/models \\")
        print("     nvcr.io/nvidia/tritonserver:24.01-py3 \\")
        print("     tritonserver --model-repository=/models")
        return None
    
    # 檢查可用模型
    try:
        client = httpclient.InferenceServerClient(url="localhost:8000")
        server_models = [m["name"] for m in client.get_model_repository_index()]
        
        # 過濾出實際可用的模型
        testable_models = []
        for model_name in available_models:
            if model_name in server_models:
                try:
                    if client.is_model_ready(model_name):
                        testable_models.append(model_name)
                        print(f"✅ 模型就緒: {model_name}")
                    else:
                        print(f"⚠️  模型未就緒: {model_name}")
                except:
                    print(f"❌ 模型檢查失敗: {model_name}")
        
        if not testable_models:
            print("❌ 沒有可測試的模型")
            return None
            
        print(f"\n🎯 開始性能對比測試 ({len(testable_models)} 個模型)")
        print("=" * 60)
        
        # 執行基準測試
        comparison_results = benchmark.compare_models(
            testable_models,
            tensorrt_test_data,
            iterations=100
        )
        
        # 生成報告
        performance_report = benchmark.generate_performance_report(comparison_results)
        
        # 保存報告
        report_path = BENCHMARK_DIR / "tensorrt_performance_report.md"
        with open(report_path, "w", encoding="utf-8") as f:
            f.write(performance_report)
        
        print(f"\n📊 性能報告已儲存: {report_path}")
        print("\n" + "=" * 60)
        print(performance_report)
        
        return comparison_results
        
    except Exception as e:
        print(f"❌ 基準測試執行失敗: {e}")
        return None

# 執行測試（如果有可用模型）
if available_models:
    print("🎯 準備執行 TensorRT 性能基準測試")
    benchmark_results = run_tensorrt_benchmark()
else:
    print("⚠️  跳過性能測試：沒有可用的 TensorRT 模型")
    benchmark_results = None

## 8. 性能視覺化分析

In [None]:
def create_performance_visualization(results: Dict[str, Any]):
    """創建性能視覺化圖表"""
    
    if not results:
        print("⚠️  沒有性能數據可視覺化")
        return
    
    # 過濾成功的結果
    successful_results = {
        k: v for k, v in results.items() 
        if "error" not in v and "latency_ms" in v
    }
    
    if not successful_results:
        print("⚠️  沒有成功的測試結果")
        return
    
    # 準備數據
    model_names = list(successful_results.keys())
    throughputs = [result["throughput_rps"] for result in successful_results.values()]
    mean_latencies = [result["latency_ms"]["mean"] for result in successful_results.values()]
    p95_latencies = [result["latency_ms"]["p95"] for result in successful_results.values()]
    
    # 簡化模型名稱（用於顯示）
    display_names = []
    for name in model_names:
        if "fp32" in name:
            display_names.append("TensorRT FP32")
        elif "fp16" in name:
            if "dynamic" in name:
                display_names.append("TensorRT FP16\n(Dynamic)")
            else:
                display_names.append("TensorRT FP16")
        else:
            display_names.append(name.replace("text_classifier_trt_", "TRT "))
    
    # 創建圖表
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('TensorRT Backend 性能分析', fontsize=16, fontweight='bold')
    
    # 顏色配置
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
    
    # 1. 吞吐量對比
    axes[0, 0].bar(range(len(display_names)), throughputs, 
                   color=colors[:len(display_names)], alpha=0.8)
    axes[0, 0].set_title('吞吐量對比 (RPS)', fontweight='bold')
    axes[0, 0].set_xlabel('模型配置')
    axes[0, 0].set_ylabel('每秒請求數 (RPS)')
    axes[0, 0].set_xticks(range(len(display_names)))
    axes[0, 0].set_xticklabels(display_names, rotation=45, ha='right')
    axes[0, 0].grid(axis='y', alpha=0.3)
    
    # 添加數值標籤
    for i, v in enumerate(throughputs):
        axes[0, 0].text(i, v + max(throughputs) * 0.01, f'{v:.1f}', 
                       ha='center', va='bottom', fontweight='bold')
    
    # 2. 延遲對比
    x = np.arange(len(display_names))
    width = 0.35
    
    axes[0, 1].bar(x - width/2, mean_latencies, width, label='平均延遲', 
                   color='skyblue', alpha=0.8)
    axes[0, 1].bar(x + width/2, p95_latencies, width, label='P95 延遲',
                   color='orange', alpha=0.8)
    axes[0, 1].set_title('延遲對比 (毫秒)', fontweight='bold')
    axes[0, 1].set_xlabel('模型配置')
    axes[0, 1].set_ylabel('延遲 (ms)')
    axes[0, 1].set_xticks(x)
    axes[0, 1].set_xticklabels(display_names, rotation=45, ha='right')
    axes[0, 1].legend()
    axes[0, 1].grid(axis='y', alpha=0.3)
    
    # 3. 延遲分佈（箱線圖）
    if len(successful_results) > 1:
        latency_data = [result["raw_latencies"] for result in successful_results.values()]
        bp = axes[1, 0].boxplot(latency_data, labels=display_names, patch_artist=True)
        
        # 設定顏色
        for patch, color in zip(bp['boxes'], colors[:len(display_names)]):
            patch.set_facecolor(color)
            patch.set_alpha(0.7)
        
        axes[1, 0].set_title('延遲分佈', fontweight='bold')
        axes[1, 0].set_xlabel('模型配置')
        axes[1, 0].set_ylabel('延遲 (ms)')
        axes[1, 0].tick_params(axis='x', rotation=45)
        axes[1, 0].grid(axis='y', alpha=0.3)
    else:
        axes[1, 0].text(0.5, 0.5, '需要多個模型\n進行分佈比較', 
                       ha='center', va='center', transform=axes[1, 0].transAxes)
    
    # 4. 性能提升倍數
    if len(successful_results) > 1:
        # 以最慢的模型作為基線
        baseline_throughput = min(throughputs)
        baseline_latency = max(mean_latencies)
        
        speedup_throughput = [t / baseline_throughput for t in throughputs]
        speedup_latency = [baseline_latency / l for l in mean_latencies]
        
        x = np.arange(len(display_names))
        axes[1, 1].bar(x - width/2, speedup_throughput, width, 
                      label='吞吐量提升', color='lightgreen', alpha=0.8)
        axes[1, 1].bar(x + width/2, speedup_latency, width,
                      label='延遲改善', color='lightcoral', alpha=0.8)
        
        axes[1, 1].set_title('性能提升倍數', fontweight='bold')
        axes[1, 1].set_xlabel('模型配置')
        axes[1, 1].set_ylabel('提升倍數')
        axes[1, 1].set_xticks(x)
        axes[1, 1].set_xticklabels(display_names, rotation=45, ha='right')
        axes[1, 1].legend()
        axes[1, 1].grid(axis='y', alpha=0.3)
        
        # 添加 1x 基線
        axes[1, 1].axhline(y=1, color='red', linestyle='--', alpha=0.7, label='基線')
    else:
        axes[1, 1].text(0.5, 0.5, '需要多個模型\n進行提升比較', 
                       ha='center', va='center', transform=axes[1, 1].transAxes)
    
    plt.tight_layout()
    
    # 保存圖表
    chart_path = BENCHMARK_DIR / "tensorrt_performance_charts.png"
    plt.savefig(chart_path, dpi=300, bbox_inches='tight')
    print(f"📊 性能圖表已儲存: {chart_path}")
    
    plt.show()

# 創建視覺化圖表（如果有測試結果）
if benchmark_results:
    print("📊 生成性能視覺化圖表...")
    create_performance_visualization(benchmark_results)
else:
    print("⚠️  跳過視覺化：沒有測試結果")
    
    # 創建模擬圖表用於演示
    print("📊 生成示例性能圖表...")
    
    # 模擬數據
    mock_results = {
        "text_classifier_trt_fp32": {
            "throughput_rps": 85.2,
            "latency_ms": {
                "mean": 11.7,
                "p95": 18.4
            },
            "raw_latencies": np.random.normal(11.7, 2.5, 100).tolist()
        },
        "text_classifier_trt_fp16": {
            "throughput_rps": 156.8,
            "latency_ms": {
                "mean": 6.4,
                "p95": 10.2
            },
            "raw_latencies": np.random.normal(6.4, 1.8, 100).tolist()
        }
    }
    
    create_performance_visualization(mock_results)

## 9. Triton TensorRT 部署腳本生成

In [None]:
# 生成完整的 Triton TensorRT 部署腳本
deployment_script = f'''
#!/bin/bash

# Triton TensorRT Backend 完整部署腳本
# 日期: {time.strftime('%Y-%m-%d %H:%M:%S')}

set -e  # 遇到錯誤立即退出

echo "🚀 Triton TensorRT Backend 部署腳本"
echo "====================================="

# 配置變數
MODEL_REPO="{MODEL_REPO.absolute()}"
CONTAINER_NAME="triton-tensorrt-server"
TRITON_IMAGE="nvcr.io/nvidia/tritonserver:24.01-py3"

# 檢查 Docker
if ! command -v docker &> /dev/null; then
    echo "❌ Docker 未安裝，請先安裝 Docker"
    exit 1
fi

# 檢查 NVIDIA Container Toolkit
if ! docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi &> /dev/null; then
    echo "❌ NVIDIA Container Toolkit 未安裝或配置錯誤"
    echo "   請安裝 NVIDIA Container Toolkit"
    exit 1
fi

echo "✅ Docker 和 GPU 支援檢查通過"

# 停止並移除現有容器
echo "🧹 清理現有容器..."
docker stop $CONTAINER_NAME 2>/dev/null || true
docker rm $CONTAINER_NAME 2>/dev/null || true

# 檢查模型倉庫
if [ ! -d "$MODEL_REPO" ]; then
    echo "❌ 模型倉庫不存在: $MODEL_REPO"
    exit 1
fi

echo "📁 模型倉庫: $MODEL_REPO"
echo "📋 可用模型:"
ls -la "$MODEL_REPO"

# 啟動 Triton Server
echo "🚀 啟動 Triton TensorRT Server..."
docker run -d \
  --name $CONTAINER_NAME \
  --gpus all \
  --shm-size=1g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v "$MODEL_REPO:/models" \
  -e CUDA_VISIBLE_DEVICES=0 \
  $TRITON_IMAGE \
  tritonserver \
    --model-repository=/models \
    --backend-directory=/opt/tritonserver/backends \
    --model-control-mode=explicit \
    --strict-model-config=false \
    --log-verbose=1 \
    --log-info=true \
    --log-warning=true \
    --log-error=true \
    --exit-on-error=false \
    --exit-timeout-secs=120 \
    --buffer-manager-thread-count=4 \
    --model-load-thread-count=8 \
    --backend-config=tensorrt,default-max-workspace-size=1073741824

# 等待服務器啟動
echo "⏳ 等待 Triton Server 啟動..."
for i in {{1..30}}; do
    if curl -s http://localhost:8000/v2/health/ready > /dev/null; then
        echo "✅ Triton Server 已就緒"
        break
    fi
    echo "  等待中... ($i/30)"
    sleep 2
done

# 檢查服務器狀態
if ! curl -s http://localhost:8000/v2/health/ready > /dev/null; then
    echo "❌ Triton Server 啟動失敗"
    echo "容器日誌:"
    docker logs $CONTAINER_NAME
    exit 1
fi

# 顯示服務資訊
echo "🌐 Triton Server 端點:"
echo "   HTTP:    http://localhost:8000"
echo "   gRPC:    localhost:8001"
echo "   Metrics: http://localhost:8002/metrics"

# 載入所有可用模型
echo "📥 載入 TensorRT 模型..."
'''

# 為每個可用模型添加載入命令
for model_name in available_models:
    deployment_script += f'''
echo "  載入模型: {model_name}"
curl -X POST "http://localhost:8000/v2/repository/models/{model_name}/load" || echo "載入失敗: {model_name}"
'''

deployment_script += '''

# 檢查模型狀態
echo "📋 檢查模型狀態:"
curl -s http://localhost:8000/v2/models | python3 -m json.tool

echo "🎉 部署完成！"
echo "\n💡 有用的命令:"
echo "   查看容器日誌: docker logs -f $CONTAINER_NAME"
echo "   停止服務:     docker stop $CONTAINER_NAME"
echo "   進入容器:     docker exec -it $CONTAINER_NAME bash"
echo "   模型統計:     curl http://localhost:8000/v2/models/stats"
'''

# 儲存部署腳本
deploy_script_path = SCRIPTS_DIR / "deploy_tensorrt_triton.sh"
with open(deploy_script_path, "w", encoding="utf-8") as f:
    f.write(deployment_script.strip())

# 設定執行權限
os.chmod(deploy_script_path, 0o755)

print(f"📜 TensorRT 部署腳本已創建: {deploy_script_path}")
print("\n🚀 要部署 Triton TensorRT Server，請執行:")
print(f"   bash {deploy_script_path}")

# 創建測試腳本
test_script = f'''
#!/usr/bin/env python3
# TensorRT 模型測試腳本

import requests
import numpy as np
import tritonclient.http as httpclient
import time
import json

def test_tensorrt_models():
    """測試 TensorRT 模型"""
    
    triton_url = "localhost:8000"
    client = httpclient.InferenceServerClient(url=triton_url)
    
    print("🎯 TensorRT 模型測試")
    print("=" * 30)
    
    # 檢查服務器狀態
    if not client.is_server_ready():
        print("❌ Triton Server 未就緒")
        return
    
    # 獲取模型列表
    models = client.get_model_repository_index()
    tensorrt_models = [m["name"] for m in models if "trt" in m["name"].lower()]
    
    if not tensorrt_models:
        print("❌ 沒有找到 TensorRT 模型")
        return
    
    print(f"📋 找到 {{len(tensorrt_models)}} 個 TensorRT 模型")
    
    # 測試每個模型
    for model_name in tensorrt_models:
        print(f"\n🚀 測試模型: {{model_name}}")
        
        try:
            # 檢查模型狀態
            if not client.is_model_ready(model_name):
                print(f"  ⚠️  模型未就緒")
                continue
            
            # 創建測試輸入
            test_input = np.random.randint(1, 1000, size=(1, 128), dtype=np.int64)
            
            # 推理測試
            inputs = [httpclient.InferInput("input_ids", test_input.shape, "INT64")]
            inputs[0].set_data_from_numpy(test_input)
            
            start_time = time.time()
            response = client.infer(model_name, inputs)
            end_time = time.time()
            
            # 獲取輸出
            output = response.as_numpy("logits")
            
            print(f"  ✅ 推理成功")
            print(f"  ⚡ 延遲: {{(end_time - start_time) * 1000:.2f}} ms")
            print(f"  📊 輸出形狀: {{output.shape}}")
            print(f"  🎯 預測類別: {{np.argmax(output[0])}}")
            
        except Exception as e:
            print(f"  ❌ 測試失敗: {{e}}")
    
    print("\n🎉 測試完成！")

if __name__ == "__main__":
    test_tensorrt_models()
'''

# 儲存測試腳本
test_script_path = SCRIPTS_DIR / "test_tensorrt_models.py"
with open(test_script_path, "w", encoding="utf-8") as f:
    f.write(test_script.strip())

os.chmod(test_script_path, 0o755)

print(f"🧪 測試腳本已創建: {test_script_path}")
print("\n📋 部署後測試命令:")
print(f"   python3 {test_script_path}")

## 10. TensorRT 最佳實踐總結

In [None]:
def generate_tensorrt_best_practices():
    """生成 TensorRT 最佳實踐指南"""
    
    best_practices = {
        "模型轉換優化": {
            "ONNX 導出": [
                "使用較新的 ONNX opset 版本 (11+)",
                "啟用常數摺疊 (constant folding)",
                "合理設計動態軸，避免過度靈活性",
                "驗證 PyTorch 和 ONNX 輸出一致性"
            ],
            "TensorRT 構建": [
                "根據硬體選擇合適精度 (FP16/INT8)",
                "設定足夠的工作空間大小 (1GB+)",
                "為常用輸入形狀創建優化配置檔案",
                "使用批次推理提高吞吐量"
            ]
        },
        "性能優化策略": {
            "精度選擇": [
                "FP16: 2x 加速，極小精度損失，推薦首選",
                "INT8: 4x 加速，需要校準數據，適合生產環境",
                "動態範圍量化: 無校準數據的 INT8 替代"
            ],
            "批次策略": [
                "固定批次大小獲得最佳性能",
                "動態批次提供靈活性但犧牲部分性能",
                "根據 GPU 記憶體調整最大批次大小"
            ],
            "記憶體管理": [
                "預分配推理記憶體避免動態分配",
                "使用 CUDA Stream 重疊計算和記憶體傳輸",
                "監控 GPU 記憶體使用避免 OOM"
            ]
        },
        "Triton 整合配置": {
            "實例組設定": [
                "根據 GPU 記憶體和延遲需求設定實例數量",
                "大模型使用單實例，小模型可多實例並行",
                "合理配置 GPU 親和性"
            ],
            "動態批次": [
                "設定合理的偏好批次大小",
                "調整佇列延遲平衡吞吐量和響應時間",
                "使用優先級處理不同重要性的請求"
            ],
            "模型預熱": [
                "為常用輸入大小配置預熱樣本",
                "包含不同批次大小的預熱配置",
                "避免首次推理的冷啟動延遲"
            ]
        },
        "部署環境優化": {
            "硬體配置": [
                "使用支援 Tensor Core 的 GPU (V100/T4/RTX/A100)",
                "確保 CUDA 版本與 TensorRT 版本兼容",
                "配置足夠的系統記憶體支援模型載入"
            ],
            "容器化部署": [
                "使用官方 Triton 容器映像",
                "正確配置 GPU 資源分配",
                "設定合適的共享記憶體大小 (--shm-size)",
                "配置記憶體鎖定限制 (--ulimit memlock=-1)"
            ]
        },
        "監控告警": {
            "關鍵指標": [
                "GPU 利用率和記憶體使用率",
                "推理延遲分佈 (P50/P95/P99)",
                "吞吐量和錯誤率",
                "模型載入時間和預熱狀態"
            ],
            "告警閾值建議": [
                "GPU 記憶體使用率 > 85%",
                "P99 延遲增長 > 50%",
                "錯誤率 > 0.1%",
                "吞吐量下降 > 20%"
            ]
        }
    }
    
    # 生成報告
    report_lines = [
        "# TensorRT Backend 最佳實踐指南",
        "=" * 40,
        f"生成時間: {time.strftime('%Y-%m-%d %H:%M:%S')}",
        "",
        "## 🎯 概述",
        "",
        "本指南總結了在 Triton Inference Server 中使用 TensorRT Backend 的最佳實踐，",
        "包括模型轉換、性能優化、部署配置和監控策略。",
        ""
    ]
    
    for section_name, content in best_practices.items():
        report_lines.append(f"## 🔧 {section_name}")
        report_lines.append("")
        
        for subsection, items in content.items():
            report_lines.append(f"### 📌 {subsection}")
            report_lines.append("")
            
            for item in items:
                report_lines.append(f"- {item}")
            
            report_lines.append("")
    
    # 性能對比總結
    if benchmark_results:
        report_lines.extend([
            "## 📊 本次測試性能總結",
            "",
            "| 配置 | 吞吐量 (RPS) | 平均延遲 (ms) | 備註 |",
            "|------|-------------|--------------|------|"
        ])
        
        for model_name, result in benchmark_results.items():
            if "error" not in result and "latency_ms" in result:
                throughput = result["throughput_rps"]
                latency = result["latency_ms"]["mean"]
                
                # 簡化名稱和添加備註
                if "fp16" in model_name:
                    config = "TensorRT FP16"
                    note = "推薦配置"
                elif "fp32" in model_name:
                    config = "TensorRT FP32"
                    note = "基線配置"
                else:
                    config = model_name.replace("text_classifier_trt_", "")
                    note = "-"
                
                report_lines.append(
                    f"| {config} | {throughput:.1f} | {latency:.2f} | {note} |"
                )
    
    # 推薦配置模板
    report_lines.extend([
        "",
        "## 🏗️ 推薦配置模板",
        "",
        "### 高吞吐量場景",
        "```protobuf",
        'name: "model_high_throughput"',
        'platform: "tensorrt_plan"',
        'max_batch_size: 32',
        "",
        'instance_group {',
        '  count: 2',
        '  kind: KIND_GPU',
        '}',
        "",
        'dynamic_batching {',
        '  preferred_batch_size: [ 8, 16, 32 ]',
        '  max_queue_delay_microseconds: 1000',
        '}',
        "```",
        "",
        "### 低延遲場景",
        "```protobuf",
        'name: "model_low_latency"',
        'platform: "tensorrt_plan"',
        'max_batch_size: 4',
        "",
        'instance_group {',
        '  count: 1',
        '  kind: KIND_GPU',
        '}',
        "",
        'dynamic_batching {',
        '  preferred_batch_size: [ 1, 2 ]',
        '  max_queue_delay_microseconds: 100',
        '}',
        "```",
        "",
        "---",
        "",
        "💡 **提示**: 根據具體業務場景和硬體資源調整上述配置參數。"
    ])
    
    return "\n".join(report_lines)

# 生成最佳實踐指南
best_practices_guide = generate_tensorrt_best_practices()

# 儲存指南
guide_path = BASE_DIR / "TensorRT_Best_Practices.md"
with open(guide_path, "w", encoding="utf-8") as f:
    f.write(best_practices_guide)

print(f"📚 TensorRT 最佳實踐指南已生成: {guide_path}")
print("\n" + "=" * 60)
print(best_practices_guide[:2000] + "..." if len(best_practices_guide) > 2000 else best_practices_guide)

# 生成實驗總結
print("\n🎉 TensorRT Backend 整合實驗完成！")
print("\n📊 實驗成果總結:")
print(f"  ✅ ONNX 模型導出: {sum(1 for r in export_results.values() if r['success'])}/{len(export_results)}")

if TRT_AVAILABLE:
    print(f"  ✅ TensorRT 引擎構建: {sum(1 for r in build_results.values() if r['success'])}/{len(build_results)}")
    print(f"  ✅ Triton 模型配置: {sum(1 for m in triton_models.values() if m['success'])}/{len(triton_models)}")
else:
    print("  ⚠️  TensorRT 引擎構建: 已跳過 (TensorRT 未安裝)")
    print("  ⚠️  Triton 模型配置: 已跳過")

print(f"  ✅ 部署腳本生成: 完成")
print(f"  ✅ 最佳實踐指南: 完成")

print("\n📚 下一步學習建議:")
print("1. 🔧 實踐 vLLM Backend 整合 (Notebook 03)")
print("2. 🛠️  開發自定義 Python Backend (Notebook 04)")
print("3. 📊 深入 INT8 量化優化")
print("4. 🏢 多 GPU 分散式部署")
print("5. 🚀 生產環境性能調優")

## 🔗 相關資源與延伸閱讀

### 官方文檔
- [TensorRT Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html)
- [Triton TensorRT Backend](https://github.com/triton-inference-server/tensorrt_backend)
- [ONNX Runtime Documentation](https://onnxruntime.ai/docs/)

### 性能優化
- [TensorRT 最佳實踐](https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html)
- [GPU 推理優化指南](https://developer.nvidia.com/blog/optimizing-inference-performance-using-nvidia-tensorrt/)
- [混合精度推理](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/)

### 實踐案例
- [TensorRT 在生產環境的部署](https://developer.nvidia.com/blog/deploying-deep-learning-nvidia-tensorrt/)
- [大規模推理服務優化](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/)

### 工具資源
- [Netron (模型視覺化)](https://netron.app/)
- [NVIDIA Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) (大語言模型專用)

---

**🎓 實驗完成標誌**: TensorRT Backend 整合與極致優化技術掌握 ✅

**🚀 性能提升**: 相比 PyTorch 原生推理，TensorRT 可實現 2-10x 的性能提升，同時減少 30-50% 的記憶體使用。在企業級部署中，這種優化對於降低推理成本和提升用戶體驗具有重大意義。