# Lab-1.8-04: ORPO 模型生產部署指南

**部署目標**: 將訓練完成的 ORPO 對齊模型部署到生產環境
- 模型量化與優化
- 推理服務搭建
- 性能監控與 A/B 測試
- 持續優化策略

---

## 1. 環境準備和依賴安裝

安裝生產部署所需的額外依賴

In [None]:
# 檢查和安裝部署相關依賴
import subprocess
import sys

def install_if_missing(package):
    try:
        __import__(package)
        print(f"✅ {package} 已安裝")
    except ImportError:
        print(f"📦 安裝 {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# 部署相關依賴
deployment_packages = [
    'fastapi',
    'uvicorn', 
    'gradio',
    'prometheus_client',
    'psutil'
]

print("🚀 檢查部署依賴...")
for package in deployment_packages:
    install_if_missing(package)

print("\n📋 導入必要模組...")

In [None]:
import torch
import torch.nn.functional as F
import numpy as np
import pandas as pd
import json
import time
import gc
import psutil
import threading
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline
)
from peft import (
    PeftModel,
    LoraConfig,
    get_peft_model
)

# 部署相關
import fastapi
import uvicorn
import gradio as gr
from prometheus_client import Counter, Histogram, Gauge, generate_latest

# 檢查設備
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"🎯 部署設備: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. 模型載入與優化

載入已訓練的 ORPO 模型並進行生產優化

In [None]:
class ProductionORPOModel:
    """生產環境的 ORPO 模型包裝器"""
    
    def __init__(self, model_name="microsoft/DialoGPT-medium", use_quantization=True):
        self.model_name = model_name
        self.use_quantization = use_quantization
        self.model = None
        self.tokenizer = None
        self.pipeline = None
        
        # 性能統計
        self.request_count = 0
        self.total_inference_time = 0
        self.avg_tokens_per_second = 0
        
        print(f"🏭 初始化生產模型: {model_name}")
        
    def load_model(self):
        """載入和優化模型"""
        print("📥 載入模型和 tokenizer...")
        
        # 載入 tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # 量化配置
        if self.use_quantization:
            bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch.bfloat16
            )
            print("⚡ 使用 4-bit 量化")
        else:
            bnb_config = None
            print("🔥 使用全精度模型")
        
        # 載入模型
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            quantization_config=bnb_config,
            device_map="auto",
            torch_dtype=torch.bfloat16 if self.use_quantization else torch.float16,
            trust_remote_code=True
        )
        
        # 創建生成管道
        self.pipeline = pipeline(
            "text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            device_map="auto",
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            max_new_tokens=256
        )
        
        print("✅ 模型載入完成")
        print(f"📊 模型參數量: {self.model.num_parameters() / 1e6:.1f}M")
        
    def generate_response(self, prompt: str, max_tokens: int = 256) -> Dict:
        """生成回應"""
        start_time = time.time()
        
        try:
            # 生成回應
            outputs = self.pipeline(
                prompt,
                max_new_tokens=max_tokens,
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id
            )
            
            generated_text = outputs[0]['generated_text']
            response = generated_text[len(prompt):].strip()
            
            # 計算性能指標
            inference_time = time.time() - start_time
            token_count = len(self.tokenizer.encode(response))
            tokens_per_second = token_count / inference_time if inference_time > 0 else 0
            
            # 更新統計
            self.request_count += 1
            self.total_inference_time += inference_time
            self.avg_tokens_per_second = (
                (self.avg_tokens_per_second * (self.request_count - 1) + tokens_per_second) 
                / self.request_count
            )
            
            return {
                'response': response,
                'inference_time': inference_time,
                'token_count': token_count,
                'tokens_per_second': tokens_per_second,
                'success': True
            }
            
        except Exception as e:
            return {
                'response': f"Error: {str(e)}",
                'inference_time': time.time() - start_time,
                'token_count': 0,
                'tokens_per_second': 0,
                'success': False,
                'error': str(e)
            }
    
    def get_stats(self) -> Dict:
        """獲取模型統計信息"""
        return {
            'request_count': self.request_count,
            'total_inference_time': self.total_inference_time,
            'avg_inference_time': self.total_inference_time / max(self.request_count, 1),
            'avg_tokens_per_second': self.avg_tokens_per_second,
            'model_name': self.model_name,
            'quantized': self.use_quantization
        }

# 初始化生產模型
production_model = ProductionORPOModel(use_quantization=True)
production_model.load_model()

print("🎉 生產模型初始化完成！")

## 3. 性能基準測試

測試模型在生產環境下的性能表現

In [None]:
def run_performance_benchmark():
    """運行性能基準測試"""
    
    print("🚀 開始性能基準測試...")
    
    # 測試用例
    test_prompts = [
        "How can I improve my productivity at work?",
        "What are the benefits of regular exercise?",
        "Explain the concept of machine learning",
        "How to cook a perfect pasta?",
        "What is the meaning of life?"
    ]
    
    results = []
    
    for i, prompt in enumerate(test_prompts):
        print(f"\n📝 測試 {i+1}/{len(test_prompts)}: {prompt[:50]}...")
        
        # 記錄系統資源
        gpu_memory_before = torch.cuda.memory_allocated() / 1e9 if torch.cuda.is_available() else 0
        cpu_percent_before = psutil.cpu_percent()
        
        # 生成回應
        result = production_model.generate_response(prompt, max_tokens=128)
        
        # 記錄資源使用
        gpu_memory_after = torch.cuda.memory_allocated() / 1e9 if torch.cuda.is_available() else 0
        cpu_percent_after = psutil.cpu_percent()
        
        # 保存結果
        test_result = {
            'prompt': prompt,
            'response': result['response'][:100] + '...' if len(result['response']) > 100 else result['response'],
            'inference_time': result['inference_time'],
            'token_count': result['token_count'],
            'tokens_per_second': result['tokens_per_second'],
            'gpu_memory_usage': gpu_memory_after - gpu_memory_before,
            'cpu_usage': (cpu_percent_before + cpu_percent_after) / 2,
            'success': result['success']
        }
        
        results.append(test_result)
        
        print(f"⏱️  推理時間: {result['inference_time']:.3f}s")
        print(f"🔤 Token 數量: {result['token_count']}")
        print(f"⚡ Token/秒: {result['tokens_per_second']:.1f}")
        
        # 清理記憶體
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()
    
    return results

# 運行基準測試
benchmark_results = run_performance_benchmark()

# 統計分析
df = pd.DataFrame(benchmark_results)
print("\n📊 性能統計摘要:")
print(f"平均推理時間: {df['inference_time'].mean():.3f}s")
print(f"平均 Token/秒: {df['tokens_per_second'].mean():.1f}")
print(f"成功率: {df['success'].mean()*100:.1f}%")
if torch.cuda.is_available():
    print(f"平均 GPU 記憶體增長: {df['gpu_memory_usage'].mean():.3f} GB")
print(f"平均 CPU 使用率: {df['cpu_usage'].mean():.1f}%")

## 4. 監控系統設置

設置 Prometheus 監控指標

In [None]:
# Prometheus 監控指標
from prometheus_client import Counter, Histogram, Gauge

class ModelMonitoring:
    """模型監控系統"""
    
    def __init__(self):
        # 請求計數器
        self.request_count = Counter(
            'orpo_model_requests_total',
            'Total number of requests to ORPO model',
            ['status']  # success, error
        )
        
        # 推理時間分布
        self.inference_duration = Histogram(
            'orpo_model_inference_duration_seconds',
            'Time spent on model inference',
            buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
        )
        
        # Token 生成速度
        self.tokens_per_second = Gauge(
            'orpo_model_tokens_per_second',
            'Average tokens generated per second'
        )
        
        # GPU 記憶體使用
        self.gpu_memory_usage = Gauge(
            'orpo_model_gpu_memory_gb',
            'GPU memory usage in GB'
        )
        
        # CPU 使用率
        self.cpu_usage = Gauge(
            'orpo_model_cpu_percent',
            'CPU usage percentage'
        )
        
        print("📊 監控系統初始化完成")
    
    def record_request(self, result: Dict):
        """記錄請求指標"""
        # 記錄請求狀態
        status = 'success' if result['success'] else 'error'
        self.request_count.labels(status=status).inc()
        
        # 記錄推理時間
        self.inference_duration.observe(result['inference_time'])
        
        # 更新速度指標
        self.tokens_per_second.set(result['tokens_per_second'])
        
        # 更新資源使用
        if torch.cuda.is_available():
            self.gpu_memory_usage.set(torch.cuda.memory_allocated() / 1e9)
        self.cpu_usage.set(psutil.cpu_percent())
    
    def get_metrics(self) -> str:
        """獲取 Prometheus 格式的指標"""
        return generate_latest()

# 初始化監控
monitoring = ModelMonitoring()

print("📈 監控系統就緒")

## 5. FastAPI 服務搭建

創建 RESTful API 服務

In [None]:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from fastapi.responses import PlainTextResponse
import uvicorn

# API 數據模型
class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7
    top_p: float = 0.9

class GenerationResponse(BaseModel):
    response: str
    inference_time: float
    token_count: int
    tokens_per_second: float
    request_id: str

# 創建 FastAPI 應用
app = FastAPI(
    title="ORPO Model API",
    description="Production API for ORPO-aligned language model",
    version="1.0.0"
)

@app.get("/")
async def root():
    return {"message": "ORPO Model API is running", "status": "healthy"}

@app.get("/health")
async def health_check():
    """健康檢查端點"""
    gpu_available = torch.cuda.is_available()
    gpu_memory = torch.cuda.memory_allocated() / 1e9 if gpu_available else 0
    
    return {
        "status": "healthy",
        "gpu_available": gpu_available,
        "gpu_memory_gb": gpu_memory,
        "model_loaded": production_model.model is not None,
        "total_requests": production_model.request_count
    }

@app.post("/generate", response_model=GenerationResponse)
async def generate_text(request: GenerationRequest):
    """文本生成端點"""
    try:
        # 生成唯一請求 ID
        request_id = f"req_{int(time.time()*1000)}_{production_model.request_count}"
        
        # 生成回應
        result = production_model.generate_response(
            request.prompt, 
            max_tokens=request.max_tokens
        )
        
        # 記錄監控指標
        monitoring.record_request(result)
        
        if not result['success']:
            raise HTTPException(status_code=500, detail=result.get('error', 'Generation failed'))
        
        return GenerationResponse(
            response=result['response'],
            inference_time=result['inference_time'],
            token_count=result['token_count'],
            tokens_per_second=result['tokens_per_second'],
            request_id=request_id
        )
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/stats")
async def get_stats():
    """獲取模型統計信息"""
    return production_model.get_stats()

@app.get("/metrics", response_class=PlainTextResponse)
async def get_metrics():
    """Prometheus 指標端點"""
    return monitoring.get_metrics()

print("🌐 FastAPI 服務配置完成")
print("可用端點:")
print("  GET  /          - 服務狀態")
print("  GET  /health    - 健康檢查")
print("  POST /generate  - 文本生成")
print("  GET  /stats     - 模型統計")
print("  GET  /metrics   - Prometheus 指標")

# 注意：在 Jupyter 中不能直接運行 uvicorn.run()
# 實際部署時使用: uvicorn main:app --host 0.0.0.0 --port 8000

## 6. Gradio 用戶界面

創建友好的 Web 界面供用戶測試

In [None]:
import gradio as gr

def gradio_generate(prompt, max_tokens, temperature):
    """Gradio 介面的生成函數"""
    try:
        # 調整模型參數（這裡簡化處理）
        result = production_model.generate_response(prompt, max_tokens=int(max_tokens))
        
        # 記錄監控指標
        monitoring.record_request(result)
        
        if result['success']:
            stats_text = f"""📊 生成統計:
⏱️ 推理時間: {result['inference_time']:.3f}s
🔤 Token 數量: {result['token_count']}
⚡ Token/秒: {result['tokens_per_second']:.1f}
📈 總請求數: {production_model.request_count}"""
            
            return result['response'], stats_text
        else:
            return f"❌ 生成失敗: {result.get('error', 'Unknown error')}", ""
            
    except Exception as e:
        return f"❌ 錯誤: {str(e)}", ""

# 創建 Gradio 界面
def create_gradio_interface():
    """創建 Gradio 網頁界面"""
    
    with gr.Blocks(title="ORPO Model Demo", theme=gr.themes.Soft()) as demo:
        gr.Markdown(
            """
            # 🤖 ORPO 對齊模型演示
            
            這是一個基於 ORPO (Odds Ratio Preference Optimization) 訓練的對齊語言模型。
            ORPO 是一種新的單階段對齊方法，相比傳統的 SFT+DPO 方法更加高效。
            """
        )
        
        with gr.Row():
            with gr.Column(scale=2):
                prompt_input = gr.Textbox(
                    label="輸入提示 (Prompt)",
                    placeholder="請輸入您的問題或提示...",
                    lines=3
                )
                
                with gr.Row():
                    max_tokens_slider = gr.Slider(
                        label="最大 Token 數",
                        minimum=50,
                        maximum=512,
                        value=256,
                        step=10
                    )
                    
                    temperature_slider = gr.Slider(
                        label="溫度 (Temperature)",
                        minimum=0.1,
                        maximum=2.0,
                        value=0.7,
                        step=0.1
                    )
                
                generate_btn = gr.Button("🚀 生成回應", variant="primary")
                
            with gr.Column(scale=1):
                stats_display = gr.Textbox(
                    label="性能統計",
                    value="等待生成...",
                    lines=6,
                    interactive=False
                )
        
        response_output = gr.Textbox(
            label="模型回應",
            lines=8,
            interactive=False
        )
        
        # 預設示例
        gr.Examples(
            examples=[
                ["如何提高工作效率？", 200, 0.7],
                ["解釋一下機器學習的基本概念", 300, 0.8],
                ["給我一些健康飲食的建議", 250, 0.6],
                ["如何學習新的編程語言？", 280, 0.7]
            ],
            inputs=[prompt_input, max_tokens_slider, temperature_slider]
        )
        
        # 事件綁定
        generate_btn.click(
            fn=gradio_generate,
            inputs=[prompt_input, max_tokens_slider, temperature_slider],
            outputs=[response_output, stats_display]
        )
        
        gr.Markdown(
            """
            ### 📋 使用說明
            
            1. **輸入提示**: 在上方文本框中輸入您的問題或提示
            2. **調整參數**: 可以調整最大 Token 數和溫度參數
            3. **生成回應**: 點擊「生成回應」按鈕獲得模型回應
            4. **查看統計**: 右側會顯示生成的性能統計信息
            
            **參數說明**:
            - **最大 Token 數**: 控制回應的最大長度
            - **溫度**: 控制回應的創造性（越高越有創意，越低越保守）
            """
        )
    
    return demo

# 創建界面
gradio_demo = create_gradio_interface()
print("🎨 Gradio 界面已配置")
print("執行 gradio_demo.launch() 來啟動界面")

# 注意：在 Jupyter 中可以執行以下代碼來啟動界面
# gradio_demo.launch(share=False, server_port=7860)

## 7. A/B 測試框架

實現模型版本對比測試

In [None]:
import random
from dataclasses import dataclass
from typing import List, Dict, Optional
import json
from datetime import datetime, timedelta

@dataclass
class ABTestConfig:
    """A/B 測試配置"""
    test_name: str
    model_a_name: str
    model_b_name: str
    traffic_split: float = 0.5  # 流量分配比例 (0.5 = 50/50)
    start_time: datetime = None
    end_time: datetime = None
    
@dataclass
class ABTestResult:
    """A/B 測試結果"""
    user_id: str
    model_version: str
    prompt: str
    response: str
    inference_time: float
    user_rating: Optional[int] = None  # 1-5 評分
    timestamp: datetime = None

class ABTestManager:
    """A/B 測試管理器"""
    
    def __init__(self):
        self.active_tests: Dict[str, ABTestConfig] = {}
        self.test_results: List[ABTestResult] = []
        
    def create_test(self, config: ABTestConfig):
        """創建新的 A/B 測試"""
        if config.start_time is None:
            config.start_time = datetime.now()
        if config.end_time is None:
            config.end_time = config.start_time + timedelta(days=7)
            
        self.active_tests[config.test_name] = config
        print(f"🧪 A/B 測試已創建: {config.test_name}")
        print(f"   模型 A: {config.model_a_name}")
        print(f"   模型 B: {config.model_b_name}")
        print(f"   流量分配: {config.traffic_split*100:.0f}% / {(1-config.traffic_split)*100:.0f}%")
        
    def assign_model(self, test_name: str, user_id: str) -> str:
        """為用戶分配模型版本"""
        if test_name not in self.active_tests:
            return "default"
            
        config = self.active_tests[test_name]
        
        # 檢查測試是否在有效期內
        now = datetime.now()
        if now < config.start_time or now > config.end_time:
            return config.model_a_name  # 默認返回 A 版本
        
        # 基於用戶 ID 的一致性分配（確保同一用戶總是獲得相同版本）
        random.seed(hash(user_id + test_name))
        if random.random() < config.traffic_split:
            return config.model_a_name
        else:
            return config.model_b_name
    
    def record_result(self, result: ABTestResult):
        """記錄測試結果"""
        if result.timestamp is None:
            result.timestamp = datetime.now()
        self.test_results.append(result)
        
    def analyze_test(self, test_name: str) -> Dict:
        """分析 A/B 測試結果"""
        if test_name not in self.active_tests:
            return {"error": "Test not found"}
        
        config = self.active_tests[test_name]
        
        # 篩選相關結果
        relevant_results = [
            r for r in self.test_results 
            if r.model_version in [config.model_a_name, config.model_b_name]
        ]
        
        if not relevant_results:
            return {"error": "No results found"}
        
        # 分組統計
        model_a_results = [r for r in relevant_results if r.model_version == config.model_a_name]
        model_b_results = [r for r in relevant_results if r.model_version == config.model_b_name]
        
        def calculate_stats(results):
            if not results:
                return {"count": 0, "avg_inference_time": 0, "avg_rating": 0}
            
            ratings = [r.user_rating for r in results if r.user_rating is not None]
            
            return {
                "count": len(results),
                "avg_inference_time": sum(r.inference_time for r in results) / len(results),
                "avg_rating": sum(ratings) / len(ratings) if ratings else 0,
                "rating_count": len(ratings)
            }
        
        model_a_stats = calculate_stats(model_a_results)
        model_b_stats = calculate_stats(model_b_results)
        
        return {
            "test_name": test_name,
            "model_a": {
                "name": config.model_a_name,
                "stats": model_a_stats
            },
            "model_b": {
                "name": config.model_b_name,
                "stats": model_b_stats
            },
            "total_results": len(relevant_results)
        }

# 初始化 A/B 測試管理器
ab_test_manager = ABTestManager()

# 創建示例 A/B 測試
test_config = ABTestConfig(
    test_name="orpo_vs_baseline",
    model_a_name="ORPO_Model",
    model_b_name="Baseline_Model",
    traffic_split=0.5
)

ab_test_manager.create_test(test_config)

# 模擬一些測試結果
def simulate_ab_test_results():
    """模擬 A/B 測試結果"""
    print("\n🔬 模擬 A/B 測試結果...")
    
    users = [f"user_{i}" for i in range(100)]
    prompts = [
        "How to be more productive?",
        "Explain quantum computing",
        "Best practices for coding",
        "How to stay healthy?"
    ]
    
    for user in users:
        prompt = random.choice(prompts)
        model_version = ab_test_manager.assign_model("orpo_vs_baseline", user)
        
        # 模擬不同模型的性能差異
        if model_version == "ORPO_Model":
            inference_time = random.uniform(0.5, 1.2)
            rating = random.choices([3, 4, 5], weights=[0.1, 0.4, 0.5])[0]
        else:
            inference_time = random.uniform(0.8, 1.8)
            rating = random.choices([2, 3, 4], weights=[0.2, 0.5, 0.3])[0]
        
        result = ABTestResult(
            user_id=user,
            model_version=model_version,
            prompt=prompt,
            response="Generated response...",
            inference_time=inference_time,
            user_rating=rating if random.random() > 0.3 else None  # 70% 用戶提供評分
        )
        
        ab_test_manager.record_result(result)
    
    return ab_test_manager.analyze_test("orpo_vs_baseline")

# 運行模擬測試
test_analysis = simulate_ab_test_results()
print("\n📊 A/B 測試分析結果:")
print(json.dumps(test_analysis, indent=2, default=str))

## 8. 生產部署檢查清單

確保模型準備好投入生產環境

In [None]:
def production_readiness_check():
    """生產就緒檢查"""
    
    print("🔍 生產部署就緒檢查")
    print("=" * 50)
    
    checklist = {
        "模型載入": production_model.model is not None,
        "GPU 可用性": torch.cuda.is_available(),
        "記憶體充足": torch.cuda.get_device_properties(0).total_memory > 8e9 if torch.cuda.is_available() else True,
        "監控系統": monitoring is not None,
        "API 服務": app is not None,
        "A/B 測試": ab_test_manager is not None,
        "基準測試完成": len(benchmark_results) > 0,
    }
    
    all_passed = True
    
    for check, passed in checklist.items():
        status = "✅" if passed else "❌"
        print(f"{status} {check}")
        if not passed:
            all_passed = False
    
    print("\n" + "="*50)
    
    if all_passed:
        print("🎉 所有檢查通過！模型已準備好投入生產環境。")
    else:
        print("⚠️  部分檢查未通過，請先解決相關問題。")
    
    # 部署建議
    print("\n📋 生產部署建議:")
    suggestions = [
        "🔧 使用 Docker 容器化部署",
        "🔒 啟用 HTTPS 和身份驗證",
        "📊 設置 Prometheus + Grafana 監控",
        "🚦 配置負載均衡器",
        "💾 設置定期模型備份",
        "🔄 實施滾動更新策略",
        "⚡ 考慮使用 NVIDIA Triton 推理服務器",
        "📝 建立日誌聚合和分析",
        "🧪 持續運行 A/B 測試",
        "🔍 設置異常檢測和自動告警"
    ]
    
    for suggestion in suggestions:
        print(f"  {suggestion}")
    
    return all_passed

# 執行就緒檢查
ready_for_production = production_readiness_check()

## 9. Docker 部署配置

生成 Docker 配置文件

In [None]:
def generate_docker_files():
    """生成 Docker 部署文件"""
    
    # Dockerfile
    dockerfile_content = """
# ORPO Model Production Dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04

# 設置工作目錄
WORKDIR /app

# 安裝系統依賴
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

# 安裝 Python 依賴
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# 複製應用代碼
COPY . .

# 設置環境變量
ENV PYTHONPATH=/app
ENV CUDA_VISIBLE_DEVICES=0

# 暴露端口
EXPOSE 8000 7860

# 健康檢查
HEALTHCHECK --interval=30s --timeout=30s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# 啟動命令
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
""".strip()
    
    # requirements.txt
    requirements_content = """
torch>=2.1.0
transformers>=4.35.0
peft>=0.6.0
bitsandbytes>=0.41.0
accelerate>=0.24.0
fastapi>=0.104.0
uvicorn[standard]>=0.24.0
gradio>=4.0.0
prometheus-client>=0.19.0
psutil>=5.9.0
pydantic>=2.0.0
numpy>=1.24.0
pandas>=2.0.0
""".strip()
    
    # docker-compose.yml
    docker_compose_content = """
version: '3.8'

services:
  orpo-model:
    build: .
    ports:
      - "8000:8000"  # FastAPI
      - "7860:7860"  # Gradio (如果需要)
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - MODEL_NAME=microsoft/DialoGPT-medium
      - USE_QUANTIZATION=true
    volumes:
      - ./models:/app/models
      - ./logs:/app/logs
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana
    restart: unless-stopped

volumes:
  grafana-storage:
""".strip()
    
    # prometheus.yml
    prometheus_config = """
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'orpo-model'
    static_configs:
      - targets: ['orpo-model:8000']
    metrics_path: '/metrics'
    scrape_interval: 5s
""".strip()
    
    # main.py (簡化版的 FastAPI 應用)
    main_py_content = """
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import torch
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from prometheus_client import generate_latest, Counter, Histogram
from fastapi.responses import PlainTextResponse

# 初始化 FastAPI
app = FastAPI(title="ORPO Model API", version="1.0.0")

# Prometheus 指標
REQUEST_COUNT = Counter('requests_total', 'Total requests', ['method', 'endpoint'])
REQUEST_DURATION = Histogram('request_duration_seconds', 'Request duration')

# 全局變量
model = None
tokenizer = None
text_generator = None

class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 256

@app.on_event("startup")
async def startup_event():
    global model, tokenizer, text_generator
    
    model_name = os.getenv("MODEL_NAME", "microsoft/DialoGPT-medium")
    use_quantization = os.getenv("USE_QUANTIZATION", "true").lower() == "true"
    
    print(f"Loading model: {model_name}")
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch.float16
    )
    
    text_generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device_map="auto"
    )
    
    print("Model loaded successfully!")

@app.get("/")
async def root():
    REQUEST_COUNT.labels(method="GET", endpoint="/").inc()
    return {"message": "ORPO Model API", "status": "running"}

@app.get("/health")
async def health():
    REQUEST_COUNT.labels(method="GET", endpoint="/health").inc()
    return {
        "status": "healthy",
        "model_loaded": model is not None,
        "gpu_available": torch.cuda.is_available()
    }

@app.post("/generate")
async def generate(request: GenerationRequest):
    REQUEST_COUNT.labels(method="POST", endpoint="/generate").inc()
    
    with REQUEST_DURATION.time():
        if text_generator is None:
            raise HTTPException(status_code=500, detail="Model not loaded")
        
        try:
            outputs = text_generator(
                request.prompt,
                max_new_tokens=request.max_tokens,
                do_sample=True,
                temperature=0.7
            )
            
            response = outputs[0]['generated_text'][len(request.prompt):].strip()
            
            return {
                "response": response,
                "prompt": request.prompt
            }
            
        except Exception as e:
            raise HTTPException(status_code=500, detail=str(e))

@app.get("/metrics", response_class=PlainTextResponse)
async def get_metrics():
    return generate_latest()

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
""".strip()
    
    # 保存文件
    files = {
        "Dockerfile": dockerfile_content,
        "requirements.txt": requirements_content,
        "docker-compose.yml": docker_compose_content,
        "prometheus.yml": prometheus_config,
        "main.py": main_py_content
    }
    
    print("🐳 Docker 部署文件內容:")
    print("="*50)
    
    for filename, content in files.items():
        print(f"\n📄 {filename}:")
        print("-" * 30)
        print(content[:200] + "..." if len(content) > 200 else content)
    
    print("\n🚀 部署命令:")
    commands = [
        "# 構建和啟動服務",
        "docker-compose up --build -d",
        "",
        "# 查看日誌", 
        "docker-compose logs -f orpo-model",
        "",
        "# 停止服務",
        "docker-compose down",
        "",
        "# 重啟特定服務",
        "docker-compose restart orpo-model"
    ]
    
    for cmd in commands:
        print(cmd)
    
    return files

# 生成 Docker 配置
docker_files = generate_docker_files()

## 10. 部署總結與最佳實踐

### 🎯 部署完成檢查清單

✅ **模型優化**: 使用量化技術減少記憶體使用  
✅ **API 服務**: FastAPI 提供高性能 RESTful 接口  
✅ **用戶界面**: Gradio 提供友好的 Web 測試界面  
✅ **監控系統**: Prometheus 指標收集和分析  
✅ **A/B 測試**: 完整的模型版本對比框架  
✅ **容器化**: Docker 配置確保一致的部署環境  
✅ **健康檢查**: 自動化的服務狀態監控  
✅ **性能基準**: 詳細的推理性能評估  

### 🏗️ 生產環境架構建議

```
用戶請求 → 負載均衡器 → API Gateway → ORPO 模型服務
                                      ↓
                               監控系統 (Prometheus/Grafana)
                                      ↓
                               A/B 測試系統 → 數據分析
```

### 📊 關鍵性能指標 (KPIs)

1. **延遲指標**:
   - P50 推理時間 < 1.0s
   - P95 推理時間 < 2.0s
   - P99 推理時間 < 5.0s

2. **吞吐量指標**:
   - QPS (每秒查詢數) > 10
   - Token/秒 > 50

3. **可靠性指標**:
   - 可用性 > 99.9%
   - 錯誤率 < 0.1%

4. **資源利用率**:
   - GPU 利用率 60-80%
   - 記憶體使用率 < 90%

### 🔧 運維最佳實踐

1. **監控告警**:
   - 設置推理時間異常告警
   - GPU 記憶體使用告警
   - 服務健康狀態監控

2. **自動化部署**:
   - CI/CD 管道自動測試和部署
   - 藍綠部署或滾動更新
   - 自動回滾機制

3. **數據管理**:
   - 定期備份模型檢查點
   - 日誌輪轉和歸檔
   - 用戶反饋數據收集

4. **安全防護**:
   - API 身份驗證和授權
   - 輸入內容過濾和驗證
   - 速率限制防止濫用

### 🚀 擴展策略

1. **水平擴展**: 增加更多模型實例處理並發請求
2. **模型分片**: 大型模型分佈式部署
3. **緩存策略**: 常見查詢結果緩存
4. **邊緣部署**: CDN 邊緣節點部署降低延遲

---

**🎉 ORPO 模型生產部署指南完成！**

現在您擁有了一個完整的 ORPO 對齊模型生產部署方案，包括：
- 高性能推理服務
- 全面的監控系統  
- 科學的 A/B 測試框架
- 容器化部署解決方案
- 生產級別的最佳實踐建議

這套完整的部署流程可以確保 ORPO 模型在生產環境中穩定、高效地運行，為用戶提供優質的 AI 服務體驗！