# Lab-1.4: 記憶體分析與優化 (Memory Profiling)

**學習目標**:
- 掌握 PyTorch 記憶體分析工具
- 識別訓練過程中的記憶體瓶頸
- 使用 Profiler 進行性能分析
- 制定記憶體優化策略

**預計時間**: 30-45分鐘

## 1. 理論背景

### 1.1 GPU 記憶體組成

訓練深度學習模型時, GPU 記憶體主要用於:

```
總記憶體 = 模型參數 + 梯度 + 優化器狀態 + 激活值 + 臨時緩存
```

**各部分占比 (以 7B 模型 FP32 訓練為例)**:
- 模型參數: 28GB (7B × 4 bytes)
- 梯度: 28GB (與參數同大小)
- 優化器狀態 (Adam): 56GB (momentum + variance)
- 激活值: 視批次大小而定 (通常 10-30GB)
- **總計**: ~120-140GB

### 1.2 記憶體分析目標

1. **定位記憶體峰值**: 找出訓練過程中記憶體占用最高的時刻
2. **識別記憶體洩漏**: 檢查是否有未釋放的張量
3. **優化記憶體分配**: 減少不必要的記憶體開銷
4. **預測記憶體需求**: 為生產環境規劃硬體資源

## 2. 環境設置

In [1]:
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torch.profiler import profile, ProfilerActivity, record_function
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import numpy as np
from tqdm.auto import tqdm
import gc
import time

print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用設備: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"總記憶體: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

PyTorch Version: 2.6.0+cu124
CUDA Available: True
使用設備: cuda
GPU: NVIDIA RTX 2000 Ada Generation
總記憶體: 16.71 GB


## 3. 基礎記憶體監控工具

In [2]:
def print_gpu_memory(prefix=""):
    """打印當前 GPU 記憶體使用情況"""
    if not torch.cuda.is_available():
        print(f"{prefix}GPU 不可用")
        return
    
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    max_allocated = torch.cuda.max_memory_allocated() / 1e9
    
    print(f"{prefix}記憶體使用:")
    print(f"  已分配 (Allocated): {allocated:.3f} GB")
    print(f"  已保留 (Reserved):  {reserved:.3f} GB")
    print(f"  峰值分配 (Peak):     {max_allocated:.3f} GB")
    
    return {
        "allocated": allocated,
        "reserved": reserved,
        "peak": max_allocated
    }


def reset_gpu_memory():
    """重置 GPU 記憶體統計"""
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
        torch.cuda.empty_cache()
    gc.collect()


# 測試
print("=" * 60)
print("GPU 記憶體初始狀態")
print("=" * 60)
reset_gpu_memory()
print_gpu_memory()

GPU 記憶體初始狀態
記憶體使用:
  已分配 (Allocated): 0.000 GB
  已保留 (Reserved):  0.000 GB
  峰值分配 (Peak):     0.000 GB


{'allocated': 0.0, 'reserved': 0.0, 'peak': 0.0}

## 4. 記憶體追蹤器類

In [3]:
class MemoryProfiler:
    """記憶體分析器"""
    def __init__(self):
        self.snapshots = []
        self.timeline = []
    
    def reset(self):
        """重置分析器"""
        self.snapshots = []
        self.timeline = []
        reset_gpu_memory()
    
    def snapshot(self, label="", timestamp=None):
        """記錄當前記憶體快照"""
        if not torch.cuda.is_available():
            return
        
        if timestamp is None:
            timestamp = time.time()
        
        snapshot = {
            "label": label,
            "timestamp": timestamp,
            "allocated": torch.cuda.memory_allocated() / 1e9,
            "reserved": torch.cuda.memory_reserved() / 1e9,
            "peak": torch.cuda.max_memory_allocated() / 1e9
        }
        
        self.snapshots.append(snapshot)
        return snapshot
    
    def start_timeline(self):
        """開始記錄記憶體時間線"""
        self.timeline = []
        self.timeline_start = time.time()
    
    def record_timeline(self, label=""):
        """記錄時間線點"""
        if not torch.cuda.is_available():
            return
        
        elapsed = time.time() - self.timeline_start
        self.timeline.append({
            "time": elapsed,
            "label": label,
            "allocated": torch.cuda.memory_allocated() / 1e9
        })
    
    def plot_snapshots(self):
        """繪製記憶體快照"""
        if not self.snapshots:
            print("沒有快照可繪製")
            return
        
        labels = [s["label"] for s in self.snapshots]
        allocated = [s["allocated"] for s in self.snapshots]
        reserved = [s["reserved"] for s in self.snapshots]
        peak = [s["peak"] for s in self.snapshots]
        
        fig, ax = plt.subplots(figsize=(12, 6))
        x = np.arange(len(labels))
        width = 0.25
        
        ax.bar(x - width, allocated, width, label="已分配", color="#3498db")
        ax.bar(x, reserved, width, label="已保留", color="#95a5a6")
        ax.bar(x + width, peak, width, label="峰值", color="#e74c3c")
        
        ax.set_xlabel("階段")
        ax.set_ylabel("記憶體 (GB)")
        ax.set_title("記憶體使用快照", fontsize=14, fontweight="bold")
        ax.set_xticks(x)
        ax.set_xticklabels(labels, rotation=45, ha="right")
        ax.legend()
        ax.grid(axis="y", alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    def plot_timeline(self):
        """繪製記憶體時間線"""
        if not self.timeline:
            print("沒有時間線可繪製")
            return
        
        times = [t["time"] for t in self.timeline]
        memory = [t["allocated"] for t in self.timeline]
        
        plt.figure(figsize=(12, 5))
        plt.plot(times, memory, linewidth=2, color="#3498db", marker="o")
        plt.xlabel("時間 (秒)")
        plt.ylabel("已分配記憶體 (GB)")
        plt.title("記憶體使用時間線", fontsize=14, fontweight="bold")
        plt.grid(alpha=0.3)
        plt.tight_layout()
        plt.show()
    
    def print_summary(self):
        """打印分析摘要"""
        if not self.snapshots:
            print("沒有快照數據")
            return
        
        peak_snapshot = max(self.snapshots, key=lambda s: s["allocated"])
        
        print("\n" + "=" * 60)
        print("記憶體分析摘要")
        print("=" * 60)
        print(f"總快照數: {len(self.snapshots)}")
        print(f"\n峰值記憶體使用:")
        print(f"  階段: {peak_snapshot['label']}")
        print(f"  已分配: {peak_snapshot['allocated']:.3f} GB")
        print(f"  已保留: {peak_snapshot['reserved']:.3f} GB")
        print(f"  峰值: {peak_snapshot['peak']:.3f} GB")


# 創建全局分析器
profiler = MemoryProfiler()
print("✅ 記憶體分析器初始化完成")

✅ 記憶體分析器初始化完成


## 5. 實驗 1: 模型載入的記憶體分析

In [4]:
print("=" * 70)
print("實驗 1: 模型載入記憶體分析")
print("=" * 70)

profiler.reset()
profiler.snapshot("初始狀態")

# 載入模型
print("\n載入 GPT-2 Medium...")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
profiler.snapshot("模型載入 (CPU)")

# 移到 GPU
print("移動到 GPU...")
model = model.to(device)
profiler.snapshot("模型移到 GPU")

# 計算模型參數量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
param_size_gb = total_params * 4 / 1e9  # FP32: 4 bytes per param

print(f"\n模型統計:")
print(f"  總參數量: {total_params / 1e6:.1f}M")
print(f"  可訓練參數: {trainable_params / 1e6:.1f}M")
print(f"  理論大小 (FP32): {param_size_gb:.3f} GB")

# 顯示記憶體快照
profiler.plot_snapshots()
profiler.print_summary()

# 分析
stats = profiler.snapshots[-1]
print(f"\n分析:")
print(f"  實際記憶體占用: {stats['allocated']:.3f} GB")
print(f"  理論參數大小: {param_size_gb:.3f} GB")
print(f"  開銷: {(stats['allocated'] - param_size_gb) / param_size_gb * 100:.1f}%")

# 清理
del model
reset_gpu_memory()

實驗 1: 模型載入記憶體分析

載入 GPT-2 Medium...
移動到 GPU...


OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 15.57 GiB of which 23.06 MiB is free. Process 3859131 has 12.57 GiB memory in use. Process 3880458 has 2.10 GiB memory in use. Including non-PyTorch memory, this process has 510.00 MiB memory in use. Of the allocated memory 413.55 MiB is allocated by PyTorch, and 12.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## 6. 實驗 2: 訓練過程記憶體分析

In [None]:
# 準備數據
class SimpleTextDataset(Dataset):
    def __init__(self, tokenizer, num_samples=100, seq_length=128):
        self.tokenizer = tokenizer
        self.num_samples = num_samples
        self.seq_length = seq_length
        self.texts = [f"The quick brown fox jumps over the lazy dog. " * 10 for _ in range(num_samples)]
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        encodings = self.tokenizer(
            self.texts[idx],
            max_length=self.seq_length,
            truncation=True,
            padding="max_length",
            return_tensors="pt"
        )
        return {
            "input_ids": encodings["input_ids"].squeeze(),
            "attention_mask": encodings["attention_mask"].squeeze(),
            "labels": encodings["input_ids"].squeeze()
        }

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
dataset = SimpleTextDataset(tokenizer, num_samples=100, seq_length=128)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

print("=" * 70)
print("實驗 2: 訓練過程記憶體分析")
print("=" * 70)

# 重新載入模型
profiler.reset()
model = GPT2LMHeadModel.from_pretrained("gpt2")
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

profiler.snapshot("訓練前")

# 訓練一個 epoch
model.train()
profiler.start_timeline()

print("\n開始訓練...")
for step, batch in enumerate(tqdm(dataloader, desc="Training")):
    batch = {k: v.to(device) for k, v in batch.items()}
    
    # 記錄前向傳播前
    if step == 0:
        profiler.record_timeline("前向前")
    
    # 前向傳播
    optimizer.zero_grad()
    outputs = model(**batch)
    loss = outputs.loss
    
    if step == 0:
        profiler.record_timeline("前向後")
    
    # 反向傳播
    loss.backward()
    
    if step == 0:
        profiler.record_timeline("反向後")
    
    # 更新參數
    optimizer.step()
    
    if step == 0:
        profiler.record_timeline("更新後")
        profiler.snapshot("第1次迭代完成")
    
    # 每10步記錄一次
    if step % 10 == 0:
        profiler.record_timeline(f"Step {step}")

profiler.snapshot("訓練完成")

# 繪製結果
print("\n記憶體快照:")
profiler.plot_snapshots()

print("\n記憶體時間線:")
profiler.plot_timeline()

profiler.print_summary()

# 清理
del model, optimizer
reset_gpu_memory()

## 7. 實驗 3: 不同批次大小的記憶體影響

In [None]:
print("=" * 70)
print("實驗 3: 批次大小對記憶體的影響")
print("=" * 70)

def measure_batch_size_memory(batch_size, num_steps=10):
    """測量特定批次大小的記憶體使用"""
    reset_gpu_memory()
    
    model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    model.train()
    dataloader_iter = iter(dataloader)
    
    for step in range(num_steps):
        try:
            batch = next(dataloader_iter)
        except StopIteration:
            dataloader_iter = iter(dataloader)
            batch = next(dataloader_iter)
        
        batch = {k: v.to(device) for k, v in batch.items()}
        
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
    
    peak_memory = torch.cuda.max_memory_allocated() / 1e9
    
    del model, optimizer
    reset_gpu_memory()
    
    return peak_memory


# 測試不同批次大小
batch_sizes = [1, 2, 4, 8, 16]
memory_usage = []

print("\n測試不同批次大小...")
for bs in batch_sizes:
    print(f"  測試 batch_size={bs}...", end=" ")
    try:
        mem = measure_batch_size_memory(bs, num_steps=10)
        memory_usage.append(mem)
        print(f"峰值記憶體: {mem:.2f} GB")
    except RuntimeError as e:
        if "out of memory" in str(e):
            print("OOM (記憶體不足)")
            memory_usage.append(None)
        else:
            raise e

# 繪製結果
valid_bs = [bs for bs, mem in zip(batch_sizes, memory_usage) if mem is not None]
valid_mem = [mem for mem in memory_usage if mem is not None]

plt.figure(figsize=(10, 6))
plt.plot(valid_bs, valid_mem, marker="o", linewidth=2, markersize=10, color="#3498db")
plt.xlabel("批次大小 (Batch Size)")
plt.ylabel("峰值記憶體 (GB)")
plt.title("批次大小 vs 記憶體使用", fontsize=14, fontweight="bold")
plt.grid(alpha=0.3)

# 添加數值標籤
for bs, mem in zip(valid_bs, valid_mem):
    plt.text(bs, mem, f"{mem:.2f}GB", ha="center", va="bottom", fontsize=10)

plt.tight_layout()
plt.show()

# 分析
print("\n=" * 70)
print("批次大小記憶體分析")
print("=" * 70)
print(f"\n{'批次大小':<15} {'峰值記憶體 (GB)':<20} {'相對於 BS=1':<20}")
print("-" * 70)
for bs, mem in zip(batch_sizes, memory_usage):
    if mem is not None:
        relative = f"{mem / memory_usage[0]:.2f}x" if memory_usage[0] else "N/A"
        print(f"{bs:<15} {mem:<20.2f} {relative:<20}")
    else:
        print(f"{bs:<15} {'OOM':<20} {'N/A':<20}")

## 8. 實驗 4: PyTorch Profiler 詳細分析

In [None]:
print("=" * 70)
print("實驗 4: PyTorch Profiler 詳細性能分析")
print("=" * 70)

reset_gpu_memory()

# 重新載入模型
model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

print("\n使用 PyTorch Profiler 分析訓練...")
print("(這可能需要幾分鐘)\n")

model.train()
dataloader_iter = iter(dataloader)

# 使用 Profiler
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=False
) as prof:
    for step in range(5):  # 只分析 5 步
        with record_function(f"training_step_{step}"):
            batch = next(dataloader_iter)
            batch = {k: v.to(device) for k, v in batch.items()}
            
            with record_function("forward"):
                optimizer.zero_grad()
                outputs = model(**batch)
                loss = outputs.loss
            
            with record_function("backward"):
                loss.backward()
            
            with record_function("optimizer_step"):
                optimizer.step()

# 打印 Profiler 結果
print("\n" + "=" * 70)
print("Profiler 報告 (按 CUDA 時間排序, Top 10)")
print("=" * 70)
print(prof.key_averages().table(
    sort_by="cuda_time_total",
    row_limit=10
))

print("\n" + "=" * 70)
print("Profiler 報告 (按記憶體使用排序, Top 10)")
print("=" * 70)
print(prof.key_averages().table(
    sort_by="self_cuda_memory_usage",
    row_limit=10
))

# 分析函數級別的性能
print("\n" + "=" * 70)
print("函數級別分析")
print("=" * 70)

events = prof.key_averages()
for evt in events:
    if evt.key in ["forward", "backward", "optimizer_step"]:
        print(f"\n{evt.key}:")
        print(f"  CPU 時間: {evt.cpu_time_total / 1e3:.2f} ms")
        print(f"  CUDA 時間: {evt.cuda_time_total / 1e3:.2f} ms")
        print(f"  記憶體: {evt.cpu_memory_usage / 1e6:.2f} MB")

# 清理
del model, optimizer
reset_gpu_memory()

## 9. 實驗 5: 優化技術對比

In [None]:
print("=" * 70)
print("實驗 5: 優化技術記憶體對比")
print("=" * 70)

def measure_optimization(config_name, use_amp=False, use_checkpoint=False, accumulation_steps=1):
    """測量不同優化配置的記憶體使用"""
    reset_gpu_memory()
    
    model = GPT2LMHeadModel.from_pretrained("gpt2-medium").to(device)
    if use_checkpoint:
        model.gradient_checkpointing_enable()
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
    scaler = GradScaler() if use_amp else None
    
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    dataloader_iter = iter(dataloader)
    
    model.train()
    
    # 訓練幾步
    for step in range(10):
        if step % accumulation_steps == 0:
            optimizer.zero_grad()
        
        batch = next(dataloader_iter)
        batch = {k: v.to(device) for k, v in batch.items()}
        
        if use_amp:
            with autocast(dtype=torch.float16):
                outputs = model(**batch)
                loss = outputs.loss / accumulation_steps
            scaler.scale(loss).backward()
            
            if (step + 1) % accumulation_steps == 0:
                scaler.step(optimizer)
                scaler.update()
        else:
            outputs = model(**batch)
            loss = outputs.loss / accumulation_steps
            loss.backward()
            
            if (step + 1) % accumulation_steps == 0:
                optimizer.step()
    
    peak_memory = torch.cuda.max_memory_allocated() / 1e9
    
    del model, optimizer
    reset_gpu_memory()
    
    return peak_memory


# 測試不同配置
configs = [
    ("基準 (無優化)", False, False, 1),
    ("混合精度 (FP16)", True, False, 1),
    ("梯度檢查點", False, True, 1),
    ("梯度累積 (x4)", False, False, 4),
    ("FP16 + 檢查點", True, True, 1),
    ("全部優化", True, True, 4)
]

results = []

print("\n測試不同優化配置...")
for config_name, use_amp, use_checkpoint, accum in configs:
    print(f"  {config_name}...", end=" ")
    try:
        mem = measure_optimization(config_name, use_amp, use_checkpoint, accum)
        results.append((config_name, mem))
        print(f"峰值: {mem:.2f} GB")
    except RuntimeError as e:
        if "out of memory" in str(e):
            print("OOM")
            results.append((config_name, None))
        else:
            raise e

# 繪製結果
valid_results = [(name, mem) for name, mem in results if mem is not None]
names = [name for name, _ in valid_results]
memories = [mem for _, mem in valid_results]

plt.figure(figsize=(12, 6))
colors = ["#e74c3c", "#3498db", "#2ecc71", "#f39c12", "#9b59b6", "#1abc9c"]
bars = plt.bar(range(len(names)), memories, color=colors[:len(names)])
plt.xticks(range(len(names)), names, rotation=45, ha="right")
plt.ylabel("峰值記憶體 (GB)")
plt.title("不同優化技術的記憶體使用對比", fontsize=14, fontweight="bold")
plt.grid(axis="y", alpha=0.3)

# 添加數值標籤
for bar, mem in zip(bars, memories):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{mem:.2f}GB',
             ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

# 分析
print("\n" + "=" * 80)
print("優化技術記憶體節省分析")
print("=" * 80)
baseline_mem = results[0][1]
print(f"\n{'配置':<25} {'峰值記憶體 (GB)':<20} {'節省比例':<15}")
print("-" * 80)
for name, mem in results:
    if mem is not None:
        saving = (baseline_mem - mem) / baseline_mem * 100 if baseline_mem else 0
        print(f"{name:<25} {mem:<20.2f} {saving:>12.1f}%")
    else:
        print(f"{name:<25} {'OOM':<20} {'N/A':<15}")

## 10. 記憶體優化建議生成器

In [None]:
def generate_optimization_suggestions(model_params_gb, available_gpu_gb, batch_size, seq_length):
    """
    根據硬體配置生成記憶體優化建議
    
    Args:
        model_params_gb: 模型參數大小 (GB)
        available_gpu_gb: 可用 GPU 記憶體 (GB)
        batch_size: 期望的批次大小
        seq_length: 序列長度
    """
    print("=" * 70)
    print("記憶體優化建議生成器")
    print("=" * 70)
    
    print(f"\n配置信息:")
    print(f"  模型大小: {model_params_gb:.2f} GB")
    print(f"  可用記憶體: {available_gpu_gb:.2f} GB")
    print(f"  目標批次: {batch_size}")
    print(f"  序列長度: {seq_length}")
    
    # 估算記憶體需求
    # 模型參數 + 梯度 + 優化器狀態 (Adam: 2x params) + 激活值
    estimated_base = model_params_gb * 4  # params + grads + optimizer
    activation_estimate = batch_size * seq_length * 0.001  # 粗略估計
    total_estimate = estimated_base + activation_estimate
    
    print(f"\n估算記憶體需求:")
    print(f"  模型 + 梯度 + 優化器: {estimated_base:.2f} GB")
    print(f"  激活值 (估計): {activation_estimate:.2f} GB")
    print(f"  總計: {total_estimate:.2f} GB")
    
    # 生成建議
    print("\n" + "=" * 70)
    print("優化建議")
    print("=" * 70)
    
    suggestions = []
    
    if total_estimate > available_gpu_gb:
        shortage = total_estimate - available_gpu_gb
        print(f"\n⚠️  記憶體不足: 缺少 {shortage:.2f} GB")
        print("\n必要優化 (按優先級):")
        
        # 1. 混合精度
        print("\n1. ✅ 啟用混合精度訓練 (FP16/BF16)")
        print("   - 記憶體節省: ~50%")
        print("   - 速度提升: 2-3x")
        print("   - 代碼: model.to(device); use autocast()")
        suggestions.append("mixed_precision")
        
        estimated_base *= 0.5
        total_estimate = estimated_base + activation_estimate
        
        if total_estimate > available_gpu_gb:
            # 2. 梯度檢查點
            print("\n2. ✅ 啟用梯度檢查點")
            print("   - 記憶體節省: ~40%")
            print("   - 時間開銷: +20-30%")
            print("   - 代碼: model.gradient_checkpointing_enable()")
            suggestions.append("gradient_checkpointing")
            
            activation_estimate *= 0.6
            total_estimate = estimated_base + activation_estimate
        
        if total_estimate > available_gpu_gb:
            # 3. 減小批次 + 梯度累積
            suggested_micro_batch = max(1, batch_size // 4)
            accumulation = batch_size // suggested_micro_batch
            print(f"\n3. ✅ 減小批次大小 + 梯度累積")
            print(f"   - 建議: micro_batch={suggested_micro_batch}, accumulation={accumulation}")
            print(f"   - 有效批次: {suggested_micro_batch * accumulation}")
            print(f"   - 記憶體節省: ~{(1 - suggested_micro_batch/batch_size)*100:.0f}%")
            suggestions.append(f"gradient_accumulation_{accumulation}")
        
        if total_estimate > available_gpu_gb:
            # 4. 考慮其他方案
            print("\n4. 💡 考慮其他方案:")
            print("   - DeepSpeed ZeRO (多GPU/CPU offload)")
            print("   - 模型量化 (INT8/INT4)")
            print("   - PEFT 技術 (LoRA/QLoRA)")
    else:
        margin = available_gpu_gb - total_estimate
        print(f"\n✅ 記憶體充足: 剩餘 {margin:.2f} GB")
        print("\n可選優化 (提升速度):")
        print("\n1. ⚡ 啟用混合精度訓練")
        print("   - 速度提升: 2-3x")
        print("   - 記憶體額外節省: ~50%")
        
        if margin > model_params_gb:
            print("\n2. 📈 可以增大批次大小")
            suggested_bs = int(batch_size * (1 + margin / total_estimate))
            print(f"   - 建議批次: {suggested_bs}")
    
    return suggestions


# 示例使用
print("\n示例 1: 8GB GPU 訓練 GPT-2 Medium")
generate_optimization_suggestions(
    model_params_gb=1.4,  # GPT-2 Medium FP32
    available_gpu_gb=8.0,
    batch_size=8,
    seq_length=512
)

print("\n\n" + "=" * 70 + "\n")

print("示例 2: 24GB GPU 訓練 GPT-2 Large")
generate_optimization_suggestions(
    model_params_gb=3.0,  # GPT-2 Large FP32
    available_gpu_gb=24.0,
    batch_size=16,
    seq_length=1024
)

## 11. 實驗總結與最佳實踐

### 實驗結論

1. **記憶體組成理解**:
   - 模型參數: ~固定大小
   - 梯度: 與參數同大小
   - 優化器狀態: Adam 需要 2x 參數大小
   - 激活值: 與批次大小和序列長度成正比

2. **批次大小影響**:
   - 批次越大, 激活值記憶體越高
   - 記憶體增長接近線性
   - 需要在批次大小和記憶體之間權衡

3. **優化技術效果**:
   - 混合精度: 最高優先級, 節省 ~50%, 提速 2-3x
   - 梯度檢查點: 節省 30-50%, 時間代價 20-30%
   - 梯度累積: 不增加記憶體, 實現大批次效果
   - 組合優化: 可節省 70-80% 記憶體

### 記憶體優化決策樹

```
開始訓練
    |
    ├─ OOM? ─ 否 ─→ 考慮混合精度加速
    |         |        |
    |         └─ OOM? ─ 否 ─→ 可以增大批次
    |
    └─ 是
       |
       ├─ 步驟 1: 啟用混合精度 (FP16/BF16)
       |          ↓
       |       還 OOM?
       |
       ├─ 步驟 2: 啟用梯度檢查點
       |          ↓
       |       還 OOM?
       |
       ├─ 步驟 3: 減小批次 + 梯度累積
       |          ↓
       |       還 OOM?
       |
       └─ 步驟 4: 考慮 DeepSpeed/PEFT/量化
```

### PyTorch 記憶體管理工具總結

#### 1. 基礎監控

```python
# 查看當前記憶體使用
torch.cuda.memory_allocated()  # 已分配
torch.cuda.memory_reserved()   # 已保留
torch.cuda.max_memory_allocated()  # 峰值

# 重置統計
torch.cuda.reset_peak_memory_stats()
torch.cuda.empty_cache()  # 清空緩存
```

#### 2. Profiler 使用

```python
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    profile_memory=True,
    record_shapes=True
) as prof:
    # 訓練代碼
    pass

# 查看結果
print(prof.key_averages().table(sort_by="cuda_memory_usage"))
```

#### 3. 記憶體快照

```python
# 記錄記憶體快照 (需要 PyTorch 2.1+)
torch.cuda.memory._record_memory_history()
# ... 訓練代碼
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
```

### 常見記憶體問題診斷

#### 問題 1: OOM (Out of Memory)

**症狀**: `RuntimeError: CUDA out of memory`

**診斷步驟**:
1. 打印峰值記憶體: `torch.cuda.max_memory_allocated()`
2. 檢查批次大小是否過大
3. 檢查是否有記憶體洩漏

**解決方案**:
```python
# 1. 減小批次
batch_size = 1

# 2. 啟用混合精度
with autocast(dtype=torch.float16):
    outputs = model(**batch)

# 3. 梯度檢查點
model.gradient_checkpointing_enable()

# 4. 清空緩存
torch.cuda.empty_cache()
```

#### 問題 2: 記憶體洩漏

**症狀**: 記憶體持續增長, 不釋放

**診斷**:
```python
# 監控每步記憶體
for step in range(100):
    # 訓練
    if step % 10 == 0:
        print(f"Step {step}: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
```

**常見原因**:
- 未 detach 的張量
- 保存了整個計算圖
- 全局變數累積

**解決**:
```python
# 使用 .detach() 斷開計算圖
loss_value = loss.detach().item()

# 避免保存張量到列表
losses.append(loss.item())  # ✅ 正確
losses.append(loss)         # ❌ 會洩漏
```

#### 問題 3: 記憶體碎片化

**症狀**: 總記憶體充足, 但分配失敗

**解決**:
```python
# 定期清理
if step % 100 == 0:
    torch.cuda.empty_cache()
    gc.collect()
```

### 生產環境建議

#### 1. 記憶體監控

```python
class MemoryMonitor:
    def __init__(self, log_interval=100):
        self.log_interval = log_interval
        self.peak_memory = 0
    
    def log(self, step):
        if step % self.log_interval == 0:
            current = torch.cuda.memory_allocated() / 1e9
            peak = torch.cuda.max_memory_allocated() / 1e9
            self.peak_memory = max(self.peak_memory, peak)
            
            print(f"Step {step}: {current:.2f}GB (peak: {peak:.2f}GB)")
            
            # 記錄到 tensorboard/wandb
            # writer.add_scalar('memory/allocated', current, step)

monitor = MemoryMonitor()
for step in range(num_steps):
    # 訓練
    monitor.log(step)
```

#### 2. 自動優化選擇

```python
def auto_configure(model, gpu_memory_gb):
    """根據 GPU 記憶體自動配置優化"""
    model_size_gb = sum(p.numel() for p in model.parameters()) * 4 / 1e9
    
    config = {}
    
    # 總是使用混合精度
    config['use_amp'] = True
    
    # 記憶體緊張時使用檢查點
    if model_size_gb * 4 > gpu_memory_gb * 0.7:
        config['use_checkpoint'] = True
    
    # 自動計算批次大小
    available = gpu_memory_gb - model_size_gb * 2  # 留給激活值
    config['batch_size'] = max(1, int(available / 0.5))  # 粗略估計
    
    return config
```

### 記憶體優化 Checklist

訓練前檢查:
- [ ] 啟用混合精度訓練
- [ ] 根據 GPU 選擇合適的批次大小
- [ ] 大模型考慮梯度檢查點
- [ ] 設置記憶體監控

訓練中監控:
- [ ] 定期記錄峰值記憶體
- [ ] 檢查記憶體是否持續增長 (洩漏)
- [ ] 觀察 GPU 利用率

遇到 OOM:
- [ ] 減小批次大小
- [ ] 啟用梯度累積
- [ ] 啟用梯度檢查點
- [ ] 清空 CUDA 緩存
- [ ] 考慮 DeepSpeed/FSDP

## 12. 下一步學習

恭喜完成 Lab-1.4 所有實驗! 🎉

您已經掌握了:
- ✅ 混合精度訓練 (速度 2-3x, 記憶體省 50%)
- ✅ 梯度累積 (突破記憶體限制)
- ✅ 梯度檢查點 (記憶體省 30-50%)
- ✅ 記憶體分析與優化 (定位瓶頸, 制定策略)

### 推薦下一步:

1. **應用到 PEFT Labs**
   - 在 LoRA/QLoRA 訓練中應用這些優化
   - 訓練更大的模型

2. **學習分散式訓練**
   - Lab-1.2: PyTorch DDP Basics
   - Lab-1.3: DeepSpeed (多GPU)

3. **高級優化技術**
   - FlashAttention (長序列優化)
   - Efficient Attention (MQA/GQA)

4. **生產部署**
   - 模型量化與壓縮
   - 推理優化

訓練優化是 LLM 工程的核心技能, 繼續加油! 🚀