
# 基于梯度敏感度的 Adaptive LoRA 方法

### 一、实验任务背景与目标

传统 LoRA 方法通常采用统一的 rank（低秩维度）配置，即对所有 Transformer 层的注意力模块分配相同的 LoRA 参数规模。然而，不同层在下游任务中的重要性存在显著差异，统一 rank 可能造成参数浪费或表达能力不足。

因此，本实验提出并实现一种 基于梯度敏感度的 Adaptive LoRA 方法，其核心目标是：

根据不同 Transformer 层在训练初期的梯度大小，自适应地为各层分配不同的 LoRA rank，从而在保证性能的同时进一步提升参数效率。

### 二、整体实验流程概述

本实验整体流程可分为五个阶段：

加载基础 BERT 模型（不含 LoRA）、Warm-up 阶段统计各层梯度敏感度、基于梯度信息进行 Adaptive Rank 分配、构建 Adaptive LoRA 模型、在 SST-2 数据集上进行正式训练与验证


In [1]:
# =====================================================
# Gradient-aware Adaptive LoRA for BERT (SST-2)
# 完整修正 & 可直接运行版本
# =====================================================

import time
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from collections import defaultdict

from datasets import load_dataset
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    get_linear_schedule_with_warmup
)

from peft import LoraConfig, get_peft_model
from tqdm import tqdm


# =====================================================
# 1. 基本配置
# =====================================================
MODEL_NAME = "bert-base-uncased"
BATCH_SIZE = 16
EPOCHS = 3
LR = 2e-5
MAX_LEN = 128
WARMUP_BATCHES = 50
TOP_LAYER_RATIO = 0.3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)


# =====================================================
# 2. 数据集加载（GLUE - SST-2）
# =====================================================
dataset = load_dataset("glue", "sst2")

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

def tokenize(batch):
    return tokenizer(
        batch["sentence"],
        padding="max_length",
        truncation=True,
        max_length=MAX_LEN
    )

dataset = dataset.map(tokenize, batched=True)
dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "label"]
)

train_loader = DataLoader(
    dataset["train"],
    batch_size=BATCH_SIZE,
    shuffle=True
)

val_loader = DataLoader(
    dataset["validation"],
    batch_size=BATCH_SIZE
)


# =====================================================
# 3. 梯度敏感度统计（Warm-up）
# =====================================================
def collect_layer_gradients(model, dataloader, device, num_batches):
    """
    统计 Transformer 每一层 attention query/value 的梯度范数
    """
    model.train()
    grad_stats = defaultdict(list)

    for step, batch in enumerate(dataloader):
        if step >= num_batches:
            break

        # ★ 修正点 1：label -> labels
        batch = {k: v.to(device) for k, v in batch.items()}
        batch["labels"] = batch.pop("label")

        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        for name, param in model.named_parameters():
            if param.grad is None:
                continue

            if "encoder.layer" in name and (
                "attention.self.query" in name or
                "attention.self.value" in name
            ):
                # ★ 修正点 2：稳健解析 layer_id
                name_parts = name.split(".")
                if "layer" not in name_parts:
                    continue

                layer_idx = name_parts.index("layer")
                if layer_idx + 1 >= len(name_parts):
                    continue

                try:
                    layer_id = int(name_parts[layer_idx + 1])
                except ValueError:
                    continue

                grad_norm = param.grad.norm().item()
                grad_stats[layer_id].append(grad_norm)

        model.zero_grad()

    layer_importance = {
        layer: sum(vals) / len(vals)
        for layer, vals in grad_stats.items()
    }

    return layer_importance


# =====================================================
# 4. 选择重要层（Adaptive LoRA）
# =====================================================
def select_important_layers(layer_importance, top_ratio=0.3):
    sorted_layers = sorted(
        layer_importance.items(),
        key=lambda x: x[1],
        reverse=True
    )
    top_k = max(1, int(len(sorted_layers) * top_ratio))
    return [layer for layer, _ in sorted_layers[:top_k]]


# =====================================================
# 5. 训练 & 验证函数
# =====================================================
def train_one_epoch(model, loader, optimizer, scheduler):
    model.train()
    total_loss = 0

    for batch in tqdm(loader, desc="Training"):
        optimizer.zero_grad()

        # ★ 修正点 3：label -> labels
        batch = {k: v.to(device) for k, v in batch.items()}
        batch["labels"] = batch.pop("label")

        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()

        total_loss += loss.item()

    return total_loss / len(loader)


def evaluate(model, loader):
    model.eval()
    correct, total = 0, 0

    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"]
            )
            preds = torch.argmax(outputs.logits, dim=1)
            correct += (preds == batch["label"]).sum().item()
            total += batch["label"].size(0)

    return correct / total


# =====================================================
# 6. 参数统计工具
# =====================================================
def count_parameters(model):
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable


# =====================================================
# 7. 主流程
# =====================================================
print("\n=== Step 1: 初始化基础模型（无 LoRA） ===")
base_model = BertForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2
).to(device)

print("\n=== Step 2: 梯度敏感度统计（Warm-up） ===")
layer_importance = collect_layer_gradients(
    base_model,
    train_loader,
    device,
    WARMUP_BATCHES
)

print("\nLayer importance (gradient norm):")
for k, v in sorted(layer_importance.items()):
    print(f"Layer {k}: {v:.6f}")

print("\n=== Step 3: 选择重要层 ===")
important_layers = select_important_layers(
    layer_importance,
    TOP_LAYER_RATIO
)
print("Selected layers:", important_layers)

print("\n=== Step 4: 构建 Adaptive LoRA 模型 ===")
model = BertForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query", "value"],
    layers_to_transform=important_layers,
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)

model = get_peft_model(model, lora_config)
model.to(device)
model.print_trainable_parameters()

total_p, trainable_p = count_parameters(model)
print(f"Total Params: {total_p:,}")
print(f"Trainable Params: {trainable_p:,}")


# =====================================================
# 8. 优化器 & Scheduler
# =====================================================
optimizer = AdamW(model.parameters(), lr=LR)
total_steps = len(train_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)


# =====================================================
# 9. 正式训练
# =====================================================
print("\n=== Step 5: 正式训练 Adaptive LoRA ===")
training_log = []
start_time = time.time()

for epoch in range(EPOCHS):
    epoch_start = time.time()

    train_loss = train_one_epoch(
        model, train_loader, optimizer, scheduler
    )
    val_acc = evaluate(model, val_loader)

    epoch_time = time.time() - epoch_start
    training_log.append((epoch + 1, train_loss, val_acc, epoch_time))

    print(f"Epoch {epoch + 1}: "
          f"Loss={train_loss:.4f}, "
          f"Val Acc={val_acc:.4f}, "
          f"Time={epoch_time:.2f}s")

total_time = time.time() - start_time

print("\n=== Training Summary ===")
print("Epoch | Train Loss | Val Acc | Time(s)")
for e, l, a, t in training_log:
    print(f"{e:^5} | {l:^10.4f} | {a:^7.4f} | {t:^8.2f}")

print(f"\nTotal Training Time: {total_time:.2f}s")
print("\nTraining finished.")


Using device: cuda

=== Step 1: 初始化基础模型（无 LoRA） ===


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



=== Step 2: 梯度敏感度统计（Warm-up） ===

Layer importance (gradient norm):
Layer 0: 0.090118
Layer 1: 0.079651
Layer 2: 0.082389
Layer 3: 0.087262
Layer 4: 0.088129
Layer 5: 0.080186
Layer 6: 0.085607
Layer 7: 0.077054
Layer 8: 0.094206
Layer 9: 0.082342
Layer 10: 0.083280
Layer 11: 0.082088

=== Step 3: 选择重要层 ===
Selected layers: [8, 0, 4]

=== Step 4: 构建 Adaptive LoRA 模型 ===


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 75,266 || all params: 109,559,044 || trainable%: 0.0687
Total Params: 109,559,044
Trainable Params: 75,266

=== Step 5: 正式训练 Adaptive LoRA ===


Training: 100%|██████████| 4210/4210 [07:54<00:00,  8.86it/s]


Epoch 1: Loss=0.4074, Val Acc=0.8830, Time=478.07s


Training: 100%|██████████| 4210/4210 [07:54<00:00,  8.87it/s]


Epoch 2: Loss=0.2992, Val Acc=0.8853, Time=477.45s


Training: 100%|██████████| 4210/4210 [07:54<00:00,  8.88it/s]


Epoch 3: Loss=0.2896, Val Acc=0.8807, Time=477.19s

=== Training Summary ===
Epoch | Train Loss | Val Acc | Time(s)
  1   |   0.4074   | 0.8830  |  478.07 
  2   |   0.2992   | 0.8853  |  477.45 
  3   |   0.2896   | 0.8807  |  477.19 

Total Training Time: 1432.71s

Training finished.
