## LLaMA 2 指令微调（Alpaca-Style on Dolly-15K Dataset)

示例代码关键训练要素：
- 使用 Dolly-15K 数据集，以 Alpaca 指令风格生成训练数据
- 以 4-bit（NF4）量化精度加载 `LLaMA 2-7B` 模型
- 使用 QLoRA 以 `bf16` 混合精度训练模型
- 使用 `HuggingFace TRL` 的 `SFTTrainer` 实现监督指令微调
- 使用 Flash Attention 快速注意力机制加速训练（需硬件支持）

### 下载 databricks-dolly-15k 数据集

In [1]:
from datasets import load_dataset
from random import randrange
 
# 从hub加载数据集
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# 数据集样例总数: 15011
dataset

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 15011
})

In [3]:
# 随机抽选一个数据样例打印
print(dataset[randrange(len(dataset))])

{'instruction': 'How is Tesla organized as a company?', 'context': '', 'response': 'Tesla has a functional organizational structure that is flat as well. Tesla’s organizational structure is designed in a way that is easy to manage and built for maximum efficiency. Tesla can still be considered as a startup in the automotive industry as its competitors are typically 75+ years old. As a startup, Tesla is designed for efficiency, to begin with as it must be easier to group and manage similar tasks. \n\n \n\nSome of the key functional groups of Tesla is as below: \n\n \n\nMaterials Engineering \n\nArtificial Intelligence for Auto Pilot \n\nHardware design engineering \n\nEnergy operations \n\nGlobal communications \n\nGlobal Environment health and safety \n\nGlobal security \n\nChief of staff \n\nInformation operations \n\n \n\nAll the functional groups report to the CEO and Tesla has a heavy flat organization structure. Materials engineering is responsible for material research and develo

### 以 Alpaca-Style 格式化指令数据

`Alpacca-style` 格式：https://github.com/tatsu-lab/stanford_alpaca#data-release

In [4]:
def format_instruction(sample_data):
    """
    Formats the given data into a structured instruction format.

    Parameters:
    sample_data (dict): A dictionary containing 'response' and 'instruction' keys.

    Returns:
    str: A formatted string containing the instruction, input, and response.
    """
    # Check if required keys exist in the sample_data
    if 'response' not in sample_data or 'instruction' not in sample_data:
        # Handle the error or return a default message
        return "Error: 'response' or 'instruction' key missing in the input data."

    return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 
 
### Input:
{sample_data['response']}
 
### Response:
{sample_data['instruction']}
"""

In [7]:
# 随机抽选一个样例，打印 Alpaca 格式化后的样例 
print(format_instruction(dataset[randrange(len(dataset))]))

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 

### Input:
Because you have one life to live.
You need to explore as much as you can to fulfill your dream so you need a target to full and that's why you need bucketlist

### Response:
Why you should have a bucket list?



### 使用快速注意力（Flash Attention）加速训练

检查你的 GPU 是否支持 `flash-attn` 加速：

```shell
$ python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Hardware not supported for Flash Attention
```
**运行结果：演示使用的 NVIDIA T4 硬件不支持 Flash Attention**

#### 安装 flash-attn 加速包（需要GPU硬件支持）

```shell
$ MAX_JOBS=4 pip install flash-attn --no-build-isolation
```

### 加载模型

In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 如果硬件设备支持，成功安装 flash-attn后，将 use_flash_attention 设置为True
use_flash_attention = False
 
# 取消注释以使用 flash-atten
# if torch.cuda.get_device_capability()[0] >= 8:
#     from utils.llama_patch import replace_attn_with_flash_attn
#     print("Using flash attention")
#     replace_attn_with_flash_attn()
#     use_flash_attention = True
 
 
# 获取 LLaMA 2-7B 模型权重
# 无需 Meta AI 审核的模型权重
model_id = "NousResearch/Llama-2-7b-hf" 
# 通过 Meta AI 审核后可使用此 Model ID 下载
# model_id = "meta-llama/Llama-2-7b-hf" 
 
 
# 使用 BnB 加载量化后的模型
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
 
# 加载模型与分词器
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")
model.config.pretraining_tp = 1 
 
# 通过对比doc中的字符串，验证模型是否在使用flash attention
if use_flash_attention:
    from utils.llama_patch import forward    
    assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"
 
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards: 100%|██████████| 2/2 [01:10<00:00, 35.17s/it]


### 使用 QLoRA 配置加载 PEFT 模型

In [9]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
 
# QLoRA 配置
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=16,
        bias="none",
        task_type="CAUSAL_LM", 
)
 
 
# 使用 QLoRA 配置加载 PEFT 模型
model = prepare_model_for_kbit_training(model)
qlora_model = get_peft_model(model, peft_config)

In [10]:
qlora_model.print_trainable_parameters()

trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.12433454005023165


### 训练超参数

In [11]:
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
# timestamp = "20250817_120925"

# 演示训练参数（实际训练是设置为 False）
demo_train = True
output_dir = f"models/llama-7-int4-dolly-{timestamp}"

In [12]:
from transformers import TrainingArguments
 
args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1 if demo_train else 3,
    max_steps=100,
    per_device_train_batch_size=3, # Nvidia T4 16GB 显存支持的最大 Batch Size
    gradient_accumulation_steps=1 if demo_train else 4,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="steps" if demo_train else "epoch",
    save_steps=10,
    learning_rate=2e-4,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant"
)


# from transformers import TrainingArguments

# args = TrainingArguments(
#     output_dir=output_dir,
#     num_train_epochs=1 if demo_train else 3,
#     max_steps=100,
#     per_device_train_batch_size=2,  # 进一步减小批次大小（如果3仍然OOM）
#     gradient_accumulation_steps=2 if demo_train else 8,  # 增加梯度累积补偿批次减小的影响
#     gradient_checkpointing=True,  # 已启用，通过牺牲少量速度换内存
#     optim="paged_adamw_32bit",  # 分页优化器减少内存碎片
#     logging_steps=10,
#     save_strategy="steps" if demo_train else "epoch",
#     save_steps=10,
#     learning_rate=2e-4,
#     bf16=True,  # 已启用混合精度，若GPU不支持bf16可改用fp16
#     fp16_full_eval=True,  # 评估时也使用混合精度
#     max_grad_norm=0.3,
#     warmup_ratio=0.03,
#     lr_scheduler_type="constant",
    
#     # 新增优化参数
#     load_best_model_at_end=False,  # 不加载最佳模型（节省验证阶段内存）
#     report_to="none",  # 禁用wandb等日志工具的内存占用
#     remove_unused_columns=True,  # 自动移除未使用的特征列
#     torch_compile=False,  # 禁用torch.compile（可能增加内存占用）
# )

### 实例化 SFTTrainer

In [13]:
from trl import SFTTrainer
 
# 数据集的最大长度序列（筛选后的训练数据样例数为1158）
max_seq_length = 2048 
 
trainer = SFTTrainer(
    model=qlora_model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction, 
    args=args,
)

Detected kernel version 5.4.241, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


### 训练模型

In [14]:
trainer.train()



Step,Training Loss
10,1.6502
20,1.3811
30,1.2937
40,1.3274
50,1.2457
60,1.2854
70,1.2553
80,1.2136
90,1.2451
100,1.2125




TrainOutput(global_step=100, training_loss=1.3110004615783692, metrics={'train_runtime': 662.4464, 'train_samples_per_second': 0.453, 'train_steps_per_second': 0.151, 'total_flos': 2.43882352705536e+16, 'train_loss': 1.3110004615783692, 'epoch': 0.26})

### 保存模型

In [15]:
trainer.save_model()

### 模型推理（测试）

In [17]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 1. 加载量化配置（与训练时一致）
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 2. 加载模型和分词器
model_path = "models/llama-7-int4-dolly-20250818_005604"  # 实际模型路径
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto"
)

# 3. 推理函数
def generate_response(prompt, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# 4. 示例使用
if __name__ == "__main__":
    test_prompt = "What is the capital of France?"
    print(generate_response(test_prompt))

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.27s/it]


What is the capital of France?
 sierpina
Which country is the capital of France?
What is the capital of France?
What is the capital of France
What is the capital of France
What is the capital of France?
What is the capital of France
